Tuesday, July 30, 2013

Data Quality and its components

                Data Quality is a fitness level of a data. A set of data which we receive in beginning of any project needs to be analyzed by several means of statistical theories. Data quality is a result of value it generate aligning the subject area goal. There are several “Ilities” (around 200) to judge the quality of data, but data analyst put some of them to weighing the purity and impurity in a set. Goal is to get To-the-point, Correct and Effective data.
*** Data quality has a vast scope of discussion. Several organizations is researching and inventing new ways to analyze a piece of data. We are not going to discuss everything in a blog post (hence it is impossible). Here I want to explain some parameters of data quality which I use in my BI projects. I will use Automobile sales information as example here.  

Five w + H ( A short introduction)
What Data Quality means?
It determines that the data we are using for business decision is dependable, well processed and error free. A quality data is achieved by combined analysis of business and technology.
When to apply this?
Data Quality analysis is not a project it is not also similar to quality control. It is a horizontal in and organization which is used in different segment of SDLC phases of different project running in the same organization. Data is generated regularly and problems are identified frequently. Data changes with the change of organizational planning and business rule changes. A quality data is always viewed in term of business impact.

Whether you are starting a project or the project is in support (maintenance) this can be implemented anytime.
Which are the benefits for implementing it?
Data quality maintaining or implementing is a costly affair. But it benefits every user who uses a set of data. A fully computerized organization also has a huge manual task to remove redundant data from their reports using separate tool. Data quality removes that redundancy.

In my experience I saw people used to extract data from two system and match to get confidence on data quality. Implementing data quality analysis removes the problem.
Who need to implement?
Business owns the data, SMEs understand the data, and Technology structures the data and end user use the data. To get a quality data there should be a collaborative effort of all stakeholders of the data.
Where it need to be implemented
When a data quality gets implemented it captured all the aspects of the data. Suggestion from Data Quality analysis results modification of all the systems involved in an organization. It starts with the legacy data in an organization have and continue to the everyday generated operational data.
How to implement?
There are several tools present in the industry to get it done with some organization level customization. Those tools are built on the basic data analysis theories. You also can analyze the data by checking several parameters.

A real life data quality measurement
Sales information is always the high priority in an organization as it is the lifeline of an industry. Sales volume is connected to most of the decision making KPIs in an organization. Automobile industry or manufacturing industry endorses automation prior to any other industry, but they are the last in game to understand importance of data flow.
                Intense labor oriented organization works with people and there people update the information manually. In an oil industry retail volume directly comes from automated gas stations. But in automotive industry there is an involvement of third party settlement and negotiation. Dealership involvement makes the whole things complicated.
                In any automotive organization this is common problem to get the correct sales data by different hierarchy. Different department looks sales volume differently. After sales department don’t see the sale of Car. So they have parts sales volume. Sales department don’t see parts sales volume. Multinational Car Company has different departments sitting in headquarter for individual country car sales. Trend management analytics checks the volume of the old car sales to new car sales.  Finance department see sales volume per individual dealership, and individual sales person. Automotive trainers look to sales volume with the level of knowledge of the dealership professional. They see the sales volume of their courses taking the car and parts sales volume as a parameter. A dealership looks to a company sales by their own sales, comparative sales with other dealership and their salespersons sales volume. A sales person sees the sales volume of his own to compare with trend to get the incentive.
                So they all have different view on sales volume. As result sales volume require several different aspects to be checked. For representing a sales volume a big amount of information need to be checked. In a project of finding incentive given to sales professional, project owner wants to check the data quality of the modeled data on which the reports are built and ready to go to production.
                If no data quality measurement tool is provided then I personally prefer the X parameters in 39 parameters set by Blundell, Hines and Stach for information quality. It is obvious that you may not require all. Database is in Oracle, ETL is in Informatica, reporting tool is Business Objects. Let’s start the analysis of the data with some parameter used by them:

Blundell, Hines, and Stach
 Read the rest and at the end of the point accuracy will be discussed as it is the single show stopper in data quality analysis.
Change ability
1.       Check if the reports are dynamic. If the reports data are not dynamic and not data driven then a same report needs modification in time which affects changeability.
2.       Check the variable definition. If the variables are smart enough to handle change in rule that will adapt business rule changes.
Ex: The Company started business in Cuba and need to add the sales volume in the sales data. If the country is a variable then it will automatically updated with the new data comes in.
3.       Prohibit Hardcoding. Try to remove hardcoding in universe, in data tables, in reports.
1.       Checks if audit trail is enabled if not enable it for important tables. Like Sales volume and dealer attributes
2.       Create a purging plan to avoid maintaining huge logs
3.       Assign timestamp to all set of data. History information helps to audit a database.
1.       Check report output and turnaround time.
2.       If output time is more than 30 second then do a root-cause-analysis.

3.       If the turnaround time (from opening the application to getting the output) is more than 3 min, availibity is challenged.
4.       Test with a full set of filter rather narrowing down the criteria.
5.       Reports and data should be available in a relevant location. It should run in a place where all the analytical data is present. Location determines the quality of information.
Ex: Sales data should appear right after the employee information.
Completeness Correctness
1.       Compare with the legacy system data and check the correctness.
2.       Create SQL queries separate than application generated SQL with the business rules mentioned in requirement. Return random small set of data and compare with a large volume and do a theoretical match. 
3.       Check the user expectation from BRD and the data returned in reports. Ex: if user wants to see only California Sales. It is irrelevant to give West coast Sales data with drill down capabilities.
4.       Check language requirement. Sometime English language developers think that Spanish and French are using same character set and create a mess when writing in report. May be they have similar set but ñ or Œ are never used in English language. So confirm with the native language experts to completeness of data.
5.       Though Dollar is vastly used in international calculations. But multinationals face lots of problem calculating the native currency to dollar. Ex: the exchange rate varies every time. And the sales data from different country generates in different currency. The analytical data should comprise the both and fix a rule for exchange rate conversion.
Completeness, Correctness and accuracy have several sub parameters to deal with. These are the end result parameter of data quality measurement.
1.       Consistency in data and documentation need to be checked. If in user given mockup “Employee ID” is given then this should not be changed to “Emp ID” for optimized space utilization.
2.       Consistency with bigger picture. Sales data is used in different application across the organization. If different application is showing different data then the consistency and reliability is getting hampered. Make sure the downstream feed is coming at the same time as other application using the same source.
3.       Consistency of design. A design must follow a guidelines and standard. If those are not followed then a same set of data looks different when compared. Ex: in an application dealer name and dealer code are referred as N_Dealer and C_Dealer. If somewhere else it is notified as Dealer_code, Dealer_Name, it will be hard to maintain.
4.       Check Point-in-Time Consistency. This means check the if the data are indexed properly and all the data are uniform at a point of time. If the system crash then the data should be properly retrieved in a point of time. Simply set a backup policy and data retrieval point.
5.       Check application consistency. If the sales data showing in your report is differs from the sales data showing in another application then you need to find justification. If your application is showing a day old data then you need to compare with other application which is showing a day old data. If find inconsistent, must be fixed.
6.       Check transaction consistency. If you have a transactional system where sales data updates in a real time environment. Then you need to check that the sales volume shows same in managerial application and employee report in same point of time. If you have historical analytical system you need to check the delay of update of your system.
Data commonality
Covered in data consistency in design.
Also data commonality refers to data architecture and variable definition guidelines. If a rule is set in an organization that the dimension key need to be 5 digit. Then it should be 5 digits across all tables in the system.
Dependability comes from the accuracy of the data on time. Check all the business rule and the generated data return time. With multiple checks in multiple times in different scenario build the confidence of a system that it can show sales volume of the guy name X from Y dealership of Z country or guy P from Q district of R state in S country.
Efficiency leads to dependability of a data. To check the data is efficient you need to:
1)       Check the number of joins pulled in a report. Joins are the single most expensive item regarding data. Less number of joins makes a system more efficient.
2)       Avoid loops in modeling. It causes cancer in system.
3)       Unnecessary variable declaration.
4)       Remove cascading if possible.
5)       Use flags more in condition in place of major where clause.
6)       Minimize user input and prompts.
7)       Schedule reports.
8)       Refresh reports with data before users open computer in morning.
9)       Remove trap, loops, fan trap, circular joins.
10)   Avoid outer joins.
Error tolerance
1)       There should be a fault tolerance system works for the database.
2)       Check if the tables contain proper error handling mechanism.
3)       Checks the database have proper recovery system.

Hardware independence
This is common problem now days because of the mobile application boom. Data should be visible and proper in different hardware. You need to check the data in desktop and mobile application.
Sales data is one of the most secured data in an organization. Information’s are confidential. For a data quality inspector there is no need to check the physical or access level security for a data. Every organization has a security or governance plan and policy which safeguard the data integrity. But you need to make sure the documentation provided by the different team of data and network security which actually save your project data.

But you need to :
1)       Check if any row level or object level security is needed or not. If needed then check the sanctity of the security.
2)       Authorized data should be visible to authorized person. Not everyone. Setup a security matrix sitting with SME and implement.
3)       Test the security matrix using different dummy id setup for different level. This should be thoroughly tested.
4)       Data administration should be controlled by skilled employees and at least three layer of data security should be implemented.
Maintainability is the one of the most important parameter to check as data quality depend on availibity of data. As a regular maintenance is needed for a data set you need to check if the data is maintainable or not.
For that  you need to check:
1)       Data loading mechanism and delay in data availibity. That means if you have a OLTP system the sales data should appear in your dashboard almost in real time. If OLAP then analytical data should appear in next business day.
2)       Quantity or volume of the data. If the data set is huge and cannot be broken into pieces by dimensions or similar type of dimension is present with a large volume then need to check modeling methodology to retrieve and maintain that.
3)       Complexity of the report. Sometime it happens that the retrieval query is too much complex and if any error occur, code fix will take time. It happens due to lack of documentation and non-reusability of code snippets.
4)       Joins are the single most expensive item in data retrieval time. Less number of joins is less complexity in code and less data retrieval time. De-normalization solves most of the problem and makes data maintainable.
5)       Some of the other maintenance check needed for below times(Mentioned in original document)
a.       problem recognition time
b.       Administrative delay time
c.        maintenance tools collection time
d.       Problem analysis time
e.        Change specification time
f.         Active correction time
g.       Local testing time
h.       Global testing time
i.         Maintenance review time
j.         Total recovery time
Modularity is important while coding for a dataset. Modularity enhances maintainability. If the data is not modularize and every piece of data is loaded as a heap or dump then it get reflected in all the development done on that piece of dataset. Check if the data set is having modular categorization of data by business values.
Reusability is having vast meaning in data quality analysis. It means reusability of all components in a dataset. Sometime creating more usability of data creates lots of joins which affects the performance. So smart and to the point reusability is important.
Check if your data has reusable content or not. As example sales data is one of the most reusable content in any industry. But the presence of that data should be in standardizing format and easily referenced.
Documentation have major role in reusability. Improper documentation can lead data modeling to confusion. A good practice I saw is to maintain a table where definition of all the column of the tables in a schema is detailed. On that table column should be like, Table name, Column Name, Description, Updated by date, Updated by person name.
This can be achieved by several means:
1)       Data modeling with organization standard naming convention.
2)       Proper Meta data table.
3)       Organized modeling separated by business rules.


Ease of understanding the function of a data set. VIN numbers have meaning for each and every number in it. But to decrypt it you need to put logic behind it. So it is not simple to use to get sales volume. Individual level sales volume can show the VIN number which they sold but sales volume should be directly picked up from the sales to a dealership. That makes the whole set easily understandable and simple.
Above mentioned checks leads to Accuracy of data. But several level of testing makes a data accurate. There is lots of testing methodology available in market but trust me best testing can be done by the people who use the data.
So before delivery:
·         Do unit testing.
·         Do testing with legacy system
·         Do testing referencing BRD
·         Do pick up and check testing
·         Introduce error and do fault testing
·         Ask SME’s for doing the testing
·         And also test with all testing methodology
If a data is accurate then user can live with the other problem in data quality but if data is not accurate then it is the end of the story.

If you read this whole story I believe you may not agree to all of my point. And that’s the beauty of data modeling and data quality analysis. It differs from people to people, organization to organization, structure to structure, business to business. But yes , you can start from this and start your own definition. Thanks for reading.

No comments:

Post a Comment