Data Quality is a fitness level
of a data. A set of data which we receive in beginning of any project needs to
be analyzed by several means of statistical theories. Data quality is a result
of value it generate aligning the subject area goal. There are several
“Ilities” (around 200) to judge the quality of data, but data analyst put some
of them to weighing the purity and impurity in a set. Goal is to get
To-the-point, Correct and Effective data.
*** Data quality has a vast scope of discussion. Several organizations
is researching and inventing new ways to analyze a piece of data. We are not
going to discuss everything in a blog post (hence it is impossible). Here I
want to explain some parameters of data quality which I use in my BI projects.
I will use Automobile sales information as example here.
Five w + H (
A short introduction)
|
What Data Quality means?
|
It determines that the data
we are using for business decision is dependable, well processed and error
free. A quality data is achieved by combined analysis of business and
technology.
|
When to apply this?
|
Data Quality analysis is not
a project it is not also similar to quality control. It is a horizontal in
and organization which is used in different segment of SDLC phases of
different project running in the same organization. Data is generated
regularly and problems are identified frequently. Data changes with the
change of organizational planning and business rule changes. A quality data
is always viewed in term of business impact.
Whether you are starting a
project or the project is in support (maintenance) this can be implemented
anytime.
|
Which are the benefits for implementing it?
|
Data quality maintaining or
implementing is a costly affair. But it benefits every user who uses a set of
data. A fully computerized organization also has a huge manual task to remove
redundant data from their reports using separate tool. Data quality removes
that redundancy.
In my experience I saw people
used to extract data from two system and match to get confidence on data
quality. Implementing data quality analysis removes the problem.
|
Who need to implement?
|
Business owns the data, SMEs
understand the data, and Technology structures the data and end user use the
data. To get a quality data there should be a collaborative effort of all
stakeholders of the data.
|
Where it need to be implemented
|
When a data quality gets
implemented it captured all the aspects of the data. Suggestion from Data
Quality analysis results modification of all the systems involved in an
organization. It starts with the legacy data in an organization have and
continue to the everyday generated operational data.
|
How to implement?
|
There are several tools
present in the industry to get it done with some organization level
customization. Those tools are built on the basic data analysis theories. You
also can analyze the data by checking several parameters.
|
A real life data quality
measurement
Sales
information is always the high priority in an organization as it is the
lifeline of an industry. Sales volume is connected to most of the decision
making KPIs in an organization. Automobile industry or manufacturing industry
endorses automation prior to any other industry, but they are the last in game
to understand importance of data flow.
Intense labor oriented
organization works with people and there people update the information
manually. In an oil industry retail volume directly comes from automated gas
stations. But in automotive industry there is an involvement of third party
settlement and negotiation. Dealership involvement makes the whole things
complicated.
In any automotive organization
this is common problem to get the correct sales data by different hierarchy.
Different department looks sales volume differently. After sales department
don’t see the sale of Car. So they have parts sales volume. Sales department
don’t see parts sales volume. Multinational Car Company has different departments
sitting in headquarter for individual country car sales. Trend management
analytics checks the volume of the old car sales to new car sales. Finance department see sales volume per
individual dealership, and individual sales person. Automotive trainers look to
sales volume with the level of knowledge of the dealership professional. They
see the sales volume of their courses taking the car and parts sales volume as
a parameter. A dealership looks to a company sales by their own sales,
comparative sales with other dealership and their salespersons sales volume. A
sales person sees the sales volume of his own to compare with trend to get the
incentive.
So they all have different view
on sales volume. As result sales volume require several different aspects to be
checked. For representing a sales volume a big amount of information need to be
checked. In a project of finding incentive given to sales professional, project
owner wants to check the data quality of the modeled data on which the reports are
built and ready to go to production.
If no data quality measurement
tool is provided then I personally prefer the X parameters
in 39 parameters set by Blundell, Hines and Stach for information quality. It
is obvious that you may not require all. Database is in Oracle, ETL is in
Informatica, reporting tool is Business Objects. Let’s start the analysis of
the data with some parameter used by them:
Blundell, Hines, and Stach
|
||
1
|
Accuracy
|
Read the rest and at the end of the point
accuracy will be discussed as it is the single show stopper in data quality
analysis.
|
2
|
Adaptability
Change ability Modifiability Flexibility |
1.
Check if the reports are dynamic. If the reports data are not
dynamic and not data driven then a same report needs modification in time
which affects changeability.
2.
Check the variable definition. If the variables are smart
enough to handle change in rule that will adapt business rule changes.
Ex: The Company started business in Cuba and need to add the sales volume in the sales data. If the country is a variable then it will automatically updated with the new data comes in.
3.
Prohibit Hardcoding. Try to remove hardcoding in universe, in
data tables, in reports.
|
3
|
Auditability
|
1.
Checks if audit trail is enabled if not enable it for
important tables. Like Sales volume and dealer attributes
2.
Create a purging plan to avoid maintaining huge logs
3.
Assign timestamp to all set of data. History information helps
to audit a database.
|
4
|
Availability
|
1.
Check report output and turnaround time.
2.
If output time is more than 30 second then do a
root-cause-analysis.
3.
If the turnaround time (from opening the application to
getting the output) is more than 3 min, availibity is challenged.
4.
Test with a full set of filter rather narrowing down the
criteria.
5.
Reports and data should be available in a relevant location.
It should run in a place where all the analytical data is present. Location
determines the quality of information.
Ex: Sales data should appear right after the employee information. |
5
|
Completeness Correctness
|
1.
Compare with the legacy system data and check the correctness.
2.
Create SQL queries separate than application generated SQL
with the business rules mentioned in requirement. Return random small set of
data and compare with a large volume and do a theoretical match.
3.
Check the user expectation from BRD and the data returned in
reports. Ex: if user wants to see only California Sales. It is irrelevant to
give West coast Sales data with drill down capabilities.
4.
Check language requirement. Sometime English language
developers think that Spanish and French are using same character set and
create a mess when writing in report. May be they have similar set but ñ or Œ are never used in
English language. So confirm with the native language experts to completeness
of data.
5.
Though Dollar is vastly used in international calculations.
But multinationals face lots of problem calculating the native currency to
dollar. Ex: the exchange rate varies every time. And the sales data from
different country generates in different currency. The analytical data should
comprise the both and fix a rule for exchange rate conversion.
Completeness,
Correctness and accuracy have several sub parameters to deal with. These are
the end result parameter of data quality measurement.
|
6
|
Consistency
|
1.
Consistency in data and documentation need to be checked. If
in user given mockup “Employee ID” is given then this should not be changed
to “Emp ID” for optimized space utilization.
2.
Consistency with bigger picture. Sales data is used in
different application across the organization. If different application is
showing different data then the consistency and reliability is getting
hampered. Make sure the downstream feed is coming at the same time as other
application using the same source.
3.
Consistency of design. A design must follow a guidelines and
standard. If those are not followed then a same set of data looks different
when compared. Ex: in an application dealer name and dealer code are referred
as N_Dealer and C_Dealer. If somewhere else it is notified as Dealer_code,
Dealer_Name, it will be hard to maintain.
4.
Check Point-in-Time Consistency. This means check the if the
data are indexed properly and all the data are uniform at a point of time. If
the system crash then the data should be properly retrieved in a point of
time. Simply set a backup policy and data retrieval point.
5.
Check application consistency. If the sales data showing in
your report is differs from the sales data showing in another application
then you need to find justification. If your application is showing a day old
data then you need to compare with other application which is showing a day
old data. If find inconsistent, must be fixed.
6.
Check transaction consistency. If you have a transactional
system where sales data updates in a real time environment. Then you need to
check that the sales volume shows same in managerial application and employee
report in same point of time. If you have historical analytical system you
need to check the delay of update of your system.
|
7
|
Data commonality
|
Covered in data
consistency in design.
Also data commonality
refers to data architecture and variable definition guidelines. If a rule is set
in an organization that the dimension key need to be 5 digit. Then it should
be 5 digits across all tables in the system.
|
8
|
Dependability
|
Dependability comes
from the accuracy of the data on time. Check all the business rule and the
generated data return time. With multiple checks in multiple times in
different scenario build the confidence of a system that it can show sales
volume of the guy name X from Y dealership of Z country or guy P from Q
district of R state in S country.
|
9
|
Efficiency
|
Efficiency leads to
dependability of a data. To check the data is efficient you need to:
1)
Check the number of joins pulled in a report. Joins are the
single most expensive item regarding data. Less number of joins makes a
system more efficient.
2)
Avoid loops in modeling. It causes cancer in system.
3)
Unnecessary variable declaration.
4)
Remove cascading if possible.
5)
Use flags more in condition in place of major where clause.
6)
Minimize user input and prompts.
7)
Schedule reports.
8)
Refresh reports with data before users open computer in
morning.
9)
Remove trap, loops, fan trap, circular joins.
10)
Avoid outer joins.
|
10
|
1)
There should be a fault tolerance system works for the
database.
2)
Check if the tables contain proper error handling mechanism.
3)
Checks the database have proper recovery system.
|
|
11
|
Hardware independence
|
This is common problem
now days because of the mobile application boom. Data should be visible and
proper in different hardware. You need to check the data in desktop and
mobile application.
|
12
|
Integrity
Security |
Sales
data is one of the most secured data in an organization. Information’s are
confidential. For a data quality inspector there is no need to check the
physical or access level security for a data. Every organization has a
security or governance plan and policy which safeguard the data integrity.
But you need to make sure the documentation provided by the different team of
data and network security which actually save your project data.
But
you need to :
1)
Check if any row level or object level security is needed or
not. If needed then check the sanctity of the security.
2)
Authorized data should be visible to authorized person. Not
everyone. Setup a security matrix sitting with SME and implement.
3)
Test the security matrix using different dummy id setup for
different level. This should be thoroughly tested.
4)
Data administration should be controlled by skilled employees
and at least three layer of data security should be implemented.
|
13
|
Maintainability
|
Maintainability is the
one of the most important parameter to check as data quality depend on
availibity of data. As a regular maintenance is needed for a data set you
need to check if the data is maintainable or not.
For that you need to check:
1)
Data loading mechanism and delay in data availibity. That means
if you have a OLTP system the sales data should appear in your dashboard
almost in real time. If OLAP then analytical data should appear in next
business day.
2)
Quantity or volume of the data. If the data set is huge and
cannot be broken into pieces by dimensions or similar type of dimension is
present with a large volume then need to check modeling methodology to
retrieve and maintain that.
3)
Complexity of the report. Sometime it happens that the retrieval
query is too much complex and if any error occur, code fix will take time. It
happens due to lack of documentation and non-reusability of code snippets.
4)
Joins are the single most expensive item in data retrieval
time. Less number of joins is less complexity in code and less data retrieval
time. De-normalization solves most of the problem and makes data
maintainable.
5)
Some of the other maintenance check needed for below times(Mentioned
in original document)
a.
problem recognition time
b.
Administrative delay time
c.
maintenance tools collection time
d.
Problem analysis time
e.
Change specification time
f.
Active correction time
g.
Local testing time
h.
Global testing time
i.
Maintenance review time
j.
Total recovery time
|
14
|
Modularity
|
Modularity is important
while coding for a dataset. Modularity enhances maintainability. If the data
is not modularize and every piece of data is loaded as a heap or dump then it
get reflected in all the development done on that piece of dataset. Check if
the data set is having modular categorization of data by business values.
|
15
|
Reusability
|
Reusability is having
vast meaning in data quality analysis. It means reusability of all components
in a dataset. Sometime creating more usability of data creates lots of joins
which affects the performance. So smart and to the point reusability is
important.
Check if your data has
reusable content or not. As example sales data is one of the most reusable
content in any industry. But the presence of that data should be in standardizing
format and easily referenced.
Documentation have
major role in reusability. Improper documentation can lead data modeling to
confusion. A good practice I saw is to maintain a table where definition of
all the column of the tables in a schema is detailed. On that table column
should be like, Table name, Column Name, Description, Updated by date,
Updated by person name.
|
16
|
Self‑documentation
|
This can be achieved by
several means:
1)
Data modeling with organization standard naming convention.
2)
Proper Meta data table.
3)
Organized modeling separated by business rules.
|
17
|
Understandability
simplicity |
Ease of understanding
the function of a data set. VIN numbers have meaning for each and every
number in it. But to decrypt it you need to put logic behind it. So it is not
simple to use to get sales volume. Individual level sales volume can show the
VIN number which they sold but sales volume should be directly picked up from
the sales to a dealership. That makes the whole set easily understandable and
simple.
|
18
|
Accuracy
|
Above mentioned checks
leads to Accuracy of data. But several level of testing makes a data
accurate. There is lots of testing methodology available in market but trust
me best testing can be done by the people who use the data.
So before delivery:
·
Do unit testing.
·
Do testing with legacy system
·
Do testing referencing BRD
·
Do pick up and check testing
·
Introduce error and do fault testing
·
Ask SME’s for doing the testing
·
And also test with all testing methodology
If a data is accurate
then user can live with the other problem in data quality but if data is not
accurate then it is the end of the story.
|
If you
read this whole story I believe you may not agree to all of my point. And that’s
the beauty of data modeling and data quality analysis. It differs from people
to people, organization to organization, structure to structure, business to
business. But yes , you can start from this and start your own definition. Thanks
for reading.