|
Data profiling is the first phase of any data migration or
data integration project. Broadly speaking, data profiling
helps you in two different ways:
-
Identify potential problems in the current data. This
helps in avoiding late project surprises.
-
Give better understanding of your current data. This helps
in, for example, planning your final data schema.
Even if you are implementing a continuous improvement plan
for your organization, you need to proactively manage data
quality. As you are well aware, garbage-in is garbage-out.
In this article, I will cover some common data profiling techniques
to help you identify potential problems in the input data.
If you are undertaking any data migration/integration project,
you can use these techniques to establish "rules"
that constitute towards quality data.
I. Accuracy
Accuracy addresses the correctness of the data. Sometimes,
even a quick examination of input may reveal data that is
obviously incorrect:
-
Numbers: Are there negative values in
a column where we are expecting only positive values?
-
Dates: Are there future dates for past
events?
-
Types: Are there alphabets in a numeric
column?
-
Variance: When using multiple data sources,
it is possible that the master data on one system is not
consistent with others. For example, your ERP system may
report an address for a customer that is different than
the one in your CRM system. For such cases, you need to
work with the stakeholders to identify the data source that
must be considered as the reference. Sometimes, you may
have to manually cleanse the data.
II. Conformance
Conformance is an aspect of accuracy. It helps discover data
that is not valid for your business requirements.
-
Domain validation: Does the input column
conform to the defined values. For example,
-
Range validation: Does the input column
conform to the defined range of values? For example, the
ages of children in kindergarten are expected to be between
4 and 5.
-
Business rules: Identify data that does
not confirm to your specific business rules. For example,
you business rule may require that a column may not contain
negative values.
III. Uniqueness
Duplicate entries have always been a problem for data integration
projects. Specifically, columns that are deemed as primary keys
must not contain duplicate entries.
-
Exact duplicates: Are there any duplicates
in the column(s)? For example, the CustomerID
column in the customer master table is expected to contain
unique values.
-
Phonetic duplicates: Are there any values
in the column that don't exactly match but sound similar?
An example set is Bryan and Brian.
Algorithms such as Soundex and Double Metaphone
can detect such duplicates.
-
Typographical errors: Are there values
that don't exactly match because of typographical errors?
Levenstein Distance algorithm, for example, can
find values that are within n keystrokes of
each other.
IV. Consistency
Does your data conform to the standards? If it does not,
you would need a strategy to cleanse the data first.
-
Non-standard Representations: California
may be represented as CA, Ca,
California, etc. North American phone numbers
may be in the multiple formats such as (999)999-9999,
1-999-999-9999, etc. Non-standard representations
complicate data-driven projects.
-
Date Formats: Do all values have a consistent
date format?
-
Sort Order: Are all values in required
ascending or descending order?
-
Completeness: Are there any values in
the column that are either null or empty?
V. Integrity
When relating multiple columns, either from the same table,
or from tables across multiple data sources, it is important
to identify the cases where the relationship fails.
-
Relationship enforcement: If there is
a primary/foreign key relationship, is it enforced?
-
Orphan analysis: When joining tables
from two different data sources, it is always important
to identify the primary keys that are present in one table
but not found in another table.
-
Interdependency: Within a table, the
zip code field always depends on the country code.
I hope you found this article useful. In the next article,
I will cover data profiling techniques that help you understand
your current data. |