|
Controlling data quality is one of the most important tasks on any data migration
or integration project. Most ETL platforms allow you to use regular expressions
to identify improperly formatted input data. In this article, I will cover some
often-used cases for regular expressions in the world of data quality.
1. Names
A name typically does not contain non-alphabetic characters such as digits or symbols,
unless you are the artist formerly known as Prince. The following regular expression
validates that a name contains one or more alphabets and spaces.
^[a-zA-Z\ ]+$
Note the use of ^ at the beginning and $ at the end. This ensures that the match
happens from the beginning to the end of the text, and not anywhere in the middle.
2. Integers
The following regular expression ensures that the input contains one or more digits
and nothing else:
^[0-9]+$
3. Currency
Currency values typically have one or more digits. It can optionally have a decimal
point followed by two digits. For example, 300, and 400.50 are valid but 500.757
is not. Here is the regular expression:
^[0-9]+(\.[0-9]{2})?$
4. Credit Card Numbers
Credit card numbers are 16 digits in length. Sometimes, the numbers are grouped
as four digits with either a space or a dash separating the groups. Here is the
regular expression:
^([0-9]{4}(\ |-|)){4}$
5. North American Phone Numbers
Phone numbers in North America are typically stored in the format (NNN)NNN-NNNN.
The following expression can be used to identify phone numbers that do not follow
this format:
^\([0-9]{3}\)[0-9]{3}-[0-9]{4}$
5. Email Addresses
Email addresses typically have the format of username@domain.ext where username
can have alphabets, digits, dashes and dots. The domain name can have alphabets,
dashes and dots. The extension is either a two-letter (nz, in) word or a three-letter
word (com, edu, gov). Here is the syntax for such a regular expression:
^[\w-\.]{1,}\@([\da-zA-Z-]{1,}\.){1,}[\da-zA-Z-]{2,3}$
I hope this was a good introduction into identifying bad data using regular expression.
Of course, the flexibility that regular expressions provide go far beyond what I
have covered here. If you are interested in learning more, you can find very good
articles on the Internet.
|