| Controlling data quality is
one of the most important tasks on any data migration or integration
project. Most ETL platforms allow you to use regular expressions
to identify improperly formatted input data. In this article,
I will cover some often-used cases for regular expressions
in the world of data quality.
1. Names
A name typically does not contain non-alphabetic characters
such as digits or symbols, unless you are the artist formerly
known as Prince. The following regular expression validates
that a name contains one or more alphabets and spaces.
^[a-zA-Z\ ]+$
Note the use of ^ at the beginning and $ at the end. This
ensures that the match happens from the beginning to the end
of the text, and not anywhere in the middle.
2. Integers
The following regular expression ensures that the input contains
one or more digits and nothing else:
^[0-9]+$
3. Currency
Currency values typically have one or more digits. It can
optionally have a decimal point followed by two digits. For
example, 300, and 400.50 are valid but 500.757 is not. Here
is the regular expression:
^[0-9]+(\.[0-9]{2})?$
4. Credit Card Numbers
Credit card numbers are 16 digits in length. Sometimes, the
numbers are grouped as four digits with either a space or
a dash separating the groups. Here is the regular expression:
^([0-9]{4}(\ |-|)){4}$
5. North American Phone Numbers
Phone numbers in North America are typically stored in the
format (NNN)NNN-NNNN. The following expression can be used
to identify phone numbers that do not follow this format:
^\([0-9]{3}\)[0-9]{3}-[0-9]{4}$
5. Email Addresses
Email addresses typically have the format of username@domain.ext
where username can have alphabets, digits, dashes and dots.
The domain name can have alphabets, dashes and dots. The extension
is either a two-letter (nz, in) word or a three-letter word
(com, edu, gov). Here is the syntax for such a regular expression:
^[\w-\.]{1,}\@([\da-zA-Z-]{1,}\.){1,}[\da-zA-Z-]{2,3}$
I hope this was a good introduction into identifying bad
data using regular expression. Of course, the flexibility
that regular expressions provide go far beyond what I have
covered here. If you are interested in learning more, you
can find very good articles on the Internet. |