Many businesses run multiple
business applications. As we live in an error-prone world, it
is possible to find the same person spelled differently in two
different applications. When consolidating data, detecting
such inexact duplicates is very important from a data quality
perspective. In this article, I will discuss a few commonly
used algorithms for approximate name matching.
Soundex Soundex comes under the category of phonetic
algorithms. Soundex can match, for example, Bryan and
Brian. The basis of this algorithm is to index a word
by its English language pronunciation. To do so, first all the
occurrences of a, e, h, i,
o, u, w, and y are removed. Next,
letters that sound the same are grouped together. For example,
b, f, p, and v are grouped
together.
Note that Soundex retains the first letter of the
string. For example, it will not match Bryan with
Fryan.
Metaphone and Double Metaphone
Metaphone is also a phonetic algorithm. It was developed to
overcome the deficiencies in the Soundex algorithm. It
essentially uses a variable-length key as opposed to
fixed-length key used by Soundex. Double Metaphone is the
second generation of the Metaphone algorithm. It generates two
keys for each word. This accounts for some ambiguous cases as
well as multiple variants of surnames with common ancestry.
For example, Double Metaphone can match Smith and
Schmidt. Note that, like Soundex, Metaphone also
retains the first letter.
Levenshtein Distance
Levenshtein distance is a metric for measuring the amount
of distance between two strings. The distance is defined as
the minimum number of operations needed to transform one
string into another, where an operation is an insertion,
deletion, or substitution of a single character. For example,
the Levenshtein distance between kitten and
sitting is 3. Using this algorithm, you can detect all
the words that are closer to each other. Many spell checkers,
including the one included in Microsoft Office, use this
algorithm.
Levenshtein distance does not suffer from the first-letter
retention problem. However, it is a combinatorial algorithm
and hence takes a longer time to run. These algorithms are
used to improve the task of data cleansing. Unclean data costs
companies billions of dollars each year around the globe. Any
proper data management project should include some aspect of
data cleansing. The above algorithms should be an expected
feature in any comprehensive data management software
solution. |