News and Information for the Data Management Community

Spring 2008

 

SAP related solutions

Sage Software related solutions

Data migration solutions

Weekly Webinar on xFusion. Register via email to:
info@softlabsco.com

Newsletter Unsubscribe

More Newsletters

Most Recent

Fall 2007

August 2007

June 2007

May 2007

April 2007

Mar 2007

Feb 2007

Jan 2007

Top three gems for detecting duplicates

When data is consolidated or integrated across multiple data sources,

a common issue is to handle duplicate records. This article discusses three algorithms to detect duplicate names even if the spellings do not match exactly.

Announcements
Latest happenings in xFusion and at Software Labs.

Did you know?


From the support desk

 
 


Top Three Gems for Detecting Duplicates
Dr. Pradeep Tapadiya

 


Many businesses run multiple business applications. As we live in an error-prone world, it is possible to find the same person spelled differently in two different applications. When consolidating data, detecting such inexact duplicates is very important from a data quality perspective. In this article, I will discuss a few commonly used algorithms for approximate name matching.

Soundex

Soundex comes under the category of phonetic algorithms. Soundex can match, for example, Bryan and Brian. The basis of this algorithm is to index a word by its English language pronunciation. To do so, first all the occurrences of a, e, h, i, o, u, w, and y are removed. Next, letters that sound the same are grouped together. For example, b, f, p, and v are grouped together.

Note that Soundex retains the first letter of the string. For example, it will not match Bryan with Fryan.

Metaphone and Double Metaphone

Metaphone is also a phonetic algorithm. It was developed to overcome the deficiencies in the Soundex algorithm. It essentially uses a variable-length key as opposed to fixed-length key used by Soundex. Double Metaphone is the second generation of the Metaphone algorithm. It generates two keys for each word. This accounts for some ambiguous cases as well as multiple variants of surnames with common ancestry. For example, Double Metaphone can match Smith and Schmidt.

Note that, like Soundex, Metaphone also retains the first letter.

Levenshtein Distance

Levenshtein distance is a metric for measuring the amount of distance between two strings. The distance is defined as the minimum number of operations needed to transform one string into another, where an operation is an insertion, deletion, or substitution of a single character. For example, the Levenshtein distance between kitten and sitting is 3. Using this algorithm, you can detect all the words that are closer to each other. Many spell checkers, including the one included in Microsoft Office, use this algorithm.

Levenshtein distance does not suffer from the first-letter retention problem. However, it is a combinatorial algorithm and hence takes a longer time to run.

These algorithms are used to improve the task of data cleansing. Unclean data costs companies billions of dollars each year around the globe. Any proper data management project should include some aspect of data cleansing. The above algorithms should be an expected feature in any comprehensive data management software solution.

xFusion Studio v3.4 Released
xFusion Studio v3.4 added some new features including additional capabilities for SAP customers, improved transformation functions, and a new way to quickly extract data from Microsoft Outlook. Look for an upcoming announcement of xFusion Studio v4.0 in June.

Do you need a better data conversion path to new ERP systems? See what businesses have done recently to make the whole process much easier. Read more online here.

Need to move data in and out of your ERP system, or between departments? More companies are seeing vast improvements in productivity with new software solutions. Read more online here.

Supply & Demand Chain Executive magazine shows how one manufacturing company has improved processing times to improve sales. Read more online here.

New Partners
We are happy to announce that Beyond438, Performance PR, Business and Technology Strategies, Netsirk Technologies, Tallman Palmer, and Software Solutions have joined as partners of Software Labs. These companies are experts in their respective industries, and valuable additions to our partner program.

Sales Promotions
Our Q1 promotion was a great success! Thanks to all who participated. Through the end of June 2008 we are extending a special discount to new and existing customers. Take advantage of these aggressive deals today by contacting sales@softlabsco.com

Promotion highlights include:
25% discount for new customers (software purchases only)
25% discount on additional user licenses of xFusion Studio and WebDB Server
$2500 fixed-price migration services (contact sales for more details)

Remember, we provide FREE software training with every product purchase!



Did you know xFusion can compare two data sources side-by-side, for example SAP and Microsoft Excel? Read more…

Support Questions:

Dear Support,

In one of the data migration projects that we are handling, it is important for us to get the list of records with missing IDs. If we can get this list before uploading the data, we can have our clients fix those records. Can xFusion help us in this regard?

Thanks

Hi,

Yes, we can take care of the validation checks using the transformation functions in xFusion. IsNullOrEmpty transform function in association with Filter function can handle your requirement very efficiently. Here is an example:

Filter ( [[SampleData]], IsNullOrEmpty([SampleData.ID]))

The query above returns the list of all the records with missing IDs in the "SampleData".

Hope this helps.

Thank you,

Support


Software Labs, Inc. 1225 Pleasant Grove Blvd. #100, Roseville CA 95768, USA
ph: 1-916-773-6272 fax: 916-773-6281 web: http://www.softlabsco.com email: info@softlabsco.com
unsubscribe via email: unsubscribe@softlabsco.com unsubscribe via web: click here