News and Information for the Data Integration community

June 2007

 

SAP related solutions

Sage Software related solutions


Data migration solutions

Weekly Webinar on xFusion. Register via email to:
info@softlabsco.com

Newsletter Unsubscribe

More Newsletters

Most Recent

May 2007

April 2007

Mar 2007

Feb 2007

Jan 2007

Dec 2006

Nov 2006

Improve Data Quality with Regular Expressions

Controlling data quality is one of the most important tasks on any data migration or integration project. Most ETL platforms allow you to use

regular expressions to identify improperly formatted input data. In this article, Dr. Pradeep Tapadiya will cover some often-used cases for regular expressions in the world of data quality.

Announcements

Latest happenings in xFusion and at Software Labs.

Did you know?


From the support desk

 
 

Improve Data Quality with Regular Expressions
Author: Dr. Pradeep Tapadiya

 

Controlling data quality is one of the most important tasks on any data migration or integration project. Most ETL platforms allow you to use regular expressions to identify improperly formatted input data. In this article, I will cover some often-used cases for regular expressions in the world of data quality.

1. Names

A name typically does not contain non-alphabetic characters such as digits or symbols, unless you are the artist formerly known as Prince. The following regular expression validates that a name contains one or more alphabets and spaces.
^[a-zA-Z\ ]+$

Note the use of ^ at the beginning and $ at the end. This ensures that the match happens from the beginning to the end of the text, and not anywhere in the middle.

2. Integers

The following regular expression ensures that the input contains one or more digits and nothing else:
^[0-9]+$

3. Currency

Currency values typically have one or more digits. It can optionally have a decimal point followed by two digits. For example, 300, and 400.50 are valid but 500.757 is not. Here is the regular expression:
^[0-9]+(\.[0-9]{2})?$

4. Credit Card Numbers

Credit card numbers are 16 digits in length. Sometimes, the numbers are grouped as four digits with either a space or a dash separating the groups. Here is the regular expression:
^([0-9]{4}(\ |-|)){4}$

5. North American Phone Numbers

Phone numbers in North America are typically stored in the format (NNN)NNN-NNNN. The following expression can be used to identify phone numbers that do not follow this format:
^\([0-9]{3}\)[0-9]{3}-[0-9]{4}$

5. Email Addresses

Email addresses typically have the format of username@domain.ext where username can have alphabets, digits, dashes and dots. The domain name can have alphabets, dashes and dots. The extension is either a two-letter (nz, in) word or a three-letter word (com, edu, gov). Here is the syntax for such a regular expression:
^[\w-\.]{1,}\@([\da-zA-Z-]{1,}\.){1,}[\da-zA-Z-]{2,3}$

I hope this was a good introduction into identifying bad data using regular expression. Of course, the flexibility that regular expressions provide go far beyond what I have covered here. If you are interested in learning more, you can find very good articles on the Internet.

Software Labs Releases xFusion Studio version 3.2
The new version simplifies integration of data across complex business applications, adds Peachtree data connector, and xFusion Scheduler among other features. Read the entire article online here: http://www.softlabsco.com/Company/PressRelease.aspx

Fixed-Price Migration Services for SAP
Software Labs continues to offer SAP Business One partners an easier way to reduce the cost of new customer implementations while accelerating the project toward on-time completion. Working as an extension of your business, Software Labs uses its proprietary data conversion templates, proven migration methodology and popular xFusion Studio software to quickly and reliably migrate customers' data from QuickBooks, Peachtree, or Microsoft Dynamics Great Plains. To learn more contact Software Labs at info@softlabsco.com or call 916-773-6272 ext. 130

Did you know that xFusion Studio supports generating the log file name dynamically based on the date and time of the run. Read more…

Support Questions:

Dear Support,

I have two tables containing a column of type string. The value in the field may be made up of digits alone, or digits and letters.

TableA contains:

Key Value
00123456

Hello

000000123456 World
00123456A Goodbye
000000123456A Moon
000000123456A Moon

TableB contains:

Key Value
000123456

Red

000000123456A Blue

I need to join the tables using this business rule:

"If Key is made up of digits, any zeroes on the left are ignored. However, if Key has non-numeric characters, any zeroes on the left are significant." My business rule would say that 00123456 and 000000123456 in Table A and 000000000000123456 in Table B are all the same value.Also, 00123456A and 000000123456A in Table A and 123456A in Table B are different.

How can I do this in xFusion Studio?

Thanks

Hi,

The transform function RegexFind can be used to find the strings with non-numeric characters in it. Then using if condition on the RegexFind, we can trim the leading zeros for the strings with non numeric strings.

To trim the leading zeros, the trick is to convert the string to a double value using ToDouble function and convert the resultant double value to string. When a string is converted to numeric values the leading zeros will be ignored automatically.

The transform query looks like this:

 

Select (

If (

RegexFind([TableA.Key], "[a-zA-z]")=true,

[TableA.Key],

ToString(ToDouble([TableA.Key]))

) as Key,

[TableA.Value]

)

Same has to be done for tableB also.

An aggregate query with join between these two resultant tables achieves the required result.

Hope this helps!

Best regards,

Support


Software Labs, Inc. 1225 Pleasant Grove Blvd. #100, Roseville CA 95768, USA
ph: 1-916-773-6272 fax: 916-773-6281 web: http://www.softlabsco.com email: info@softlabsco.com
unsubscribe via email: unsubscribe@softlabsco.com unsubscribe via web: click here