News and Information for the Data Integration community

February 2007

 

SAP related solutions

Sage related solutions

Data migration solutions

Weekly Webinar on xFusion. Register via email to:
info@softlabsco.com

Newsletter Unsubscribe

More Newsletters

Most Recent

Jan 2007

Dec 2006

Nov 2006

Data Profiling - Part I


Data profiling is an important phase of any data migration or data integration project. Dr. Pradeep Tapadiya introduces data profiling techniques and shows how data profiling can help you avoid late project suprises.

Announcements

Latest happenings in xFusion and at Software Labs.

Did you know?


From the support desk

 
 

Data Profiling - Part I

Author: Dr. Pradeep Tapadiya

 


Data profiling is the first phase of any data migration or data integration project. Broadly speaking, data profiling helps you in two different ways:

  • Identify potential problems in the current data. This helps in avoiding late project surprises.

  • Give better understanding of your current data. This helps in, for example, planning your final data schema.

Even if you are implementing a continuous improvement plan for your organization, you need to proactively manage data quality. As you are well aware, garbage-in is garbage-out.

In this article, I will cover some common data profiling techniques to help you identify potential problems in the input data. If you are undertaking any data migration/integration project, you can use these techniques to establish "rules" that constitute towards quality data.

I. Accuracy

Accuracy addresses the correctness of the data. Sometimes, even a quick examination of input may reveal data that is obviously incorrect:

  • Numbers: Are there negative values in a column where we are expecting only positive values?

  • Dates: Are there future dates for past events?

  • Types: Are there alphabets in a numeric column?

  • Variance: When using multiple data sources, it is possible that the master data on one system is not consistent with others. For example, your ERP system may report an address for a customer that is different than the one in your CRM system. For such cases, you need to work with the stakeholders to identify the data source that must be considered as the reference. Sometimes, you may have to manually cleanse the data.

II. Conformance

Conformance is an aspect of accuracy. It helps discover data that is not valid for your business requirements.

  • Domain validation: Does the input column conform to the defined values. For example,

    • The State column for US addresses is expected to be one of the fifty states of USA. Any other value is considered to be out of domain.

    • The Gender column is expected to be M or F.

  • Range validation: Does the input column conform to the defined range of values? For example, the ages of children in kindergarten are expected to be between 4 and 5.

  • Business rules: Identify data that does not confirm to your specific business rules. For example, you business rule may require that a column may not contain negative values.

III. Uniqueness

Duplicate entries have always been a problem for data integration projects. Specifically, columns that are deemed as primary keys must not contain duplicate entries.

  • Exact duplicates: Are there any duplicates in the column(s)? For example, the CustomerID column in the customer master table is expected to contain unique values.

  • Phonetic duplicates: Are there any values in the column that don't exactly match but sound similar? An example set is Bryan and Brian. Algorithms such as Soundex and Double Metaphone can detect such duplicates.

  • Typographical errors: Are there values that don't exactly match because of typographical errors? Levenstein Distance algorithm, for example, can find values that are within n keystrokes of each other.

IV. Consistency

Does your data conform to the standards? If it does not, you would need a strategy to cleanse the data first.

  • Non-standard Representations: California may be represented as CA, Ca, California, etc. North American phone numbers may be in the multiple formats such as (999)999-9999, 1-999-999-9999, etc. Non-standard representations complicate data-driven projects.

  • Date Formats: Do all values have a consistent date format?

  • Sort Order: Are all values in required ascending or descending order?

  • Completeness: Are there any values in the column that are either null or empty?

V. Integrity

When relating multiple columns, either from the same table, or from tables across multiple data sources, it is important to identify the cases where the relationship fails.

  • Relationship enforcement: If there is a primary/foreign key relationship, is it enforced?

  • Orphan analysis: When joining tables from two different data sources, it is always important to identify the primary keys that are present in one table but not found in another table.

  • Interdependency: Within a table, the zip code field always depends on the country code.

I hope you found this article useful. In the next article, I will cover data profiling techniques that help you understand your current data.

New Partners
We are happy to announce the addition of Tryarc to our growing list of partners. xFusion will help them to streamline customer migrations, and integrating customer data across business applications.

Data Integration for SAP Business One
Going beyond data migration, xFusion tackles the task of managing data movement across SAP Business One and other business applications. Read about our free templates for integrating data and improving automation in our new brochure here:

New Events
Software Labs to attend Microsoft Across America Launch Event on Feb 13, 2007.

Did you know that xFusion can save the repository settings under roaming profile?
If users connect to a machine running xFusion Studio using remote desktop or Citrix client, they do not have to lose the repository settings. xFusion Studio 3.0 has the capability to save the repository in an alternate location under the roaming profile. Read more

Support Questions:

Dear Support,

I added two connections to a new xFusion package. I can create a standard query using the first connection but when I create a standard query using the second connection it does not show the tables for the 2nd connection?

The Query automatically connects to the first connection. How do we solve this problem?

Thanks

Answer:

When you create a standard query, there is a combo box on the tool bar that shows the current connection you are working on. By default, the first connection is selected for a new query. However, you can change this selection anytime. When you do that, the table list will automatically update.

Hope this helps.

Support


Dear Support,

Can I round off a number to two decimals using xFusion Studio?

Thanks

Answer:

Format transform function in xFusion Studio can be used to round off to any number of decimal places as desired.

For example,

format(10.257, "f2")
formats the number to 2 places.
format(10.2572345678, "f5")
formats the number to 5 places.

Format function can also be applied to other data types such as date and currency.

Hope this helps.

Support


Software Labs, Inc. 1225 Pleasant Grove Blvd. #100, Roseville CA 95768, USA
ph: 1-916-773-6272 fax: 916-773-6281 web: http://www.softlabsco.com email: info@softlabsco.com
unsubscribe via email: unsubscribe@softlabsco.com unsubscribe via web: click here