Bad Data Costs Big Money, Part 3

Data Deduplication

Data duplication can be any record that inadvertently shares data with another record.  Basically, this is data that has been entered by other systems, by users, imported, and it can relate to customers, suppliers, products, and more.  This data duplication costs companies vast amounts of money.  This can lead to missed opportunities, leads, etc.

How do we deduplicate records?

The main difficulty on de-duplicating data is how is the data row duplicate to another record.  There are many factors to determine this.  It could be the name, address, phone number, or other fields.  At VectorX, we use multiple rules to follow that will determine duplicate rows.  For example, we use a company name rule that helps normalize rows.  It eliminates things such as LLC., LLC, Inc., Inc, Incorporated, etc.  Additionally, we use algorithms such as Levenshtein distancing, Soundex, and others.

For example:

  • ACME Parts LLC.
  • ACME Parts LLC

Levenshtein will match both of these as they are less than the typical 2 space limit.  Another approach used is combining multiple fields together to create a uniqueness test.

For example:

  • ACME Parts 1st Main Street Atlanta GA 30066
  • ACME Equipment 1st Main Street Atlanta GA 30066
  • ACME1stMainAtlanGA30066
  • ACME1stMainAtlanGA30066

1st four letters of Name, 1st 7 letters of Street, 1st 5 letters of City, 1st 2 letters of State, 1st 5 letters of zip code.  This to helps in distinguishing unique and non-unique records in your source.  You will notice this is two different account names.  Is this a duplicate?  Did they change their name?  Or is this a typo.  This is why we need business experts to help in the clean-up effort.  This isn’t a one-person show, we will need your help in the clean-up.

This process is normally the longest to properly get correct.  Just imagine if this were millions of rows of account data.  If done correctly we will get a clean dataset to start new.  If not, you will be spending a lot of time and money cleaning up manually the bad data that was loaded.