Duplicate Data Cleanup for Importing to CRM

So you have built your CRM environment to replace your legacy system and now to the data import. But wait…when reviewing your legacy data you notice that there is some bad mojo going on with duplicates. So just how do you go about cleaning up all this data before importing into CRM? Well, do not fear, there are several ways to attack this monster.

If your data just happens to be in a delimited format that you can open with Excel then you are off to a good start. There are a couple of ways to look for duplicates in Excel, grouping, manual viewing (yuck) or some good old fashioned formulas.

Excel has a couple of ways to use formulas. One is the built in “countif” function which takes a range and a criteria and determines the number of times the criteria appears.

For example, “=COUNTIF(C1:C10, “Customer Effective”) would return the number of times Customer Effective was found in the column range C1 through C10. Now let’s take it a little further for duplicates: IF(MAX(COUNTIF(C1:C10,C1:C10))>1,"Possible Duplicate","Good") Here the range is compared against itself and finds those with a count greater than 1. Where this is of most value is to format your column and highlight those values that are potential duplicates.

Next up is a SQL statement comparison. There are probably many ways to accomplish some type of data comparison, and there are some other great Customer Effective solutions: http://blog.customereffective.com/blog/2008/01/find-duplicate.html., but here are a couple more.

First there is the SQL grouping:  Select last_name, count(last_name) as LastNameCount From contact group by last_name having (count(last_name) > 1). So, as you SQL Server gurus know, there are some other iterations of this but hopefully this gives you an idea of one way to accomplish the task.

Another great SQL Server function, and my personal favorite, is the “Except” operand. Use of the “Except” operand allows you to query all the rows from the left table and compare values that are NOT also found in the right table. A good use of this is where you might have a set of data, ie. table, that is updated nightly from a larger repository. If you only want the new changes and additions, use of this operand would only return the changes. Now for an example:

Select * from contacts except select * from contacts_old

Here we are comparing ALL the columns from the contacts table and comparing to ALL the columns in the contacts_old table. The result set are those rows which are new or have a change to any column. Two caveats to remember: 1) the number and order of the columns must be the same in both tables and 2) the data types must be compatible.

I hope this gives you a few more tools to use in identifying and cleanup of import duplicates. Good luck!

Post by: Jeff Trantham, Customer Effective

Show Buttons
Hide Buttons