More Duplicate Data Cleanup for Importing to CRM: ROW_NUMBER

Jeff Trantham’s blog on this subject has inspired me to share my just about, all-time-favorite duplicate detection tool in SQL.  It’s the ROW_NUMBER() function and I like to consider it my first line of defense in preventing duplicates from winding their way into the CRM.  As Jeff mentioned, there are many tools in SQL to accomplish this, but I find myself gravitating toward ROW_NUMBER() again and again.  Here’s an example of how it works, and notice my use of the Common Table Expression:

;WITH ContactCheck AS

(

SELECT legacyfirstname,

legacylastname,

legacycompanyname,

legacycontactid,

ROW_NUMBER() OVER (PARTITION BY legacylastname, legacyfirstname, legacycompanyname OVER ORDER BY legacylastname) as RN

FROM LegacyContactTable

)

SELECT *

FROM ContactCheck

WHERE RN=1

Lets breakdown that ROW_NUMBER() function works:

image

You can then alter the code based on your definition of duplicate.  You may think that by virtue of having the same first and last name they are a duplicate, or you may think that a duplicate is a record with the same first name, last name, company, email address, and street address. The PARTITION clause would need only be modified accordingly, making for some tidy code.

I would highly recommend Gregory Larsen’s article from DatabaseJournal.com which discusses this as well as many other great functions that came around with SQL Server 2005.

Enjoy!

Post by: James Diamond, Customer Effective

Show Buttons
Hide Buttons