Return All-but-One Duplicate Records in CRM using SQL

Frequently when importing / scrubbing data within CRM, identifying duplicates to remove or ‘flag’ may be needed.  It’s helpful to have a query that will return all the duplicates in the database – but exclude a single ‘surviving’ record.  (i.e. Return all-but-one of the duplicated records.)

This query is one of my go-to query patterns – I rarely, if ever, use it as-is, but will modify it to fit the particular entity / scenario I need to investigate.

The returned record set might used as the source for a Scribe Insight job to either ‘flag’ or inactivate or delete the duplicates depending on the situation. – Of course you’ll need to take into account the record’s relationship to other entities etc.


 

This example query assumes the duplicated contacts were entered more than 3 milliseconds apart in SQL. If your data was migrated from another system or entered as a result of a large batch and you need finer control, consider using the 'versionnumber' field as a unique timestamp field. (The VERSIONNUMBER field is a unique value that gets incremented as records are updated – it can be very useful.) 

Another variations might be that you want to change MAX to MIN – and adjust the JOIN in the query to return all but the most recently created or updated record.

The key is that this is a pattern you can build on and adapt to your particular situation.

( Also, by blogging it here, I no longer have to search my laptop for copies of it when I need a example for someone. )

Smile 

 

/*
Title: Find Duplicate Contact to Remove (Example)
Author: Scott Sewell, Customer Effective
Description: Find the most recently created contacts that duplicates an existing contact
based on fullname and emailaddress1. The Subquery finds the Name/Email and CreatedOn date of the most recently created duplicates. e.g. if there are multiple contact records with the same name and emailaddress1, return the date of all but the oldest one.
*/

SELECT    FC.contactid   ,FC.fullname   ,FC.emailaddress1 FROM    dbo.FilteredContact AS FC JOIN    ( SELECT    FC2.fullname   ,FC2.emailaddress1   ,MIN(FC2.createdon) MIN_CREATED_ON   FROM    dbo.FilteredContact AS FC2   WHERE    FC2.emailaddress1 IS NOT NULL    AND FC2.statecode = 0   GROUP BY    FC2.fullname   ,FC2.emailaddress1   HAVING    COUNT(FC2.contactid) >= 2    ) AS FC_DUPE    ON FC_DUPE.MIN_CREATED_ON < FC.createdon    AND FC.statecode = 0    AND FC.fullname = FC_DUPE.fullname    AND FC.emailaddress1 = FC_DUPE.emailaddress1 
 
-- @ScottSewell

Post by: Scott Sewell, Customer Effective

Show Buttons
Hide Buttons