One of the really nice features of Dynamics CRM is the built-in duplicate detection. You can look for things like contacts having the same email address or contacts having the same first N characters in their first and last name. But sometimes you want just a little more, and if you bring the power of the Microsoft Stack you have just that. In this post I’m going to talk about how to use the power of SQL Server Integration Services (SSIS) to uncover duplicates by providing a nice little score to determine the likelihood of a duplicate between records. If you're on the “business” side of the house, you understand the importance of having a clean list of contacts. Nothing hurts user adoption more than questionable data. If users don’t trust it, they won’t use it. With that said, most of this blog is geared toward CRM system administrators or those on the IT side of the house. However, take a look at some of the finished product here, which we’ll come back to:
On a personal note, I’ve used this process with versions 3 and 4, and expect to use it shortly in 2011. On a more personal note, people have successfully or unsuccessfully tried to call me Jim, Jimmy, Jamie, or any other derivation of James. More than likely, this is how they try to enter me into their Dynamics CRM. The out-of-the-box duplicate detection will not ferret these out, but the example I will use does exactly that.
1. In the interest of time and space, I’m going to refer you to this
2. I will typically use another database that is on the same database server as my CRM database to be the destination of the Fuzzy Grouping process. This database will be used to create the many temp tables needed to handle the Fuzzy Grouping process as well. As a rule, I don’t modify the CRM database.
Now that you’ve read the aforementioned posting from Microsoft, you will understand these next sentences. I include the contactid and masterid as output columns. These columns are in the result set but do not factor into the probability score. They are still important. Other than that, I will include firstname, lastname, parentcustomeridname, and possibly address1_city as input columns. Include whatever other input columns depending on how you define a duplicate. Your Fuzzy Grouping step should look something like this:
Let’s take a look at the Advanced tab in the above screen as well:
The ability to throttle the similarity threshold is important, as you have another screen to reduce or increase the number of records which show as duplicates.
Here’s what the destination table should look like after you run it:
The _key_out column value is the same across all of these records, so now we need to do some fancy SQL to get to this:
Note the GUIDs were redacted, if only because I was looking for a chance to use that word.
With this all set, I now need to get it into CRM. I have a custom cross reference entity with two N:1 relationships to the contact entity. That’s where the redacted GUIDs go. Load those into CRM using your ETL tool of choice and voila. See this screen shot as an example of how one of these records looks and how the entity is set up:
This leads us back to that first screen shot. I’ve included information from the contact record in my view to help me decide if they are in fact duplicates. I can use the Bulk Edit feature to update the Merge Status flag en masse as well.
Now that we’ve identified the duplicates and which record should be merge into which, what happens next? If you’re like me, you put in a call to the Customer Effective development team. They then provide you with a tool that uses the CRM SDK to cycle through these records and actually do the merge. (In case you were wondering, the Flip and Merge setting tells the application to flip which contact is the Subordinate and which is the Master). One could still merge records based on this information manually, but I would not recommend it.
Additional Points:
1. If you want to automate this Fuzzy Grouping process, you will need to have SQL Server Enterprise edition. I have not yet experimented with using Windows Scheduler to do this, but I will give that the old college try.
2. Depending on your recordset size, you would be advised to run this off-hours. I’ve run this against approximately 250,000 records, and it took about an hour as I recall. Let this not be your benchmark (see point #3).
3. Test.
4. This process won’t replace a human being eyeballing the information. “Bright” and “Wright” will achieve a very high score with this process.
5. See point #3.
6. The sample screenshots here are for contacts, but this works swimmingly for accounts as well.
7. This should really only be used where your data is so complicated that duplicate detection is not up to the task. This is usually not the case.
If you’d like more details on this process or information on how we at Customer Effective can help you with cleaning your data, please contact us at info@customereffective.com.
Post by: James Diamond,