CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Discussion (deprecated) »
  • Feature Requests and Suggestions (Moderator: Dave Greenberg) »
  • Here's a new p.o.c. for smart contact matching (dedupe by params)
Pages: [1]

Author Topic: Here's a new p.o.c. for smart contact matching (dedupe by params)  (Read 1463 times)

Coleman Watts

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 2346
  • Karma: 183
  • CiviCRM version: The Bleeding Edge
  • CMS version: Various
Here's a new p.o.c. for smart contact matching (dedupe by params)
March 27, 2011, 04:48:18 pm
Howdy folks. In my ongoing quest to reduce burden on crm-db administrators (me) while dreaming up new gadgets, I have finished a working proof-of-concept for a smart deduping rule to run when saving new contacts. It is not intended to replace the current system that can dedupe 100,000 contacts in a few seconds -- compared to that, the query this code runs is probably a lot less efficient. It is nevertheless very useful for accurately determining whether a contact already exists in the db before creating a potential duplicate.

The three things this query builder does that standard deduping does not are
  • Use of OR logic in addition to AND (match email OR address OR phone number)
  • Cross-matching between different fields (first name <-> nick name)
  • Removes non-numeric separators from phone numbers (spaces, dashes, dots, and parenthesis) during search

Currently it sports two modes: "strict" and "stricter"
The logic for those modes looks like this:

Strict Mode
IF first_name OR nick_name matches first_name OR nick_name
AND last_name matches
AND email OR street address OR phone number match
Then we have a match!

Given enough information, this virtually guarantees you will match the contact in your database if they exist, without any false-positives. The only possible exception to this I can think of is the case in which a parent and child both have the same name, and are living at the same address or have the same phone number. For the sake of us crm folks, people really shouldn't be allowed to do that. But just in case, I added a second mode:
Stricter Mode
IF first_name OR nick_name matches first_name OR nick_name
AND last_name matches
AND email OR (street address AND dob) OR (phone number and dob) match
Then we have a match! And this time, no false positives from Bob and Bob, jr.

Because of the OR logic, the number of required params is flexible; first name OR nick name are required (provide one or the other, or, for best results, both), last name is always required, and at least one of email, street address, or phone number are required as well.

So, what's next? I've already added this function to my own personal API and am incorporating it into all the in-house modules we use here at woolman.org. I'm curious to hear from people if they would like to see the code and try it out, or even if there is some possibility of adding it to CiviCRM core. I realize that with only two modes it is far less configurable than what currently exists, but people may be willing to sacrifice flexibility for power and accuracy.
Try asking your question on the new CiviCRM help site.

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Here's a new p.o.c. for smart contact matching (dedupe by params)
March 27, 2011, 07:59:08 pm

hey coleman:

a couple of thoughts and comments:

1. Lets see what the interest level is for this

2. I'm curious as to the UI for achieving this flexibility (or is the current set of fields hardcoded?). Ideally we like things to be customizable so different folks can plug in their own field sets etc

In general, making dedupe a bit more field aware (i.e. ignore non-numeric characters in phone etc) is a good step forward

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Coleman Watts

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 2346
  • Karma: 183
  • CiviCRM version: The Bleeding Edge
  • CMS version: Various
Re: Here's a new p.o.c. for smart contact matching (dedupe by params)
March 27, 2011, 08:51:35 pm
Currently there is no UI, it is basically an API function I've created for my own use. The matching engine is actually only part of that function, which is a mega-wrapper for contact+location search, get, add, and update. These are things I do all the time in custom modules, and I was getting pretty tired of writing out similar code over and over, plus it was getting cumbersome to support. So I created an API for myself that includes wrapper functions for a bunch of civi and drupal functions to make them more useful. With this particular function you can add/update a contact (including multiple emails, addresses and phones). Just pass in the contact info and it takes care of all the nitty-gritty for you (like matching on external id or user id or legal id if you pass those, or else running the matching query to your specified level of strictness). It supports several modes: 'dry run' which only matches and returns the contact, 'update only' which will update an existing contact but not create a new one, 'update all' which updates all fields to the new values given, and 'fill empty fields' which will update the contact without overwriting any existing data. Makes my life much easier :)

Providing a totally configurable UI for this would be difficult and maybe not even desirable. I think matching would be better (and easier to configure) if we just hard-coded a few presets like the one's I've outlined here, and let the site admin choose between them. Much less work for them, much better matching results. We could always leave the existing UI in place and let people forgo the presets if they really want to.
Try asking your question on the new CiviCRM help site.

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Here's a new p.o.c. for smart contact matching (dedupe by params)
March 28, 2011, 06:30:57 am
just to clarify:

you use the below for contact create / update and you do your own dedupe before u decide to call either create or update?

So this is a "one" contact dedupe and you dont "hook" into the current dedupe process

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Coleman Watts

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 2346
  • Karma: 183
  • CiviCRM version: The Bleeding Edge
  • CMS version: Various
Re: Here's a new p.o.c. for smart contact matching (dedupe by params)
March 28, 2011, 09:25:44 am
That's right. This function is for finding/adding/updating a single contact. What it does depends on the params it is passed and the various options.
The original version of this function used CRM_Dedupe_Finder::dupesByParams() but now uses the above mentioned query builder which is both better at finding matches and better at preventing false-positives than anything I have been able to configure with Civi's dedupe options.
The params and options look like this:
Code: [Select]
/**
 * Wrapper for adding/finding/updating contacts in CiviCRM
 * @param $params: array of arrays: keys should be:
                   'contact', 'address', 'phone', 'email'
 * @param $write_mode: 'search', 'do_not_update', 'update_all', 'fill_empty_fields'
 * @param $match_mode: 'strict' or 'stricter'
 * @param $return: 'return_cid' or 'return_contact'
 * @param $exclude: string: comma separated list of contact ids that this person is not a match for (or subquery which returns ids)
 * @return array('op'=>action taken, 'contact_id', 'contact'=>full contact record)
 */
function woolman_contact_match($params, $write_mode='update_all', $return='return_cid', $match_mode='stricter', $exclude=''){
...
}

Note the $exclude @param, which I find useful when, for example, entering a whole family -- pass in the contact ID's of the other family members to make sure you don't get a false-match for Bob and Bob, jr. My camp registration form does this: mom enters the names, etc of all her family members in a single webform. When the form is submitted, we save each one with this function in a foreach() loop. During the loop each successfully saved cid is added to the exclude list, which is then passed in while saving the next contact and so on.
Try asking your question on the new CiviCRM help site.

Pages: [1]
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Discussion (deprecated) »
  • Feature Requests and Suggestions (Moderator: Dave Greenberg) »
  • Here's a new p.o.c. for smart contact matching (dedupe by params)

This forum was archived on 2017-11-26.