CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?
Pages: [1] 2 3 ... 5

Author Topic: smart automatic merge dedupe?  (Read 23792 times)

Sean Madsen

  • I post occasionally
  • **
  • Posts: 98
  • Karma: 5
  • CiviCRM implementer/developer
    • Bikes Not Bombs
  • CiviCRM version: 4.6
  • CMS version: Drupal 7
smart automatic merge dedupe?
May 19, 2009, 06:45:40 pm
I like the dedupe feature. I find that moving through each contact in the dedupe results is pretty mindless though. For example, I often run into the case where a duplicate contact exists containing only the same email address and the duplicate contact has no other information. Clearly in this case I want to delete the duplicate (and whether or not I merge the email address is pretty trivial since the two email addresses are the same). Is there a way to get Civi to be a bit smarter and automatically merge some of the dedupe results in a batch fashion?

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
May 19, 2009, 07:46:48 pm

the dedupe code does not do this currently. You should be able to do this using the internal dedupe functions and checking some rules based on your database. If you write code to do this, please share with the community. I suspect a few folks would use it. If you need help writing the code ping us on IRC

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Blake

  • I post occasionally
  • **
  • Posts: 36
  • Karma: 5
    • LinkedIn Profile
Re: smart automatic merge dedupe?
August 17, 2009, 02:21:43 pm
I agree. Our database has about 10,000 contacts to merge (only relationships though), and a batch dedupe WOULD be very helpful.  ;D

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
May 10, 2010, 07:34:34 am
Have you been able to cook up something to batch dedupe ?

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 13, 2011, 07:21:52 pm
I'm looking to write some code to do this as my next task - probably in the form of a php function in ultimately wrapped in a drupal 6 module...

What i need to know is which tables contain records where I can just flick over to a new contact Id, and which tables are unsafe to do this on and should force a manual merge?

of course if I can get some help with the tablelist (and even if I don't for that matter) I'll post back my code for the community

basic plan in (pigeon php) is (please dont help me debug this it's illustrative only!!!):

get passed 2 contact_ids
Code: [Select]
function mergeContacts($keeper, $duplicate) {
// clean input
$keeper = (int)$keeper;
$duplicate = (int)$duplicate;

// check $duplicate to see if complex data attached:

(need a list of tables here) ??? basic question is what records are unsafe to automatically merge?

if ($UnsafeRecordsFound) {
  $message = "Sorry, unable to merge $duplicate into $keeper because one or more records exist in the following tables: ( $MatchingTableList ). Please use the tradional <a href='@todo url in her with $duplicate & $keeper>CiviCRM de-duping screen</a> or delete the records and try again.");
  $success = FALSE;
} else {
  foreach (table in tablist) {
    run sql: "UPDATE $table SET $idFieldName = $keeper WHERE $idFieldName = $duplicate;"
    // todo build a report in here to advise what table had affected rows
  }
  run sql: "UPDATE civicrm_contact SET deleted = 1 WHERE contact_id = $duplicate;" // todo check this
  $success = TRUE;
}
return ( array($success => $success, message => $message)
}

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 13, 2011, 08:46:08 pm
One step closer - think this gives me the list of tables I need to consider - as either road-blocks or targets for updating:

Code: [Select]
SELECT `TABLE_NAME`
FROM `COLUMNS`
WHERE `COLUMN_NAME` = 'contact_id'
  AND `TABLE_SCHEMA` = 'au_drupal'

SELECT TABLE_NAME
FROM information_schema.COLUMNS
WHERE `COLUMN_NAME` = 'entity_id'
  AND `TABLE_SCHEMA` = 'au_drupal'
  AND TABLE_NAME NOT IN (
    SELECT table_name
    FROM au_main.civicrm_custom_group
    WHERE `extends` IN ('Location','Address','Contribution','Activity',
      'Relationship','Group','Membership','Participant','Event','Grant','Pledge','Case'
    /*ie take custom tables for: 'Contact','Individual','Household','Organization'   */
    )
  )

comments welcome!

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
July 13, 2011, 09:58:50 pm
Quote from: Erich Schulz on July 13, 2011, 07:21:52 pm
I'm looking to write some code to do this as my next task - probably in the form of a php function in ultimately wrapped in a drupal 6 module...

What i need to know is which tables contain records where I can just flick over to a new contact Id, and which tables are unsafe to do this on and should force a manual merge?


Not sure I understand, could you give an example of on safe and one unsafe?

-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 14, 2011, 04:43:30 am
by safe I'm meaning "safe to transfer from the duplicate to original contact_id without review by a human"

I'm thinking a safe record would be a tag, group or activity

an unsafe record maybe a membership record? or a financial contribution?

I'm not sure to be honest - but most dupes don't have a lot of data attached so deduping automatically shouldn't be hard. The issue is where do you draw the line and say "the computer can't merge this contact"

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
July 15, 2011, 02:51:58 am
Hi,

I think more or less every external data is safe to add (some are a pain, like you end up with 2 relationships employed of, but ok).


IMO the main problem is the conflicting fields: which source do you keep if both contacts have it? external_ref, nick name, middle name...

X+

P.S. Why are you trying to go directly at the db level instead of using the BAO or better yet, add that to the API?
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 15, 2011, 04:17:41 am
hi Xavier,

i have a little problem in that the org I'm on is on civi 3.2 and they have a large dup problem - i'm not keen to wrestle with v2 of the api since i gather v3 is major revision...

i'm also finding the api docs a bit well, um, missing - am I looking in the wrong spot?

I'm happy to work with someone who is more api savvy - but currently I see a database, I write SQL and the problem goes away... is that bad?

I guess if I got pointed at a nice "proper" function that used the api and did something similar I'd be happy to work in that context, as I understand the updside of an API rather than just getting the job in sql

addit: I guess I should have a look at the dedup code

« Last Edit: July 15, 2011, 04:35:24 am by Erich Schulz »

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 15, 2011, 05:41:13 am
mmm ok in \civicrm\CRM\Dedupe\Merger.php one finds these two functions:

maybe the solution is to put a wrapper around them:

Code: [Select]
    /**
     * Based on the provided two contact_ids and a set of tables, move the
     * belongings of the other contact to the main one.
     */
    function moveContactBelongings($mainId, $otherId, $tables = false)

and

Code: [Select]
    /**
     * Find differences between contacts.
     */
    function findDifferences($mainId, $otherId)

Eileen

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4195
  • Karma: 218
    • Fuzion
Re: smart automatic merge dedupe?
July 15, 2011, 01:04:36 pm
Hi Erik

You might want to check out this

http://wiki.civicrm.org/confluence/display/CRM/DeDupe+Optimization+Project+for+v3.3

Note that some of the items on that page have been implemented in 3.3 & 3.4 (the 3 UI improvements down the bottom of the page).

You probably will find that API v3 has been backported to the site you are on but I certainly haven't looked into how it would work on a scripted de-dupe.

The 2 biggest issues with scripted dedupes are that you need to be really sure that they really are a match and that if there is conflicting data you need to select which (like Xavier says).

However, there is a 3rd element on the site you are working on which is an external system (EMS) that is also being de-duped. I'm pretty sure it's relying on the hook hook_civicrm_merge running so if you bypass that then it won't get de-duped.

Make today the day you step up to support CiviCRM and all the amazing organisations that are using it to improve our world - http://civicrm.org/contribute

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 15, 2011, 05:15:43 pm
Hi Eileen, fancy meeting you here :-)

cool - will do some tests - EMS I am in the process of documenting and I'll document as I go there (happy to discuss further in different forum if you like)

I am happy to split off "accurate identification of duplicates" as a separate thread/project- my back-of-envelope analysis of the task at hand is as follows:


[a) - find potential dupes]
-> [b) classify potential dupes on {certainty} as definite/probable/possible]
-> [c) screen definite dupes for {suitability} for automated merging]
-> [d) automatically merge dupes where {certainty=definite} and {suitability = true}]
-> [e) place remainder of potential duplicates in a queue for human review]

In this forum I was mainly looking to clarify elements of c) and d)

my sense is that 80% of our dupes are email only with very few "belongings"

Eileen

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4195
  • Karma: 218
    • Fuzion
Re: smart automatic merge dedupe?
July 15, 2011, 05:25:27 pm
Quote
my sense is that 80% of our dupes are email only with very few "belongings"

I think that's very likely! At one point a bug cause a contact for each user for each domain to be created :-(. The civicrm_uf_match table could probably be queried to get a list of these.

I would say that for C you should probably focus specifically on those contacts with email but no name details in the first instance as they are the simplest & probably the bulk.

My main point about EMS was that it is probably an argument against just doing SQL as the hook won't be called if you do it that way.

It would be really handy to have an API for deduping contacts wouldn't it? But we have to have a talk fest (and a warm beer) before we introduce any new API functions!

A-Team - contact_merge as an API? with find differences as an option (ie. you can call the API as a dry run or a do it?) leveraging the functions Erik identified?
Make today the day you step up to support CiviCRM and all the amazing organisations that are using it to improve our world - http://civicrm.org/contribute

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
July 15, 2011, 08:59:44 pm
Quote from: Eileen on July 15, 2011, 05:25:27 pm
It would be really handy to have an API for deduping contacts wouldn't it? But we have to have a talk fest (and a warm beer) before we introduce any new API functions!

A-Team - contact_merge as an API? with find differences as an option (ie. you can call the API as a dry run or a do it?) leveraging the functions Erik identified?

Most def a good idea.

api.contact.merge
with param either two contact id or
dedupe rule+ group id + dry_run/get_only
?

Your example (quite common in my installs as well) of having one complete contact and an email only wouldn't be identified easily with our existing dedupe rule: either it won't be found with a  first+last+email rule, or it will identify plenty of legit contacts that share the same email with a email rule.

Wondering if changing the dedupe to say that an empty field always match would solve and be easy to implement.

(and it might be good to start a new thread for that discussion).

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Pages: [1] 2 3 ... 5
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?

This forum was archived on 2017-11-26.