CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?
Pages: 1 2 3 [4] 5

Author Topic: smart automatic merge dedupe?  (Read 23790 times)

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
August 02, 2011, 09:34:54 pm

1. Check CRM/Utils/Hook.php for how hooks are implemented

2. having it as a drupal module allows folks to use it WITHOUT coding or hacking the code base :) Makes it much simpler and easier to get a lot more folks using it

3. Note that if we go down the path of adding it to the API / core, there is a very implicit expectation that there will be ongoing maintainance, support and improvements till things are stable. Unit tests are also not optional at this stage :)

4. You can use CRM/Core/BAO/Cache.php to cache things in the DB. I'll document that class on a long plane ride today :)

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 02, 2011, 10:31:06 pm
thanks Lobo

cool - i have some homework.

reading between the lines it sounds like I should just pop a temporary drupal 6 module wrapper around this. and then see if anyone ever wants to run with it to and help me the do extra steps to get it into core

I think once i have the entity id question sorted the code will be usable, "at own risk"

oop gotta go - so will add more later

my vision was that several "automerge behaviours" need to be added as additional meta-data attibutes into the datadicitonary xml, and that datadictionary reads would then replace several of the long key lists and and select statements  - i've indicated which function in the source code - just a suggestion, but seems ot me this would make it easier to maintain not just the automerge code but also the manual merge code, as there seem to be several examples of related (but not the same) meta-data imbedded in those functions too

re unit testing - i maybe very slow,   :o  but just figured out how to do it. I can't see me getting to it before 2012 (and most likely never to be truely honest) but I'll make some notes in git hub in case someone who's actually written a civicrm test case, and is likely to write more in the future, rises to the task.

btw i'd'a thought being able to merge clear duplicates without human intervention would be a sensible core "addition" rather than a "hack"... but maybe everyone else in the universe out there is running really tight ships with n'er a duplicate in site, or they have legions of obsessive compulsive vollunteers with no other mission in life but to sit and manually merge duplicates ;)

in terms of ongoing support, as I've said I'm happy to work with others to get this code to comply with 'house rules', and I've taken a fair bit of time to make it clear and maintainable, but ideally by somone else. I think anyone writing code that is only maintable by themself isn't actually doing the universe a favour, so it comes down to how much need there is for this out there.

sorry not meaning to be difficult, just honest  :)

« Last Edit: August 02, 2011, 11:32:38 pm by Erich Schulz »

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
August 02, 2011, 10:54:35 pm
Starting to discuss a hook so a module can add apis (you're welcome to join the list).

So the module can add a api.contact.merge

Code: [Select]
'carfeful' - conservative and only merge if no discernable risk of dataloss
'normal' - take a typical approach ie allow "Doug" into "Douglas" etc
'forced' - merge contact regardless of apparent data inconsistencies

The risk is that my careful is your normal. ie. ambiguous what it means.

If not too long, would be good to have an option per "real action" behing. eg option.merge_partial="firstname,lastname"...
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 02, 2011, 11:59:35 pm
huh? what list? goodness, a wiki, a book, a forum, irc, github... you guys like to spread your stuff around!!!

no wonder I'm so confused... no wait... I have an excuse for being clueless and disoriented now  :D

agree clarity on automerge mode will be important!

My thought would be to plan on three modes, the one in the middle will always be a bit of a compromise - but it may also be the most usefull, then the sysadmin can decide if they want to go with one of the "clear" extreme options, or something a bit wooly in the middle.

again if this stuff is defined in the xml datadictionary then it would be possible to report the tests that will be used

I don't think partial is the best word tho, as Eileen i think said earlier you dont want to do a partial merge - you either merge or don't merge, this is an issue about where do you draw the line for initiating the merge... once you decide to automerge it has to be a complete merge

I think any "partial merge" would be done by in the standard user interface,

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 04, 2011, 04:23:54 am
ok some notes on test case https://github.com/ErichBSchulz/CiviCRMAutoMerge/wiki/Test-Plan

just have to work out the entity_id conundrum then I'll be testing it on our data

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
August 04, 2011, 07:02:01 am

might want to check:

tests/phpunit/CRM/Contact/BAO/QueryTest*.php

and basically create a dataset consisting of an array of:

( contact type A, contact type B, expected result)

and the test iterates thru the dataset and does the merge and checks the result

That way, it makes it a lot easier to add another tuple to the test case when we discover a few discrepancies

the big challenge would be now to figure out an easy way to create contacts with a fair amount of info spread across multiple tables

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
August 04, 2011, 07:09:28 am
Quote from: Donald Lobo on August 04, 2011, 07:02:01 am


the big challenge would be now to figure out an easy way to create contacts with a fair amount of info spread across multiple tables

lobo

Did I hear someone calling an api rescue ?:) (and chaining)


-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
August 04, 2011, 07:13:49 am

its more a question of how do you represent the data in a easy manner (currently done in xml in the data provider cases) rather than the code to create it :P

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
August 04, 2011, 07:35:18 am
Json to the rescue in a txt?

{action:'create',entity:'contact',first_name:'John',email:'john@doe.com',api.api.phone.create:{location_type_id:1,
  phone:"101 32934", type_id=1}}


or simply a long list of civicrm_api calls

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 05, 2011, 06:19:24 pm
ooh, I love JSON... xml makes my hair curl!

btw have finally gotten a few minutes to sleuth the wt' does "entity_id" mean issue (and I think this gives the full picture of all the civicrm_contact.id foreign keys.

the answer is in CRM/Dedupe/Merger.php

CRM_Dedupe_Merger::cidRefs()
CRM_Dedupe_Merger::eidRefs()
CRM_Dedupe_Merger::getActiveRelTables()

cidRefs are a simple single key, and eidRefs are a compound key. Sadly the field name is not 100% reliable in classifying the two type of key (ie civicrm_entity_tag.entity_id is a simple key)

The getActiveRelTables function explicitly explains the logic:

$sqls[] = "SELECT COUNT(*) AS count FROM $table WHERE $field = $cid";
$sqls[] = "SELECT COUNT(*) AS count FROM $table WHERE $entityId = $cid AND $entityTable = 'civicrm_contact'";

employer_id is handled in a different way for a reason i can't explain

I'll write an interface into these two functions and modify the code to cope with the mixture of simple/compound keys.

then (deep breath) that code should work

:-)

my spouse expects me to interact with my offspring this weekend so maybe I'll fit it in - not sure

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
August 06, 2011, 12:47:27 am
Find very weird that tags are simple. Will check thx
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 06, 2011, 06:00:37 pm
yeah is weird - as the fixme says all this stuff really should be read from the "schema"/"meta-data" as a property of the field:

 "simple foreign key to":"contact"

The optimal way of representing the compound foriegn keys maybe to name the entityType field rather than the entity

"compound foreign key with":"owner_entity_table"


Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 07, 2011, 06:06:23 am
ok - this is getting so close I can almost smell the ozone from the vapourising duplicates...

https://github.com/ErichBSchulz/CiviCRMAutoMerge

would highly value any feedback, will have more time to work on it in the next day or two and hopefully nail it

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 08, 2011, 07:34:01 pm
mmm looking at this code from CRM/Dedupe/Merger.php:

Code: [Select]
        require_once 'CRM/Core/Transaction.php';
        $transaction = new CRM_Core_Transaction( );
        foreach ($sqls as $sql) {
            CRM_Core_DAO::executeQuery( $sql,
                                        CRM_Core_DAO::$_nullArray,
                                        true, null, true );
        }
        $transaction->commit( );
    }

gee it would be nice to have an idea how it reports errors!!!

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
August 09, 2011, 01:24:44 am
I'd doubt it does ;(
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Pages: 1 2 3 [4] 5
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?

This forum was archived on 2017-11-26.