CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?
Pages: 1 2 [3] 4 5

Author Topic: smart automatic merge dedupe?  (Read 23789 times)

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
July 30, 2011, 02:04:30 am
Hi,

Very interesting progress. Could you clarify what you mean by schema (looks like the db name) and datadictionary?

How do you use the class? If the auto merge is supposed to be safe, shouldn't it be run without html report from the cron?

I'm probably not having the same assumption than you on how it's supposed to be plugged and used in the system, so please allow me to explain how I understand it and correct me, so I can read your code looking at the right thing.

I had in mind an api contact.merge (with id A and id B as param) that would either merge the contacts, or return an error (eg. with the explanation of why it couldn't merge (eg different first names, don't know which one to keep). That can either be run from a cron automatically or from the UI (ajax) eg offer an automerge option.

With that in mind, looking at your code:
- you mix UI stuff (generating the html report) and code (merging). I'd much rather see your class only generating warning/error strings/code and let the caller do the html report (or whaterver report format they want, a json or plain text for cli).
- you seem to have introduced new tables (egc_xxx and ems_xxx), can you provide the sql to create them?
- a bigger concern of starting from scratch and not re-using the existing merge stuff: means duplicated features doing kinda the same stuff, ie. twice more maintenance and risk of error everytime civicrm schema is modified.

It really would be useful to have one code (at that point, don't have an opinion if it should be yours or CRM/Contact/Form/Merge.php). As lobo was pointing it out, the BAO seems to be the logical place to put it.

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 30, 2011, 02:45:33 am
Hey Xavier!

Thanks for replying

Some of your questions are answered in the code, but I’m very happy to clarify as as I’d love your help to properly assimilate this code…

autoMergeSQL is the primary interface and it’s what you’d interact with from the cli – and this is the function that needs an api wrapper around it. Note that this function only generates the sql – it does not execute it. (?yet) but it’s the function you want (I hope!!). this function returns a plain text error report on failure.

By data dictionary, I guess I mean the civicrm xml files… I read somewhere that is what civi uses to generate the some of the code.

Re the html, you’ll hopefully be pleased to hear its not as bad as you think. The three html function are for development and debugging purposes – run them to explain how the class is proposing to perform the merge and how it got there – these functions merely provide a window into the logic called upon by autoMergeSQL – if you wanted I could easily tweak them to return a nested array instead of a html report. Given the need to do more work on this I wanted to build some proper windows into the machinery as I when along.

(don’t worry about those additional tables – I didn’t make them, but they are part of the legacy environment I’m in – they would be harmless if they’re missing)

re “schema” – good question, sorry I didn’t realise would cause confusion (maybe showing my oracle roots) yep schema = database in mysql see http://lists.mysql.com/mysql/211616 

Again, I am sorry for not reusing the existing code – but (1) its very scantly commented and I couldn’t follow it (2) it looked like it needed major surgery to extract form dependencies and (3) it is full of significant ‘todos’ and ‘fixmes’


Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
July 30, 2011, 03:55:58 am

I just took a quick look at the current code, specifically the function postProcess (note: i did not write it, so have no vested interest in defending it :)

1. The only dependency on the form object, is the formValues at the very top of the function

2. There are 3 fixme's (two of them being similar/same). Could not find any todo's

While starting from scratch is a good idea in some cases, am not sure the current replacement is a drop in replacement for the current code base.

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 30, 2011, 04:20:13 am
sorry you're correct there are no todos

but there are five more fixme in civicrm\CRM\Dedupe\Merger.php

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
July 30, 2011, 04:31:04 am
Quote from: Erich Schulz on July 30, 2011, 02:45:33 am
By data dictionary, I guess I mean the civicrm xml files… I read somewhere that is what civi uses to generate the some of the code.

You might want to use the api to access it. Every entity got a getField method, eg:

http://drupal.demo.civicrm.org/civicrm/ajax/doc#/civicrm/ajax/rest?json=1&debug=1&version=3&entity=Contact&action=getfields

(press go, that's the values you want)

To get the list of entities:

http://drupal.demo.civicrm.org/civicrm/ajax/doc#/civicrm/ajax/rest?json=1&debug=1&version=3&entity=Entity&action=get

Quote from: Erich Schulz on July 30, 2011, 02:45:33 am

Re the html, you’ll hopefully be pleased to hear its not as bad as you think. The three html function are for development and debugging purposes – run them to explain how the class is proposing to perform the merge and how it got there – these functions merely provide a window into the logic called upon by autoMergeSQL – if you wanted I could easily tweak them to return a nested array instead of a html report. Given the need to do more work on this I wanted to build some proper windows into the machinery as I when along.


Yeap, getting the errors/warning as array will be easier to wrap in an api

Quote from: Erich Schulz on July 30, 2011, 02:45:33 am

re “schema” – good question, sorry I didn’t realise would cause confusion (maybe showing my oracle roots) yep schema = database in mysql see http://lists.mysql.com/mysql/211616 



$cfg = CRM_Core_Config::singleton();

and you got more or less all what you want. Top of my head: $cfg->dsn

btw, to run a sql query within civi: CRM_Core_DAO::executeQuery

Quote from: Erich Schulz on July 30, 2011, 02:45:33 am

Again, I am sorry for not reusing the existing code – but (1) its very scantly commented and I couldn’t follow it (2) it looked like it needed major surgery to extract form dependencies and (3) it is full of significant ‘todos’ and ‘fixmes’

cf. lobo's comment. I agree with you about the existing code not being super clear/commented, but pretty sure the right answer is to replace it by your new code (and you will need to understand it first) or refactor it (and you will need to understand it first too).

Looking at your code, isn't it a risk handling custom set (entity_id) that it's merging custom set on a participant that has the same id than the contact?

eg. you are merging contact 1 and 42, you got a custom data for participant entity_id=42, you end up putting it with entity_id 1, ie. to another unrelated participant?

Basically, the schema might not be that fit to "simply" merge without having too many assumptions that are going to bite you at one point or another. Using CRM_Dedupe_Merger might be more complex now, but avoid quite a few problems down the road.

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 30, 2011, 05:06:41 am
thanks Xavier  :)

Thanks for the api links, but what i'm suggesting is that each conctact_id foreign key field gets a new attribute of "automerging behaviour", ie how should this record be handled during an automated contact merge?

- block automerge (ie because mergeing this information isn't safe without human review)
- updated (to point to target conctact_id),
- deleted altogether
- ignored
- examined in detail

if someone in the core adds these attributes then the automerge class just needs to read them (ie in the *_list functions)

the same applies to the columns of civcrm_contact that are examined in columnAutomergeBehaviour()... those behaviours eg "BlockIfGreater"  for do_not_email or "BlockOnDifferentValue" is_deceased should also go into the datadictionary then be read from there


re the entity_id being a reference to a participant - i *think* i have that covered with excludeEntitiesSQLList() which lists the entities which I thought are not a synonym for the a contact



Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 30, 2011, 06:30:42 am
Lobo, I am also not proposing this code is a replacement for the existing code. An automatic merge is quite a different beast to a human mediated merger. I’m trying to apply the 80:20 rule to the automerges – most of them are going to be very simple records with limited baggage the big challenge is to make sure I abort on hitting one of the 20% where it’s a bit trickier.

I think I have pretty much done the code for the fixme at 212 of merger.php – but I’m a bit confused by the structure at 262

I am really curious about what the point of the long and windy road is with the three queries at line 327 of merger.php – am I missing something?? or is that excessively convoluted??

Both types of merge would do well to be reading the core data dictionary (through the BAO/DAO as appropriate) but it seems that the functions would be actually for the most part be reading different attributes.

I’m afraid (as Eileen can attest) that I’m a bit constrained by my current environment which is a bit ‘non-standard’… so honest I’m trying to comply with sandpit norms but…

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 31, 2011, 12:02:16 am
Hey all, I would appreciate some more feedback on this class – I understand the importance of code reuse, but
in reality the only code I could have reused was to replace the functionality between lines 175-190. And there didn’t seem to be a simple way to refactor the many hundred lines of code that are designed to do something quite different into the actually very simple 15 lines.

Xavier autoMergeSQL() is ready to go – it just needs an api wrapper around it. It returns a single array with elements: is_error, error_message (on failure), sql (on success) and report.

A fair amount of effort went into making automerge class transparent and easy to refine and maintain by others. Lines 57 – 137 do nothing other than provide a trace report to make it easier for ongoing development. These function are purely there to quickly generate reports for the eyes of fellow programmers (these are ‘reports’ – they are not data, which is why they generate html), if you want the information reported by these functions then call the same function that the html reporter is accessing. 

Lines 208 – 254 clearly document the key settings.

Lines 255-322, 394-439 and 442-459 are designed to be replaced by meta-data reads of “data automerge behaviour”, defining (1) what to do with foreign key relationships (2) what to do with civicrm_contact fields (3) and any special record level processing of particular fields, respectively. As I’ve said I’m happy to work with someone in core to implement that. If you want…

Lines 364-393 implement a range of potential system reactions to non-identical data in records proposed to be automatically merged.

Most of the remaining lines are either simple api-wrappers or schema reading and filtering functions not available in the current code.

I’ll swap out the drupal api calls for civicrm. So I’m happy to keep working on this, but I need to know I’m not wasting my time.

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
July 31, 2011, 01:37:59 pm
Quote from: Erich Schulz on July 31, 2011, 12:02:16 am
Xavier autoMergeSQL() is ready to go – it just needs an api wrapper around it. It returns a single array with elements: is_error, error_message (on failure), sql (on success) and report.


For the api, check out examples in api/v3, you got civicrm_api3_create_success and civicrm_api3_create_error to generate the array to return in the right format (and does some clean up to keep the memory under control too).

IMO, the api should be contact.merge (array (id => 1, id_duplicate => 42))
it returns either an array of errors (if it couldn't auto merge), or the record of the contact (assuming you have all the data anyway) like a contact.get

Make sense?

Quote from: Erich Schulz on July 31, 2011, 12:02:16 am

I’ll swap out the drupal api calls for civicrm. So I’m happy to keep working on this, but I need to know I’m not wasting my time.

Definitely something that's going to be useful for me, but still not convinced about the going to the sql approach (eg. you can't know if an entity_id=42 is for the contact 42 or the participant 42 or the activity 42 without having to dig more into civicrm logic than looking at the schema).

And could you start writing unit tests? That will help trusting that the auto merge doesn't fubar too much the db ;)

@lobo & co, what do you think?

As for your PM: api.entity.get returns the list of entities, api.{nameOfTheEntity}.getfields returns the list of all the fields (in values[]). That's these ones that will contain contact_id or contact_id_a or entity_id (cf above about the warning that entity id isn't only about contacts).
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 31, 2011, 03:55:28 pm
xavier - so the wrapper you want is something like this - the return value is the contact_id on success,

Code: [Select]
Function automerge_api_interface($param) {
  $result = AgcAutoMerge::autoMergeSQL($param['id'], $param['id_duplicate']);
  if ($result['is_error']) {
    return array($result['error_message']);
  }
  else {
    // where is the api definition for execute query?
    CRM_Core_DAO::executeQuery($result['sql']);
    if {{{no error returned by executeQuery}}} {
      return ($param[id];)
    }
    else {
      return {{{errors returned by executeQuery();}}}
    }
  }
}

I know you’re all vollies and are only working for warm fuzzies too, but sorry guys when I contributed this I did it on the assumption that someone who is intimately familiar with civicrm would pick it up and go with it, I have a full time job, 3 kids and a wife, for a “hobby” I’m going to be sysadmin supporting hopefully about 20-30 users managing 20000 contacts in a major campaign culminating in March. I have a lot to do. I could have just hacked this de-dupe and got on with the other 20-50 things on my to-do list but this seemed like a way I could contribute to civicrm.

Sorry this is a long and windy way of saying I don’t have time to write formal test cases (even tho i understand their value!) and you’ll need to be creative to write a test that proves an entity has disappeared without leaving orphan entities or corrupted references.

Re using the sql vs api, I think the API is a great and amazing and really useful thing, but so is SQL! if you want to re-write the 15 lines of code that implement the writes (updates and deletes) be my guest (more specifically replace lines 183 and 188 with an api call if you want, but you'll lose functionality and make the code less transparent as far as i can tell). I'm also happy to help integrate datadictionary reads to find the foreign key references if they exist. The benefit of constructing the update as an sql script, other than scalability and efficiency is that an sql script can run as a single transaction and can also simply be aggregated into batches. The existing manual merge code uses a good smattering of sql - so I'm assuming that 'safe entity transfer' isn't a current api call

so my question about the entitiy_id columns is:

*  how to identify which ones are a direct foreign key into civicrm_contact.id ?

I've already got a list of all collumns with those names and i've had a stab at eliminating the entity id's that don't point at contacts my stab at the logic is in getContactKeyFields(). if i can do this with the api cool, how do i? sorry i currently i can't see how, i can only get as far are pulling a list of field names, not the foreign key relationships
« Last Edit: July 31, 2011, 06:17:51 pm by Erich Schulz »

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
August 01, 2011, 01:01:19 am
Quote from: Erich Schulz on July 31, 2011, 03:55:28 pm
The existing manual merge code uses a good smattering of sql - so I'm assuming that 'safe entity transfer' isn't a current api call

My point is that I don't like sql spread in the manual merge already and I think it should be in the BAO that is used both by the manual and the auto. By spreading the sql even more, that's making the problem bigger (and fully aware you didn't create the problem in the first place, and that there is a finite amount of time in a day). And pretty sure you agree with me on a theoretical level ;)

Quote

so my question about the entitiy_id columns is:

*  how to identify which ones are a direct foreign key into civicrm_contact.id ?


Short answer: you can't.

Long one: entity_id is often used to store an id of rows in different tables (so you can't put an external ref). That's based on the context that you can know if the id is in civicrm_contact or civicrm_case. Sometimes, that's kind of simple, eg civicrm_entity_tag has a entity_table that let you know what it is.

I can't tell you from the top of my head how that's done in details for the custom fields, but that's more complicated. I guess that's one of the reasons I was suggesting to re-use the exiting code, to hide that complexity and the "interesting and creative" ways of storing related stuff around a contact.

Can someone from the core join and see if we can be reasonably sure to cover all cases ?

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 01, 2011, 07:58:15 pm
Thanks Xavier for the many pointers

 :)

At Xaviers suggestion I've reposted the code here: https://github.com/ErichBSchulz/CiviCRMAutoMerge

I've done a major reorder of the functions around a bit into a more logical clustering - but the code itself is little changed since the first version. Only code changes are to remove the "support code" to a separate class, and de-drupalise the db calls.

 

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
August 02, 2011, 08:09:23 am

so a few thoughts and comments from an incredibly brief look at the code:

1. Might help to package this as a drupal module so folks can use it as is (with a simple ajax interface to pick contact 1 and contact 2)

2. I would fix the ignore_list function and make it hookable. I'd remove your org's specific tables in there (agc_, ems_, civicrm_value_electorates)

3. I'd avoid using $_SESSION directly.

4. We have most of the schema info in PHP arrays / DAO objects. However i think your method of going to the DB directly might be faster. I dont kow what the current code does with regard to this (i.e. how much is hard coded vs dynamically figured out). However note that schema introspection queries in mysql are really really slow. So if u r doing it for a set of contacts, you might want to cache this information.

lobo

A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
August 02, 2011, 09:29:42 am
Hi,

I really would like to have it as an API. Ok, I need to work on a hook to add/override API methods then, let me think about it.

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
August 02, 2011, 08:51:59 pm
@lobo - thanks for the feedback :-)


1. happy to package as drupal module if that is simplest - but would rather work with xavier to api it. rather than creating a formal module, I could just post my existing menu option and wrapper function... just want to avoid creating an expectation of ongoing maintence in parallel to the api - I guess it depends how long it takes to get the api published and if there is an urgent user need... happy to be guided

2. very happy to, is there is a "best practice" guide for creating hooks... or "creating hooks for dummies"... i'll have a look in the wiki, but if there is a url to current ideal practice that'd get me off to good start

3 & 4... cool - i have cached some of the schema reads (getContactKeyFields) but not all (Describe) - didn't realise they were slow actually, but just seemed like a good idea.... but I use session...  should I just put them in a static variable? like I did with schema()?

@xavier - i thought of a third api parameter (as well a the two ID's) - it would be helpful for some type of mode, eg

'carfeful' - conservative and only merge if no discernable risk of dataloss
'normal' - take a typical approach ie allow "Doug" into "Douglas" etc
'forced' - merge contact regardless of apparent data inconsistencies

I think this would be usfeul depending on the nature of the installation... an install with 300 carefull nurtured contacts will chose different behavious to an org with a scrappy list of a 1000000 that they've pulled together from multiple sources

btw is it just you and me who are on github - or do others use it too?

Pages: 1 2 [3] 4 5
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?

This forum was archived on 2017-11-26.