CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?
Pages: 1 [2] 3 4 5

Author Topic: smart automatic merge dedupe?  (Read 23789 times)

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 15, 2011, 09:44:53 pm
ok i have a sense of progress!

it seems to me that at some point there needs to be a table of contact pairs - the relationship table is a potential candidate that has been mentioned - but maybe there needs to be another table - this table would underpin phase (a) to (e) that I outlined previously


this would allow firstly some caching to occur and secondly recording the human opinion about the relationship - which means that potential duplicates only need to be assessed by a human once rather than every time a rule gets run


the following relationships need recording:

- a is a duplicate of b
- a is a possible duplicate of b
- a is not a duplicate of b

the definition of "is not a duplicate" could also include records marked as "spouse of" or "cohabitates with" etc


Eileen

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4195
  • Karma: 218
    • Fuzion
Re: smart automatic merge dedupe?
July 15, 2011, 09:47:57 pm
Sigh "- a is not a duplicate of b" is already a table in 3.3....
Make today the day you step up to support CiviCRM and all the amazing organisations that are using it to improve our world - http://civicrm.org/contribute

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: smart automatic merge dedupe?
July 15, 2011, 10:38:38 pm
Would be good to extend civicrm_dedupe_exception and add a "type" column (modifying the xml schema and run Gencode.php). Probably needs to modify a bit the existing code to only consider the records with the type "not dupe".

Don't know how much more complex it would be, but being able to flag which field(s) where in problem so it couldn't do the merge automatically would make it easier to manually go through the "duplicate needs manual merge", "probably duplicate needs manual merge" type.

Kind of related: Merging phone needs to be smarter. eg "483 39 40 34" is the same as "83394034" or "83.39.40.34". Bonus point to deal with international prefixes (001 or +1 for the US, 0041 or +41 for CH...). On that one, got the list of prefix ready to be imported in Civi.
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 17, 2011, 05:13:48 pm
sorry Eileen missed that one - nice to see my thinking is in parallel tho!

re the type in the dedup exception table does is that  to cope with degrees of uncertainty?

just to explore the need and usage: "Jack" and  "Jim" with the same home number maybe father and son or the same person...

the point is if someone looks at the record and quite reasonably shrugs and says "i dont' know if these are dupes or not" then the system needs to record that assessment - and there is then no point revisiting that record until you either have Jack or Jim on the phone for some other reason ("by the way, are you also known as Jim, or is that a different person?") - i guess ideally they go in a queue for the "local office manager" or "case worker" to resolve...

if these "uncertain cases" aren't actively recognised and managed they either build up and get repeatedly revisited by the central office whenever the de-dupe is run, or someone eventually just gets fed up and decides arbitrarily they are dupes or not-dupes - with a 50% error rate.

I think telephone number formatting is best handled outside of the dedupe process - each install should have its house rule for standardised formatting and this should be enforced by the UI with an SQL back-up - I have an SQL script for Australian numbers, but for each country there probably needs to be a php function (and maybe a js one too?) that does input reformatting and also an sql script to do the mass clean up - is there a good place to post my script?

Eileen

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4195
  • Karma: 218
    • Fuzion
Re: smart automatic merge dedupe?
July 17, 2011, 11:16:23 pm
I don't think Xavier was suggesting the dedupe should fix phone formatting but that if phone no is a dedupe criteria it should be able to see past the syntax
Make today the day you step up to support CiviCRM and all the amazing organisations that are using it to improve our world - http://civicrm.org/contribute

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 18, 2011, 01:04:10 am
sure and there are some good reasons actually reformat the telephone numbers in civicrm_phone (or at least offer sys-admins the ability to), and doing this work well outside of a deduping process:

- if the dedupe rule can assume pre-tidied phone numbers then its going to be a much faster job
- many phone numbers are duplicated within the same contact with minor format variations readily avoidable on imports by calling on the standard logic
- readability/ui issues - choose the most human readable standardised format (spaces, and even brackets and dashes etc if you must) then that's what you give the user
- finding area codes involves some guess work but probably worth doing (the script I wrote checked state, postcode and some custom data) and if you're going to attempt its better to present that logic to users in case your assumptions are overly presumptuous

this is a bit off topic - but this is script that illustrates what "tidy telephone numbers" means to me - this could run as a housekeeping cron if the ui in incapable of negotiating with the user to keep the numbers neat, or if back end jobs were causing ongoing pollution:

Code: [Select]
/* remove +61 from long numbers: */
UPDATE `civicrm_phone`
SET phone=replace(phone,'+61','')
WHERE phone LIKE '%+61%'
  AND length(phone) > 11;

/* Replace letter O with number 0, strip spaces, brackets, -.,` sign: */
UPDATE `civicrm_phone`
SET phone=trim(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
 phone,'O','0'),'o','0'),' ',''),'`',''),'*',''),',',''),'.',''),'[',''),']',''),'(',''),')',''),'-',''))
WHERE length(phone) BETWEEN 8 AND 16;

/* strip leading 61 */
UPDATE `civicrm_phone`
SET phone= substr(phone,3)
WHERE phone LIKE '61%'
  AND length(phone) IN (11,12); /*Should only be 11, but many invalid 610 numbers */

/* temporarily strip leading 0's off valid numbers */
UPDATE `civicrm_phone`
SET phone= substr(phone,2)
WHERE phone LIKE '0%'
  AND length(phone) = 10;

/* attempt to supply missing area codes from first digit of post code */
UPDATE
  civicrm_phone AS p,
  civicrm_address AS a
SET p.phone=concat(CASE substr(a.postal_code, 1, 1)
  WHEN 2 THEN 2
  WHEN 3 THEN 3
  WHEN 4 THEN 7
  WHEN 5 THEN 8
  WHEN 6 THEN 8
  WHEN 7 THEN 3
  WHEN 8 THEN 8
END, p.phone)
WHERE length(phone) = 8
  AND a.contact_id = p.contact_id
  AND  substr(postal_code, 1, 1) BETWEEN 2 AND 8;

/* attempt to supply missing area codes from state_province_id */
UPDATE
  civicrm_phone AS p,
  civicrm_address AS a
SET p.phone=concat(CASE state_province_id
  WHEN 1639 THEN 8
  WHEN 1640 THEN 2
  WHEN 1641 THEN 7
  WHEN 1642 THEN 8
  WHEN 1643 THEN 3
  WHEN 1644 THEN 3
  WHEN 1645 THEN 8
END, p.phone)
WHERE length(phone) = 8
  AND a.contact_id = p.contact_id
  AND  state_province_id BETWEEN 1639 AND 1645;


/* attempt to supply missing area codes from civicrm state group */
UPDATE
  civicrm_phone AS p,
  civicrm_group_contact AS gc
SET p.phone=concat(CASE gc.group_id
  WHEN 116 THEN 3 /* vic */
  WHEN 117 THEN 2 /* act */
  WHEN 118 THEN 8 /* nt */
  WHEN 119 THEN 2 /* nsw */
  WHEN 120 THEN 7 /* qld */
  WHEN 122 THEN 8 /* sa */
  WHEN 123 THEN 3 /* tas */
  WHEN 124 THEN 8 /* wa */
  WHEN 494 THEN 3 /* vic */
END, p.phone)
WHERE length(phone) = 8
  AND gc.group_id in (116,117,118,119,120,122,123,124,494) /*ignore 115 AG and 121 fedmps */
  AND gc.status='Added'
  AND gc.contact_id=p.contact_id;

/* put back the spaces in 1800 and 1300 numbers: */
UPDATE `civicrm_phone`
SET phone=CONCAT(SUBSTR(phone,1,4),' ',SUBSTR(phone,5,3),' ',SUBSTR(phone,8,3))
WHERE phone LIKE '1_00%'
  AND length(phone) = 10;

/* put back the spaces and area codes on correctly constructed mobile and home numbers: */
UPDATE `civicrm_phone`
SET
  phone=IF(phone LIKE '4%',
    CONCAT('0',SUBSTR(phone,1,3),' ',SUBSTR(phone,4,3),' ',SUBSTR(phone,7,3)),
    CONCAT('0',SUBSTR(phone,1,1),' ',SUBSTR(phone,2,4),' ',SUBSTR(phone,6,4)))
WHERE phone BETWEEN 100000000 AND 999999999
  AND length(phone) = 9;



Eileen

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4195
  • Karma: 218
    • Fuzion
Re: smart automatic merge dedupe?
July 18, 2011, 01:36:28 am
Quote
- if the dedupe rule can assume pre-tidied phone numbers then its going to be a much faster job

You're right - I don't think it can deal with untidied phones without an unacceptable performance hit,

I guess there must be some place on the wiki were you could share your script (although no-one knows their way around the wiki :-)
Make today the day you step up to support CiviCRM and all the amazing organisations that are using it to improve our world - http://civicrm.org/contribute

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 18, 2011, 04:17:08 am
Oh good i'm glad it's not me getting lost in the wiki then... thought just occurred to me that that script could easily be modified to include a $start_phone_id and an $end_phone_id parameter... to plug into a "AND id BETWEEN $start_phone_id AND $end_phone_id" clause ...

to allow scalabilty and load management

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 21, 2011, 06:34:28 am
ok back on topic - I've been give the hurry up on deduping - i've just saved a poor admin officer who knows how long by asking if I can have a crack at auto merging before the manual merge starts (1700 dupes on first look alone!)

i've gotten as far as sniffing out the the fields in the schema where a contact_id (possibly hiding as entity_id) may live



Code: [Select]

  /**
   * Access the information_schema to pull out all fields with contacts
   *
   * This function supports deduping by providing a complete list
   * of all tables in the schema - even if CiviCRM does not know about them!
   * It includes all table.collumn containing contact_id or entity_id, except
   * custom civitable relating to a non-contact id
   * // fixme - assumes all data is found in a single schema
   * @param boolean $clear_cache maybe useful if adding tables or testing
   * @uses $_SESSION to cache results
   * @return array of (schema_name).(table name).(key field)
   */
  static public function getContactKeyFields($clear_cache = false) {
    // check for cached table list:
    if ($clear_cache || !array_key_exists('_agc.dedupefields',$_SESSION)) {
      $schema = 'au_drupal';  //fixme this should look up correct setting
      // 1 find tables with 'contact_id' as a field
      $sql = <<<sql
        SELECT concat(
          `TABLE_SCHEMA`, '.',
          `TABLE_NAME`, '.',
          `COLUMN_NAME`) as value
        FROM information_schema.COLUMNS
        WHERE `COLUMN_NAME` LIKE '%contact_id%'
          AND `TABLE_SCHEMA` = '$schema'
sql;
      // 2 add in tables with 'entity_id' as a field
      $FieldList[1] = AgcDedupe::sqlToArray($sql);
      $sql = <<<sql
        SELECT concat(
          `TABLE_SCHEMA`, '.',
          `TABLE_NAME`, '.',
          `COLUMN_NAME`) as value
        FROM information_schema.COLUMNS
        WHERE `COLUMN_NAME` LIKE '%entity_id%'
          AND `TABLE_SCHEMA` = '$schema'
sql;
      $FieldList[2] = AgcDedupe::sqlToArray($sql);
      // 3 remove civi custom tables where entity_id is not a contact
      $sql = <<<sql
        SELECT concat('$schema.',`table_name`, '.entity_id') as value
        FROM au_drupal.civicrm_custom_group
        WHERE `extends` IN ('Location','Address','Contribution','Activity',
          'Relationship','Group',
          'Membership','Participant','Event','Grant','Pledge','Case')
sql;
      $FieldList[3] = AgcDedupe::sqlToArray($sql);
      $_SESSION['_agc.dedupefields'] = // = 1 + (2-3)
        $FieldList[1] + array_diff($FieldList[2], $FieldList[3]);
    }
    return $_SESSION['_agc.dedupefields'];
  }
 
  /**
   * Produce a list of columns where a given contact_id is found
   *
   * @uses getFieldList() to identify field to search in db schema
   * @return array with (table name).(key field)
   */
  static public function locate($contact_id) {
    $keys = AgcDedupe::getContactKeyFields(true);
    $locations = array(); // places $contact_id is found
    foreach ($keys as $key) {
      $split = explode('.',$key);
      $table = $split[0].'.'.$split[1]; // = schema.table
      $field = $split[2]; // = field_name
      $sql =
        "SELECT count(*) AS n FROM $table WHERE $field=".(int)$contact_id.';';
      $res=db_query($sql);
      $db_row =  db_fetch_object($res);
      $n = $db_row->n;
      if ($n) { // contact found!
        $locations[] = $key;
      }
    }
    return $locations;
  }

this is my first sniff for a duplicate:


Code: [Select]
    au_drupal.civicrm_activity.source_contact_id
    au_drupal.civicrm_activity_target.target_contact_id
    au_drupal.civicrm_email.contact_id
    au_drupal.civicrm_group_contact.contact_id
    au_drupal.civicrm_group_contact_cache.contact_id
    au_drupal.civicrm_mailing_event_queue.contact_id
    au_drupal.civicrm_participant.contact_id
    au_drupal.civicrm_subscription_history.contact_id

questions remain as I asked(ish) earlier...

which of these can i just "update table set field =newid where field =  oldid;"??

any pointers welcome!!
« Last Edit: July 21, 2011, 07:06:49 am by Erich Schulz »

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
July 21, 2011, 12:41:32 pm

any reason why the obvious answer of: "you should do it for all of them" is not right

lobo
 
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Eileen

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4195
  • Karma: 218
    • Fuzion
Re: smart automatic merge dedupe?
July 21, 2011, 01:31:38 pm
Erich - I think if you decide something is a duplicate you should merge all data about them if you are doing it automatically. I think that too much data retained is safer than not enough.

The risk with autodeduping is merging non-duplicates (which you have already thought through)

You can truncate this table rather than merge it

civicrm_group_contact_cache
Make today the day you step up to support CiviCRM and all the amazing organisations that are using it to improve our world - http://civicrm.org/contribute

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 21, 2011, 06:56:11 pm
thanks all  - my assumption is this must be reasonbably complicated since no-one had done it - if it's simpler than I imagined that is great!

@lobo: I guess that's my question - just want to see if there are potential unintended consequences - I thought I better ask given the scary notice (that I'm currently unable to locate in the wiki) warning the sky will fall on my head if I directly interact with the database

which tables (or actually fields) should:

- block an auto dedupe and force human review and traditional dedupe screen? are you saying there are none?
- the dedupe simply ignore and not update?
-  get special handling in some other way?

@eileen - I'm thinking rather than truncating the group_cache I may just ignore it but document the need to consider truncating it after a batch process

I'm going to have to think about civicrm_contact itself too - I guess I just need to scan for it for inconsistencies (eg one record with a "do not email" and one with)

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: smart automatic merge dedupe?
July 21, 2011, 10:18:28 pm

I would first check and see if you can use the merge function and get it to merge all fields

check: CRM/Contact/Form/Merge.php, function postProcess

we'll first need to refactor the code and put it in its own BAO and isolate it from form dependencies

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 29, 2011, 05:35:55 am
ok comments welcome!

there are plenty of comments in the file and a few todo's but this is about as simple and safe as I can make it!

I'd really appreciate the core teams appraisal - I'm hoping someone in the core will rise to the challenge of some of the todo's

(sorry Lobo - I looked at the file you linked to and I'm afraid I found it pretty cryptic and also full of todo's... so I started green-field... hopefully my code is clear and open enough that we can run with it)

:-)

the current primary todos are:

 * TODO:  Define the automerge behaviour of tables and fields in the datadictionary then...
 * TODO:  Rewrite the *_list functions and the columnAutomergeBehaviour functions to read the datadictionary
 * TODO:  Rewrite schema() to use correct setting
 * TODO:  db_query needs to be replaced with civicrm equivilant!
 * TODO:  ideally this should be undoable - achievable if a list of updates is made and stored as a note attached to the 'deleted' contact

if someone in the core can run with item 1, i'm happy to keep playing with 2,3,4 and maybe five

(fyi - I've asked people in my local install to comment as well)


Erich Schulz

  • I post frequently
  • ***
  • Posts: 142
  • Karma: 5
    • When no-one understands what you are going on about its time to start a blog
  • CiviCRM version: 4.4
  • CMS version: Drupal 7
  • MySQL version: 5.somthing
  • PHP version: 5.3.3
Re: smart automatic merge dedupe?
July 29, 2011, 05:32:03 pm
based on some early feedback from my main site, I'll amend the behaviour of the first_name check (as specified in columnAutomergeBehaviour() and implemented in automergeSuitablilityTestSQL() ) so it only accepts blanks, exact matches or initial (single character) matches, so it will force manual review of:

 "doug"->"douglas" and
 "tom" -> "tom and jerry"

But it will allow automerge of:
"d" -> "douglas"

(remembering the decision has already been made that the two records in question are duplicates)

so the test in question will need to change from
Code: [Select]
IFNULL(ca.$column,'') NOT LIKE CONCAT(IFNULL(cb.$column,''),'%')
to a new behaviour 'AllowFirstcharBlankOrMatch' like

Code: [Select]
IFNULL(ca.$column,'') NOT LIKE CONCAT(IFNULL(cb.$column,''),'%') AND UPPER(SUBSTR(ca.$column,1,1))<>UPPER(cb.$column)
« Last Edit: July 29, 2011, 05:46:39 pm by Erich Schulz »

Pages: 1 [2] 3 4 5
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • smart automatic merge dedupe?

This forum was archived on 2017-11-26.