Author Topic: Proof of concept to reduce time to run a dedupe from a few days to a few minutes (Read 13042 times)

dalin · March 04, 2010, 05:32:55 am

While dedupe is better than it was in 2.1, it's still mostly unusable on large datasets. Running a rule manually with just one field (and no length) works ok, but for anything with more than one field or using a length will take hours or days (and likely timeout before it completes). Even more dangerous is that this tends to lock up the tables in question preventing INSERTs and UPDATEs to the table while the dedupe is in process. But with some small changes we can reduce the SQL time from a few days to a few minutes.

Currently the main query looks like this:

Code: [Select]

CREATE TEMPORARY TABLE dedupe 
SELECT t1.contact_id id1, t2.contact_id id2, 5 weight 
FROM civicrm_email t1 
JOIN civicrm_email t2 
   ON (SUBSTR(t1.email, 1, 6) = SUBSTR(t2.email, 1, 6))
WHERE t1.contact_id < t2.contact_id 
  AND (t1.contact_id IN (12,88,121,156,164,169,170) OR t2.contact_id IN (12,88,121,156,164,169,170)) 
UNION ALL SELECT t1.id id1, t2.id id2, 5 weight 
FROM civicrm_contact t1 
JOIN civicrm_contact t2 
  USING (first_name)
WHERE t1.id < t2.id 
  AND (t1.id IN (12,88,121,156,164,169,170) OR t2.id IN (12,88,121,156,164,169,170));

Those IN() clauses will only exist if you are deduping a group rather than the entire set of contacts. Unfortunately even deduping a group of 30k contacts against a total of 600k contacts still takes several hours. An additional UNION will be made for each additional field that you add to the dedupe rule. The crazy JOIN with the SUBSTRING functions is caused by using a length on the given field.

Lets run EXPLAIN on this:

Code: [Select]

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: t1
         type: ALL
possible_keys: FK_civicrm_email_contact_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 597673
        Extra: 
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: t2
         type: ALL
possible_keys: PRIMARY,FK_civicrm_email_contact_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 597673
        Extra: Using where
*************************** 3. row ***************************
           id: 2
  select_type: UNION
        table: t1
         type: index
possible_keys: PRIMARY,index_first_name
          key: index_first_name
      key_len: 195
          ref: NULL
         rows: 565324
        Extra: Using index
*************************** 4. row ***************************
           id: 2
  select_type: UNION
        table: t2
         type: ref
possible_keys: PRIMARY,index_first_name
          key: index_first_name
      key_len: 195
          ref: democrats_civicrm_staging.t1.first_name
         rows: 4
        Extra: Using where; Using index
*************************** 5. row ***************************
           id: NULL
  select_type: UNION RESULT
        table: <union1,2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra:

Ouch, there's a full table scan on both tables in the first query. The total number of rows analyzed in this query is 357,215,276,225 to dedupe 15 contacts. I've never seen it finish running even for just 15 contacts (and I've waited hours). I'm not even going to calculate the rows analyzed to dedupe all 600k contacts.

There's a number of ways that we can improve this:
1) The JOIN with no index is the big killer, create a new temporary table before inserting into dedupe. MySQL can't use an index on this JOIN because of the string functions. So instead we can take all values, truncate them and put them into another temporary table before inserting into dedupe. The challenge being that our dedupe query joins the table to itself which TEMPORARY tables can't do. Instead we can use a MEMORY table.

2) Use MEMORY tables instead of TEMPORARY TABLES. Which has the added benefit of being much faster than a temporary MyISAM table (which is what this likely is since we don't specify an engine). This will however require checking the max_heap_table_size session value and increasing if necessary to avoid running out of space and checking the available RAM. Which we should probably be doing anyway - if a dedupe table gets sent to disk it will never finish and take down the server. In the example below deduping a group with 30k records civicrm_temp_dedupe is 184MB. Also since MEMORY TABLES are available to all sessions we'd need to use a random string suffix on tables names to avoid collisions. See
http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html
This also means that we should Create the temporary table before we insert into.

3) Split the IN() OR IN() into 2 queries. MySQL doesn't seem to be able to utilize indexes well in this situation.

4) Create the temporary table before we insert into it and don't use a UNION. This means that MySQL doesn't need to try to combine all the results before inserting it in the table. Just let MySQL be smart about it. This also means that we can add indexes to the temporary tables to improve our subsequent actions.

Here's my modified example.

Code: [Select]

SET SESSION max_heap_table_size = 1024 * 1024 * 1024;

DROP TABLE IF EXISTS civicrm_temp_email_truncated;

CREATE TABLE civicrm_temp_email_truncated (
  contact_id INT(8) UNSIGNED NULL DEFAULT NULL,
  email VARCHAR(6) NULL DEFAULT NULL,
  INDEX (contact_id),
  INDEX (email)
) ENGINE MEMORY;

INSERT INTO civicrm_temp_email_truncated (contact_id, email) 
SELECT contact_id, LEFT(email, 6) 
FROM civicrm_email;

DROP TABLE IF EXISTS civicrm_temp_dedupe;

CREATE TABLE civicrm_temp_dedupe (
  contact_id_1 INT(8) UNSIGNED NULL DEFAULT NULL,
  contact_id_2 INT(8) UNSIGNED NULL DEFAULT NULL,
  weight INT(2) UNSIGNED NULL DEFAULT 0
--  ,INDEX (contact_id_1, contact_id_2)
) ENGINE MEMORY;

INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight) 
SELECT t1.contact_id AS contact_id_1, t2.contact_id AS contact_id_2, 3 AS weight 
FROM civicrm_temp_email_truncated t1
INNER JOIN civicrm_temp_email_truncated t2
  ON t1.email = t2.email
WHERE 
  t1.contact_id < t2.contact_id
  AND t1.contact_id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);

INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight) 
SELECT t1.contact_id AS contact_id_1, t2.contact_id AS contact_id_2, 3 AS weight 
FROM civicrm_temp_email_truncated t1
INNER JOIN civicrm_temp_email_truncated t2
  ON t1.email = t2.email
WHERE 
  t1.contact_id < t2.contact_id
  AND t2.contact_id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);

INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight) 
SELECT t1.id AS contact_id_1, t2.id AS contact_id_2, 5 weight 
FROM civicrm_contact t1 
JOIN civicrm_contact t2 
  USING (first_name)
WHERE t1.id < t2.id
  AND t1.id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);

INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight) 
SELECT t1.id AS contact_id_1, t2.id AS contact_id_2, 5 weight 
FROM civicrm_contact t1 
JOIN civicrm_contact t2 
  USING (first_name)
WHERE t1.id < t2.id
  AND t2.id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);

SELECT contact_id_1, contact_id_2, SUM(weight) AS threshold
FROM civicrm_temp_dedupe
GROUP BY contact_id_1, contact_id_2
HAVING threshold >= 7;

The SQL above executes in a mere few seconds. For deduping a group of 30k contacts takes 300 seconds.

The biggest offender is the final query. I would've thought that a multi-key index on contact_id_1,contact_id_2 would be good but EXPLAIN tells me that it isn't utilized. It's a full table scan with no indexes being used. One technique would be to send only those rows that exist more than once to a third temp table. But that might be difficult. There's a dichotomy of temp table size vs. performance that needs to be balanced.

Unfortunately this won't work on the entire 600k contacts because civicrm_temp_dedupe takes too much memory (with max_heap_table_size set to 2GB it still runs out of space (and gets sent to disk)). Deduping that many contacts may not be feasible with any technique short of moving the data to a separate beefy server, deduping, and then moving it back.

I'm proposing to the client that we make these modifications and submit a patch. I'll let you know if they agree, but I thought I should start this dialogue either way.

xavier · March 05, 2010, 10:55:09 pm

Looks very interesting improvements.

One of the biggest issue the users have is that you have for each contact to manually merge via a second screen, then come back and let the dedupe calculate the list again.

I don't have a solution, just rising the issue.

Maybe offering a sane default values:
- keep the oldest (smallest ID)
- fill mode (ie. keep the fields that are filed)
- copy activity/relationship/events...

and adding that as an ajax option on each line ?

X+

petednz · March 06, 2010, 01:57:31 am

Doesn't doing the merge in a new tab avoid the list having to rebuild - though of course you have to remember who you have merged.

dalin · March 07, 2010, 05:04:53 pm

Is anyone willing to offer some technical critique of my proposal?

One thing that should probably come with this is an intermediary screen that says things like:

Quote

In order to run this dedupe you need:
The MEMORY storage engine: success.
An estimated NNN MB of available memory.
Currently you have only NN MB available. The temporary tables required will be sent to disk instead and the dedupe may not be able to complete.

Do you wish to continue?

xavier · March 07, 2010, 11:03:19 pm

As for the memory table, isn't it a high risk than the server will be dead slow to other users because it is running low on memory ?

Have you tried solving the 1 that looks like the biggest problem and instead of doing 2 in memory do it on the disk with myisam and the index ?

X+

dalin · March 08, 2010, 01:37:57 am

What we could do is:
If (available RAM > (estimated amount for our temp tables + 10% of total RAM)) {
use MEMORY
}
else {
use InnoDB
}

Lets avoid MyISAM

Memory would be much preferable. Especially if we can get the running time down to something reasonable for large datasets. It would be safe to have memory use increase to within 10% of total if the thing only runs for < 5 mins.

I haven't yet had opportunity to investigate how to improve that last query in the proposed change.

The client wants to put this on hold pending other more pressing matters. Based on where in the list they've placed it I don't see us getting to this within the next 6 months.

I believe David Strauss gave architectural input for original dedupe mechanism. It would be great to get his input on this new approach.

xavier · March 08, 2010, 03:36:57 am

Not sure myisam (or archive) isn't more appropriate as you don't need any transaction integrity. What you'd need is a an enhanced flat file where you can add an index, ie. myISAM.

X+

torenware · March 20, 2010, 10:47:02 pm

Quote from: dalin on March 07, 2010, 05:04:53 pm

Is anyone willing to offer some technical critique of my proposal?

One thing that should probably come with this is an intermediary screen that says things like:

Quote
In order to run this dedupe you need:
The MEMORY storage engine: success.
An estimated NNN MB of available memory.
Currently you have only NN MB available. The temporary tables required will be sent to disk instead and the dedupe may not be able to complete.

Do you wish to continue?

Dave,

Having done some serious tweaking on CiviCRM queries (more in 1.9 than later), I've found that the killer tends to be huge temporary tables created for a query that of course will not be indexed. You've certainly done some serious stuff with ACLs, so you likely know exactly what I'm talking about. And your EXPLAIN data indicates that you're running into a similar problem -- huge temp tables that end up getting full scans.

I found that simply understanding how these temporary tables get constructed by the MySQL query planner was the best way to speed up what MySQL was doing. Short version is: anything you can do to reduce the size of these tables as created by various JOINs will speed up these kinds of queries dramatically.

Often the best way to do this is to move logic that lives in the WHERE clause and move it into the ON clauses for the various joins. A good join is a sparse join, since that massive temporary table that kills your speed is typically N1 x N2 x N3 x .... Nn, which gets big very very fast, and very likely, larger than any MEMORY table you could reasonably construct. The way to keep it as small as possible is to make sure that Ni is a small number, which you can do by putting restrictive logic into the ON clauses for the joins.

Hinting also helps, but I found, much less than reducing the number of rows that the WHERE clause needs to process.

I haven't looked at your query in detail, but some of those IN lists can be moved out of the WHERE clause into the ON clauses, and there's also the possibility that you could construct your temporary table in two steps, the first where you use the SUBSTR operators, and after you construct the step 1 table, you index the relevant columns.

EXPLAIN is definitely your friend here. You've clearly already done some of this. If you have your sample SQL, I'd say run EXPLAIN on it, and see if any of the rearrangements I'm proposing affect the size of the tables the query planner constructs. Once you know what the efficient SQL looks like, it becomes more of a matter of figuring out how the change the CiviCRM query building code to make it do the "right thing".

dalin · March 22, 2010, 08:45:11 pm

@torenware Thanks for the ideas. I implemented most of them (or something similar) into my original proposal.

The last hurdle is the final query which takes about 90% of the time:

SELECT contact_id_1, contact_id_2, SUM(weight) AS threshold
FROM civicrm_temp_dedupe
GROUP BY contact_id_1, contact_id_2
HAVING threshold >= 7;

Which does a full scan and temp table. Which makes sense, MySQL needs to create sums for every combination of contact_id_1 and contact_id_2 so it needs to use a temp table. It has no indexes on this temp table which is possibly why it takes so much time. I think this can be avoided by creating a third MEMORY table and manually. Something like (my stored procedure knowledge is pretty slim) (and I haven't compared, or even executed this one):

Code: [Select]


DROP TABLE IF EXISTS civicrm_temp_dedupe_results;

CREATE TABLE civicrm_temp_dedupe_results (
  contact_id_1 INT(8) UNSIGNED NULL DEFAULT NULL,
  contact_id_2 INT(8) UNSIGNED NULL DEFAULT NULL,
  score INT(2) UNSIGNED NULL DEFAULT 0,
  PRIMARY (contact_id_1, contact_id_2)
) ENGINE MEMORY;

CREATE PROCEDURE calculateDedupeResults()
BEGIN
  DECLARE done BOOLEAN DEFAULT FALSE;
  DECLARE c1, c2 INT(8);
  DECLARE weight INT(2);
  DECLARE cur1 CURSOR FOR SELECT contact_id_1, contact_id_2 FROM civicrm_temp_dedupe d;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

  OPEN cur1;

  REPEAT
    FETCH cur1 INTO c1, c2, weight;
    IF NOT done THEN
        INSERT INTO civicrm_temp_dedupe_results (contact_id_1, contact_id_2, score) VALUES (c1,c2, weight) ON DUPLICATE KEY UPDATE score = score + weight;
    END IF;
  UNTIL done END REPEAT;

  CLOSE cur1;
END

calculateDedupeResults();

Sean Madsen · April 12, 2010, 10:51:21 am

I'm doing all my de-duping from the back end with variations on the following query.

Features of this query:

it's pretty fast
it will de-dupe contacts of varying type
it is very good at matching contacts by name. (lots of complicated logic in there!)
it takes into account whether the contacts are part of the same household (because if so, they'll be more likely to have the same contact info)
it does not present duplicates that have a direct relationship between each other. (We have a relationship type set up called "Is not a duplicated contact of" that we use to link different people with the same name so they won't shop up in de-dupe searches anymore.)
it outputs one piece of text in MediaWiki format that I can plop it into our internal wiki to keep track of the dupes done/to-do
it sorts the results grouped by contact type, with the most likely dupes listed first

Code: [Select]

/* =========================================================================*/
/*__________________________ DE-DUPE SCRIPT ________________________________*/

set @site_domain = 'http://example.org';

/* set points values for similarities (used when sorting results) */

set @same_name_points = 6;
set @same_email_points = 3;
set @same_address_points = 4;
set @same_phone_points = 1;
set @common_related_household_points = -4;

set @score_threshold = 6;

set @total_points = @same_name_points    + @same_email_points + 
                    @same_address_points + @same_phone_points ;

/* make a place to store the pool of contacts to de-dupe */

drop temporary table if exists pool;
create temporary table pool (
  contact_id integer not null,
  index(contact_id));

/* CHANGE THIS SECTION IF YOU'D LIKE TO DE-DUPE A SUB-SET OF YOUR CONTACTS */

insert into pool select contact_id
from civicrm_contact;

drop temporary table if exists pool2;
create temporary table pool2 like pool;
insert into pool2 select * from pool;

/* create a place to store dupes */

drop temporary table if exists dupes;
create temporary table dupes (
  contact_id_1 integer not null,
  contact_id_2 integer not null,
  score integer,
  match_method char(20),
  index(contact_id_1,contact_id_2),
  index(score),
  index(match_method) );


/* find contacts with the same name */

drop temporary table if exists c_name_1;
create temporary table c_name_1 (
  contact_id integer not null,
  n_last_strict char(30),
  n_last_fuzz char(30),
  n_first_strict char(30),
  n_first_fuzz char(30),
  n_nick char(30),
  n_mid char(1),
  n_first_nick char(50),
  n_sort_strict char(100),
  n_sort_fuzz char(100),
  index(contact_id),
  index(n_last_strict),
  index(n_last_fuzz),
  index(n_first_strict),
  index(n_first_fuzz),
  index(n_nick),
  index(n_mid),
  index(n_first_nick),
  index(n_sort_strict),
  index(n_sort_fuzz) );

update civicrm_contact set first_name = NULL where first_name = '';
update civicrm_contact set last_name = NULL where last_name = '';
update civicrm_contact set nick_name = NULL where nick_name = '';
update civicrm_contact set sort_name = NULL where sort_name = '';

insert into c_name_1
select 
  con.id                                                      as contact_id,
  concat(last_name,'%')                                       as n_last_strict,
  if(length(last_name)<3,null,concat('%',last_name,'%'))      as n_last_fuzz,
  concat(first_name,'%')                                      as n_first_strict,
  if(length(first_name)<3,null,concat('%',first_name,'%'))    as n_first_fuzz,
  concat('%',nick_name,'%')                                   as n_nick,
  middle_name                                                 as n_mid,
  concat_ws(', ',first_name,nick_name)                        as n_first_nick,
  sort_name                                                   as n_sort_strict,
  if(length(sort_name)<9,null,concat('%',sort_name,'%'))      as n_sort_fuzz
from pool
join civicrm_contact con
  on con.id = pool.contact_id;

drop temporary table if exists c_name_2;
create temporary table c_name_2 like c_name_1;
insert into c_name_2 select * from c_name_1;

insert into dupes (
  select c1.contact_id, c2.contact_id, @same_name_points, 'name'
  from c_name_1 c1, c_name_2 c2
  where c1.contact_id < c2.contact_id and
  (
    
    /* for individual vs. individual */
    (    
      /* similar last name */
      (    coalesce(c1.n_last_strict like c2.n_last_strict,0)
        or coalesce(c2.n_last_strict like c1.n_last_strict,0) ) and
      /* and similar first/nick name */
      (    coalesce(c1.n_first_nick like c2.n_first_fuzz,0)
        or coalesce(c1.n_first_nick like c2.n_nick ,0)
        or coalesce(c1.n_first_strict like c2.n_first_strict ,0)
        or coalesce(c2.n_first_nick like c1.n_first_fuzz,0)
        or coalesce(c2.n_first_nick like c1.n_nick ,0)
        or coalesce(c2.n_first_strict like c1.n_first_strict ,0)
      ) and
      /* and they don't have the same middle initial */
      coalesce(c1.n_mid != c2.n_mid,1)
    ) 
    
    /* for hosueholds and organizations */
    or coalesce(c1.n_sort_fuzz like c2.n_sort_fuzz,0) 
    or coalesce(c2.n_sort_fuzz like c1.n_sort_fuzz,0) 
    
    
    /* for household/organization vs. individual */
    or ( coalesce(c1.n_sort_fuzz like c2.n_first_fuzz,0) and
         coalesce(c1.n_sort_fuzz like c2.n_last_fuzz ,0) 
       )
    or ( coalesce(c1.n_sort_fuzz like c2.n_nick, 0) and
         coalesce(c1.n_sort_fuzz like c2.n_last_fuzz,0)
       )
    or ( coalesce(c2.n_sort_fuzz like c1.n_first_fuzz,0) and
         coalesce(c2.n_sort_fuzz like c1.n_last_fuzz,0) 
       )
    or ( coalesce(c2.n_sort_fuzz like c1.n_nick, 0) and
         coalesce(c2.n_sort_fuzz like c1.n_last_fuzz,0) 
       )
       
  )
);

/* find contacts with the same email */

insert into dupes (
  select distinct e1.contact_id, e2.contact_id, @same_email_points, 'emal'
  from pool
  join pool2
  join civicrm_email e1
    on pool.contact_id = e1.contact_id
  join civicrm_email e2
    on pool2.contact_id = e2.contact_id
  where e1.contact_id < e2.contact_id
    and e1.email = e2.email
);


/* find contacts with the same phone */

insert into dupes (
  select distinct p1.contact_id, p2.contact_id, @same_phone_points, 'phon'
  from pool
  join pool2
  join civicrm_phone p1
    on pool.contact_id = p1.contact_id
  join civicrm_phone p2
    on pool2.contact_id = p2.contact_id
  where p1.contact_id < p2.contact_id
    and p1.phone = p2.phone and length(p1.phone)>6
);


/* find contacts with the same address */

insert into dupes (
  select distinct a1.contact_id, a2.contact_id, @same_address_points, 'addr'
  from pool
  join pool2
  join simple_results a1
    on pool.contact_id = a1.contact_id
  join simple_results_2 a2
    on pool2.contact_id = a2.contact_id
  where a1.contact_id < a2.contact_id
    and a1.address_compare = a2.address_compare
);


/* see which contacts are part of the same household */
insert into dupes (
  select distinct
    if(r1.contact_id_a < r2.contact_id_a, r1.contact_id_a, r2.contact_id_a),
    if(r1.contact_id_a < r2.contact_id_a, r2.contact_id_a, r1.contact_id_a),
    @common_related_household_points,
    '{household}'
  from civicrm_relationship r1
  join civicrm_relationship r2
    on r2.contact_id_b = r1.contact_id_b and r1.contact_id_a != r2.contact_id_a
  where
        r1.relationship_type_id = 7
    and r2.relationship_type_id = 7
);

/* remove dupes that are related to eachother */

delete dupes from dupes
left join civicrm_relationship rel
  on  if(contact_id_a < contact_id_b, contact_id_a, contact_id_b) = contact_id_1
  and if(contact_id_a < contact_id_b, contact_id_b, contact_id_a) = contact_id_2
where rel.id is not null;


/* display the results */

set @@group_concat_max_len = 1024000;

select 
  count(distinct merge_result) as num,
  group_concat( distinct merge_result separator '\n') as merge_results
from (
  select 
    concat(
      
      '*(',
      ( case
        when c1.contact_type='Household' and c2.contact_type='Household'
          then '3HH'
        when c1.contact_type='Household' and c2.contact_type='Individual'
          then '5HI'
        when c1.contact_type='Household' and c2.contact_type='Organization'
          then '2HO'
        when c1.contact_type='Individual' and c2.contact_type='Household'
          then '5HI'
        when c1.contact_type='Individual' and c2.contact_type='Individual'
          then '6II'
        when c1.contact_type='Individual' and c2.contact_type='Organization'
          then '4IO'
        when c1.contact_type='Organization' and c2.contact_type='Household'
          then '2HO'
        when c1.contact_type='Organization' and c2.contact_type='Individual'
          then '4IO'
        when c1.contact_type='Organization' and c2.contact_type='Organization'
          then '1OO'
        end ), '-',
        
      
      lpad(format(sum(dupes.score)/@total_points*100,0),3,'0'),'%: ',
      group_concat(distinct match_method order by dupes.score 
      desc separator ', '),') ',
      
      '[',@site_domain,
      '/civicrm/contact/view?action=browse&selectedChild=rel&cid=',c1.id,' ',
      c1.sort_name,']',
      
      ' and '

      '[',@site_domain,
      '/civicrm/contact/view?action=browse&selectedChild=rel&cid=',c2.id,' ',
      c2.sort_name,']',
      
      ' => \'\'\'[',@site_domain,'/civicrm/contact/merge?reset=1&cid=', 
      c2.id, '&oid=', c1.id, '&action=update&rgid=7 merge]\'\'\''
      
    ) as merge_result
  from dupes
  join civicrm_contact c1
    on c1.id = dupes.contact_id_1
  join civicrm_contact c2
    on c2.id = dupes.contact_id_2
  group by c1.id, c2.id
  having sum(dupes.score) >= @score_threshold
  order by merge_result desc
) a;

dalin · April 12, 2010, 05:10:34 pm

Sean how many contacts do you run this against? I think the last few queries will run into the same problem I had: full table scans and more temp tables sent to disk.

jbertolacci · April 21, 2010, 01:47:18 pm

I have been talking to dlobo about sponsoring work. His estimate for a "base set of changes / sql fixes would be between 40-50 hours to optimize all queries and take less resources" is about $5k of work.

I can sponsor half of that and am looking for a co-sponsor. Would anyone like to be a co-sponsor to improve the the merge duplicate code and sql in civi?

xcf33 · June 29, 2010, 08:34:38 am

Hi guys,

I'm really interested in what you guys are trying here,

We currently have CiviCRM deployment from 50k contacts to 250k contacts. We are planning to use it for 500k to 1 million contacts.

I just wanted to throw in my 2 cents here because I think that regardless of query optimization and database tuning, MySQL just has its limits. Physically speaking I don't know how we can optimize the query enough to not perform a full table scan at some point. Also there is the performance vs complexity issue with creating multiple temporary tables.

Coming from another CRM system, I liked one of the things that they did for the de-duping process:

Instead of utilizing the hardware stack and DB server to process the entire database for De-duping, I'm thinking about the preventative approach of stopping duplicate contacts to be created in the first place.

Although I have not dealt with the every aspect of the system, I believe de-dedupe should be a 2 prong approach.

***********************************************************
1. Prevent duplicate contacts to be created in the first place.

Contacts are created in the following senario:

1. Importing process (dupe check here)
2. API call for creating contacts (dupe check here)
3. Activities (making contribution, register for event, using a profile form, etc etc)

Activities here is the culprit because when a user completes the activity, if he or she does not log into the CMS, CiviCRM does not make a connection of that visitor to the contact record, therefore creating a duplicate contact.

This has happened to me many times when I was testing contribution pages.

I believe the solution would be to apply the same de-duping rules on the activities so that the person completing the activities will be merged with an existing contact if rules are matched.

So for example, if I fill in a contribution form and in the form the First Name, Last Name and Email matches an existing contact in the system and first name, last name and email match is the de-dupe rule currently being used then contribution activity should be recorded to the existing contact rather than creating a new contact.

I believe this kind of auto merge can help prevent duplicate contacts to be created in the first place and therefore minimize the need for a resource intensive process.

/***********************************
2. Thinking about duplicates contextually

For us, we don't like dupe contacts mainly because we do a lot of email marketing and we really DO NOT want to send the same email twice or more times to the same person, so I believe the ability to contextually filter the contacts you want to perform dedupe on is important as well.

So for example, I can filter my contacts like (only people who have made a contribution before) or contacts who have received mailing from us and is not on hold or removed DO NOT EMAIL.

This way we can subset and only perform de-duping on the contacts that we care about.

Let me know what you guys think,

Cheers!

Chang

xavier · June 29, 2010, 11:51:12 pm

Hi,

They are dedupe already on activities. The problem is that you can't choose what dedupe rule has to be applied to each case, and you might register to an event using only an email, but the duplication criteria is first+last+email.

X+

dalin · June 29, 2010, 11:53:08 pm

And if I understand your question, deduping a subset is possible. Create a group of the contacts in your set and just dedupe that group.

CiviCRM Community Forums (archive)

News:

Author Topic: Proof of concept to reduce time to run a dedupe from a few days to a few minutes (Read 13042 times)

dalin

Proof of concept to reduce time to run a dedupe from a few days to a few minutes

xavier

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

petednz

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

dalin

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

xavier

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

dalin

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

xavier

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

torenware

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

dalin

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

Sean Madsen

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

dalin

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

jbertolacci

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

xcf33

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

xavier

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes

dalin

Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes