While dedupe is better than it was in 2.1, it's still mostly unusable on large datasets. Running a rule manually with just one field (and no length) works ok, but for anything with more than one field or using a length will take hours or days (and likely timeout before it completes). Even more dangerous is that this tends to lock up the tables in question preventing INSERTs and UPDATEs to the table while the dedupe is in process. But with some small changes we can reduce the SQL time from a few days to a few minutes.
Currently the main query looks like this:
CREATE TEMPORARY TABLE dedupe
SELECT t1.contact_id id1, t2.contact_id id2, 5 weight
FROM civicrm_email t1
JOIN civicrm_email t2
ON (SUBSTR(t1.email, 1, 6) = SUBSTR(t2.email, 1, 6))
WHERE t1.contact_id < t2.contact_id
AND (t1.contact_id IN (12,88,121,156,164,169,170) OR t2.contact_id IN (12,88,121,156,164,169,170))
UNION ALL SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (first_name)
WHERE t1.id < t2.id
AND (t1.id IN (12,88,121,156,164,169,170) OR t2.id IN (12,88,121,156,164,169,170));
Those IN() clauses will only exist if you are deduping a group rather than the entire set of contacts. Unfortunately even deduping a group of 30k contacts against a total of 600k contacts still takes several hours. An additional UNION will be made for each additional field that you add to the dedupe rule. The crazy JOIN with the SUBSTRING functions is caused by using a length on the given field.
Lets run EXPLAIN on this:
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t1
type: ALL
possible_keys: FK_civicrm_email_contact_id
key: NULL
key_len: NULL
ref: NULL
rows: 597673
Extra:
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: t2
type: ALL
possible_keys: PRIMARY,FK_civicrm_email_contact_id
key: NULL
key_len: NULL
ref: NULL
rows: 597673
Extra: Using where
*************************** 3. row ***************************
id: 2
select_type: UNION
table: t1
type: index
possible_keys: PRIMARY,index_first_name
key: index_first_name
key_len: 195
ref: NULL
rows: 565324
Extra: Using index
*************************** 4. row ***************************
id: 2
select_type: UNION
table: t2
type: ref
possible_keys: PRIMARY,index_first_name
key: index_first_name
key_len: 195
ref: democrats_civicrm_staging.t1.first_name
rows: 4
Extra: Using where; Using index
*************************** 5. row ***************************
id: NULL
select_type: UNION RESULT
table: <union1,2>
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra:
Ouch, there's a full table scan on both tables in the first query. The total number of rows analyzed in this query is 357,215,276,225 to dedupe 15 contacts. I've never seen it finish running even for just 15 contacts (and I've waited hours). I'm not even going to calculate the rows analyzed to dedupe all 600k contacts.
There's a number of ways that we can improve this: 1)
The JOIN with no index is the big killer, create a new temporary table before inserting into dedupe. MySQL can't use an index on this JOIN because of the string functions. So instead we can take all values, truncate them and put them into another temporary table before inserting into dedupe. The challenge being that our dedupe query joins the table to itself which TEMPORARY tables can't do. Instead we can use a MEMORY table.
2)
Use MEMORY tables instead of TEMPORARY TABLES. Which has the added benefit of being much faster than a temporary MyISAM table (which is what this likely is since we don't specify an engine). This will however require checking the max_heap_table_size session value and increasing if necessary to avoid running out of space and checking the available RAM. Which we should probably be doing anyway - if a dedupe table gets sent to disk it will never finish and take down the server. In the example below deduping a group with 30k records civicrm_temp_dedupe is 184MB. Also since MEMORY TABLES are available to all sessions we'd need to use a random string suffix on tables names to avoid collisions. See
http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html This also means that we should Create the temporary table before we insert into.
3)
Split the IN() OR IN() into 2 queries. MySQL doesn't seem to be able to utilize indexes well in this situation.
4)
Create the temporary table before we insert into it and don't use a UNION. This means that MySQL doesn't need to try to combine all the results before inserting it in the table. Just let MySQL be smart about it. This also means that we can add indexes to the temporary tables to improve our subsequent actions.
Here's my modified example.
SET SESSION max_heap_table_size = 1024 * 1024 * 1024;
DROP TABLE IF EXISTS civicrm_temp_email_truncated;
CREATE TABLE civicrm_temp_email_truncated (
contact_id INT(8) UNSIGNED NULL DEFAULT NULL,
email VARCHAR(6) NULL DEFAULT NULL,
INDEX (contact_id),
INDEX (email)
) ENGINE MEMORY;
INSERT INTO civicrm_temp_email_truncated (contact_id, email)
SELECT contact_id, LEFT(email, 6)
FROM civicrm_email;
DROP TABLE IF EXISTS civicrm_temp_dedupe;
CREATE TABLE civicrm_temp_dedupe (
contact_id_1 INT(8) UNSIGNED NULL DEFAULT NULL,
contact_id_2 INT(8) UNSIGNED NULL DEFAULT NULL,
weight INT(2) UNSIGNED NULL DEFAULT 0
-- ,INDEX (contact_id_1, contact_id_2)
) ENGINE MEMORY;
INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight)
SELECT t1.contact_id AS contact_id_1, t2.contact_id AS contact_id_2, 3 AS weight
FROM civicrm_temp_email_truncated t1
INNER JOIN civicrm_temp_email_truncated t2
ON t1.email = t2.email
WHERE
t1.contact_id < t2.contact_id
AND t1.contact_id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);
INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight)
SELECT t1.contact_id AS contact_id_1, t2.contact_id AS contact_id_2, 3 AS weight
FROM civicrm_temp_email_truncated t1
INNER JOIN civicrm_temp_email_truncated t2
ON t1.email = t2.email
WHERE
t1.contact_id < t2.contact_id
AND t2.contact_id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);
INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight)
SELECT t1.id AS contact_id_1, t2.id AS contact_id_2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (first_name)
WHERE t1.id < t2.id
AND t1.id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);
INSERT INTO civicrm_temp_dedupe (contact_id_1, contact_id_2, weight)
SELECT t1.id AS contact_id_1, t2.id AS contact_id_2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (first_name)
WHERE t1.id < t2.id
AND t2.id IN (12,88,121,156,164,169,170,217,220,267,293,350,361,418,496,519,532,555,565,574,578,595,705,711,725,756,774,781,782,793,837,838,843,846,914,932,949,950,955);
SELECT contact_id_1, contact_id_2, SUM(weight) AS threshold
FROM civicrm_temp_dedupe
GROUP BY contact_id_1, contact_id_2
HAVING threshold >= 7;
The SQL above executes in a mere few seconds. For deduping a group of 30k contacts takes 300 seconds.
The biggest offender is the final query. I would've thought that a multi-key index on contact_id_1,contact_id_2 would be good but EXPLAIN tells me that it isn't utilized. It's a full table scan with no indexes being used. One technique would be to send only those rows that exist more than once to a third temp table. But that might be difficult. There's a dichotomy of temp table size vs. performance that needs to be balanced.
Unfortunately this won't work on the entire 600k contacts because civicrm_temp_dedupe takes too much memory (with max_heap_table_size set to 2GB it still runs out of space (and gets sent to disk)). Deduping that many contacts may not be feasible with any technique short of moving the data to a separate beefy server, deduping, and then moving it back.
I'm proposing to the client that we make these modifications and submit a patch. I'll let you know if they agree, but I thought I should start this dialogue either way.