CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • Proof of concept to reduce time to run a dedupe from a few days to a few minutes
Pages: 1 [2]

Author Topic: Proof of concept to reduce time to run a dedupe from a few days to a few minutes  (Read 13039 times)

xcf33

  • I post frequently
  • ***
  • Posts: 181
  • Karma: 7
  • CiviCRM version: 3.3.2
  • CMS version: Drupal 6.19/6.20
  • MySQL version: 5.x
  • PHP version: 5.2.6
Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes
June 30, 2010, 03:02:19 pm
Quote from: xavier on June 29, 2010, 11:51:12 pm
Hi,

They are dedupe already on activities. The problem is that you can't choose what dedupe rule has to be applied to each case, and you might register to an event using only an email, but the duplication criteria is first+last+email.

X+

Like I said, I haven't played with all activities but that's good to hear. I think it would be a good idea to have a setting to select which de-dupe rules to use for activities as well as even for contact_create APIs.

Quote from: dalin on June 29, 2010, 11:51:12 pm
And if I understand your question, deduping a subset is possible.  Create a group of the contacts in your set and just dedupe that group.

I understand you can de-dupe by group but there are some short falls to groups as well. First, it is an extra step, second it creates another x number of rows in the database (and with large number of contacts that's not very affordable). I'm thinking along the lines of almost like a search builder for deduping.

On a separate de-duping email discussions Piotr has also mentioned "soundex" based algorithms that could be employed for auto de-duping.

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes
July 06, 2010, 07:02:37 am
i was playing with a DB sent by chang to check whats happening and ran into this issue :). In this case (with relatively few contact ids in the IN clause), the OR clause is messing things up, if you rewrite the query sent by dalin as:

Code: [Select]
SELECT t1.contact_id id1, t2.contact_id id2, 5 weight
FROM civicrm_email t1
JOIN civicrm_email t2
   ON (SUBSTR(t1.email, 1, 6) = SUBSTR(t2.email, 1, 6))
WHERE t1.contact_id < t2.contact_id
  AND (t1.contact_id IN (12,88,121,156,164,169,170))
UNION ALL
SELECT t1.contact_id id1, t2.contact_id id2, 5 weight
FROM civicrm_email t1
JOIN civicrm_email t2
   ON (SUBSTR(t1.email, 1, 6) = SUBSTR(t2.email, 1, 6))
WHERE t1.contact_id < t2.contact_id
  AND (t2.contact_id IN (12,88,121,156,164,169,170))
UNION ALL
SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
  USING (first_name)
WHERE t1.id < t2.id
  AND (t1.id IN (12,88,121,156,164,169,170))
UNION ALL
SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
  USING (first_name)
WHERE t1.id < t2.id
  AND (t2.id IN (12,88,121,156,164,169,170))

it returns quite quickly (less than 1 sec) on my 4 year old macbook pro with 188K contacts and 127K email addresses

deduping the whole set is still quite expensive

lobo
« Last Edit: July 06, 2010, 07:25:35 am by Donald Lobo »
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

dalin

  • I post occasionally
  • **
  • Posts: 89
  • Karma: 8
  • CiviCRM version: many
  • CMS version: Drupal 6
  • MySQL version: 5.0
  • PHP version: 5.2
Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes
July 06, 2010, 07:00:03 pm
Nice, the extra UNION avoids the problems of the OR.

I think it still will scale O(n2) though due to the JOIN on a function.  We may be able to combine the techniques to hopefully find something that is at least O(n lg(n)).
--
Dave Hansen-Lange
Web Developer
Advomatic LLC
http://advomatic.com
Hong Kong office

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Proof of concept to reduce time to run a dedupe from a few days to a few minutes
July 06, 2010, 10:52:43 pm

Note that this by itself will not work as is, since we will need to remove duplicate triplets from the result set. An easy way might be to populate the table while we are generating the query and use REPLACE INTO or something similar, but this will need more code changes for it to be accurate

if folks are interested in contributing code AND/OR funding improvements to dedupe for the 3.2.x series please send us email. Would be good if we can spend 50-100 hours or looking and optimizing various queries and ensure dedupe does scale to at least a million records :). Currently dedupe is also a pretty big time sink for imports on a large DB

lobo
 
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Pages: 1 [2]
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • Proof of concept to reduce time to run a dedupe from a few days to a few minutes

This forum was archived on 2017-11-26.