CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Support »
  • Using CiviCRM »
  • Using Import (Moderator: Yashodha Chaku) »
  • Small patch to drastically improve dedupe performance
Pages: [1]

Author Topic: Small patch to drastically improve dedupe performance  (Read 1760 times)

nkinkade

  • I post occasionally
  • **
  • Posts: 56
  • Karma: 5
Small patch to drastically improve dedupe performance
May 12, 2010, 10:43:23 am
CiviCRM 3.1.3

We have been trying to import a CSV file with 465 contacts, which also includes a single relationship (Employee of -> Organization Name).  When launching the import MySQL would peg at 100% CPU usage and an hour or two later there was no sign of it completing any time soon.  I turned on slow query logging in MySQL and found this query to be the culprit:

CREATE TEMPORARY TABLE dedupe SELECT t1.id id1, t2.id id2, 10 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 ON (SUBSTR(t1.organization_name, 1, 10) = SUBSTR(t2.organization_name, 1, 10)) WHERE t1.id < t2.id;

With manual testing, consistently it was taking 35 to 45 seconds to complete that query.  Removing the SUBSTR() functions caused it to complete in a fraction of a second.  It would appear that SUBSTR is a very costly function.  Replacing the SUBSTR() function with a where clause like WHERE LENGTH(rule_field) <= rule_length allowed the import to complete in less than 2 minutes, as opposed to hours or possibly never completing.   Is there any reason that such a where clause would not work the same as the SUBSTR() functions?  If not, then I propose the attached patch to drastically improve dedupe performance.

Nathan

nkinkade

  • I post occasionally
  • **
  • Posts: 56
  • Karma: 5
Re: Small patch to drastically improve dedupe performance
May 12, 2010, 11:43:45 am
After chatting with Chastell on #civicrm I realized that my logic for the previous post/patch was ridiculously (and embarrassingly) wrong.  Clearly (now) organization names <= $number is not the same as substr(org_name, 1, 10).  That said, the following patch also improved performance significantly by simply not operating SUBSTR() on null values, which in our case makes a huge difference, but your mileage may vary depending on your data.  This isn't a final solution, but perhaps a small stop-gap until the issues with SUBSTR can be addressed.  The following allowed an import of 475 records into a contact database of ~ 17,000 records complete in about 20 minutes:

Code: [Select]
diff --git a/CRM/Dedupe/BAO/Rule.php b/CRM/Dedupe/BAO/Rule.php
index 0f5894e..02b34b2 100644
--- a/CRM/Dedupe/BAO/Rule.php
+++ b/CRM/Dedupe/BAO/Rule.php
@@ -127,6 +127,7 @@ class CRM_Dedupe_BAO_Rule extends CRM_Dedupe_DAO_Rule
             }
             if ($this->rule_length) {
                 $where[] = "SUBSTR({$this->rule_field}, 1, {$this->rule_length}) = SUBSTR('$str', 1, {$this->rule_length})";
+                $where[] = "{$this->rule_field} IS NOT NULL";
             } else {
                 $where[] = "{$this->rule_field} = '$str'";
             }
@@ -141,6 +142,7 @@ class CRM_Dedupe_BAO_Rule extends CRM_Dedupe_DAO_Rule
         // finish building WHERE, also limit the results if requested
         if (!$this->params) {
             $where[] = "t1.$id < t2.$id";
+            $where[] = "t1.{$this->rule_field} IS NOT NULL";
         }
         if ($this->contactIds) {
             $cids = array();

Piotr Szotkowski

  • I live on this forum
  • *****
  • Posts: 1497
  • Karma: 57
Re: Small patch to drastically improve dedupe performance
May 12, 2010, 12:00:23 pm
Thanks! Filed the NOT NULL sub-thread as CRM-6235 and fixed for 3.1.5.
If you found the above helpful, please consider helping us in return – you can even steer CiviCRM’s future and help us extend CiviCRM in ways useful to you.

nkinkade

  • I post occasionally
  • **
  • Posts: 56
  • Karma: 5
Re: Small patch to drastically improve dedupe performance
May 20, 2010, 03:59:38 pm
Piotr, this is only marginally related, but I mention it here because I found another of the huge causes of our dedupe slowdowns during import.  I was able to undo the "IS NOT NULL" patch and apply the following patch and the import of ~475 records completed in almost exactly 1 minute.  With only the "IS NOT NULL" patch the time to complete the import went from basically hours or never to about 20 minutes.  With both patches applied it was around 1 minute.  What that tells me is that the following patch, in this case, may obviate the need for the "IS NOT NULL" patch.  Apparently the dedupe code, at a lower level, doesn't like to be passed a null set of dedupe params and somehow hangs for around 45 seconds when you do (the SUBSTR() queries cause MySQL to do something intensive for that time).  I didn't try to get to the bottom of that, but simply avoid sending an empty set of dedupe params.  I suspect someone should investigate why sending empty params causes such a problem, for it may affect other parts of CiviCRM:

Code: [Select]
diff --git a/CRM/Dedupe/BAO/Rule.php b/CRM/Dedupe/BAO/Rule.php
index 0f5894e..02b34b2 100644
--- a/CRM/Dedupe/BAO/Rule.php
+++ b/CRM/Dedupe/BAO/Rule.php
@@ -127,6 +127,7 @@ class CRM_Dedupe_BAO_Rule extends CRM_Dedupe_DAO_Rule
             }
             if ($this->rule_length) {
                 $where[] = "SUBSTR({$this->rule_field}, 1, {$this->rule_length}) = SUBSTR('$str', 1, {$this->rule_length})";
+                $where[] = "{$this->rule_field} IS NOT NULL";
             } else {
                 $where[] = "{$this->rule_field} = '$str'";
             }
@@ -141,6 +142,7 @@ class CRM_Dedupe_BAO_Rule extends CRM_Dedupe_DAO_Rule
         // finish building WHERE, also limit the results if requested
         if (!$this->params) {
             $where[] = "t1.$id < t2.$id";
+            $where[] = "t1.{$this->rule_field} IS NOT NULL";
         }
         if ($this->contactIds) {
             $cids = array();

Piotr Szotkowski

  • I live on this forum
  • *****
  • Posts: 1497
  • Karma: 57
Re: Small patch to drastically improve dedupe performance
May 21, 2010, 01:18:25 pm
Quote from: nkinkade on May 20, 2010, 03:59:38 pm
Piotr, this is only marginally related, but I mention it here because I found another of the huge causes of our dedupe slowdowns during import.  I was able to undo the "IS NOT NULL" patch and apply the following patch and the import of ~475 records completed in almost exactly 1 minute.

Hm, I’m not sure I understand. You say you ‘were able to undo the "IS NOT NULL" patch and apply the following patch’, but the patch you then quote is exactly the same that I used to fix the issue: http://fisheye2.atlassian.com/changelog/CiviCRM/?cs=27612 → http://fisheye2.atlassian.com/rdiff/CiviCRM?csid=27612&u&N
If you found the above helpful, please consider helping us in return – you can even steer CiviCRM’s future and help us extend CiviCRM in ways useful to you.

nkinkade

  • I post occasionally
  • **
  • Posts: 56
  • Karma: 5
Re: Small patch to drastically improve dedupe performance
May 21, 2010, 01:36:58 pm
Sorry.  I'm was just careless.  This is the patch:

http://code.creativecommons.org/viewgit/civicrm.git/commit/?h=cc_staging_3.1.3&id=50cf2ce397085dc5ddeec901ab4d7a1ee50360a3

Code: [Select]
diff --git a/CRM/Dedupe/Finder.php b/CRM/Dedupe/Finder.php
index 350864e..c1aac3e 100644
--- a/CRM/Dedupe/Finder.php
+++ b/CRM/Dedupe/Finder.php
@@ -85,6 +85,10 @@ class CRM_Dedupe_Finder
      * @return array  matching contact ids
      */
     function dupesByParams($params, $ctype, $level = 'Strict', $except = array()) {
+        // If $params is empty there is zero reason to proceed.
+        if ( ! $params ) {
+            return array();
+        }
         $rgBao =& new CRM_Dedupe_BAO_RuleGroup();
         $rgBao->contact_type = $ctype;
         $rgBao->params = $params;

Piotr Szotkowski

  • I live on this forum
  • *****
  • Posts: 1497
  • Karma: 57
Re: Small patch to drastically improve dedupe performance
May 24, 2010, 04:57:59 am
Thanks, this makes perfect sense now. Filed as CRM-6295 and fixed for 3.1.6.
If you found the above helpful, please consider helping us in return – you can even steer CiviCRM’s future and help us extend CiviCRM in ways useful to you.

nkinkade

  • I post occasionally
  • **
  • Posts: 56
  • Karma: 5
Re: Small patch to drastically improve dedupe performance
May 24, 2010, 07:59:40 am
And an embarrassing bit of irony can be found in this: "I'm was just careless." :-(  I guess I need to put a higher priority on proofreading. :-)

Pages: [1]
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Support »
  • Using CiviCRM »
  • Using Import (Moderator: Yashodha Chaku) »
  • Small patch to drastically improve dedupe performance

This forum was archived on 2017-11-26.