CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • APIs and Hooks (Moderator: Donald Lobo) »
  • API Performance and Big Synchronisation Processes
Pages: [1]

Author Topic: API Performance and Big Synchronisation Processes  (Read 3183 times)

capo

  • I post occasionally
  • **
  • Posts: 108
  • Karma: 5
API Performance and Big Synchronisation Processes
February 01, 2013, 04:16:36 am
I've been testing the API performance of CiviCRM.

The motivation
I'll need to perform big synchronization processes daily (including new contacts, updates and unsuscribes from other databases). I'll need to "move" some thousands of records daily (the database size is over two millions of records but the number of records with changes every day is, more or less, some thousands).

The results may be interesting for other people, so I want to share them with you.

The REST "experiment"
Throught Talend (http://www.talend.com/), I used the CiviCRM API to create new contacts in a fresh CiviCRM installation. Both Talend and CiviCRM were installed on the same machine (so I "attacked" the RESP using something like http://localhost/.../civicrm/extern/rest.php).

The process I executed was:

  • I choosed a 2 million records text file as the source (its column length was about 300 characters),
  • I only filled 6 fields (date added, first and last name, mobile and work phone),
  • I only created contacts (didn't care about other related tables),
  • the machine processor was: Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz,
  • the machine had 4Gb RAM and
  • the machine was running a Ubuntu 12.04 OS.

The performance was very close to 2 records per second (I made the same "experiment" with SugarCRM, using Talend's SugarCRM plugin, and obtained -almost the same performance- something like 2.2 records per second).

The PHP "experiment"
After that, I tried a second option: writting PHP code to be executed in the same machine. The conditions were, more or less, the same conditions specified for the previous experiment (except I only filled email, first and last name). The code I wrote for the test, was:

Code: [Select]
    require_once "/var/www/drupal/sites/all/modules/civicrm/api/class.api.php";
    $api = new civicrm_api3 ( array ( "conf_path" => "/var/www/drupal/sites/default" ) );

    $sourceFile = "source.txt";
    $stopInRow = 500; /* 0 if you want to process the whole file */

    $startTime = time();

    $row = 1;
    if (($handle = fopen($sourceFile, "r")) !== FALSE) {
        while (($row <= $stopInRow || $stopInRow == 0) &&
            ($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {

            if ($row == 1) {
                $header = $data;
            } else {
                $email = $data[ array_search ( "EMAIL", $header ) ];
                $first_name = $data[ array_search ( "FIRSTNAME", $header ) ];
                $last_name = $data[ array_search ( "LASTNAME", $header ) ];

                $api->Contact->Create ( array (
                    "contact_type"=>"Individual",
                    "last_name"=>$last_name,
                    "first_name"=>$first_name
                ) );
            }

            $row++;
        }

        fclose($handle);
    }

    $endTime = time();

    $timeLapse = $endTime - $startTime;
    echo $timeLapse . " seconds" . "\n";
    echo ($row - 1) . " records inserted" . "\n";
    echo abs ( ($row - 1) / $timeLapse) . " records / second" . "\n";

This time, I obtained a performance of 6 records per second. My hypothesis is this process is probably faster because it doesn't pass trought Apache / Drupal. Anyway, performance is still too slow for my purposes.

I tried to make the same experiment with SugarCRM but I didn't find a way to use its API skipping REST (and so, skipping Apache). Point for CiviCRM! The performance of the SugarCRM API, in this context, was still close to 2 records per second (as expected because, indeed, the experiment was almost the same experiment I made trought Talend but using only a different interface). (If anyone wants to have the code I used, just ask).

The risked solution
Both API performance results are not sufficient for my purposes, so I'm thinking about to do something risky and I want to have comments from you:

- What about to write some stored procedures to play the role of "SQL API"? Something as "CREATE STORED PROCEDURE spContactCreate...".

I understand that, If I write it, I'll have to "replicate" the API behaviour in SQL and review it for every change of version.

- If you think it's not a good idea, do you have any suggestion about what to do?

Thanks for reading!
Best!

totten

  • Administrator
  • Ask me questions
  • *****
  • Posts: 695
  • Karma: 64
Re: API Performance and Big Synchronisation Processes
February 01, 2013, 05:32:27 am
A few thoughts:

 * Agree that a SQL-based interface would be more performant. It's fairly straight-forward to work out which tables/columns/keys to use for the core data model, but it becomes more challenging when it needs to support Civi's dynamic, configurable model while also supporting the varied data-access goals of different developers/projects. It would probably easier to write your own custom SQL and then write test-cases to validate the SQL works correctly. (The test-cases be can small/fast, and they'll help you quickly find issues when upgrading to newer Civi releases with modified schema.) Of course, working on a SQL procedure API sounds like fun -- to fully support the Civi data model, one would probably wind up doing a lot of stored-procedure metaprogramming.

 * You might try running multi-threaded experiments -- that should improve performance by some multiple (5x, 10x, whatever -- depending on HW configuration and Civi's code). For example, you said that it needs to sync several thousand records per day. Some back-of-the-napkin math:

  + Assume 5000 rec
  + Assume PHP API with 6 rec/sec
  + Assume 10 concurrent threads
  + 5000 / 6 / 10 = 83 sec

I don't know your requirements, but 83sec may be acceptable for a daily job. Of course, that quick estimate assumes that performance scales linearly with #threads -- which is probably true at the beginning but won't hold for really high concurrency. Only experimentation will provide real numbers.

As far as implementing a multi-threaded design, I'm a Talend novice, but this might be the thing:

https://help.talend.com/display/TalendComponentsReferenceGuide521EN/18.8+tParallelize

For PHP, you could get a quick estimate by splitting the text file in 10 parts and then running the program ten times. IIRC:

Code: [Select]
#!/bin/bash
echo "Worker threads started at:"
date

php my-test.php sample-data-0.csv &
php my-test.php sample-data-1.csv &
php my-test.php sample-data-2.csv &
php my-test.php sample-data-3.csv &
php my-test.php sample-data-4.csv &
php my-test.php sample-data-5.csv &
php my-test.php sample-data-6.csv &
php my-test.php sample-data-7.csv &
php my-test.php sample-data-8.csv &
php my-test.php sample-data-9.csv &
wait

echo "All worker threads finished at:"
date

For a real implementation, one might do the bash trick above or look into something fancier (like PHP's pcntl, beanstalkd, zeromq, RabbitMQ, ActiveMQ, etc).

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: API Performance and Big Synchronisation Processes
February 01, 2013, 07:29:55 am
Hi,

With the rest interface from node, I could get on my laptop (i7) around 1000 individual created per minute, including their organisation (200 created, 800 fetched and relationship created), email, phone and address.

 https://github.com/tttp/civi-charlatan

It's as parallel as the webserver can deal with it. I didn't test much of the error messages and you might get some contacts creation that failed because concurrency reasons.

Still, and order of magnitude faster than your case with talend over rest.

Could you try running civi-charlatan see if the diff is somewhere in the config or if it's just faster because more concurrent operations?
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: API Performance and Big Synchronisation Processes
February 01, 2013, 07:38:42 am


And I did have a few "down to the sql metal" interfaces. It's obviously faster, but It brakes on upgrade means you have to deal with sql injection in yet another place, or that it alters slightly the behaviour of the imported data because the model has changed, or...

Bottomline, unless you are prepared to spend loads of hours trying to debug why the whatever isn't properly behaving on some edge cases reported a few weeks after a minor upgrade by some angry user, you should try sticking with the api. Because it's not if it will happen, but when. I think you can consider yourself warned ;)

X+

P.S. and the usual disclaimer: use the api at your own risk and don't blame us too much if it fails, so far it has happened significantly less often than with my hand made sql requests.
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

capo

  • I post occasionally
  • **
  • Posts: 108
  • Karma: 5
Re: API Performance and Big Synchronisation Processes
February 01, 2013, 08:11:10 am
totten, xavier, thanks for your advices. I'll try to make a new "experiment", this time, parallelizing. I totally agree if I manage to reduce the time of execution to a couple of minutes, it'd be completely acceptable and I wont have no excuses to skip using the API (it has been the intention since the very first minute).

Thanks!
Keep you informed!

JonGold

  • Ask me questions
  • ****
  • Posts: 638
  • Karma: 81
    • Palante Technology
  • CiviCRM version: 4.1 to the latest
  • CMS version: Drupal 6-7, Wordpress 4.0+
  • PHP version: PHP 5.3-5.5
Re: API Performance and Big Synchronisation Processes
February 25, 2015, 11:48:50 am
Hi capo,

Sorry to post on an old thread - but I'm having the exact same issue while using the Kettle CiviCRM plugin you wrote.  Did you ever find a way to improve REST performance that works with the Kettle plugin?
Sign up to StackExchange and get free expert CiviCRM advice: https://civicrm.org/blogs/colemanw/get-exclusive-access-free-expert-help

Pages: [1]
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • APIs and Hooks (Moderator: Donald Lobo) »
  • API Performance and Big Synchronisation Processes

This forum was archived on 2017-11-26.