CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • Contact Importing scalability, and scalability of the system in general
Pages: [1] 2 3

Author Topic: Contact Importing scalability, and scalability of the system in general  (Read 15168 times)

xcf33

  • I post frequently
  • ***
  • Posts: 181
  • Karma: 7
  • CiviCRM version: 3.3.2
  • CMS version: Drupal 6.19/6.20
  • MySQL version: 5.x
  • PHP version: 5.2.6
Contact Importing scalability, and scalability of the system in general
May 06, 2010, 11:25:36 am
Hi guys,

I have just started to brainstorm some ideas on a scalable contact importing tool or possibly re-organize some of the current process flows of CiviCRM contact import. Due to my lack of understanding of the system I would like to seek any opinion, suggestions or comments you have and possibly get some help from the community as well.

General problem: CiviCRM's scalability problem, So far some of the scalability issues I have discovered are: contact import on large datasets (say 50,000 to 500,000), de-duping of contacts, mailing process and queuing.

Specific problem in my scenario:
 
  • Infrastructure Limitation:
    Our sites are hosted by Rackspace Cloud Hosting which imposes a notorious limit of 30 seconds timeout/redirect from the load balancer so it doesn't matter if the php_timeout is set to 0 it will give the end user a website unavailable message when script runs longer than 30 seconds. One interesting information tough is that even if the script is running on a specific server in the cloud longer than 30 seconds and the user receives the time out message, the server will still finish running the script in most cases.

    This problem is specifically severe in the contact importing process because the contact importing process is a multi-step process that performs logic in each step, our sites usually hangs up on the Preview stage which means that the temporary import job table has been created and populated, however the import process was not finished
  • Import File Size
    We generally do import job of size 50,000 to 500,000 in each contact file, fields in those files can vary from 10 to 25, even given a hypothetic scenario that our infrastructure restriction from the cloud host does not apply to everyone. CiviCRM's 8MB limit would generally limit a import file to under 100,000 contact per file, not to mention the waiting period for the importing process the end user experiences (The slow progress bar)

    Other problems with this method is also the session integrity. Someone could mistakenly navigate to another page in the browser tab or basically open another browser tab on the site and went to another area to work on other things.

To address the 2 problems of
  • 1. Import process hanging up on preview stage
  • 2. Avoid long waits on large imports and maintaining session integrity (and possibly increase importing file limit)

Here's my plan:

Instead of a multi-step process to import contact, it will become a 2 step pseudo job-queue system.

Step 1: Data source/Upload File

This step will remain largely the same as the current contact importing process

Step 2: Field Matching, Group/Tag and all other options.

In this step, the form will collect everything the import job needs, the field matching, group to apply to this import and everything else. Some of the options shown from the preview page will now be on this form.

This step will serve as the final page the end user sees, after the user has set all the import configurations, the system will tell the user his/her import job will start now, they are free to leave and when the job is complete, they will be notified. It can also give a link and a check-sum hash of some sort for them to check the import status.

In the backend, all the import configurations will be saved in a table, basically instructing CiviCRM on how to perform this import job

A cron script will scan this table on a regular basis and see which new job has come up and needs to be processed, it will perform the import process based on all the configuration options for the import job collected from the form mentioned previously.

Errors from the import job will be written to a log file on the file system.

When the import job is done, the cron script will trigger an email that contains the information about this import job, very much like the summary page step. It will also provide the user with a link to the error report log file.


This could probably be done 2 ways.

1. create a drupal module and use its FORM API to "re-construct" the current Import forms and save the import job meta data. A cron script will implement various methods from the CRM_Import_ImportJob and other classes to run the actual Import.

2. re-organize the workflow of the current Import process by changing the form files in CRM\Import\Form and change the processing logic.


Due to my lack of understanding of core CIVI code at this moment I have not made up my mind yet, I do have more drupal API knowledge so I will be leaning towards the first strategy.


I'm sorry for this long-post, I'm not trying to rant or anything else, simply trying to think-out-loud on something that may be beyond my ability. I would like to seek some feedback from anyone who's interested as well as comments, suggestions, etc.


If you are interested in a scalable import tool, feel free to contact me and post to this message, main while I will try to start a group on CiviCRM.org or somewhere where discussions with regarding to the system's scalability can be further explored.



Thanks,
Cheers!


Chang



Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Contact Importing scalability, and scalability of the system in general
May 08, 2010, 07:54:58 am

a few comments and thoughts:

1. dedupe and import are the biggest issues. Mail can be batched in a configurable amout, and hence sending should be fine. The queuing stuff is a series of queries which i think can be optimized via SQL (if shown to be inefficient)

2. there is a seperate thread on dedupe, which i'll move to the scalability forum board later today

3. with regard to import a few thoughts:

* Uploading a large file is also time intensive, especially since upload speeds are probably slower than download speeds? Should the "quick" import form just assume the file is on the server and take the directory location? Maybe a seperate form for folks to upload / manage files uploaded (if you dont want users to copy a file to the server, but i'd script and check if uploading files u want to deal with can be done in 30 seconds)
* Compressing import into one form will potentially make scripting it even easier
* If we do the above, we basically can tell the one form, only process lines ABC - DEF
* We can send in the "mapping" field table in that form (i.e. avoid the mapping form)

4. I'd probably strongly recommend we write a url called "BatchImport" and an api function which basically does all the work (and returns the result), along with unit tests to check this out. Would definitely make a great needed addition to CiviCRM

thanx for starting this discussion.

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

xcf33

  • I post frequently
  • ***
  • Posts: 181
  • Karma: 7
  • CiviCRM version: 3.3.2
  • CMS version: Drupal 6.19/6.20
  • MySQL version: 5.x
  • PHP version: 5.2.6
Re: Contact Importing scalability, and scalability of the system in general
May 10, 2010, 06:11:16 am
Thanks for the reply Lobo,

I really like your feedback. I think

1. Manage file upload separately would be a wonderful idea, we can even take advantage of Drupal or Joomla's File API to handle the upload
2. Break down everything in one form will be ideal, for mapping preview we can just take the first few lines of the file they have (uploaded) and selected (It could even be an ajax process where they select one of the files already uploaded from the separate import upload utility and show what the first few lines look like).
3. An url and a public API for Import would be terrific. To be honest, I studied files in the Import folder a little bit and was having trouble invoke the classes and private APIs in CiviCRM, a public API would allow folks with API knowledge of their perspective CMS to "plug" into CiviCRM.

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: Contact Importing scalability, and scalability of the system in general
May 10, 2010, 07:33:40 am
Yeap, an API would be great, and open doors like being able to import from the cli as well (no more timeout/memory problems)


You might be interested to use the sql import or plug a new import system:

http://en.flossmanuals.net/CiviCRM/DevelopImport
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Contact Importing scalability, and scalability of the system in general
May 10, 2010, 10:04:34 am

hey chang:

1. just want to confirm that all the code u'll develop and release will be released under a GPL v2 or later license, right? Can you please check on this and explicitly confirm that this is indeed the case

2. Since batch import is an issue across platforms, we'd love it if you can develop the whole thing in CiviCRM rather than having different versions across the two platforms. We can definitely help you out and create the structure etc for you to get started

3. We can do the API, but we are pretty slammed till the end of this month (when 3.2.alpha gets out). We should chat and we can point u in the right direction so you can move forward on it

lobo




A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

xcf33

  • I post frequently
  • ***
  • Posts: 181
  • Karma: 7
  • CiviCRM version: 3.3.2
  • CMS version: Drupal 6.19/6.20
  • MySQL version: 5.x
  • PHP version: 5.2.6
Re: Contact Importing scalability, and scalability of the system in general
May 10, 2010, 12:03:46 pm
Hi Lobo,

1. sounds good, GPL v2 would be perfectly fine. (I will create public repo on github)
2. Only reason I wanted to use some Drupal components is because I'm relatively new to CiviCRM API (forms, file handling, pages, etc) but I wouldn't mind use it if it will help make this more cross platform. It would be great if you or anyone can point me to how to invoke the private APIs in the importjob class and other needed libraries (I tried to play with it and got a few errors) as well as which class methods to call in different steps and what data structure/parameters they expect.
3. I totally understand that you guys are slammed and have your development timeline, we can chat more.


Looking forward to this :)


xcf33

  • I post frequently
  • ***
  • Posts: 181
  • Karma: 7
  • CiviCRM version: 3.3.2
  • CMS version: Drupal 6.19/6.20
  • MySQL version: 5.x
  • PHP version: 5.2.6
Re: Contact Importing scalability, and scalability of the system in general
May 19, 2010, 07:16:12 am
From LOBO...

Quote
hey chang:

i just spoke with wes (who wrote the sql importer part of the code) and we tried to track back who wrote the code and did not succeed (the branch it was developed on was deleted after the merge). If i had to guess, a lot of it was trying to write a good abstraction that we would move into at some point

So i dont think we store the various config options (group/tag/mapping etc) in the db currently

I've been thinking about the batch import process, i think we need 3 calls:

civicrm_import_create_table( $fileName ); // creates and return the table name

// import rows from $offset to $offset + $limit from $tableName
// the import parameters are in the params array
civicrm_import_import_rows( $tableName, $offset, $limit, $params )

civicrm_import_delete_table( $tableName ); // drops the table

things are getting a bit better now, so i can work with you on the above interface and get it going. Would be good to design a set of unit tests that we can use to test functionality before we write code. will help us improve the import code this way

ping me on IRC if the above makes sense and we can get the ball rolling

lobo


Hey Lobo,


That's great,

I think the 3 API calls is a great start

Code: [Select]
civicrm_import_create_table( $fileName );
civicrm_import_import_rows( $tableName, $offset, $limit, $params )
civicrm_import_delete_table( $tableName );


So this is my understanding:


1st call will create the table and parse all the csv into a table.
2nd call $params will store import type,  mapping information, date format, dupe check, geocoding, groups, tags, error_reporting


I think it could look this way

Code: [Select]
array(
'type' => 'contact',
'date_format' => 'mm/dd/yyyy',
'dupe_check' => 'overwrite',
'error_report' => FALSE,
'groups_id_add' => array(5, 6),
'tags_id_add' => array(6, 7),
'geocode' => FALSE,
'mapping' => array(
0 => 'first_name',
1 => 'middle_name',
2 => 'last_name',
3 => 'email',
4 => 'phone',
4 => 'street_address',
5 => 'supplemental_address_1',
6 => 'city',
7 => 'state_province',
8 => 'country',
9 => 'custom_1',
10 => 'custom_2',
),
);

The mapping array will probably have to be more elaborate but the idea is just to match up the temp table column number with a civicrm contact field. For email, phone etc it probably can be another nested array with more specifications.
« Last Edit: May 19, 2010, 07:47:13 am by xcf33 »

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Contact Importing scalability, and scalability of the system in general
May 19, 2010, 09:34:13 am

we've started work on this on IRC: http://issues.civicrm.org/jira/browse/CRM-6273

if you are interested please jump on IRC and help with the code / spec / unit test

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

Erik Hommel

  • Forum Godess / God
  • I live on this forum
  • *****
  • Posts: 1773
  • Karma: 59
    • EE-atWork
  • CiviCRM version: all sorts
  • CMS version: Drupal
  • MySQL version: Ubuntu's latest LTS version
  • PHP version: Ubuntu's latest LTS version
Re: Contact Importing scalability, and scalability of the system in general
May 25, 2010, 09:47:36 am
Hi guys,
sounds like an interesting development! Question: I come across situations where I need to add contacts in a migration, and for each contact also:
* add custom data;
* add contact to a group;
* add relationships and tags.
Are we taking that kind of situtation into account in this interface, or is it more focused on batching the standard contact import? At the moment of course I do multiple API calls to cater for these options, and add the custom data directly in the database. So far I have not experienced problems with this approach, but I wondered if this would not exist on top of the scalibility issues. If I have 500,000 records to import AND need to import into more than one table?
Consultant/project manager at EEatWork and CiviCooP (http://www.civicoop.org/)

mbriney

  • I’m new here
  • *
  • Posts: 21
  • Karma: 2
  • Technical Product Manager / VP at Edelman
    • Edelman Public Relations
Re: Contact Importing scalability, and scalability of the system in general
May 25, 2010, 10:55:32 am
Erik,

We have the same need to be able to move across all of those tables.  The front-end that we are working on right now will facilitate this when it's finished.

-Matt
support CiviCRM through 'make it happen' initiatives!
http://civicrm.org/mih

Erik Hommel

  • Forum Godess / God
  • I live on this forum
  • *****
  • Posts: 1773
  • Karma: 59
    • EE-atWork
  • CiviCRM version: all sorts
  • CMS version: Drupal
  • MySQL version: Ubuntu's latest LTS version
  • PHP version: Ubuntu's latest LTS version
Re: Contact Importing scalability, and scalability of the system in general
May 25, 2010, 10:59:16 am
That sounds good Matt! Are you saying it is part of the import API, or are you creating a separate solution that you can share with us?
Erik
Consultant/project manager at EEatWork and CiviCooP (http://www.civicoop.org/)

Michael McAndrew

  • Forum Godess / God
  • I live on this forum
  • *****
  • Posts: 1274
  • Karma: 55
    • Third Sector Design
  • CiviCRM version: various
  • CMS version: Nearly always Drupal
  • MySQL version: 5.5
  • PHP version: 5.3
Re: Contact Importing scalability, and scalability of the system in general
May 25, 2010, 11:04:58 am
Hello all,

When I hear Import and API in the same discussion, my eyes light up :)

Here is a forum post that describes two import use cases.  Would be great to hear feedback on how/if you see them fitting in to this project.  Neither are the same use case as what you were first outlining but maybe they compliment it.  Interested to hear what you think.  I am able to commit time to developing this (and will be working on use case 1 all tomorrow).  Happy to be told that they are way off, need to be integrated down the line but have been meaning to talk about these for a while and now seemed like a good time...

Use case 1 - Complex contribution import

I have a CSV file that gets exported from an external payments site on an ongoing basis.  It needs to be imported into Civi by clients but we can't use the Import Contributions functionality because
1) The CSV doesn't get exported in a completley Civi friendly format, and
2) we need to do a little extra work on the data as it is getting imported (deliver / update memberships depending on some logic, and populate some custom fields with calculated data)

So what we would like to do is create a page that is very similar in look to the import pages and allow us to do this processing.  Looks like what we would be re-using here is your api that can be used to upload a remote file and create a table from it.

We'd then probably skip the mapping, or at least create our own mapping which allows us to do the extra things we want to (though having a framework to do this would be cool).  I'm not sure I understand how civicrm_import_mapping_create works but maybe that is what you mean.

I've already written the script that does this importing and am about to start on the interface.

Because this is an automated process, and because my script is only so clever, there are times that we want to skip the row and report back to the user why we skipped (just like the standard import).  At the moment, the script just creates a text report but it would be great if we were able to leverage some of the already existing import reporting.

Use case 2 - Clean before import

This is a slightly different use case, and probably more away from the original post, but it relates to Erik Hommel's post about relationships, multiple groups, etc. and Xavier's post about using the CLI for importing.  It's something I do with pretty much every client and I think it would be really useful to have a standardised shared import framework...

Because just about every clients data comes in a different structure, and needs more or less cleaning of things like address data, etc., and has things like multiple groups to import and so on I tend to write import scripts that have three stages

1) create a MySQL table from their 'raw data'

2) loop through each of the records, cleaning as I go, and creating an array which I feed to the relevant API to do the import (I have a fairly messy library of civicrm specific cleaning functions that i reuse here)

3) report / do something else based on the result of the API call.

This is generally much more pleasant than using the GUI, even when the GUI can do the import, because i can easily change the 'mapping' I create in part 2 without having to go through the Civi screens.  And because I am leveraging the API, rather than directly interacting with the DB, it's relativley easy and I don't have to worry about any data integrity issues.  Though obviously this approach is no good if you want the client to do the import :)

OK - apologies for long post :) hope this was useful.  interested to hear what you think and happy to talk further on IRC.
Service providers: Grow your business, build your reputation and support CiviCRM. Become a partner today

mbriney

  • I’m new here
  • *
  • Posts: 21
  • Karma: 2
  • Technical Product Manager / VP at Edelman
    • Edelman Public Relations
Re: Contact Importing scalability, and scalability of the system in general
May 25, 2010, 11:12:11 am
Quote from: Erik Hommel on May 25, 2010, 10:59:16 am
That sounds good Matt! Are you saying it is part of the import API, or are you creating a separate solution that you can share with us?
Erik

We are working in coordination with the core team.  We hope to have a standalone Drupal module that leverages existing APIs later this month that will temporarily bridge the gap until the Import API is finished.  Once the Import API is ready we will update our code to work within that structure.
support CiviCRM through 'make it happen' initiatives!
http://civicrm.org/mih

Erik Hommel

  • Forum Godess / God
  • I live on this forum
  • *****
  • Posts: 1773
  • Karma: 59
    • EE-atWork
  • CiviCRM version: all sorts
  • CMS version: Drupal
  • MySQL version: Ubuntu's latest LTS version
  • PHP version: Ubuntu's latest LTS version
Re: Contact Importing scalability, and scalability of the system in general
May 25, 2010, 11:13:15 am
Michael,
I recognize part of what I do in your second story. I tend to:
* import the source data into MySQL
* do some simple manipulation in MySQL
* use some of my little tools to clean some more
* use my base script to run the API's in one batch
I have added a little error reporting as well.

If there will be an import function that will allow mapping to field/table, something like this:
incoming columnA goes to contacts / first name
incoming columnB goes to contacts/last name
incoming columnC goes to custom data table/custom data field
incoming columnD sets group A when it has the value.....

that would be great....it will slice off a significant part of the work, although I think there will always be little horrors in migration that require unexpected intervention
Consultant/project manager at EEatWork and CiviCooP (http://www.civicoop.org/)

mbriney

  • I’m new here
  • *
  • Posts: 21
  • Karma: 2
  • Technical Product Manager / VP at Edelman
    • Edelman Public Relations
Re: Contact Importing scalability, and scalability of the system in general
May 25, 2010, 11:45:53 am
We actually don't invision the proces working much differently than the current system only that you will upload the file, save your de-dupe settings, define your field mappings and then all of that data will be saved in a config file.

Then there would be a background service that will then check and run the jobs and when the job is complete e-mail the results.

We're looking into the batch/queue system in Drupal at the moment for that functionality.
support CiviCRM through 'make it happen' initiatives!
http://civicrm.org/mih

Pages: [1] 2 3
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Scalability (Moderator: Donald Lobo) »
  • Contact Importing scalability, and scalability of the system in general

This forum was archived on 2017-11-26.