CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion (Moderator: Donald Lobo) »
  • Imports - line-by-line vs one parsed object?
Pages: [1]

Author Topic: Imports - line-by-line vs one parsed object?  (Read 1538 times)

alanms

  • I post occasionally
  • **
  • Posts: 72
  • Karma: 5
Imports - line-by-line vs one parsed object?
March 07, 2011, 04:55:10 am
Hi,

I'm putting in a proposal to do some work on some changes to the import script. Naturally anything I come up with will be shared back when fully working and tested :)

What makes this interesting is, my client's needs are the opposite to a lot of the orgs working on improving the import script. They aren't a large org with in-house/regular technical staff looking for ways to program huge imports on a high-spec dedicated server. They're that group comparitively unrepresented on the CiviCRM forums - a small org with irregular technical support from occasionally-hired freelancers (like me), with a low-medium spec VPS, who want their regular office staff to be able to do imports without worrying about any of the many possible server errors and the many necessary preparatory steps for lower-end servers such as chopping up import sheets based on which fields need mysql joins.

(if you're lucky enough to never have encountered these problems, browse the forums. They're an ordinary non-techie's nightmare - the server falls over, the white box never disappears, the import stops an unknown way through and sometimes resumes but sometimes doesn't, and occasionally strange incorrect data is created such as blank contacts)

So there's two needs - a user friendly interface, and a fail-safe import script. I have a plan for the user interface (a fish-eye style one-page display with live row-by-row reporting and AJAX forms for fixing invalid fields and possible dupes if you're interested).

The tricky part is the fail-safe script. I have a plan, but it's almost the opposite approach to all existing versions of the import script, so I want to be sure that I haven't missed anything obvious and that my approach doesn't sacrifice too much in performance.

All the current solutions parse the csv file into an object, then iteratate over the object. My instinct is to restructure the process to work like this (after uploading, basic validation, mapping etc):

  • (before process) Create a record of the import in the Civi database, recording mapping, file, number of rows, and number of rows successfully processed (N)
  • Hit row N of the file, parsing into a PHP object
  • Pass the appropriate attributes to the appropriate validation methods. If a field fails validation, print the row to the page, highlighting the row and the invalid fields in the fish-eye display., then go to the next row, record the row number & details in an array field in the DB, and skip to 5)
  • If valid, run a dupe check on this row. Update where appropriate, and if there's ambiguity (it passes the 'strict' test but fails the 'fuzzy' test - or did I get that the wrong way round?) AJAX print details to the screen, fish-eye highlighted, record the row number & details in an array field in the DB, and scip to 5)
  • If valid, pass the appropriate attributes to the appropriate APIs, creating or updating.
  • Record in the DB that N=N+1, and if row count < N, repeat.

Unless I've missed something obvious, this should make for a good failsafe. It should be possible to keep the server strain at any given point in the process more or less the same if an import has rows in the hundreds or in the millions. There's also loads of potential for smooth interface interactivity (trust me on this one!) and even if the server does clonk out mid import, it should be possible to always manually resume where it left off. Not only that but it doesn't look like too daunting a programming task.

So, two questions:

  • How big a performance hit is this method likely to be?
  • What other disadvantagtes are there to this method that I haven't thought of?
« Last Edit: March 07, 2011, 04:57:23 am by alanms »

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: Imports - line-by-line vs one parsed object?
March 07, 2011, 08:22:22 am
Hi,

What do you mean by fish-eye display ?

The import already transfer all the lines of the csv into a temporary mysql table. I would make sense indeed to be able to split (or re-run) an import process and be able to see in the import tables which rows have not been imported.

I'm wondering to what extend adding an extra status column on these tables shouldn't allow to do both that and the error records (I would love being able to fix some contacts without having to re-run a complete import)

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Donald Lobo

  • Administrator
  • I’m (like) Lobo ;)
  • *****
  • Posts: 15963
  • Karma: 470
    • CiviCRM site
  • CiviCRM version: 4.2+
  • CMS version: Drupal 7, Joomla 2.5+
  • MySQL version: 5.5.x
  • PHP version: 5.4.x
Re: Imports - line-by-line vs one parsed object?
March 07, 2011, 08:26:59 am
hey alan:

thanx for the detailed blog post. i do think that the concept behind "api"fying the import meshes quite nicely with what you are suggesting. We had blogged about it last year here:

http://civicrm.org/blogs/lobo/implementing-batch-import-api

I think this will allow an ajax interface to import pretty easily (at the same time allowing batch imports on the server also)

Ultimately the current import (or any other custom import scripts) all follow the same pattern:

- set up import

- for all rows, impor row

- clean up import

in the current case the above code is very tightly integrated and "api"fying it will help the structure and allow people to stitch different styles of import

lobo
A new CiviCRM Q&A resource needs YOUR help to get started. Visit our StackExchange proposed site, sign up and vote on 5 questions

alanms

  • I post occasionally
  • **
  • Posts: 72
  • Karma: 5
Re: Imports - line-by-line vs one parsed object?
March 10, 2011, 08:33:03 am
Thanks for the replies - prompt and intelligent as always!

re xavier's "Fish-eye" question - a bit like http://interface.eyecon.ro/demos/fisheye.html but rather than emulating the Mac OS X application doc, imagine a table resizing height by row by vertical proximity, width by column by horizontal proximity, and font-size by both combined, going from tiny 1px text in 3px x 8px cells which simply commuicate how many rows and columns there are, which cells have data and which are completed (e.g. green=done, pale grey=to do), to ordinary spreadsheet size for what is currently being worked on or focused on (with a 'full cell' button label to show a full cell scrollable popup for cells with long content such as notes).

re Lobo's post, the aim is definitely to offer a user-friendly reliable, interactive 'slow but steady' interface for the Import API which can sit alongside fast, task specific/programmatic import options.   

re both posts' more general points, I think there's another dimension to this. The current import script already has a kind of resume functionality with the 'status' column of the temporary tables - but as I've found in the past only certain types of errors lead to resume attempts and the results are sometimes not what is expected. Also, looking at what records I have left of dead imports of my own, many of them failed during the row-by-row process.

Giving it some more thought, speed is expendable because expert-led mega-import projects will be best done bypassing the import interface entirely, programming into the import API directly - and most orgs with high-performance requirements and a decent database budget already seem to be pointing this direction. The priority for interface import users are that it's failsafe, and that they have tools available to avoid bad data being imported by mistake (which can include up-to-date data being overwritten by perfectly valid data which only a human can judge is less up to date).

So here's what I think would work best: an asynchronous chained process with each execution of the Import API being just one row and initiated by a chain of AJAX functions.

One major benefit is that unlike a  line of cheap Christmas lights, if one bulb fails, the rest of the process doesn't just die - whether it's an error 500, 'MySQL gone away', etc, AJAX can try again, recording each attempt in a data-* attribute in the tr element for that row, and then mark that row as an error and moves on to the next row if it fails more than X (5?) times.

The other major benefit is that it makes all sorts of useful interactivity possible and simple. A few examples:
- Pausing and resuming imports, at any time, for any reason. AJAX simply doesn't tell the next row to go.
- Setting breakpoints, so a cautious user can check how the import's going after, say, 50 rows before allowing the rest to go. 
- Manually fixing invalid data or choosing whether to merge possible duplicates case-by-case while the rest of the import carries on importing.

And a few rather attractive possibilities that would require a little extra work...
- Cell-by-cell change reporting. When an import is about to 'update' rather than 'add' to an entity, we can get the current data with the AJAX API, or after with some kind of pull or FK from the change log, and show what was overwritten as well as what the new records are.
- Cell-by-cell undo. If we have what a record was as well as ID etc, we can 'undo' any change in an import with a simple AJAX button that takes the .text of the 'before' record and writes it using the API.

I'm sure there's more. Essentially this'd give two import options: the experts can set up their own top speed imports which require technical skills and which trust trust them to know precisely what they're doing. For everyone else, there's a slow but steady route, built on the same core script from the API, which gives numerous opportunities to spot problems of all types and correct them immediately.

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: Imports - line-by-line vs one parsed object?
March 10, 2011, 08:37:42 am
One request per row is a denial of service. Between one import failing or crashing the server, I'd rather keep the existing version.

But definitely sure you are on something, let me process now that I feel better having avoided a DOS ;)

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

alanms

  • I post occasionally
  • **
  • Posts: 72
  • Karma: 5
Re: Imports - line-by-line vs one parsed object?
March 10, 2011, 08:50:12 am
Ha - good point! Definitely needs more thought...

(p.s. I trust you meant I'm 'on to something' as in there's some substance to the idea, rather than 'on something' as in narcotics...)
« Last Edit: March 10, 2011, 08:53:24 am by alanms »

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: Imports - line-by-line vs one parsed object?
March 10, 2011, 09:25:42 am
Quote from: alanms on March 10, 2011, 08:50:12 am
(p.s. I trust you meant I'm 'on to something' as in there's some substance to the idea, rather than 'on something' as in narcotics...)

Oops, who could have thought that a missing "to" would change the meaning so much ? English has not enough words, so you are forced to (ab)use and stuff your sentences with a lot of a on to up at of in..., that change completely the meaning of the verb!

I'm on something, definitely, as for you, you might be on to(o) ;)

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

alanms

  • I post occasionally
  • **
  • Posts: 72
  • Karma: 5
Re: Imports - line-by-line vs one parsed object?
March 10, 2011, 09:41:45 am
X:  :)

One quick thought to minimise the hit on the server (disclaimer - performance isn't at all my area of expertise so I may well have missed something obvious):

- To keep the data traffic down to something negligible, each AJAX request passes only the temporary table name and row number, to a wrapper around the API call which retrieves the data from the appropriate row in the temporary table and passes it to the import API. In-line cell by cell edits save to the appropriate row in the temporary table and mark it as not done if it's marked as done.
- To keep the volume of requests at any given time low, as well as chaining the requests to only happen either after processing the previous completed request or when it is clear that the chain has broken, wrap each call in a setTimeout or equivalent, spacing each request out by a number of milliseconds.

The setTimeout option also opens up the interesting possibility of an option, maybe presented in an 'Advanced Options' accordion, maybe as a slider next to a 'Go/Pause' button, to allow users to control the speed of their import: speeding it up if their import needs to be done comparatively quickly, and slowing it down if they aren't in a rush and want to minimise the impact on other users or their own CiviCRM activities in other browser tabs, or, if they want to be able to watch and control the import more closely as it happens.

Unless I've missed something, it should be possible to keep the impact on site performance of an import as close to equivalent of one extremely busy user (I'm not sure how this compares with the current process).

It also feels like there could be a lot of uses for the ability to export a sheet, make changes, import it and slow down the import enough to be able to watch each row go in and interrupt where necessary.

Pages: [1]
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion (Moderator: Donald Lobo) »
  • Imports - line-by-line vs one parsed object?

This forum was archived on 2017-11-26.