Author Topic: Pluggable, faster import system (Read 7901 times)

cap10morgan · July 30, 2008, 10:33:17 am

While working on importing some existing contacts from another system, I realized that the CSV import was just too slow and cumbersome for my needs. The existing data is all on the same MySQL server that CiviCRM runs on, so I thought, "Why not just add the ability to specify an SQL query to get the initial data from?" I had hoped that the resulting internal data structure would resemble the one you end up with after parsing the CSV file anyway. However, after discussing with Lobo and looking through the code, it seemed that a deeper re-design was in order.

Here's the wiki page going over what I'm working on now: http://wiki.civicrm.org/confluence/display/CRM/SQL+Import

The highlights are:

1. DataSource plugins -- You can write plugin classes that let CiviCRM pull data in from other systems. They present a form to the user and must get the data from the external system into a temporary table in the database (but don't need to transform it any beyond that). Then the standard import system will take over from there. Initially I hope to have a CSV plugin (to duplicate the existing functionality) and an SQL plugin.

2. More efficient import -- Rather than spending lots of time transforming each record, the new system will apply transformations to the entire table at once whenever possible. This will let the database do the heavy lifting, rather than PHP. For some things, PHP may still be necessary.

I've not yet committed any code to trunk as we're in 2.1 code freeze. It's not usable yet, anyway.

Does anyone have comments on this? Does this seem useful to others? Are there things you'd want to see (or not see) in here before it would be useful to you?

Donald Lobo · July 30, 2008, 11:34:48 am

cap10morgan:

Do you want to expand a bit more on 2 and some thoughts on how u'd do it? Thats the crux of the issue, IMO. If we can figure out a simple solution of transforming a row into a set of php/sql statements, it would make import super efficient. THere's also the whole issue of data consistency and validation (which we can ignore for now)

lobo

cap10morgan · July 30, 2008, 11:37:53 am

Sure thing, once I get there and really dive into it. I'm concentrating on the SQL import stuff first (and making that appropriately general so that it can support other DataSources). But once that's working, this will be front and center.

nmiracle · August 02, 2008, 06:01:46 am

I had a similar problem (roughly a million records in a mySQL db on the same server as our CiviCRM implementation that needed importing) and after diddling with the csv import, ended up doing roughly the same thing you've described -- writing a script executable at the shell level that would read the existing DB, format as needed (we carry the phone numbers as pure numeric, for instance, formatting them at the time of access and it seemed a better fit for Civi to put supply them formatted) and put it into the various civiCRM tables.

The original source of much of our data was our county registrar of voters. I'm not sure this is typical, but if it is and the records are typical for source, then it would be very handy if the import script could:

a) deal with transforms like "Y" and "N" or "F" and "M" to Booleans;
b) deal with transforms like "jr" and "sr" to tables of values;
c) deal with custom fields;
d) deal with compound transforms. In our case, we had election data for 8 years in variable format -- there were groups of three fields where the first field was an election code ('0308' was the March, 2008 primary election), there was a voting type ("A","E" or "Y" for absentee, early or election day voting) and a party affiliation ("D","R" and "I"). Our users don't care about whether somone voted early, but it is of intense interest to know what party the voter selected during the last three or four elections, so we needed to support custom fields for (say) "2008_Primary" and have a value of D, R or I, if recorded (it is not always) and Y as a fallback.
e) if the import would be run over HTTP, I think it will probably have to do some sort of periodic write to the client to avoid browser timeouts.

(d) is probably a PITA and may not be that commonly needed: I think the others are both simpler and more necessary.

We also had a bunch of data (pretty junky data, in fact) that we got from the local campaign. It had been collected without much of anything in the way of edits for data integrity and the addresses were not canonicalized nor split into street number, predirectional, postdirections, streettype, etc. etc. Because the addreses were so messed up, the data couldn't be match/merged with the rest of what we had even though it had some extremely useful data -- phone numbers and email addresses. I ended up having to write a parser that took into account just about every weird way that a human being can mis-format their own address. Again, not knowing the data sources or if this would be a common need, having something like this as a feature might be a very useful. I can guarantee that without this, it would surely have been a mistake to import that data because we would have ended up with thousands of dupes that had to be resolved by a person somehow.

Nancy

Donald Lobo · August 02, 2008, 11:40:10 am

Hey Nancy:

1. Can you share your voter file table structure and your import script (might be useful to others). We'd like to get a better idea of the type and amount of data a voter file has

2. You might also be interested in this forum topic: http://forum.civicrm.org/index.php?topic=3105.0

3. If you think your data cleaner is generic and useful to other folks, sharing that might also be a good idea

lobo

nmiracle · August 02, 2008, 10:08:56 pm

Yikes. Those are short questions with long answers.

I am happy to share what I know about the voter files and also about what I wrote. However, my experience with PHP is extremely limited, and what I wrote was in perl, which may not be helpful. MySQL is MySQL, however...

1. We had two schemas:

The source data we got from Tarrant County, TX with $1 and a freedom-of-information request (we had to show up in person and prove we were a registered voter in the county.) This was provided as a .txt file and served as our raw data. The list of fields is sort of long, so I put it here: http://www.digitalmiracles.com/stores/sdio/info/TCFileLayout.html
The schema we developed to meet the needs of the campaign we are serving is somewhat different. There were four general additional classes of information that was needed (and you've probably heard this before):
- email address
- multiple phone numbers
- information about volunteer activities -- basically all the information we had could be rendered as booleans for whether the voter had expressed interest in various particular campaign activities -- phone banking, fund raising, blockwalk, calling other volunteers, etc. We did not have information as to whether they actually showed up or the date they expressed this, just the interest; hence the representation as Boolean.
- information about the voter's other political activities and offices -- for instance, have they served as a county, state or national delegate, been a precinct captain, elector, member of the SDEC, etc. Again, we did not have much depth of information here, just sort of a 'yes' or 'no'
Because I knew that this was a file that we were creating for a very specific purpose (one-time import) and the data domain was very strictly defined, I did not normalize the data, but just put it in a table, one record per voter. I would not advise the current schema as a permanent file structure. In fact, I'm a little embarassed to admit I did it because it is so very much brute force, but it was a good and simple solution to a specific problem. At any rate, here is the schema annotated to indicate where normalized tables would be a better long-term solution: http://www.digitalmiracles.com/stores/sdio/info/VoterFileLayout.html

I did the import in four steps:
- Copied the txt file from the county to a directory on my server and wrote a very simple program that just read the txt file, parsed it and stuffed it into a mysql table. This was executed from the shell prompt; there isn't any UI. I'd be happy to share this, but its pretty basic: "OPEN, while not EOF {read, parse to array, insert into MYSQL} CLOSE"; LMK if you'd like to see it.
- Because we had a pretty substantial database already, then wrote a second program that matched the data from the first table up to the existing db by state file id and, if there is no match, test with a combination of name and address. If the record is found, it does an update, if not an insert. Ditto on "not very complicated and executes from shell." I don't think this one is probably generally useful.
- Wrote a cleanup program that uses the address cleanup macro and that also (for those voters for which there is no email address, generates a valid but unusable email address (rand()@myserver.com for whatever your server is). I did this because, based on the observed behavior of the current import program, I was concerned that civiCRM would choke if it got a record with no email address. It would be preferable (at least IMHO) to not add email records with addresses that we know don't work, even if they are flagged as onhold, but I didn't know if that would be ok. This might be generally useful, particularly the address cleaner which is described in more detail below; you tell me.
- Wrote a program that reads the cleaned file and puts it into the CiviCRM tables.
I'm guessing that probably the last program is the one that you think might be generally useful? I'd be happy to share it; however, like the others it is written in perl and executed it from the shell. If nothing else, having the logic and MySQL statements as a guide might be useful. If it would be useful to have this function in PHP with a proper UI, I'd be intrested in working on such a project, but I don't know php hardly at all and to ensure that the program matches the quality of the rest of the codebase, I'd need assistance from development partner who does. If someone wants to help me, I promise that I'm friendly and I learn new languages pretty quickly.

2. does sound interesting, I'll pay them a visit

3. There actually were several data cleaners.

A number of very trivial formatters, which I didn't mention. Specifically, a phone formatter, which just takes a phone number, rips out any nonnumerics with a regular expression and then reformats with dashes. That's a no-brainer and you probably have one already. There is a subroutine that, given a first, middle and last name, reassembles them and parses them correctly based on the location of spaces (we had a ton of cases from the campaign where the person's first name was stated as 'Emma Jean' or "Bobby Joe". To be able to match with public records, that has to be 'Emma' in the firstname field and 'Jean' in the second.) Also trivial. We have a routine that standardizes city names for our area (people often abbreviate 'Fort Worth' as 'FW' or "FT Worth" and "North Richland Hills" as "NRH"), etc. I'd rate that one as geographically-specific and not generally useful. Plus one that looks for common misspellings in email addresses. Probably general. While these are generally applicable I'm sure anyone would write the same and that these are not very interesting. At least, I'd be surprised if they were.
The address cleaner is considerably more interesting and might be generally useful (this is based on the fact that I see a similar address structure in the civicrm_address table. The routine accepts an address and a second field which is (if known) an apartment type for the address. The second field is optional, but often available and, if available, proved to be very helpful hint when disambiguating complex addresses. The subroutine returns an array consisting of ($canonicallyFormattedAddress, $streetNo, $streetPrefix, $streetName, $streetType, $streetSuffix, $aptType, $aptNo)

So if it is given "Apt 26, 123 N. Elm Street W, it would return the array ('123 N Elm St W Apt 26',123,'N','Elm','St','W','Apt',26)

The problem is that there is a certain level of knowledge-based parsing required. This is like the classical lexical relationship/ precedince problem illustrated by the two English sentences 'Time flies like an arrow' v 'Fruit flies like a banana'. The correct identification of the verb in the sentence is based on the listener being aware that 'fruit flies' is a legitimate adjective-noun pair but 'time flies' is not.

So is it with the words 'Avenue', 'Freeway' 'Farm Road' and 'Highway' here in Texas.

Sample interesting (and pesky) cases here include "2103 Mariposa Avenue N 114" (canonically "2103 Mariposa Ave N Apt 114" with a street name of 'Mariposa' , a street type of 'Ave' and a streetSuffix of 'N") as compared to "2301 E Avenue H 110" (canonically "2301 E Avenue H Apt 110" with a street name of 'Avenue H') or "115 W State Highway 360" (canonically the street name is "State Highway 360") v "115 W State Street 360" (canonically "115 W State St Apt 360".) In the first case, the identification of the word 'Avenue' as streetname or street type is based on the preceeding segment being a identified as a directional, in the last case, the parsing is based on the fact that although numerics in the trailing entity are generally a very good hint of an apartment number, if the entity preceeded by 'highway', 'farm road' or any abbreviation thereof, then last entity is almost certainly part of the street name unless preceeded by an apartment type indicator of some sort or by a street type indicator.

This information is useful for precisely addressing mailings and generating the postal system barcodes, but it is also extremely useful if you are creating walklists for a campaign. It is particularly useful if you can calculate:

my $streetSide = mod($streetNo,2);

Because that will let you generate walk lists for a particular side of the street, as well as by block number.

Based on our usage, I know that my algo works very well on Texas addresses, but there may be regionally different cases which break it and it would be intresting to see what those are.

I'd be happy to share the routine if it anyone would want it, but one caveat: it is written in perl as a standalone routine (there is no file i/o), and I don't know how easy it is for php to call perl. So I'd need some guidance as to how best to package it for use with php or (as above) some coaching in translating it to php.
There is also a much tougher set of cases that involve serious misspellings and/or the submission of information so incomplete that it is impossible to use machine logic to get a reliable match. As a result, we built several programs that use fuzzy logic to make best guesses at probable matches based on an estimation of which part of the input file is most likely to be correct (for instance, we got one file of information that had been OCR'd from poll sign-in records. The addresses had been canonicalized, but the names were almost beyond belief.) We got another where the personal names were very accurate, but the addresses and email addresses had not been carefully entered at all. The programs use logic to propose a 'best match' and then a list of other possibles and ask a human to review and make a choice as to which match is best. I don't honestly think those would be useful because they are very specialized, but I'd be happy to discuss the idea if anyone else finds it interesting. And I'd love to know how other people have solved it.

I apologize for the length of the answer, but I don't know this community well and I wasn't sure how much really needed to be explained. Hopefully, it isn't TMI. Bottom line, I'd be happy to share but will need help from someone more knowledgeable because I'm a complete noob to php.

Nancy

Donald Lobo · August 03, 2008, 02:05:42 pm

hey nancy:

thanx for the detailed answers. Seems like you've been doing some pretty cool work with voter matching / dupe checking and importing all the stuff into CiviCRM. Some of the things from the below list which will definitely be useful to us (i.e. a perl script is fine, we know perl enough to convert things if/when needed)

Wrote a program that reads the cleaned file and puts it into the CiviCRM tables.
The address parser and any associated walk list generation code that you might have written

Might be good for you to package all the scripts and attach it to the forum so that someone later on can potentially use it as a starting point. Also any other insights that you might have with regard to voter management etc would be great as we start thinking about integrating CiviVoter into the core system

lobo

xavier · August 03, 2008, 11:20:45 pm

Hi,

Two points that I found cumbersome in the import (beside what has been mentionned/worked upon):

- The table lookup (eg Salutation/Boolean, Country...) should be made more pluggable, so you can add extra conversion/control. eg I can decide on an install to say that UK = United Kingdom, and that "(other)" is translated to an empty field instead of not importing the record and have an error message.

It would also be convenient to define how errors are handled: so far, it doesn't import he record, but it would be nice to have the possibility to set them as warning, and import whatever is correct (ie. not stop when the country lookup doesn't work, just leave it empty and import the other fields

- Possibility to set a "fixed" value for all the recorded contact.
eg: set the source "Contacts Xavier outlook", set the country to...

(I know that's not that difficult to add a column and put the same value in it, but i'd argue it'd be faster not to read/convert/lookup on each record if this is a constant)

X+

nmiracle · August 13, 2008, 10:01:25 pm

My apologies for not getting back on this sooner -- I had to take the daughter up to college and then (within a week of moving in), she got mono and it has been a real pain. I'll post the code sometime this weekend.

nmiracle · August 22, 2008, 09:04:02 pm

Ok, attached is a version of the address splitting subroutine. It is really part of a .pm module, but there isn't a provision for uploading that, so I saved it as text after pulling it out of its normal harness.

nmiracle · August 22, 2008, 09:05:15 pm

And here is a copy of a generalized version of the program we wrote to put things into the civi tables. Again, this has been changed into a txt file so it could be added.

jpm · November 17, 2008, 01:25:47 pm

We are looking to implement CiviCRM, but must have a transition path off of our current systems. Is there a timeline when these data import / export routines will be ready for commercial use?

Also, as we go through our migration, how can we contribute back to this development effort.

nmiracle · November 18, 2008, 03:55:04 am

If you want to use the import stuff I put in perl, it is available now (see the example). It is really intended for the use of a programmer, however, and isn't packaged for end-user access.

Do you have a php programmer? I'd be happy to help them understand what we did.

We also, incidentally, ended up adding some tables for doing walklists / phonelists, but that is another story...

CiviCRM Community Forums (archive)

News:

Author Topic: Pluggable, faster import system (Read 7901 times)

cap10morgan

Pluggable, faster import system

Donald Lobo

Re: Pluggable, faster import system

cap10morgan

Re: Pluggable, faster import system

nmiracle

Re: Pluggable, faster import system

Donald Lobo

Re: Pluggable, faster import system

nmiracle

Re: Pluggable, faster import system

Donald Lobo

Re: Pluggable, faster import system

xavier

Re: Pluggable, faster import system

nmiracle

Re: Pluggable, faster import system

nmiracle

Re: Pluggable, faster import system

nmiracle

Re: Pluggable, faster import system

jpm

Re: Pluggable, faster import system

nmiracle

Re: Pluggable, faster import system