CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Google Summer of Code »
  • GSOC 2015: Predictive and Data Mining Project
Pages: 1 [2]

Author Topic: GSOC 2015: Predictive and Data Mining Project  (Read 3377 times)

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
April 16, 2015, 01:48:16 am
Hi xavier,

I asked Dr. Shailesh Kumar about co-mentoring this project, but unfortunately he had his own commitments. Right now, I'm not able to find any other person who would be willing to co-mentor this project.

Please let me know if you find a name, so that I can discuss the project and get familiar with him.

Thanks
Mohit

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
May 21, 2015, 02:56:01 pm
Hi Xavier,
Here is my today's progress update.

I created the wiki page for DataMine Project. Link: http://wiki.civicrm.org/confluence/display/CRMDOC/CiviMail+Installation

I also posted about the project on my own blog. Here is the link: https://mohitishere.wordpress.com/
I look forward to posting it on CiviCRM blog also.

I was stuck up with CiviMail installation as I was getting some errors. I'll figure it out and get it done today.

Also I have installed civiCRM using this link : http://wiki.civicrm.org/confluence/display/CRMDOC/Installing+CiviCRM+for+WordPress/
Should I also install civiCRM-buildkit? https://github.com/civicrm/civicrm-buildkit.
What is the difference between the two? I was a bit confused on this. It would be great if you could guide me on this.

Thanks
Mohit


mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
May 24, 2015, 12:31:05 pm
Hi Xavier,
I have successfully installed CiviMail and configured it. I'm able to send mails and have understood the basic structure.

It would be great if you could provide the dump now so that I can start working on the next steps.

Thanks
Mohit

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: GSOC 2015: Predictive and Data Mining Project
May 25, 2015, 02:35:30 am
Did you find what tables store the click+open? tell me which ones you want and I'll dump them for you
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
May 25, 2015, 09:38:47 am
Yes I did find them.

The civicrm_mailing_event_opened table stores the opens.
The civicrm_mailing_event_trackable_url_open table stores the clicks.

Also one more thing, I found that whenever a recipient opens a mailing multiple times, each open  gets stored in the civicrm_mailing_event_opened table separately,  but it is not reflected in the Mailing Reports on the web interface. It shows only the number of unique opens in the 'Tracked opens' field. Is there any field in the Report that indicates this? If not, I think it would be great if we also add a field (in the Mailings Report) that indicates the total opens for a particular mailing.
« Last Edit: May 25, 2015, 09:40:37 am by mohit019 »

emilyf

  • Ask me questions
  • ****
  • Posts: 696
  • Karma: 54
  • CiviCRM version: 2.x - 4.x
  • CMS version: Drupal 5, 6, 7
Re: GSOC 2015: Predictive and Data Mining Project
June 01, 2015, 11:51:12 am
Now that we have entered the official coding section of the project, we are looking for daily updates from the student detailing the following:

- What you are working on today
- What issues / roadblocks you are trying to overcome
- Any other questions you have where we can help

Please keep us updated as to your progress! How is everything going?

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 05, 2015, 12:48:26 pm
Hi Emily,

Sorry for the late reply.

I'm write now working on a dump of mailings that contains tables having information about the mail opening rates, click rates and so on. I'm trying to find out patterns so as to devise a function that could make some basic predictions related to mailings.

I've got familiar with the table structure and am using python to do data analytics. I'm right now focused on finding the min, max, mean and median time for opens/clicks for different mailings.

The project is quite interesting and challenging, and I'm getting to learn a lot by working on this project.

Thanks
Mohit

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 07, 2015, 04:59:12 am
I worked on the dump and attached are the results I got so far.

I have written a python script that extracts data from the sql tables (civicrm_mailing_event_delivered and civicrm_mailing_event_opened) and finds the time between the sending and opening of the mails and computes the min,max and average time.

I found that there were total 85 mailings. Some mailings do show some similarity in terms of their min time or the average time. I aim to do some more analytics to come up with more conclusive overview of the data.

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 12, 2015, 03:31:07 am
I realized that I made some wrong assumptions about the mailings.  After making the correction, I have come up with revised analytics data.  Attached is the file.

This involved working on civicrm_mailing_event_delivered, civicrm_mailing_event_opened, civicrm_mailing_event_queue and civicrm_mailing_job tables in order to find the time between sending and opening of mails, and compute min, max and average for each mailing.

I soon aim to do the same for clicks and then see how I can apply the linear regression techniques.
« Last Edit: June 12, 2015, 03:32:50 am by mohit019 »

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 12, 2015, 02:36:07 pm
Some more data analytics (about clicks).

This involved working on civicrm_mailing_event_trackable_url_open, civicrm_mailing_event_delivered, civicrm_mailing_event_queue and civicrm_mailing_job tables in order to find the time between sending of mails and opening of clicks, and compute min, max and average for each mailing.

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 15, 2015, 03:57:20 am
Here is the link to the github repository for this project:

https://github.com/mohit-agarwal/Data-Mine-GSoC-Project


mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 21, 2015, 01:44:18 pm
I've been working on the small dump Xavier provided. I applied the linear regression techniques on the dump but did not get good results. I took the sent time (ignoring the date/day) of the mail and the email_id(to take into account the person to whom the mail was being sent to) of the recipient as the dependent variables and tried to predict the open_time of the mail(time since the mail was first opened).

I got the linear equation as y = x1 * (sent_time) + x2 * (email_id) + c.
Here I'm getting very high value of intercept 'c' which subsumes the effect of the coefficients x1 and x2. I'm getting residual values between -15 hrs to +8 hrs, which is very far from the real data.

After discussing with Xavier, I realized I also need to take into account the day on which the mail was sent. So I would now take day also as one of the dependent variables. And will predict open_time as time since the mail was sent (in seconds). I hope to get better results using this approach.

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 22, 2015, 09:36:38 am
I tried taking all the 3 factors(sent_time, week_day, email_id/mailing_id) into account and build a linear regression model to predict the open_time(here I am taking open_time as time since 00:00 hrs).

I got he following results:

Model 1
Independent variables : sent_time(in minutes), email_id, day
Dependent variables : open_time(in minutes)

Coefficients:
(Intercept)    sent_time     email_id          day
  888.98032      0.05877     -0.02989      0.71612

Residuals:
    Min      1Q  Median      3Q     Max
-954.04 -177.38   29.97  235.56  517.03

Positive Residual Mean = 230.8751 (~ 4 hrs)
Negative Residual Mean = -271.0273 (~ 4.5 hrs)

Model 2
Independent variables : sent_time(in minutes), mailing_id, day
Dependent variables : open_time(in minutes)

Coefficients:
(Intercept)    sent_time   mailing_id          day
  859.95509      0.08822      4.27276    -14.70451

Residuals:
    Min      1Q  Median      3Q     Max
-975.44 -187.02   31.98  225.68  516.58

Positive Residual Mean = 233.8716 (~ 4 hrs)
Negative Residual Mean = -265.854 (~ 4.4 hrs)

Observations
Total tuples :  2153

Min sent_time : 654
Max sent_time : 1150
Avg sent_time : 1004

Given that sent_time(in train data) varies from 654 to 1004, I get the predicted open_times also in a certain small range(they mainly revolve around the intercept). Due this this, I get extreme min and max residual values.

The mean positive and negative residual means are for the training data. The predicted open_time values show an error of +- 4hrs. I'm not sure how good/bad is the current error rate. Also it is for the training data (one which I used to build the model). I expect greater error rate when I test it on new data. How close do we want the prediction to the real data?

I did some analysis on the data and found that it contains only 25 unique sent_times (implying most of the mails are sent at same time, here I'm ignoring the date) due to which this factor is not able to significantly affect the predicted open_time, and hence I get nearly the same open_time for most of the test cases. Also it contains only 4 unique days and 15 unique mailing_ids. I think having a larger dump will improve the results.

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 23, 2015, 05:01:55 am
I have pushed the R code that builds the model and generates the results for train and test data sets.

You can refer to code here:
https://github.com/mohit-agarwal/Data-Mine-GSoC-Project

Commands to run the code are mentioned in the Readme.

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
June 26, 2015, 01:48:59 pm
I've written a Php script which you can host on your local server, and use it to build your prediction model and test it on the new data. It asks the user to decide on the independent and dependent variables and accordingly uses R libraries to build the prediction model. I aim to add more functionalities in this module (like give user an option to decide on what data to learn, etc). I've not put much emphasis on the UI right now. Please let me know your views.

I've pushed the code in the github repo. Necessary instructions are mentioned in the Readme file. Here is the link -

https://github.com/mohit-agarwal/Data-Mine-GSoC-Project

Pages: 1 [2]
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Google Summer of Code »
  • GSOC 2015: Predictive and Data Mining Project

This forum was archived on 2017-11-26.