CiviCRM Community Forums (archive)

*

News:

Have a question about CiviCRM?
Get it answered quickly at the new
CiviCRM Stack Exchange Q+A site

This forum was archived on 25 November 2017. Learn more.
How to get involved.
What to do if you think you've found a bug.



  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Google Summer of Code »
  • GSOC 2015: Predictive and Data Mining Project
Pages: [1] 2

Author Topic: GSOC 2015: Predictive and Data Mining Project  (Read 3377 times)

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
GSOC 2015: Predictive and Data Mining Project
March 07, 2015, 09:33:14 am
Hi,
I am Mohit Aggarwal, a final year undergraduate from India. I am pursuing my Bachelors of Technology in Computer Science and Engineering from IIIT Hyderabad, India.

Regarding my open source experience, I was selected for Google Summer of Code 2014 and worked with Benetech on the open source MathML cloud software application.

I am quite interested in working on 'Predictive and Data Mining' project. I think I have the required skills and experience that is needed for this project.

I have successfully installed CiviCRM and hosted it inside Wordpress CMS. I am right now focusing on understanding the civicrm-core codebase. As per your advice, I have also submitted a Pull Request(PR #5333) that deals with the CRM-11369 issue. I have also started answering questions on Stackoverflow so that I earn a reputation of 200.

If possible, I would like to discuss more specifically about the project, so that I can build a clear timeline for the project for the purpose of my proposal. Also, it would be great if you could list some more issues so that I can demonstrate my understanding of the civicrm codebase by submitting more PRs.

Thanks and Regards
Mohit Aggarwal


petednz

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4899
  • Karma: 193
    • Fuzion
  • CiviCRM version: 3.x - 4.x
  • CMS version: Drupal 6 and 7
Re: GSOC 2015: Predictive and Data Mining Project
March 08, 2015, 02:50:18 pm
Hi Mohit - what is your stackexchange handle - maybe we can help you get the 200

I couldn't see any likely names at http://area51.stackexchange.com/proposals/77367?phase=commitment&committers=mostrecent#tab-top
Sign up to StackExchange and get free expert advice: https://civicrm.org/blogs/colemanw/get-exclusive-access-free-expert-help

pete davis : www.fuzion.co.nz : connect + campaign + communicate

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
March 09, 2015, 03:40:47 am
Hi,
My stack exchange handle is mohit-agarwal.

Right now, I have very less reputation. It would be great if you could suggest me ways that could help me increase my reputation.

Meanwhile, I would also like to start working on the project. It would help me a lot if you could provide more details about the project. I have good experience in the field of statistics and data mining, and therefore would like to discuss more about this project.
« Last Edit: March 09, 2015, 03:56:36 am by mohit019 »

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: GSOC 2015: Predictive and Data Mining Project
March 11, 2015, 11:55:43 pm
Quote from: mohit019 on March 09, 2015, 03:40:47 am
Hi,
My stack exchange handle is mohit-agarwal.

Right now, I have very less reputation. It would be great if you could suggest me ways that could help me increase my reputation.
answering questions and asking good ones is the best way.
Quote from: mohit019 on March 09, 2015, 03:40:47 am

Meanwhile, I would also like to start working on the project. It would help me a lot if you could provide more details about the project. I have good experience in the field of statistics and data mining, and therefore would like to discuss more about this project.

Cool, then you will probably be able to answer some questions on these topics too.

what did you read about the project? What experience do you have?
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
March 12, 2015, 10:49:39 am
Hi xavier,
From what I understood, we need to add prediction to the existing data. For that, we would need to do pattern recognition in the existing data. We can also do regression analysis but then it depends on the data and the relationship that exists between the constituents. If we are able to model the data as collection of dependent and independent variables and find the relationship among them, then we can use parametric methods such as linear regression to add prediction factor to the data.

I went through this link: http://wiki.civicrm.org/confluence/display/CRMDOC/The+codebase and got an idea of the database structure. I think we can use clustering techniques to cluster similar entities, for instance, the events based on their attributes (here they would act as features). Then for a new event, we can predict the number of participants(by taking average) based on the cluster it belongs to. We can even go for fuzzy clustering to get better results.

Please let me know if my approach is correct and if I'm heading in the right direction. Right now, I'm focusing on understanding the database by going through the codebase, and will be able to come with more ideas once I fully understand it.

Regarding my experience, I have done various projects in the field of Machine Learning and Data Mining. For instance, I built a scalable and efficient search engine using 40GB of Wikipedia data. Then I did a project that aimed at extracting context and meaning of a sentence from a tweet and linking it to Wikipedia page for better understanding. I also built a Face Recognition Tool that could recognize a person by comparing characteristics of the face to those of known individuals, using Eigenfaces. My B.Tech project aimed at improving the accuracy of a classifier by finding best possible combination of features using spectral clustering. I do feel I have the required skills and experience to do this project, and would be eager to learn new things if need be. I think it is a challenging and interesting project, and working on this would be a great learning experience.

Thanks

JoeMurray

  • Administrator
  • Ask me questions
  • *****
  • Posts: 578
  • Karma: 24
    • JMA Consulting
  • CiviCRM version: 4.4 and 4.5 (as of Nov 2014)
  • CMS version: Drupal, WordPress, Joomla
  • MySQL version: MySQL 5.5, 5.6, MariaDB 10.0 (as of Nov 2014)
Re: GSOC 2015: Predictive and Data Mining Project
March 14, 2015, 07:21:39 pm
Hi Mohit,
What would prove most useful is to be able to make predictions about how likely an individual would respond positively or negatively or neutrally to a particular engagement action given their relationship history. This history is composed of all of the outreach actions to them and their reaching back to the organization (these blend a bit), such as bulk emails, personal emails, petition signings, survey responses, phone calls, meetings and other custom activities, contributions of various sorts (some will be purchases of goods or services and others will be donations), purchases and renewals of memberships, participation in events through registering, cases they have been involved in (these are sometimes used as workflow for selling memberships, but they can also be for things like helping a person get housing), grants they may have applied for and/or received, and so on. Secondary sources of information come from the associated user accounts in WordPress, Drupal or Joomla! particularly for sites with significant user generated content, and from various extensions. For example, we have an extension that can draw in contacts' social media accounts and their recent posts to the social web.

Some installations also have appended census track information on demographics based on the home postal code of their contacts. Political campaigns often have significant salient information on past voting intentions, taking a lawn sign, etc.

Some installations have done more work to map relationships among the contacts in their database. For example, they will know the employment history of leaders in a field like Canadian non-profit professionals working in foreign aid, or the board appointments of business leaders in a polluting sector of the economy, or friendship and professional networks of high net worth individuals in a city as philanthropy prospects.

By positive response I mean completing a user journey to make a donation, buy a membership, buy a ticket or sponsorship for an event, or respond positively to major donor request to make a bequest sometime in the next 5 years. By neutral I mean something close to ignoring the outreach. By negative response, I'm trying to indicate the negative reaction that comes from spamming or asking too much or too often or too early in a relationship.

Being able to make these sorts of predictions would help staff figure out how to segment and design their engagement campaigns. Should we send one email followed by a phone call to our top 50 prospects, or an email followed by direct mail to our top 500 prospects?

Some of this might benefit from cluster analysis.

As CiviCRM is used by organizations of widely varying sizes with widely varying amounts of data on their users, it would be highly useful to know when insufficient information is available to make a prediction with a given level of confidence.

I've thrown out a bunch of ideas here, and your project should focus down on something specific. It would likely be best if you had a specific project / client that would allow you to develop generally useful tools and techniques.

HTH.

Cheers,
Joe
Co-author of Using CiviCRM https://www.packtpub.com/using-civicrm/book

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: GSOC 2015: Predictive and Data Mining Project
March 17, 2015, 06:37:42 am
Hi,

As for predicting the income of a fundraisign campaign or an event registration, wouldnt it be easier/more accurate to simply see how people acted on previous events (ie. we see that in our events, 50% of the participants have booked 2 months before the event, so for the event in two months, we are expecting today number of participants x 2. they are various data points to take into account (type of event, event fee amounts), and a registration date can be read in a lot of different ways (x days before the start of the registration=y before the events, or as a day of the week or...). To make it more fancy, map that with other data (mailings containing a url to the event page, activity on twitter...)

X+
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
March 23, 2015, 03:16:06 pm
Hi all,
Thanks for providing your ideas on the project.

I do understand that the scope of the project is quite big. Therefore I have decided to focus on the Events/Campaigns entity since making predictions about various things related to events (expected attendance, income and so on) would be quite useful data. Once we have the algorithms that are able to make predictions about the events, we can extend it to other domains also.

I think clustering can play an important role in this. We can cluster similar events using clustering techniques like kmeans++ or spectral clustering. We can use the event information (event type, location, fees, number of volunteers, number of attendees, etc) to define the feature vector for each event. Now there are many ways to compute similarity between any 2 events. One way is plot each event in the vector space and use the Euclidean distance function. We can then use this similarity function to cluster similar events.

Once we have clusters of events, we can make predictions with some level of confidence. For instance, I need to predict the number of attendees expected for an event. For this, I look at all the events that fall into the cluster to which this event belongs to, and take the average of number of attendees in all the similar events. Average over similar events would give a very basic intuition about the number of attendees expected for this event. We can make it more accurate by giving different weights to different features in the event feature vector. For instance, we can give weights to different timestamps. As xavier said, if I know that 50% of the participants have booked 2 months (1 time stamp) before the event, so for the event in two months, we are expecting today number of participants x 2. Similarly, if I know the number of participants for a more recent time stamp (suppose 1 month before), I'll give more weight to this while predicting today number of participants.   

We can also take into account other data points, like activity about the event on social media, mailings list containing links to the event, etc. More the discussion about the event, more is the probability of getting a higher turnout at the event. To make predictions of the income from an event, we can mine into the users' data, look at their history (how much they have contributed in similar events, how many actually booked and did not attend, etc) and come up with numbers that give a fair estimate of the expected income.

We can also group similar users into clusters. We can then use this information to find more users for an event. For instance, if we have x users registered for an event 2 months before, we can send a mail inviting users (to this event) from cluster set of each of the x users. There's high probability of users with similar interests to participate in the same event. This approach can also be used to decide whether a user would make a donation, buy a membership, buy a ticket or sponsorship for an event. Based on a user's own history and similar users' history, one can predict whether he would respond positively, neutrally or negatively.

These are my observations at an abstract level. Please provide your views on this and let me know if I'm going in the right direction.

JoeMurray

  • Administrator
  • Ask me questions
  • *****
  • Posts: 578
  • Karma: 24
    • JMA Consulting
  • CiviCRM version: 4.4 and 4.5 (as of Nov 2014)
  • CMS version: Drupal, WordPress, Joomla
  • MySQL version: MySQL 5.5, 5.6, MariaDB 10.0 (as of Nov 2014)
Re: GSOC 2015: Predictive and Data Mining Project
March 24, 2015, 08:13:08 am
This seems like a very good approach to the problem that Nicolas and Xavier are interested in, namely, predicting things at the level of a campaign or event.

Personally I would prefer more actionable predictions and data mining, such as how do some action x (eg sending another email, or phonebanking the invitees who have not responded) will cause some outcome to change (eg attendance or revenue for the event / campaign). But I don't want to flog a dead horse if Nicolas and Xavier are not as keen on this.
Co-author of Using CiviCRM https://www.packtpub.com/using-civicrm/book

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
March 25, 2015, 03:03:37 pm
Hi Joe,
I agree with you that more actionable predictions such as how do some action x would be more useful data. However I feel this would require more data mining and complex algorithms. I do believe clustering to find similar events/campaigns and then using the history of similar events to decide some action x, can help find a solution to this problem. I think I would be able to work on these predictions with better understanding once I have coded the algorithms to predict basic questions on events/campaigns. Working on this problem would an interesting task.

Also, I had one query. Right now, I'm focusing on predictions related to events/campaigns. This again has a big scope. It would be great if you could enlist some specific predictions (like income of event, attendance, etc) that you want me to work on in this project. This would help me in the proposal as once I know the specific predictions I need to make, I can accordingly work on the required algorithms and implementation.

Thanks
Mohit

JoeMurray

  • Administrator
  • Ask me questions
  • *****
  • Posts: 578
  • Karma: 24
    • JMA Consulting
  • CiviCRM version: 4.4 and 4.5 (as of Nov 2014)
  • CMS version: Drupal, WordPress, Joomla
  • MySQL version: MySQL 5.5, 5.6, MariaDB 10.0 (as of Nov 2014)
Re: GSOC 2015: Predictive and Data Mining Project
March 25, 2015, 04:36:09 pm
I think focussing on attendance and income for an event is a good start. If there is only 1 ticket price and nothing else, then they are equivalent. Discounts, usually based on buying sufficiently far in advance of the event, is one type of price difference. Another thing that could be modelled is selling tickets and selling sponsorships. To keep focus, I would focus on evaluating a campaign that is aimed at selling tickets for an event.

It would be best if you could find an organization that has adequate data to allow you to get relevant results with tight enough confidence intervals.

When graphing the predictions, it would be useful to indicate confidence interval for the predictions made at different points in time. It would be highly useful to see how the prediction has been changing during the period leading up to the event.

In order to get enough data, especially organizations without many events of their own (eg they run a single annual fundraising event) it might be beneficial to create an infrastructure for organizations to opt-in to sharing relevant statistics about such events. A phase 2, 2016 project ;).In this case, one would likely want to collect all of the data points that might be relevant causal factors in the success of the event, but in a way that does not require the whole db to uploaded in order to deal with privacy concerns. For example, how many contacts emailed in how many mailing at what period in advance of the event, info about the contacts eg distribution of number of previous event purchases, donations, memberships, other activities, etc. In fundraising the key predictive metrics on an individual are often expressed as RFM, usually parsed as recency, frequency, monetary amount (try googling).
Co-author of Using CiviCRM https://www.packtpub.com/using-civicrm/book

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: GSOC 2015: Predictive and Data Mining Project
March 26, 2015, 01:38:47 am
Quick reminder, not sure if the student(s) interested have done it already for this project, but you have to register and submit the proposal before tomorrow

http://forum.civicrm.org/index.php/topic,36143.0.html

If it's done already, all good, we'll discuss internally and with you and let you learn better our development workflow and tools we use and give you a chance to mingle with the community
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
March 26, 2015, 03:32:24 pm
Hi,
I have submitted the proposal. It would be great if you could provide your feedback on it.

Thanks
Mohit

mohit019

  • I’m new here
  • *
  • Posts: 20
  • Karma: 0
  • CiviCRM version: 4.5.8
  • CMS version: Wordpress
  • MySQL version: 5.5.41-0ubuntu0.14.04.1
  • PHP version: 5.5.9
Re: GSOC 2015: Predictive and Data Mining Project
April 09, 2015, 12:20:23 pm
I am posting my proposal below. It would be great if you could provide your feedback.

PERSONAL INFORMATION

Name: Mohit Aggarwal

Email: mohit.agarwal019@gmail.com

School/University: International Institute of Information Technology Hyderabad

Graduation Date: 2015

Major/Focus: Computer Science and Engineering

Location: India

Timezone: IST (UTC+05:30)

Drupal.org Profile: NIL

IRC: mohit019

CV/Resume URL: http://mohit-agarwal.github.io/Resume.pdf

Preferred time of day for virtual/video interview: IST 11:00 to IST 23:00

Open Source Experience

I have been contributing voluntarily to Benetech (a non-profit org that develops technology to create positive social change) codebase since March 2014. I worked on the MathML cloud accessibility project and made significant contributions to the MathML Cloud API project by adding various features to the API. I also helped set up a server for the MathML Cloud API project on Google Compute Engine. At the time of his internship with Google, I participated in the Google Serve event on ‘Global Literacy - Accessibility Code Sprint’, wherein I joined Benetech for the ‘Google Docs Addon for Accessible Math project’( http://goo.gl/Ky6QMx ).


GSOC INFORMATION


If you have participated in Google Summer of Code in the past, please describe your participation.

I got selected for Google Summer of Code 2014 for Benetech. I worked on the project ‘Accessible Math for the Blind and Vision Impaired - MathML Cloud’ (https://github.com/benetech/svgtex/wiki/Ideas-for-GSoC-2014). After 1 month, I had to officially drop from GSoC as I got selected for Google Summer Internship 2014 (Google Interns are not eligible for GSoC). However, I continued to work for Benetech and completed the whole project as a volunteer. Out of my passion for open source contribution, I still continue to voluntarily contribute to Benetech.

Have you applied to GSoC in the past and are you applying to any other organizations this year? If so, please explain.

I applied to GSoC in 2014 and got selected for Benetech. I’m not applying for any other organization this year.

How many hours will you devote to your GSoC project each week? What are your other summer plans (vacation, this time of year isn't your summer and you are in class, etc)?


I will be able to devote my entire time to the project as I would be on vacation. I have no other commitments during the 3 month timeline. I plan to contribute around 40 hours a week towards this project. I am very interested in contributing to the CiviCRM. I think this opportunity would act as a great platform for me to deeply expand my horizons and contribute to your organization using my diverse skill set.


CIVICRM INFORMATION

Have you registered an account at CiviCRM.org?


Yes. My username is mohit019.

Have you even built a site with Drupal, WordPress, or Joomla?  (Please provide details)

Yes I have built a site using Drupal. I worked on a project (under Structured System Analysis and Design course) that aimed to build a virtual classroom wherein students from different areas can attend the lecture videos posted by the teacher and and at the same time interact with each other giving them the feel of a real classroom. I decided on Drupal CMS to implement this product which had functionalities like registration, login for students, forum for discussion, security and so on.

I also have experience working with Wordpress. My GSoC 2014 project with Benetech was to enhance the functionality of the mathjax-latex Wordpress plugin to support the MathML Cloud API.

Have you ever built a CiviCRM site or helped on a CiviCRM project? 

No.

Have you ever posted a questions to the CiviCRM Forums, JIRA, or GitHub?

Yes. I have been posting questions on CiviCRM Forums to get proper understanding of the project.

Have you ever contributed code to CiviCRM?


No, not yet. I have successfully installed CiviCRM and hosted it inside Wordpress CMS. I am right now focusing on understanding the civicrm-core codebase. I am also currently working on a bug and have submitted a Pull Request(PR #5333) that deals with the CRM-11369 issue (https://github.com/civicrm/civicrm-core/pull/5333).


TECHNICAL INFORMATION

Have you ever utilized IRC?

Yes. While working on open source projects, IRC becomes my main source of communication.

Have you ever worked with Git?

Yes. I have good experience with Git. I have worked on many open source projects and every time I used Git as the version control. My github link: https://github.com/mohit-agarwal

Question based on knowledge of git:

You just committed some code (and not pushed yet), but you realized there is a typo in commit message. How would you change it (please explain each step of the solution) ?


We can set the commit message directly in the command line by using the following command:

git commit --amend -m "Updated message".

Question based on given 'almighty_function':

Given this function:

function gsoc_funciton($x, $y, $z) {

 if ($y != $z && $x == $y && $x == $z) {

    return "Success!";

 }

return "FAIL!"

}


Please provide set of values for $x, $y and $z for function to return "Success!". Explain your solution.

What, if anything would you change or fix in the function?

There is no set of values of $x, $y and $z for which the given function will return Success. All the 3 conditions in the if statement can never be true simultaneously.

In order for the given function to return Success, we can drop any one of the conditions in the if statement. Or we can change the first condition to '$y==$z'. Then this function will return Success when all the variables have the same value.


PROJECT INFORMATION


Which project idea sparks your interest and why?

Project - Predictive and Data Mining

I find this project quite interesting since it involves applying machine learning and data mining techniques to huge datasets. I have good amount of interest and experience in this field. The problem does not have a well defined solution and requires exploring different options in order to come up with efficient algorithms that can help make predictions. I think this would make the project quite challenging and exciting. I also believe that this project once implemented, would prove to be quite useful to the users and organizations since they would be able to make actionable predictions that can cause outcomes (eg, attendance or revenue for the event / campaign) to change. Working on this project would be a great learning experience.

Treating this project as a real proposal, provide your implementation plan with as much detail as possible such as weekly time breakdowns, methods of mentor communication, project management, and when to expect specific results/deliverables.

Project Overview

The main aim of the project is to add prediction to the existing data, i.e. to be able to make predictions about how likely an individual would respond positively or negatively or neutrally to a particular engagement action given their relationship history.

- By positive response, it means completing a user journey to make a donation, buy a membership, buy a ticket or sponsorship for an event, or respond positively to major donor request to make a bequest sometime in the next 5 years.
- By neutral, it means something close to ignoring the outreach.
- By negative response, it is trying to indicate the negative reaction that comes from spamming or asking too much or too often or too early in a relationship.

The relationship history is composed of all of the outreach actions to them and their reaching back to the organization (these blend a bit), such as bulk emails, personal emails, petition signings, survey responses, phone calls, meetings and other custom activities, contributions of various sorts (some will be purchases of goods or services and others will be donations), purchases and renewals of memberships, participation in events through registering, cases they have been involved in (these are sometimes used as workflow for selling memberships, but they can also be for things like helping a person get housing), grants they may have applied for and/or received, and so on.

Need

As CiviCRM is used by organizations of widely varying sizes with widely varying amounts of data on their users, it would be highly useful to know when insufficient information is available to make a prediction with a given level of confidence.

Approach

I discussed the project in detail with the mentors and finally came up with the approach mentioned below.

The scope of the project is quite big. Therefore I have decided to focus on the Events/Campaigns entity since making predictions about various things related to events (expected attendance, income and so on) would be quite useful data. Once we have the algorithms that are able to make predictions about the events, we can extend it to other domains also.

I would start by working on predicting the attendance and income for an event. To keep focus, I would focus on evaluating a campaign that is aimed at selling tickets for an event.

In order to make predictions with good confidence levels, I would start with an organization that has adequate data to allow me to get relevant results with tight enough confidence intervals. I would be able to test the accuracy and efficiency of my algorithms if I select an organization that has sufficient data (to be used as train data) for data mining and machine learning.

Once I’ve implemented the algorithms, I plan to graph the predictions so as to indicate confidence interval for the predictions made at different points in time. It would be highly useful to see how the prediction has been changing during the period leading up to the event.

I then plan to move on to more actionable predictions and data mining, such as how do some action x (eg sending another email, or phone banking the invitees who have not responded) will cause some outcome to change (eg attendance or revenue for the event / campaign).

Future Work


There would be many organization for which we won’t be having sufficient data in the CiviCRM database that can be used as train data for machine learning. In order to get enough data, especially organizations without many events of their own (eg they run a single annual fundraising event), there will be a need to create an infrastructure for organizations to opt-in to sharing relevant statistics about such events. In this case, one would likely want to collect all of the data points that might be relevant causal factors in the success of the event, but in a way that does not require the whole db to uploaded in order to deal with privacy concerns.

Implementation

Events - Feature Extraction and Clustering

Clustering can play an important role in making predictions. We can cluster similar events using clustering techniques like kmeans++ or spectral clustering. We can use the event information (event type, location, fees, number of volunteers, number of attendees, etc) to define the feature vector for each event (Feature Extraction). Now there are many ways to compute similarity between any 2 events. One way is plot each event in the vector space and use the Euclidean distance function. We can then use this similarity function to cluster similar events.

Making Predictions

Once we have clusters of events, we can make predictions with some level of confidence. For instance, I need to predict the number of attendees expected for an event. For this, I look at all the events that fall into the cluster to which this event belongs to, and take the average of number of attendees in all the similar events. Average over similar events would give a very basic intuition about the number of attendees expected for this event. We can make it more accurate by giving different weights to different features in the event feature vector. For instance, we can give weights to different timestamps. As xavier said, if I know that 50% of the participants have booked 2 months (1 timestamp) before the event, so for the event in two months, we are expecting today number of participants x 2. Similarly, if I know the number of participants for a more recent time stamp (suppose 1 month before), I'll give more weight to this while predicting today number of participants.   

Data Mapping

We can also take into account other data points, like activity about the event on social media, mailings list containing links to the event, etc. More the discussion about the event, more is the probability of getting a higher turnout at the event. To make predictions of the income from an event, we can mine into the users' data, look at their history (how much they have contributed in similar events, how many actually booked and did not attend, etc) and come up with numbers that give a fair estimate of the expected income.

Clustering Similar Users

We can also group similar users into clusters. We can then use this information to find more users for an event. For instance, if we have x users registered for an event 2 months before, we can send a mail inviting users (to this event) from cluster set of each of the x users. There's high probability of users with similar interests to participate in the same event. This approach can also be used to decide whether a user would make a donation, buy a membership, buy a ticket or sponsorship for an event. Based on a user's own history and similar users' history, one can predict whether he would respond positively, neutrally or negatively.

Mentor Communication

I will remain in touch with my mentors all the time while working on the project. I would be active on IRC and mailing lists most of the time (IST 10 am to IST 11 pm). I would keep the mentors updated with my progress on the project on a weekly basis. I am open to working overtime if I somehow miss the deadlines.

I also plan to blog about my work and progress in the project on a weekly basis. This way the mentors and other members of the community can get the updates on the project in an easy timely manner.

When to expect specific deliverables: Please see below

Detailed Description (only if this is a new idea, otherwise if there is a description online, please provide the URL):

http://wiki.civicrm.org/confluence/display/CRM/Google+Summer+of+Code+-+2015


Expected Deliverables: (list the main items that you will deliver be during the program):

Item 1 - 09 June

Train Data (for events/campaigns) that can be used to make predictions.

Item 2 - 23 June

A code module that makes simple predictions related to events (attendance/income) based on similar events’ data.

Item 3 - 07 July

A detailed analysis of predictions so as to indicate confidence interval for the predictions made at different points in time.

Item 4 - 21 July

ML Algorithms to make more actionable predictions and data mining.

Item 5 - 11 Aug

Prediction algorithms that take into account secondary sources of information about users apart from their relationship history.

 

Timeline (break down by every week of GSoC):

19 May: Research Week

-  Understand the civicrm-core codebase.
-  Understand the correlations between different entities in CiviCRM.

26 May: Organization Selection

-  Find a few organizations that have enough data that can be used for prediction.
-  Select one organization and decide on the specific predictions (like income of event, attendance, etc) we need to devise the algorithms for.

02 June: Train Data Preparation

-  Data cleaning
-  Data mining and
-  Feature extraction (for events) for 1 particular organization.

09 June: Clustering

-  Apply basic Clustering Algorithms to find similar events.
-  Test and plot the accuracy of the algorithms against new test events (by checking if they fall in the correct cluster).
-  Improve upon the results by changing the features of the events or the clustering algorithms.

16 June: Prediction

-  Find different ways to make prediction from similar events’ data. For instance, use average.
-  Make simple predictions like attendance/income of an event.

23 June: Mid term submission

30 June: Feedback

-  Work on the feedbacks, received from the evaluation.
-  Rigorous testing of the API, in extreme conditions, to check for bugs.
-  Clean the code and write required documentation.

30 June: Result Analysis

-  Graph the predictions so as to indicate confidence interval for the predictions made at different points in time.
-  Analyse and infer as to how the prediction has been changing during the period leading up to the event.
-  Accordingly think of ways to improve accuracy of the algorithms.

07 July: Improve the accuracy of prediction algorithms

-  Take into account other data points, like activity about the event on social media, mailings list containing links to the event, etc while making predictions.
-  Work on how these things can be accounted for in the prediction algorithms.

14 July: Work on more actionable predictions

-  Work on the feedback from the mentors.
-  Devise algorithms to make predictions such as how do some action x (eg sending another email, or phone banking the invitees who have not responded) will cause some outcome to change (eg attendance or revenue for the event / campaign).

21 July: Use secondary sources of Information to do more ML

-  Collect data from the associated user accounts in WordPress, Drupal or Joomla! particularly for sites with significant user generated content, and from various extensions.
-  Devise ways to use this information for doing more ML and improve upon the prediction algorithms.

28 July, 04 Aug: Cluster Similar Users

-  Use data collected about users (from both primary and secondary sources) to cluster similar users.
-  Devise ways to use similar users’ data to add more detail and complexity to the prediction algorithms.

11 Aug: Wrap up

-  Clean the code.
-  Write the required documentation.

18 Aug: Final submission


Potential Mentors (optional): xavier dutoit, Joe Murray (JMA Consulting), totten


Which aspect project idea do you see as the most difficult?

Finding similarity between any two events. We need to come up with the best combination of features that can be used to build the feature vector of each event/campaign so that clustering algorithms perform well on these feature vectors while grouping similar events.

Which aspect project idea do you see as the easiest?

Applying clustering algorithms once we have with us the event/campaign feature vectors. There are various libraries that provide inbuilt functions for different clustering techniques like kmeans++.

Which portion of the project idea will you start with?

I would start by selecting an organization that has enough data (that can be used as train data for ML) and then do data analysis to decide on features (feature extraction) and the clustering techniques.
« Last Edit: April 09, 2015, 12:25:34 pm by mohit019 »

xavier

  • Forum Godess / God
  • I’m (like) Lobo ;)
  • *****
  • Posts: 4453
  • Karma: 161
    • Tech To The People
  • CiviCRM version: yes probably
  • CMS version: drupal
Re: GSOC 2015: Predictive and Data Mining Project
April 12, 2015, 02:24:27 pm
Thanks,

As I was explaining to you, we'd like to have a mentor (outside of the civicrm community) that has a stronger expertise on some datamining tools and techniques to co-mentor). If you have a name to toss in the bag, would be great ;)
-Hackathon and data journalism about the European parliament 24-26 jan. Watch out the result

Pages: [1] 2
  • CiviCRM Community Forums (archive) »
  • Old sections (read-only, deprecated) »
  • Developer Discussion »
  • Google Summer of Code »
  • GSOC 2015: Predictive and Data Mining Project

This forum was archived on 2017-11-26.