Author Topic: Sneak Peak: PR Testing (Read 2241 times)

totten · July 18, 2014, 05:31:32 pm

With support from the Wikimedia Foundation, we've been working to integrate our test-suite with Github pull-requests so that test regressions can be identified *before* a PR is accepted. I'll present more in-depth about this in the future, but for now I want to share a little about how to use it -- and what's coming next.

As an example, let's take a look at an arbitrary pull-request -- like eileen's https://github.com/civicrm/civicrm-core/pull/3681 . Several things to notice:

As soon as Eileen submitted the PR, a comment was automatically added by "civicrm-builder". civicrm-builder is the bot which monitors our PRs and runs the tests.
Specifically, "civicrm-builder" asks a question: "Can one of the admins verify this patch?" civicrm-builder is the new kid on the block, so he hasn't yet figured out that Eileen is High Priestess of Patching, and he's worried that the patch might do something really malicious (like run "rm -rf /" on the test machine).
One of the admins -- totten -- sees the comment and responds to "civicrm-builder." Like a young puppy, civicrm-builder only knows a couple phrases in English. (See https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin for a list.) In this case, totten says "add to whitelist" because he recognizes that Eileen is High Priestess of Patching. Now this patch (and any future patches from Eileen) will be tested automatically.
civicrm-builder went off on its own for a while to test Eileen's patch. Eventually, it passed (which makes civicrm-builder happy). Like an enthusiastic environmentalist, civicrm-builder issues green badges whenever he's happy. Notice the green checkmark next to Eileen's patch. Clicking on the green checkmark shows a full report.

So we have a working system, right? Well, mostly. There's a balancing act we still need to complete: when civicrm-builder performs tests, it could run all of them -- which would provide good feedback, but it would take a long time. For the moment, civicrm-builder runs only a handful of basic tests ( https://test.civicrm.org/job/CiviCRM-Core-PR/26/testReport/%28root%29/ ). So the next step is finding a good balance which (a) runs enough tests to be useful, and (b) runs them fast enough to be useful, and (c) can be achieved with reasonable manpower. I'll report back as this progresses.

petednz · July 18, 2014, 05:56:19 pm

i like giving eileen green checkmarks too - but for some reason it doesn't replace the paycheck

Chris Burgess · July 18, 2014, 06:06:23 pm

Looks great! A couple of thoughts on infrastructure load -

I expect it would be good for contributors to be able to reproduce such a setup, to better test against unpublished modules or live customer data. How can I help make that happen? It would let us shift the load away to self-hosted dev setups (& make it easier to expand test infrastructure too).

Approaches like https://wiki.hpdd.intel.com/display/PUB/Changing+Test+Parameters+with+Gerrit+Commit+Messages might help us to target initial patch testing (eg tagging a patch with "contribute" so it either runs contribute tests first and bails out if it fails there, or runs only contribute tests and doesn't get awarded a green "good to merge" badge until it's fully tested later).

totten · July 18, 2014, 07:03:24 pm

Chris, those are great points.

Quote from: Chris Burgess on July 18, 2014, 06:06:23 pm

I expect it would be good for contributors to be able to reproduce such a setup, to better test against unpublished modules or live customer data. How can I help make that happen?

Part of the arrangement with Wikimedia includes producing documentation so that other dev teams can create similar configurations.

As that matures, perhaps I can send you a draft for review/testing/comment?

Quote from: Chris Burgess on July 18, 2014, 06:06:23 pm

Approaches like https://wiki.hpdd.intel.com/display/PUB/Changing+Test+Parameters+with+Gerrit+Commit+Messages might help us to target initial patch testing (eg tagging a patch with "contribute" so it either runs contribute tests first and bails out if it fails there, or runs only contribute tests and doesn't get awarded a green "good to merge" badge until it's fully tested later).

I think that's potentially a good approach. There are a few approaches I want to try and wonder if you have any reactions on relative cost/benefit:

Hinting - Using Github tags or Github comments, allow the developers (either the original dev or a reviewer) to indicate which test suite to run.
Pattern matching - For example, if the patch touches "CRM/Event/*" or "api/v3/Event*", then run any test-cases that declare "@group CiviEvent".
Parallelization - When running the tests, create 3-4 instances (eg Drupal multi-site) and run 3-4 copies of PHPUnit (with different test-cases in each).
Optimization - Tackle slowness at the source by changing the tests. Dawnthorne had some good POCs.

My inclincation is to push more on pattern-matching and parallelization (because they don't require user-interaction or broad changes to the actual tests) but could adjust if there's a better argument.

Chris Burgess · July 18, 2014, 08:20:56 pm

Hinting feels like the lowest entry point but the most work for contributors (don't mis-spell! etc) and the least advantage (at least as far as focussing testbot resources on useful testing). Pattern matching is the converse - we have to teach it patterns but contributors get it 'for free'. Parallelization may make sense if it offers a speed boost but that's mostly dependent on free server resources? Optimization should target performant CiviCRM then performant testing IMO - because of the useful side-effects (this may be what you mean?) - and that points towards being able to test "real world" DBs alongside the test DB.

totten · July 18, 2014, 09:13:28 pm

Regarding parallelization - I think we have sufficient resources that we could do a few more threads.

Regarding optimization - There may in fact be overlap in what helps performance of tests and real-usage, but I can't name any concrete items off the top of my head -- unless there are suggestions, we'd have to go on a fishing expedition. Trying to think of more concrete issues, the obvious painpoints (to me) don't really overlap: real-performance is heavily influenced by bootstrap, but that's negligible for test-performance. Test-performance is heavily influenced by cleaning up DB-tables and caches, but that's irrelevant to real-performance.

totten · July 18, 2014, 11:12:08 pm

Regarding pattern-matching -- I'm thinking about the best way to encode the patterns. Two ideas that have come up -- use a "testmap.json" or use PHP annotations.

Code: [Select]

// FILE: tests/testmap.json
{
  {match: "api/[A-Za-z]+.php",                                      phpunit-group: "ApiFramework"},
  {match: "api/v3/utils.php",                                       phpunit-group: "ApiFramework"},
  {match: "CRM/Utils/API/.*",                                       phpunit-group: "ApiFramework"},
  {match: "Civi/API/.*",                                            phpunit-group: "ApiFramework"},
  {match: "CRM/Event/.*php",                                        phpunit-group: "CiviEvent"},
  {match: "CRM/Mailing/.*php",                                      phpunit-group: "CiviMail"},
  {match: "CRM/Mailing/.*php",                                      phpunit-group: "CiviMail},
  {match: "(CRM|api|Civi)/([A-Za-z]+).php",                         phpunit-class: "$1_$2"},
  {match: "(CRM|api|Civi)/([A-Za-z]+)/([A-Za-z]+).php",             phpunit-class: "$1_$2_$3"},
  {match: "(CRM|api|Civi)/([A-Za-z]+)/([A-Za-z]+)/([A-Za-z]+).php", phpunit-class: "$1_$2_$3_$4"},
}

The nice thing about this: there's no particular PHP bias, so it can work just as well with PHP+PHPUnit or JS+(QUnit|Jasmine)+(karama|protractor). On the otherhand, you need to understand regex, and you need to update a centralized file, and it's won't be very clever about deduping. (So if the class "CRM_Foo_BarTest" matches on account of a phpunit-group rule and on account of a phpunit-class rule, then it would get run twice... unless we take other measures to dedupe.)

Code: [Select]

// In any source-code file, you can add annotations to classes or global functions like:
/**
 * @TestWith\group CiviEvent
 * ...or...
 * @TestWith\class CRM_Event_BAO_WhizzyTest
 */
class CRM_Event_BAO_Participant { ... }

// OR, in any test file, you can add annotations to classes or functions like:
/**
 * @TestFor\group CiviEvent
 * ...or...
 * @TestFor\file "CRM/Event/BAO/Event.php"
 * ...or...
 * @TestFor\file "sample-data.*.csv"
 */
class CRM_Event_BAO_EventTest { ... }

In this case, the mappings are more "in situ", which is probably a bit nicer when coding. They're closely tied to PHP parsing which means.... we can be more quite clever (e.g. deduping and matching tests based on *class* name instead of file-name), but we do have to parse the files, and it won't work with pure-JS tests. (OTOH, our pure-JS test-suite is currently so small that we don't currently need this level of granularity.)

totten · July 23, 2014, 08:19:02 pm

FWIW, related JIRA issues are:
* Parallelization: https://issues.civicrm.org/jira/browse/INFRA-124
* Pattern matching: https://issues.civicrm.org/jira/browse/INFRA-122

Eileen · July 23, 2014, 09:06:31 pm

Run syntaxconformance tests on every one - if that doesn't break nothing will :-)

PS glad you picked a PR that didn't make me cringe when I looked

Eileen · July 24, 2014, 03:48:50 pm

as a thought - could you do full tests on a throttled basis? ie. run the full tests if they haven't been run for 2 hours. From experience the tests that fail are often a bit unpredictable so it would be good to run full tests too. Not sure what you'd do with the results - ie. if there have been 5 PRs since the last full run & then it fails would you then post a comment on them all? Re-run the failing test on each of them (of course - does it still fail in isolation?)

totten · July 24, 2014, 04:09:39 pm

So, I agree that we still need to do full runs. How much advantage do you see in the throttling approach compared to this approach:

1. The first job monitors pending PR's. It runs a subsection of the test-suite (hopefully targeted based on the content of the patch).
2. The second job runs periodically (say, every 4hr). It monitors the official branches (not PRs) and executes the full test suite.

General aside: The problem of inconsistency in test results has recurred in a few contexts for me (e.g. WebTest suite; timing-sensitive tests; and making our tests compatible with vanilla phpunit tooling), so I'm experimenting with a new script ("The Jeanine Matthews Divergent Test Hunter") to help identify/isolate divergent test results.

Eileen · July 24, 2014, 04:24:22 pm

Well - I guess the scenario you are looking at is how do you find the commit that breaks the tests? If you merge 10 PRs, none of which break the subset that were run on them but after doing that you have broken tests what do you do?

If you are going to resolve that manually then you have the risk that it won't be treated as a priority & the situation will go stale & become a problem both on new PRs - ie. the tests are failing already & in tracking down where the failure stemmed from.

I think we need to treat webtests & unittests separately because the webtests aren't really reliable enough to put processes around but the unit tests we probably should. We have 2 sets of unit tests that regularly fail due to time related issues - the Authorize.net ones due to the fact Authorize.net runs on Mountain time - this could probably be handled in the test. The other one is a membership related one which is timezone related in 'some' way. I'm about 70% sure this is a real bug because I get the same intermittant reports in the wild. Anyway - I think the phpunit ones could be managed & the webtest ones ignored.

Ah I see - I was assuming the PRs would be built into the tests cumulatively & that they would not usually be merged more often than 2 hours - so I was thinking that if there were 10 PRs in 2 hours then the 2 hourly run would do a full test on all 10. So we could be sure the full set would not cause some breakage.

totten · July 24, 2014, 04:46:21 pm

Quote from: Eileen on July 24, 2014, 04:24:22 pm

Well - I guess the scenario you are looking at is how do you find the commit that breaks the tests? If you merge 10 PRs, none of which break the subset that were run on them but after doing that you have broken tests what do you do?

The same way we do now -- with the periodic/scheduled tests against the official code. The changelog may have some noise but you can usually eyeball it; and if not, then use "git bisect" ( http://alanhollis.com/solving-bugs-phpunit-git-bisect/ ). If we had time/resources and wanted to be clever, I suppose we could automatically call "git bisect" to analyze the regressions.

Quote from: Eileen on July 24, 2014, 04:24:22 pm

Ah I see - I was assuming the PRs would be built into the tests cumulatively & that they would not usually be merged more often than 2 hours - so I was thinking that if there were 10 PRs in 2 hours then the 2 hourly run would do a full test on all 10. So we could be sure the full set would not cause some breakage.

As soon as the first PR is submitted, the testing begins immediately (~1 min later?). When the second PR is submitted, testing begins immediately. If there's a flood of concurrent PRs, then Jenkins will queue/backlog the tasks, but generally the tests are immediate. Doingfull test runs (CRM+API+Civi+WebTest suites) is only problematic when doing immediate execution.

Eileen · July 24, 2014, 05:14:50 pm

hm - It wasn't my impression that the current process was working terribly well - in that tests are often in a broken state for long periods of time - although it has been improving. It's over 2 months since the tests completed.

I just took at look & saw that a PR I did 9 days ago is now causing a test to fail - from my point of view that is completely in the past now - ie. the issue is resolved from the customer's point of view as the patch is deployed to them & they can reasonably expect that it will be fixed in 4.6 by the time they next upgrade. We've invoiced the customer and in the absence of any clear release schedule to plan around I'm not going to commit to putting time into it before the next release (and certainly not before the end of the billing month).

In this case your test pattern probably WOULD have picked the test up & not let it in. In this case had that happened immediately I would have fixed it (since it would have been billable). Had it happened 9 days later I would have left it hanging until we were next looking at upgrading or I attended a sprint.

As it is I'm inclined to say- reverse it & I'll fix it next sprint or next time it appears billable. Except that on a quick perusal I'm not sure if my patch caused a bug or exposed one
( https://github.com/civicrm/civicrm-core/commit/29347f3d407cfcd5f51c1fbc77a55bb607c6691b )

totten · July 26, 2014, 03:01:25 pm

Quote from: Eileen on July 24, 2014, 05:14:50 pm

hm - It wasn't my impression that the current process was working terribly well - in that tests are often in a broken state for long periods of time - although it has been improving. It's over 2 months since the tests completed.

Agree. That's why we're adding tests which run against a PR as soon as it's submitted (and before the PR is merged). That should do a lot of good, but (as with most tools/systems) it's likely to have limits and edge-cases. The old-process remains as a fallback or catch-all (until something better comes along).

CiviCRM Community Forums (archive)

News:

Author Topic: Sneak Peak: PR Testing (Read 2241 times)

totten

Sneak Peak: PR Testing

petednz

Re: Sneak Peak: PR Testing

Chris Burgess

Re: Sneak Peak: PR Testing

totten

Re: Sneak Peak: PR Testing

Chris Burgess

Re: Sneak Peak: PR Testing

totten

Re: Sneak Peak: PR Testing

totten

Re: Sneak Peak: PR Testing

totten

Re: Sneak Peak: PR Testing

Eileen

Re: Sneak Peak: PR Testing

Eileen

Re: Sneak Peak: PR Testing

totten

Re: Sneak Peak: PR Testing

Eileen

Re: Sneak Peak: PR Testing

totten

Re: Sneak Peak: PR Testing

Eileen

Re: Sneak Peak: PR Testing

totten

Re: Sneak Peak: PR Testing