Repair pantsd+watchman integration test flakiness.

Review Request #4067 — Created July 12, 2016 and submitted

3565, 3622
jsirois, nhoward_tw, stuhood

Presently, the pantsd+watchman integration test can occasionally fail in CI with a logged error along the lines of:

W0602 21:35:21.260252 22326] encountered exception during SchedulerService.get_build_graph():
Traceback (most recent call last):
File "/home/travis/build/pantsbuild/pants/src/python/pants/pantsd/service/", line 65, in runner_factory
build_graph = self._scheduler_service.get_build_graph(spec_roots)
File "/home/travis/build/pantsbuild/pants/src/python/pants/pantsd/service/", line 84, in get_build_graph
return self._graph_helper.create_graph(spec_roots)
File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/", line 64, in create_graph
for _ in graph.inject_specs_closure(spec_roots):  # Ensure the entire generator is unrolled.
File "/home/travis/build/pantsbuild/pants/src/python/pants/engine/legacy/", line 169, in inject_specs_closure
for address in self._inject(specs):
File "/home/travis/build/pantsbuild/pants/src/python/pants/engine/legacy/", line 185, in _inject
entries = maybe_list(state.value, expected_type=LegacyTarget)
AttributeError: 'NoneType' object has no attribute 'value'

This seems to only happen when invalidation activity is high (in this case due to constant filesystem event activity in the CI environments), indicating a locking issue around graph construction. I was able to reliably reproduce the flakiness locally using a simple test harness described in #3565 that simulated the fs event conditions seen in CI.

To repair:

  • Add coarser grain locking to graph construction in LegacyGraphHelper.create_graph(), which revealed deadlock potential at the time of the Pailgun handler's fork() in subsequent testing.
  • Add a request handler-level lock to the Pailgun to prevent requests from forking when the scheduler lock is inadvertantly held by the invalidation event processing thread (which can result in a deadlock/hung pailgun client).
  • Add some informational print statements to the pantsd+watchman integration test to aid in future diagnostics in CI, if necessary.

Fixes #3565

+ failed to reproduce locally in a lengthy while test loop using the test harness setup described in #3565.

  2. src/python/pants/pantsd/service/ (Diff revision 2)

    The indirection through scheduler_service to the scheduler's lock sortof obfuscates things.

    Maybe the SchedulerService is the thing that should actually have a lock (since it is the thing actually expecting concurrency) and scheduler should just be concurrency oblivious?

    1. discussed offline, but essentially the thinking here is/was that the locking in the scheduler is ingrained enough now that it made more sense to simply surface that (rentrant) lock at a higher level to retain use of a singular lock for all scheduler work.

      filed to revisit this.

  3. Since you don't need the ability to mutate this, would making it an @property be more defensive?

  4. Debug output is probably fine... assuming it's intentional?

    1. yup - purely intentional and only emitted in the integration test context when pytest's -s option is used.

      I suspect this output will be useful in more quickly understanding the failure mode for future cases of breakage in and out of CI.

Review request changed

Status: Closed (submitted)

Change Summary:

thanks Stu! this is in @