Repair pantsd+watchman integration test flakiness.

Review Request #4067 - Created July 12, 2016 and submitted

Information
Kris Wilson
pants
kwlzn/pantsd/locking
3565, 3622
Reviewers
pants-reviews
jsirois, nhoward_tw, stuhood

Presently, the pantsd+watchman integration test can occasionally fail in CI with a logged error along the lines of:

W0602 21:35:21.260252 22326 pailgun_service.py:69] encountered exception during SchedulerService.get_build_graph():
Traceback (most recent call last):
File "/home/travis/build/pantsbuild/pants/src/python/pants/pantsd/service/pailgun_service.py", line 65, in runner_factory
build_graph = self._scheduler_service.get_build_graph(spec_roots)
File "/home/travis/build/pantsbuild/pants/src/python/pants/pantsd/service/scheduler_service.py", line 84, in get_build_graph
return self._graph_helper.create_graph(spec_roots)
File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/engine_initializer.py", line 64, in create_graph
for _ in graph.inject_specs_closure(spec_roots):  # Ensure the entire generator is unrolled.
File "/home/travis/build/pantsbuild/pants/src/python/pants/engine/legacy/graph.py", line 169, in inject_specs_closure
for address in self._inject(specs):
File "/home/travis/build/pantsbuild/pants/src/python/pants/engine/legacy/graph.py", line 185, in _inject
entries = maybe_list(state.value, expected_type=LegacyTarget)
AttributeError: 'NoneType' object has no attribute 'value'

This seems to only happen when invalidation activity is high (in this case due to constant filesystem event activity in the CI environments), indicating a locking issue around graph construction. I was able to reliably reproduce the flakiness locally using a simple test harness described in #3565 that simulated the fs event conditions seen in CI.

To repair:

  • Add coarser grain locking to graph construction in LegacyGraphHelper.create_graph(), which revealed deadlock potential at the time of the Pailgun handler's fork() in subsequent testing.
  • Add a request handler-level lock to the Pailgun to prevent requests from forking when the scheduler lock is inadvertantly held by the invalidation event processing thread (which can result in a deadlock/hung pailgun client).
  • Add some informational print statements to the pantsd+watchman integration test to aid in future diagnostics in CI, if necessary.

Fixes #3565

https://travis-ci.org/pantsbuild/pants/builds/144029267

+ failed to reproduce locally in a lengthy while test loop using the test harness setup described in #3565.

Stu Hood
Kris Wilson
Review request changed

Status: Closed (submitted)

Change Summary:

thanks Stu! this is in @ https://github.com/pantsbuild/pants/commit/279470ad41fbe84c8b764aef5010d068ca1d1034

Loading...