Repair pantsd+watchman integration test flakiness.
Review Request #4067 - Created July 12, 2016 and submitted
|jsirois, nhoward_tw, stuhood|
Presently, the pantsd+watchman integration test can occasionally fail in CI with a logged error along the lines of:W0602 21:35:21.260252 22326 pailgun_service.py:69] encountered exception during SchedulerService.get_build_graph(): Traceback (most recent call last): File "/home/travis/build/pantsbuild/pants/src/python/pants/pantsd/service/pailgun_service.py", line 65, in runner_factory build_graph = self._scheduler_service.get_build_graph(spec_roots) File "/home/travis/build/pantsbuild/pants/src/python/pants/pantsd/service/scheduler_service.py", line 84, in get_build_graph return self._graph_helper.create_graph(spec_roots) File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/engine_initializer.py", line 64, in create_graph for _ in graph.inject_specs_closure(spec_roots): # Ensure the entire generator is unrolled. File "/home/travis/build/pantsbuild/pants/src/python/pants/engine/legacy/graph.py", line 169, in inject_specs_closure for address in self._inject(specs): File "/home/travis/build/pantsbuild/pants/src/python/pants/engine/legacy/graph.py", line 185, in _inject entries = maybe_list(state.value, expected_type=LegacyTarget) AttributeError: 'NoneType' object has no attribute 'value'
This seems to only happen when invalidation activity is high (in this case due to constant filesystem event activity in the CI environments), indicating a locking issue around graph construction. I was able to reliably reproduce the flakiness locally using a simple test harness described in #3565 that simulated the fs event conditions seen in CI.
- Add coarser grain locking to graph construction in
LegacyGraphHelper.create_graph(), which revealed deadlock potential at the time of the Pailgun handler's
fork()in subsequent testing.
- Add a request handler-level lock to the Pailgun to prevent requests from forking when the scheduler lock is inadvertantly held by the invalidation event processing thread (which can result in a deadlock/hung pailgun client).
- Add some informational print statements to the pantsd+watchman integration test to aid in future diagnostics in CI, if necessary.
+ failed to reproduce locally in a lengthy
whiletest loop using the test harness setup described in #3565.
The indirection through
scheduler_serviceto the scheduler's lock sortof obfuscates things.
Maybe the SchedulerService is the thing that should actually have a lock (since it is the thing actually expecting concurrency) and scheduler should just be concurrency oblivious?
Since you don't need the ability to mutate this, would making it an @property be more defensive?
Debug output is probably fine... assuming it's intentional?
Address Stu's feedback.
Revision 3 (+49 -7)