[engine] Walk references in the ProductGraph

Review Request #3803 — Created May 2, 2016 and submitted — Latest diff uploaded

kwlzn, patricklaw

With caching disabled (will revisit in a future review), the majority of the runtime for ./pants list :: with the new engine was in cycle detection. After attempting to implement a smarter cycle detection algorithm [0], it became obvious that the actual cost of walking was the dominant factor.

This review replaces the internals of ProductGraph with _Entry objects that directly reference one another. This avoids executing dict lookups (and thus a lot of Node equality checks) during graph walks.

  • Add explicit tests of cycle detection.
  • Move from multiple defaultdicts to a non-default dict of _Entry objects and walk the graph directly by _Entry reference, which avoids a lot of dict lookups (~20% improvement).
  • Switch from recursion to iteration for graph walk (~6% improvement).
  • Enable some default filters in legacy:commands for venv directories.

With caching disabled, the runtimes for time ./pants -q run src/python/pants/engine/exp/legacy:fsnodes -- //:: > /dev/null were:

  real  0m22.143s
  user  0m19.913s
  sys   0m1.914s

complex algorithm (with reference walking):
  real  0m22.874s
  user  0m20.438s
  sys   0m2.078s

this review (simple algorithm, with reference walking):
  real  0m17.940s
  user  0m15.944s
  sys   0m1.847s

[0]: I was able to get that algorithm solidly working, but in addition to being more complicated, the extra overheads meant that (at least for the content of the pantsbuild/pants repo) it was significantly slower than the naive m^2 approach. For posterity, the implementation is tagged here.