[engine] Walk references in the ProductGraph

Review Request #3803 — Created May 1, 2016 and submitted

kwlzn, patricklaw

With caching disabled (will revisit in a future review), the majority of the runtime for ./pants list :: with the new engine was in cycle detection. After attempting to implement a smarter cycle detection algorithm [0], it became obvious that the actual cost of walking was the dominant factor.

This review replaces the internals of ProductGraph with _Entry objects that directly reference one another. This avoids executing dict lookups (and thus a lot of Node equality checks) during graph walks.

  • Add explicit tests of cycle detection.
  • Move from multiple defaultdicts to a non-default dict of _Entry objects and walk the graph directly by _Entry reference, which avoids a lot of dict lookups (~20% improvement).
  • Switch from recursion to iteration for graph walk (~6% improvement).
  • Enable some default filters in legacy:commands for venv directories.

With caching disabled, the runtimes for time ./pants -q run src/python/pants/engine/exp/legacy:fsnodes -- //:: > /dev/null were:

  real  0m22.143s
  user  0m19.913s
  sys   0m1.914s

complex algorithm (with reference walking):
  real  0m22.874s
  user  0m20.438s
  sys   0m2.078s

this review (simple algorithm, with reference walking):
  real  0m17.940s
  user  0m15.944s
  sys   0m1.847s

[0]: I was able to get that algorithm solidly working, but in addition to being more complicated, the extra overheads meant that (at least for the content of the pantsbuild/pants repo) it was significantly slower than the naive m^2 approach. For posterity, the implementation is tagged here.


  1. lgtm! handful of misc comments.

  2. stru[c]tural

  3. assuming the default object equality is identity based anyway, could this be made clearer with an explicit is comparison?

  4. consider self._nodes.iteritems() here and downward.

    1. T602:ERROR src/python/pants/engine/exp/scheduler.py:152 iteritems disappears in Python 3.x. Use non-iter instead.
      | for node, entry in self._nodes.iteritems():

      Don't our imports mean that these are py3 style anyway?

    2. ah, k. fwiw: http://markmail.org/message/t2a6tp33n5lddzvy

  5. how about making the all_predicate function signature default the second arg e.g.:

    def all_predicate(node, state=None): return True

    and reusing that here?

  6. src/python/pants/engine/exp/scheduler.py (Diff revision 2)

    maybe these definitions should happen under if branches to avoid potentially unnecessary function definition overhead?


    if predicate:
      def entry_predicate(...):
        return predicate(...)
      def entry_predicate(...):
        ...default case...
  7. quick micro-optimization: use object literals where possible for initial creation (e.g. root_entries = []) - they're 2-3x faster:

    [illuminati source (master)]$ python -m timeit -n 10000 "x = []"
    10000 loops, best of 3: 0.0245 usec per loop
    [illuminati source (master)]$ python -m timeit -n 10000 "x = list()"
    10000 loops, best of 3: 0.0888 usec per loop
  2. src/python/pants/engine/exp/scheduler.py (Diff revision 2)

    or return entry and entry.state is not None, slightly more readable

  3. second param node seems not used, or can get from node_entry, either way can kill

Review request changed

Status: Closed (submitted)

Change Summary:

Merged as 8b163caa1213ae5b845fa936c559877bcc65fa84