[engine] Eliminate non-determinism computing cache keys

Review Request #3593 — Created March 22, 2016 and submitted

kwlzn, patricklaw, stuhood

There is an existing unit test that verify reruning the same request hits cache
100%, but that's within the same python process. As it turns out there is
randomness when persisting cache to local path and reuse from restarting a
different process.

Have identified two sources for non-determinism:

  • StepRequest.step_id this could change as when requests may be executed,
    returned and scheduled in different order. Alternatives considered
    1) remove step_id from StepRequest, since we still pass step_id to and
    back from sub process, the interface would look less clean.
    2) overwrite StepRequest__reduce__ provide fields that are to be pickled,
    essentially whitelisting the fields, that's fragile if we ever add/remove

  • StepRequest.dependencies is an unsorted map. So sorting is provided in this
    review that sorts in two steps:
    1) group nodes by their types, since comparison for nodes of different types
    is not defined
    2) add Key.__lt__, currently undefined which defaults to identity based.

Implemented StepRequest.compute_key to handle the above two.

Manual testing on list 3rdparty/:: now gets 100% hit rate, was <20%, this
results about 3x improvement speed up.

Also fixed a bug that is_cacheable is missing @property, this results every
node is_cacheable will return True, therefore FilesystemNode is cached!
AddressMapperTest.test_no_address_no_family is supposed to catch that, but
when we invalidate the graph we don't reset step_id therefore in the second run
requests get higher step_ids, i.e, different fingerprints. Now with the step_id
change the bug properly shows up.

https://travis-ci.org/peiyuwang/pants/builds/117578595 passed
http://jenkins.pantsbuild.org/job/pants_ci.unit_tests/136/ passed

manually tested list 3rdparty/::

  • 0
  • 0
  • 3
  • 0
  • 3
Description From Last Updated
  1. Great finds Peiyu!

    Please add a test that catches the multi-run non-determinism case: it should be possible to do by running an execution twice with the same Storage instance, but with different Schedulers?

    1. multiprocess engine would be good testing non-determinism (dependencies are more likely to come in different orders)

      I switched test_engine.test_rerun_with_cache to LocalMultiprocessEngine

  2. src/python/pants/engine/exp/nodes.py (Diff revision 1)

    The reason the abstractproperty didn't blow up is that this doesn't extend AbstractClass.

    Good find!

    1. Fixed, and another place

  3. src/python/pants/engine/exp/scheduler.py (Diff revision 1)

    Rather than constructing the StepRequest with a sorted dict, I'd rather see us kill two birds with one stone here and add a def key method to StepRequest that returns a cache key for the StepRequest. That would also solve the step_id problem.

    Note that StepRequest is different from other classes that we content address, because the Scheduler and Engine are aware of it. I would be strongly against a def key method on most other objects (certainly on any user-defined object), but here I think it makes sense.

    1. only con is it will need to take a storage parameter, but otherwise makes sense.


  2. Why not just return the key, which the caller can put in Storage?

    1. Storage doesn't have a seperate interface for just computing the key.

      current put serializes obj into a blob and computes the key, save the blob.

      So if we do that, we would need to return both key and the blob, too much coupling.

    2. Fixed, misunderstood. this way we don't have to rely storage passed in

Review request changed

Status: Closed (submitted)

Change Summary:

Submitted as 26386a87caa36cc800f3604b113f5fae85947396