[engine] Content address node and state only in engine

Review Request #3615 — Created March 25, 2016 and submitted

peiyu
pants
3066, 3097
pants-reviews
kwlzn, patricklaw, stuhood

This review conslidates node/state to key and key back to node/state into Storage, all called by engine.

This review alone does not improve performance (see [1]) but it will allow better control by engine what requests are to be CA-ed. By placing the CA logic in one place also cleaned up the interface, which might address some of https://github.com/pantsbuild/pants/issues/3070.

  • No more storage in Scheduler, all it sees are states, nodes, no more keys.
  • Step.__call__ no longer dealing with keys
  • Turning nodes and states into keys and the other way only happens in one code base in Storage, and is only called by Engine
  • Also killed state_key, the 'keyed' version of nodes or states is only for either computing cache keys or multi-process

[1] Manual performance test: cold cache, 30% faster (1.42 sec vs 1.15 sec), warmed-up cache, 64% slower (0.46 sec vs 0.75 sec) is because now we compute keys more, for nodes. Only converting nodes into keys guarantees stable ordering without making nodes support ordering. A lot of keying are repeated so expect https://rbcommons.com/s/twitter/r/3597/ would help.

https://travis-ci.org/peiyuwang/pants/builds/118573674 passed.
https://travis-ci.org/peiyuwang/pants/builds/118612254 passed.
https://travis-ci.org/peiyuwang/pants/builds/118633765 passed.
https://travis-ci.org/peiyuwang/pants/builds/118885399 passed.
https://travis-ci.org/peiyuwang/pants/builds/118898889 passed.

  • 0
  • 0
  • 1
  • 0
  • 1
Description From Last Updated
  1. 
      
  2. src/python/pants/engine/exp/engine.py (Diff revision 1)
     
     

    I think this is wrong:

    is_cacheable is for a different purpose, we will need a separate boolean for is_content_addressable

  3. 
      
  1. 
      
  2. src/python/pants/engine/exp/engine.py (Diff revision 1)
     
     

    Don't mix camel case and underscores.

    But also, why not just call this 'Storage'? StorageIO is redundant.

    1. fixed camel case.

      Removed the old .storage as it is needed, just storage_io and cache now.

      Keeping storage_io to distinguish from storage, they are doing different things

  3. src/python/pants/engine/exp/engine.py (Diff revision 1)
     
     
     

    one space

  4. src/python/pants/engine/exp/engine.py (Diff revision 1)
     
     
     
     

    This boolean could probably be more opaque... while you will usually only set it when something is multi-process, you could set it for any reason. It basically means "should content-address"?

    Based on the fact that this boolean doesn't have anything to do with the engine, it seems like this could just go into the Storage class? Should StepRequest and StepResult go in storage.py too?

    1. Removed this flag.

      Discussed offline, engine will make the in-process, out-process decision based on request as opposed to a global flag.

  5. src/python/pants/engine/exp/engine.py (Diff revision 1)
     
     
     

    This seems odd to me. On a request-by-request basis, we either have to content address everything that isn't already content addressed, or not. So the fact that we're inspecting all of the fields individually seems weird.

    1. No longer maybe_key.

      One thing worth mentioning is step_request.node isn't keyed, this is only for convenience, because for cache we need 1) a keyed request to compute hash 2) an unkeyed node to decide whether is_cacheable. Or I will have sth like _maybe_cache_get(self, step_request, keyed_step_request) and _maybe_cache_put(self, step_request, keyed_step_request, step_result), that looks... ugly

      Added a comment

  6. src/python/pants/engine/exp/nodes.py (Diff revision 1)
     
     
     
     

    This is probably too coarse... there is no reason not to CA most states... it's a property of the consumer rather than the state, no?

    1. Removed state.is_content_addressable now, engine will decide based on request whether to CA.

  7. 
      
  1. Looks good. But please fix the naming =/

  2. src/python/pants/engine/exp/engine.py (Diff revision 3)
     
     

    StorageIO is a redundant name: please change it.

    Storage ~= IO, IO ~= Storage. So putting both in the name is not helpful.

    Still seems like it should just be merged with Storage.

    1. Merged with Storage now.

  3. src/python/pants/engine/exp/engine.py (Diff revision 3)
     
     

    storageIo

  4. src/python/pants/engine/exp/engine.py (Diff revision 3)
     
     

    Since this no longer "controls" that, I think it should definitely just be merged into Storage.

  5. src/python/pants/engine/exp/engine.py (Diff revision 3)
     
     

    int

  6. src/python/pants/engine/exp/engine.py (Diff revision 3)
     
     

    int

  7. src/python/pants/engine/exp/engine.py (Diff revision 3)
     
     
     
     

    This should probably fail fast if possible... the method shouldn't be called in that case.

    1. Good point, now just need to check if it is exception first.

      -      result = self._storage.resolve_result(result)
             if isinstance(result, Exception):
               raise result
      +      result = self._storage.resolve_result(result)
      
  8. src/python/pants/engine/exp/scheduler.py (Diff revision 3)
     
     
     

    This probably indicates that this method is in the wrong place. Should move it near the code on StorageIO that deals with Steps. ... and maybe rename it StepStorage?

    1. Moved into Cache as a private method, because it is used only for cache calls to compute cache key.

  9. 
      
Review request changed

Status: Closed (submitted)

Change Summary:

Submitted as 1a1ffdcde387cd87b950db1723b1fdcf63875583

Loading...