[engine] cache StepResult under StepRequest

Review Request #3494 — Created Feb. 24, 2016 and submitted

peiyu
pants
3545
pants-reviews
kwlzn, patricklaw, stuhood

This review adds:

  • Cache that caches StepResult for StepRequest by storing the blobs in the main storage like the rest subjects and states, in addition, the request key to result key mappings are stored in a separate lookup table. Therefore two lookups are needed for cache hits.
  • Lmdb support for creating child database, for secondary index use case.
  • Add add_mapping, get_mapping interface to Storage, Cache can choose to have a different location of the subjects storage, or can be the same, but implementation is shared, this uniformation allows for sharing future GC logic for example.
  • Hooks in various execution environment to check cache and save to cache: LocalSerialEngine, LocalMultiprocessEngine, _stateful_pool_loop.
  • Use OrderedDict for Struct.kwargs to ensure consistent hashing.

Notes:

  • Now since the main blobs states and subjects are already in keys, both StepRequest and StepResult are light, so we can afford to do this inside engine before step calls, essentially doing cache checks earliest possible.

TODO:

  • There is some weirdness from tuples, that even for the same tuple for example, ((u'thrift', u'apache_java'),) may produce different hash. I tried if flatten the tuple pickle will be produce consistent result. Still need some work.

https://travis-ci.org/peiyuwang/pants/builds/114562555
https://travis-ci.org/peiyuwang/pants/builds/116460440

  • 0
  • 0
  • 7
  • 0
  • 7
Description From Last Updated
  1. 
      
  2. src/python/pants/engine/exp/scheduler.py (Diff revision 2)
     
     
     
     
     
     
     
     

    It's looking like the scheduler should not have access to the Cache. Since that is a property of execution, it should be a property of the Engine.

    The Scheduler/ProductGraph would then only have State(StorageKey) for each node, and the engine would do all Cache and Storage lookups.

    1. Addressing Storage to Engine move at https://rbcommons.com/s/twitter/r/3554/

  3. I think StepRequest needs to be visible to the Engine, and that the engine needs to do all Storage/Cache lookups. So _create_step below shouldn't be doing the subject lookup.

  4. src/python/pants/engine/exp/scheduler.py (Diff revision 2)
     
     
     
     
     
     

    I think StepRequest needs to be visible to the Engine, and that the engine needs to do all Storage/Cache lookups.

    So rather than this _call_ method, the Engine would be explicitly looking up all inputs (by their keys) and then calling node.step itself.

    So _create_step below shouldn't be doing the subject lookup... instead, it should probably be passing across the subject_key.

  5. src/python/pants/engine/exp/scheduler.py (Diff revision 2)
     
     
     
     

    I think this needs to move to the engine.

  6. 
      
  1. 
      
  2. This isn't actually creating blobs... it looks like the create method is just str()ing these objects directly.

    Instead, you need to be using the pickling infrastructure.

    I'm pretty sure this should just be a Storage.put(tuple(sorted_keyables()) call?

    1. you are right, i should just make them a tuple and turn them input key by calling Storage.put

  3. 
      
  1. 
      
  2. src/python/pants/engine/exp/storage.py (Diff revision 3)
     
     
     
     

    nit: in memory -> in-memory

  3. src/python/pants/engine/exp/storage.py (Diff revision 3)
     
     

    Do you mean "Save the StepResult for a given StepRequest." ?

    1. thanks for the catch, fixed!

  4. src/python/pants/engine/exp/storage.py (Diff revision 3)
     
     
     
     
     
     
     
     
     
     
     
     

    make these @property or prepend "get_"?

  5. 
      
  1. 
      
  2. src/python/pants/engine/exp/nodes.py (Diff revision 3)
     
     

    This shouldn't be necessary, because all of the subclasses of Node are datatypes... ie, namedtuple, which is pickleable without any extra work.

    1. removed, _asdict is probably be called by pickle, it will break if it is not.

      I added just to be sure, since _asdict returns a OrderedDict which is what we want.

  3. This is not the right place for this. The execution pool should not know about the cache. Probably the function that runs in the pool should store this?

    1. Fixed, wrong place. Between _stateful_pool_loop and Node's step calls, _stateful_pool_loop is a better place because cache is part of the execution environment.

  4. src/python/pants/engine/exp/storage.py (Diff revision 3)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

    Although the cache should definitely be backed by Storage, I don't think the Cache should be using the same instance of Storage (ie, the same database file).

    For one thing, it creates a weird coupling in this method. But the other is that the content-addressable data currently living in Storage/LMDB is what goes under .pants.d... but the content stored in the Cache might go outside in .cache for example.

    1. Talked offline, now Storage provides add_mapping and get_mapping interfaces so it handles both content storage as well as retrieving links between keys, uniforming the implementations allows us we improving them the same way for example for GC.

      We can choose to configure storage in the same or a different location from cache.

  5. src/python/pants/engine/exp/storage.py (Diff revision 3)
     
     
     
     
     

    Let's talk about this today.

    1. will do a bit more debugging.

  6. 
      
  1. The factoring looks better now, thanks.

  2. src/python/pants/engine/exp/engine.py (Diff revision 4)
     
     

    Rather than continue, I think I'd prefer to see if/else here... this is more complicated than just a guard clause.

  3. src/python/pants/engine/exp/processing.py (Diff revision 4)
     
     
     
     

    I'm pretty sure this is still not quite right.

    This should be happening in engine._execute_step. The processing module should not know about the cache.

    1. gotcha, fixed now, moved to _execute_step

  4. xx

  5. The only reason __str__ was overridden was to make the string shorter... this change makes it match the default for datatype, so you could just remove the implementation at this point.

    1. attaching dependencies is more useful when eyeballing different nodes, project_tree is less information since they are all identical, but removing __str__ is a lot easier.

      removed.

  6. src/python/pants/engine/exp/storage.py (Diff revision 4)
     
     
     
     

    (just marking as a blocking issue... happy to help debug this)

    1. https://rbcommons.com/s/twitter/r/3574/ to turn off pickle memorization.

  7. src/python/pants/engine/exp/storage.py (Diff revision 4)
     
     

    This is probably more efficient as:

    self[self.HIT_KEY] += 1
    

    ... since it doesn't have the intermediate list.

    ditto other usages.

    1. fixed both places.

  8. src/python/pants/engine/exp/storage.py (Diff revision 4)
     
     

    path is not None

    (an empty string is a legal directory)

  9. src/python/pants/engine/exp/struct.py (Diff revision 4)
     
     

    I don't think this makes sense... it's already too late to turn this into an OrderedDict and have it make any difference, iirc.

    One thing you could do though is have _normalize_utf8_keys do something like:

    dict_type = type(kwargs)
    dict_type(six.text_type(k): v ...)
    

    ...to preserve the type of its argument.

    1. I misunderstood, I thought OrderedDict would sort by keys.

      did what's suggested to properly preserve the original dict type.

      sorting is not for now.

  10. 
      
  1. lgtm!

  2. src/python/pants/engine/exp/storage.py (Diff revision 5)
     
     

    nit:

    :return: <desc>
    
  3. 
      
Review request changed

Status: Closed (submitted)

Change Summary:

submitted as 73642848697b5ee364309bb046f17032c0d97201

Loading...