Payload refactor

Review Request #1063 — Created Sept. 22, 2014 and submitted

patricklaw
pants
https://github.com/patricklaw/pants/tree/pl.payload_refactor_final
6c051d5...
pants-reviews
benjyw, ity, jsirois, zundel

Payload refactor.

  • Break Payload into a Payload/PayloadField abstraction, and fix the major design flaw preventing clean Target subclassing.
  • Allow explicit freezing of Payloads and always freeze Payloads in the Target superclass __init__.
  • Tighten up invalidation hashing (fingerprinting) and testing.
  • Change the name of "invalidation hash" in general to "fingerprint", since the former is ambiguous and used to mean different things in several different places.
  • Make Repository and Wiki plain objects rather than Target subclasses. Users who want to expose a custom Repository instance in all BUILD files now register this instance in their internal backend's register.py as an exposed object. Then consumers of this Repository refer to this object directly. This allowed the removal of the horrible hack in JvmTarget.provides. It gets us much closer to having truly immutable Payloads.

ci.sh is green locally

CI is green: https://travis-ci.org/pantsbuild/pants/builds/36363381

  • 0
  • 0
  • 3
  • 6
  • 9
Description From Last Updated
PA
JS
  1. page 1/3

  2. This assumes an implementation for the import items' hash that is collision free - seems a bit much for not even a comment explaining how you happen to know hash(_) will be collision free in practice.

    1. All uses of hash() for fingerprinting have been removed.

  3. 
      
PA
  1. 
      
  2. Good point. In general we never have a guarantee of collision free hashes, but python's internal hash implementation will surely be more likely to collide than sha1. Maybe I should always find a string representation of the underlying instances and sha1() those, as I do in other places.

    1. +1 to that - I think relying on a standard definition of collision-free, sha1 for now, is a good idea. Maybe it does make sense in the end to support a single container type - TupleField - and handle the hashing as the sha1 of the str(tuple) given.

    2. Part of the intent of this change was to get all of these underlying objects serializable as well. It might be that the best way to do this is to standardize on sha1(json.dumps(self.serialize())).hexdigest() in almost all cases. Unfortunately, I don't want to go down the serialization road in the same CR, since there are some difficult edge cases (JarDependency in particular, also OrderedSet). There's also the trickiness that JSON can't differentiate between tuples and lists, but I suppose I could avoid that by having a couple of custom serializer/deserializer implementations.

    3. If the serialize contract was primitive or tuple, nested ok - then I can't think of a datatype that you couldn't implement serialize for. I like the serialize idea - it seems strightforward for the 2 cases you mention as well.

  3. 
      
JS
  1. page 2/3

  2. These should not be exposed in jvm backend - they are particular to pantsbuild/pants. Larry is trying to get a pants internal_backend [1] that is divorced from the pantsbuild.pants sdist - that would be appropriate to use here.

    [1] https://rbcommons.com/s/twitter/r/1030/

    1. I see this has landed now, I'll update the CR presently

    2. Fixed, but blocking on https://rbcommons.com/s/twitter/r/1073/ upstream

  3. As discussed on page 1/3 - s/hash(self)/more robust/

  4. Consider +='s in this file for the tuple joining

  5. 
      
PA
  1. 
      
  2. Agreed, but right now we don't have support for this, and I don't want this change to block on that. I don't think this is harmful to leave in as long as we make a note to remove it as soon as there's a standard way to inject it properly.

    Would you be okay with me dropping a TODO saying as much?

    1. No - this really should be fixed. If you want to help review Larry's RB which he added you and Benjy to last week, then you could rebase and make a small move change using his work.

  3. 
      
ZU
  1. 
      
  2. nit: wrong oder

  3. I hope the example of moving repo definitions into register.py is well commented when you move it into internal_backend. That is very different from all of our other configuration.

    There is a huge hole in testing for testing all of the different payload types we have defined, making sure that sources in different order come out with the same hash, etc.

JS
  1. I've held back on p3/3 after offline talks with Patrick. He'll be changing some of the core code on that page.

  2. 
      
PA
JS
  1. I know there is at least 1 more diff coming. I'll way for that to do a final pass.

  2. src/python/pants/base/payload.py (Diff revisions 2 - 3)
     
     

    Why the nested sha1 of the key?

    1. Mild paranoia that I haven't been consistent about:

      hasher.update('foo')
      hasher.update('bar')
      

      is the same as

      hasher.update('f')
      hasher.update('oobar')
      

      Meaning there are lots of trivial collisions when just jamming keys and values into a hasher. The proper solution to this is to always generate a JSON serializable object, get its stable json representation, then hash that, but that was a bit more work than I want to do in one CR, and this is no worse than what we have now.

  3. src/python/pants/base/payload.py (Diff revisions 2 - 3)
     
     

    field = self._fields.get(attr) or else embrace the raise if that's the api - I think it should be - and just return self._fields[attr].value

    1. The problem is that we have some places (e.g. JvmTarget provides parameter) which set either an instance of a PayloadField (e.g. jvm.Artifact or None since that's the default, and acceptable. It would be awkward to make Target authors rewrap or None-check these types of parameters, so I decided that the API would allow for None as a value for a PayloadField instance. Do you have an alternative?

    2. I don't - but then the 1st suggestion stands - this is a bug unless you switch from [attr] to .get(attr) IIUC

  4. tests/python/pants_test/base/test_payload.py (Diff revisions 2 - 3)
     
     

    Kill or implement

    1. Still finishing up the tests

  5. shenanigans - mostly unused imports including this one

    1. I'm in the process of writing all of the tests in this module as we speak, which will use all of these.

  6. 
      
PA
  1. 
      
  2. src/python/pants/base/payload.py (Diff revisions 2 - 3)
     
     

    I'm not clear on how it's a bug. If attr isn't a key in the fields dict, I want it to raise in the natural way. It's if the value of the field in the dictionary is None that I need to do this special casing.

    1. Ah yes - I was being dumb. Thanks for hand-holding.

  3. 
      
PA
PA
PA
PA
PA
IT
  1. @patrick - i have refrained from commenting on this rb as you were still updating it, but this looks like a big change especially the api break so would like to review it. i will be able to get to this by tomorrow the latest, hang tight!

    1. NB: the API break is tiny:
      sed script across BUILD files to replace repo='<str>' with the corresponding repo=<object>
      export the repo 1 repo object per current repo target you have - for Twitter this is 2 objects

    2. tiny? :) It was in at least 2 places that I could see - both repo and payload, is that what you meant? and as you probably realized already, this does mean all the Twitter internal implementations would need to be modified to conform with these.
      Not undermining this change at all, I think its a positive change - I wanted to make sure that I assess how much work it would be to absorb this change and now that I have gone through all the four pages -- I am definitely more informed. Thanks for waiting for me to review!

  2. 
      
JS
  1. 
      
  2. TODO(John Sirois): why is an override needed here?

    1. Published this too soon - from p 1/4 - I still need to research this and may find my own answer by the time I get to p 4/4.

  3. This raises - no super called yet

  4. Needed any longer? Its odd to define __hash__ but not __eq__

  5. How did you determine allowing None is ok here - this is an api change with no obvious justification.

  6. 
      
JS
  1. page 2/4

  2. Again - now I doubt myself - this must be broken since super has not been called yet and thus no self.payload yet.

  3. provides isn't meant in the plural sense, its intended to be read "this target provides this artifact" - so its singular and so s/ProvidesTupleField/ProvidesField/

    ie: this is a case for ProvidesField or None you educated me on earlier in this review

  4. more death before super ... at this point I have to suspect this somehow works, but at the very least it locally makes no sense at all and should be moved after super to look sane for the reading.

    I'll refrain from more comments on this issue, but if it is a real issue, please fix throughout targets/ in this review.

  5. For inline repo declarations presumably. I know I captialized Duplicate & Skip, but that way because Duplicate.CONCAT, ... Skip were rules. maybe not justified. It seems to me Repository should be repo(...) - its not a constant like those init-caps examples are at any rate.

  6. Odd to have __eq__ and __hash__ defined and yet using different field sets. consider __eq__ over the same tuple as __hash__ for sanity sake... or revert the __hash__ change.

  7. NB, neither a description change nor a repo change should invalidate the owning Target .. for compile, doc, test, etc - only maybe for publish if it used an invalidator.

  8. consider isinstance(other, Repository) and tuple == tuple for simple parallelism with __hash__

  9. 
      
PA
  1. 
      
  2. Other way around--this raises if super is called first. super calls self.payload.freeze().

    1. Where soes the magic non-None self.payload come from then? Either I'm dense or this will be a continual reader's confusion.

  3. I think I remember why this was added, but I think the reason has now been obviated. Removing.

  4. Good question. I'll undo the default param and fix the one place that I changed the invocation (unnecessarily).

  5. 
      
PA
  1. 
      
  2. Unfortunately if you look at how it's actually used in this target (rather than JvmTarget), that's not the case. I documented as much below--if this needs to be fixed, it should be in a followup CR.

    1. tanget - when replying to a comment from the diff view, click the reply button in the 'Other reviews' blue pane adjacent (left) to the green 'Your comment' pane. This will make the final review comment appear inline to the the original query and is generally much more sane. This is actually the single RB misfeature that I am regularly actually upset by.

  3. Will roll back since I think it was supposed to support the old, bad hash() behavior. But in a followup I think we should audit all of our hash/eq methods.

  4. Okay, this is important. I will elide them from the fingerprint.

  5. 
      
JS
  1. page 3/4

  2. TODO(John Sirois): isn't there now a helper for this that gets ensure_ascii, allow_nan, and sort_keys correct in one central spot?

    1. Yup, missed this one.

  3. src/python/pants/base/payload.py (Diff revision 4)
     
     

    I'm very much a fan of "state changes should be reflected by returning new objects" - ie the caller should receive an FrozenPayload and use that going forward. This is tantamount to saying mutable objects (save for maybe builders) are evil though, and that may be throwing the baby out with the bathwater. As you see fit.

    1. Yeah I didn't like passing the Payload up, but honestly the surface area of this change is pretty small. If we decide the other way is better, flipping the interface isn't too hard at all.

  4. src/python/pants/base/payload_field.py (Diff revision 4)
     
     

    Unused? looks like .num_chunking_units calls receivers are now all Targets

    1. Good catch, will remove.

  5. 
      
PA
  1. 
      
  2. See Target.payload

    1. Will do and I'll try to find a constructive alternative. If you read this from a pure python code-reader perspective though, I think you have to admit it will be a major cause for confusion.

  3. 
      
JS
  1. 
      
  2. 
      
JS
  1. 
      
  2. src/python/pants/base/target.py (Diff revision 4)
     
     

    I really don't like this. How about take a payload object in the constructor and freeze that - this way the subclass code looks normal, and the inheritance chain just does:

    def __init__(name, address, build_graph, payload=None, ...)
      payload = (payload or Payload()).[rest]
      super(...)
    

    I'm just really uncomfortable with accessing a field before construction - uniform access principle making the baseclass property look like a field at any rate.

    1. I just answered my own objection. I think this is just me reacting to 'too cute'. If there were a Target.add_field method that hid the gymanstics and self.payload were self._payload or self.__payload, then the subclasses would read sanely.

  3. src/python/pants/base/target.py (Diff revision 4)
     
     

    Wait - this is crazy to read - a class field is being used as a safeguard to set an instance field of the same name (you could be using hasattar). It seems to me this is a perfect use for __new__, but that still leaves the readability problem. This will never read well in subclasses - its just odd. If the oddness is the lesser evil, then I think a docs/ rst develper update is in-order. I now get it, you get it, Wickman is a super-fast genius and would get it before having to ask anyone, but I think thats the end of the line on comprehension.

    1. The class attr becoming an instance attr is idiomatic python for memoizing, but that's not the real issue.

      Taking a step back, I agree about the goofiness of the magic property here. I was trying to avoid boilerplate, but it's not that much boilerplate and it's not worth compromising comprehension of a core interface. I'll change to the builder model, shouldn't be much effort.

  4. 
      
JS
  1. page 4/4 lgtm

  2. 
      
JS
  1. https://rbcommons.com/s/twitter/r/1073/ is in master @ https://github.com/pantsbuild/pants/commit/5e8a6c9eee8656172a1a0a57957b6a2d94c58f46 now, so you're good to go from that angle.
    I'm also happy enough with my review to ship - I know you'll fix the bugs that are really bugs. You understand my issues with .payload access in __init__ pre-super as well and I trust your solution be it docs or code.

    Thanks for this core refactor and go, go, Go!: http://web.archive.org/web/20101125181523im_/http://madriverglen.com/ski_museum/index-Images/40.jpg

  2. 
      
PA
PA
PA
PA
PA
JI
  1. Please holding on merging... we have a Twitter internal pants_internal build break due to this RB. Investigating...

    1. Note that there were two API breaks: how you manipulate payloads in custom Target subclass __init__ methods, and how you declare Repository objects for JVM publishing. There are several examples in this CR of how to adjust your internal repo accordingly, and according to John this should not be a difficult change for Twitter to make.

    2. Thanks, I didn't have much context on this RB, but will now take a look at it and the fixes.

    3. For how the Repository setup changed, see these two files:
      src/python/internal_backend/repositories/register.py
      examples/src/scala/com/pants/example/hello/welcome/BUILD

      For how payload manipulation changed, see:
      src/python/pants/backend/core/targets/resources.py
      or
      src/python/pants/backend/jvm/targets/jar_library.py

  2. 
      
PA
PA
PA
IT
  1. are you still adding more tests?

    1. Nope, tests are done.

    2. I've addressed all of the issues below and updated the CR. I plan on submitting later today if there aren't any other issues.

  2. where is sha1 defined? i cant see it in the imports

    1. Interesting, good catch. Also obviously missing coverage, I'll add a test to hit this. Eric Ayers has also mentioned that this weird behavior (imports being either a JarDependency or a spec) is not used in practice, and we can simplify this code with a deprecation cycle.

    2. (In this case, I don't think we need a deprecation cycle. I can't imagine someone else is using the imports= feature of protobufs.)

    3. I'm going to leave the support for JarDependency now to reduce the scope of this change, but I suggest in a quick followup that one of us throw away all of that code and just use address specs.

  3. is there no need to self.assert_list(jars ..) anymore?

    1. No particular reason not to, added it back to reduce the change.

  4. shouldnt this check be before the payload gets created?

    1. It doesn't really matter AFAICT, and this change is focused on making this API break in the least disruptive and easiest to review way possible, so I'm not going to unnecessarily mess with all of the Target subclass __init__s except dropping comments.

  5. src/python/pants/base/payload_field.py (Diff revision 7)
     
     

    forgot to implement eq?

    1. I can't think of a good way to implement __eq__ generally, and for a lot of subclasses it doesn't really make sense to implement __eq__. On the other hand, I think there's no good reason for this to have __hash__. I'll remove it.

  6. 
      
ZU
  1. Thanks Patrick, I am adding working on a new target for our repo and this is going to make it a lot easier.

  2. 
      
PA
PA
IT
  1. Ship It!

  2. 
      
PA
PA
Review request changed

Status: Closed (submitted)

Change Summary:

Thanks all, submitted at 7bcd00e8c7c98719055411bce758d04e838a8b45
Loading...