Support zinc name hashing.

Review Request #1779 — Created Feb. 16, 2015 and submitted

benjyw
pants
7bd1a93...
pants-reviews
nhoward_tw, patricklaw
- Upgrade to a recent version of zinc.
- Support version 5 of the analysis serialization format.
- This version moves the CompileSetup sections from the end of
the file to the beginning.
- This version adds a "name hashing" section to the CompileSetup.
- Add a pants option to turn name-hashing on.
-  Fixed the implementation of is_nonempty_analysis(), which can no
longer simply look at a prefix (because the order of elements in the zinc
analysis file has now changed).

We had already added support for splitting/merging the analysis sections
used by name hashing, under the assumption that their structure and semantics
were the same as for their equivalent pre-name-hashing sections.  We had been
told by TypeSafe that this assumption was correct, but had never tested it.
I have now verified that it is indeed true.

Note that when name hashing is turned on, the member* and inheritance* sections
are populated by zinc INSTEAD OF the direct* and public* sections. However the
"used names" section is populated AS WELL AS the "class names" section. This means
that turning on name hashing will cause analysis files to be larger. Whether this
is significant, in particular wrt split/merge times, needs to be measured. I suspect
it should be OK, since split/merge of these sections is simple - they don't have
the complicated internalization/externalization logic.

CI passes: https://travis-ci.org/pantsbuild/pants/builds/50907336

  • 0
  • 0
  • 6
  • 4
  • 10
Description From Last Updated
ST
  1. Awesome. Would you mind adding Nick Howard to this one?

  2. 
      
PA
  1. Very exciting!

  2. Any reason not to use abc.abstractproperty here?

    1. No strong reason, I just have an aversion to using abc because I have Python metaclass PTSD... Metaclasses make everything harder to debug, and it seems like huge overkill to solve such a minor problem. But I can go that route if you insist.

  3. Here or in the while loop below, could you provide a bit more explanation about what the intention is? It's throwing away a bunch of information and then returning "is non-trivial". Can you elaborate?

    1. Added better comments.

  4. Normal spiel about 'rw' here. Do we have guarantees about encoding of this file?

    1. I don't know what "Normal spiel" means.

      I believe the file should always be UTF-8. And it can actually contain non-ASCII because classnames can be anything. However the lines we care about here are ASCII-only, and line iteration is (I'm pretty sure) robust on UTF-8 content.

    2. After reading the docs on open again and experimenting a little, it seems like this is okay on POSIX systems because they do not differentiate between text and binary file modes. But the docs suggest adding b for documentation if you mean binary data, and here we do mean binary data (since we certainly don't mean ascii).

    3. Actually I think we don't want to add 'b': Its effect is to read the Windows line endings (\r\n) verbatim. Without the 'b', it'll convert them to \n before passing them to the caller. I'm pretty sure that's what we want, as we assume \n in so many places, including here.

      Even if we change this, it's something we need to do in a principled way, across all of Pants, so I don't want to touch it in this one place in this one change.

    1. Why? This works fine and acknowledges that we're aware that the line string has the \n at the end.

    2. Can it be the last line without a trailing newline? I guess it's a minor nit.

  5. I don't see how this can help compared to before--in the best case scenario it's decoded as ascii. Do you want to decode as utf-8 to be sure?

    1. Before this was simply a bug - arg_diff was treated as a string but was actually an instance of DictDiff. Calling str() will invoke __str__ on DictDiff, which I think does the right thing.
  6. Same thing--what's the str buying? I'm not convinced we should ever allow str, considering it means precisely opposite things in py3 vs py2.

    1. Same reason, except here these are ZincAnalysisElementDiff instances we were treating as strings.

  7. 
      
BE
BE
BE
  1. Ping?

  2. 
      
NH
  1. 
      
  2. use .format().

    Also it might be good to include more information about where the line is, unless this is caught somewhere else that already does that.

    If the parser hit a malformed analysis file, you wouldn't know what file it was from the error message.

    It looks like other ParseErrors from this class don't do that right now, but I think it would be nice to have.

    1. It would be good to have, but that's more of a refactor than I want to deal with right now.

  3. 
      
BE
ST
  1. Thanks Benjy!

  2. I think this will execute the format string for each input line? Maybe execute it once before looping?

    1. I'd like to think the runtime is smart enough to optimize this, but maybe not...

  3. Could use some more context here... filename maybe? Or catch and rethrow with filename?

    1. The parser could use some refactoring to allow it to do this easily, but that's a bigger change than I want to make right now... And to be honest this has always been enough information for me to figure out any problems.

  4. Might this be untrue in the context of distributed caching? Or do the cache keys account for different inputs to zinc?

    1. The scalac args that go into the setup should be mixed into the cache keys. If they aren't that's a bug, but not here.

  5. could pass maxsplit=1 here I guess?

    1. Good idea, done.

  6. 
      
BE
BE
  1. Submitted as 2d9972a20496dfd47b0e4ae1851832666443996c. Thanks everyone!

  2. 
      
BE
Review request changed

Status: Closed (submitted)

ST
  1. It looks like this change does not cause the buildcache hash to change, so if the cache fetches an artifact generated with the previous version, it fails with:

    FAILURE: Unrecognized version line: format version: 4

    I guess the tool bootstrap identity isn't included, but should be?

    1. Hmm. At least the format version should be. The tool identity might be overkill, as every upgrade will invalidate every artifact.

    2. A workaround is to bump the cache key generation in pants.ini. This will invalidate every artifact (not just scala ones) so it's a bit extreme.

    3. Wait, it looks like the tool jars are in the key. See platform_version_info() in scala_compile.py. Weird.

    4. It looks like the platform jars use the scala version, but not the zinc version. Will open a review to include that. Thanks!

    5. Opened https://rbcommons.com/s/twitter/r/1858/

  2. 
      
Loading...