Changes to zinc analysis split/merge test data generation:

Review Request #2095 — Created April 17, 2015 and submitted

davidt, patricklaw, stuhood
- Fix various issues with the anonymizer.
- Add a mode to the test that causes it to generate canonical test data.
- Add copious comments on how to use all of the above.
- Replace the old test data tarball with loose test files that have
  non-ASCII content. Python tarfile doesn't play nicely with non-ASCII
  names (even with the pax format), so loose test files are easier.
  Git should compress them fine anyway.
- Made the test a little more comprehensive - it now verifies the merged
  analysis against an expected file.
- Made sure that split and merge don't leave dicts with empty values,
  as those create spurious diffs when comparing analyses that are
  semantically identical.

This change establishes that our split/merge logic is sound in the presence
of non-ascii file/class names in zinc analysis files.

TODO: Test that rebasing is similarly robust to non-ascii content.

CI passes:

  • 0
  • 0
  • 4
  • 0
  • 4
Description From Last Updated
  1. Is there a strong motivation to turn these into loose files? On the one hand, I dislike checking in binaries, but on the other hand this is generating a lot of random words that will pollute git grep results. Could this alternatively just be a binary that lives in bintray rather than checked in at all?

    1. No strong motivation, mostly that getting tarfile to work with non-ASCII filenames is going to be a lot of finicky work.

    2. i prefer in-repo over a tarfile anyway, and the git-grep ship kinda sailed when jquery was checked in

  2. It's not clear to me that this is the case in all terminals, since some terminals use different encodings. Might be worth including a little python script to do the same thing very precisely.

    1. Will table for a future change, but I qualified the comment.

    1. In fact, if the intent here is to ensure that this is unicode but it might utf-8 encoded bytes, then you'll want to use arg_diff.decode('utf-8'). Ideally though we should be positive what type this is, and either not need decoding or definitely need it.

    2. Actually, we need to use our special utility (src/python/pants/util/ that does type checking if we don't know whether it's unicode or encoded bytes, because in python2 calling u'føo'.decode('utf-8') results in a decode error (calling decode on a unicode object first encodes it using the standard encoding, which is ASCII, which fails before it gets a chance to try to decode it).

    3. arg_diff is neither bytes nor text, it's an object of type DictDiff. That cast to unicode is intended to invoke its __unicode__ method. I'm not sure if __unicode__ means anything in python3 though?
    4. Anyway, switching to six.text_type does the right thing in py2 and py3.

  3. The u'...' is unnecessary because we import __future__ ... unicode_literals.

  1. Ship It!
Review request changed

Status: Closed (submitted)

Change Summary:

Submitted as e4321d1888c160a8493f5b333ec4442861c5a240.