Customize tarfile module next() method

Review Request #4123 — Created July 25, 2016 and submitted

benjyw, jsirois, kwlzn, mateor, nhoward_tw, patricklaw, peiyu, stuhood

tarfile.TarFile object is iterable and has a next() method. next() will parse the header and save parsed info. During parsing, a lot of checks are done, to make sure the header is valid. And if there is something wrong with the header, exceptions will be thrown. next() catches a lot of them but not reraise what it catches in all cases.

We have a corrupted tgz file inside Twitter. During parsing, an error in one of the headers is caught, but next() hide it silently. Thus cache read succeeds with no error. When pants uses this corrupted tgz for analysis, it throws exception because it couldn't find a file which is supposed to be in the tarball.

From source code (, we can see that InvalidHeaderError will ONLY be raised if it happens in the beginning of the tar file. Actually, a lot of exceptions are hidden by tarfile module. tarfile module simply thinks these exceptions mark the end of tarball. It probably is not a violation of tar standard, but it is not acceptable for pants. Pants relies on the integrity of tarball. Any defect in tarball should cause cache read to fail. A partially extracted tarball serves no use to Pants.

This changeset does the following:
1. Customize next() method in tarfile.TarFile for Python2.x. Now it will throws exception loudly whenever an invalidheader is seen if errorlevel is bigger than 0. If on Python3.x interpreter, original tarfile.Tarfile will be used.
2. Add test cases.

ci green:

  • 0
  • 0
  • 3
  • 1
  • 4
Description From Last Updated
  1. lgtm mod a few comments/nits.

    1. Kris, thanks for reviewing. next() is basically a copy of the official tarfile module next() method. The only change I made is to throw an exception when invalid header is seen and errorlevel is bigger than 0. But I think the tafile module itself is not very well commented:) I will add the comments you suggested.

    2. ah, gotcha - that's my bad. I was looking at an older copy of for comparison (just noticed the 2.7.12 copy has most of this code verbatim), so feel free to ignore the related comments.

    3. ah yeah. I took next() method from latest It has some other differences from the tarfile module we currently use.
      diffs are mostly here
      notice that __read() change is not in tarfile currently used by pants. But the change in next() does not depend on that, and I believe we will switch to new tarfile module eventually, thus I think using next() from latest should be good.
  2. src/python/pants/util/ (Diff revision 2)
  3. src/python/pants/util/ (Diff revision 2)

    this could probably stand to get a better description of the reasons behind it so someone doesn't inadvertantly remove it.

  4. src/python/pants/util/ (Diff revision 2)

    maybe add a comment to help clarify the point of this check vs just seeking+reading every time?

    1. ah, nevermind on this comment - I was looking at an old copy of the module for comparison that didn't include this.

  5. src/python/pants/util/ (Diff revision 2)

    should this include additional data that might be useful in troubleshooting a broken tar, such as the offset at the time of the unexpected EOF?

  6. it might make sense here to write out some actual contents into the files in the test tar and verify they emerge identically post-extraction.

    this would help ensure your changes don't inadvertantly introduce corruption in the untarring process.

    1. this is probably also no longer necessary - I assumed you had added the seek offset-1+read(1) bits.

  1. Nice! I like the tests. That must have been fun to put together. I've got a documentation/license issue and a minor comment.

  2. src/python/pants/util/ (Diff revision 2)

    We might want to put a more substantial attribution for this. It looks like tarfile is MIT licensed:

    Also, I think it would be good to add a commit sha or something for the version of the method this copy+modifies. That way, it'll be easier to figure out the modification. It'll also be easier to check whether it needs to be updated.

    1. Nick, thanks for catching this. Should I copy line 5 - 18 of to docstring of next() method?

    2. Hm. I think that would be reasonable.

  3. src/python/pants/util/ (Diff revision 2)

    It might be a good idea to add a comment here that marks the modification so that future readers don't need to try to compare.

  1. Ship It!
  1. Please include/replace some of the reviewers with non-Twitter reviewers... this is a fundamental change.

    In particular, one or more of [jsirois, benjyw, patricklaw] might be good.

  2. src/python/pants/util/ (Diff revision 2)

    I'm uncomfortable with overriding a single method of a class which wasn't necessarily intended to be overwritten.

    IMO, it would be safer to include the entire module (with the SHA/note/attribution that Nick mentioned) than to override a single method.

    1. in that case, should I still put it under src/python/pants/util? and maybe I can call it directly?

    2. I agree it is safer to copy the whole file. Either way, we need some sync mechanism to keep up with latest version. I wonder if we have done similar things in Pants before.

    3. Drop this as Benjy and Kris suggest that this kind of monkeypatching is common in Python.

  3. This is 8 bytes.

  2. src/python/pants/util/ (Diff revision 3)

    Are we sure this license allows us to do this without affecting our own license? I.e., this isn't a transitive license like the GPL?

    Does our standard copyright notice at the top of the file contradict this notice? It's possible this needs to be at the top of the file instead of our standard one.

    Basically, someone who understands software licenses needs to look this over.

    1. From Stu's comments, I think I will just copy the whole tarfile module to Pants and make some minor changes. That way, we will have only one copyright notice from tarfile module at the top of the file.

    2. I disagree that copying the entire module is better. I think it would be worse. this sort of mobkeypatching is somewhat idiomatic in Python, so I have no problem with extending this class in this way.

    3. I meant "monkeypatching", of course. I will defer to Patrick's judgement here though. I just think that the less copypasta of core python code the better.

    4. fwiw, I did actually check that across all recent python 2.7 versions that there were no changes to other methods in tarfile that would cause an issue w/ the initial subclassing approach of next(). so I'm =Benjy - at least for python2.7 runtimes.

      I did not check for python3 tho, which is where I might expect this to become problematic assuming the tarfile impl could change more drastically. we don't really provide guarantees beyond best effort for python3k today tho afaict.

    5. So I have consulted legal expert inside Twitter. I quote his answer here:
      "This file ( is under a BSD-style license which is compatible with pants' Apache2 license (even the Python2 license is compatible). My recommendation is to keep the pants (c) at the top of the file and maintain the BSD-style license in the code block as you have done. In addition, if there is a LICENSES file for pants where we are listing third party OSS licenses for attribution, we should add this to that file list."

      So I think the current layout of copyright notice should be fine.
      As for LICENSE file, I think he actually means NOTICE file, since we don't have it in pants, it should be fine. I am checking with him on that.

  3. It would be good to test that reading non-corrupted tarfiles still works with our modified TarFile.

  2. src/python/pants/util/ (Diff revision 3)

    Just like Benjy, I am not a lawyer. And when adding outside licenses to your project it is probably best to consult with a professional. I'd prefer you ask your legal department for some advice. If you have a blocker for that I can talk to Foursquare's, I think.

    But my experience with this is as follows:

    CPython is publised under PSFL, which is modeled after the BSD license and is pretty permissive - so in this instance inlining the license is likely safe (above hedging applies).

    Reasonable ways I have seen OSS code inlining:
    1) the way you have it here - add the original license to the class definition
    2) Add the original copyright to the top of the file, but adding an end date to the copyright date, and then followed up with our own copyright notice, which would apply from today forward on the modified version.
    3) Inline the entire file unchanged and extend it in subclass, etc.

    I like any of those solutions.

    This is a code-base wide issue we should get a proper answer for. There is heavily modified Typesafe code in the Pants repo that doesn't include the Pants copyright at all.

    1. Mateo,

      thank you for the comment.

      I have asked corresponding team inside Twitter. Hope I can hear from them soon.

    2. Mateo,
      I have got response from legal expert inside Twitter. I have post it under Benjy's comment.

  1. lgtm - one final thought.

  2. src/python/pants/util/ (Diff revision 4)

    how about in lieu of whole-file copying, we could at least make this explicitly applicable to only python 2.x:

    import six
    if six.PY2:
      class TarFile(...):
      from tarfile import TarFile
  1. Sure, this solution is fine with me. Thanks!

  1. Ship It!
Review request changed

Status: Closed (submitted)

Change Summary:

a9957e91d8f262c8d1384c380cf8fb77ec075eb1. Thanks, Kris, Nick, Mateo, Stu and Benjy!