Make checkstyle's options filename-agnostic.

Review Request #3975 — Created June 6, 2016 and submitted

3555, 3559
benjyw, patricklaw, stuhood, zundel
See issue:

We were getting excessive cache invalidation in our shared build
cache for checkstyle, because it was fingerprinting the absolute
paths to the configuration files. There's really no reason for
checkstyle to care about the names of the files containing rules
and suppressions, all it should care about are the file contents.
This change enforces that.

Added integration tests exercising the caching behavior (unit tests don't work because FakeOptions are not fingerprinted).

CI went green here:
CI went green here:
CI went green here:

  • 0
  • 0
  • 6
  • 0
  • 6
Description From Last Updated
  1. This looks fine, but it's a shame to have it be an ad-hoc thing in just this task. Is there a way to generalize this into the fingerprint system itself? We used to have something like this (invalidate_for_files).

  1. I don't love that this is so much ad-hoc code in this one task. Especially since we already have a fairly sophisticated fingerprint mechanism. It would be better to incorporate this into the fingerprinting machinery, so all tasks can enjoy it. For example, could this be a fingerprint strategy?

    1. I wasn't sure, since we already have a file_option used by --configuration that explicitly does mix in the file path. And the way it works over the properties dict, fingerprinting the contents of any values which happen to be files, is a little weird and maybe not widely applicable?

  2. Accessing a 'private' method in Task like this is a code smell.

    1. Oh whoops that was just there for debugging, that line doesn't do anything.

  1. I'm fine with this solution, but let's hear out the other reviews (and also please add Patrick, as he did the fingerprinting stuff IIRC), just in case others have other opinions on how to do this.

  2. src/python/pants/option/ (Diff revision 2)

    I think you want to call dict_option() here, no? See above.

    1. I don't think so, since we favor type=dict over type=dict_option.

    2. type=dict gets turned into type=dict_option behind the scenes (

      dict_option does various special-case handling of dicts, which your code bypasses.

    3. It doesn't quite work:

          E            File "/Users/gmalmquist/Src/Pants/src/python/pants/backend/jvm/tasks/", line 122, in checkstyle
          E              for k, v in self.get_options().properties.items():
          E          Exception message: 'DictValueComponent' object has no attribute 'items' 

      I think there's some magic that happens with the options that isn't replicated by return dict_option(s).

      Do you want me to use return dict_option(s).val?

    4. The "magic" happens in is_dict_option() in (or rather the thing that calls it).

      You'll need to modify that function to check for your new type. I think that should be sufficient.

    5. Thanks, that did the trick.

  1. Thanks for doing this! #shipit (pending Benjy's feedback, and maybe a naming nit)

  2. src/python/pants/option/ (Diff revision 2)

    Given that this is dict_option + file_contents_option, the full name would probably be dict_with_file_contents_option.

    But, iirc, Benjy added support for applying a type to the members of a collection; could that be done here? ... ie:

    register(.., type=dict, member_type=file_contents_option)
    1. member_type is only for lists, afaicr.

    2. It would also be wrong here, because not all values of dict are required to be file paths.

    3. If we were to be fully pendantic we'd have to call it dict_with_some_file_contents_option, but that seems very clunky.

  1. Leaving file_option() and fingerprint_files() as is seems like a landmine waiting for the next task implementer to step in to. Maybe in _fingerprint_files() we should fingerprint the filepaths relative to the buildroot? and/or maybe we should we warn if an absolute path is passed to file_option()? At the very least add an implementation note on file_option() about avoiding using an absolute path here.

    1. I agree, but I think that may be a job for a separate review.

    2. You wouldn't need file_contents_option() at all if we just relativized the filepath to pants_buildroot before fingerprinting it. You would also fix findbugs and scalastyle without having to change more ccode.

    3. I don't think we want to relativize any part of the path for checkstyle though. What does pants care what the filename of the rules file is if the contents haven't changed?

    4. Then let's consider remove baking in the filename. None of the other code in pants that uses file_option() cares either so why are we creating 2 ways to do this, one of which almost surely introduces a bug?

    5. If you don't feel comfortable changing the implementation of file_option(), maybe you could deprecate it and convert all existing uses of file_option() to the new file_contents_option()

    6. I was considering that at the outset of this review; but that could potentially be a breaking change I think, if anyone has plugins that depend on that behavior?

    7. There are so few current uses of file_option, I think it would be fine to have them all use content only.

    8. Okay.

  2. This is confusing and requires explanation.

  1. I'm vaguely uncomfortable with removing the path from the hash. It's likely fine for direct use of file_option, since if you're expecting a single file (a config file, for example) the name of the option acts as its name implicitly.

    But type=list_option, member_type=file_option, which used to provide strong enough guarantees to match the fingerprint of target sources, would now match for various permutations of the inputs.

    If it would be possible to relativize to the buildroot and have that work 99% of the time, that would be preferable.

    1. I'm not sure it would necessarily match for various permutations of the inputs if the file contents were different. file_option bakes in both path and content and those contents are still baked in and you'd still get the ordering guarantee if file contents were different. Still, I think that relativizing the path is the way to go that would introduce the least possible incompatibilities.

    2. The files OrderedDict(('a', 'a') ('b', 'b')) and OrderedDict(('b', 'a') ('a', 'b')) would hash the same way.

    3. Stu, can you clarify what you mean above? If it were a dictionary type of option, the key would be baked in.

    4. Sorry. Meant: two files in each collection, with the names swapped, but the contents the same. How about: [('a', '1') ('b', '2')] and [('b', '1') ('a', '2')]

      Anyway, this is an improvement... won't hold it up.

  2. Does anything set contents_only=False any more?

  1. Thanks for iterating on this. I think doing the "right thing" by default is is going to prevent inadvertent shared cache invalidation between our CI builder and laptops in new plugins. I hate discovering that using the remote buildcache is mysteriously broken.

  2. Add a test to exercise this?

  1. Ship It!
  1. Ship It!
  2. This is subject to collision. I'd recommend making a map from path to hash-of-contents, then using the provided stable_json_sha1.

  1. So we're back to always fingerprinting the path as well as the content? And this is OK for your use case because the relative pathname is stable?

    1. That's exactly right.

  2. This isn't new, but since you're here, might be worth a docstring comment that this assumes the files are small enough to read entirely into memory.

  3. Same here: This assumes that the files are small enough to read into memory. Might be worth a comment in the docstring.

Review request changed

Status: Closed (submitted)

Change Summary:

In 5a59b2a3349bd715458e98fed27a001e92031a4f. Thanks Stu, Eric, Patrick, and Benjy!