Add support for pants-ignore to ProjectTree

Review Request #3698 — Created April 15, 2016 and submitted

2956, 3232
benjyw, gmalmquist, stuhood

This review adds built-in support to src/python/pants/base/ for an option --ignore-patterns. This will be a new option, independent from the existing --ignore-patterns option.

The idea behind the split is that there are patterns that should never ever be matched by pants (similar to the things that are utterly ignored by git: those specified by .gitignore files) and should thus be pushed down into ProjectTree, which we hope will eventually be responsible for all filesystem access in pants. Additionally, there is a separate, higher-level set of ignore patterns that only applies to BUILD file parsing. The current --ignore-patterns option represents the latter high level option, as it is only (and should only) be applied to BUILD file parsing.

Changes made:
1. A new option has been added in
2. ProjectTree takes an additional init argument which is a list of ignore patterns (default to be [])
3. All public methods of FileSystemProjectTree and ScmProjectTree have been made private and renemed to ..._raw since they don't have ignore logic.
4. ProjectTree base class now has the implementation of common APIs for FileSystem and SCM project tree, with pants ignore logic. When querying ignored path, it will be shown as non-existent. When accessing ignored path, an exception will be thrown.
5. ProjectTree created in legacy package is defaulted to ignore dotfiles.
6. add 2 new test bases - ProjectTreeTestBase and PantsIgnoreTestBase (derived from ProjectTreeTestBase)
7. 2 new test files to test pants ignore, 1 for file system project tree, 1 for scm project tree.
8. remove "API public" doc string that was accodentally added in

Something to notice:
1. For pants ignore to work, there should be no "." in relative path to build_root.

Latest CI:

  • 0
  • 0
  • 17
  • 1
  • 18
Description From Last Updated
  2. src/python/pants/base/ (Diff revision 2)

    Rather than filtering the input value, you'll want to filter the output value.

    So if blah.txt is excluded, and the glob1 is for *.txt, then blah.txt should not be returned.

    BUT, in some cases, if the directory matches the patterns, then you can also skip running the glob at all.

    1. Also, should check with Timur, but I believe the API for the ignore patterns is such that you need to add a trailing slash to a directory in order for it to be matched correctly.

    2. Stu, I have created a new revision regarding this issue. Can you take a look at that one? Thank you!

  1. Looks good to me! Please add support to ScmProjectTree as well.

    1. Also, when you're satisfied that this is ready to land, please see the "create a pull request" bit in ... we need a green CI in order to merge this.

  2. The underlying library supports taking a list of inputs... perhaps add a second method on the base class that takes a list and filters matches?

    def filter_ignored(self, path_list):
      ignored_paths = ...
      return set(path_list) - ignored_paths
  3. This should probably fail loudly with an exception for requests for filtered files...

    1. Once you've addressed the feedback below, please add gmalmquist, and benjyw to this review.

  2. src/python/pants/base/ (Diff revision 4)

    Can drop 'pants' from the name here; it's implicit.

    Maybe ignore_patterns?

  3. src/python/pants/base/ (Diff revision 4)

    Returns True if

  4. src/python/pants/base/ (Diff revision 4)

    I don't think this will preserve the order of the inputs... probably use OrderedSet(path_list) instead.

    Also, consider returning early with the input list if no paths were ignored (ie, ignored_paths is empty).

  5. src/python/pants/base/ (Diff revision 4)

    ditto naming

  6. Since this is a dir, I think it needs a trailing slash appended in order to properly match the patterns... should confirm/fix with a unit test.

    1. A trailing slash is required for a dir to match correctly. In that case, isdir, exists, listdir, lstat, relative_readlink all have to be enhanced. If ignore pattern is "fruit/" and "fruit" is a dir, then "fruit" should be ignored. But isdir('fruit') will return true becase "fruit" does not match "fruit/", os.path.isdir("fruit") returns true.
      So for those methods, I think we want to check if the path is dir first, then apply '/' if it is dir.
    2. checked

  7. 'The path {} is ignored in {}'.format(file_relpath, self)

  8. src/python/pants/base/ (Diff revision 4)

    Can you move these access exceptions to a helper method? ie, self.raise_access_ignored(relpath)

  9. The fast_relpath function enforces that all paths are located below the root: should use it here both because it is faster, and because it has that enforcement.

    1. It seems fast_relpath was not used in scm_project_tree originally. _scm_relpath in scm_project_tree is also using os.path.relpath. Do we want to change that to fast_relpath as well?
    2. Yea, it should... ProjectTree should enforce that files it works with don't live or link outside the buildroot that is passed in.

  10. Please add a comment indicating what this is doing.

    Also, note that we generally use single quotes for strings in the repo.

  11. Maybe add a comment indicating that this option is experimental, since it won't yet apply everywhere.

  12. Both subclasses of ProjectTreeTestBase seem to have mostly the same methods... you should move as much shared code as possible into the base class.

    One thing you will see is that the concrete base classes should extend unittest.TestCase, but the concrete base class should extend AbstractClass.

    See tests/python/pants_test/engine/exp/ for an example.

    1. Er, "but the abstract base class should extend AbstractClass"

  2. Stu, on line 17, I used self.root_dir which is set in setUp() method defined in base class.
    However, when I move all these test_... functions to base class, it complains about FileSystemPantsIgnoreTest has no attribute root_dir.
    Should it still run setUp first to initialize self.root_dir?
  1. Thanks Yujie. Let's wait for some more folks to chime in.

    Can you please include a link to a green CI in the "Testing done" section?

  2. This is only used in one place... probably not worth breaking out.

    1. this is used in as well. in function: _append_slash_if_dir_path.
      _isdir_raw is an abstract method in project_tree since filesystem and scm have their own implementations.

  3. It's not very obvious why this is being appended... maybe have a form of isignored for directories that appends the slash?

    isignored_dir? or a param isignored(relpath, directory=True) or something.

    1. added a new parameter "directory" which is default to False.

  4. The 'isignored' check is cheaper than the actual os.path.isdir check... should probably swap the order of these.

    Also, consider just:

    return not self.isignored(..) and os.path.isdir(..)
    1. (oh, I see why now. bummer.)

    2. Cool. I tried to reply as a comment. But somehow it came as another review...

      paste here anyway so other reviewers may see.

      if we do isignored first, there is a counterexample.

      assume 'fruit' is a dir in root. ignore pattern is 'fruit/'

      isdir('fruit') should return false because it is ignored.

      but if we do isignored first, then 'fruit' does not match 'fruit/', isdir('fruit') will then return True.

      I shuffled the order to avoid that case.

    3. To reply to the comment, you have to reply from the "View Reviews" screen, rather than the "View Diff" screen. Weird, I know.

  5. unnecessary temp variable

  6. ditto

  7. ditto

  8. I don't think listdir returns things with slashes on the end, unfortunately... it's not stating them, so it doesn't know whether they are dirs.

    For now, I think you might need to basically just do something similar to self.exists across all of the entries. But please include a TODO pointing to

    1. I think _append_slash_if_dir_path already does that.

      _append_slash_if_dir_path does 2 things:
      1. check if input path is a dir
      2. append trailing slash if it is a dir.

      so for each item returned by listdir, I will append trailing slash only when it is a dir and then do the matching.

      Does it address the concern?

    2. Yea, you're right... it does. Missed that.

      Please still add the TODO here.

    3. done.

  9. xx: use fast_relpath rather than replace in both of these.

  10. is this strip supposed to be an rstrip?

    1. I used "strip" along with "replace" because after "replace" the string may start with '/'. But after I change this to fast_relpath, I believe we can just use rstrip.

    2. checked.

  2. if we do isignored first, there is a counterexample.

    assume 'fruit' is a dir in root. ignore pattern is 'fruit/'

    isdir('fruit') should return false because it is ignored.

    but if we do isignored first, then 'fruit' does not match 'fruit/', isdir('fruit') will then return True.

    I shuffled the order to avoid that case.

  3. I think _append_slash_if_dir_path already does that.

    _append_slash_if_dir_path does 2 things:
    1. check if input path is a dir
    2. append trailing slash if it is a dir.

    so for each item returned by listdir, I will append trailing slash only when it is a dir and then do the matching.

    Does it address the concern?

  4. I used "strip" along with "replace" because after "replace" the string may start with '/'. But after I change this to fast_relpath, I believe we can just use rstrip.

  1. @Yujie: Do you think you could move some of the repeated logic from the two ProjectTree implementations into the baseclass? You'd have more concrete methods on the base class, and have abstract private _raw methods instead.

    1. OK I will see what I can do.

  1. Looks pretty good. I've got some minor suggestions / nits I noticed and a couple issues I wanted to highlight.

    1. Nick,
      thanks for reviewing it.

      I have addressed some of the comments, marked with "done". I will either address or comment on remaining ones soon. Meantime, I have refactored the code a little bit to reduce duplication. I have post a new review about this.

  2. Use os.path.sep here.

  3. Use os.path.sep here.

  4. Use os.path.sep here.

  5. Use os.path.sep here.

  6. src/python/pants/base/ (Diff revision 6)

    nit: rm spaces in ignore_patterns = None.

  7. src/python/pants/base/ (Diff revision 6)

    nit: period.

  8. src/python/pants/base/ (Diff revision 6)

    Instead of creating a list here, you could check len(matches) > 0.

  9. src/python/pants/base/ (Diff revision 6)

    nit: period.

  10. src/python/pants/base/ (Diff revision 6)

    You could use a comprehension here:

    return [p for p in path_list if p not in ignored_paths]

    1. It works well with Timur's suggestion to remove dependency on OrderedSet, although I have a concern about performance may get a little worse.


    2. I'm not sure it would be worse. It really depends on the number of elements and the nature of the operations.

      I'd recommend writing the clearest implementation, then doing some benchmarking and profiling. If it turns out that the bottleneck is there, then change it. And, include a note, or name things so that later readers see the reason it is the way it is.

  11. src/python/pants/base/ (Diff revision 6)

    nit: capitalize & period.

  12. src/python/pants/base/ (Diff revision 6)

    nit: capitalize & period.

  13. src/python/pants/base/ (Diff revision 6)

    nit: capitalize & period.

  14. Use os.path.sep here.

  15. Instead of "{}/", I think this should use os.path.sep.

  16. src/python/pants/base/ (Diff revision 6)

    I think doing a remove here changes the traversal done by _do_walk since _do_walk uses the dirnames to recurse and does it after yielding. Could you add a note about that, because it's a little confusing.

    Alternatively, you could undo removing dirpaths below, which would make it so that the walk is unaffected, only the output.

    1. change the traversal behavior is my intention actually. Because for ignored dirs we do not want to walk them. If I keep dirpath, then walk will go to every dir and every file which is unnecessary.

    2. Maybe the filtering should happen in _do_walk then? That way you could filter the recursion for both topdown values.

    3. I could, but I prefer not to. Two reasons:

      1. on FileSystemProjectTree side, we used os.walk (corresponds to _do_walk in ScmProjectTree). The document about os.walk says (
        When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again. Modifying dirnames when topdown is False has no effect on the behavior of the walk, because in bottom-up mode the directories in dirnames are generated before dirpath itself is generated.

      I think the raw implementation of "walk" in ScmProjectTree should behave the same way as os.walk.

      1. In the latest code review, the ignore logic has been extracted from 2 derived classes and moved to ProjectTree base class to remove duplicate code. Only raw implementations are in derived classes now.
    4. That's a reasonable. You might want to note that in a docstring too.

  17. Use os.path.sep here.

    1. Git uses / internally, so that may not be accurate.

    2. In the latest version, I moved all these pattern matching logic to base class. So they are shared between FS and SCM project tree. I wonder if I should use '/' for all cases?

    3. @Stu, Ah. So, it depends on whether it's on the fs side or the pattern creation side as to whether we should use it?

  18. src/python/pants/bin/ (Diff revision 6)

    nit: I think pants_ignore should be a required arg here, since it's only used from line 184 and it's not public.

  19. nit: capitalize and period.

  20. nit: capitalize, and period.

  21. nit: add a period after patterns

  22. If we're making this class no longer public, the summary should call that out more clearly. Could you add a note about it? Methods marked :API: public are special after 1.0.0 since they'll have to go through a deprecation cycle(

    It would be good to make sure that these are not used outside the pants repo if we're unmarking them.

    1. I think that this was pretty clearly accidental: I suggested the Yujie remove these.

      BuildFile itself is not a public API for subclassing, let alone its tests.

    2. Ok cool. In that case, I'd prefer if this were a separate change, but I'd be happy with a note in the summary so that it ends up in the git history.

  23. nit: year

  24. Feels like this could be extracted as a helper method.

  25. How would you feel about extracting a variable for the full file list and making these assertions about what was missing rather than what's left.

    I feel like that would make the examples more clear. eg, for this one, it could say self.assertEquals(self._all_files, set(files_list)).

    For some tests where we're expecting certain things to be ignored, it might be clarifying to show what is ignored.

    eg the /apple case, where only the root apple entry is removed.

    1. It is a good idea!
      I have made the change accordingly.

  26. nit: year

  27. Should this use fullpath too? To mirror the other methods?

    1. well, rmdirs is only meant to be used for cleanup, ie, remove base dir, so it should not take a relative path as argument. I think I will change the name of this method to rm_base_dir, in correspondence to make_base_dir.

    2. thanks.

  28. See comment on ScmPantsIgnoreTest.

  29. nit: year

  30. Could you add a docstring or a comment noting that the test cases are in PantsIgnoreTestBase? I don't think we have a lot of shared test suites like this, so I'd prefer if that was called out somehow.

    Alternatively, we could use a different naming convention for shared test suites, Base is used for classes that contain a framework for tests, but not tests themselves. Maybe something like PantsIgnoreTestShared.

    1. I personally prefer the first approach. I think base in pants tests always contains common things that subclasses should do. They can be not only setup and teardown steps, but also common test cases. FileSystemPantsIgnoreTest and ScmPantsIgnoreTests can also define their unique test cases. I will add docstrings in both derived classes.

  1. Thank you very much, looks awesome.
    My main concern is about global option - I think it's confusing while it's affect only new engine part of Pants.

    1. It affects core pants as well, but only for BUILD file parsing. So it will likely just look redundant.

      But not for very long: Kris is currently working on the review to move engine code out into core.

  2. src/python/pants/base/BUILD (Diff revision 6)

    I think we are trying do not use twitter.common.collections in pants anymore, so better to remove this ProjectTree dependency.

    1. This combined with Nick's review works pretty well.

  3. src/python/pants/base/ (Diff revision 6)

    So it's list with file paths?

    1. It's list with both file paths and dir paths.
      I have modified doc string to "Takes a list of paths and filters out ignored ones."

  4. src/python/pants/base/ (Diff revision 6)

    More 'pythonic' way to do this:
    if not ignored_paths:
    I'm not a big fan of this way but...

    1. I wonder if we use this widely in Pants? If so I would love to follow the convention. Although I think it is more confusing for readers.

  5. src/python/pants/base/ (Diff revision 6)

    Almost all usages of this method are self.isignored(self._append_slash_if_dir_path(relpath)), so I prefer to see this append inside isignored method.

    1. It is actually used in 2 methods: "isignored" and "filter_ignored"

  6. src/python/pants/bin/ (Diff revision 6)

    In addition I think it's a good idea to make pants_ignore required in whole ProjectTree hierarchy.

    1. You mean to make pants_ignore required in ProjectTree class? Is there a particular reason for that? I make it optional because I think there may be use cases where users of this class do not want to ignore anything.

  7. I think it makes sense to mention that option not only experimental but also not working outside of new engine and even maybe therefore not introduce this option right now.

  8. nit: better to use same quotes.

  9. Why AbstractClass and not unittest.TestCase ?

    1. This is an abstract class because _mk_project_tree will have different implementations for different derived class. Also, unittest.TestCase cannot be subclassed by an abstract class.
      I referred to tests/python/pants_test/engine/exp/
  10. All this methods are FS related, not SCM, which looks suspiosious in a base class...

    1. These methods are actually used for SCM tests as well. To test SCM, I believe we also have to create dirs and files. like in tests/python/pants_test/base/, here SCM test also inherits from a base class which has makedirs and touch methods.

  1. Thanks!

  2. Should move this comment into the help for the option.

  1. Ship It!
  1. Ship It!
Review request changed

Status: Closed (submitted)

Change Summary:

Merged as e63d4aed345e9b29081e0361d30ea6b9d8fa24c8