Introduce ignore_patterns option

Review Request #3414 — Created Feb. 3, 2016 and submitted

tabishev
pants
tabishev/pants_build_ignore_option
2880
pants-reviews
dturner-tw, patricklaw, stuhood, zundel

Original motivation:
npm install command creates a lot of BUILD.md files which pants fail to interpret as pants BUILD files.

Other problems:
Currently there is no good way to prevent some directories / BUILD files from parsing. spec_path stands only for spec prefixes, exclude_target_regexp for targets.
Ability to exclude some directories not only based on prefixes can be useful in more cases than original motivation: with it we can exclude node_modules and bower_components directories from scanning which can be good for performance. Also there are /target/ exclusion in Square's code which is also target for this change.

Solution:
Introduce ignore_patterns option which behave like .gitignore but for pants BUILD files.
Dependency on pathspec library (sources: https://github.com/cpburnz/python-path-specification) was added to make .gitignore-like filtration. It doesn't contain any C/C++ extensions.

Performance impact:
I've checked ignore_patterns on big twitter repository trying to execute "scanning heavy" commands such as filter and dependees. There is no noticeable impact on scan_build_files performance (about 15%, but in general it's only about 2%) and some noticeable impact on get_build_files_family performance (about x2.5, but in general it's only about 7%).

But executions of get_build_files_family can be much faster as their result most likely was calculated already in scan_build_files. I will publish follow-up PR with this change.
Upd: I've implemented the fix and it doesn't seems to make situation better. Also it doesn't work well with spec_excludes code. So I've decided to return to this question after all deprecated code will be removed and with profiler.

5-6 diff contains spec_excludes deprecation and deprecation of ability to filter broken BUILD files using exclude_target_regexp.

https://travis-ci.org/ttim/pants/builds/107142688

Also I've tested --spec-excludes after deprecation and checked exclude_target_regexp for ignoring BUILD files.

  • 0
  • 0
  • 12
  • 2
  • 14
Description From Last Updated
TA
TA
PA
  1. 
      
    1. Also, can we do better than "hoping" for the licensing of an outside library? I'm pretty sure since we aren't distributing it as either a source or a binary, there's no big deal either way. But I'd like to be sure.

    2. My hope based on actually reading it, but I'm afraid to be sure in this kind of questions.

  2. pants_build_ignore = pathspec.PathSpec.from_lines(GitIgnorePattern, pants_build_ignore_paths or [])

    1. Looks much more better, thanks!

  3. It's terrifying that this isn't using kwargs. But not your change, so up to you.

  4. I'd pull the rhs into an ancestors variable for readability.

  5. src/python/pants/base/build_file.py (Diff revision 1)
     
     

    Never use + for string concatenation, prefer '/{}'.format(...)

    1. Changed, I've noticed that there is no + in pants code. What is a reason?

    2. It clarifies that you're doing string formatting rather than some other operation. It also makes certain kinds of unicode interpolation errors easier to read and understand when they occur.

  6. src/python/pants/base/build_file.py (Diff revision 1)
     
     

    Ditto +.

    [] is unnecessary, this can be a generator comp.

  7. src/python/pants/base/build_file.py (Diff revision 1)
     
     
     
     

    This is sufficiently tricky that it deserves a comment explaining the purpose of the trailing slash.

  8. src/python/pants/base/build_file.py (Diff revision 1)
     
     

    [] unnecessary

  9. src/python/pants/bin/goal_runner.py (Diff revision 1)
     
     

    I'm starting to think that this should just live on a subsystem rather than having to be threaded everywhere.

    1. I disagree with it. It's appearing in three places currently: as constructor argument to BuildFileAddressMapper which seems logical as BuildFileAddressMapper should be the only place which cares about it, in BuildFileTestBase because it's testing this logic from BuildFile and in ide_gen.py because this task currently trying to deal with actual filesystem representation of BUILD files (I hope this will be removed later). Also there is some mentions about it in other tests because I need to pass it as argument to BuildFileAddressMapper which is initialized in BaseTest. I think it's small count of usages for extracting it to separate subsystem. And basically it's already 'like' this - there is only one 'main' place - BuildFileAddressMapper.

    2. But in practice, there is exactly one value for this per-repo. And for tests, subsystem values can be easily overridden per-test. It's one thing for there to be only a few places that it's used, but we also have to look at the cost of all of the places it's threaded through, and redefaulted. The subsystem could just expose a single method that does all of this for you. I think it would clean up this code quite a bit.

    3. We want to use pants_build_ignore in BuildFileAddressMapper, can we?
      About "subsystem values can be easily overridden per-test" - pants_build_ignore values can affect BuildGraph for example, is it handled by values overriding for subsystems? Should we store pants_workdir & buildroot in subsystems for example?

      There were a lot of spec_excludess before, it's right, but it's not true for pants_build_ignore.

    4. Alright, I'm not going to push the point further. It's not that big of a deal, and you're right that spec_excludes didn't used to be a subsystem, so this has the benefit of changing less stuff.

  10. Ditto defaulting to []

  11. Similar to or exactly? Ideally link to the lib's definition of the actual spec.

  12. tests/python/pants_test/base/build_file_test_base.py (Diff revision 1)
     
     
     
     
     

    ditto defaulting

  13. 
      
ST
  1. 
      
  2. The import pathspec line is redundant: just from pathspec.gitignore ... should be enough.

    1. It was used later for pathspec.PathSpec, which is not good, fixed.

  3. My preference would be for deprecating this in this change, rather than in another. Idea would be to defend against finding a usecase later that we have to refactor this feature to support.

    1. I've already made something like this locally before, so shouldn't be a problem.

    2. Finally I've decided to move deprecation here. In 5-6 diff.

  4. Having pants_ in the variable name is a bit redundant. build_ignore_patterns ? or file_ignore_patterns?

    1. Renamed to build_ignore_patterns for variables and left pants_build_ignore for option.

  5. .* seems like a strange default, unless I'm misunderstanding how the ignore syntax works. Could you expand the help to include some examples?

    1. .* means all files and directories started from .. Should I expand default meaning in hint? Also I can point to gitignore doc from git site.

    2. Updated hint, could you verify it? Thanks!

  6. 
      
TA
TA
TA
PA
  1. 
      
  2. src/python/pants/base/build_file.py (Diff revisions 1 - 2)
     
     

    The extra parens are also unnecessary

  3. src/python/pants/base/build_file.py (Diff revisions 1 - 2)
     
     

    but here the extra parens are necessary, since there is another argument to the function.

  4. 
      
TA
BE
  1. This seems like a sledgehammer to crack a nut. How about just special-casing .md suffixes and ignoring them when gathering BUILD files? We can revisit a more complex solution if more such examples pop up, but frankly I'd be fine with a hard-coded list of special-cases, because it will suffice in pretty much all situations.

    1. Currently there is no good way to prevent some directories / BUILD files from parsing. spec_path stands only for spec prefixes, exclude_target_regexp for targets. We already have code which trying to prevent places to be parsed based on exclude_target_regexp, but this code not fixing dependees task for example...

      Ability to exclude some directories not only based on prefixes can be useful in more cases than that: with it we can exclude node_modules and bower_components from scanning which can be good for performance. Also I've seen /target/ exclusion in Square's code which is also target for this change.

      Also worth mentioning that BUILD.md is not the only example. The same happens, for example, with BUILD.orig files while git merge.

    2. Ah yes, I can see the utility for the other cases (not scanning into bower_components etc.) We already have 2-3 places in the code base where we hard-code some "don't scan these dirs for performance reasons" logic. My point was that this is overkill just to ignore certain suffixes, which is how you introduced this change. The performance issue is orthogonal.

      The Pants "feature" of supporting BUILD.anything is not that important, and is usually more trouble than its worth. So to solve the problem of recognizing I'd favor simply whitelisting some allowed suffixes (literally just ".tools" today, I think), and treating all other files BUILD.something files as "not BUILD files".

      But as I said, I can justify this change for the other feature of ignoring directories for performance reasons.

      That said, I would really like to see these all unified into a single concept. Do we really need this confusing patchwork of spec_excludes, exclude_target_regexp and this? "directories", "specs" and "targets" are overlapping concepts, especially once you introduce wildcarding. I would greatly prefer to see a path to merging them into a single way of saying "pretend these files and directories aren't there", and then applying that universally.

    3. I've changed description, thanks for pointing this out.

      Current plan is to leave two concepts for ignoring: this one to ignore actual BUILD files and another to ignore targets (which is possible only after actual BUILD files was read). And I'm going to deprecate spec_excludes in favor of pants_build_ignore.

    4. That's great! So now the question is, do we still need to ignore targets? exclude_target_regexp's help string is "Exclude targets that match these regexes. Useful with ::, to ignore broken BUILD files." So it's really about ignoring BUILD files, not individual targets within BUILD files.

    5. I'm going to remove this part from hint for exclude_target_regexp in RB with spec_excludes deprecation =)

      Is it still needed (way to ignore targets)? I don't know, honestly. I've seen net.sf.grinder.grinder and :aux- patterns from Eric in Slack, so let's leave them for now.

    6. Then let's add a TODO to look into deprecating exclude_target_regexp.

    7. These concepts are still separate, and need to stay separate. The build_ignore is all about not parsing BUILD files, and not descending into certain directory structures to parse those BUILD files--both for performance and correctness. build_ignore is notably a repo global setting.

      exclude_target_regexp can be overridden on a per task basis using options in order to have that task ignore (otherwise perfectly valid) targets. The canonical example here is when you have some expensive targets that you don't want to test during a test-changed. You can set

      [test-changed]
      exclude_target_regexp: ['src/my/expensive/targets::']
      

      We currently use this internally. While there's some confusion and overlap with build_ignore, I think once we get rid of spec_excludes that the overlap will disappear, and it will be clear that the two have different purposes. Hopefully it will also be clearly documented when and how to use each one!

  2. 
      
TA
TA
TA
BE
  1. 
      
  2. How about, instead, a --pants-ignore option that points to a file with ignore patterns in it, and defaults to .pantsignore.

    My thinking is:

    A) The analogy to .gitignore is strong. It's basically the exact same paradigm: "Hey git/pants, pretend these files aren't there."

    B) One can put comments in that file to explain the various patterns.

    1. I will prefer to stick with current implementation, at least for now. I've asked about it on Slack and here is main reasons: 1) we don't want to create one more file just for couple of strings, 2) you can point to file from pants.ini, 3) better explorability, mainly for a default - .*. Let's leave list option and see how people will use it.

    2. How about, instead, a --pants-ignore option that points to a file with ignore patterns in it, and defaults to .pantsignore.

      Note that we have @file support already... I'd prefer not to require that this be a file (what would the default value be if no file exists? how does it play into help? etc).

  3. 
      
TA
TA
ST
  1. 
      
  2. You need to pass fromfile=True in order to enable the @file parsing.

  3. s/to to/to/

  4. The default value (when rendered by ./pants help, for example) changes based on what is specified in pants.ini. So perhaps explaining the default isn't as useful as just linking to the docs.

  5. Might be good to indicate that unlike .gitignore, this is a list of entries rather than a single string.

  6. 
      
BE
  1. Looks OK to me, but (and not to bikeshed) can we call the flag something like --ignore-paths? Because there are other places in the code base that have "don't descend into these dirs" logic, and they're looking for source files or other things, not BUILD files. So I think it's a clearer paradigm if we give this a name that implies "pretend that paths (files or dirs) matching these patterns don't exist", which is precisely analogous to .gitignore.

    1. In other words, this is bigger and more useful than just "Ignore these paths when reading BUILD files", so I'd like its name to reflect that.

    2. Ignoring files generically throughout the codebase is probably a good idea, but updating the globs implementation to do that here would be out of scope. Once this is in and exists, we can later give it another name and apply it in more places... but right now, it would commit "someone" to implementing that in order for the naming not to awkward.

      Can we punt on that?

    3. Hmmm. Giving it another name later would require a deprecation cycle etc. I'd rather get the right name now, even if its implied functionality is not applied as aggressively yet (note that the other places would mostly use this for performance reasons, so the semantics don't change, and therefore nor would the name need to). I certainly don't think anything further needs to be implemented just yet.

    4. From my point of view it's ok to give --ignore-patterns (?) name with help pointing to BUILD files only and make at some moment breaking change which will make --ignore-patterns working with globs/whatever too. If it's ok to have any breaking changes after 1.0, otherwise better to deprecate.

    5. Great, I mean there will be no breaking change as far as I can tell. Just some stuff happening behind the scenes.

  2. 
      
TA
TA
TA
TA
TA
TA
Review request changed

Status: Closed (submitted)

Change Summary:

6f4b8511bb181e7dbd15726084f5f0897490d96c

BE
  1. Ship It!
  2. 
      
BE
  1. This is marked as submitted, but I'm not seeing it in master?

  2. 
      
BE
  1. This is marked as submitted, but I'm not seeing it in master?

    1. Looks like it's in https://github.com/pantsbuild/pants/commit/6f4b8511bb181e7dbd15726084f5f0897490d96c.

    2. My bad, my IDE wasn't refreshing.

  2. 
      
Loading...