Subprocess clean-all

Review Request #4011 — Created June 16, 2016 and submitted

echuba
pants
3590
pants-reviews
benjyw, kwlzn, stuhood, wisechengyi, zundel

Move current .pants.d contents to a temporary directory and subprocess its removal after task execution

https://travis-ci.org/ebubae/pants/builds/140860387

  • 0
  • 0
  • 13
  • 1
  • 14
Description From Last Updated
  1. Thanks for this, but see my comments below.

  2. src/python/pants/core_tasks/clean.py (Diff revision 1)
     
     

    We used to do this in a separate clean-all-async task, but got rid of it. I don't totally remember why, but I think it was because in typical cases builds are disk-bound, so the background deletion would slow down the next build anyway.

    So I'm not sure about this. I think it would be better to wait for the daemon, which can do this more naturally.

    1. in the case of ./pants clean-all there would be no other task, so waiting on the daemon (async task to remove .pants.d) wouldn’t speed up anything?

    2. a bit more background info: the motivation is to unblock pants from going into other tasks (such as compile which is more CPU intensive) and to not have to wait on a large .pants.d to be deleted during an upgrade for example.

  3. src/python/pants/core_tasks/clean.py (Diff revision 1)
     
     

    This is very scary. A future bug that causes tmpdir to point to the wrong thing might wreak havoc on the filesystem.

    Better to fork the process and run safe_rmtree(self.get_options().pants_workdir) in one-liner.

    But again, I'm not sure how I feel about this change, and might recommend waiting for the daemon.

  4. 
      
  1. echo on "fork the process and run safe_rmtree(self.get_options().pants_workdir)"

    1. can you elaborate on this

    2. As @benjyw pointed out, rm -rf could be scary, and safe_rmtree does a few additional checks, so it would be better to fork another process then call safe_rmtree. Some examples can be found http://www.petercollingridge.co.uk/blog/running-multiple-processes-python, except there is no need to wait for the child process to complete in this case.

    3. I really don't think os.fork is what we want here. In the example you gave above, fork() was used to split the program to two identical processes, and the child function would be completed by program execution. This takes away the advantage we get from using subprocess. Additionally, I don't see any extra checks safe_rmtree actually does other than surpressing errors from output, which isn't actually a check.

    4. safe_rmtree calls shutil.rmtree that checks for symlinks and what not.

      like subprocess, fork generates another process but with more context, and it is generally cleaner and easier to use library code than calling rm -rf directly.

      The technicality about using os.fork should be resolved in offline discussion.

    5. The issue is also that safe_rmtree(self.get_options().pants_workdir) is a one liner that is more obviously correct. Shelling out to rm -rf with an argument that was set several lines earlier has many more ways to go wrong.

  2. src/python/pants/core_tasks/clean.py (Diff revision 3)
     
     
     
     

    would be good to set default to false

  3. src/python/pants/core_tasks/clean.py (Diff revision 3)
     
     

    nit comment formatting:

    'Although ... subprocess.'

    same for other comments.

  4. 
      
  1. Stopping by to say I'd like to see some tests for this. tests/python/pants_test/tasks/test_clean_all_integration.py would be a good place for them to live.

    1. tests added on last commit

  2. 
      
  1. 
      
  2. src/python/pants/core_tasks/clean.py (Diff revision 5)
     
     

    for large pants workdir.

  3. Do you mean to put test_clean_all_async on the same indentation as test_clean_all_on_wrong_dir?

  4. Is `os.waitpid()` meant for waiting on the forked process to finish? Might be worth a comment as it may not be obvious.

    Also might be good to assert ".pants.d" is not there after clean-all.

  5. src/python/pants/core_tasks/clean.py (Diff revision 6)
     
     

    nit: dedent for the function.

  6. src/python/pants/core_tasks/clean.py (Diff revision 6)
     
     

    nit: 'Temporary directory created at {}'.format(tmpdir) should do since there's only one argument here.

  7. src/python/pants/core_tasks/clean.py (Diff revision 6)
     
     

    staled comment?

  8. 
      
  1. Thanks for the test.

    Another thought: does the child process need to give up the file lock set in RoundEngine's attempt method in order to be async?

  2. src/python/pants/core_tasks/clean.py (Diff revision 6)
     
     
     
     
     
     
     
     

    I think this portion should happen in the parent before the child is forked.

    If something in the parent process attempts to use the workdir after the fork but before it is moved it could cause weird behavior.

    For example, some users run pants in the following way:

    ./pants clean-all compile my-target:bin
    

    With the current implementation, the tasks after clean-all would be racing the child process for the workdir.

  3. dedent please.

    1. Missed the last update.

  4. 
      
  1. 
      
  2. src/python/pants/core_tasks/clean.py (Diff revisions 8 - 9)
     
     
     
     
     
     
     
     
     
     
     
     
     

    does it mean this chunk of code will be executed by both parent and child process? I was thinking pid=os.fork() should be moved to right above if pid == 0:

    1. Yes, I was just about to leave the same comment.

  3. 
      
  1. 
      
  2. src/python/pants/core_tasks/clean.py (Diff revision 11)
     
     

    would it not be cleaner and safer to anchor this tmpdir either inside a freshly created pants workdir itself, or alongside it (e.g. $WORKDIR -> $WORKDIR.purge)?

    this would:

    1) ~guarantee that the mv/rename operation always happened on the same filesystem. in the event that the default system tmpdir (/tmp etc) is on another filesystem, this might avoid an expensive synchronous copy prior to removal.

    2) ensure that in the event that the fork'd dir removal failed, that subsequent clean-alls could re-attempt removal and potentially expose an error vs possibly orphaning a giant (in our case, ~10-30GB) workdir in $TMP.

    so maybe the ideal flow would be something along the lines of these pseudo-commands:

    1) mkdir -p $WORKDIR.new/trash
    2) mv $WORKDIR $WORKDIR.new/trash/
    3) mv $WORKDIR.new $WORKDIR

    with the clean-all removal target always being $WORKDIR/trash iff that exists. having a static trash dir location would work better with automated daemon-based removal as well.

  3. src/python/pants/core_tasks/clean.py (Diff revision 11)
     
     
     
     
     

    this seems a bit perilous imho, primarily due to a complete lack of error handling/failure notification in the child process once the fork happens.

    I'm guessing any errors here could likely end up with a raw traceback on a fork-held copy of the parents stderr, which could happen after the parent pants process had already returned. this failure mode also has the potential to orphan a partially dealt-with copy of the workdir in a tmp dir - or worse, a wedged subprocess that never made it to the os._exit(0) line.

    so, at a minimum this should probably get a dedicated try/except block as a safeguard - or possibly a more holistic regrouping on the os.fork approach.

  4. 
      
  1. 
      
  2. src/python/pants/core_tasks/clean.py (Diff revisions 11 - 12)
     
     
     
     
     
     
     
     
     
     
     

    this still doesn't seem to address the concern that the default tmp location as used by temporary_dir() may be on a separate filesystem, which could synchronously incur an expensive shutil.move operation during the first safe_concurrent_rename().

    note that temporary_dir() has a kwarg root_dir that would allow you to create a temp dir directly in a given dir:

    :param string root_dir: The parent directory to create the temporary directory.
    
  3. src/python/pants/core_tasks/clean.py (Diff revisions 11 - 12)
     
     
     
     
     
     

    the os._exit(0) here should probably go in a finally block to ensure it's execution in the face of an exception. it'd also be a good idea to safeguard against both OSError and IOError:

    try:
      safe_rmtree(...)
    except (IOError, OSError):
      ...
    finally:
      os._exit(0)
    

    + possibly a mocked test that simulates this failure case to ensure appropriate handling?

  4. 
      
  1. looking good - handful of additional comments.

  2. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     
     

    nit: this docstring change probably isn't necessary as the fundamental nature of Clean isn't changing - but if you choose to leave it, you should drop a space inbetween the summary + body and make the body a complete sentence:

    """Summary sentence.
    
    Body sentence one. Body Sentence two. etc."""
    
  3. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     
     

    nit: implicit string concatenation won't add spaces for you between combined strings - this currently reads:

    '... Can dramatically speed up clean-allfor large pants workdir.'

    would add a space after the clean-all bit on the top string and pluralize workdir -> workdirs.

  4. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     

    this mkdir probably isn't necessary? the subsequent mkdir(tmpdir/trash) should suffice since this dir will just be moved and rm'd anway.

  5. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     

    this tmpdir should be hidden (prefixed with .pants_cleanall etc) to avoid showing up in git status.

    I think you'll need to plumb a prefix kwarg in temporary_dir()->tempfile.mkdtemp() to achieve this.

  6. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     

    should this be logger.debug vs info?

    and should the log msg provide more context about what the temp dir is being used for?

  7. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     

    redundant? fairly sure temporary_dir should already create this directory for you.

  8. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     

    nit: https://www.python.org/dev/peps/pep-0008/#comments

  9. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     

    at the very least, this fork() should probably result in logging indicating that 1) it forked a new process that is doing something in the background 2) the forked processes pid in case anything goes wrong:

    if pid == 0:
      ...
    else:
      logger.info('Forked an asynchronous clean-all worker at pid: {}'.format(pid))
    
  10. src/python/pants/core_tasks/clean.py (Diff revision 13)
     
     

    unclear if this added advertisement is ideal for the default case?

    if we truly want to push users toward this, we should probably just option it on by default once proven vs added nagging that most users will outright ignore.

    1. i don't know. discussion on the Slack led to the compromise that users should know if they have async processes running

  11. 
      
  1. 
      
  2. src/python/pants/core_tasks/clean.py (Diff revision 14)
     
     

    Please update this comment, since the tmp dir is no longer deleted in the subprocess, instead it's <pants-work-dir>/trash.

  3. src/python/pants/core_tasks/clean.py (Diff revision 14)
     
     

    How would you feel about using <build_root>/.pants_cleanall as the root_dir and dropping the prefix?

    I think with this current implementation, if a clean-all were interrupted it would leave a .pants_cleanall<uniquechars> directory.

    1. sounds good. I'll still leave the prefix edit in util though

  4. src/python/pants/core_tasks/clean.py (Diff revision 14)
     
     

    This isn't necessary as safe_concurrent_rename deletes the destination.

  5. src/python/pants/core_tasks/clean.py (Diff revision 14)
     
     

    Update this comment please.

  6. src/python/pants/core_tasks/clean.py (Diff revision 14)
     
     
     

    safe_rmtree silences errors, so this shouldn't be necessary.

    1. you might actually consider using rm_rf vs safe_rmtree here, which would surface vs swallow any errors.

    2. based on previous impl, it seems like we want to suppress errors (which I'm guessing would mainly come from symblinks).

  7. Perhaps this should also check that the .pants.d directory is gone / empty.

    Since you are modifying how clean-all works without async. It would be good to also have a test that checks that for it as well.

  8. 
      
  1. 
      
  2. src/python/pants/core_tasks/clean.py (Diff revision 15)
     
     

    I think this should delete tmpdir (pants/.pants_cleanall/<unique tmp dir>) instead of pants_trash (pants/.pants_cleanall/), because if there are competing processes running clean-all, it could potentially create or delete pants_trash at the same time, whereas tmpdir is unique to each process.

  3. src/python/pants/util/contextutil.py (Diff revision 15)
     
     

    seems prefix isn't used.

    1. not used anymore, but could be used in the future?

    2. in that case, you can do:

      path = tempfile.mkdtemp(dir=root_dir, suffix=suffix, prefix= prefix or '')

      >>> '12355' or ''
      '12355'
      >>> 
      >>> None or ''
      ''
      
    3. or
      def temporary_dir(root_dir=None, cleanup=True, suffix=str(), permissions=None, prefix=''):

    4. sorry about the spam. it's better to conform to the existing style

      def temporary_dir(root_dir=None, cleanup=True, suffix=str(), permissions=None, prefix=str())
      then there's no need to check prefix being None or not.

    5. seems to me like it'd be most ideal to simply match the underlying behavior of tempfile.mkdtemp vs clobbering its default prefix?

      def temporary_dir(..., prefix=tempfile.template)
        ...
        tempfile.mkdtemp(..., prefix=prefix)
        ...
      
  4. 
      
  1. Ship It!
  2. 
      
  1. Looks good. I've got a few remaining comments, then ship it!

    1. Keep em coming #berigorous

  2. src/python/pants/core_tasks/clean.py (Diff revision 17)
     
     

    Last comment update I promise. :)

    The last bit here isn't accurate after the last batch of changes since the trash directory now lives in .pants_cleanall

  3. src/python/pants/core_tasks/clean.py (Diff revision 17)
     
     

    nit: add period.

  4. tests/python/pants_test/tasks/test_clean_all_integration.py (Diff revision 17)
     
     
     
     
     
     
     
     
     
     
     
     

    I think you need to use run_pants_with_workdir for these, otherwise the work_dir will be ignored.

    Also, it'd be good to have a

    self.assertFalse(os.path.exists(work_dir))
    

    in there too.

  5. 
      
  1. 
      
  2. tests/python/pants_test/tasks/test_clean_all_integration.py (Diff revision 18)
     
     
     
     
     
     
     
     
     
     
     
     

    shouldn't the success condition for both tests here be that trash_dir is completely removed (e.g. self.assertFalse(os.path.exists(trash_dir))) vs its dir listing being empty (e.g. self.assertTrue(os.listdir(trash_dir) == []))?

    I'd actually expect the assertion here to throw OSError post clean-all:

    >>> os.listdir('non-existent-dir')
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
    OSError: [Errno 2] No such file or directory: 'non-existent-dir'
    

    but the fact that this is passing seems to indicate that the new trashdir itself is left behind now in both cases?

    IMO, both clean-all paths should result in complete removal of the dir itself.

    1. See Yi's comment about possible concurrent clean-all processes running:

      I think this should delete tmpdir (pants/.pants_cleanall/<unique tmp dir>) instead of pants_trash (pants/.pants_cleanall/), because if there are competing processes running clean-all, it could potentially create or delete pants_trash at the same time, whereas tmpdir is unique to each process.

    2. To safeguard from this event, .pants_cleanall is left as an empty directory (thus the tests make sense)

    3. I don't think this approach will work. failures to cleanup just the temp dir itself could potentially orphan gigabytes of data in a temp dir nested inside the trash dir with no subsequent handle/pointer for removal (esp now that this has moved outside of the workdir).

      Thus, the top level trash dir itself inherently must be cleaned to account for such a situation - and to generally account for a future case of daemon-based async clean-all which wouldn't have a handle to the tmpdir name.

    4. @Kris, if I understand correctly, instead of the static .pants_cleanall dir, do the following:
      1. create a tempdir prefixed with .pants_cleanall_ in the repo, e.g. <Pants repo>/.pants_cleanall_<tmpdir>
      2. move .pants.d into <Pants repo>/.pants_cleanall_<tmpdir>
      3. then remove <Pants repo>/.pants_cleanall_<tmpdir>

    5. Not sure if this addresses the issue @Kris brought up about orphaning data. Would the daemon check all .pants_cleanall_XX files?

    6. @Yi huh? I don't think I suggested any of that in my comment - and that approach could potentially make things even worse by nesting temp dirs in the build root without a top level container dir that has the property of "is always removed".

      the crux of the issue with the current implementation that I can see is that we create a net-new "trash dir" and then only remove temporary directories that were created in there within the same run. subsequent runs have zero handle/context to the tempdir name that was just created (and don't enumerate them).. and since nothing ever removes the top-level "trash dir" (like we always do with the workdir - which this is no longer part of), arbitrary failures (a control-c, etc) no longer get cleaned up in subsequent runs like they naturally should be.

      so IMO, it's not much a "trash dir" if it never gets holistically emptied, but rather now a mutation of a secondary workdir only used for temporary copying and removal - which I'm firmly against.

      so while the concern of "competing processes running clean-all" is certainly valid, avoiding removal of the top level trash dir seems like the wrong way to solve it (and besides, this case has been/is already possible today without --async). this could just be a matter of concurrent-safe rm handling (tho maybe it is already safe now?) or potentially guarding the critical part of clean-all with an interprocess lockfile.

    7. latest revision deletes all tmpdirs

    8. another issue to keep in mind here is that there's no firm association between the buildroot and workdir, we just happen to use the default mode where the workdir is nested inside the buildroot. in this RB we're creating static dirs alongside the workdir that may live completely outside of the repo itself - and those would be largely invisible to a casual observer and most likely a surprise to find that they exist.

      for example, in the Scoot scenario the buildroot could be ~/dev/pants but the workdir could live outside of that using options in something like ~/.pants_repo_workdir. then, when the code in this RB creates the "trash dir" we're now polluting the users home directory with ~/.pants_cleanall which will hold trash for a mysteriously unlinked workdir and remain there forever. in the case of a second repo that uses a workdir like ~/.source_repo_workdir, things are now colliding based on the static use of dirname(workdir) as a fixed trash dir vs temporary path.

      thus overall, I think the approach of creating this fixed sidecar trash dir is probably not the most hygienic from a disk footprint perspective in addition to the above concern - this should always only ever be temporary at best. IMO, it'd be great to always self-contain this inside the pants workdir under a "trash" folder like it was in prior revs of this RB which would inherently provide the guarantee of "a top level container that is always removed" - and solve both issues.

      something crudely along the lines of:

      1) TEMP_DIRNAME = dirname($WORKDIR)
      2) TEMP_NAME = temporary_dir(prefix='.pants_cleanall', root_dir=$TEMP_DIRNAME)
      3) mkdir $TEMP_NAME/trash
      4) mv $WORKDIR $TEMP_NAME/trash/
      5) mv $TEMP_NAME $WORKDIR
      6) rm -rf $WORKDIR/trash

  3. 
      
  1. 
      
  2. src/python/pants/core_tasks/clean.py (Diff revisions 18 - 19)
     
     
     

    with the "colliding competing clean-all processes" concern in-mind, how is this any better than just a straight up safe_rmtree(pants_trash)?

  3. 
      
  1. lgtm w/ a few nits.

  2. src/python/pants/core_tasks/clean.py (Diff revisions 19 - 20)
     
     
     

    the safe_mkdir() here seems unnecessary in its current form given that the subsequent safe_concurrent_rename() will stomp directly on the newly created dir.

    to illustrate:

    >>> safe_mkdir('workdir')
    >>> os.system('touch workdir/A_FILE')
    >>> safe_mkdir('trash')
    >>> os.system('touch trash/STALE')
    >>> os.system('find .')
    ./trash
    ./trash/STALE
    ./workdir
    ./workdir/A_FILE
    >>> safe_concurrent_rename('workdir', 'trash')
    >>> os.system('find .')
    ./trash
    ./trash/A_FILE
    
  3. src/python/pants/core_tasks/clean.py (Diff revisions 19 - 20)
     
     
     
     
     
     
     
     
     
     
     
     
     

    how about instead of the 2x os.path.join(pants_wd, 'trash') here only doing this once above (just after defining pants_wd maybe?) and then reusing the variable in the subsequent safe_rmtree() calls?

  4. tests/python/pants_test/tasks/test_clean_all_integration.py (Diff revisions 19 - 20)
     
     
     
     
     
     
     
     

    you should probably wrap the calls to self.run_pants_with_workdir here with an self.assert_success() to ensure the pants runs completed successfully.

  5. tests/python/pants_test/tasks/test_clean_all_integration.py (Diff revisions 19 - 20)
     
     
     
     
     
     
     
     

    the convention is that if a symbol is prefixed with a _, that it's private to the class/module/etc - hence rather than using the private os._exists here both tests here should use the public os.path.exists.

  6. tests/python/pants_test/tasks/test_clean_all_integration.py (Diff revisions 19 - 20)
     
     
     
     
     
     

    it would probably make sense testing-wise here to populate the empty, temporary_workdir()-spawned work_dir with at least one sentinel file as part of setup and then ensure that said file was successfully removed as an added check. this seems more true to the nature of the behavior you'd want to verify for a clean-all - not to mention that proving correctness with the movement of an empty workdir in the current impl seems perilous.

    as-is, if the pants run completely failed/no-oped this test would still pass because the trash dir would never get created in the first place.

  7. 
      
Review request changed

Status: Closed (submitted)

Change Summary:

Merged as c06ac5dde91583fb5bb7ae036496a1d58ffaf8a7

Loading...