[pantsd] Repair watchman startup flakiness.

Review Request #3644 — Created April 3, 2016 and submitted

peiyu, stuhood
  • Bump Watchman socket timeout from 1 second to 5 seconds to resolve occasional SocketTimeout on the initial Watchman.watch_project.
  • Remove the prior retry strategy for Watchman.watch_project in favor of higher timeout, making a single 5s SocketTimeout fatal.
  • Add pants.util.retry.retry_on_exception and utilize for re-running the inital Watchman launch command to mitigate a socket race between watchman client and server on startup.
  • Improve ProcessManager.get_subprocess_output to capture stderr and pass it with the raised ExecutionError exception for better logging.
  • Move watchman startup logging from info to debug now that it's launched in the tool runtime path vs post-fork daemon path.
  • Bump pywatchman dep to latest.

All of this combined leads to a 100% success rate in launching Watchman on my machine over the course of ~20+ back-to-back start/stop test runs.

CI is green @ https://travis-ci.org/pantsbuild/pants/builds/121002400

+ tested the race fix under a failure case repro and verified the retry strategy resolves it:

DEBUG] acquiring lock: <pants.process.lock.OwnerPrintingInterProcessFileLock object at 0x1167ac4d0>
DEBUG] launching pantsd
DEBUG] Selected watchman binary bootstrapped to: /Users/kwilson/.cache/pants/bin/watchman/mac/10.11/4.5.0/watchman
DEBUG] launching watchman
DEBUG] watchman cmd is: /Users/kwilson/.cache/pants/bin/watchman/mac/10.11/4.5.0/watchman get-pid --no-save-state --statefile=/Users/kwilson/dev/pants/.pants.d/watchman/watchman.state --sockname=/Users/kwilson/dev/pants/.pants.d/watchman/watchman.sock --logfile=/Users/kwilson/dev/pants/.pants.d/watchman/watchman.log --log-level 2
DEBUG] ensuring creation of directory: /Users/kwilson/dev/pants/.pants.d/watchman
DEBUG] encountered exception on retry #0: ExecutionError(message="Command '[u'/Users/kwilson/.cache/pants/bin/watchman/mac/10.11/4.5.0/watchman', u'get-pid', u'--no-save-state', u'--statefile=/Users/kwilson/dev/pants/.pants.d/watchman/watchman.state', u'--sockname=/Users/kwilson/dev/pants/.pants.d/watchman/watchman.sock', u'--logfile=/Users/kwilson/dev/pants/.pants.d/watchman/watchman.log', u'--log-level', '2']' returned non-zero exit status 1", output='2016-04-02T17:26:45,555: [cli] unable to talk to your watchman on /Users/kwilson/dev/pants/.pants.d/watchman/watchman.sock! (Permission denied)')
DEBUG] watchman is running, pid=1459 socket=/Users/kwilson/dev/pants/.pants.d/watchman/watchman.sock
DEBUG] purging metadata directory: /Users/kwilson/dev/pants/.pids/pantsd
DEBUG] released lock: <pants.process.lock.OwnerPrintingInterProcessFileLock object at 0x1167ac4d0>
DEBUG] pantsd is running at pid 1502
Cache stats: hits=0, misses=1118, total=1118

while monitoring watchman processes in a simultaneous window.

  2. src/python/pants/pantsd/watchman.py (Diff revision 1)

    worth making this configurable?

    1. hmm, yeah - good idea. fixed.

  3. src/python/pants/pantsd/watchman.py (Diff revision 1)

    Do we now explicitly kill watchman on clean-all?

    1. yup - it's killed with pantsd when optioned on.

  4. src/python/pants/pantsd/watchman.py (Diff revision 1)

    Worth making this configurable?

    1. don't believe so here - in my testing, the first retry seems to always get the job done.

  5. src/python/pants/util/retry.py (Diff revision 1)

    Exponential backoff generally makes a better default.

    1. sure, tho seems like exponential makes more sense when you're backing off of a resource for load purposes. afaict, tho it's possible we could encounter system load related issues here - the case of the current usage of this guard seems to be more of a factor of time alone - and since we can't readily distinguish beteen a race-spurn failure and a bad/broken watchman I was trying to avoid sleeping any more than necessary (e.g. exponentially) in the bad/broken watchman case.

      I improved the general usage case for retry_on_exception to accept a backoff lambda/function and default to exponential backoff tho.

  2. src/python/pants/util/retry.py (Diff revision 2)

    exponential backoff is 2^n, not nn, maybe you mean 2*n?

    1. rb formatting: exponential back off is 2^n, not n*n, maybe you mean 2**n?

    2. argh, yeah great catch. I think I'm going to just make the default no backoff - I can't imagine the exponential growth as a default being practical in the pants runtime with a 27s sleep on the 3rd retry.


Review request changed

Status: Closed (submitted)

Change Summary:

thanks Stu & Peiyu! submitted @ 0bff3f9c5b4ceb057c8fc0d0b119e3e431a9af1d