Allow failover for remote cache

Review Request #3374 — Created Jan. 25, 2016 and submitted

nhoward_tw, patricklaw, stuhood, zundel

Today pants can have multiple remote cache hosts either through config or
through the resolve process, within each task pants will first decide a
best url and stick with the same url throughout the task run.

The problem is if this url dies there is no failover even though other urls are
still be available. The cache instances dying turns out to be not that
uncommon, most recent incident happened in Twitter was due to a memory leak
that was made worse by bloated artifacts so hosts died a few times a day. When
this happens we see pants default to local compile that takes forever.

This review does not attempt to implement load balancing, it does not even try
to re-ping (because running pinger is expensive) it only tries to address the
very specific problem: individual host dying during the single pants run and
allow pants to fail over to 2nd or 3rd instance.

It does not address if all urls die but as long as there is one instance alive
now pants can eventually fail over.

  • 0
  • 0
  • 10
  • 0
  • 10
Description From Last Updated
  2. src/python/pants/cache/ (Diff revision 1)

    is Exception too general here?

    1. yea, that's on purpose, the sole purpose here is to count the metric then rethrow and let caller handle the exception

  3. tests/python/pants_test/cache/ (Diff revision 1)

    I am not too familiar with the caching mechanics, but is it possible that it will keep failing over between url1 and url2 and never fall to local compile?

    1. BestUrlSelector is only to handle failover between remote caches, we have separate tests for remote cache to local cache fail over, see TestArtifactCache.test_local_backed_remote_cache

  1. Does this scheme support being able to mix a local URL in with a remote URL in the list of caches?

    1. No, it does not, but that's no different from it was before, where the best url needs to be either http or https.

    2. I'm going to stay out of this review, but just wanted to mention that we've had a use case where we mix local caches and remote caches and want to favor the local cache. Currently we don't have this configured anymore so I don't want to add some kind of useless requirement.

  2. src/python/pants/cache/ (Diff revision 1)

    This debug log was useful. Maybe log the entire returned list before returning it?

  2. src/python/pants/cache/ (Diff revision 1)

    nit: please use snake_case, ie underscore delimiting for bestUrlSelector.

    1. this always got me, thanks, fixed

  3. src/python/pants/cache/ (Diff revision 1)

    nit: you could use a tuple instead of a list here.

  4. src/python/pants/cache/ (Diff revision 1)

    This will be executed in multiprocessing.Pool worker threads / processes. Should this have some kind of synchronization to ensure concurrent modifications don't cause problems?

    1. yea, that's why the check is greater than

            if self.unsuccessful_calls[best_url] > self.max_failures:

      the point is not to be 100% accurate but if first url dies and there are still cache calls it will eventually fail over (up to pool size * max_failures + 1 under the extreme race conditions)

      Will add a comment

    2. That doesn't prevent races, because if it were executed on multiple threads, you could end up doing multiple rotates if multiple threads ran += and the > check. If the number of threads that did that was equal to the number of urls, you could end up rotating the list back where it was.

      That said, the pool most cache checks happen in is based on subprocesses, so I think the url lists won't be shared between them anyway. On the one hand, that means it won't be a race. On the other, it means that each worker will have different url selection state.

      If you wanted it to be shared, you'd have to use something like, but I don't have much experience with those.

    3. Thanks! reworded.

      Will avoid passing Value among processes for now, instead decrease max_failures from 5 to 3, worst case for our ci that has 8 workers, will take 8*3 =24 failures for all workers to do the fail-over switch, that's 24 local compiles, compared with 800+ compiles for large target, still a win.

  5. nit: snake_case

  6. nit: snake_case

  7. nit: snake_case

  8. nit: snake_case for arguments.

    Also, I think it'd be clearer to invert the successful flag, so usages would be

    self.call_url(expected_url=some_url, with_error=True)

    1. good point, changed to call_url(self, expected_url, with_error=False)

      my only preference is to make call by default successful, that seems more conventional.

  9. You could make expected_url required, since there are no calls that don't use it.

  10. nit: capitalize, and add period to comments also s/couter/counter/.

  2. This :param: stanza is invalid.

  3. src/python/pants/cache/ (Diff revision 2)

    To re-raise correctly you must use 'raise', not 'raise e'.

    The former will preserve the original stack trace (which is what we care about), the latter will start a new stack trace at the point of re-raise, which is not very useful.

    1. fixed, thanks!

  1. LGTM, modulo an assertTrue that could be more specific.

  2. nit: use assertIn

  1. Ship It!
  1. Ship It!
Review request changed

Status: Closed (submitted)

Change Summary:

Commited as d6c25992a62bcd7d8d324ac54569a8110e507331