[engine] fix unpickling error in multiprocessing engine tests

Review Request #3751 — Created April 25, 2016 and discarded — Latest diff uploaded

3149, 3261
jsirois, kwlzn, patricklaw, stuhood

Not clear what the root cause of the UnpicklingError is, but switching
from cStringIO to StringIO seems to fix the flakiness, have run many tests
to confirm, at the cost of significant performance impact.

A few things we know about the failure:

  • Those unpickable data appear to be old space has been GC-ed [1] for sth else.
  • Failures only were seen in travis, not on personal mac.
  • Switching to use in-memory db instead of lmdb makes no difference.
  • Switching cPickle to pickle also makes no difference.
  • Switching cStringIO to StringIO haven't seen any failures.

Did some search in python bugs with no avail. Our cStringIO usage looks alright. [2]
Tried to read the cStringIO source [3], but didn't spend too much time.

The reason it hits travis my guess could be due to 1) multiprocessing has
higher memory overhead, 2) we use #cpu to allocate multiprocessing pool size,
3) travis has more cpus compared to local mac.

This does impact performance significantly, about +50% for the cold cache,
and +100% for the warmed up cache in the list test I did.

[1] https://github.com/pantsbuild/pants/issues/3149 with examples of the
hexlified and unhexlified unpickable data.
[2] https://github.com/pantsbuild/pants/blob/master/src/python/pants/engine/exp/storage.py#L169-L173
blob is what's reallocated.
[3] cStringIO https://github.com/python/cpython/blob/2.7/Modules/cStringIO.c


  • Test pickle/stringIO combinations:
    (with test has a loop for running test 5 times)

cStringIO, cPickle: job 457 to 467 all failed except for 1
cStringIO, pickle: job 469 to 479 all failed
StringIO, cPickle: job 480 to 490 all passed
StringIO, pickle: job 491 to 501 all passed

  • This patch vs master:

with this patch: job 502 to 512, 22/22 (restarted once)
without this patch: job 513 to 523, 3/22 failed