avoid pathological regex performance when linkifying large ivy output

Review Request #3603 — Created March 24, 2016 and submitted

kwlzn, stuhood, zundel

Our build contains an 800 MB jar. Ivy outputs 110k dots while downloading this. Pants then tries to linkify this and re.findall takes 10+ minutes.

I tried instead changing all the groups to non-capturing and all the ? and +s to non-greedy, but they didn't solve the problem (they did help a bit, but not enough to suffice).

The negative lookahead solution in this commit feels hacky. I'm open to other suggestions (it might be worth trying the regex or re2 libraries for non-backtracking regexes, if it's worth adding those deps to pants - also not sure if those will actually help, since we'll still be quadratic).


also turned linkify.py into a standalone script to repro the issue on our ivy output and verified execution time went from awful -> good

  1. seems more than reasonable to me for the net perf gain - from 11m19.392s (before) -> 0m0.026s (after) on my machine in the test harness for the 100k dots case.

    excellent find!

  1. Ship It!
  1. Excellent find, and excellent fix.

  2. src/python/pants/reporting/linkify.py (Diff revision 1)

    This is a 100% reasonable heuristic.

    And this is slightly hacky, but in a clever way, so I don't mind it at all.

  1. thanks Matt! this is in @ https://github.com/pantsbuild/pants/commit/b50df7c64656c2b8ed119190b27a5f49ecc15071 - please mark this RB as submitted when you get a chance.

  1. Awesome and so unexpected find. Thanks!

Review request changed

Status: Closed (submitted)