Remove Duplicates in File System tasks in v2 Engine

Review Request #4096 — Created July 19, 2016 and submitted

kwlzn, stuhood

Currently when performing file system tasks in v2 Engine (like glob matching), we keep duplicate paths.
This can cause errors sometimes. For example, if 2 glob patterns defined in "sources" field of a target both match a common scala file, this file will be given twice during compilation, thus triggering errors like "X is already defined as class X".
In fact, we do not have any use cases where keeping duplicate is a requirement, however we do have use cases like the above where removing duplicates can save us a lot of trouble. In addition, v1 engine does not have duplicate paths, which will make transition from v1 to v2 engine smoother.

In this review:
1. Enhanced logic in merge_paths ( to remove duplicates while preserving the order of paths.
2. Added a test case.

ci green:

  1. lgtm! one quick thought on simplifying the de-duping implementation.

  2. src/python/pants/engine/ (Diff revision 1)

    any reason to not just use t.c.collections.OrderedSet here instead of the for loop?

    >>> from twitter.common.collections import OrderedSet
    >>> l = ['p', 'a', 'n', 't', 's', 'b', 'u', 'i', 'l', 'd', 'p', 'a', 'n', 't', 's']
    >>> tuple(OrderedSet(l))
    ('p', 'a', 'n', 't', 's', 'b', 'u', 'i', 'l', 'd')
    1. You mean like:
      Paths(tuple(OrderedSet(p for paths in paths_list for p in paths.dependencies)))?

      I don't have a reason for not using OrderedSet, it simply didn't come to my mind when I made the change, but it does look neat.

      I remember vaguely that we want to move away from t.c stuff eventually, but I am happy to make the change now.


    2. yup, exactly. I think there is a general desire to move away from t.c as a dep, but that usually involves simply intaking the same code from t.c. directly into pants' own codebase in most of the cases I've seen.

      in this case, OrderedSet is pretty heavily used so I think you're good to go:

      [illuminati pants (master)]$ ag "from twitter.common.collections.* import .*OrderedSet.*" | wc -l
    3. Generally I would recommend OrderedSet. But I could go either way in this case (which is likely to be very hot). A manual dedupe implementation could avoids some extra shuffling, without any loss of clarity (since the method is already isolated).

    4. what extra shuffling are you referring to? my guess would be that the impl as-is is likely slower (due to list append operations) than OrderedSet would be, esp as the input size grows.

      should be pretty easy to benchmark.

    5. Orderset keeps a dict of tuples in addition to a custom doubly linked list. I expect that python's list (which is both #1 native, #2 array backed) is faster than the doubly linked list, and the extra tuples are not needed for an append-only case.

      Probably worth a quick benchmark, but I don't feel strongly either way.

    6. I did some standalone profiling, raw implementation of dedup is 10X faster than OrderedSet. But I will use OrderedSet anyway since it's suggested by most reviewers.

    7. if it's 10x faster, then I'd say by all means use that instead.

  1. Ship It!
  1. lgtm!
    +1 on using OrderedSet (more for clearer intent)

  1. thanks Yujie! this is in master @ 454dc92602b789dd7819388bb9c844f11b28408c

    please mark this RB as submitted when you get a chance.

Review request changed

Status: Closed (submitted)

Change Summary:

thanks, Kris, Stu and Ity for reviewing!