Implemented isolated code-generation strategy for simple_codegen_task.

Review Request #2322 — Created June 4, 2015 and submitted

gmalmquist
pants
gmalmquist/isolated-codegen-strategy
1609
ca61360...
pants-reviews
jsirois, patricklaw, zundel

Added a --strategy flag to simple codegen wich allows for code generation with either 'isolated' or 'global' strategy.

The 'isolated' strategy generates code per-target, in a (stable) unique directory for that target. The 'global' strategy does it the same way codegen has worked in the past.

The 'isolated' strategy additionally takes advantage of the isolation to find what sources a target generates via file system inspection, which is much more reliable than trying to predict what sources will be generated in advance.

Test-cases were added to test simple_code_gen (test_simple_codegen_task.py), and integration tests for
protobuf and wire gen were updated to accomodate the modified file paths
due to the isolation strategy.

Travis CI is green.

  • 0
  • 0
  • 3
  • 0
  • 3
Description From Last Updated
GM
JS
  1. 
      
  2. This should probably be a classmethod that you query in register_options so you can omit the option altogether.
    1. Then users get a message telling them the flag doesn't exist if they try to specify the --strategy for a Task that forcible overrides it. Which makes sense, but might confuse users by making them assume they just typoed it or something. I think it might be better to keep it as a valid option, but print a warning if it's ignored?

    2. Well, that's the case today for users who add random unsupported flags to the CLI!  I think you're coming from the perspective of knowing the code, and knowing codegen can possibly have boh these options.  If I'm just a user, I get my flag info from `-h` and so I'll have no reason to assume --strategy is a valid flag.  ... but this brings up the good point that the flag shouldn't be normally visible anyway!  It should be advanced, the sort of thing you'd set repo wide.
    3. @gmalmquist: To implement the behavior John is suggesting, just add advanced=True to the call to register --strategy

    4. Okay, I see your point. I will change it.

  3. I don't understand the self._codegen_strategy.name != strategy - I'm not seeing the case where you actually want a prior returned self.codegen_strategy to change - it seems like this should be a property whose value never changes. If that's right, just s/@property/@memoized_property/ and simply return self._codegen_strategy_for_name(strategy)

    1. Yeah, I can't actually think of a situation where we'd ever want the strategy to change mid-generation.

      I didn't know memoized_property existed, neat! I will use it.

  4. You could just apply @memoized_method here, but Patrick may be gutting that.
  5. A long winded question really, may be an issue:

    This assumes that 2 different targets (owning 2 different source files) cannot make the error of producing the same gen files. I don't think this is a good assumption, even for protoc. You could very easily - esp in a coarse-grained maven multi-module style project, have a file in project1/src/main/protobuf/com/squareup/base.proto and project2/src/main/protobuf/com/squareup/base.proto that produce the same output file path. With option java_package and other options, the relationship from source to gen could be even more obscure, but also overlapping in output path names. The correct answer for compilations of both in the same pants run may be to fail and call this illegal use of protobuf, it may be to isolate these 2 from each other and let things be. The current solution addresses neither though and assumes sanity on the BUILD/protobug author with a silent error if sanity does not prevail.

    Clearly if the 2 targets generating the same filename don't depend on each other via imports, all is well and you can pick either strategy - allow it, or fail fast. If those 2 targets do depend on each other though, in the current scheme the 1st target whose sources are found will silently win.

    And now for the reality of the situation, which does not bode well:

    $ cat a.proto 
    package a;
    
    option java_outer_classname = 'Foo';
    
    message Person {}
    
    $ cat b.proto 
    package a;
    
    option java_outer_classname = 'Foo';
    
    message Cookie {}
    
    $ rm -rf /tmp/protobuf_java && mkdir /tmp/protobuf_java && protoc --java_out=/tmp/protobuf_java a.proto b.proto 
    a/Foo.java: Tried to write the same file twice.
    
    
    
    
    
    $ cat a.proto 
    import "b.proto";
    
    package a;
    
    option java_outer_classname = 'Foo';
    
    message Person {
      optional Cookie cookie = 1;
    }
    
    $ cat b.proto 
    package a;
    
    option java_outer_classname = 'Foo';
    
    message Cookie {}
    
    $ rm -rf /tmp/protobuf_java && mkdir /tmp/protobuf_java && protoc --java_out=/tmp/protobuf_java a.proto
    $ rm -rf /tmp/protobuf_java2 && mkdir /tmp/protobuf_java2 && protoc --java_out=/tmp/protobuf_java2 a.proto
    $ diff -r /tmp/protobuf_java /tmp/protobuf_java2 | head
    diff -r /tmp/protobuf_java/a/Foo.java /tmp/protobuf_java2/a/Foo.java
    2c2
    < // source: a.proto
    ---
    > // source: b.proto
    11,12c11,12
    <   public interface PersonOrBuilder extends
    <       // @@protoc_insertion_point(interface_extends:a.Person)
    ---
    >   public interface CookieOrBuilder extends
    

    So - no solutions, just problems pointed out from me in this round.

    1. I am concerned that the use cases we have where we want to use two different .proto files that create the same message are going to remain broken by this logic. Our case where we re-define the same .proto files are for testing:

      a.proto

      message Foo {
        ...
      }
      

      test-a.proto

      message Foo {
        ... something intentially overridden for test ...
      }
      

      Since the target that contains test-a.proto may depend on another target that includes a.proto, we aren't going to get the overriding behavior we want - we're going to always get the Foo defined in a.proto.

    2. What is the "correct" behavior that we expect to happen in that case?

      Assuming we have:

      java_protobuf_library(name='foo',
        sources=['stuff.proto', 'test-a.proto',]
        dependencies=[':lib',]
      )
      
      java_protobuf_library(name='lib',
        sources=['a.proto', 'b.proto', 'c.proto',]
      )
      

      Here, if we try to compile :foo, do we expect test-a.proto be compiled after a.proto, and thus overwrite the code generated from a.proto?

      The current code in protofub_gen.py appears to add all .proto sources from a target and its transitive dependencies in post-order traversal ordering to the protoc invocation, which would result in something like:

      protoc a.proto b.proto c.proto stuff.proto test-a.proto
      

      I'm not sure whether that produces the desired behavior or not. Even if it does though, that still does leave the problem of a.proto, not test-a.proto, claiming the generated .java file for the purposes of find_sources, which presumably would cause problems down the line. I'm not sure how to distinguish between sources generated by dependencies, and sources generated by the actual target, given how we're currently invoking protoc.

    3. My experiment above shows that if you pass 2 files into protoc at once that produce the same output file, protoc errors.
      
      It seems to me, the only correct way to use protoc is to pass it all the files in the transitive dep graph of a root target.  Per the experiment, it will helpfully error if there is an ambiguity.  If there is not an ambiguity, you're left with the problem of assigning outputs to targets in that trans dep graph.  How about, gen the root passing all trans dep sources and assign all found generated sources to it, then walk the graph down to the leaves doing said-same, but subtracting generated sources from "roots" already visited until you hit the leaves.
    4. In maven, you would run code generation into separate diretories and then compile them each, then let the ordering of files on the classpath sort things out. I

    5. You were cut off there, but hopefully we do not need to resort to classpath-order wins like maven.
    6. I was going to say, I don't know if we want to rely on that. Maybe we could use the logic like Garrett has by default and provide some other mechanism to 'force' an override of a generated class? That is kind of icky, but helps call out the horrible nature of this practice while still giving it a way to work.

    7. Protoc is intended to be invoked with just a subset of the .proto files you want. It has the concept of a --proto-path for files that contain imports that you don't necessarily want to compile into output classes. The problem we are talking about isn't with the .proto sources, its the generated output. In my example above, test-a.proto won't import a.proto, so the proto compiler shouldn't have a problem with it. But if there is a dependency relationship between the targets, the code above will prefer the output from the target that compiled a.proto, which is the opposite of the behavior I wanted.

    8. Is my proposal (passing all transitive proto sources for each target (solving dups through error, which is correct in a trans dep graph) and walking from root to leaves, deriving generated sources through a subtractive process) feasible?  AFAICT it avoids all ambiguities and generates correct generated source<->target associations.
      
      Solving the protoc case aside for a moment, this may be highlighting that the auto-source association bit is not an appropriate part of the isolated strategy, or at least should be pluggable.  Maybe SimpleCodeGen should ask for a globale strategy and also ask for an isolated startegy from subclasses.  If it gets instances of both, it props up the option, if not, it does not.  Either way, the subclass can pass in a custom IsolatedStrategy that implements its own source collection startegy as needed.  The default, or a canned one at least could be what Garret has here.
    9. Solving the problem of dups is the only reason we're working on this... If we didn't have dups, the global strategy would be working for us.

    10. Aha - I did not know (or remember that).  That's... unfortunate.  Is your use case sane or just an organic thing you're trying to deal with without re-writing a bunch of protos / code?
    11. Letting subclasses also override the implementation of the strategy does seem reasonable.

      Your proposal involving global compilation and transitive sources actually seems to be how the code as I have it now is working. It may make sense in terms of erroring on conflicts, but it doesn't entirely solve the problems in our repo with proto collisions.

    12. One use case which seems legit is to test some code that wants to test what happens when a proto is changed in an incompatible way talking to an older version of a server (an RPC that uses the proto should fail with a particular error case)

  6. 
      
ZU
  1. 
      
  2. Garrett and I talked about this a bit before posting publicly. This seems a bit unusual, I thought maybe it would be better to follow a pattern like we use for product_types in Task where we could return the list of strategies supported by this task.

    But then he has a special use case for this in testing...

    1. I think it makes more sense to simply force a particular strategy rather than erroring out on invalid options; the latter possiblity seems like it would needlessly complicate things for users.

      However, I think printing a warning if the user tries to specify a strategy for a task that forces a strategy anyway it would be good.

  3. Its pretty neat that wire_gen just "worked" when switching to the isolated strategy

  4. The original file_protoc_blocks() looks like the gutted remains of a more meaningful function from somewhere else. It only split on empty lines which never happens in pants output, and if it did, it has nothing to do with the number of protoc blocks.

    Now we split when there is an empty line or it matches r'Executing: .*?\bprotoc' which seems much more reasonable. Seems like the logic could be made more straightforward?

    1. Yeah the original test would never actually split the output into more than one block, so I had to update it. I can probably clean up the logic, I didn't spend too much time on it.

  5. regular expressions are usually raw strings r'...'

    1. I didn't find it necessary since I didn't use any escapes, but I can change it over for consistency.

  6. 
      
GM
ZU
  1. 
      
  2. you just overwrote sources, maybe the line above the if stmt should go in an else: branch?

    1. Yeah that's sensible.

  3. This makes sense to me as a human, but it isn't the right syntax for machines to intepret it. It is supposed to be:

    :return: the codegen strategy object
    :rtype: SimpleCodegenTask.CodegenStrategy
    
    1. Weird; I assumed it would be consistent with the :param: syntax.

  4. Can you provide some context here? Why are you sorting the targets like this? This looks like an attempt to walk the targets in bfs order, but it mixes the height of different targets together, so I'm not sure if it might have a problem.

    1. I'm just doing a topsort of the targets, to make sure targets appear before their dependees. This becomes important if we're generating targets separately rather than all together, because targets may need their dependencies to be generated before they are.

      I defined the 'height' of a target as its distance from a root dependency, so python's built-in sorted() function can use it to order the targets appropriately.

      If there's already something in the pants codebase to top-sort a subset of the dependency graph, let me know and I'll use that instead.

    2. ... we found sort_targets() in build_graph.py

    3. So, I'm finding my code to run 1.7 to 2.0 times faster than the sort_targets in build_graph. But I can defer to the older (more tested) implementation if people prefer.

    4. Always the right answer: use the slow API, circle back and replace the slow impl.  Everybody wins, no proliferation of impls.
    5. Or replace it 1st in a separate RB if you want your API usage to land at full speed.
    6. Sounds good. Looking more carefully, I think my implementation is actually asymtotically slower anyway, it just happens to be faster for small sets of targets.

  5. 
      
JS
  1. 
      
  2. This needs a paired BUILD target dep edit.
  3. I'd be happy enough if you added a TODO with issue link that described how this strategy can fail and track a fully robust impl or - as it seems like we need to support a Square use case, a way to toggle strategies to allow for strict no dup by default with a fallback to support the Square style case.
    1. The --strategy=global already enforces no duplicates; should there also be one for the isolated strategy?

    2. Well, you've got a generic piece of code here that's very attractive to use when implementing a new codegen task since you need not calculate target<->gen mappings, it does it for you.  But there is this big gotcha thats not documented except sort-of in a source code comment and even then the comment does not acknowledge the silent dup issue directly.
      
      So there are 2 big issues identified:
      1. This code implements a policy acceptable to Square, but not an obviously universalizable policy: its very reasonable to want dups in an dep graph to error
      2. The behavior of the algorithm for associating generated files even without dups in play is really part of the API and an adopter needs to know about it when making their isolated/global or support both choice.
      
      2. says you at least need to document the isolated strategy behavior in a python doc comment.
      1. I can live with for now, but really should be addressed / issue tracked.
    3. Ah okay. I'll add more docs and a flag then.

    4. To your point of being able to choose different policies toward duplicates, we could deal with this in a post processing step. There is already the detect-duplicates task - maybe we could wire it up so that it ran in the compile step for folks that want to detect duplicate symbols on the classpath before the pakcaging steps.

    5. I can just run through the sources in simple_codegen_task to check it; it's not a difficult check to implement.

  4. This assert and the 3 below all have a funny line continuation indent.  Its not 2 spaces or 4 or 8 which would cover out 2 space standard, python's 4, or double of each.  It also does not drop-align with the open-paren which is common.
    1. Oops, I think that was a side-effect of a variable rename.

  5. 
      
JS
  1. 
      
  2. This is another sign that subclasses should return a startegy instance (or 2 if they support toggling).  In that structure this method could be eliminated from the base class and only be present as an abstract method in the global strategy for SimpleCodeGen subclasses to implement.
    1. I'll think a bit on how to re-structure things.

  3. 
      
GM
GM
ZU
  1. 
      
  2. Same comment about the name shadowing as in jaxb_gen()

  3. Aside: I'm trying to think of a simpler way to implement what is currently supported_strategy_types, forced_codegen_strategy and codegen_strategy and friends, but I think we can rework later if we have to. I'm happy enough with the interface to pants end users (aside from the options changes I mentioned above) and it's not the critical part of this change.

    1. Yeah... pants developers who want to subclass SimpleCodegenTask fortunately will only have to worry about supported_strategy_types. The exact behavior of the other two methods is a little involved but most users shouldn't have to care.

  4. nit: """ should be on a line by itself.

  5. This line cleans up a problem we've been having when users switch branches. (stale generated classes get left behind and picked up by IntelliJ)

    It would be nice to add a test case for it. If you don't want to do it in this PR, could you add a TODO()? We could make this an intern task.

    1. Sure. You mention 'stale generated classes' -- this will just clean up stale generated .java files. The isolated compile strategy for javac would have to do the .class cleaning up.

  6. This is fine for now, but I have a feeling we might want to turn off these warnings in our repo, at least for some types of codegen.

    1. Yeah, we'll see how obnoxious it is. Currently even sake/rpc/protos-test doesn't trigger it though.

  7. nit: remove extra n/l?

  8. Maybe give this a new name so it doesn't shadow SimpleCodegenTask. I think it is confusing due to the way you referenced them using the cls variable in supported_codegen_strategy_types above.

    1. Yeah that's true.

  9. Same comments as the other option below.

    --allow-empty instead of --allow-empty and consider making this a boolean option.

    1. I tried doing that a few days ago and it didn't work for some reason, so I assumed hyphens in options were illegal due to how the option scoping works. But I just tried it again now and it worked fine, so I guess I was just doing something silly before.

    2. I also didn't know boolean options existed, I'll look into that.

  10. 1) convention is to put a '-' character between words, so this would be --allow-dups
    2) Is there a reason not to make this a boolean option? this allows conveniences like allowing --no-allow-dups to turn off the option.

  11. nit: you could shorten this to:

    return { strategy_type.name : strategy_type for strategy_type in cls.supported_strategy_types() }
    
    1. Huh, I didn't know dictionary comprehensions existed. Cool.

  12. I would prefer we raise a custom exception type here. Its good for testing and easier to go find in the code.

  13. Does wire need to support global codegen strategy? There shouldn't be any operational difference, should there? Maybe a TODO() to get rid of GlobalCodegenStrategy once we are sure we are happy with Isolated.

    1. I can just remove it if there's no reason to keep it around. It's only like 4 lines of code to put it back if we change our minds.

  14. 
      
GM
ZU
  1. 
      
  2. Simplify this to something like: "Skip targets with no sources defined"

  3. Drop the "whether". "Allow multiple targets..."

  4. this comment should be before the 'by default..." sentence.

  5. nit: could be @abstractmethod

  6. 
      
JS
  1. 
      
  2. Typically you'd do:
    ```python
    register('--flag', default=True, action='store_true', ...)
    ```
    The type is then automatically set by the action.
    
    This allows `--flag` to specify `True` and `--no-flag` to specify `False`.  Folks using pants don't write `--flag=False` on the CLI as a rule.
  3. kill extra blank line
  4. It would be great to circle back and add tests for this.  A TODO added now would be fine.
  5. 
      
GM
JS
  1. LGTM.  I'll let Eric have the patch-in honors.
  2. 
      
GM
ZU
  1. Submitted to master @ commit a42b8a5. Please close this and the PRs and mark as submitted.

  2. 
      
GM
Review request changed

Status: Closed (submitted)

Loading...