[engine] Isolated Process Execution - First Cut

Review Request #4029 — Created June 28, 2016 and submitted

2978, 3609
benjyw, jsirois, kwlzn, peiyu, stuhood

This review is for the first cut of isolated process execution in the new engine. It contains the beginnings of snapshotting, isolated workspace preparation and execution. It's not a finished product, but it's a point from which we can start iterating. It includes the beginnings of a process execution model for the engine.

Process Execution:

This first cut is more complicated than it needs to be, but I intend to refine it and make it more transparent to the engine.

How it works:

You define a process execution rule using the SnapshottedProcess rule type. It maybe that this can be decomposed into its components with the user facing pieces broken up a more fine grained way, but again, this is a first cut.

Here's an example of what that looks like from one of the tests.

                     input_selectors=(Select(Files), SelectLiteral(JavaOutputDir('build'), JavaOutputDir)),

The first parameter is the product type produced by running the process. This maps exactly to the product type parameter for the existing task concept.

The input_* parameters communicate what products are needed to run, and how to translate them into a SnapshottedProcessRequest. They are essentially a subclause that's a task rule. The output_conversion is called with the result of running the process and the Checkout which allows it to construct the product.

Internally what happens is that if a SnapshottedProcess task is selected, it creates a ProcessExecutionNode providing a place where before and after steps can be managed.

The execution node
- Creates a task node for creating the process request.
- Creates a checkout directory and dumps snapshots into it based on the contents of the process request.
- Calls the prep_fn on the request with the checkout. I want to get rid of this and come up with a better scheme for preparing for execution for things that are like build output directories, but for this review I punted via this escape hatch.
- Once the checkout is prepared, the process is actually executed using a ProcessExecutionNode.
- Finally, the output is converted into it's final form.

Future steps
- I'm not happy with naming, but for this review, I'm focusing more on a rough structure and I'm planning to iterate on it.
- Cleaning up checkouts. I left cleaning up temp directories as a future step.
- Configuration. Need to store those tar files somewhere. For now, I just create temp directories.
- Documentation. I think this needs to wait on naming and maturing the user facing representation.
- Integration with the planner example.
- Nailgun.

I've got a number of high level tests I've been using to drive development in test_isolated_process.

I've attached a PR with CI running.

  • 0
  • 0
  • 10
  • 1
  • 11
Description From Last Updated
  1. Thanks Nick!

    Just some initial high level comments.

  2. src/python/pants/engine/isolated_process.py (Diff revision 1)

    One critical bit that is missed here (and which probably led to the supposed need to add "UncacheableTaskNode"), is that a snapshot needs to encode a unique id, which will usually be its fingerprint.

    A Snapshot object must include its fingerprint/identity, because that's what is allowing us to avoid keeping the entire object in memory, and allowing us to hit the cache for a task, even if it only inputs/outputs Snapshots.

    1. ...and actually, the Snapshot probably shouldn't have a filename/archive as part of its identity, as that abstraction leaks to the consumers of the Snapshot.

      I would expect that the Node implementations that consume snapshots would know how to convert from the snapshot id into a location on disk/etc.

    2. Makes sense. For now, I'm going to just hash all of the files together to generate a fingerprint. I think it could be better to use FileDigest, but I don't want to get into the product merging problem just yet and I feel like a proper solution would end up there. If you've got an idea for getting around that, I'd be interested though.

    3. Makes sense. Thanks!

  3. src/python/pants/engine/isolated_process.py (Diff revision 1)

    So, it's clear that SnapshotNode needs to be a Node/intrinsic, and it's clear that ProcessExecutionNode needs to be intrinsic.

    But it's not clear that ProcessOrchestrationNode/OpenCheckoutNode/ApplyCheckoutNode need to be Nodes/intrinsics. Why are they not all a single pipelined step in ProcessExecutionNode?

    1. I put it together that way because my thought was that the sandbox creation needs to have a happens before relationship with executing inside it. I didn't want to have ProcessExecutionNode to be responsible for creating and populating the sandbox. Partly to allow for pipelining separate process execution setups.

      That might be a bit premature, but I found it helpful as part of feeling my way around how nodes interact with each other. I was having a lot of difficulty modeling how stateful node interactions might work. I got that sorted, but it's probably not necessary at this point.

      I'll just slurp all of the operations up into ProcessOrchestrationNode and rename it as ProcessExecutionNode. We could break it up later if necessary.

    2. If we can avoid making Nodes stateful, that would be ideal. The idea of treating them as coroutines (with state on the stack) eventually might be worth doing to remove redundant setup/teardown, but I'd rather not be forced into it if the need for state is more of a convenience.

  4. src/python/pants/engine/scheduler.py (Diff revision 1)

    It seems like Rule is only useful to NodeBuilder, so they should likely be together.

    But I'm not sure I see the necessity of Rule yet... given that they are always 1 to 1 with some Node type, it seems like they could probably just be converted into a standardized classmethod factory factory function on Node.

    But the addition of Rule seems to have grown out of the addition of a bunch of different Node types, and I'd like to see whether it is possible to avoid doing that instead.

    1. In order to allow for users to be able to define process executions, we either need a new rule/task type or require that a process execution definition is broken up into at least pairs of tasks.

      I went with the former, which leads to a strawman UI like

                           (Select(Files), SelectLiteral(JavaOutputDir('build'), JavaOutputDir)),

      If we were to do the latter, it would look something like

      (ClasspathEntry, (Select(JavacProcessExecutionResult)), process_result_to_classpath_entry),
      (JavacProcessRequest, (Select(Files), SelectLiteral(JavaOutputDir('build'), JavaOutputDir)), java_sources_to_javac_process_request)

      The intrinsic look up would match subclasses of ProcessExecutionResult, then the node would select for the paired ProcessExecutionRequest subtype, which would find the appropriate task and we'd be off to the races. In this, the binary would be defined on the request subclass I think. I'd prefer not to go this route, but I could.

      What I'm going to try doing is removing the Rule abstraction by making NodeBuilder an index of curried constructors for TaskNode, or the special-cased process execution nodes. Right now, that's essentially what Rules are and as such they don't pull much weight. I still think that having a type for task/rule definitions is worthwhile, but this may be the wrong place for it.

    2. Gotcha: that helps to explain Rule. But I would wonder what it does to composability to fuse the definitions of those operations. For example, what if rather than composing a single ClasspathEntry, the first thing I do after running a process is merge a bunch of Snapshots, and then operate on that?

  1. Thanks Nick: looks great from a fundamentals standpoint. Would love to see this mixed into the "real world" tests/python/pants_test/engine/examples sandbox a bit more deeply to make sure that usage seems sane.

  2. src/python/pants/engine/isolated_process.py (Diff revision 3)

    There is a concurrency issue here: the computation of the Snapshot ID needs to be fused with the creation of the archive, so that they don't observe different values.

    Naively, creating the tarball and then fingerprinting that would be concurrency safe. But it would mean that timestamps/etc in the tarball would be included in the fingerprint, which we don't want.

    If the tar implementation supports adding files from streams, you could both fingerprint and stream the content of each file as you go?

    1. I think I'll just create the snapshot fingerprint after creating the tar file. That way the snapshot is fingerprinted based on what made it in. We could potentially also create Snapshot objects from fingerprinting loose files for doing cache lookup in the future. I think that would be fairly safe, but there might be some things I haven't thought through yet. Punting on that for another review.

  3. I expect that FileDigest is more of a temporary workaround for integration with the existing "Target fingerprinting" mechanism.

    Instead, I would say (if possible):
    1) make creation of Snapshots as cheap as possible, and
    2) change LegacyBuildGraph's construction of EagerFilesetWithSpec to use the unique id of a Snapshot.

    ...would be preferable.

    In particular, I don't think the sha of an individual file is the right primitive. Killing individual FileDigests and using the Snapshot id of the entire glob would help to reduce our total Node count as well.

  4. src/python/pants/engine/isolated_process.py (Diff revision 3)

    I think that from a user's perspective, this is the entire representation of the snapshot... it's more than just an ID, and probably deserves a bit more documentation.

  5. Please add a TODO or ticket: it feels like this will eventually need to be integrated with BinaryUtil. I'm not sure that we'll want to support running binaries from PATH (certainly not initially?).

  6. src/python/pants/engine/isolated_process.py (Diff revision 3)

    Would be great to avoid adding this if we can... could we just start with "list of relative output directories to create"? Seems like it would be sufficient for now.

    1. Sure. I'll replace with just a list of output directories. That's more specific.

      Another thought I've had here is that we could have both input snapshot subjects and output snapshot subjects, which would then be used to gather the results and allow them to be distributed and reused. I feel like there could be some trickiness around things like project tree though. If the output snapshot subjects were pathglobs / files, we could potentially precreate the directories leading up to them prior to execution.

  7. src/python/pants/engine/isolated_process.py (Diff revision 3)

    xx: let's not do any casting. would prefer to explode here.

    1. Cool. I'll add some explicit explosions here.

  8. src/python/pants/engine/isolated_process.py (Diff revision 3)

    IIRC, scoot's API indicates that stdout/stderr go in well-defined output locations, so that they can just be part of a Snapshot (or split out of the result snapshot), which is important for remoting of large executions.

    This is probably fine for now, but would be good to be ready for that change.

    1. I considered that, but I wasn't sure where I'd want to put the files. I don't think it would be too hard to change as the code is right now, so I'm punting on that.

  9. I wonder whether supporting capture of a failing process is a good idea, or whether exit_code != 0 should just result in a Throw from ProcessNode? From looking around the codebase, all non-foreground/frontend process currently fail for a non-zero exitcode (foreground tasks will not use this API).

    1. Hm. My thinking here was that the conversion code could raise on failure or not as it was warranted. But, I think it would probably be better to have an explicit failure callback that defaults to something like Throw('exit code <non-zero> for <binary>') on non-zero values.

  10. Generally variants need to propagate downward through the graph, so the creator of a ProcessExecutionNode should be passing some in if they exist. There is a TODO below where you request a TaskNode: the variants should be passed there too.

  11. src/python/pants/engine/isolated_process.py (Diff revision 3)

    I think that all of these dependencies should be flattened and returned as a single Waiting dependencies list. If any of them are not available, the Node isn't ready to run.

    As this stands, it's possible that you'll bounce into and out of the ProcessNode 3 times, each time computing a few more things to Wait for. Since you know up front what all the initial conditions are, you should wait for them eagerly.

    1. True. I considered doing that initially, but I was still sorting out what I wanted to request and doing it early would have made things more complex. I'll do that now.

  12. src/python/pants/engine/isolated_process.py (Diff revision 3)

    I don't think this should explicitly create a SelectDependencies node... instead, could probably just return a list of the individual Selects.

    1. I could be convinced otherwise, but I think the ordering that SelectDependencies maintains might be needed here.

    2. See the implementation of SelectDependencies itself: the reason it can provide that guarantee is that it assumes it can iterate over its inputs deterministically... you can do the same thing here.

  13. src/python/pants/engine/isolated_process.py (Diff revision 3)

    If this is supposed to be user pluggable, then it should likely be using select_node.

    Which begs the question: if you know which task you're executing (and don't intend for it to be pluggable via installation of user tasks), then why not just call the function directly? If

    1. I could inline it too. I didn't because the behavior I wanted was exactly the same as what a task node does. I could change the special case in the NodeBuilder so that it adds an additional task for the conversion, or inline the select handling.

      I'll inline the select handling since that feels less fragile.

  14. src/python/pants/engine/isolated_process.py (Diff revision 3)

    This is redundant I think: datatype already results in this repr.

    (TaskNode has a repr because function objects don't str well by default)

    1. Ah. That makes sense.

  15. src/python/pants/engine/scheduler.py (Diff revision 3)

    Doesn't extend rule. (Rule is gone I think?)

    1. Missed that one. Thanks.

  2. Probably worth disabling tar compression here.

  3. Please note that this will fail for GitProjectTree... switching to project_tree.content + TarFile.addfile with a buffer (that you'd also fingerprint) might get two birds with one stone.

    A TODO might also be ok.

    1. Hm. Adding a TODO. I think we'd need the ability to stat files in project tree for things like timestamps. It'd also be good if project_tree's could produce io-like objects of files so that we wouldn't have to read the full file into memory.

    2. We explicitly do not want timestamps, afaik.

  4. This is likely something for the StepContext... probably worth doing before landing this.

    1. Minimally added to step context.

  5. newline

  6. Doesn't need a wrapper class probably, as it is now private to ProcessExecutionNode.

  7. src/python/pants/engine/isolated_process.py (Diff revision 4)

    Can just extend this context manager down to the TODO: clean up the checkout below, and drop the Checkout type.

    1. I extended the context manager, but left cleanup false. Leaving the tmp dirs in place will allow for punting on what to do with output files for the time being.

  8. Probably != 0

    1. Right. I forgot that python sometimes sees the code as signed.

  9. Since this uses filesystem operations, it's not cacheable.

    1. Ok. I guess I don't understand the semantics of caching in the new engine entirely.

Review request changed

Status: Closed (submitted)

Change Summary:

Submitted as https://github.com/pantsbuild/pants/commit/68b5a6c91ea661f3a6c2fca94b2b9d0234a562c6