expose products for jvm bundle create and python binary create tasks

Review Request #4015 — Created June 21, 2016 and submitted

3477, 3501, 3593
mateor, patricklaw, peiyu, stuhood

bundle.jvm currently does not expose a product_type for any archives it creates (only the loose bundle directory). And binary.python-binary-create currently does not expose any products at all. We would like these tasks to expose products, in order to be able to consume those products in downstream (internal) tasks.

In particular, JVM apps (at least at Twitter) are generally published as archived (zip'd/tar'd) bundles. And python apps are deployed directly as pexes.
1. Expose deployable_archive product in jvm bundle_create task (arhives, including zip, tar, tar.gz, tar.bz2).
2. Expose same deployable_archive product in python_binary_create task (pexes).
3. Above products live in .pants.d, and they have symlinks created in dist dir.
4. Modify integration and unit test cases to accommodate changes in (3).
5. Add test cases.

Cloned from the pull request in RB: https://rbcommons.com/s/twitter/r/3959/

ci pending: https://travis-ci.org/pantsbuild/pants/builds/141133645

  • 0
  • 0
  • 13
  • 0
  • 13
Description From Last Updated
  2. Please add a pydoc for this method indicating what is being symlinked, and where.

    Also, this is doing a mix of symlinking and copying... if possible, would be good to make it more uniform. Since the loose bundle just contains a bunch of symlinks, copying it should be very cheap.

    1. Stu, the reason I created symlink for bundles is that jvm bundle dir itself contains symlinks to .pants.d, thus create a symlink for jvm bundles in dist/ should not break any current use cases.

  3. src/python/pants/util/dirutil.py (Diff revision 1)

    I'd prefer to see the behaviour here be similar to relative_symlink, where rather than deleting a destination even if it isn't already a link, it will only delete a destination if it is a link.

    So maybe rename this to absolute_symlink with similar semantics, and drop the safe prefix?

    1. ...or, if you just switch to copying, this should be moot?

    2. Odd, I didn't touch dirutil in any of my changes. I wonder if there was a commit in the PR that hadn't been added to the RB yet.

  4. src/python/pants/util/dirutil.py (Diff revision 1)

    This sortof exists at https://github.com/pantsbuild/pants/blob/master/src/python/pants/util/fileutil.py#L15-L20 ... possible to use that instead?

  1. Thanks for pinging me, Matt. I am a lot happier with the bundle task as it is now, so thanks for giving it some additional love. I have a very specific use case for bundle products, this has been on my "someday" list for a long time!

  2. The one ask I have left is to expose a pex_archive product alongside the deployed_archives product.

    The TLDR is that the existing compromise was to expose 3 products - jvm_archives, pex_archives and a joint product called deployable_archives.

    Stu wanted the joint product so that the bundle_create task could easily get add new languages which I also want.

    But I want separate products as well so that it could be possible to schedule only Jvm tasks or only Python tasks. When they share a product the scheduler has no choice but to schedule every task in the task graph.

    With just deployable archives, if I invoke a python task that requires deployable_archives it would also be forced to schedule every Jvm task as well.

    If there are no jvm targets in context, then the cost of scheduling extraneous tasks is somewhat minimal. But we have some tasks (buildgen, pom-resolve) that inject large swaths of targets into the buildgraph - unrelated to the root target. That works out fine as long as we are strict about relying on products for scheduling - and not having independent outputs coupled to the same product types.

  2. src/python/pants/backend/jvm/tasks/bundle_create.py (Diff revision 3)

    products should be updated for valid vts as well.

    1. From @stuhoods comment my understanding is that the archives should already exist for valid vt's, is this not the case or am I misunderstanding something?

      "Also, this block is not currently looking at which vts are valid... for any vt.valid == True, the archive should already exist."

    2. The archives exist for valid vt. But downstream tasks (like Wei's new task) get paths for those archives by consuming products. If products is not updated then downstream task will think there is no products available.

  2. I think if vt is valid, we do not have to call archiver.create, instead we can just do os.path.join.

  3. src/python/pants/backend/jvm/tasks/bundle_create.py (Diff revision 6)

    Current review will create archive only if target is root. This is not always correct. I believe for every vt we should create archive if required and save it to .pants.d. For root target, we create a copy of archive in dist/.

  4. here archive_path may be undeclared.

  2. In my approach, I created a symlink for bundle and a copy for archive in dist dir. The reason I am doing this is that (1) bundle dir itself contains symlink to .pants.d, thus create a symlink would be enough. (2) archive is an independent entity that does not contain any symlinks. In current pants implementation, when we issue a clean-all command, .pants.d is deleted, but dist is not. So creating a symlink for archive here will have the risk of linking to non-exist file. Thus we always create a full copy of archive in dist/.

    In Stu's previous comment, he mentioned creating copy for bundle as well. We may want to reach an agreement on this one with him.

  3. "if archivepath" should be enough.

    "archive" is an imported class which is always True.

  4. you may want to use "extensions" dict below as well.

  5. we want to populate jvm_archives product as well according to Mateo's comments.

  6. add products to pex_archives as well.

  7. Probably we should rename archive_type to archive_extension.

    It was my mistake that archive types from options can be different from their extentions (tgz vs tar.gz)

  8. rename to test_absolute_symlink

  1. Ship It!
  1. Thanks Matt/Yujie.

  2. src/python/pants/backend/jvm/tasks/bundle_create.py (Diff revision 11)

    This seems like a change in behaviour which is likely to break consumers. Was this necessary to get the tests passing?

    These should likely be defined somewhere in src/python/pants/fs/archive.py

    1. Archiver gets the appropriate archiver lib based on the key used.

      TAR = TarArchiver('w:', 'tar')
      TGZ = TarArchiver('w:gz', 'tar.gz')
      TBZ2 = TarArchiver('w:bz2', 'tar.bz2')
      ZIP = ZipArchiver(ZIP_DEFLATED, 'zip')

      _ARCHIVER_BY_TYPE = OrderedDict(tar=TAR, tgz=TGZ, tbz2=TBZ2, zip=ZIP)

      I could refactor this so that rather than literals the extensions are pulled from a dict in archive and then fetch the appropriate data out of archive.py.

      Not sure why this would break consumers, since its a new dynamic property.

  3. src/python/pants/backend/jvm/tasks/bundle_create.py (Diff revision 11)

    Ideally these would both use the same codepath to compute the filenames (ie, compute them before the vt.valid check, and then use the same paths in both branches). As it stands, they're likely to go out of sync.

  4. You'll want to bump the implementation_version for all affected tasks here.

    1. Good point, thanks for calling this out.

  5. ditto on the path symmetry

  6. src/python/pants/util/dirutil.py (Diff revision 11)

    space after : and before the path name

  1. I have one syntax thing and the rest basically ask for TODOs. Thanks for all the work on this Matt.

  2. Can you add a TODO to lift distdir management somewhere general, possibly even to Task itself?

    Obviously tasks will sometimes need to push to dist - it has always seemed a leak to make individual tasks handle managing the distdir by themselves. Ideally, the distdir itself would be hidden from concrete tasks.

    For instance - we could imagine distributables as just another product and a PantsDistributeTask would just consume that product and push to dist within it's execute(). Arbitrary tasks could just add to that product as needed.

    All of this is outside this review and probably something that needs a good bit of thought though - but a TODO would be nice :)

  3. This is another thing that I think should be its own task or perhaps inlined into the existing classpath tasks. The consolidate classpath is something that we could do earlier in the pipeline and it would be nice to just add those unpacked classed to a product and get the consolidated classpath for free.

    For example - some tasks unpack jars after resolve. Mightn't we get some compile speed if we consolidate the classpath before those unpacked dirs get sent to the compiler?

    Another example- our Docker plugin has to duplicate the consolidate_classpath logic since it is scheduled before bundle. The Android plugin would also benefit from this in a big way - it basically pays the loose source penalty 3 times - compile, dex, and bundle.

    If you agree, a TODO here would be fine for now.

  4. I don't think you need the else '' here since there isn't an assignment.

  5. Just to surface an unpopular opinion - it may be useful to Foursquare and Twitter to use symlinks for dist. But I consider that to be pretty user unfriendly to people who aren't involved in the inside-baseball.

    Especially since I actually question the utility of the symlink at all. How full does your distdir ever get? We generally use tempdirs and at worst blow it away before anything production related.

  1. Ship It!
Review request changed

Status: Closed (submitted)

Change Summary:

commit a09ac81420589749c3adc73e9fb02a0d8bc8447d