Make monolithic jar slimmer

Review Request #3133 — Created Nov. 16, 2015 and submitted

peiyu
pants
peiyuwang:peiyu/make-fat-jar-optional2
2577
73411e9...
pants-reviews
fkorotkov, gmalmquist, jsirois, patricklaw, stuhood, zundel

Monolithic jar does not scale, even with --no-deployjar that excludes external
jars, internal dependencies can still be large. In Twitter we saw a few hundred
Mbytes jars to transfer, for remote offices this is more painful. Breaking into
smaller jars can also benefit hadoop folks to better utilize their caching layer [1]

This review makes --no-deployjar no longer bundle internal libraries, instead,
make them symlinks under libs like we do for 3rdparty, the links will go to those
jars under .pants.d, symlinks will have stable names, those stable names are used
as archive contents, they are also to be embeded as the Manifest's Class-Path
attribute.

Current --no-deployjar

app-bundle.jar // this can be large
libs/
libs/3rdparty-x.jar
libs/3rdparty-others.jar

New --no-deployjar

app-bundle.jar // this only sets Class-Path + other Manifest entries
libs/
libs/3rdparty-x.jar
libs/3rdparty-other.jar
libs/internal-library-a.jar // symlinks to .pants.d, unless it's shaded
libs/internal-library-other.jar

---deployjar will remain as is.

[1] Hadoop use case is they use SharedCache distribute jars from HDFS. They have
a caching layer that does file checksums so if we break down into smaller jars
they will be able to transfer less.
http://www.slideshare.net/ctrezzo/hadoop-summit2015-yarnsharedcachev12

https://travis-ci.org/peiyuwang/pants/builds/91439042

  • 0
  • 0
  • 2
  • 3
  • 5
Description From Last Updated
  1. 
      
    1. yea, I can save this change by doing what previous test_shader_integration._bundle_and_run did, load the json returned and compare.

    2. Fixed.

  2. 
      
  1. 
      
  2. I think it would be more consistent to copy the inputs rather than symlinking in this case. Adding shading rules shouldn't change the layout of your bundle.

    1. yep, will do

    2. fixed now, shader makes copies so we are totoally safe.

      Only difference is shader is run multiple times compared to before 1 for internal + number of external jars times.

  3. underscore/snake_case for variables

  4. snake_case

  5. ...the use of target_name here is weird... wish I'd noticed that in the original review.

    1. can you elaborate?

      target_name is used when address.spec_path is None which is for root level targets, seems reasonable to me

  6. "Avoid name collisions by ..."

  7. Please include a commented indicating why this is using an ordered map.

    1. no necessary any more, reverted by using json load and compare the returned map

  8. 
      
  1. In general LGTM. I looked over this to convince myself this isn't going to change square's use of deploying monolithic binaries using ./pants binary I'll leave the final approval to others who are already engaged.

  2. 
      
  1. Others have pointed out the style nits I would, and nothing in this looks particularly dangerous for 4sq's usage of ./pants bundle, but I'm confused about how this method helps with shipping around large binaries. Could you include a small, concrete example of how the structure between a bundle in the old way and a bundle in this new way would differ?

    1. I think the difference is that right now, bundle puts 3rdparty jars in a separate directory and puts them on the classpath. In this change, it will also put jars compiled in the repo in the separate directory. So the jar in the root of the bundle jar will be very slim indeed.

    2. But for the purposes of shipping this jar around, what has changed? Isn't it useless without its libs + first party classfiles?

    3. This specific request is from hadoop people, I checked with them, hadoop has a caching layer, called sharedcache to copy the jars, internally computing checksum checks. Also see, http://www.slideshare.net/ctrezzo/hadoop-summit2015-yarnsharedcachev12

      Patrick, look for test_jvm_app_integration.test_bundle in this reviewboard for an example, that's exactly like Eric said. We don't just ship this jar we will still ship the entire bundle archive. If you just want to ship the jar, the old option deployjar is still provided.

    4. Interesting, this is an exciting change then. We've been wanting to ship around slim hadoop jars for a while now, and I was waiting to see how these upstream bundling changes shook out.

  2. 
      
  1. 
      
  2. src/python/pants/backend/jvm/tasks/bundle_create.py (Diff revision 2)
     
     
     
     
     
     
     
     
     
     
     
     
     

    I removed this, I suspect this is not useful since internal classpath is handled by JvmBinaryTask.monolithic_jar, external classpath is handled below.

    I placed an assert False there and no tests break.

    Will check with jsirios, maybe just old code.

    1. Tested this in twitter repo and examined the bundled archive. (use buildcache service as an example, check and deploy)

      verified we get

      • synthetic jar (now just 4.1K, was 28M, for hadoop it's 29K vs 520M)
      • library jars under libs
      • resource files under libs

      Unzip and compared with the current fat jar, no more no less, deployed and runs fine.

      I am going to drop this, so it is covered by add_target logic

  3. 
      
  1. Thanks Peiyu.

  2. It feels like we should be able to delete some code somewhere in here, since this change should eliminate the "monolithic for internal dependencies" case entirely, leaving us only the "monolithic for everything" and "symlink for everything" cases. Am I missing cases?

    Can you make a pass over this and see whether we can delete some additional codepaths?

    1. One method that's cleaned up is outside of jar_task in bundle_create, add_jars (see my comment above to jsirios)

      What's left in add_target seems all necessary - adding manefest and handling the monolithic jar case.

  3. src/python/pants/backend/jvm/tasks/jar_task.py (Diff revision 2)
     
     
     
     

    I think there is a bit of a false dichotomy here... you will be adding classpath entries to the manifest either way... it's just that in one case they will be symlinks, and in the other case they will be a new monolithic jar.

    1. in the --deployjar case, no classpath entries (both external and internal classpath) are needed in manifest.

      in the --no-deployjar case, it's the opposite, full classpath entries are needed in manifest (Unless shading happens, there will be real files instead of symlinks.

      TODOs for me:
      1. improve testing, I didn't check manifest explicitly in integration, only run the jar, that only ensure I have sufficient classpath doesn't check only necessary classpath are kept.
      2. improve testing, need to have test check not symlinks for the shading case.

      Will address them.

    2. Improve tests:
      - library jars under bundle/libs are symlinks except for shading
      - Add a 3rdparty dependency to testprojects/src/java/org/pantsbuild/testproject/bundle
      so test_jvm_app_integration can test bundle behavior (whether Class-Path) is set
      properly for both internal and external dependencies

  4. 
      
  1. Feel free to drop the first issue--it's more a question and a request for clarification on the design goals here, not necessarily a blocker.

  2. I think this is equivalent to os.path.join(address.spec, address.target_name), though I'm a little wary of a stable name that looks like this. We also already have Address.path_safe_spec, although reading the code it seems like we still haven't fixed it to use a different separator for the target name.

    Is the expectation here that this is being run out of dist, and that re-bundling / re-running the jar will always result in the corresponding internal lib symlinks being updated? I'm slightly concerned that there could be a case where the jar is rerun but the symlinks haven't been updated, and as a result it silently runs with out of date or otherwise unexpected jars because the names don't have any fingerprinting information in them.

    1. (a reminder that changing this will require changing the intellij plugin)

    2. path_safe_spec would work too but this is a shared between plugin so probably not worth the change

      for deploy purpose, people call --bundle-jvm-archive=zip/tgz, at that point, all symlinks are materialized (see archive.py followlinks=True in this rb)

      to call locally, there are two cases
      1. --deployjar mode is no different from today, one big jar for everything in the repo plus all external symlinks
      2. --no-deployjar mode with or without this change, you will need to rerun bundle command in order to get the change, the difference is one to update the jar, the other to update symlinks

  3. What is an example of where such a name collision would occur?

    1. Noticed after the fact that this is moved code. Feel free to disregard the question (though I am curious).

    2. right, from moved code.

      I am curious too, @fkorotkov would know, I searched two bundles I tried, all 0-...jar, didn't find 1-. will try to dig a bit more, maybe just to be safe

    3. The original test RuntimeClasspathPublisherTest included a scenario for incremental compile: 0-z2.jar 1-z3.jar

      Another example I noticed that 1- is used, is when the additional resource are appended, in twitter for example the build.properties is created that's appended to the classpath. this is not exactly a name collision, because one is jar the other is a directory.

  4. Does this actually need to pretend to be platform-independent? In this case we always want :, because java always requires : (I think).

    1. ditto agree but to change to : will have to change plugin

  5. Whitespace is off

  6. 
      
Review request changed

Status: Closed (submitted)

Change Summary:

Merged as 6285e4f27a0fa35f219449c00395cf78de125a20

Loading...