Fix unicode string on stdout causing taskerror

Review Request #3944 — Created May 28, 2016 and updated

3517, 3532
jsirois, kwlzn, stuhood

- No encoding was set on payload which caused the encoding error detailed in the ticket
- Encode payload string with utf-8 before it gets written to stdout

  • Added a new test which fails without this change
  • Ran locally
  • 0
  • 0
  • 3
  • 0
  • 3
Description From Last Updated
  1. Tests look great, thanks.
  2. src/python/pants/java/ (Diff revision 1)

    Could this go deeper? into def iter_chunks perhaps?

    1. also, it'd be great to fire off a PR and link that in the bugs field above - this will trigger both Jenkins and Travis runs.

  2. tests/python/pants_test/java/ (Diff revision 1)

    this strikes me as odd - IIUC you redirect sys.stdout to a StringIO then print() (implicitly to the StringIO referenced by sys.stdout) and then retrieve the value?

    but as near as I can tell, the whole point here is just to test the default encoding type of print()?

    if so, could you not achieve the same thing with:

    out_s = StringIO()
    print(self.content, file=out_s)
    self.stdout_content = out_s.getvalue().strip()

    or even simpler by mocking the behavior of print() sans files:

    self.stdout_content = self.content.encode(sys.stdout.encoding)

    (but this may all be moot after the comment below)

  3. tests/python/pants_test/java/ (Diff revision 1)

    seems to me like the scope of these tests should probably continue to test against the raw byte stream as written by write_chunk() (e.g. the wire format) vs stashing and comparing against a unicode conversion this early?

    then, as Stu said - you could move the chunk payload encoding to 'utf-8' deeper into NailgunProtocol.iter_chunks and then test the unicode reads using that directly - e.g. in the context the client would read and encode from the byte stream to utf-8 on the read/client-side of the protocol.

  2. tests/python/pants_test/java/ (Diff revision 1)

    happy to move the testing out to different call stack in the code but not sure I clearly understand this.

    write_chunk() and iter/read_chunk() as I see are doing the right thing. write_chunk() encodes unicode to bytes before writing it out and iter/read_chunk() decodes bytes to unicode after reading.

    it makes sense to encode in bytes before stdout(..) since stdout of a unicode string causes an implicit encoding - output is always bytes.

    so maybe I am missing something here -- why would we encode again in iter/read(..)?

    1. well, afaict the reason you have to do a payload.encode('utf-8') in here is because of the implicit payload.decode('utf-8') that happens in

      so the idea here I think - is instead of the payload.encode('utf-8') in, you'd do something along the lines of making NailgunProtocol.read_chunk+iter_chunks take parameters that would control this (likely defaulting to decoding to 'utf-8') - but with an option to not do that as used by iter_chunks.

      then in you could just do something like:

      for chunk_type, payload in self.iter_chunks(self._sock, decoding=None):

      which would mean the change and its tests would move up into and - where all the tests are modeled as a connected pair of sockets that test both sides of the protocol.

      in there, I'm imagining you'd whip up a test to write a properly encoded payload, read it on the other end of the socket while exercising the new options, etc.

    2. well, afaict the reason you have to do a payload.encode('utf-8') in here is because of the implicit payload.decode('utf-8') >that happens in

      afaics, that payload.decode('utf-8') is in the right place. and yes the effect of that is that I get a unicode string in which to me is the right behavior.

      regardless of who the client is, calling a read(..) or iter(..) should only return unicode strings not bytes. I think the right approach is that we should always decode after read because we want unicode strings to be passed around and not bytes -- which is what is happening currently. let me know if you disagree.

    3. I definitely agree that payload.decode('utf-8') should be the /default/ case - but in the case of NailgunClient it seems like it'd be more efficient to simply avoid the initial decoding to begin with if we're just immediately calling encode() again, no? afaict, this dual decode/encode will effectively create two copies of every string the NailgunClient processes - one bytes and one unicode. this could add up to 2x additional memory allocation for nailgun/pailgun runs with a bunch of output.

    4. ok agreed, between clear contract and memory efficiency, efficiency always wins -- it wasnt clear that you were suggesting the removal of decode(..) for memory footprint. adding the new code path then.

Review request changed
  1. Ship It!
  2. I'd be a fan of explicitly passing return_bytes=False here with an explanation; ie: the payload, which becomes working_dir or command can contain unicode chars as parts of path names or command argument values.
    That said, there is 1 caller, and that caller doesn't use these tupe slots at all:

    $ git grep -n "\.parse_request" -- src/
    src/python/pants/pantsd/    _, _, arguments, environment = NailgunProtocol.parse_request(self.request)

    So can the new return_bytes parameter just be dropped wholesale and the bytes never decoded?

    1. I think the decode() is failproof for any future cases of this being used. Having said that, since that future is not here currently - happy to remove the decode altogether with a TODO of adding it back in if there was a caller in the future that needed it.

    2. The TODO seems like the best path to me.  At this low layer there is no telling what constitutes the byte stream and its really only the particular nailgun application riding above that knows its safe to decode the bytes in a particular way.