Fix unicode string on stdout causing taskerror
Review Request #3944 - Created May 28, 2016 and updated
|jsirois, kwlzn, stuhood|
- No encoding was set on payload which caused the encoding error detailed in the ticket
- Encode payload string with utf-8 before it gets written to stdout
- Added a new test which fails without this change
- Ran ci.sh locally
this strikes me as odd - IIUC you redirect
print()(implicitly to the
sys.stdout) and then retrieve the value?
but as near as I can tell, the whole point here is just to test the default encoding type of
if so, could you not achieve the same thing with:
out_s = StringIO() print(self.content, file=out_s) self.stdout_content = out_s.getvalue().strip()
or even simpler by mocking the behavior of
self.stdout_content = self.content.encode(sys.stdout.encoding)
(but this may all be moot after the comment below)
seems to me like the scope of these tests should probably continue to test against the raw byte stream as written by
write_chunk()(e.g. the wire format) vs stashing and comparing against a unicode conversion this early?
then, as Stu said - you could move the chunk payload encoding to 'utf-8' deeper into
NailgunProtocol.iter_chunksand then test the unicode reads using that directly - e.g. in the context the client would read and encode from the byte stream to
utf-8on the read/client-side of the protocol.
happy to move the testing out to different call stack in the code but not sure I clearly understand this.
write_chunk() and iter/read_chunk() as I see are doing the right thing. write_chunk() encodes unicode to bytes before writing it out and iter/read_chunk() decodes bytes to unicode after reading.
it makes sense to encode in bytes before stdout(..) since stdout of a unicode string causes an implicit encoding - output is always bytes.
so maybe I am missing something here -- why would we encode again in iter/read(..)?
I'd be a fan of explicitly passing
return_bytes=Falsehere with an explanation; ie: the payload, which becomes
commandcan contain unicode chars as parts of path names or command argument values.
That said, there is 1 caller, and that caller doesn't use these tupe slots at all:
$ git grep -n "\.parse_request" -- src/ src/python/pants/pantsd/pailgun_server.py:62: _, _, arguments, environment = NailgunProtocol.parse_request(self.request)
So can the new
return_bytesparameter just be dropped wholesale and the bytes never decoded?