Fix .proto files with unicode characters
Review Request #1330 - Created Nov. 13, 2014 and submitted
Fix .proto files with unicode characters in the comments. (Problem found in '_same_contents')
Updated unit tests, added a testproject example.
I tried naming the .proto file with a utf-8 character and it failed miserably. Fortunately I don't have that case, so I didn't try to fix it.
Whitespace really doesn't affect tokenization at all? It seems like you really want to compare canonicalized ASTs, though I suppose you don't have one readily available.
Also, the behavior has changed from before: now you're only stripping newlines, whereas before you stripped only spaces. Is this intentional? Do you also want tabs?
This seems like one of those cases where down the road, there's going to be some malformed (non-utf8-encoded) file in the codebase or from an external party, and an ensure_text TypeError is going to be super cryptic. Maybe catch that here and reraise with a more specific error (that includes the offending filename).
Found why this behavior changed, reverted the code in _same_contents to before the previous protobuf refactor. Left in the new test for the method.
Revision 2 (+75 -20)