Do not crash on unicode filenames

Review Request #1193 — Created Oct. 22, 2014 and submitted

dturner-tw
pants
d4ef453...
pants-reviews
benjyw, ity, jinfeng, jsirois, peiyu, stuhood

Introduce a new function, safe_walk, which wraps os.path.walk, but
ensures that the returned values are unicode objects. This isn't
strictly safe, in that it is possible that some paths will not be
decodeable, but that case is rare, and the only alternative is to
somehow avoid all interaction between paths and unicode objects, which
seems especially tough in the presence of unicode_literals. See e.g.
https://mail.python.org/pipermail/python-dev/2008-December/083856.html

Note that rbcommons has mangled the filename of the newly added file -- it is in fact testprojects/tests/java/com/pants/testproject/unicode/中文/.gitsave (if you download the diff and apply it, all will be well).

Ran new test.

  • 0
  • 0
  • 3
  • 1
  • 4
Description From Last Updated
DT
JI
  1. 
      
  2. is it your change?

    1. No, RBT just decided to bite me for no reason. Sadness.

  3. ditto.

  4. 
      
DT
BE
  1. 
      
  2. src/python/pants/util/dirutil.py (Diff revision 3)
     
     

    Nit - docstrings should begin with a single summary line.

    Any subsequent text should be separated from that first sentence by a single blank line.

    See: http://legacy.python.org/dev/peps/pep-0257/#multi-line-docstrings

    Also, it's os.walk, not os.path.walk.

  3. src/python/pants/util/dirutil.py (Diff revision 3)
     
     

    This isn't python3 safe. In python3 a 'str' is already decoded.

    Use six to make this work across python2 and python3: Look at src/python/pants/util/strutil.py. You can add an ensure_text() function there and use it here.

  4. src/python/pants/util/dirutil.py (Diff revision 3)
     
     

    I don't see how this guarantees anything about the returned values. All you've done is ensure something about the type of the input.

    If this causes some magic to happen inside os.walk then that needs to be documented.

    1. It does cause magic to happen inside os.walk (this is discussed in the linked thread but long thread is long); will comment it.

  5. Use single quotes.

  6. You're not testing the case of a non-ascii filename somewhere underneath an ascii root dir. And based on my comment above about the implementation, I'm not even sure that will work?

    1. os.path.join(tmpdir, '中文') will almost always be a non-ascii project under an ascii root dir (because tmpdir will almost always be ascii). Or do you mean a filename as distinct from a dirname?

  7. 
      
ZU
  1. At a high level, the best way for us to make sure that unicode filenames work and continue to work is to create one somewhere in testprojects/ and then run integration tests that include it. If we accept this change and a new instance of a 'naked' os.walk() is added, developers won't be aware that unicode filename processing is broken.

    1. OK, I'll move the file from examples into an existing project in testprojects/ -- since there are already integration tests over that, this should avoid future breakage.

    2. Sorry, my comment is kind of boneheaded. I didn't see the unicode filename in rbcommons' summary of the change (it truncated it) and it re-write the name in the diff lit.

    3. Yeah, rbcommons sort of mangled that one -- I filed a bug on that last night.

  2. 
      
DT
DT
BE
  1. 
      
  2. Ooops, I misread which dir you were starting your walk at.

  3. 
      
BE
  1. Ship It!

  2. 
      
IT
  1. Ship It!

  2. 
      
DT
DT
DT
DT
DT
Review request changed

Status: Closed (submitted)

Change Summary:

In master at eaa2fc58b2fe66f6635b847f74f16ed5fecc8eec
Loading...