[Python-Dev] PEP 277 (unicode filenames): please review
Martin v. Loewis
martin@v.loewis.de
14 Aug 2002 08:23:42 +0200
Skip Montanaro <skip@pobox.com> writes:
> What's the current behavior? If my program receives an input in utf-8
> (let's say it comes from a form on a website), what form will it be in, or
> can't I tell?
In general, you cannot tell in advance - it will depend on the data
source.
W3C advocates "early normalization" towards "NFC", meaning that in the
Internet, you should always see NFC data - unless you are primary data
source, e.g. by reading from a terminal, or after decoding some legacy
encoding. It turns out that most Python codecs will produce NFC
already, so normalization to NFC would be required only for user input,
and - as it turns out - when reading file names on OS X.
> Is it possible I will get spurious inequalities today if I compare
> two different unicode objects which were created from different
> sources and in different normal forms?
If they are in different normal forms, you *will* get inequalities
reliably. In the real world, inequalities will be spurious.
> What about a string and a unicode object? Where can I read all
> about it (Python and unicode normalization)?
Python does no normalization, so there is nothing to read. For
Unicode, you may want to start with the Normalization FAQ
http://www.unicode.org/unicode/faq/normalization.html
Regards,
Martin