[Python-Dev] Unicode <--> UTF-8 in CPython extension modules
John Dennis
jdennis at redhat.com
Fri Feb 22 22:23:58 CET 2008
I've uncovered what seems to me to a problem with python Unicode
string objects passed to extension modules. Or perhaps it's revealing
a misunderstanding on my part :-) So I would like to get some
clarification.
Extension modules written in C receive strings from python via the
PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
format parameter.
Many C libraries in Linux use the UTF-8 encoding.
The 's' format when passed a Unicode object will encode the string
according to the default encoding which is immutably set to 'ascii' in
site.py. Thus a C library expecting UTF-8 which uses the 's' format in
PyArg_ParseTuple will get an encoding error when passed a Unicode
string which contains any code points outside the ascii range.
Now my questions:
* Is the use of the 's' or 's*' format parameter in an extension
binding expecting UTF-8 fundamentally broken and not expected to
work? Instead should the binding be using a format conversion which
specifies the desired encoding, e.g. 'es' or 'es#'?
* The extension modules could successfully use the 's' or 's#' format
conversion in a UTF-8 environment if the default encoding was
UTF-8. Changing the default encoding to UTF-8 would in one easy
stroke "fix" most extension modules, right? Why is the default
encoding 'ascii' in UTF-8 environments and why is the default
encoding prohibited from being changed from ascii?
* Did Python 2.5 introduce anything which now makes this issue visible
whereas before it was masked by some other behavior?
Summary:
Python programs which use Unicode string objects for their i18n and
which "link" to C libraries expecting UTF-8 but which have a CPython
binding which only uses 's' or 's#' formats programs seem to often
fail with encoding errors. However, I have yet to see a CPython
binding which does explicitly define it's encoding requirements. This
suggests to me I either do not understand the issue in it's entirety
or many CPython bindings in Linux UTF-8 environments are broken with
respect to their i18n handling and the problem is currently
not addressed.
--
John Dennis <jdennis at redhat.com>
More information about the Python-Dev
mailing list