7 Unicode Changes

Python's Unicode support has been enhanced a bit in 2.2. Unicode strings are usually stored as UCS-2, as 16-bit unsigned integers. Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned integers, as its internal encoding by supplying --enable-unicode=ucs4 to the configure script. (It's also possible to specify --disable-unicode to completely disable Unicode support.)

When built to use UCS-4 (a ``wide Python''), the interpreter can natively handle Unicode characters from U+000000 to U+110000, so the range of legal values for the unichr() function is expanded accordingly. Using an interpreter compiled to use UCS-2 (a ``narrow Python''), values greater than 65535 will still cause unichr() to raise a ValueError exception. This is all described in PEP 261, ``Support for `wide' Unicode characters''; consult it for further details.

Another change is simpler to explain. Since their introduction, Unicode strings have supported an encode() method to convert the string to a selected encoding such as UTF-8 or Latin-1. A symmetric decode([encoding]) method has been added to 8-bit strings (though not to Unicode strings) in 2.2. decode() assumes that the string is in the specified encoding and decodes it, returning whatever is returned by the codec.

Using this new feature, codecs have been added for tasks not directly related to Unicode. For example, codecs have been added for uu-encoding, MIME's base64 encoding, and compression with the zlib module:

>>> s = """Here is a lengthy piece of redundant, overly verbose,
... and repetitive text.
... """
>>> data = s.encode('zlib')
>>> data
'x\x9c\r\xc9\xc1\r\x80 \x10\x04\xc0?Ul...'
>>> data.decode('zlib')
'Here is a lengthy piece of redundant, overly verbose,\nand repetitive text.\n'
>>> print s.encode('uu')
begin 666 <data>
M2&5R92!I<R!A(&QE;F=T:'D@<&EE8V4@;V8@<F5D=6YD86YT+"!O=F5R;'D@
>=F5R8F]S92P*86YD(')E<&5T:71I=F4@=&5X="X*

end
>>> "sheesh".encode('rot-13')
'furrfu'

To convert a class instance to Unicode, a __unicode__ method can be defined by a class, analogous to __str__.

encode(), decode(), and __unicode__ were implemented by Marc-André Lemburg. The changes to support using UCS-4 internally were implemented by Fredrik Lundh and Martin von Löwis.

See Also:

PEP 261, Support for `wide' Unicode characters: Written by Paul Prescod.

See About this document... for information on suggesting changes.