Hus to Know?: Python 2.x and Unicode

Wednesday, November 17, 2010

Python 2.x and Unicode

Suppose you tried the following:

>>> a = u"Hey\u2019t"
>>> b = a.encode('utf-8')
>>> b.encode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)

The encode('utf-8') takes a Python unicode object and converts it into a Python ASCII string object. When you try to encode a Python string object into UTF-8, Python throws an error above.

The best slide talk that discusses these issues can be found here: http://farmdev.com/talks/unicode/

Python 2.x has these problems in general because it has created separate typed objects, u' for unicode, and ' for string (ASCII), both derived from the basestring type. Python 3.0 solves this issue by unifying the string object.

Hus to Know?

Wednesday, November 17, 2010

Python 2.x and Unicode

No comments:

Post a Comment