Wednesday, November 17, 2010

Python 2.x and Unicode

Suppose you tried the following:
>>> a = u"Hey\u2019t"
>>> b = a.encode('utf-8')
>>> b.encode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
The encode('utf-8') takes a Python unicode object and converts it into a Python ASCII string object. When you try to encode a Python string object into UTF-8, Python throws an error above.

The best slide talk that discusses these issues can be found here: http://farmdev.com/talks/unicode/

Python 2.x has these problems in general because it has created separate typed objects, u' for unicode, and ' for string (ASCII), both derived from the basestring type. Python 3.0 solves this issue by unifying the string object.

No comments:

Post a Comment