Friday, September 21, 2012

Unicode Quirks in Django

Python 2.x's Unicode implementation really leaves something to be desired.  The major issue often arises when you need to go between str() and unicode() types, the former is a 8-bit character while the latter is a 16-bit character.   The problem is that doing .encode('utf-8') on a unicode object is idempotent (i.e. u'\u2013t'.encode('utf-8') but doing it on a str() twice will cause Python to trigger ascii codec errors.

Here's a great introduction of troubleshooting Unicode issues:

http://collective-docs.readthedocs.org/en/latest/troubleshooting/unicode.html

There's a great PowerPoint slide about demystifying Unicode in Python, which should be required reviewing.  It's more detailed about the complexities of UTF-encoding, but it's worthwhile to review.

http://farmdev.com/talks/unicode/

One of the general rule of thumbs that you'll get from this talk is 1) decode early 2) unicode everywhere and 3) encode late.

In Django, this approach is closely followed when writing data to the database.  You usually don't need to convert your unicode objects because it's being handled at the database layer.  Assuming your SQL database is configured properly and your Django settings are set correctly, Django's database layer handles the unicode to UTF-8 conversion seamlessly.  For example, just look inside the MySQLdb Python wrapper and right before a query is executed, the entire string is encoded into the specified character set:

MySQLdb/cursors.py:

if isinstance(query, unicode):
            query = query.encode(charset)
        if args is not None:

What if you attempt to use logging.info() on Django objects?   (i.e. logging.info("%s" % User.objects.all()[0])   If you searched on Stack Overflow, you'd see a recommendation to create a __str__(self) in your Python classes that call unicode() and convert to UTF-8:

http://stackoverflow.com/questions/1307014/python-str-versus-unicode

def __str__(self):
    return unicode(self).encode('utf-8')
Django's base model definitions (django.db.models.base) also follow this convention:

  def __str__(self):
        if hasattr(self, '__unicode__'):
            return force_unicode(self).encode('utf-8')
        return '%s object' % self.__class__.__name__

Normally, Python handles string interpolations automatically by determining whether the string is unicode or str() type.  Consider these cases:

>>> print type("%s" % a)
<type 'str'>
print type("%s" % 'hey')
<type 'str'>
print type("%s" % u'hey')
<type 'unicode'>

Assuming your character set on your database is set to UTF-8, consider this example and how Python deals with string interpolations for class. Normally Python does unicode conversions automatically, but for Python classes, "%s" always means to invoke the str() function.

class A(object):

    def __init__(self):
        self.tst = u'hello'

    def __str__(self):
        if hasattr(self, '__unicode__'):
            return self.__unicode__().encode('utf-8')
        return 'hey'

    def __unicode__(self):
        return u'hey\u2013t'

>>> a = A()
>>> print "%s" % a
hey-t
>>> print "%s %s" % (a, a.tst)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)

In this failing case, the problem is that printing the A class results in printing a str() type intermixed with a.tst, which is a unicode type.  When this issue happens, you're likely to see the UnicodeDecodeError

The same problem happens when trying to attempt to declare the __unicode__() method in your Django models and attempt to print out Django objects and attributes that have Unicode characters, similar to the issues reported in this Stack Overflow article.  Because Python string interpolation will  invoke the __str__() method, you have to be careful about intermingling Django objects and Django attributes when printing or logging them.

What's the solution?  In your Django models, it actually may be useful to force returning the Unicode type in the __str__() method, assuming you also have a __unicode__() method defined.  One of the quirks of Python is that if a Unicode type is returned, the __unicode__() method will be attempted to execute.   It's somewhat counter-intuitive, but by adding this section of code, you can avoid the hazards of intermingling Django objects and attributes:

# http://www.gossamer-threads.com/lists/python/bugs/842076
def __str__(self):
    return u'%s object' % self.__class__.__name__

The recommendation is also consistent with this python-dev discussion about how to implement __str__() and __unicode__() methods:

This was added to make the transition to all Unicode in 3k easier:

. __str__() may return a string or Unicode object.
. __unicode__() must return a Unicode object.

There is no restriction on the content of the Unicode string
for __str__().

Another alternative is to prefix your logging statements with u', which will force the Python string interpolation to run __unicode__() instead of str(). But it's easy to forget this prefix, so overriding the Django base model with this __str__() has helped to avoid triggering these UnicodeDecodeException errors for me.

No comments:

Post a Comment