Here's a great introduction of troubleshooting Unicode issues:
http://collective-docs.readthedocs.org/en/latest/troubleshooting/unicode.html
There's a great PowerPoint slide about demystifying Unicode in Python, which should be required reviewing. It's more detailed about the complexities of UTF-encoding, but it's worthwhile to review.
http://farmdev.com/talks/unicode/
One of the general rule of thumbs that you'll get from this talk is 1) decode early 2) unicode everywhere and 3) encode late.
In Django, this approach is closely followed when writing data to the database. You usually don't need to convert your unicode objects because it's being handled at the database layer. Assuming your SQL database is configured properly and your Django settings are set correctly, Django's database layer handles the unicode to UTF-8 conversion seamlessly. For example, just look inside the MySQLdb Python wrapper and right before a query is executed, the entire string is encoded into the specified character set:
MySQLdb/cursors.py:
if isinstance(query, unicode): query = query.encode(charset) if args is not None:
What if you attempt to use logging.info() on Django objects? (i.e. logging.info("%s" % User.objects.all()[0]) If you searched on Stack Overflow, you'd see a recommendation to create a __str__(self) in your Python classes that call unicode() and convert to UTF-8:
http://stackoverflow.com/questions/1307014/python-str-versus-unicode
def __str__(self):
return unicode(self).encode('utf-8')
Django's base model definitions (django.db.models.base) also follow this convention:def __str__(self): if hasattr(self, '__unicode__'): return force_unicode(self).encode('utf-8') return '%s object' % self.__class__.__name__
Normally, Python handles string interpolations automatically by determining whether the string is unicode or str() type. Consider these cases:
>>> print type("%s" % a) <type 'str'> print type("%s" % 'hey') <type 'str'> print type("%s" % u'hey') <type 'unicode'>
class A(object): def __init__(self): self.tst = u'hello' def __str__(self): if hasattr(self, '__unicode__'): return self.__unicode__().encode('utf-8') return 'hey' def __unicode__(self): return u'hey\u2013t' >>> a = A() >>> print "%s" % a hey-t >>> print "%s %s" % (a, a.tst) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
In this failing case, the problem is that printing the A class results in printing a str() type intermixed with a.tst, which is a unicode type. When this issue happens, you're likely to see the UnicodeDecodeError
The same problem happens when trying to attempt to declare the __unicode__() method in your Django models and attempt to print out Django objects and attributes that have Unicode characters, similar to the issues reported in this Stack Overflow article. Because Python string interpolation will invoke the __str__() method, you have to be careful about intermingling Django objects and Django attributes when printing or logging them.
What's the solution? In your Django models, it actually may be useful to force returning the Unicode type in the __str__() method, assuming you also have a __unicode__() method defined. One of the quirks of Python is that if a Unicode type is returned, the __unicode__() method will be attempted to execute. It's somewhat counter-intuitive, but by adding this section of code, you can avoid the hazards of intermingling Django objects and attributes:
# http://www.gossamer-threads.com/lists/python/bugs/842076 def __str__(self): return u'%s object' % self.__class__.__name__
The recommendation is also consistent with this python-dev discussion about how to implement __str__() and __unicode__() methods:
This was added to make the transition to all Unicode in 3k easier: . __str__() may return a string or Unicode object. . __unicode__() must return a Unicode object. There is no restriction on the content of the Unicode string for __str__().
No comments:
Post a Comment