Tuesday, July 12, 2011

Django and large datasets

If you've worked with large data sets with Python/Django, memory consumption can hit 2GB on dev, even if settings.DEBUG=False..

Django's documentation about querysets hints about the issue does little to mention this issue about big datasets:

https://docs.djangoproject.com/en/dev/topics/db/queries/#

Caching and QuerySets
Each QuerySet contains a cache, to minimize database access. It's important to understand how it works, in order to write the most efficient code.

In a newly created QuerySet, the cache is empty. The first time a QuerySet is evaluated -- and, hence, a database query happens -- Django saves the query results in the QuerySet's cache and returns the results that have been explicitly requested (e.g., the next element, if the QuerySet is being iterated over). Subsequent evaluations of the QuerySet reuse the cached results.

Keep this caching behavior in mind, because it may bite you if you don't use your QuerySets correctly. For example, the following will create twoQuerySets, evaluate them, and throw them away:

...so if you do queries against big datasets (i.e.):

for item in MyTable.objects.all():

...memory consumption can hit the roof. The culprit was iterating through all objects (475,000)...as the size increases, the self._result_cache in the QuerySet object just keeps appending. Even though Django fetches 100 rows at a time from MySQL, each time you iterate, it just adds to the result_cache array such that you will end up caching all 475K objects. No clearing, nada.

Disqus approached the problem by implementing a SkinnyQuerySet to avoid storing extra data in memory, though it's not clear to me that the changes they discussed attempted to query against a large dataset (i.e. not using LIMIT range queries) See p. 25 at http://www.scribd.com/doc/37072033/DjangoCon-2010-Scaling-Disqus (or the source code for SkinnyQuerySet at https://gist.github.com/550438). FYI -- the code also changes to prevent you from doing list() over a QuerySet.

The code that does all this stuff is here (ITER_CHUNK_SIZE is 100):
external/django/db/models/query.py(757)_fill_cache()
 def _result_iter(self):
     pos = 0
     while 1:
         upper = len(self._result_cache)
         while pos < upper:
             yield self._result_cache[pos]
             pos = pos + 1
         if not self._iter:
             raise StopIteration
         if len(self._result_cache) <= pos:
             self._fill_cache()

def _fill_cache(self, num=None):
    """
    Fills the result cache with 'num' more entries (or until the results
    iterator is exhausted).
    """
    if self._iter:

        try:
            for i in range(num or ITER_CHUNK_SIZE):
                self._result_cache.append(self._iter.next())
        except StopIteration:
            self._iter = None
The iterator() fixes the problem with Django caching over time, but I realized that the problem happened on the first iteration of a queryset, and looking at this object reference graph (you have to install the objgraph library) I inspected one of these objects and you can see that it's part of all the objects already loaded in MySQLdb's _rows dictionary after just one QuerySet iteration. So basically any QuerySet even with iterator() will store the entire SQL result in memory.
(Pdb) import objgraph
(Pdb) objgraph.show_most_common_types(limit=20)
tuple                      41682

(Pdb) next
-> if not result_type:
(Pdb) import objgraph
(Pdb) objgraph.show_most_common_types(limit=20)
tuple                      419858
What ultimately is happening is that it generates SQL command, sends it over to the Python MySQLdb wrapper, which eventually calls cursor.execute(), which calls store_result(), which stores the entire result set in memory (see store_result() reference in http://mysql-python.sourceforge.net/MySQLdb.html) The tradeoff is either to load everything in memory or one row at a time. Obviously we have more control and do LIMIT ranges to try to balance between too much memory and too many DB connections.
django/db/models/sql/compiler.py:
728  ->         cursor = self.connection.cursor()
729             cursor.execute(sql, params)

/usr/local/lib/python2.6/dist-packages/MySQLdb/cursors.py(313)
"""This is a MixIn class which causes the entire result set to be
stored on the client side, i.e. it uses mysql_store_result(). If the
result set can be very large, consider adding a LIMIT clause to your
query, or using CursorUseResultMixIn instead."""

def _get_result(self): return self._get_db().store_result()
The solution seems to use an implementation similar to query_iterator where you do LIMIT's upfront on QuerySet (you can also use iterator() to avoid caching the result set...so it seems a combination of Disqus and this snippet http://djangosnippets.org/snippets/1949/...

No comments:

Post a Comment