How to quack like a QuerySet

The main database I work on has a bit of a quirk where there are two sets of tables for something that is conceptually one entity. For the sake of this example, let’s call them cogs and sprockets. So, I have a “cog” table, a “sprocket” table, and then a bunch of related tables like “cog_tooth”, “sprocket_tooth”, and so on. As it turns out, the split between cogs and sprockets wasn’t really necessary, since they’re really both just gears, but enough infrastructure is in place to deal with cogs and sprockets as separate groups of entities that it’s too much work to change that now.

The basic problem, which I’ve been trying to solve for months now and have only just recently figured out, is how to just show a list of gears, with cogs and sprockets mixed together, while still allowing searching, sorting, and pagination. In other words, I don’t want one page for browsing cogs and another page for browsing sprockets; I just want a gear list. I want to bury the detail of whether a gear is a cog or a sprocket and let the user page through them all sorted together. And I want to do this, somehow, with the Django ORM.

What didn’t work

Before I realized I needed to quack like a QuerySet, I tried many other things. Each was a near-solution, but wasn’t practical for one reason or another.

Model inheritance

The obvious solution was to create a Gear model, and let Cog and Sprocket inherit from that model. Then, I could paginate through Gears and follow an association to a Cog or Sprocket if I needed further information. The problem with this approach is that it introduces a new table for gears. Outside of Django, there is a lot of existing infrastructure for importing, indexing, and managing cogs and sprockets. All of that code would have to be updated to create and update the gear table whenever working with cogs and sprockets. A lot of fields would need to be moved to the gear table to prevent having to do lots of joins during pagination. This would be far too much work just to accommodate Django’s particular inheritance technique, which is otherwise unnecessary for these systems that would need to change.

Database views

If I were to solve this problem from a strictly SQL perspective, I would use a UNION query, like:

SELECT name, num_teeth, 'cog' AS type
FROM   cog
UNION
SELECT name, num_teeth, 'sprocket' AS type
FROM   sprocket

So, it seemed like I could wrap that in a database view and create a Gear model with managed=False using that view as its “db_table”. I tried this, and it actually worked, but it was dreadfully slow. Perhaps this means I shouldn’t be using MySQL; I imagine Postgres’s query optimizer is sufficiently intelligent to run this type of query faster, but these are the cards I’ve been dealt. Unfortunately, performance is a show-stopper with this approach.

RawQuerySet

At this point, I thought maybe it would be best to drop to raw SQL so that I could optimize the queries by hand. I started studying the Manager.raw() method and RawQuerySet class, and built a custom QuerySet that ran a UNION query with a subselect and dynamically generated LIMITs to allow for pagination. This actually worked, and it performed just fine, but it had several issues. One was that I was generating SQL from strings in order to make pagination work, and this was dirty and error-prone. The larger issue is that I had no obvious way to implement .filter() for these RawQuerySets. Faking .order_by() is easy enough, but .filter() is very complex and pulls in a ton of Django internals. It would have taken forever to reinvent this wheel, and I really didn’t want to roll with a half-assed .filter() implementation, especially since my main use case was to be able to provide Django Admin ChangeList-style list views with lots of filtering options.

Looking at RawQuerySet was very inspiring, however. It made me realize that maybe what I wanted wasn’t so much a Gear model at all, but rather a Gear QuerySet – that Django Models were a bit too low-level for what I was trying to accomplish, and that QuerySets were a much better abstraction to code against. Django’s generic list view doesn’t require a model, for example; you can feed it any QuerySet, or really, any object that supports .count() and slicing. This led me to the big a-ha! moment:

Quack like a QuerySet

I mean “quack” in the “duck-typing” sense: create a class that behaves like a QuerySet, but don’t bother inheriting from the QuerySet class. It doesn’t have to provide the full suite of QuerySet features (which is huge), but just enough to perform the operations that the views require. This was the winning solution, and everything fell into place once I made this realization. Here’s what I ended up with:

from django.db.models import Q
from django.db.models.query import REPR_OUTPUT_SIZE
 
from myproject.myapp.models import Cog, Sprocket
 
ITER_HARD_LIMIT = 10000
 
class GearQuerySet(object):
    def __init__(self):
        self.ordering = ('description',)
        self.cog_query = Cog.objects.order_by(*self.ordering)
        self.sprocket_query = Sprocket.objects.order_by(*self.ordering)
 
    def __iter__(self):
        for row in self[:ITER_HARD_LIMIT]:
            yield row
 
    def __repr__(self):
        data = list(self[:REPR_OUTPUT_SIZE + 1])
        if len(data) > REPR_OUTPUT_SIZE:
            data[-1] = "...(remaining elements truncated)..."
        return repr(data)
 
    def __getitem__(self, k):
        if not isinstance(k, (slice, int, long)):
            raise TypeError
        assert ((not isinstance(k, slice) and (k >= 0))
                or (isinstance(k, slice) and (k.start is None or k.start >= 0)
                    and (k.stop is None or k.stop >= 0))), \
                "Negative indexing is not supported."
 
        if isinstance(k, slice):
            ordering = tuple(field.lstrip('-') for field in self.ordering)
            reverse = (ordering != self.ordering)
            if reverse:
                assert (sum(1 for field in self.ordering
                            if field.startswith('-')) == len(ordering)), \
                        "Mixed sort directions not supported."
 
            cq = self.cog_query
            sq = self.sprocket_query
 
            if k.stop is not None:
                cq = cq[:k.stop]
                sq = sq[:k.stop]
 
            rows = ([row + (Cog,)
                     for row in cq.values_list(*(ordering + ('pk',)))] +
                    [row + (Sprocket,)
                     for row in sq.values_list(*(ordering + ('pk',)))])
 
            rows.sort()
            if reverse:
                rows.reverse()
            rows = rows[k]
 
            pk_idx = len(ordering)
            klass_idx = pk_idx + 1
            cog_pks = [row[pk_idx] for row in rows
                            if row[klass_idx] is Cog]
            sprocket_pks = [row[pk_idx] for row in rows
                           if row[klass_idx] is Sprocket]
            cogs = Cog.objects.in_bulk(cog_pks)
            sprockets = Sprocket.objects.in_bulk(sprocket_pks)
 
            results = []
            for row in rows:
                pk = row[-2]
                klass = row[-1]
                if klass is Cog:
                    cogs[pk].type = 'cog'
                    results.append(cogs[pk])
                elif klass is Sprocket:
                    sprockets[pk].type = 'sprocket'
                    results.append(sprockets[pk])
            return results
        else:
            return self[k:k+1][0]
 
    def count(self):
        return self.cog_query.count() + self.sprocket_query.count()
 
    def all(self):
        return self._clone()
 
    def filter(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.filter(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.filter(*args, **kwargs)
        return qs
 
    def exclude(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.exclude(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.exclude(*args, **kwargs)
        return qs
 
    def order_by(self, *ordering):
        qs = self._clone()
        qs.cog_query = qs.cog_query.order_by(*ordering)
        qs.sprocket_query = qs.sprocket_query.order_by(*ordering)
        qs.ordering = ordering
        return qs
 
    def _clone(self):
        qs = GearQuerySet()
        qs.cog_query = self.cog_query._clone()
        qs.sprocket_query = self.sprocket_query._clone()
        qs.ordering = self.ordering
        return qs

The above QuerySet-like class implements the parts of the QuerySet interface that Django’s generic list view depends on: .count() and slicing with .__getitem__(). It also provides a ._clone() method which, despite the private-looking name, is generally expected to exist for a QuerySet. As usual, .filter(), .exclude(), and .order_by() call ._clone() to make a copy of the QuerySet before making modifications so that immutability is preserved.

The .__iter__() method is defined in terms of .__getitem__() and uses a constant, ITER_HARD_LIMIT, to define the maximum number of results that will be returned if the QuerySet is used as an iterator. This is to prevent accidentally loading the entire table into memory with a for-loop, since my .__getitem__() returns a list for slices, not an iterator. It is rather difficult to support lazy iteration through multiple tables in parallel while keeping them sorted together, so I elected not to solve this particular problem. Django’s RawQuerySet, on the other hand, defines an iterator directly and builds slicing in terms of that, so the right way to structure these methods really depends on your use case.

Quack like a Manager

I could stop right here and create instances of this custom QuerySet class directly, but I decided to carry the fa├žade further and create duck-typed Manager and Model classes as well:

class GearManager(object):
    def count(self):
        return self.get_query_set().count()
 
    def all(self):
        return self.get_query_set()
 
    def filter(self, *args, **kwargs):
        return self.get_query_set().filter(*args, **kwargs)
 
    def exclude(self, *args, **kwargs):
        return self.get_query_set().exclude(*args, **kwargs)
 
    def order_by(self, *args, **kwargs):
        return self.get_query_set().order_by(*args, **kwargs)
 
    def get_query_set(self):
        return GearQuerySet()
 
class Gear(object):
    objects = GearManager()

With the above code, I can write expressions like Gear.objects.filter(…), Gear.objects.all().order_by(…), and so on, maintaining consistency with the way model objects are usually queried. This seems a bit silly at first glance, but in the future I will be adding methods to create new instances, so it’s good to have a place to put them (GearManager).

More curious is the fake model class (Gear), since it is currently just acting as a namespace for the “objects” attribute. I wonder if it would be worthwhile to make this class actually do something, perhaps making it a superclass of the Cog and Sprocket model classes (though I dislike the idea of adding a circular dependency between the modules these classes reside in). The idea of using Python’s new Abstract Base Class support might be worthwhile exploring, since it provides an advisory way to make isinstance() return True.

Has anyone else tried to do this sort of thing before? What do you think of this solution?

Edit: I updated the GearQuerySet example to include the algorithm for searching for cogs and sprockets together.

3 thoughts on “How to quack like a QuerySet”

  1. Thanks! I modified the code example to include a full definition of .__getitem__() – I had originally left the algorithm out since it is pretty lengthy and sort of tangential to my main goal of this article which was to demonstrate a basic pattern for writing custom QuerySets. However, in the interest of providing a runnable example, I have added it back in.

    To really be a complete demo, it would require models and preferably a list view user interface – this would take more time than I have at the moment. One thing that I thought might be useful would be a “search everything” reusable app where you could configure which models and fields you want to be searchable, and get a search that runs across many models in your database. This might not be an optimal solution compared to Haystack, etc., if you’re trying to do fulltext search, but in some cases it could be handy, and would make a decent proof of concept.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">