How to quack like a QuerySet

The main database I work on has a bit of a quirk where there are two sets of tables for something that is conceptually one entity. For the sake of this example, let’s call them cogs and sprockets. So, I have a “cog” table, a “sprocket” table, and then a bunch of related tables like “cog_tooth”, “sprocket_tooth”, and so on. As it turns out, the split between cogs and sprockets wasn’t really necessary, since they’re really both just gears, but enough infrastructure is in place to deal with cogs and sprockets as separate groups of entities that it’s too much work to change that now.

The basic problem, which I’ve been trying to solve for months now and have only just recently figured out, is how to just show a list of gears, with cogs and sprockets mixed together, while still allowing searching, sorting, and pagination. In other words, I don’t want one page for browsing cogs and another page for browsing sprockets; I just want a gear list. I want to bury the detail of whether a gear is a cog or a sprocket and let the user page through them all sorted together. And I want to do this, somehow, with the Django ORM.

What didn’t work

Before I realized I needed to quack like a QuerySet, I tried many other things. Each was a near-solution, but wasn’t practical for one reason or another.

Model inheritance

The obvious solution was to create a Gear model, and let Cog and Sprocket inherit from that model. Then, I could paginate through Gears and follow an association to a Cog or Sprocket if I needed further information. The problem with this approach is that it introduces a new table for gears. Outside of Django, there is a lot of existing infrastructure for importing, indexing, and managing cogs and sprockets. All of that code would have to be updated to create and update the gear table whenever working with cogs and sprockets. A lot of fields would need to be moved to the gear table to prevent having to do lots of joins during pagination. This would be far too much work just to accommodate Django’s particular inheritance technique, which is otherwise unnecessary for these systems that would need to change.

Database views

If I were to solve this problem from a strictly SQL perspective, I would use a UNION query, like:

SELECT name, num_teeth, 'cog' AS type
FROM   cog
UNION
SELECT name, num_teeth, 'sprocket' AS type
FROM   sprocket

So, it seemed like I could wrap that in a database view and create a Gear model with managed=False using that view as its “db_table”. I tried this, and it actually worked, but it was dreadfully slow. Perhaps this means I shouldn’t be using MySQL; I imagine Postgres’s query optimizer is sufficiently intelligent to run this type of query faster, but these are the cards I’ve been dealt. Unfortunately, performance is a show-stopper with this approach.

RawQuerySet

At this point, I thought maybe it would be best to drop to raw SQL so that I could optimize the queries by hand. I started studying the Manager.raw() method and RawQuerySet class, and built a custom QuerySet that ran a UNION query with a subselect and dynamically generated LIMITs to allow for pagination. This actually worked, and it performed just fine, but it had several issues. One was that I was generating SQL from strings in order to make pagination work, and this was dirty and error-prone. The larger issue is that I had no obvious way to implement .filter() for these RawQuerySets. Faking .order_by() is easy enough, but .filter() is very complex and pulls in a ton of Django internals. It would have taken forever to reinvent this wheel, and I really didn’t want to roll with a half-assed .filter() implementation, especially since my main use case was to be able to provide Django Admin ChangeList-style list views with lots of filtering options.

Looking at RawQuerySet was very inspiring, however. It made me realize that maybe what I wanted wasn’t so much a Gear model at all, but rather a Gear QuerySet – that Django Models were a bit too low-level for what I was trying to accomplish, and that QuerySets were a much better abstraction to code against. Django’s generic list view doesn’t require a model, for example; you can feed it any QuerySet, or really, any object that supports .count() and slicing. This led me to the big a-ha! moment:

Quack like a QuerySet

I mean “quack” in the “duck-typing” sense: create a class that behaves like a QuerySet, but don’t bother inheriting from the QuerySet class. It doesn’t have to provide the full suite of QuerySet features (which is huge), but just enough to perform the operations that the views require. This was the winning solution, and everything fell into place once I made this realization. Here’s what I ended up with:

from django.db.models import Q
from django.db.models.query import REPR_OUTPUT_SIZE
 
from myproject.myapp.models import Cog, Sprocket
 
ITER_HARD_LIMIT = 10000
 
class GearQuerySet(object):
    def __init__(self):
        self.ordering = ('description',)
        self.cog_query = Cog.objects.order_by(*self.ordering)
        self.sprocket_query = Sprocket.objects.order_by(*self.ordering)
 
    def __iter__(self):
        for row in self[:ITER_HARD_LIMIT]:
            yield row
 
    def __repr__(self):
        data = list(self[:REPR_OUTPUT_SIZE + 1])
        if len(data) > REPR_OUTPUT_SIZE:
            data[-1] = "...(remaining elements truncated)..."
        return repr(data)
 
    def __getitem__(self, k):
        if not isinstance(k, (slice, int, long)):
            raise TypeError
        assert ((not isinstance(k, slice) and (k >= 0))
                or (isinstance(k, slice) and (k.start is None or k.start >= 0)
                    and (k.stop is None or k.stop >= 0))), \
                "Negative indexing is not supported."
 
        if isinstance(k, slice):
            ordering = tuple(field.lstrip('-') for field in self.ordering)
            reverse = (ordering != self.ordering)
            if reverse:
                assert (sum(1 for field in self.ordering
                            if field.startswith('-')) == len(ordering)), \
                        "Mixed sort directions not supported."
 
            cq = self.cog_query
            sq = self.sprocket_query
 
            if k.stop is not None:
                cq = cq[:k.stop]
                sq = sq[:k.stop]
 
            rows = ([row + (Cog,)
                     for row in cq.values_list(*(ordering + ('pk',)))] +
                    [row + (Sprocket,)
                     for row in sq.values_list(*(ordering + ('pk',)))])
 
            rows.sort()
            if reverse:
                rows.reverse()
            rows = rows[k]
 
            pk_idx = len(ordering)
            klass_idx = pk_idx + 1
            cog_pks = [row[pk_idx] for row in rows
                            if row[klass_idx] is Cog]
            sprocket_pks = [row[pk_idx] for row in rows
                           if row[klass_idx] is Sprocket]
            cogs = Cog.objects.in_bulk(cog_pks)
            sprockets = Sprocket.objects.in_bulk(sprocket_pks)
 
            results = []
            for row in rows:
                pk = row[-2]
                klass = row[-1]
                if klass is Cog:
                    cogs[pk].type = 'cog'
                    results.append(cogs[pk])
                elif klass is Sprocket:
                    sprockets[pk].type = 'sprocket'
                    results.append(sprockets[pk])
            return results
        else:
            return self[k:k+1][0]
 
    def count(self):
        return self.cog_query.count() + self.sprocket_query.count()
 
    def all(self):
        return self._clone()
 
    def filter(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.filter(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.filter(*args, **kwargs)
        return qs
 
    def exclude(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.exclude(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.exclude(*args, **kwargs)
        return qs
 
    def order_by(self, *ordering):
        qs = self._clone()
        qs.cog_query = qs.cog_query.order_by(*ordering)
        qs.sprocket_query = qs.sprocket_query.order_by(*ordering)
        qs.ordering = ordering
        return qs
 
    def _clone(self):
        qs = GearQuerySet()
        qs.cog_query = self.cog_query._clone()
        qs.sprocket_query = self.sprocket_query._clone()
        qs.ordering = self.ordering
        return qs

The above QuerySet-like class implements the parts of the QuerySet interface that Django’s generic list view depends on: .count() and slicing with .__getitem__(). It also provides a ._clone() method which, despite the private-looking name, is generally expected to exist for a QuerySet. As usual, .filter(), .exclude(), and .order_by() call ._clone() to make a copy of the QuerySet before making modifications so that immutability is preserved.

The .__iter__() method is defined in terms of .__getitem__() and uses a constant, ITER_HARD_LIMIT, to define the maximum number of results that will be returned if the QuerySet is used as an iterator. This is to prevent accidentally loading the entire table into memory with a for-loop, since my .__getitem__() returns a list for slices, not an iterator. It is rather difficult to support lazy iteration through multiple tables in parallel while keeping them sorted together, so I elected not to solve this particular problem. Django’s RawQuerySet, on the other hand, defines an iterator directly and builds slicing in terms of that, so the right way to structure these methods really depends on your use case.

Quack like a Manager

I could stop right here and create instances of this custom QuerySet class directly, but I decided to carry the façade further and create duck-typed Manager and Model classes as well:

class GearManager(object):
    def count(self):
        return self.get_query_set().count()
 
    def all(self):
        return self.get_query_set()
 
    def filter(self, *args, **kwargs):
        return self.get_query_set().filter(*args, **kwargs)
 
    def exclude(self, *args, **kwargs):
        return self.get_query_set().exclude(*args, **kwargs)
 
    def order_by(self, *args, **kwargs):
        return self.get_query_set().order_by(*args, **kwargs)
 
    def get_query_set(self):
        return GearQuerySet()
 
class Gear(object):
    objects = GearManager()

With the above code, I can write expressions like Gear.objects.filter(…), Gear.objects.all().order_by(…), and so on, maintaining consistency with the way model objects are usually queried. This seems a bit silly at first glance, but in the future I will be adding methods to create new instances, so it’s good to have a place to put them (GearManager).

More curious is the fake model class (Gear), since it is currently just acting as a namespace for the “objects” attribute. I wonder if it would be worthwhile to make this class actually do something, perhaps making it a superclass of the Cog and Sprocket model classes (though I dislike the idea of adding a circular dependency between the modules these classes reside in). The idea of using Python’s new Abstract Base Class support might be worthwhile exploring, since it provides an advisory way to make isinstance() return True.

Has anyone else tried to do this sort of thing before? What do you think of this solution?

Edit: I updated the GearQuerySet example to include the algorithm for searching for cogs and sprockets together.

Trying Out PyCharm, Part 2

In which I actually use the editor to write code

Previously, I attempted to write up my experiences with the new beta of the PyCharm editor for Python, from JetBrains, makers of IntelliJ IDEA. By the time I got a basic installation working and started a Django project, I ran out of time. This time around, I actually wrote a little bit of code using PyCharm, so I have a few more things to say.

Things I like

It defaults to spaces, not tabs. This is correct.

Ctrl-space does menu completion, and this works very well. If you’re used to Intellisense, this does what you expect. Any IDE worth its salt should be able to accomplish this basic task.

Overriding methods get a little icon in the left gutter that takes you to the source code for the methods they override, and similarly, overridden methods allow you to navigate up and down the hierarchy. This is neat, and not something I’ve seen any Python editor do before.

The Structure panel is great. It updates as you code, and uses icons to distinguish between variables, functions, and methods.

You can duplicate a line with Ctrl+D, comment a line with Ctrl+/, and indent/dedent a line with Tab/Shift-Tab. If there is an active selection, these commands work on the selection instead. These are super-handy and very intuitive features.

Entire blocks can be easily rearranged by the Move Statement Up / Move Statement Down (Ctrl+Shift+Up/Down) commands. This seems like it would be really handy for code reorganization.

Ctrl+B will jump to the definition of the symbol under the cursor, and if it’s ambiguous it pops up a menu where you can pick which definition you mean. Likewise, Ctrl+N lets you type a class name to jump to, searches as you type, and lets you refine your choice with a pop-up menu. I can see this being a huge time saver.

The basic refactorings are all there: Rename, Change Signature, Move, Copy, Safe Delete, Inline, Pull Up, Push Down, Extract Interface, Extract Superclass. I’m not sure what Extract Interface does, since Python doesn’t really have interfaces (third-party implementations aside), but Extract Superclass has a nice interface where you can check which methods you want to move into the new base class. If the current class already extends a class, it will use multiple inheritance. This isn’t Java!

Out of the box, PyCharm supports CVS, Subversion, Git, and Mercurial. Not too shabby!

Editor windows can be split horizontally and vertically, and each split pane can have its own tabs. The tabs don’t currently support drag’n'drop, which is a bummer, but I bet they’ll have that working soon, since everyone expects that to work with web browsers nowadays.

Things I’m on the fence about

The real-time syntax checking is slightly distracting, just as I recall it being with IDEA. As I type, various things get underlined with red squigglies to notify me that my syntax is invalid. It’s invalid, usually, because I haven’t finished typing yet.

Code folding works great, if you’re into that sort of thing. I seem to be the odd one out – I can’t stand it. It just creates a bunch of noise, and I feel like if your program is so big and deeply nested that you need to collapse sections of it, you should fix your program. But anyway, PyCharm does folding just fine.

Things I don’t like

Introspection doesn’t always work too well. When I tried to create a models.CharField for a model, it popped up with the hint that CharField takes the arguments (self, *args, **kwargs). I’m sure there’s some fancy footwork going on in Django that makes this difficult, but it would be really helpful to have argument hints about basic stuff like this if PyCharm is to claim any sort of Django support.

One of the more amazing refactorings that IDEA can do, which I think is called Replace Inheritance With Delegation, is not available in PyCharm. I think this is a great tool, and hopefully it will get integrated with PyCharm someday. Python’s support for multiple inheritance makes it less necessary, perhaps (since the big benefit with Java is to be able to pick a different base class without breaking anything, since Java only allows single-inheritance).

By default, you can click anywhere and it will put the cursor at that point, even if it’s 50 characters to the right of the text. This really bothers me, but fortunately it’s easy to disable: File->Settings->IDE Settings->Editor->Virtual Space->Allow placement of caret after end of line.

PyCharm follows the convention that many Windows-style programs do these days, where Ctrl-Left/Right moves between words but Alt-Up/Down moves between paragraphs. I don’t know who came up with this, but I really prefer Ctrl-Up/Down for symmetry, and so that I can keep Ctrl held down and easily move around. Ctrl-Up/Down works like a mouse wheel, which is a nice feature but I wish it had a different shortcut. Alt-Left/Right switches between tabs, which is handy but also weird to me. But I’m really nitpicking here.

Indexing takes a long time. After opening the editor, you’ll want to grab some coffee or what-have-you.

Thumbs Up

This is really just a superficial evaluation of PyCharm, since I would have to use it for much longer to really critique it properly. It appears to be well-implemented, full-featured, and artfully designed. I think I could be happy using PyCharm. I don’t know if it’s compelling enough to break my Emacs addiction, but it’s certainly a worthy contender. I recommend giving it a try!

Ruby’s surprising handling of local variables

I was trying to read some Rails code today and came across the definition of the link_to view helper, which looks like this:

def link_to(*args, &block)
  if block_given?
    options      = args.first || {}
    html_options = args.second
    concat(link_to(capture(&block), options, html_options))
  else
    name         = args.first
    options      = args.second || {}
    html_options = args.third
 
    url = url_for(options)
 
    if html_options
      html_options = html_options.stringify_keys
      href = html_options['href']
      convert_options_to_javascript!(html_options, url)
      tag_options = tag_options(html_options)
    else
      tag_options = nil
    end
 
    href_attr = "href=\"#{url}\"" unless href
    "<a #{href_attr}#{tag_options}>#{name || url}</a>"
  end
end

At first glance, I thought for sure I had found a bug! The variable, href, is only initialized if html_options is specified. It seems like the “href_attr = … unless href” line would blow up otherwise, since it’s testing a variable that may not have been set. Or, so I thought. It turns out that my understanding of Ruby’s local variable semantics was wrong, as demonstrated by this simple test:

irb(main):001:0> x
NameError: undefined local variable or method 'x' for main:Object
        from (irb):1
        from :0
irb(main):002:0> if false
irb(main):003:1>   x = 0
irb(main):004:1> end
=> nil
irb(main):005:0> x
=> nil

It seems that assigning to an uninitialized variable in code that does not execute is sufficient to create that variable and assign it a default value of nil. This is in contrast to Python, which doesn’t define new variables unless the code that sets them actually executes:

>>> x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
>>> if False:
...   x = 0
...
>>> x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined

Neither does Javascript:

js> x
typein:1: ReferenceError: x is not defined
js> if (false) x = 0;
js> x
typein:3: ReferenceError: x is not defined

Is it just me, or is Ruby’s behavior completely bizarre here?

Using the Python tokenizer for source transformation

I have been working on porting a medium-sized Django project from Django 0.96 to Django 1.0, and one of the necessary changes is converting to use Unicode strings (u’like this’) instead of byte strings (‘like this’). There was too much code to do this reliably by hand, so it seemed like a good idea to write a script to do it. Rather than hack together a bunch of regular expressions, I decided to try the Python tokenize module, since it seemed like I could get very reliable source translation that way.

My first attempt was to use the new untokenize function, which takes tokenizer output and turns it back into source code. However, despite the documentation which states that “conversion is lossless and round-trips are assured”, the coding style is not preserved. Whitespace is added in some places and removed in others, and even though the code runs the same, it looks ugly and generates huge undreadable diffs. Instead, I built the output source manually, since the tokenizer provides enough information about row and column positions. Here’s how it ended up:

#!/usr/bin/env python
 
import sys
import itertools
from tokenize import *
 
def token_line_number((num, token, spos, epos, line)):
    return spos[0]
 
def token_lines(tokens):
    return itertools.groupby(tokens, token_line_number)
 
def convert_strings(token_line):
    result = ''
    pad = 0
    for num, token, spos, epos, line in token_line:
        result += ' ' * (spos[1] + pad - len(result))
        if num == STRING and token[0] != 'u':
            result += 'u'
            pad += 1
        result += token
    return result
 
def convert_unicode(tokens):
    for line_number, token_line in token_lines(tokens):
        token_line = list(token_line)
        has_strings = False
        for num, _, _, _, _ in token_line:
            if num == STRING:
                has_strings = True
                break
        if has_strings:
            yield convert_strings(token_line)
        else:
            yield token_line[0][4]
 
tokens = generate_tokens(sys.stdin.readline)
for line in convert_unicode(tokens):
    sys.stdout.write(line.replace('__str__', '__unicode__'))

Overall, it was very simple to write and didn’t take too long. For any lines that didn’t have string literals, I just printed them out verbatim. Otherwise, I built a new line by assembling it token-by-token, padding with spaces to match the original column positions of each token (compensating for the additional padding introduced by adding the extra ‘u’s). As a post-process, I changed all definitions and calls to “__str__” with the preferred “__unicode__” with a simple search-and-replace.