Category Archives: Web Development

How to quack like a QuerySet

The main database I work on has a bit of a quirk where there are two sets of tables for something that is conceptually one entity. For the sake of this example, let’s call them cogs and sprockets. So, I have a “cog” table, a “sprocket” table, and then a bunch of related tables like “cog_tooth”, “sprocket_tooth”, and so on. As it turns out, the split between cogs and sprockets wasn’t really necessary, since they’re really both just gears, but enough infrastructure is in place to deal with cogs and sprockets as separate groups of entities that it’s too much work to change that now.

The basic problem, which I’ve been trying to solve for months now and have only just recently figured out, is how to just show a list of gears, with cogs and sprockets mixed together, while still allowing searching, sorting, and pagination. In other words, I don’t want one page for browsing cogs and another page for browsing sprockets; I just want a gear list. I want to bury the detail of whether a gear is a cog or a sprocket and let the user page through them all sorted together. And I want to do this, somehow, with the Django ORM.

What didn’t work

Before I realized I needed to quack like a QuerySet, I tried many other things. Each was a near-solution, but wasn’t practical for one reason or another.

Model inheritance

The obvious solution was to create a Gear model, and let Cog and Sprocket inherit from that model. Then, I could paginate through Gears and follow an association to a Cog or Sprocket if I needed further information. The problem with this approach is that it introduces a new table for gears. Outside of Django, there is a lot of existing infrastructure for importing, indexing, and managing cogs and sprockets. All of that code would have to be updated to create and update the gear table whenever working with cogs and sprockets. A lot of fields would need to be moved to the gear table to prevent having to do lots of joins during pagination. This would be far too much work just to accommodate Django’s particular inheritance technique, which is otherwise unnecessary for these systems that would need to change.

Database views

If I were to solve this problem from a strictly SQL perspective, I would use a UNION query, like:

SELECT name, num_teeth, 'cog' AS type
FROM   cog
SELECT name, num_teeth, 'sprocket' AS type
FROM   sprocket

So, it seemed like I could wrap that in a database view and create a Gear model with managed=False using that view as its “db_table”. I tried this, and it actually worked, but it was dreadfully slow. Perhaps this means I shouldn’t be using MySQL; I imagine Postgres’s query optimizer is sufficiently intelligent to run this type of query faster, but these are the cards I’ve been dealt. Unfortunately, performance is a show-stopper with this approach.


At this point, I thought maybe it would be best to drop to raw SQL so that I could optimize the queries by hand. I started studying the Manager.raw() method and RawQuerySet class, and built a custom QuerySet that ran a UNION query with a subselect and dynamically generated LIMITs to allow for pagination. This actually worked, and it performed just fine, but it had several issues. One was that I was generating SQL from strings in order to make pagination work, and this was dirty and error-prone. The larger issue is that I had no obvious way to implement .filter() for these RawQuerySets. Faking .order_by() is easy enough, but .filter() is very complex and pulls in a ton of Django internals. It would have taken forever to reinvent this wheel, and I really didn’t want to roll with a half-assed .filter() implementation, especially since my main use case was to be able to provide Django Admin ChangeList-style list views with lots of filtering options.

Looking at RawQuerySet was very inspiring, however. It made me realize that maybe what I wanted wasn’t so much a Gear model at all, but rather a Gear QuerySet – that Django Models were a bit too low-level for what I was trying to accomplish, and that QuerySets were a much better abstraction to code against. Django’s generic list view doesn’t require a model, for example; you can feed it any QuerySet, or really, any object that supports .count() and slicing. This led me to the big a-ha! moment:

Quack like a QuerySet

I mean “quack” in the “duck-typing” sense: create a class that behaves like a QuerySet, but don’t bother inheriting from the QuerySet class. It doesn’t have to provide the full suite of QuerySet features (which is huge), but just enough to perform the operations that the views require. This was the winning solution, and everything fell into place once I made this realization. Here’s what I ended up with:

from django.db.models import Q
from django.db.models.query import REPR_OUTPUT_SIZE
from myproject.myapp.models import Cog, Sprocket
class GearQuerySet(object):
    def __init__(self):
        self.ordering = ('description',)
        self.cog_query = Cog.objects.order_by(*self.ordering)
        self.sprocket_query = Sprocket.objects.order_by(*self.ordering)
    def __iter__(self):
        for row in self[:ITER_HARD_LIMIT]:
            yield row
    def __repr__(self):
        data = list(self[:REPR_OUTPUT_SIZE + 1])
        if len(data) > REPR_OUTPUT_SIZE:
            data[-1] = "...(remaining elements truncated)..."
        return repr(data)
    def __getitem__(self, k):
        if not isinstance(k, (slice, int, long)):
            raise TypeError
        assert ((not isinstance(k, slice) and (k >= 0))
                or (isinstance(k, slice) and (k.start is None or k.start >= 0)
                    and (k.stop is None or k.stop >= 0))), \
                "Negative indexing is not supported."
        if isinstance(k, slice):
            ordering = tuple(field.lstrip('-') for field in self.ordering)
            reverse = (ordering != self.ordering)
            if reverse:
                assert (sum(1 for field in self.ordering
                            if field.startswith('-')) == len(ordering)), \
                        "Mixed sort directions not supported."
            cq = self.cog_query
            sq = self.sprocket_query
            if k.stop is not None:
                cq = cq[:k.stop]
                sq = sq[:k.stop]
            rows = ([row + (Cog,)
                     for row in cq.values_list(*(ordering + ('pk',)))] +
                    [row + (Sprocket,)
                     for row in sq.values_list(*(ordering + ('pk',)))])
            if reverse:
            rows = rows[k]
            pk_idx = len(ordering)
            klass_idx = pk_idx + 1
            cog_pks = [row[pk_idx] for row in rows
                            if row[klass_idx] is Cog]
            sprocket_pks = [row[pk_idx] for row in rows
                           if row[klass_idx] is Sprocket]
            cogs = Cog.objects.in_bulk(cog_pks)
            sprockets = Sprocket.objects.in_bulk(sprocket_pks)
            results = []
            for row in rows:
                pk = row[-2]
                klass = row[-1]
                if klass is Cog:
                    cogs[pk].type = 'cog'
                elif klass is Sprocket:
                    sprockets[pk].type = 'sprocket'
            return results
            return self[k:k+1][0]
    def count(self):
        return self.cog_query.count() + self.sprocket_query.count()
    def all(self):
        return self._clone()
    def filter(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.filter(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.filter(*args, **kwargs)
        return qs
    def exclude(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.exclude(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.exclude(*args, **kwargs)
        return qs
    def order_by(self, *ordering):
        qs = self._clone()
        qs.cog_query = qs.cog_query.order_by(*ordering)
        qs.sprocket_query = qs.sprocket_query.order_by(*ordering)
        qs.ordering = ordering
        return qs
    def _clone(self):
        qs = GearQuerySet()
        qs.cog_query = self.cog_query._clone()
        qs.sprocket_query = self.sprocket_query._clone()
        qs.ordering = self.ordering
        return qs

The above QuerySet-like class implements the parts of the QuerySet interface that Django’s generic list view depends on: .count() and slicing with .__getitem__(). It also provides a ._clone() method which, despite the private-looking name, is generally expected to exist for a QuerySet. As usual, .filter(), .exclude(), and .order_by() call ._clone() to make a copy of the QuerySet before making modifications so that immutability is preserved.

The .__iter__() method is defined in terms of .__getitem__() and uses a constant, ITER_HARD_LIMIT, to define the maximum number of results that will be returned if the QuerySet is used as an iterator. This is to prevent accidentally loading the entire table into memory with a for-loop, since my .__getitem__() returns a list for slices, not an iterator. It is rather difficult to support lazy iteration through multiple tables in parallel while keeping them sorted together, so I elected not to solve this particular problem. Django’s RawQuerySet, on the other hand, defines an iterator directly and builds slicing in terms of that, so the right way to structure these methods really depends on your use case.

Quack like a Manager

I could stop right here and create instances of this custom QuerySet class directly, but I decided to carry the façade further and create duck-typed Manager and Model classes as well:

class GearManager(object):
    def count(self):
        return self.get_query_set().count()
    def all(self):
        return self.get_query_set()
    def filter(self, *args, **kwargs):
        return self.get_query_set().filter(*args, **kwargs)
    def exclude(self, *args, **kwargs):
        return self.get_query_set().exclude(*args, **kwargs)
    def order_by(self, *args, **kwargs):
        return self.get_query_set().order_by(*args, **kwargs)
    def get_query_set(self):
        return GearQuerySet()
class Gear(object):
    objects = GearManager()

With the above code, I can write expressions like Gear.objects.filter(…), Gear.objects.all().order_by(…), and so on, maintaining consistency with the way model objects are usually queried. This seems a bit silly at first glance, but in the future I will be adding methods to create new instances, so it’s good to have a place to put them (GearManager).

More curious is the fake model class (Gear), since it is currently just acting as a namespace for the “objects” attribute. I wonder if it would be worthwhile to make this class actually do something, perhaps making it a superclass of the Cog and Sprocket model classes (though I dislike the idea of adding a circular dependency between the modules these classes reside in). The idea of using Python’s new Abstract Base Class support might be worthwhile exploring, since it provides an advisory way to make isinstance() return True.

Has anyone else tried to do this sort of thing before? What do you think of this solution?

Edit: I updated the GearQuerySet example to include the algorithm for searching for cogs and sprockets together. Continue reading

Using the Python tokenizer for source transformation

I have been working on porting a medium-sized Django project from Django 0.96 to Django 1.0, and one of the necessary changes is converting to use Unicode strings (u’like this’) instead of byte strings (‘like this’). There was too much code to do this reliably by hand, so it seemed like a good idea to write a script to do it. Rather than hack together a bunch of regular expressions, I decided to try the Python tokenize module, since it seemed like I could get very reliable source translation that way.

My first attempt was to use the new untokenize function, which takes tokenizer output and turns it back into source code. However, despite the documentation which states that “conversion is lossless and round-trips are assured”, the coding style is not preserved. Whitespace is added in some places and removed in others, and even though the code runs the same, it looks ugly and generates huge undreadable diffs. Instead, I built the output source manually, since the tokenizer provides enough information about row and column positions. Here’s how it ended up:

#!/usr/bin/env python
import sys
import itertools
from tokenize import *
def token_line_number((num, token, spos, epos, line)):
    return spos[0]
def token_lines(tokens):
    return itertools.groupby(tokens, token_line_number)
def convert_strings(token_line):
    result = ''
    pad = 0
    for num, token, spos, epos, line in token_line:
        result += ' ' * (spos[1] + pad - len(result))
        if num == STRING and token[0] != 'u':
            result += 'u'
            pad += 1
        result += token
    return result
def convert_unicode(tokens):
    for line_number, token_line in token_lines(tokens):
        token_line = list(token_line)
        has_strings = False
        for num, _, _, _, _ in token_line:
            if num == STRING:
                has_strings = True
        if has_strings:
            yield convert_strings(token_line)
            yield token_line[0][4]
tokens = generate_tokens(sys.stdin.readline)
for line in convert_unicode(tokens):
    sys.stdout.write(line.replace('__str__', '__unicode__'))

Overall, it was very simple to write and didn’t take too long. For any lines that didn’t have string literals, I just printed them out verbatim. Otherwise, I built a new line by assembling it token-by-token, padding with spaces to match the original column positions of each token (compensating for the additional padding introduced by adding the extra ‘u’s). As a post-process, I changed all definitions and calls to “__str__” with the preferred “__unicode__” with a simple search-and-replace. Continue reading

Surfing linkland

I was reading Roy Fielding’s you’re-doing-it-wrong article on REST for the second or third time today, and this time I tried to avoid my usual knee-jerk dismissal and learn something instead. His assertion that hyperlinks are essential to understanding REST got me thinking about the idea of auto-discovery. In essence, you should be able to request the root of a site via HTTP and, without knowing anything about the site’s particular approach to URL design, be able to navigate through it’s services just like clicking links on a browser. I realized something which has sort of been in the back of my mind for awhile but never really struck me fully until today: the “link” tag is the de-facto standard for exposing services. If you want to see what services a web site wants you to know about, just scrape its links. So, I wrote a little Python script to do just that:

#!/usr/bin/env python
import os
import re
import sys
import urllib
if len(sys.argv) != 2:
    print 'usage: %s <url>' % os.path.basename(sys.argv[0])
url = sys.argv[1]
html = urllib.urlopen(url).read()
for tag in re.compile(r'<link.*?>', re.DOTALL).findall(html):
    print tag

Here’s what the output looks like for some common web sites:

ramen@pedro:~$ dumplinks
<link rel="stylesheet" href="/static/reddit.css?v=2a07c701b9a58519a6c333860d8add98" type="text/css" />
<link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="" />
ramen@pedro:~$ dumplinks
<link rel="alternate" type="application/rss+xml" title=" News" href="" />
<link rel="apple-touch-icon" href=""/>
<link href="/wp-content/themes/h4/ie6.css?m=1223022215a" rel="stylesheet" type="text/css" media="screen" />
<link rel='stylesheet' href='' type='text/css' />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="" />
<link rel="introspection" type="application/atomserv+xml" title="Atom API" href="" />
<link rel='openid.server' href='' />
<link rel='openid.delegate' href='' />
ramen@pedro:~$ dumplinks
<link rel="stylesheet" rev="stylesheet" href="//" media="screen">
<link rel="stylesheet" rev="stylesheet" href="//" media="screen">
<link rel="stylesheet" rev="stylesheet" href="//" media="screen">
<link rel="stylesheet" type="text/css" media="screen" href="//" />
<link rel="stylesheet" type="text/css" media="screen" href="//" />
<link rel="stylesheet" type="text/css" media="screen" href="//" />
<link rel="top"       title="News for nerds, stuff that matters" href="//" >
<link rel="search"    title="Search Slashdot" href="//">
<link rel="alternate" title="Slashdot RSS" href="" type="application/rss+xml">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
ramen@pedro:~$ dumplinks
<link rel="alternate" type="application/atom+xml" title="It’s just data" href=""/>
<link rel="openid.server" href=""/>
<link rel="search" type="application/opensearchdescription+xml" href="" title="intertwingly blog search"/>
<link rel="stylesheet" href="/css/blog5.css" type="text/css" media="screen"/>
<link rel="stylesheet" href="/css/halloween.css" type="text/css" media="screen"/>
<link rel="stylesheet" href="/css/print.css" type="text/css" media="print"/>
<link rel="shortcut icon" href="/favicon.ico"/>
ramen@pedro:~$ dumplinks
<link rel="stylesheet" href="/wp-content/themes/starship/styles/site-3300.css" type="text/css" media="screen,print" />
<link rel="alternate" type="application/rss+xml" title=" RSS Feed" href="" />
<link rel="pingback" href="" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="" />
<link rel='stylesheet' href='' type='text/css' media='all' />
<link rel='stylesheet' href='' type='text/css' media='all' />
<link rel='stylesheet' href='' type='text/css' media='all' />

I already learned about a few new things like the apple-touch-icon feature and opensearchdescription. It seems like there’s another world out there that I never really noticed because it was lost in the tag soup. Continue reading site relaunch and birthday celebration

We just launched the new site, which is a major rewrite of both the content management system and recycling location search engine, as well as a much-needed redesign of the look-and-feel and navigation!

Please drop by and leave a comment on our Happy Birthday page, celebrating the company’s 17 years of operation providing information and resources on recycling to the world.

Dave Continue reading

Apache-style access logs for Tomcat

To get an Apache-style access log, complete with referrers and user-agents, I created a directory called /var/log/tomcat writable by the Tomcat process, and I added the following to tomcat/conf/context.xml:

<Valve className=”org.apache.catalina.valves.FastCommonAccessLogValve”
rotatable=”false” />

This will create a file called /var/log/tomcat/access.log and start logging requests to it. I’m turning off Tomcat’s date stamping and log rotation, since I prefer to use logrotate. Continue reading

Running Tomcat from daemontools

I’m a big fan of the daemontools server framework for quite a few reasons. For one, it’s incredibly stable. So stable, I use it to watch Apache and restart it when it crashes because daemontools never crashes. “Depend in the direction of stability” is my mantra.

Your server might crash, but daemontools will ruthlessly restart it, and since it runs from inittab, the OS will restart daemontools if it ever crashes. But, like I said, daemontools never crashes.

Another reason I love daemontools is that it does automatic logging and log rotation, so you can create as many servers you need and you don’t have to worry that the one last server you threw in there in a hurry won’t get its logs rotated, causing your hard drive to fill up.

Tomcat, on the other hand, has never been very friendly to the system administrator. It creates many different log files at once and rotates them by renaming them with date stamps; this makes it really annoying when you’re trying to tail the log files. Log messages tend to be large and span multiple lines, so it’s very hard to see what’s going on with all the noise. Killing Tomcat requires running a shell script that sends a shutdown command to a socket, completely breaking UNIX convention for no good reason.

I decided to bite the bullet and get Tomcat to run under daemontools instead, and I’ve never been happier (well, given that I’m still talking about Java here, let’s just say I’ve never been less unhappy…). Here’s how I did it:

My run script looks like this:

exec 2>&1
exec envdir ./env setuidgid tomcat /usr/local/tomcat/bin/ run

In my env directory I have two files that set up my environment: JAVA_HOME and CLASSPATH. JAVA_HOME contains the following:


CLASSPATH contains this (I’ll explain why in a bit):


I had to comment out the following line at the top of tomcat/bin/ to keep it from clobbering the CLASSPATH environment variable:

# First clear out the user classpath

Now, tiny-formatter.jar is a little hack that makes Tomcat’s logger use only one line per log message and removes the datestamp, since multilog adds one already. It contains a class file, TinyFormatter.class, generated by compiling the following source file,


import java.util.logging.Formatter;
import java.util.logging.LogRecord;

public class TinyFormatter extends Formatter
static final String lineSep = System.getProperty(“line.separator”);

public String format(LogRecord record)
StringBuffer buf = new StringBuffer(180);
buf.append(“: “);
buf.append(” (“);

Throwable throwable = record.getThrown();
if (throwable != null) {
StringWriter sink = new StringWriter();
throwable.printStackTrace(new PrintWriter(sink, true));

return buf.toString();

Okay, maybe it’s not so tiny. That’s Java for ‘ya.

To install this custom formatter, I edited tomcat/conf/ and replaced its contents with the following:

handlers = java.util.logging.ConsoleHandler
.handlers = java.util.logging.ConsoleHandler

java.util.logging.ConsoleHandler.level = FINE
java.util.logging.ConsoleHandler.formatter = TinyFormatter

The run script for the daemontools logger is pretty standard:

exec setuidgid tomcat multilog t /var/multilog/tomcat

I can now start and stop Tomcat with “svc -u /service/tomcat” and “svc -d /service/tomcat”, and restarting is the usual “svc -t /service/tomcat”. To learn all the details I’ve left out, I highly recommend reading the djb way. I’m out of time. ;) Continue reading

PMA Scanbots

I can’t think of any good reason why you’d want to put your phpMyAdmin installation in any of the following locations:

  1. /MYADMIN/
  2. /MYadmin/
  3. /MyAdmin/
  5. /PHPMYadmin/
  6. /PHPmyadmin/
  7. /PMA/
  8. /PhPmYaDmIn/
  9. /admin/
  10. /admin/mysql/
  11. /admin/phpmyadmin/
  12. /admin/pma/
  13. /db/
  14. /dbadmin/
  15. /myADMIN/
  16. /myadmin/
  17. /mysql-admin/
  18. /mysql/
  19. /mysqladmin/
  20. /pHpMyAdMiN/
  21. /phpMYadmin/
  22. /phpMyAdmin-2.2.0/
  23. /phpMyAdmin-2.2.3/
  24. /phpMyAdmin-2.2.6/
  25. /phpMyAdmin-2.2.7-pl1/
  26. /phpMyAdmin-2.2.7/
  27. /phpMyAdmin-2.5.1/
  28. /phpMyAdmin-2.5.4/
  29. /phpMyAdmin-2.5.6/
  30. /phpMyAdmin-2.6.4-pl4/
  31. /phpMyAdmin-2.6.4/
  32. /phpMyAdmin-2.7.0-pl2/
  33. /phpMyAdmin-2.7.0/
  34. /phpMyAdmin-2.8.1/
  35. /phpMyAdmin-
  36. /phpMyAdmin-
  37. /phpMyAdmin-
  38. /phpMyAdmin-
  39. /phpMyAdmin-
  40. /phpMyAdmin-2.9.0/
  41. /phpMyAdmin-2.9.1/
  42. /phpMyAdmin/
  43. /phpmyADMIN/
  44. /phpmyadmin/
  45. /phpmyadmin2/
  46. /pma/
  47. /pmamy/
  48. /web/phpMyAdmin/

It’s a jungle out there. Continue reading

Ten things that XML-RPC does… that REST leaves unspecified

  1. Standard, cross-language, typeful serialization of data
  2. User-defined error codes and messages
  3. “Boxcarring” of requests to reduce overhead
  4. Serialization of binary content
  5. Serialization of date-time values
  6. Standardized parameter passing
  7. Introspection allowing for straightforward code generation
  8. High-level APIs for just about every language
  9. No manual parsing of XML, ever
  10. Only three lines to call a function in Python and several other languages:

>>> import xmlrpclib
>>> s = xmlrpclib.Server(‘http://localhost/xmlrpc’)
>>> s.demo.addTwoNumbers(3, 4)

Not that the REST doesn’t have its benefits, but someone ought to be saying this. XML-RPC isn’t complicated like SOAP, it runs just about everywhere, and it lets you get on with your work rather than arguing about semicolons versus slashes or XML versus JSON or countless other things. Besides, when your goal is to support as many languages as possible, you want to minimize the amount of code you write for each language. As far as I’ve seen, nothing else accomplishes literally no-code binding like XML-RPC. Continue reading