How to quack like a QuerySet

The main database I work on has a bit of a quirk where there are two sets of tables for something that is conceptually one entity. For the sake of this example, let’s call them cogs and sprockets. So, I have a “cog” table, a “sprocket” table, and then a bunch of related tables like “cog_tooth”, “sprocket_tooth”, and so on. As it turns out, the split between cogs and sprockets wasn’t really necessary, since they’re really both just gears, but enough infrastructure is in place to deal with cogs and sprockets as separate groups of entities that it’s too much work to change that now.

The basic problem, which I’ve been trying to solve for months now and have only just recently figured out, is how to just show a list of gears, with cogs and sprockets mixed together, while still allowing searching, sorting, and pagination. In other words, I don’t want one page for browsing cogs and another page for browsing sprockets; I just want a gear list. I want to bury the detail of whether a gear is a cog or a sprocket and let the user page through them all sorted together. And I want to do this, somehow, with the Django ORM.

What didn’t work

Before I realized I needed to quack like a QuerySet, I tried many other things. Each was a near-solution, but wasn’t practical for one reason or another.

Model inheritance

The obvious solution was to create a Gear model, and let Cog and Sprocket inherit from that model. Then, I could paginate through Gears and follow an association to a Cog or Sprocket if I needed further information. The problem with this approach is that it introduces a new table for gears. Outside of Django, there is a lot of existing infrastructure for importing, indexing, and managing cogs and sprockets. All of that code would have to be updated to create and update the gear table whenever working with cogs and sprockets. A lot of fields would need to be moved to the gear table to prevent having to do lots of joins during pagination. This would be far too much work just to accommodate Django’s particular inheritance technique, which is otherwise unnecessary for these systems that would need to change.

Database views

If I were to solve this problem from a strictly SQL perspective, I would use a UNION query, like:

SELECT name, num_teeth, 'cog' AS type
FROM   cog
UNION
SELECT name, num_teeth, 'sprocket' AS type
FROM   sprocket

So, it seemed like I could wrap that in a database view and create a Gear model with managed=False using that view as its “db_table”. I tried this, and it actually worked, but it was dreadfully slow. Perhaps this means I shouldn’t be using MySQL; I imagine Postgres’s query optimizer is sufficiently intelligent to run this type of query faster, but these are the cards I’ve been dealt. Unfortunately, performance is a show-stopper with this approach.

RawQuerySet

At this point, I thought maybe it would be best to drop to raw SQL so that I could optimize the queries by hand. I started studying the Manager.raw() method and RawQuerySet class, and built a custom QuerySet that ran a UNION query with a subselect and dynamically generated LIMITs to allow for pagination. This actually worked, and it performed just fine, but it had several issues. One was that I was generating SQL from strings in order to make pagination work, and this was dirty and error-prone. The larger issue is that I had no obvious way to implement .filter() for these RawQuerySets. Faking .order_by() is easy enough, but .filter() is very complex and pulls in a ton of Django internals. It would have taken forever to reinvent this wheel, and I really didn’t want to roll with a half-assed .filter() implementation, especially since my main use case was to be able to provide Django Admin ChangeList-style list views with lots of filtering options.

Looking at RawQuerySet was very inspiring, however. It made me realize that maybe what I wanted wasn’t so much a Gear model at all, but rather a Gear QuerySet – that Django Models were a bit too low-level for what I was trying to accomplish, and that QuerySets were a much better abstraction to code against. Django’s generic list view doesn’t require a model, for example; you can feed it any QuerySet, or really, any object that supports .count() and slicing. This led me to the big a-ha! moment:

Quack like a QuerySet

I mean “quack” in the “duck-typing” sense: create a class that behaves like a QuerySet, but don’t bother inheriting from the QuerySet class. It doesn’t have to provide the full suite of QuerySet features (which is huge), but just enough to perform the operations that the views require. This was the winning solution, and everything fell into place once I made this realization. Here’s what I ended up with:

from django.db.models import Q
from django.db.models.query import REPR_OUTPUT_SIZE
 
from myproject.myapp.models import Cog, Sprocket
 
ITER_HARD_LIMIT = 10000
 
class GearQuerySet(object):
    def __init__(self):
        self.ordering = ('description',)
        self.cog_query = Cog.objects.order_by(*self.ordering)
        self.sprocket_query = Sprocket.objects.order_by(*self.ordering)
 
    def __iter__(self):
        for row in self[:ITER_HARD_LIMIT]:
            yield row
 
    def __repr__(self):
        data = list(self[:REPR_OUTPUT_SIZE + 1])
        if len(data) > REPR_OUTPUT_SIZE:
            data[-1] = "...(remaining elements truncated)..."
        return repr(data)
 
    def __getitem__(self, k):
        if not isinstance(k, (slice, int, long)):
            raise TypeError
        assert ((not isinstance(k, slice) and (k >= 0))
                or (isinstance(k, slice) and (k.start is None or k.start >= 0)
                    and (k.stop is None or k.stop >= 0))), \
                "Negative indexing is not supported."
 
        if isinstance(k, slice):
            ordering = tuple(field.lstrip('-') for field in self.ordering)
            reverse = (ordering != self.ordering)
            if reverse:
                assert (sum(1 for field in self.ordering
                            if field.startswith('-')) == len(ordering)), \
                        "Mixed sort directions not supported."
 
            cq = self.cog_query
            sq = self.sprocket_query
 
            if k.stop is not None:
                cq = cq[:k.stop]
                sq = sq[:k.stop]
 
            rows = ([row + (Cog,)
                     for row in cq.values_list(*(ordering + ('pk',)))] +
                    [row + (Sprocket,)
                     for row in sq.values_list(*(ordering + ('pk',)))])
 
            rows.sort()
            if reverse:
                rows.reverse()
            rows = rows[k]
 
            pk_idx = len(ordering)
            klass_idx = pk_idx + 1
            cog_pks = [row[pk_idx] for row in rows
                            if row[klass_idx] is Cog]
            sprocket_pks = [row[pk_idx] for row in rows
                           if row[klass_idx] is Sprocket]
            cogs = Cog.objects.in_bulk(cog_pks)
            sprockets = Sprocket.objects.in_bulk(sprocket_pks)
 
            results = []
            for row in rows:
                pk = row[-2]
                klass = row[-1]
                if klass is Cog:
                    cogs[pk].type = 'cog'
                    results.append(cogs[pk])
                elif klass is Sprocket:
                    sprockets[pk].type = 'sprocket'
                    results.append(sprockets[pk])
            return results
        else:
            return self[k:k+1][0]
 
    def count(self):
        return self.cog_query.count() + self.sprocket_query.count()
 
    def all(self):
        return self._clone()
 
    def filter(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.filter(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.filter(*args, **kwargs)
        return qs
 
    def exclude(self, *args, **kwargs):
        qs = self._clone()
        qs.cog_query = qs.cog_query.exclude(*args, **kwargs)
        qs.sprocket_query = qs.sprocket_query.exclude(*args, **kwargs)
        return qs
 
    def order_by(self, *ordering):
        qs = self._clone()
        qs.cog_query = qs.cog_query.order_by(*ordering)
        qs.sprocket_query = qs.sprocket_query.order_by(*ordering)
        qs.ordering = ordering
        return qs
 
    def _clone(self):
        qs = GearQuerySet()
        qs.cog_query = self.cog_query._clone()
        qs.sprocket_query = self.sprocket_query._clone()
        qs.ordering = self.ordering
        return qs

The above QuerySet-like class implements the parts of the QuerySet interface that Django’s generic list view depends on: .count() and slicing with .__getitem__(). It also provides a ._clone() method which, despite the private-looking name, is generally expected to exist for a QuerySet. As usual, .filter(), .exclude(), and .order_by() call ._clone() to make a copy of the QuerySet before making modifications so that immutability is preserved.

The .__iter__() method is defined in terms of .__getitem__() and uses a constant, ITER_HARD_LIMIT, to define the maximum number of results that will be returned if the QuerySet is used as an iterator. This is to prevent accidentally loading the entire table into memory with a for-loop, since my .__getitem__() returns a list for slices, not an iterator. It is rather difficult to support lazy iteration through multiple tables in parallel while keeping them sorted together, so I elected not to solve this particular problem. Django’s RawQuerySet, on the other hand, defines an iterator directly and builds slicing in terms of that, so the right way to structure these methods really depends on your use case.

Quack like a Manager

I could stop right here and create instances of this custom QuerySet class directly, but I decided to carry the façade further and create duck-typed Manager and Model classes as well:

class GearManager(object):
    def count(self):
        return self.get_query_set().count()
 
    def all(self):
        return self.get_query_set()
 
    def filter(self, *args, **kwargs):
        return self.get_query_set().filter(*args, **kwargs)
 
    def exclude(self, *args, **kwargs):
        return self.get_query_set().exclude(*args, **kwargs)
 
    def order_by(self, *args, **kwargs):
        return self.get_query_set().order_by(*args, **kwargs)
 
    def get_query_set(self):
        return GearQuerySet()
 
class Gear(object):
    objects = GearManager()

With the above code, I can write expressions like Gear.objects.filter(…), Gear.objects.all().order_by(…), and so on, maintaining consistency with the way model objects are usually queried. This seems a bit silly at first glance, but in the future I will be adding methods to create new instances, so it’s good to have a place to put them (GearManager).

More curious is the fake model class (Gear), since it is currently just acting as a namespace for the “objects” attribute. I wonder if it would be worthwhile to make this class actually do something, perhaps making it a superclass of the Cog and Sprocket model classes (though I dislike the idea of adding a circular dependency between the modules these classes reside in). The idea of using Python’s new Abstract Base Class support might be worthwhile exploring, since it provides an advisory way to make isinstance() return True.

Has anyone else tried to do this sort of thing before? What do you think of this solution?

Edit: I updated the GearQuerySet example to include the algorithm for searching for cogs and sprockets together.

Database model objectives

I was flipping through Chris Date’s SQL And Relational Theory and came across this little gem, which paraphrases “Codd’s own stated objectives in introducing his relational model.” I think this bears repeating today because I have felt for awhile that the NoSQL movement has a significantly different set of goals–which is fine–but seems to be ignoring some of the things that make the relational model nice to work with. I wonder if it is necessarily either-or, or if perhaps some of these NoSQL systems can work toward satisfying more of the needs that relational database systems satisfy, without sacrificing the speed and ease of distribution that has made the NoSQL concept popular.

Here are the stative objectives, quoting Date:

  1. To provide a high degree of data independence
  2. To provide a community view of the data of spartan simplicity, so that a wide variety of users in an enterprise, ranging from the most computer naive to the most computer sophisticated, can interact with a common model (while not prohibiting superimposed user views for specialized purposes)
  3. To simplify the potentially formidable job of the database administrator
  4. To introduce a theoretical foundation, albeit modest, into database management (a field sadly lacking in solid principles and guidelines)
  5. To merge the fact retrieval and file management fields in preparation for the addition at a later time of inferential services in the commercial world
  6. To lift database application programming to a new level–a level in which sets (and more specifically relations) are treated as operands instead of being processed element by element

I want to ponder on these objectives for a bit before drawing too many conclusions, but a few things seem starkly obvious.

The need to build indexes by hand in NoSQL systems in order to search (efficiently or not) by different criteria is a step away from the relational model’s goals of data independence because these indexes are likely to be built with a particular application in mind, sometimes (often?) to the disadvantage of other applications requiring a different view of the data.

To further that point, if the indexes designed into the database are insufficient, it will probably be the case that applications will have to drop back to the level of processing one record at a time, rather than working with data sets as units, unless all application developers have enough control over the database system to be able to make the needed changes.

The job of the database administrator is no doubt at a disadvantage today with NoSQL systems, though this is more of a tools issue than a fundamental design issue. The “how do I query the database?” comic sums up the current situation amusingly.

A theoretical foundation of NoSQL systems is hard to find. Most of the theory seems to be in regard to eventual consistency and other issues related more to distributed systems than data modeling in the abstract. This will surely come with time, though as soon as you get into the details of data modeling in NoSQL systems, you really have to specify which one, as they are more different than they are similar. A theory of data management with key-value stores seems, to me, unenlightening at first glance.

Whatever the base model is, if NoSQL databases are here to stay, I think we are going to see a need for some theoretical foundations to manage the growing complexity of our data models given the new strengths and limitations of NoSQL systems.

PMA Scanbots

I can’t think of any good reason why you’d want to put your phpMyAdmin installation in any of the following locations:

  1. /MYADMIN/
  2. /MYadmin/
  3. /MyAdmin/
  4. /PHPMYADMIN/
  5. /PHPMYadmin/
  6. /PHPmyadmin/
  7. /PMA/
  8. /PhPmYaDmIn/
  9. /admin/
  10. /admin/mysql/
  11. /admin/phpmyadmin/
  12. /admin/pma/
  13. /db/
  14. /dbadmin/
  15. /myADMIN/
  16. /myadmin/
  17. /mysql-admin/
  18. /mysql/
  19. /mysqladmin/
  20. /pHpMyAdMiN/
  21. /phpMYadmin/
  22. /phpMyAdmin-2.2.0/
  23. /phpMyAdmin-2.2.3/
  24. /phpMyAdmin-2.2.6/
  25. /phpMyAdmin-2.2.7-pl1/
  26. /phpMyAdmin-2.2.7/
  27. /phpMyAdmin-2.5.1/
  28. /phpMyAdmin-2.5.4/
  29. /phpMyAdmin-2.5.6/
  30. /phpMyAdmin-2.6.4-pl4/
  31. /phpMyAdmin-2.6.4/
  32. /phpMyAdmin-2.7.0-pl2/
  33. /phpMyAdmin-2.7.0/
  34. /phpMyAdmin-2.8.1/
  35. /phpMyAdmin-2.8.2.1/
  36. /phpMyAdmin-2.8.2.2/
  37. /phpMyAdmin-2.8.2.4/
  38. /phpMyAdmin-2.9.0.1/
  39. /phpMyAdmin-2.9.0.2/
  40. /phpMyAdmin-2.9.0/
  41. /phpMyAdmin-2.9.1/
  42. /phpMyAdmin/
  43. /phpmyADMIN/
  44. /phpmyadmin/
  45. /phpmyadmin2/
  46. /pma/
  47. /pmamy/
  48. /web/phpMyAdmin/

It’s a jungle out there.

I did my taxes with MySQL

For the second year now, I used MySQL to do my taxes. I find that even with GainsKeeper, it’s tedious to do investment taxes with TurboTax because it takes a considerable amount of research (even with a small portfolio like mine) to provided the needed information to calculate the cost basis: When did I buy this stock or fund? Did I buy multiple lots? Were there reinvested dividends? Was this the first sale? If not, when else did I sell?

As geeky as it sounds, SQL is a powerful tool to answer these kinds of ad-hoc questions. It can transform investment taxes from a several-week project into something you can do in half a day (which I did, by the way). Here are the tables I use, which are pretty close to E*Trade’s CSV export format:

CREATE TABLE `security` (
`cusip` varchar(32) NOT NULL,
`symbol` varchar(32) NOT NULL,
`description` varchar(255) NOT NULL,
PRIMARY KEY (`cusip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `trade` (
`trade_date` date NOT NULL,
`order_type` enum(‘BUY’,'SELL’) NOT NULL,
`cusip` varchar(32) NOT NULL,
`description` varchar(255) NOT NULL,
`quantity` decimal(16,8) NOT NULL,
`executed` decimal(16,8) NOT NULL,
`commision` decimal(16,8) NOT NULL,
`net_amount` decimal(16,8) NOT NULL,
PRIMARY KEY (`trade_date`,`order_type`,`cusip`,`quantity`,`executed`),
KEY `cusip` (`cusip`),
CONSTRAINT `trade_ibfk_1` FOREIGN KEY (`cusip`) REFERENCES `security` (`cusip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

I also have a couple of views that show just buys or sells for convenience:

CREATE VIEW `buys` AS select `t`.`trade_date` AS `trade_date`,`t`.`order_type` AS `order_type`,`t`.`cusip` AS `cusip`,`t`.`description` AS `description`,`t`.`quantity` AS `quantity`,`t`.`executed` AS `executed`,`t`.`commision` AS `commision`,`t`.`net_amount` AS `net_amount`,`s`.`symbol` AS `symbol`,`s`.`description` AS `symbol_description` from (`trade` `t` join `security` `s` on((`t`.`cusip` = `s`.`cusip`))) where (`t`.`order_type` = ‘BUY’)

CREATE VIEW `sells` AS select `t`.`trade_date` AS `trade_date`,`t`.`order_type` AS `order_type`,`t`.`cusip` AS `cusip`,`t`.`description` AS `description`,`t`.`quantity` AS `quantity`,`t`.`executed` AS `executed`,`t`.`commision` AS `commision`,`t`.`net_amount` AS `net_amount`,`s`.`symbol` AS `symbol`,`s`.`description` AS `symbol_description` from (`trade` `t` join `security` `s` on((`t`.`cusip` = `s`.`cusip`))) where (`t`.`order_type` = ‘SELL’)

Finding all the purchases I made for a particular stock is as simple as this:

mysql> select * from buys where cusip = ’007903107′ order by trade_date \G
*************************** 1. row ***************************
trade_date: 2005-12-06
order_type: BUY
cusip: 007903107
description: ADV MICRO DEVICES
quantity: 5.00000000
executed: 27.59000000
commision: 12.99000000
net_amount: 150.94000000
symbol: AMD
symbol_description: ADV MICRO DEVICES
1 row in set (0.00 sec)

I decided to use CUSIP numbers as primary keys for securities since they are more stable and work for bonds as well as stocks. This makes things a little bit annoying because I always have to look the CUSIP up from the security table. That’s what I get for trying to “do the right thing” I guess. =)

Easy rotating database backups with logrotate

“logrotate” is a utility that comes with most Linux distributions. Its intended purpose is to rename, compress, and delete old log files. If you look in /var/log, you’ll probably see files like the following:

/var/log/auth.log
/var/log/auth.log.0
/var/log/auth.log.1.gz
/var/log/auth.log.2.gz
/var/log/auth.log.3.gz

This is “logrotate” in action. Recently, I realized that this is the same kind of behavior I’d like for my database backups. With a little creative misuse I was able to turn it into a backup rotator for MySQL backups. Here’s how to do it:

Create the directories /usr/local/mysql-backup and /var/backups/mysql. Paste the following into /usr/local/mysql-backup/mysql-backup.logrotate.conf:

/var/backups/mysql/*.sql {
rotate 7
daily
compress
missingok
nocreate
}

To read about these and other configuration options, type “man logrotate”. Now, create a shell script in /usr/local/mysql-backup/mysql-backup with the following:

#!/bin/sh
cd /usr/local/mysql-backup
/usr/sbin/logrotate -f -s mysql-backup.logrotate.state mysql-backup.logrotate.conf
/usr/bin/mysqldump -uUSERNAME -pPASSWORD -A > /var/backups/mysql/database.sql

If you want weekly backups as well, create a directory called /var/backups/mysql-weekly, and add the following line to the mysql-backup script:

gzip -c /var/backups/mysql/database.sql > /var/backups/mysql-weekly/database-`date +%G-%V`.sql.gz

Now all you need to do is “chmod +x mysql-backup” and “ln -s mysql-backup /etc/cron.daily/mysql-backup” and you’ll get nicely rotated and compressed database backups every day. If you want hourly backups, symlink it to “cron.hourly” instead, and change “rotate 7″ to taste in the logrotate configuration.