Using the Python tokenizer for source transformation

I have been working on porting a medium-sized Django project from Django 0.96 to Django 1.0, and one of the necessary changes is converting to use Unicode strings (u’like this’) instead of byte strings (‘like this’). There was too much code to do this reliably by hand, so it seemed like a good idea to write a script to do it. Rather than hack together a bunch of regular expressions, I decided to try the Python tokenize module, since it seemed like I could get very reliable source translation that way.

My first attempt was to use the new untokenize function, which takes tokenizer output and turns it back into source code. However, despite the documentation which states that “conversion is lossless and round-trips are assured”, the coding style is not preserved. Whitespace is added in some places and removed in others, and even though the code runs the same, it looks ugly and generates huge undreadable diffs. Instead, I built the output source manually, since the tokenizer provides enough information about row and column positions. Here’s how it ended up:

#!/usr/bin/env python
 
import sys
import itertools
from tokenize import *
 
def token_line_number((num, token, spos, epos, line)):
    return spos[0]
 
def token_lines(tokens):
    return itertools.groupby(tokens, token_line_number)
 
def convert_strings(token_line):
    result = ''
    pad = 0
    for num, token, spos, epos, line in token_line:
        result += ' ' * (spos[1] + pad - len(result))
        if num == STRING and token[0] != 'u':
            result += 'u'
            pad += 1
        result += token
    return result
 
def convert_unicode(tokens):
    for line_number, token_line in token_lines(tokens):
        token_line = list(token_line)
        has_strings = False
        for num, _, _, _, _ in token_line:
            if num == STRING:
                has_strings = True
                break
        if has_strings:
            yield convert_strings(token_line)
        else:
            yield token_line[0][4]
 
tokens = generate_tokens(sys.stdin.readline)
for line in convert_unicode(tokens):
    sys.stdout.write(line.replace('__str__', '__unicode__'))

Overall, it was very simple to write and didn’t take too long. For any lines that didn’t have string literals, I just printed them out verbatim. Otherwise, I built a new line by assembling it token-by-token, padding with spaces to match the original column positions of each token (compensating for the additional padding introduced by adding the extra ‘u’s). As a post-process, I changed all definitions and calls to “__str__” with the preferred “__unicode__” with a simple search-and-replace.

OCaml streams tutorial

I wrote a tutorial on the Stream module for the Objective CAML Tutorial wiki. It’s got a lot of code that I’ve been sitting on for almost two years, so it feels good to finally get it up on the interwebs where a hard drive crash can’t hurt it.

The Stream module is not very well known, and until now has had almost no documentation. I found out a lot by studying the source while working on the PLEAC project. I also got some ideas from Raymond Hettinger’s excellent work on Python’s itertools module; Python’s generators are very similar to OCaml’s streams.

I hope to find the time to also write a tutorial on camlp4 stream expressions, which use the Stream module to implement syntax sugar that makes stream processing and parsing much more concise.

Raising kids in this crazy world

As I’m about to have my first child, I’ve been thinking about the public education system he will probably suffer through just as I did, so it was well-timed that a Reddit commentor found this little gem about how school prepares us for a life of servitude and not one of leadership, free thought, or creativity.

It reminded me of the book by Charles J. Sykes: Dumbing Down Our Kids, Why American Children Feel Good About Themselves But Can’t Read, Write, Or Add (worth it for the title alone). I found this book in my bookshelf today, and noticed that I had written some sort of poem-like thing on the back page, presumably after reflecting on what I read in the first few chapters (I’m pretty sure that’s as far as I got). I’ll reproduce it for you here:

knowledge is important.
applying knowledge is also important.
knowledge is a necessary prerequisite.

learning is hard work.
making this seem untrue or avoidable
is popular and lucrative.
often, the result of catering toward this
interest is something other than learning.

we are not learning.
our laziness and desire to feel good
have obscured this.
we are being sold our own stupidity.
we pay a little for nothing instead of a lot
for something, and we believe this
to be a great deal.

Something to think about.

Surfing linkland

I was reading Roy Fielding’s you’re-doing-it-wrong article on REST for the second or third time today, and this time I tried to avoid my usual knee-jerk dismissal and learn something instead. His assertion that hyperlinks are essential to understanding REST got me thinking about the idea of auto-discovery. In essence, you should be able to request the root of a site via HTTP and, without knowing anything about the site’s particular approach to URL design, be able to navigate through it’s services just like clicking links on a browser. I realized something which has sort of been in the back of my mind for awhile but never really struck me fully until today: the “link” tag is the de-facto standard for exposing services. If you want to see what services a web site wants you to know about, just scrape its links. So, I wrote a little Python script to do just that:

#!/usr/bin/env python
 
import os
import re
import sys
import urllib
 
if len(sys.argv) != 2:
    print 'usage: %s <url>' % os.path.basename(sys.argv[0])
    sys.exit(1)
 
url = sys.argv[1]
html = urllib.urlopen(url).read()
for tag in re.compile(r'<link.*?>', re.DOTALL).findall(html):
    print tag

Here’s what the output looks like for some common web sites:

ramen@pedro:~$ dumplinks http://reddit.com
<link rel="stylesheet" href="/static/reddit.css?v=2a07c701b9a58519a6c333860d8add98" type="text/css" />
<link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.reddit.com/.rss" />
 
ramen@pedro:~$ dumplinks http://wordpress.com
<link rel="alternate" type="application/rss+xml" title="WordPress.com News" href="http://en.blog.wordpress.com/feed/" />
<link rel="apple-touch-icon" href="http://s2.wordpress.com/wp-content/themes/h4/i/webclip.png"/>
<link href="/wp-content/themes/h4/ie6.css?m=1223022215a" rel="stylesheet" type="text/css" media="screen" />
<link rel='stylesheet' href='http://s.wordpress.com/wp-content/themes/h4/global.css?m=1214319868a' type='text/css' />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://wordpress.com/xmlrpc.php?rsd" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://wordpress.com/wp-includes/wlwmanifest.xml" />
<link rel="introspection" type="application/atomserv+xml" title="Atom API" href="http://wordpress.com/wp-app.php" />
<link rel='openid.server' href='http://wordpress.com/?openidserver=1' />
<link rel='openid.delegate' href='http://wordpress.com/' />
 
ramen@pedro:~$ dumplinks http://slashdot.org
<link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/idlecore-tidied.css?T_2_5_0_227" media="screen">
<link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/lick.css?T_2_5_0_227" media="screen">
<link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/comments-idle.css?T_2_5_0_227" media="screen">
<link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie7-idle.css?T_2_5_0_227" />
<link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie6-idle.css?T_2_5_0_227" />
<link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie8-idle.css?T_2_5_0_227" />
<link rel="top"       title="News for nerds, stuff that matters" href="//slashdot.org/" >
<link rel="search"    title="Search Slashdot" href="//slashdot.org/search.pl">
<link rel="alternate" title="Slashdot RSS" href="http://rss.slashdot.org/Slashdot/slashdot" type="application/rss+xml">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
 
ramen@pedro:~$ dumplinks http://intertwingly.net
<link rel="alternate" type="application/atom+xml" title="It’s just data" href="http://intertwingly.net/blog/index.atom"/>
<link rel="openid.server" href="http://intertwingly.net/id/"/>
<link rel="search" type="application/opensearchdescription+xml" href="http://intertwingly.net/search/" title="intertwingly blog search"/>
<link rel="stylesheet" href="/css/blog5.css" type="text/css" media="screen"/>
<link rel="stylesheet" href="/css/halloween.css" type="text/css" media="screen"/>
<link rel="stylesheet" href="/css/print.css" type="text/css" media="print"/>
<link rel="shortcut icon" href="/favicon.ico"/>
 
ramen@pedro:~$ dumplinks http://earth911.com
<link rel="stylesheet" href="/wp-content/themes/starship/styles/site-3300.css" type="text/css" media="screen,print" />
<link rel="alternate" type="application/rss+xml" title="Earth911.com RSS Feed" href="http://earth911.com/feed/" />
<link rel="pingback" href="http://earth911.com/xmlrpc.php" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://earth911.com/xmlrpc.php?rsd" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://earth911.com/wp-includes/wlwmanifest.xml" />
<link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-email/email-css.css?ver=2.40' type='text/css' media='all' />
<link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-polls/polls-css.css?ver=2.40' type='text/css' media='all' />
<link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-postratings/postratings-css.css?ver=1.40' type='text/css' media='all' />

I already learned about a few new things like the apple-touch-icon feature and opensearchdescription. It seems like there’s another world out there that I never really noticed because it was lost in the tag soup.

Earth911.com site relaunch and birthday celebration

We just launched the new Earth911.com site, which is a major rewrite of both the content management system and recycling location search engine, as well as a much-needed redesign of the look-and-feel and navigation!

Please drop by and leave a comment on our Happy Birthday page, celebrating the company’s 17 years of operation providing information and resources on recycling to the world.

Cheers!
Dave

PHP daemontools service with SIGHUP support

Here’s a template script for writing a daemontools service in PHP. It handles HUP signals, so you can make the service reload its configuration file without restarting.

<?php
 
// Don't forget this line or signal handling won't work:
declare(ticks = 1);
 
// This global flag is used to trigger a configuration file reload.
$load_config = true;
 
// The SIGHUP handler doesn't do any actual work; it just sets the flag.
function handle_hup($signal) {
global $load_config;
$load_config = true;
pcntl_signal(SIGHUP, 'handle_hup', false);
}
 
// Register the signal handler.
pcntl_signal(SIGHUP, 'handle_hup', false);
 
// Start the service.
echo "myserver: starting service\n";
while (true) {
// Load the config file if necessary.
if ($load_config) {
$load_config = false;
echo "myserver: loading configuration\n";
include 'config.php';
}
 
// Do something useful.
// ...
 
// Pause for a configurable amount of time.
echo "myserver: sleeping for $SLEEP_TIME seconds...\n";
sleep($SLEEP_TIME);
}
 
?>

And in config.php, you can start with something like this, to make the sleep time configurable:

<?php
 
// Sleep time in seconds.
 
$SLEEP_TIME = 60;
 
?>

Some random thoughts

  • Slashes are to URLs what dots are to objects
  • HTML is assembly language
  • The spec for HTML tables is practically frozen
  • Programming languages encapsulate philosophies
  • Across the spectrum of languages, some properties are nearly universal, others divisive
  • Blogging is like Usenet with bigger egos (if that is possible)

Apache-style access logs for Tomcat

To get an Apache-style access log, complete with referrers and user-agents, I created a directory called /var/log/tomcat writable by the Tomcat process, and I added the following to tomcat/conf/context.xml:

<Valve className=”org.apache.catalina.valves.FastCommonAccessLogValve”
pattern=”combined”
directory=”/var/log/tomcat”
prefix=”access.log”
rotatable=”false” />

This will create a file called /var/log/tomcat/access.log and start logging requests to it. I’m turning off Tomcat’s date stamping and log rotation, since I prefer to use logrotate.