I have been working on porting a medium-sized Django project from Django 0.96 to Django 1.0, and one of the necessary changes is converting to use Unicode strings (u’like this’) instead of byte strings (‘like this’). There was too much code to do this reliably by hand, so it seemed like a good idea to write a script to do it. Rather than hack together a bunch of regular expressions, I decided to try the Python tokenize module, since it seemed like I could get very reliable source translation that way.
My first attempt was to use the new untokenize function, which takes tokenizer output and turns it back into source code. However, despite the documentation which states that “conversion is lossless and round-trips are assured”, the coding style is not preserved. Whitespace is added in some places and removed in others, and even though the code runs the same, it looks ugly and generates huge undreadable diffs. Instead, I built the output source manually, since the tokenizer provides enough information about row and column positions. Here’s how it ended up:
#!/usr/bin/env python import sys import itertools from tokenize import * def token_line_number((num, token, spos, epos, line)): return spos[0] def token_lines(tokens): return itertools.groupby(tokens, token_line_number) def convert_strings(token_line): result = '' pad = 0 for num, token, spos, epos, line in token_line: result += ' ' * (spos[1] + pad - len(result)) if num == STRING and token[0] != 'u': result += 'u' pad += 1 result += token return result def convert_unicode(tokens): for line_number, token_line in token_lines(tokens): token_line = list(token_line) has_strings = False for num, _, _, _, _ in token_line: if num == STRING: has_strings = True break if has_strings: yield convert_strings(token_line) else: yield token_line[0][4] tokens = generate_tokens(sys.stdin.readline) for line in convert_unicode(tokens): sys.stdout.write(line.replace('__str__', '__unicode__'))
Overall, it was very simple to write and didn’t take too long. For any lines that didn’t have string literals, I just printed them out verbatim. Otherwise, I built a new line by assembling it token-by-token, padding with spaces to match the original column positions of each token (compensating for the additional padding introduced by adding the extra ‘u’s). As a post-process, I changed all definitions and calls to “__str__” with the preferred “__unicode__” with a simple search-and-replace.

