ramenlabs

a blog by dave benjamin

Database model objectives

June 25th, 2010

I was flipping through Chris Date’s SQL And Relational Theory and came across this little gem, which paraphrases “Codd’s own stated objectives in introducing his relational model.” I think this bears repeating today because I have felt for awhile that the NoSQL movement has a significantly different set of goals–which is fine–but seems to be ignoring some of the things that make the relational model nice to work with. I wonder if it is necessarily either-or, or if perhaps some of these NoSQL systems can work toward satisfying more of the needs that relational database systems satisfy, without sacrificing the speed and ease of distribution that has made the NoSQL concept popular.

Here are the stative objectives, quoting Date:

  1. To provide a high degree of data independence
  2. To provide a community view of the data of spartan simplicity, so that a wide variety of users in an enterprise, ranging from the most computer naive to the most computer sophisticated, can interact with a common model (while not prohibiting superimposed user views for specialized purposes)
  3. To simplify the potentially formidable job of the database administrator
  4. To introduce a theoretical foundation, albeit modest, into database management (a field sadly lacking in solid principles and guidelines)
  5. To merge the fact retrieval and file management fields in preparation for the addition at a later time of inferential services in the commercial world
  6. To lift database application programming to a new level–a level in which sets (and more specifically relations) are treated as operands instead of being processed element by element

I want to ponder on these objectives for a bit before drawing too many conclusions, but a few things seem starkly obvious.

The need to build indexes by hand in NoSQL systems in order to search (efficiently or not) by different criteria is a step away from the relational model’s goals of data independence because these indexes are likely to be built with a particular application in mind, sometimes (often?) to the disadvantage of other applications requiring a different view of the data.

To further that point, if the indexes designed into the database are insufficient, it will probably be the case that applications will have to drop back to the level of processing one record at a time, rather than working with data sets as units, unless all application developers have enough control over the database system to be able to make the needed changes.

The job of the database administrator is no doubt at a disadvantage today with NoSQL systems, though this is more of a tools issue than a fundamental design issue. The “how do I query the database?” comic sums up the current situation amusingly.

A theoretical foundation of NoSQL systems is hard to find. Most of the theory seems to be in regard to eventual consistency and other issues related more to distributed systems than data modeling in the abstract. This will surely come with time, though as soon as you get into the details of data modeling in NoSQL systems, you really have to specify which one, as they are more different than they are similar. A theory of data management with key-value stores seems, to me, unenlightening at first glance.

Whatever the base model is, if NoSQL databases are here to stay, I think we are going to see a need for some theoretical foundations to manage the growing complexity of our data models given the new strengths and limitations of NoSQL systems.

Ruby’s surprising handling of local variables

February 26th, 2010

I was trying to read some Rails code today and came across the definition of the link_to view helper, which looks like this:

def link_to(*args, &block)
  if block_given?
    options      = args.first || {}
    html_options = args.second
    concat(link_to(capture(&block), options, html_options))
  else
    name         = args.first
    options      = args.second || {}
    html_options = args.third
 
    url = url_for(options)
 
    if html_options
      html_options = html_options.stringify_keys
      href = html_options['href']
      convert_options_to_javascript!(html_options, url)
      tag_options = tag_options(html_options)
    else
      tag_options = nil
    end
 
    href_attr = "href=\"#{url}\"" unless href
    "<a #{href_attr}#{tag_options}>#{name || url}</a>"
  end
end

At first glance, I thought for sure I had found a bug! The variable, href, is only initialized if html_options is specified. It seems like the “href_attr = … unless href” line would blow up otherwise, since it’s testing a variable that may not have been set. Or, so I thought. It turns out that my understanding of Ruby’s local variable semantics was wrong, as demonstrated by this simple test:

irb(main):001:0> x
NameError: undefined local variable or method 'x' for main:Object
        from (irb):1
        from :0
irb(main):002:0> if false
irb(main):003:1>   x = 0
irb(main):004:1> end
=> nil
irb(main):005:0> x
=> nil

It seems that assigning to an uninitialized variable in code that does not execute is sufficient to create that variable and assign it a default value of nil. This is in contrast to Python, which doesn’t define new variables unless the code that sets them actually executes:

>>> x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
>>> if False:
...   x = 0
...
>>> x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined

Neither does Javascript:

js> x
typein:1: ReferenceError: x is not defined
js> if (false) x = 0;
js> x
typein:3: ReferenceError: x is not defined

Is it just me, or is Ruby’s behavior completely bizarre here?

Taming the Unicode terminal

February 25th, 2010

For the past six months or so, I’ve been on a quest for the perfect graphical Linux terminal, and it’s led me in some odd directions. The thing is, I already found the perfect terminal (perfect for me, that is): mrxvt. I’ve been using mrxvt for about five years now, and it’s been my favorite for several reasons:

  • It’s insanely fast
  • It’s easy to customize
  • It has tabs, and you can put them at the bottom

About my only gripe has been the lack of Zmodem support. Call me crazy, but I still like to use Zmodem from time to time. The only terminals that still do Zmodem are SecureCRT and old versions of Konsole; since KDE 4′s Konsole regressed, this feature no longer exists.

I had little reason to switch to another terminal until the day I realized that UTF-8 is here to stay, and it’s something a terminal needs to do. I fought this for a long time, but ultimately I got tired of setting LANG=C in all my profiles and jumping through hoops to turn off what is essentially progress. The problem is that mrxvt does not support Unicode, and it probably never will.

What are my other options? I tried many:

  • xterm and rxvt have unicode options, but no tabs.
  • xfce4-terminal is actually quite close to what I want, but it won’t let me put the tabs at the bottom.
  • gnome-terminal is slow, feature-poor, hard to configure, and also won’t let me put the tabs at the bottom.
  • urxvt looks promising, but the tab support is text, not graphical, and needs some UI work. (I hear there’s a GTK version, but it doesn’t work very well and lacks the keyboard support I depend on.)
  • Konsole is probably the best terminal available for Linux, but it’s slow, clunky, and I can’t customize the keyboard shortcuts the way I want. I like how it remembers what sessions I had open last, but I hate not being able to assign hotkeys – with mrxvt, I type Ctrl-Shift-Enter, type a host name, and get an instant ssh connection with the host name as the tab title. I can’t seem to get this effect with Konsole.

I’ve tried others, but I can’t remember them anymore. I tried everything available through Debian and a few others, and wasn’t happy with any of them. I even tried making my own terminal using Python and libvte, and at one point considered that a viable option, but ultimately gave up because I saw how much work I had to do to get decent terminal emulation (I use a lot of keyboard shortcuts in emacs), and besides, VTE just feels kind of slow (which is why gnome-terminal also feels slow).

After flirting with Konsole for a few months (hey, at least it motivated me to try out KDE 4, which is awesome), I got tired of the slowness and lack of keyboard customizations and I switched back to mrxvt. I started looking for other ways to get Unicode to work, and somehow I stumbled upon GNU Screen, which I had been ignoring up to that point, since mrxvt’s tabs left me little reason to add another layer of virtual screen support.

However, GNU Screen had, this whole time, been a workable solution, and I was just completely unaware of what it could do. I don’t think many people know that this is even possible, since it’s hard to find any conversations on the web about it, but screen can emulate UTF-8 on a latin-1 terminal, and I find this completely amazing. Here’s my simple .screenrc configuration:

defbce on
defutf8 on
escape ^]^]
markkeys "h=^B:l=^F:$=^E"
setenv LANG en_US.UTF-8
startup_message off
term $TERM
termcapinfo xterm|xterms|xs|rxvt ti@:te@
zmodem catch

Here’s what it does:

  • “defutf8 on” turns on UTF-8 support, including translation to the parent terminal’s encoding!
  • “defbce on” turns on “background color erase”, which fixes a problem where status-line background colors don’t go all the way across the screen in various full-screen console programs
  • “escape ^]^]” moves the escape key from the default of Ctrl-A to Ctrl-], which is the best compromise I could come up with since my brain is completely wired to use Ctrl-A to go to the beginning of the line (emacs, zsh, etc.)
  • The “markkeys” line I don’t really use much anymore, now that I got mrxvt’s scrollback buffer working again, but it lets me use the other emacs keys to navigate screen’s scroll/copy buffer
  • “setenv LANG en_US.UTF-8″ sets the LANG environment variable to indicate that I want UTF-8 encoding. This is essential because, from mrxvt, I have LANG=C so that screen knows my terminal doesn’t support UTF-8.
  • “startup_message off” skips the startup screen, since I use screen for every single tab, and this would be totally redundant.
  • “term $TERM” uses the TERM setting from the parent terminal rather than letting screen overwrite it as TERM=screen or TERM=screen-bce. I actually use “xterm” as my TERM type, even though I use an rxvt-derived terminal, because this has given me the best compatibility with the various console programs I use. I’ve spent many hours fixing keyboard escape codes so that mrxvt is xtermy enough.
  • The “termcapinfo …” line tells screen not to use the “alternate screen” for anything, which is necessary for mrxvt’s scrollback buffer and corresponding keyboard and mouse support to work.
  • Last but not least, “zmodem catch” enables–holy shit–Zmodem support! I had pretty much already given up on Zmodem, after having bad experiences with zssh, but screen’s Zmodem support really works! Supposedly it’s experimental and crashes sometimes, but so far I haven’t had any major problems with it.

That’s it for my screen setup so far. I made the following change to my .mrxvtrc to use screen for every tab and make sure that LANG=C before screen runs (which screen will later set to en_US.UTF-8):

Mrxvt.command: \!LANG=C exec screen
Mrxvt.macro.Ctrl+Shift+Return: NewTab "ssh" \!read -p"Host: " host; echo -ne "\e]0;$host\a"; LANG=C exec screen ssh $host

The first line is the only one necessary to start using screen for all new tabs. The second one is my handy shortcut for opening up ssh sessions quickly. It really helps when I need to log onto a bunch of boxes at once.

I got all of this working and ran with it for a couple of weeks. Everything was great except for one minor issue, which wasn’t really a big deal, but I came up with a fix for that, too. The issue was that certain Unicode characters were showing up as ‘?’. After some investigation I learned that these characters did not have corresponding characters in latin-1. They included bullets, various dashes/hyphens, and curly single- and double-quotes. I don’t really care if my quotes are curly, but it is distracting when they show up as question marks, so I started digging through the screen manual trying to find a solution. I couldn’t find any way to customize the replacement character it chooses, so I finally downloaded the screen source and started hacking on it. I wrote a little patch that does the replacements I want, and figured I might as well try doing a fork on github so I can keep up with future changes to screen. Here is the commit.

As it turns out, the latest version from github, which I think is a mirror of the CVS trunk, once again kills mrxvt’s scrollback buffer support, and after messing around with it for awhile I gave up and went back to the latest screen source package from Debian unstable, applying my patch to encoding.c and running “debian/rules binary” to build a custom .deb package. That is what I am using now, and it’s working just fine!

All this, and I still haven’t really learned how to use screen for its intended purpose. I still don’t really need the virtual screens, but I’m intrigued by the bind, bindkey, and stuff commands, which seem to be a general-purpose keyboard macro tool. I’m looking forward to learning more about what screen can do.

Redesigning my blog, for reals this time

February 23rd, 2010

Okay, so this is about the 4th theme I’ve picked for my blog, though I doubt anyone reads it anyway. In any case, this time I finally realized that I’ll never be happy with the coding style of another WordPress theme developer, so I started from scratch. Plus, I really wanted to try using 960.gs, and converting an existing theme from one framework (or no framework) to another is pretty hard and time-consuming. So here’s my brand new theme, with not many features or anything. Hope you like it. =)

Synchronet runs on Linux

August 31st, 2009

I just found this out. I followed the directions and got Synchronet BBS running on my Debian box. It runs great under daemontools, and out of the box it sets up telnet, ssh, ftp, finger (yes, finger), mail, and irc servers. It has multiple sets of menu keys including one that emulates Renegade, which is more nostalgic for me than emacs bindings.

Now, the question is, what would I do with it? If I made a BBS, would you log into it? If so, would you log in more than 20 times? What if it had ANSI art? What if it was UTF-8?!

I doubt it, right? Internet BBSes are weird.

I miss BBSes a lot but I never stay on an internet one for more than a couple of weeks. But I have to say, Synchronet is a really nice BBS, and I’d totally use it if I still knew how to do ANSI art and more than 1 people would call it. Maybe I can gate it into Twitter somehow.

Using the Python tokenizer for source transformation

March 26th, 2009

I have been working on porting a medium-sized Django project from Django 0.96 to Django 1.0, and one of the necessary changes is converting to use Unicode strings (u’like this’) instead of byte strings (‘like this’). There was too much code to do this reliably by hand, so it seemed like a good idea to write a script to do it. Rather than hack together a bunch of regular expressions, I decided to try the Python tokenize module, since it seemed like I could get very reliable source translation that way.

My first attempt was to use the new untokenize function, which takes tokenizer output and turns it back into source code. However, despite the documentation which states that “conversion is lossless and round-trips are assured”, the coding style is not preserved. Whitespace is added in some places and removed in others, and even though the code runs the same, it looks ugly and generates huge undreadable diffs. Instead, I built the output source manually, since the tokenizer provides enough information about row and column positions. Here’s how it ended up:

#!/usr/bin/env python
 
import sys
import itertools
from tokenize import *
 
def token_line_number((num, token, spos, epos, line)):
    return spos[0]
 
def token_lines(tokens):
    return itertools.groupby(tokens, token_line_number)
 
def convert_strings(token_line):
    result = ''
    pad = 0
    for num, token, spos, epos, line in token_line:
        result += ' ' * (spos[1] + pad - len(result))
        if num == STRING and token[0] != 'u':
            result += 'u'
            pad += 1
        result += token
    return result
 
def convert_unicode(tokens):
    for line_number, token_line in token_lines(tokens):
        token_line = list(token_line)
        has_strings = False
        for num, _, _, _, _ in token_line:
            if num == STRING:
                has_strings = True
                break
        if has_strings:
            yield convert_strings(token_line)
        else:
            yield token_line[0][4]
 
tokens = generate_tokens(sys.stdin.readline)
for line in convert_unicode(tokens):
    sys.stdout.write(line.replace('__str__', '__unicode__'))

Overall, it was very simple to write and didn’t take too long. For any lines that didn’t have string literals, I just printed them out verbatim. Otherwise, I built a new line by assembling it token-by-token, padding with spaces to match the original column positions of each token (compensating for the additional padding introduced by adding the extra ‘u’s). As a post-process, I changed all definitions and calls to “__str__” with the preferred “__unicode__” with a simple search-and-replace.

OCaml streams tutorial

March 11th, 2009

I wrote a tutorial on the Stream module for the Objective CAML Tutorial wiki. It’s got a lot of code that I’ve been sitting on for almost two years, so it feels good to finally get it up on the interwebs where a hard drive crash can’t hurt it.

The Stream module is not very well known, and until now has had almost no documentation. I found out a lot by studying the source while working on the PLEAC project. I also got some ideas from Raymond Hettinger’s excellent work on Python’s itertools module; Python’s generators are very similar to OCaml’s streams.

I hope to find the time to also write a tutorial on camlp4 stream expressions, which use the Stream module to implement syntax sugar that makes stream processing and parsing much more concise.

Raising kids in this crazy world

November 14th, 2008

As I’m about to have my first child, I’ve been thinking about the public education system he will probably suffer through just as I did, so it was well-timed that a Reddit commentor found this little gem about how school prepares us for a life of servitude and not one of leadership, free thought, or creativity.

It reminded me of the book by Charles J. Sykes: Dumbing Down Our Kids, Why American Children Feel Good About Themselves But Can’t Read, Write, Or Add (worth it for the title alone). I found this book in my bookshelf today, and noticed that I had written some sort of poem-like thing on the back page, presumably after reflecting on what I read in the first few chapters (I’m pretty sure that’s as far as I got). I’ll reproduce it for you here:

knowledge is important.
applying knowledge is also important.
knowledge is a necessary prerequisite.

learning is hard work.
making this seem untrue or avoidable
is popular and lucrative.
often, the result of catering toward this
interest is something other than learning.

we are not learning.
our laziness and desire to feel good
have obscured this.
we are being sold our own stupidity.
we pay a little for nothing instead of a lot
for something, and we believe this
to be a great deal.

Something to think about.

Surfing linkland

October 29th, 2008

I was reading Roy Fielding’s you’re-doing-it-wrong article on REST for the second or third time today, and this time I tried to avoid my usual knee-jerk dismissal and learn something instead. His assertion that hyperlinks are essential to understanding REST got me thinking about the idea of auto-discovery. In essence, you should be able to request the root of a site via HTTP and, without knowing anything about the site’s particular approach to URL design, be able to navigate through it’s services just like clicking links on a browser. I realized something which has sort of been in the back of my mind for awhile but never really struck me fully until today: the “link” tag is the de-facto standard for exposing services. If you want to see what services a web site wants you to know about, just scrape its links. So, I wrote a little Python script to do just that:

#!/usr/bin/env python
 
import os
import re
import sys
import urllib
 
if len(sys.argv) != 2:
    print 'usage: %s &lt;url&gt;' % os.path.basename(sys.argv[0])
    sys.exit(1)
 
url = sys.argv[1]
html = urllib.urlopen(url).read()
for tag in re.compile(r'&lt;link.*?&gt;', re.DOTALL).findall(html):
    print tag

Here’s what the output looks like for some common web sites:

ramen@pedro:~$ dumplinks http://reddit.com
&lt;link rel="stylesheet" href="/static/reddit.css?v=2a07c701b9a58519a6c333860d8add98" type="text/css" /&gt;
&lt;link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" /&gt;
&lt;link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.reddit.com/.rss" /&gt;
 
ramen@pedro:~$ dumplinks http://wordpress.com
&lt;link rel="alternate" type="application/rss+xml" title="WordPress.com News" href="http://en.blog.wordpress.com/feed/" /&gt;
&lt;link rel="apple-touch-icon" href="http://s2.wordpress.com/wp-content/themes/h4/i/webclip.png"/&gt;
&lt;link href="/wp-content/themes/h4/ie6.css?m=1223022215a" rel="stylesheet" type="text/css" media="screen" /&gt;
&lt;link rel='stylesheet' href='http://s.wordpress.com/wp-content/themes/h4/global.css?m=1214319868a' type='text/css' /&gt;
&lt;link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://wordpress.com/xmlrpc.php?rsd" /&gt;
&lt;link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://wordpress.com/wp-includes/wlwmanifest.xml" /&gt;
&lt;link rel="introspection" type="application/atomserv+xml" title="Atom API" href="http://wordpress.com/wp-app.php" /&gt;
&lt;link rel='openid.server' href='http://wordpress.com/?openidserver=1' /&gt;
&lt;link rel='openid.delegate' href='http://wordpress.com/' /&gt;
 
ramen@pedro:~$ dumplinks http://slashdot.org
&lt;link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/idlecore-tidied.css?T_2_5_0_227" media="screen"&gt;
&lt;link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/lick.css?T_2_5_0_227" media="screen"&gt;
&lt;link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/comments-idle.css?T_2_5_0_227" media="screen"&gt;
&lt;link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie7-idle.css?T_2_5_0_227" /&gt;
&lt;link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie6-idle.css?T_2_5_0_227" /&gt;
&lt;link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie8-idle.css?T_2_5_0_227" /&gt;
&lt;link rel="top"       title="News for nerds, stuff that matters" href="//slashdot.org/" &gt;
&lt;link rel="search"    title="Search Slashdot" href="//slashdot.org/search.pl"&gt;
&lt;link rel="alternate" title="Slashdot RSS" href="http://rss.slashdot.org/Slashdot/slashdot" type="application/rss+xml"&gt;
&lt;link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"&gt;
 
ramen@pedro:~$ dumplinks http://intertwingly.net
&lt;link rel="alternate" type="application/atom+xml" title="It’s just data" href="http://intertwingly.net/blog/index.atom"/&gt;
&lt;link rel="openid.server" href="http://intertwingly.net/id/"/&gt;
&lt;link rel="search" type="application/opensearchdescription+xml" href="http://intertwingly.net/search/" title="intertwingly blog search"/&gt;
&lt;link rel="stylesheet" href="/css/blog5.css" type="text/css" media="screen"/&gt;
&lt;link rel="stylesheet" href="/css/halloween.css" type="text/css" media="screen"/&gt;
&lt;link rel="stylesheet" href="/css/print.css" type="text/css" media="print"/&gt;
&lt;link rel="shortcut icon" href="/favicon.ico"/&gt;
 
ramen@pedro:~$ dumplinks http://earth911.com
&lt;link rel="stylesheet" href="/wp-content/themes/starship/styles/site-3300.css" type="text/css" media="screen,print" /&gt;
&lt;link rel="alternate" type="application/rss+xml" title="Earth911.com RSS Feed" href="http://earth911.com/feed/" /&gt;
&lt;link rel="pingback" href="http://earth911.com/xmlrpc.php" /&gt;
&lt;link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://earth911.com/xmlrpc.php?rsd" /&gt;
&lt;link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://earth911.com/wp-includes/wlwmanifest.xml" /&gt;
&lt;link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-email/email-css.css?ver=2.40' type='text/css' media='all' /&gt;
&lt;link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-polls/polls-css.css?ver=2.40' type='text/css' media='all' /&gt;
&lt;link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-postratings/postratings-css.css?ver=1.40' type='text/css' media='all' /&gt;

I already learned about a few new things like the apple-touch-icon feature and opensearchdescription. It seems like there’s another world out there that I never really noticed because it was lost in the tag soup.

Earth911.com site relaunch and birthday celebration

October 29th, 2008

We just launched the new Earth911.com site, which is a major rewrite of both the content management system and recycling location search engine, as well as a much-needed redesign of the look-and-feel and navigation!

Please drop by and leave a comment on our Happy Birthday page, celebrating the company’s 17 years of operation providing information and resources on recycling to the world.

Cheers!
Dave