Surfing linkland

I was reading Roy Fielding’s you’re-doing-it-wrong article on REST for the second or third time today, and this time I tried to avoid my usual knee-jerk dismissal and learn something instead. His assertion that hyperlinks are essential to understanding REST got me thinking about the idea of auto-discovery. In essence, you should be able to request the root of a site via HTTP and, without knowing anything about the site’s particular approach to URL design, be able to navigate through it’s services just like clicking links on a browser. I realized something which has sort of been in the back of my mind for awhile but never really struck me fully until today: the “link” tag is the de-facto standard for exposing services. If you want to see what services a web site wants you to know about, just scrape its links. So, I wrote a little Python script to do just that:

#!/usr/bin/env python
 
import os
import re
import sys
import urllib
 
if len(sys.argv) != 2:
    print 'usage: %s <url>' % os.path.basename(sys.argv[0])
    sys.exit(1)
 
url = sys.argv[1]
html = urllib.urlopen(url).read()
for tag in re.compile(r'<link.*?>', re.DOTALL).findall(html):
    print tag

Here’s what the output looks like for some common web sites:

ramen@pedro:~$ dumplinks http://reddit.com
<link rel="stylesheet" href="/static/reddit.css?v=2a07c701b9a58519a6c333860d8add98" type="text/css" />
<link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.reddit.com/.rss" />
 
ramen@pedro:~$ dumplinks http://wordpress.com
<link rel="alternate" type="application/rss+xml" title="WordPress.com News" href="http://en.blog.wordpress.com/feed/" />
<link rel="apple-touch-icon" href="http://s2.wordpress.com/wp-content/themes/h4/i/webclip.png"/>
<link href="/wp-content/themes/h4/ie6.css?m=1223022215a" rel="stylesheet" type="text/css" media="screen" />
<link rel='stylesheet' href='http://s.wordpress.com/wp-content/themes/h4/global.css?m=1214319868a' type='text/css' />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://wordpress.com/xmlrpc.php?rsd" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://wordpress.com/wp-includes/wlwmanifest.xml" />
<link rel="introspection" type="application/atomserv+xml" title="Atom API" href="http://wordpress.com/wp-app.php" />
<link rel='openid.server' href='http://wordpress.com/?openidserver=1' />
<link rel='openid.delegate' href='http://wordpress.com/' />
 
ramen@pedro:~$ dumplinks http://slashdot.org
<link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/idlecore-tidied.css?T_2_5_0_227" media="screen">
<link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/lick.css?T_2_5_0_227" media="screen">
<link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/comments-idle.css?T_2_5_0_227" media="screen">
<link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie7-idle.css?T_2_5_0_227" />
<link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie6-idle.css?T_2_5_0_227" />
<link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie8-idle.css?T_2_5_0_227" />
<link rel="top"       title="News for nerds, stuff that matters" href="//slashdot.org/" >
<link rel="search"    title="Search Slashdot" href="//slashdot.org/search.pl">
<link rel="alternate" title="Slashdot RSS" href="http://rss.slashdot.org/Slashdot/slashdot" type="application/rss+xml">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
 
ramen@pedro:~$ dumplinks http://intertwingly.net
<link rel="alternate" type="application/atom+xml" title="It’s just data" href="http://intertwingly.net/blog/index.atom"/>
<link rel="openid.server" href="http://intertwingly.net/id/"/>
<link rel="search" type="application/opensearchdescription+xml" href="http://intertwingly.net/search/" title="intertwingly blog search"/>
<link rel="stylesheet" href="/css/blog5.css" type="text/css" media="screen"/>
<link rel="stylesheet" href="/css/halloween.css" type="text/css" media="screen"/>
<link rel="stylesheet" href="/css/print.css" type="text/css" media="print"/>
<link rel="shortcut icon" href="/favicon.ico"/>
 
ramen@pedro:~$ dumplinks http://earth911.com
<link rel="stylesheet" href="/wp-content/themes/starship/styles/site-3300.css" type="text/css" media="screen,print" />
<link rel="alternate" type="application/rss+xml" title="Earth911.com RSS Feed" href="http://earth911.com/feed/" />
<link rel="pingback" href="http://earth911.com/xmlrpc.php" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://earth911.com/xmlrpc.php?rsd" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://earth911.com/wp-includes/wlwmanifest.xml" />
<link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-email/email-css.css?ver=2.40' type='text/css' media='all' />
<link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-polls/polls-css.css?ver=2.40' type='text/css' media='all' />
<link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-postratings/postratings-css.css?ver=1.40' type='text/css' media='all' />

I already learned about a few new things like the apple-touch-icon feature and opensearchdescription. It seems like there’s another world out there that I never really noticed because it was lost in the tag soup.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">