I was reading Roy Fielding’s you’re-doing-it-wrong article on REST for the second or third time today, and this time I tried to avoid my usual knee-jerk dismissal and learn something instead. His assertion that hyperlinks are essential to understanding REST got me thinking about the idea of auto-discovery. In essence, you should be able to request the root of a site via HTTP and, without knowing anything about the site’s particular approach to URL design, be able to navigate through it’s services just like clicking links on a browser. I realized something which has sort of been in the back of my mind for awhile but never really struck me fully until today: the “link” tag is the de-facto standard for exposing services. If you want to see what services a web site wants you to know about, just scrape its links. So, I wrote a little Python script to do just that:
#!/usr/bin/env python import os import re import sys import urllib if len(sys.argv) != 2: print 'usage: %s <url>' % os.path.basename(sys.argv[0]) sys.exit(1) url = sys.argv[1] html = urllib.urlopen(url).read() for tag in re.compile(r'<link.*?>', re.DOTALL).findall(html): print tag
Here’s what the output looks like for some common web sites:
ramen@pedro:~$ dumplinks http://reddit.com <link rel="stylesheet" href="/static/reddit.css?v=2a07c701b9a58519a6c333860d8add98" type="text/css" /> <link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" /> <link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.reddit.com/.rss" /> ramen@pedro:~$ dumplinks http://wordpress.com <link rel="alternate" type="application/rss+xml" title="WordPress.com News" href="http://en.blog.wordpress.com/feed/" /> <link rel="apple-touch-icon" href="http://s2.wordpress.com/wp-content/themes/h4/i/webclip.png"/> <link href="/wp-content/themes/h4/ie6.css?m=1223022215a" rel="stylesheet" type="text/css" media="screen" /> <link rel='stylesheet' href='http://s.wordpress.com/wp-content/themes/h4/global.css?m=1214319868a' type='text/css' /> <link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://wordpress.com/xmlrpc.php?rsd" /> <link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://wordpress.com/wp-includes/wlwmanifest.xml" /> <link rel="introspection" type="application/atomserv+xml" title="Atom API" href="http://wordpress.com/wp-app.php" /> <link rel='openid.server' href='http://wordpress.com/?openidserver=1' /> <link rel='openid.delegate' href='http://wordpress.com/' /> ramen@pedro:~$ dumplinks http://slashdot.org <link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/idlecore-tidied.css?T_2_5_0_227" media="screen"> <link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/lick.css?T_2_5_0_227" media="screen"> <link rel="stylesheet" rev="stylesheet" href="//images.slashdot.org/comments-idle.css?T_2_5_0_227" media="screen"> <link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie7-idle.css?T_2_5_0_227" /> <link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie6-idle.css?T_2_5_0_227" /> <link rel="stylesheet" type="text/css" media="screen" href="//images.slashdot.org/ie8-idle.css?T_2_5_0_227" /> <link rel="top" title="News for nerds, stuff that matters" href="//slashdot.org/" > <link rel="search" title="Search Slashdot" href="//slashdot.org/search.pl"> <link rel="alternate" title="Slashdot RSS" href="http://rss.slashdot.org/Slashdot/slashdot" type="application/rss+xml"> <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"> ramen@pedro:~$ dumplinks http://intertwingly.net <link rel="alternate" type="application/atom+xml" title="It’s just data" href="http://intertwingly.net/blog/index.atom"/> <link rel="openid.server" href="http://intertwingly.net/id/"/> <link rel="search" type="application/opensearchdescription+xml" href="http://intertwingly.net/search/" title="intertwingly blog search"/> <link rel="stylesheet" href="/css/blog5.css" type="text/css" media="screen"/> <link rel="stylesheet" href="/css/halloween.css" type="text/css" media="screen"/> <link rel="stylesheet" href="/css/print.css" type="text/css" media="print"/> <link rel="shortcut icon" href="/favicon.ico"/> ramen@pedro:~$ dumplinks http://earth911.com <link rel="stylesheet" href="/wp-content/themes/starship/styles/site-3300.css" type="text/css" media="screen,print" /> <link rel="alternate" type="application/rss+xml" title="Earth911.com RSS Feed" href="http://earth911.com/feed/" /> <link rel="pingback" href="http://earth911.com/xmlrpc.php" /> <link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://earth911.com/xmlrpc.php?rsd" /> <link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://earth911.com/wp-includes/wlwmanifest.xml" /> <link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-email/email-css.css?ver=2.40' type='text/css' media='all' /> <link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-polls/polls-css.css?ver=2.40' type='text/css' media='all' /> <link rel='stylesheet' href='http://earth911.com/wp-content/plugins/wp-postratings/postratings-css.css?ver=1.40' type='text/css' media='all' />
I already learned about a few new things like the apple-touch-icon feature and opensearchdescription. It seems like there’s another world out there that I never really noticed because it was lost in the tag soup.
