Over the next seven days I will be moving this site to a new server, so at some point between now and then: expect disuptions!
January 26, 2009
Pipe Cleaner – a delicious example
This is going to be a very brief post: here’s how you use Pipe Cleaner to download every in your delicious account tagged “python” – outputing it as OPML, RSS, or Atom comes for free:
import module:api_delicious; api_delicious.PostsList to:items tag:"python" authenticate:delicious;
January 25, 2009
ChangeCamp – notes from redesigning www.toronto.ca/opendata
This is a transcription-with-license from the a session I attended yesterday at ChangeCamp on entitled “Designing www.toronto.ca/opendata“. You can read more about ChangeCamp here; my primary reason for attending was interest in promoting and helping our governments share information which they are gathering, to make them more transparent, accountable and even potentially useful. This session I was attending was run by some folks from the City of Toronto and its 311 initiative (I think). These notes are based from memory and the photos I took, here on Flickr. Note that this is only one of several (3?) parallel discussions that were happening during this session, and this is only the “data” section of the session; I’m afraid I wandered off during the “tools” parts as I thought it might be a little premature.
Tags, Geocoding, Ontologies
A common theme of discussion was providing ways to organize the data. In particular:
- geocode,
- geocode by tag: e.g. “the annex”
- tagged by topic; i.e. that this is an orthogonal axis to geocoding
- metadata
- we should look to how other cities have organized data to find a common vocabulary
A common concept was that we could use tags / geocodes to:
- spontaneously form communities around ideas
- track related issues
- form feeds on related information
- use to query related information that would generally be spread widely
Information Dissemantion
- be able to track issues through the system; this is a very common theme
- have access to historical information
- feeds for everything
Crime and Public Safety
- Police reports (geocoded)
- Public health services
- Emergency information
- Info (like about SARS)
- Releases
- Public health
Scheduling
- Pools
- Skating rinks
- Ferries
- Public meetings
This is related to the concept discussed that any information that goes into a PDF should be available in raw form.
Politician Information
- Voting records
- Expenses
- Finances
Service Information
- Power grid disruptions
- Critical incident
- Zoning (i.e. what’s the zoning information for this location)
- Density / population / demographics
- Parking information (e.g. what’s the parking rules; how do I get a parking permit, etc.)
- Real-time polution:
- Water quality
- Air quality
- Tagged, available as feeds
- Historic information
- Traffic
- Roads
- street conditions
- Trains
- Utilization rates
- Roads
- Sewer / waterflow data; i.e. that apparently sensors are already in place for
Complaints
- be able to add data into the system
- be able to track that information
Tourism Information
- event dates, locations, price; e.g. Nuit Blanche
- standard information for tour operators
Budget Information
- spreadsheets
- all the raw data in PDFs should be available as XLS/CSV
- be able trace evolution of data from its source; follow back up the chain
Tendering Information
- What is up for tender
- What tenders have been awarded
- Make interaction with city more efficient and open
311 Information
- Track whether services were successful
- Raw feeds
- Ticketing system (i.e. issue tracking)
- Turnaround time
Community Group Information
Information to empower groups, enable spontaneous community formation…
- Mayor’s initiatives
- Bike lane’s
- Deal with language issues
- Schools
- What assets are available (pools, gyms)
- Parks & rec
- Open spaces
- Comunity centres
- Commity health centres
My issues with OAuth
The other day I twittered Chris Messina about OAuth:
@factoryjoe #OAuth is an incomprehensible mess. Programming a Python client to connect to a service has never been so hard
This is the details of my experience, plus suggestions about how to fix the problems I’ve encountered.
The username/password gold standard
To interact with a service like Twitter’s API, you need three pieces of information: a username, password and an API endpoint I want to use. Once I have this information I can use a standard library in almost any language to start using the Twitter API – I am up and running within 45 seconds. Now, don’t mistake that I think giving up your username and password to a third party service is a good idea: it’s horrible. However, the other part – to be up and running within a few minutes – is critical from a programming usability point of view. No bucks, no Buck Rogers; no API users, no API usage.
It took me a day and half (albeit of scattered hours) to get OAuth to work for me. To put this in perspective, I had Google’s authentication system – including recoding urllib2 to deal with PUT and 301/302 errors – usable in about 2 hours.
What I’ve discovered is that OAuth is as almost as easy to use as HTTP Basic Authentication (the username / password scheme above); the issue is the confusing way OAuth is currently presenting information to developers. I have documented my coding experiences here and the code is freely available for use and perusal here (though you’ll really be buying into a web resources model that might not be your thing if you do).
The informational issues I had – with suggested fixes – are documented below.
Critical OAuth information is poorly packaged
This is what you need to know to access an OAuth API:
- A “consumer key”
- A “consumer secret”
- An “authorization URL”
- A “token URL”
- An “access token URL”
Not to mention a list of API URLs that are the API end points. Note that all of these items are defined by terminology unique to OAuth and thus unfamiliar to the new developer. Now, try going to Fire Eagle and getting all of that information, and when you’re finished that head on over to PostRank to do the same. If you’re clever, it’ll take you 5 minutes but more likely (especially if you’ve never seen OAuth before) it’ll take you about 10 minutes and you’re likely to have got something wrong. Did you catch that Fire Eagle has similar looking but not quite the same URLs? Does PostRank use “http://” or “https://” for “standard request paths“? Did you know that PostRank also has two entirely different hostnames for URLs in its APIs? If you didn’t, well, you’ll probably be revisiting your lists.
Here’s what I suggest: OAuth should recommend that every OAuth Service Provider return the following JSON dictionary (in a TEXTAREA or PRE) in the place where they’re currently returning the consumer key & secret:
{
"api_uri": [
"http://www.postrank.com/myfeeds/",
"http://www.postrank.com/user/",
"http://api.postrank.com/"
],
"oauth_access_token_url": "http://www.postrank.com/oauth/access_token",
"oauth_authorization_url": "http://www.postrank.com/oauth/authorize",
"oauth_token_url": "http://www.postrank.com/oauth/request_token",
"oauth_consumer_key": "XXXXXXXXXXXXXXXXXXXXXX",
"oauth_consumer_secret": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"oauth_signature_method": "HMAC-SHA1",
"oauth_version": "1.0"
}
That’s it: one single piece of information that can be read without difficultly by every single language modern developers use and can be copied and pasted in a single operation by the developer.
I’d almost like to note that Fire Eagle has the most useful OAuth developer documentation I’ve seen and the OAuth team should consider adopting it wholesale as their own.
The OAuth website is confusing
The front page of the OAuth website is very promising. There’s a big round button like area that says “For Consumer developers…” and another that says “For Service Provider developers…”. From there things go rapidly downhill. Neither of these items are buttons. Instead, one clicks on the “Get Started…” link, from there you examine a list of other links and then start reading about how OAuth got it’s name after the sound Blaine Cook’s college roommate’s brother’s cat used to make horking up furballs or whatever. Honestly: no one cares at this stage. What I’d like to do is click on “Consumer developers” link and start seeing a concrete example of what I need to do interact with an OAuth enable service. All the other stuff is filler.
One further point: it’d be nice to see a proper logo.
The OAuth website needs a API playground
My final recommendation is that OAuth provide on their website a live sample API Service Provider than all the libraries interact with out of the box. It’s difficult enough to get an API working without wondering whether the problem is on my side or on their side.
Creating OPML subscription lists using Pipe Cleaner
Here’s a neat API I completed this morning, called api_feeds. It takes a URL (or a list of them) and transforms them into:
- the home page associated with the URL
- the feed(s) for the URL
- the name of the home page
If you’re following along at home, this is essentially the information needed for a single outline in an OPML subscription list.
Here’s a simple python example:
api = api_feeds.OneFeed()
api.request = {
"uri" : "http://code.davidjanes.com/blog/2009/01/23/transparently-working-with-oauath/",
}
pprint.pprint(api.response, width = 1)
And here’s what the output looks like:
{'link': u'http://code.davidjanes.com/blog',
'links': [{'href': u'http://feeds.feedburner.com/DavidJanesCode',
'rel': 'alternate',
'type': u'application/rss+xml'}],
'title': u"David Janes' Code Weblog"}
There’s actually quite a bit going on here behind the scenes, most of it using code I didn’t initially write but have quite heavily hacked: the Universal Feed Parser and the Feed Finder.
What becomes really interesting what happens when we combine this with other modules. Here’s an example of how we can build an OPML subscription list from all the posts I’ve tagged “python” and “django” in del.icio.us. The code looks up each link I’ve bookmarked, does the feed discovery above, filters out items that don’t have feeds, and outputs as OPML. Note the neat pipeline type aspect to the code:
api_delicious = api_delicious.PostsList(tag = "python django") api_many = api_feeds.ManyFeeds(require_feed = True) api_opml = api_opml.OPMLWriter() api_many.items = api_delicious.items api_opml.items = api_many.items print api_opml.Produce()
Producing the following OPML:
<opml encoding="utf-8" version="2.0">
<head>
<title>[Untitled]</title>
</head>
<body>
<outline htmlUrl="http://push.cx"
rssUrl="http://push.cx/feed"
text="Push cx"
type="rss"/>
<outline htmlUrl="http://crankycoder.com"
rssUrl="http://crankycoder.com/feed/"
text="crankycoder.com"
type="rss"/>
<outline htmlUrl="http://blog.dowski.com"
rssUrl="http://blog.dowski.com/feed/"
text="the occasional occurrence"
type="rss"/>
<outline htmlUrl="http://www.b-list.org/feeds/entries/"
rssUrl="http://feeds2.feedburner.com/b-list-entries"
text="The B-List: Latest entries"
type="rss"/>
<outline htmlUrl="http://blog.thescoop.org"
rssUrl="http://blog.thescoop.org/feed/"
text="The Scoop"
type="rss"/>
<outline htmlUrl="http://effbot.org"
rssUrl="http://effbot.org/zone/rss.xml"
text="effbot.org"
type="rss"/>
<outline htmlUrl="http://blog.disqus.net"
rssUrl="http://feeds.feedburner.com/BigHeadLabs"
text="Disqus"
type="rss"/>
<outline htmlUrl="http://blog.ianbicking.org"
rssUrl="http://blog.ianbicking.org/feed/atom/"
text="Ian Bicking: a blog"
type="rss"/>
<outline htmlUrl="http://antoniocangiano.com"
rssUrl="http://feeds.feedburner.com/ZenAndTheArtOfRubyProgramming"
text="Zen and the Art of Programming"
type="rss"/>
<outline htmlUrl="http://www.carthage.edu/webdev"
rssUrl="http://www.carthage.edu/webdev/?feed=rss2"
text="carthage webdev"
type="rss"/>
<outline htmlUrl="http://www.eweek.com"
rssUrl="http://www.eweek.com/rss-feeds-13.xml"
text="Application Development - RSS Feeds"
type="rss"/>
<outline htmlUrl="http://jeffcroft.com/"
rssUrl="http://feeds.feedburner.com/jeffcroft/blog"
text="JeffCroft.com: Latest blog entries"
type="rss"/>
</body>
</opml>
This will be just as terse (terser, probably) when written as a Pipe Cleaner script; I’m just struggling over how to introduce the authentication code gracefully into the scripts.
Pipe Cleaner Progress
I’ve made substantial progress in the spare hours I have on Pipe Cleaner recently. The current plan is to spend February documenting and packaging and start selling it whomever needs it. In your case, the price is almost certainly free, so no worries
One discovery I’ve made this morning is that the command line application is going to be as important as the web application. This is because certain scripts — as you’ll see in the next post — inherently take a long time to run and they’ll almost certainly cause timeouts on an HTTP interface. I’ve thought about a few ways to work around this, and may yet implement them, but there is almost certain going to be a command line set of tools.
January 23, 2009
Transparently working with OAuath
This is part one of two posts I’m going to write about OAuth; the second will be somewhat more critical in tone. Before I criticize – and I know it’s hard to put together technologically things like OAuth – I want to actually accomplish something with it, so I at least I appear that I have somewhat of a clue about it. This is a report of what I’ve done.
bm_uri is a libary and tool I’ve written for working with URIs, and in particular http:// and https:// URLs. Here are some of the advantages of using bm_uri over all the normal Python urllib and urllib2 methods:
- downloads are cached; if a URL is temporarily not available,
bm_uriwill return the cached version, likewise if it has been downloaded in the near past, the cached version will be returned rather than hitting the net again - downloads can be cooked, meaning converted into a more useful form such as TIDY-cleaned up HTML, JSON, Unicode text and so forth
bm_urihandles all the protocol stuff for you (such as User-Agent, Last-Modified and so forth) so you don’t have to- authentication is handled “invisibly” as possible for you … at least after the initial setup
Here is an example of accessing a OAuth resource using bm_uri returning my current location from Fire Eagle as a Python object. From a programming point of a view, I believe I have reduced this to close to the minimum number of steps possible. Here’s the setup phase:
import bm_uri import bm_oauth import pprint bm_cfg.cfg.initialize() bm_oauth.OAuth(service_name = "fireeagle")
Here’s using it in code – note how there’s no reference to OAuth here whatsoever.
loader = bm_uri.JSONLoader('https://fireeagle.yahooapis.com/api/0.1/user.json?format=json')
loader.Load()
pprint.pprint(loader.GetCooked())
And here’s the output of the program:
{u'stat': u'ok',
u'user': {u'location_hierarchy': [{u'best_guess': True,
u'geometry': {u'coordinates': [-79.418426513699998,
43.731891632100002],
u'type': u'Point'},
u'id': 572261,
u'label': None,
u'level': 1,
u'level_name': u'postal',
u'located_at': u'2008-03-19T04:09:30-07:00',
...
u'name': u'Canada',
u'normal_name': None,
u'place_id': u'EESRy8qbApgaeIkbsA',
u'woeid': 23424775}],
u'readable': True,
u'writable': False}}
Gather information
The devil is in the details, obviously and with OAuth, the little satan is doing the initial setup. Here’s how I did this for Fire Eagle – there’ll be something analogous for whatever service you are using:
- Log in or sign up (obviously)
- Go to the Developers’ Page
- Click on Create a New App
- Copy the “Consumer Key” and the “Consumer Secret” … these will be long-ish strings of nonsense
- Find out the Request Token URL, the Access Token URL, and the Authorization URL. These are public knowledge and for Fire Eagle are:
- https://fireeagle.yahooapis.com/oauth/request_token
- https://fireeagle.yahooapis.com/oauth/access_token
- http://fireeagle.yahoo.net/oauth/authorize
Note how Yahoo has conveniently made that last URL similar looking to the others, but not quite the same. Thanks!
However you implement OAuth, you’re probably going to need to be able to persist information to disk or database. As documented here several weeks ago, we already have that covered with our bm_cfg module. In ~/.cfg/fireeagle.json, create the following JSON format file:
{
"fireeagle": {
"api_uri" : "https://fireeagle.yahooapis.com/",
"oauth_access_token_url": "https://fireeagle.yahooapis.com/oauth/access_token",
"oauth_authorization_url": "http://fireeagle.yahoo.net/oauth/authorize",
"oauth_consumer_key": "ABCDEFGHIJKL",
"oauth_consumer_secret": "ABCDEFGHIJKLMNOPQRSTUVWXYZ012345",
"oauth_token_url": "https://fireeagle.yahooapis.com/oauth/request_token",
}
}
The only new item here is the api_uri: that’s the prefix of URLs that bm_uri will use OAuth with.
Set it up
Next you have to do all sorts of OAuth stuff to actually work with OAuth. If the why interests you, please go read the spec! I’m more of how person myself, and this is what we need to do:
- run:
python bm_uri.py --service fireeagle --authorize - this will pop up a browser window; grant your application access and then…
- run:
python bm_uri.py --service fireeagle --exchange
And that’s it – you should now be able to just work with the Fire Eagle API in bm_uri without even having to know OAuth is there!
End notes
January 19, 2009
Atom as a Rosetta Stone for WORK objects
WORK – Web Object Records – is a way of describing messages we pass over the web: a single header object called the “meta” and zero or more objects called “items”. Each object can be encoded as a JSON record, though we can access invidual items within each WORK object using a WORK Path which allows quite a bit of latitude for type coercision and vagarities in packaging.
Pipe Cleaner is a project I’ve been working on for the last two months that allows one to script data using WORK, to accomplish tasks such as remixing and filtering RSS feeds, read or produce OPML, make JSON interfaces and so forth. I actually have one live deployment which I will blog about soon and hope to have it beta productized for March.
Atom is a standard for syndicating feeds, not unsimilar to RSS but with a richer better described vocabulary. I already have one major “project” built around Atom: the hAtom microformat for describing microcontent and information that can be syndicated. hAtom has also been morphed by Microsoft to produce the Web Slice format, so you may be seeing that about. Atom is conforms to WORK: there’s a “feed” meta header and zero or more “entry” items.
With Pipe Cleaner I’m trying not only to make a way where feeds and other data can be remixed, but also make it easy to do so! To do that, I’ve decided that be default, even though you are working with (say) OPML or RSS, we’ll translate all the terms to their Atom equivalents as best as possible. You’ll have to read the spec yourselves, but here’s a quick rundown of common elements, not all required by any means:
author, with possible sub-fieldsuriandemailcontent– the bodysummary– a summary of the body; currently my feeling is that content & summary must always be HTMLupdated– when last updatedcreated– when created, assume to be updated if not presentlink– the main URIlinks– for alternate URIs (this is a variance from the Atom spec; it should be easy to find the main URI for an element; I may reconsider this before release)id– a unique identifiercategory– tags, encoded in a sub-field term
Note that I’m not slavish about making the output conformant to all the SHOULDs, MUSTs, etc. that are in the Atom spec: my pragmatic programming approach says “do the best we can” and if the user needs better, they can walk the extra mile.
Here’s some examples of data that’s been run through Pipe Cleaner, translating to Atom upon input and translating back to whatever is needed upon output. The JSON (actually pretty printed JSON) output is the most instructive for what’s going to inside Pipe Cleaner.
RSS Feed
OPML Data
Note how the OPML is “flattened”, with hierarchy being encoded into the Category. This can be turned off if needed.
hCard microformat (in HTML)
Note the neat namespacing in the RSS output. The OPML is almost devoid of useful information, further consideration is needed.
hCalendar microformat (in HTML)
Similar to hCard. We’ll probably also (or exclusively) encode the hCalendar data in an xCal extension.
hAtom microformat (in HTML)
hAtom -> RSS is basically turning an hAtom page into a feed!
Source example
Since no blog post is complete without a little source code, here’s a Pipe Cleaner script to parse the hCard document. If you’re following closely, the output format is selected by the user at runtime. All the other scripts are of similar terseness.
import module:api_microformat; api_microformat.HCard uri:"http://tantek.com/" to:items meta:meta;
January 9, 2009
Thinking about Configuration
Happy New Year, everyone. I’ve been busy at paying work recently, plus cleaning up and testing existing code I’ve been discussing here over the last few months. At work I’ve been developing in WebObjects, which though a lovely platform is not the way of the future so I’m not documenting many of my experiences here.
The applications I’ve been working on recently, Pipe Cleaner and GenX, need – like most applications – configuration. This will store information which can be safely exposed to the public, such as my Google Maps API key, and information that I need to keep private within the application, such as my Freebase username and password (cf. however the password anti-pattern). Furthermore, though the code I’m writing is in Python it is possible that the code that provides the UI will be written in another language, such as PHP inside of WordPress.
Given these considerations, here’s my design choices:
- configuration files are stored as multiple individual files inside a directory (or directories)
- configuration files are in JSON, and contain a dictionary of dictionaries (see below)
- configuration files can be marked as private or public
- the same logical configuration (say for Amazon, which has both public and private information) can be in a public and private file
- the configuration is global, but is accessed through setter/getter properties
- non-global versions of the configuration can be made
That all said, here’s what I’ve written. First, the setters and getters:
class Cfg:
_cfg_private = {}
_cfg_public = {}
@apply
def public():
def fget(self):
return self._cfg_public
return property(**locals())
@apply
def private():
def fget(self):
return self._cfg_private
return property(**locals())
As an aside, I’m not 100% sure about Python decorators and wonder if my favorite language is being turned into a C++ like mess.
Next, the ‘add’ function that adds information to the configuration ensuring private and public are handled correctly. Note that there can be multiple dictionaries inside of ‘d’, but ‘d’ is either all Public or not.
def add(self, d):
if type(d) != types.DictType:
raise TypeError("only dictionaries can be added")
if d.get('@Public'):
#
# Public definitions never overwrite private definitions
#
for key, value in d.iteritems():
if type(value) != types.DictType:
continue
if not self._cfg_private.has_key(key):
self._cfg_private[key] = value
self._cfg_public[key] = value
else:
self._cfg_private.update(d)
And finally the loader, which gets everything in a directory or one level down. Note the ‘exception’ parameter which makes me a bad person, but I don’t like code failing unless I tell it to.
def load(self, path, exception = False, depth = 0):
try:
if os.path.isdir(path) and depth < 2:
for file in os.listdir(path):
self.load(os.path.join(path, file))
elif os.path.isfile(path):
if path.endswith(".json"):
self.add(json.loads(bm_io.readfile(path)))
except:
if exception:
raise
Log("ignoring exception", exception = True, path = path)
And one more thing: make the global configuation:
cfg = Cfg()
Here’s how you use it:
import bm_cfg
# setup ... on a per-file or directory basis
for file in sys.argv[1:]:
bm_cfg.cfg.load(file)
# use it
pprint.pprint({
"private" : bm_cfg.cfg.private,
"public" : bm_cfg.cfg.public,
}, width = 1)
Here’s what my configuration directory looks like:
$ pwd /Users/davidjanes/Sites/pc/cfg $ ls amazon.json freebase.json praized.json amazon.public.json gmaps.json yahoo.json
Here’s the (private) amazon.json:
{
"amazon" : {
"Locale" : "us",
"AccessKeyID" : "0......",
"AssociateTag" : "ona-20",
"Private" : "Don't See"
}
}
And here’s the (public) amazon.public.json:
{
"@Public" : 1,
"amazon" : {
"Locale" : "us",
"AccessKeyID" : "0......",
"AssociateTag" : "ona-20"
}
}
Note that if the private version of the Amazon file wasn’t available, the public version would also be in the private one. I.e. the private configuration basically is “everything” (noting possibly exceptions above in the code).