David Janes' Code Weblog

February 28, 2009

AUAPI: the Atom core vocabulary for items

auapi · David Janes · 7:58 am ·

To quickly review, the AUAPI sees API results as composing of two parts. A “response” (formerly the “meta”) and “items”, composed of one or more items. This division is documented in Work Object Records.

Here is how we use Atom to encode API items in the AUAPI (we will document “response” in a different post). This sticks fairly closely to the Atom standard, the differences being in how each term is serialized into JSON for convenience and ease of use – and that there’s no real required elements. Don’t forget that each “item” is a discrete unit returned from an API and represents “whatever” – a Flickr photo, an Amazon product

  • title (string, plain text) – the title; this is a plain text string, i.e. no entities or HTML allowed
  • id (string, plain text) – a unique ID (in the context of the API being called) for this result
  • content (string, HTML) – the complete text; this is HTML
  • summary (string, HTML) – the summary text; this is HTML.
  • link (string, URL) – this is the “main” link of whatever the item represents, a page on Amazon, a blog post’s original HTML and so forth
  • links (array of dictionary) – these are other links related to the item; the format is documented below
  • category (array of dictionary) – tags/categories; the format is documented below
  • author (a string or dictionary) – if a dictionary, the format is documented below
  • updated (string, Atom datetime format) – when the item was last updated
  • posted (string, Atom datetime format) – when the item was originally posted; if using updated and/or posted, always use updated and then use posted only if you have a meaningful difference.

Links Format

links are results related to the item. The most important link should be encoded in the link item. links are a list of dictionary, each dictionary containing (further documentation):

  • href (string, URI)
  • rel (string, enumeration)
  • type (string, MIME type)
  • hreflang (string, language code)
  • title (string, plain text)
  • length (string, integer)

Category Format

category are the tags for the item, and are a list of a dictionary each dictionary containing (further documentation):

  • @ (string, plain text) – the tag
  • rel (string, from enumeration)
  • scheme (string, URI)

Author Format

The author can be encoded as a string (containing the author’s name) or as a dictionary, or as a list of dictionaries. If a dictionary, this is how it should be encoded (further documentation)

  • @ (string, plain text) – the author’s name
  • uri (string, URI)
  • email (string, email address)

Example

This is a API result from Whitepages.com, encoded as AUAPI. I have removed non-Atom items and shortened long text for clarity.

{'content': "<div class="vcard">\n<div class="fn">Jack Smith</div>...\n</div>\n",
 'id': '40b296d1a95f3b379a8108b27daf009c',
 'links': [{'href': 'http://www.whitepages.com/16176/t...',
            'rel': 'related',
            'title': 'Find Neighbours',
            'type': 'text/html'},
           {'href': 'http://www.whitepages.com/16176/track...',
            'rel': 'related',
            'title': 'Map',
            'type': 'text/html'},
           {'href': 'http://www.whitepages.com/16...',
            'rel': 'related',
            'title': 'Map',
            'type': 'text/html'},
           {'href': 'http://www.whitepages.com/16176/track/102...',
            'rel': 'alternate',
            'title': 'Whitepages.com',
            'type': 'text/html'}],
 'title': u'Jack Smith'}

February 27, 2009

Introducing The Almost Universal API

auapi · David Janes · 9:54 am ·

The Almost Universal API is a culmination – or at least a local maxima – of several projects I’ve been working on for the last few months: in particular, Web Object Records, Pipe Cleaner and PyBM. The AUAPI is:

  • a way of presenting results returned from many popular APIs
  • a Python library to actual do this

I’ll be making several posts about how to use the AUAPI, including installation instructions. The plan is to make an easy_install version, but initially this will be a SVN from Google Code thing.

The AUAPI is mainly about how to present results returned from APIs, not how to send data to APIs nor how to encode requests. The encoding is designed to “look good” in JSON and be easily and algorithmically encoded into XML. The AUAPI data model is based on:

  • Atom, the “core” vocabulary, particularly providing title, content, summary, updated, category, link and links
  • MediaRSS, for encoding images
  • hCard, for encoding information about people
  • hCalendar, for encoding information about events

There are several “maybe” standards too:

  • hProduct, for encoding information about things
  • Google’s SGN URLs, for providing a universal way of talking about accounts

I have already worked a fair number of APIs into the AUAPI. These are documented on the Mashematica Wiki:

February 22, 2009

What is the framework for public APIs?

ideas,semantic web · David Janes · 3:55 pm ·

This post was originally sent to the ChangeCamp mailing list in response to a question about “what framework should we use for public APIs?“.

The core “frameworks” are POSH, REST and JSON. POSH is “Plain Old Semantic HTML”, meaning websites should be developed using modern web standards, pages should validate and use HTML elements correctly, and presentation is coded using CSS. REST can have deeper implications, but amongst the simplest is that pages can be returned using simple GET statements against well known URLs. JSON has emerged as the defacto standard for returning API results, amongst the reasons for is simplicity of creating mashups and embedability.

Atom and/or RSS provide the framework for update notifications. There are emerging technologies for real-time delivery, but it’s too early to worry about that.

Microformats provide a framework for embedding well-understood objects in HTML, are based on popular and well-understood standards, are easy(-ish) to implement, and a “consumer” ecosystem exists. In particular, people can be represented by hCard, events by hCalendar, tagged data by rel-tag and microcontent (articles within a page) by hAtom. Note that no parallel infrastructure need exist to do microformats: they are served within HTML pages.

Identify should use OAuth and OpenID; pragmatism says Facebook Connect and Google Friend Connect should be in the mix too, though I have a number of reservations about those.

I am very non-bullish about RDF, particularly as a model for delivering data of well-defined formats. IMHO it has missed almost the entirely the mashup wave of the last few years, and successes seem to be scattered at best. RDFa is competing in microformat’s “space” and may see success yet if it starts proving concrete solutions rather than “here’s a format that can do anything”, especially given microformat’s process issues.

February 21, 2009

Using Pipe Cleaner to convert CSV list of Science Journals to an OPML subscription list

demo,pipe cleaner · David Janes · 3:43 pm ·

Here’s a Pipe Cleaner script to convert this text list of Science Journals and converts it an OPML subscription list (here)

import module:api_csv.CSV;

CSV uri:'http://www.tictocs.ac.uk/text.php' delimeter:'\t';

items := map value:$items map:{
    "title" : "{{ C1 }}",
    "links" : {
        "href" : "{{ C2 }}",
        "type" : "text/xml",
        "rel" : "alternate",
    }
};

I decided not to use the “header name” feature of the CSV command because I had to remap anyway to create the links object. This has to be run with the following command (or from the web UI):

pc --format opml science-journals

Of course, this is a little unwieldy in size so maybe you only want journals with “Astrophysics” in their title:

import module:api_csv.CSV;

CSV uri:'http://www.tictocs.ac.uk/text.php' delimeter:'\t';

items := map value:$items map:{
    "title" : "{{ C1 }}",
    "links" : {
        "href" : "{{ C2 }}",
        "type" : "text/xml",
        "rel" : "alternate",
    }
};

items := search value:$items for:"Astrophysics";

Cool, eh? Not only this, this can be run entirely from the Web Interface with selectable strings so (theoretically) a Pipe Cleaner user would have an API to this data.

February 5, 2009

Turning garbage "HTML" into XML parsable XHTML using Beautiful Soup

html / javascript,python · David Janes · 6:56 am ·

Here’s our problem child HTML: Members of Provincial Parliament. Amongst the attrocities committed against humanity, we see:

  • use of undeclared namespaces in both tags (<o:p>) and attributes (<st1:City w:st="on">)
  • XML processing instructions – incorrectly formatted! – dropped into the middle of the document in multiple places (<?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" />)
  • leading space before the DOCTYPE

This is so broken that even HTML TIDY chokes on it, producing a severely truncated file. This broken document provided me however an opportunity to play with the Python library Beautiful Soup, which lists amongst it’s advantages:

  • Beautiful Soup won’t choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don’t have to create a custom parser for each application.
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

Alas, straight out of the box Beautiful Soup didn’t do it for me, perhaps because of some of my strange requirements (my data flow works something like this: raw document → XML → DOM parser → JSON). However, Beautiful Soup does provide the necessary calls to manipulate the document to do the trick. Here’s what I did:

First, we import Beautiful Soup and parse it to the object soup. We’re expecting an HTML node at the top, so we look for that.

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(raw)

if not hasattr(soup, "html"):
	return

Next, we loop through every node in the document, using Beautiful Soup’s findAll interface. You will see several variants of this call here in the code. What we’re looking for is use of namespaces, which we then add to the HTML element as attributes using fake namespace declarations.

We need to find namespaces already declared:

used = {}
for ns_key, ns_value in soup.html.attrs:
	if not ns_key.startswith("xmlns:"):
		continue

	used[ns_key[6:]] = 1

Then we look for ones that are actually used:

nsd = {}
for item in soup.findAll():
	name = item.name
	if name.find(':') > -1:
		nsd[name[:name.find(':')]] = 1

	for name, value in item.attrs:
		if name.find(':') > -1:
			nsd[name[:name.find(':')]] = 1

Then we add all the missing namespaces to the HTML node.

for ns in nsd.keys():
	if not used.get(ns):
		soup.html.attrs.append(( "xmlns:%s" % ns, "http://www.example.com#%s" % ns, ))

Next we look for attributes that aren’t properly XML declarations, e.g. HTML style <input checked />-type items.

for item in soup.findAll():
	for index, ( name, value ) in enumerate(item.attrs):
		if value == None:
			item.attrs[index] = ( name, name )

Then we remove all nodes from the document that we aren’t expecting to see. If you keep the script tags you’re going to have to make sure that each node is properly CDATA encoded; I didn’t care about this so I just remove them.

[item.extract() for item in soup.findAll('script')]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.ProcessingInstruction ))]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.Declaration ))]

In the final step we convert the document to Unicode. This requires another step of post-processing: html2xml changes all entity uses that XML doesn’t recognize into a &#...; style. E.g. we do change &nbsp; but we don’t change &amp;. At this point we now have a document that can be processed by standard DOM parsers (if you convert to UTF-8 bytes, sigh).

cooked = unicode(soup)
cooked = bm_text.html2xml(cooked)

February 2, 2009

And we're back!

Uncategorized · David Janes · 2:46 pm ·

This should be up and running on our new host with a new database.

Powered by WordPress