AUMFP – The Almost Universal Microformats Parser

I’ve completely refreshed the the Almost Universal Microformats Parser up on Google Code. Changes from the (very old) version include:

  • Tarballs available
  • Much better handling of Internationalized Characters
  • Many improvements to parsing
  • Simplified iterator interface (see below)
  • Spun-off support library files into their own library called PyBM. If you’re using tarballs this won’t be issued

Microformat support includes:

  • hCard
  • hCalendar
  • hAtom
  • hListing
  • hResume
  • rel-tag
  • xfolk

There’s also an addition ‘hdocument’ parser that treats an arbitrary webpage like the other parsers, returning information such as feeds, links, images and so forth.

Use

Using the parser is simple:

import hcard

import pprint

parser = hcard.MicroformatHCard(page_uri = 'http://tantek.com')

for d in parser.Iterate():

pprint.pprint(d)

The ‘d’ returned is an extended python ‘dict’. Because we capture information about classes within paths, there’s no guarantee about how a key is going to be named. For example, a phone number could be keyed ‘tel’ or ‘tel.home’ (or a number of other things). Our dictionary ‘mfdict’ provides a number of functions called ‘find’ to pull out values. For example, this will pull out the least dot-specified telephone number:

tel = d.find('tel')

We also add special keys beginning with an ‘@’ for well known, additionally interesting or commonly used fields, to save you the trouble of figuring this information out yourself. Here’s an example parsed hCard (from the example above):

{'@html': u'<address id="hcard" class="vcard author"></address>',

'@index': 'vcard-36',

'@loose-uris': [u'http://tantek.com/'],

'@parents': u'author copyright xoxo',

'@title': u'Tantek \xc7elik',

'@uf': 'hCard',

'@uri': u'http://tantek.com#hcard',

u'_url': '',

u'adr.country-name': '',

u'adr.locality': u'San Francisco',

u'adr.region': u'CA',

u'fn': u'Tantek \xc7elik',

u'logo': u'icon-2007-128px.png',

'n.family-name': u'\xc7elik',

'n.given-name': u'Tantek',

u'photo': u'http://tantek.com/icon-2007-128px.png',

u'uid': u'Tantek \xc7elik',

u'url': u'http://feeds.technorati.com/contact/tantek.com/%23hcard'}

1 comment on this post.
  1. | David Janes’ Code:

    [...] now have the AUMFP up as a demo page. Here’s a few [...]