David Janes' Code Weblog

October 24, 2008

AUMFP – The Almost Universal Microformats Parser

aumfp,python,semantic web · David Janes · 8:49 am ·

I’ve completely refreshed the the Almost Universal Microformats Parser up on Google Code. Changes from the (very old) version include:

  • Tarballs available
  • Much better handling of Internationalized Characters
  • Many improvements to parsing
  • Simplified iterator interface (see below)
  • Spun-off support library files into their own library called PyBM. If you’re using tarballs this won’t be issued

Microformat support includes:

  • hCard
  • hCalendar
  • hAtom
  • hListing
  • hResume
  • rel-tag
  • xfolk

There’s also an addition ‘hdocument’ parser that treats an arbitrary webpage like the other parsers, returning information such as feeds, links, images and so forth.

Use

Using the parser is simple:

import hcard
import pprint

parser = hcard.MicroformatHCard(page_uri = 'http://tantek.com')
for d in parser.Iterate():
  pprint.pprint(d)

The ‘d’ returned is an extended python ‘dict’. Because we capture information about classes within paths, there’s no guarantee about how a key is going to be named. For example, a phone number could be keyed ‘tel’ or ‘tel.home’ (or a number of other things). Our dictionary ‘mfdict’ provides a number of functions called ‘find’ to pull out values. For example, this will pull out the least dot-specified telephone number:

tel = d.find('tel')

We also add special keys beginning with an ‘@’ for well known, additionally interesting or commonly used fields, to save you the trouble of figuring this information out yourself. Here’s an example parsed hCard (from the example above):

{'@html': u'<address id="hcard" class="vcard author"></address>',
 '@index': 'vcard-36',
 '@loose-uris': [u'http://tantek.com/'],
 '@parents': u'author copyright xoxo',
 '@title': u'Tantek \xc7elik',
 '@uf': 'hCard',
 '@uri': u'http://tantek.com#hcard',
 u'_url': '',
 u'adr.country-name': '',
 u'adr.locality': u'San Francisco',
 u'adr.region': u'CA',
 u'fn': u'Tantek \xc7elik',
 u'logo': u'icon-2007-128px.png',
 'n.family-name': u'\xc7elik',
 'n.given-name': u'Tantek',
 u'photo': u'http://tantek.com/icon-2007-128px.png',
 u'uid': u'Tantek \xc7elik',
 u'url': u'http://feeds.technorati.com/contact/tantek.com/%23hcard'}

1 comment

  1. [...] now have the AUMFP up as a demo page. Here’s a few [...]

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress

Switch to our mobile site