David Janes' Code Weblog

November 11, 2008

WORK – Web Object Records

ideas,pybm,python,semantic web,work · David Janes · 11:37 am ·

Introduction

As technologists, we’re all familiar with REST – Representational State Transfer:

Representational state transfer (REST) is a style of software architecture for distributed hypermedia systems such as the World Wide Web. As such, it is not strictly a method for building what are sometimes called “web services.” The terms “representational state transfer” and “REST” were introduced in 2000 in the doctoral dissertation of Roy Fielding, one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST talks about how we address and use information on the World Wide Web. I’d like to introduce the concept of WORK -  Web Object Records – which defines how we think about data being transmitted across the web.

WORK is not a descriptive standard – it is not telling you what to do, it’s describing what you are doing. The hope is that by having a delineated description of what we are doing, we can then write tools to cut through the babel of API standards being currently promulgated by a multitude of vendors; we can standardize the unstandarded.

Defintion

A WORK item:

  • is conceptually a JSON-like dictionary, consisting of string keys and object values
  • each value in the dictionary is a (usually-) shallow JSON-like object, that is:
    • a dictionary, list or basic value type
  • the basic value types are Unicode strings, floating point numbers, integers and booleans
  • the difference between strings and other basic value types is fuzzy (data encoded in XML, HTML form data)
  • null/None is rarely explicitly sent, instead it is the absence of a value being defined
  • the difference between a list of objects and a single object is fuzzy and fluid (XML children)
  • the data model defined implicitly by “what you see” is as useful as formal definition elsewhere
  • there are no cycles or explicit ways of cross referencing within a WORK item
  • WORK items can – and often are – nested within another WORK item, but only one level deep

Benefits

Because we technologists inherently use a WORK model of data, it explains:

  • why we prefer XML over CSV – because we like to store more that a single atomic value in a “cell”
  • why we prefer JSON to XML – because we think about data as JSON-like WORK objects, not as nested text constructs
  • why we don’t adopt RDF (in it’s variants) for transmitting data, implementing APIs and so forth – because we don’t think in graphs
  • why we find it easier to work with web data in Python and Ruby than in Java – because those languages explicitly use the same model for storing data as we think about the data

Examples

Here are a few examples of how one can view common API / feed results as WORK items.

RSS feeds

RSS is defined by a two level WORK hierarchy. The first level is:

{
  "channel" : CHANNEL-WORK,
  "item" : [ ITEM-WORK, ITEM-WORK, ... ]
}

A ITEM-WORK looks like:

{
  "title" : STRING,
  "link" : STRING,
  "description" : STRING
}

If you look at at the XML for a RSS feed with only 1 ITEM, there’s no way to tell without reading the spec than ITEM repeats. This is what we mean by saying that the difference between a single object and a list is sometimes fuzzy.

White Pages API

The White Pages API is also a two level WORK hierarchy (this pattern is very very common). Here’s the first level, slightly more complicated than RSS due to the XML serialization:

{
 "meta" : META-WORK,
 "listings" : {
   "listing" : [ LISTING-WORK, LISTING-WORK, ... ]
 }
}

A LISTING-WORK looks like:

{
  "geodata" : OBJECT,
  "phonenumbers" : OBJECT,
  "business" : { "businessname" : "Fred's Pizza" },
  "address" : OBJECT
}

The OBJECTs above in the White Pages API are somewhat complicated, but tractable (as we shall see in another post)

Amazon AWS API

The Amazon Associates Web Service allows one to retrieve information about Amazon products via XML responses. The response is a little convoluted but still recognizable:

{
 "Items" : {
   "RequestHeader" : REQUEST-HEADER-WORK,
   "Item" : [ ITEM-WORK, ITEM-WORK, ... ]
 },
 "OperationRequest" : { ... }

The individual ITEM-WORK describe products:

{
 "ASIN" : STRING,
 "ImageSets": {
   "ImageSet": {
    "LargeImage": {
     "URL": "http://ecx.images-amazon.com/images/I/31e55zf53VL.jpg",
     "Width": "300",
     "Height": "300"
   },
  },
 "ItemAttributes": {
   "Title": "Under a Blood Red Sky - Deluxe Edition CD/DVD",
   "Manufacturer": "Island",
   "ProductGroup": "Music",
   "Artist": "U2"
 }
}
Google search result

We can also look at HTML pages as if they’re returning data as WORK items. This could be explicit if rules such as microformats or RDFa were used,  or once again it could be just a convenient way of modeling the data. Here’s a hypothetical WORK item for a single result returned from a Google:

{
 "title" : "Bombardier Inc. - Bombardier - Home",
 "url " : "http://www.bombardier.com/",
 "description" : "Manufacturers of a large range of regional...",
 "links" : [
  {
   "title" : "Careers",
   "url" : "...",
  },
  {
   "title" : "Business Aircraft",
   "url" : "...",
  },
  ...
 ]
}

Conclusion

WORK gives us a powerful way of looking at – at simplifying – data that’s retrieved over the Internet via REST calls. If we can view API results as being made up of standardized components – WORK items – then the amount of work we need to do to work with new APIs can be absolutely minimized.

Designing and writing some of these tools is my next task.

October 28, 2008

Amazon's OpenSearch: mostly useless

search,semantic web · David Janes · 8:28 am ·

As part of a broader project I’m working on, I decided to see if there’s a way I could easily get search results from the web in machine readable fashion. One project to facilitate this is Amazon/A9′s OpenSearch. Alas, it’s useless:

  • No big web search provider has signed on to provide machine readable results. Including A9/Alexa! A9 will aggregate search results from different OpenSearch providers for you, it just won’t let you use Alexa’s results elsewhere (search for Alexa on that page)
  • even if you were to buy into the search aggregation approach, many (most?) of sources are dead now. A little pruning wouldn’t hurt here guys! (search for IMDB on that page)

I wouldn’t be tempted to be offer my search results in OpenSearch format, because who’s going to use it after I put in the work? And if all that’s available as search sources are mostly broken C and D-list sites, well who cares? It’s a fringe benefit, but not one that I’m looking for and nor likely are you. You’d think that Amazon would use Alexa search results in OpenSearch to “prime the pump”, but I guess being the Nth placed web search service is good enough for them.

Note that there’s a great argument for simply marking up search results with hAtom and use rel=next to navigate to the next page of results, but that’s a topic for another day,

If I have any of my facts wrong here, I apologize in advance: the documentation kind of sucks. I’m also sure there’s some difference between A9, Alexa and Amazon – I really just don’t have the time to work it out.

Further reading

More style updates

administrivia,html / javascript,semantic web · David Janes · 6:36 am ·

I’ve added hAtom to this weblog’s template: you can see a parsed version here. I’ve also updated the comments to be prettier.

Next, to figure out what this gravatar stuff is and to expand the blogroll.

October 25, 2008

AUMFP – Demo

aumfp,demo,python,semantic web · David Janes · 1:13 pm ·

I now have the AUMFP up as a demo page. Here’s a few examples:

October 24, 2008

AUMFP – The Almost Universal Microformats Parser

aumfp,python,semantic web · David Janes · 8:49 am ·

I’ve completely refreshed the the Almost Universal Microformats Parser up on Google Code. Changes from the (very old) version include:

  • Tarballs available
  • Much better handling of Internationalized Characters
  • Many improvements to parsing
  • Simplified iterator interface (see below)
  • Spun-off support library files into their own library called PyBM. If you’re using tarballs this won’t be issued

Microformat support includes:

  • hCard
  • hCalendar
  • hAtom
  • hListing
  • hResume
  • rel-tag
  • xfolk

There’s also an addition ‘hdocument’ parser that treats an arbitrary webpage like the other parsers, returning information such as feeds, links, images and so forth.

Use

Using the parser is simple:

import hcard
import pprint

parser = hcard.MicroformatHCard(page_uri = 'http://tantek.com')
for d in parser.Iterate():
  pprint.pprint(d)

The ‘d’ returned is an extended python ‘dict’. Because we capture information about classes within paths, there’s no guarantee about how a key is going to be named. For example, a phone number could be keyed ‘tel’ or ‘tel.home’ (or a number of other things). Our dictionary ‘mfdict’ provides a number of functions called ‘find’ to pull out values. For example, this will pull out the least dot-specified telephone number:

tel = d.find('tel')

We also add special keys beginning with an ‘@’ for well known, additionally interesting or commonly used fields, to save you the trouble of figuring this information out yourself. Here’s an example parsed hCard (from the example above):

{'@html': u'<address id="hcard" class="vcard author"></address>',
 '@index': 'vcard-36',
 '@loose-uris': [u'http://tantek.com/'],
 '@parents': u'author copyright xoxo',
 '@title': u'Tantek \xc7elik',
 '@uf': 'hCard',
 '@uri': u'http://tantek.com#hcard',
 u'_url': '',
 u'adr.country-name': '',
 u'adr.locality': u'San Francisco',
 u'adr.region': u'CA',
 u'fn': u'Tantek \xc7elik',
 u'logo': u'icon-2007-128px.png',
 'n.family-name': u'\xc7elik',
 'n.given-name': u'Tantek',
 u'photo': u'http://tantek.com/icon-2007-128px.png',
 u'uid': u'Tantek \xc7elik',
 u'url': u'http://feeds.technorati.com/contact/tantek.com/%23hcard'}

September 27, 2008

Zap2it: where's my read/write web?

ideas,semantic web · David Janes · 9:00 am ·

Zap2it is a TV listings service. I’m not a big TV guy but’s useful for me to look up a few things once and a while such as F1 racing, sailing programs, NFL football and other mindless pursuits.

Here’s a listing for a BBC show tomorrow on Paul Cayard. I have little confidence that this link will still work in a week’s time, let alone a year, but let’s leave that alone for now. Zap2it provides a way of saving it to “My Favorites” but I’m not really interested in signing up just yet.

  • why can’t I access this as an iCalendar (or a hCalendar!) so I can add this to my Google Calendar or my Apple iCal program? I can see why Zap2it would like to retain customers as accounts for monetization purposes, but I’m more likely to remain a loyal Zap2it user if it provides functionality I need. Otherwise, some day someone else will do it for me
  • Once I’ve made an account, why can’t I access all my favorites as an iCal object or even export it directly into Google Calendar? Because then I could share that calendar with my friends (i.e. the gang I get together with on the weekend to watch football) bring more people to Zap2it!

Another rant on the account creation process:

  • Zap2it doesn’t allow you to register valid email addresses in the format myname+zap2it@example.com! Come on guys, get it together. This is especially important because:
  • Zap2it makes you check a box to opt out of partner spam; if you’re not a form reader, you could find yourself getting useless information that you had no desire to receive in the first place. Sigh.
  • After I create my account, I’m not logged in. I then have to re-find the show I was trying to “Save to Favorites”; try to bookmark that; be forced to login; be brought to my account page; re-find the show again; and then bookmark. High comedy!
« Newer Posts

Powered by WordPress

Switch to our mobile site