David Janes' Code Weblog

October 22, 2009

hAtom hits the big time

microformats,semantic web · David Janes · 2:14 pm ·

From the Read/Write Web:

Earlier this year, the Associated Press, together with the Media Standards Trust, introduced hNews, a new microformat for describing news content. HNews allows publishers to easily attach machine-readable news semantics to content on the web. Today, the AP announced the completion of the first draft of hNews. In addition, TownNews, announced that is will support hNews in its BLOX content management system, which is being used by over 1,500 newspapers in the US.

HNews, which is an extension of the hAtom format, only requires content users to specify information about the source organization. In addition, publishers can specify geo-information, a dateline element, license information and information about the code of ethics that governed the behavior of the author of a given site. At its most basic level, hNews, just like other microformats like hCard or hCalendar, allows search engines spiders to identify and read semantic information that would otherwise be buried within a text and would be hard to identify for search engines.

The RRW article then goes on to posit some ideas about this being related to AP’s efforts to track use of their web content across the web. This strikes me as rather farfetched, as stripping out the microformat tags is beyond trivial. What makes this exciting for me is that it makes it more likely that search engines will start recognizing hAtom tags and thus will start properly indexing blogs and other microcontent properly into search engines.

In other exciting hAtom-related news, WordPress 2.7 has the post_class function to allow (new) templates to automatically include the hentry tag on blog posts! Also see the Smashing Magazine article on this.

August 10, 2009

Travel Websites & Web 3.0

Discover Anywhere Mobile,ideas,semantic web · David Janes · 2:53 pm ·

On my Discover Anywhere Mobile blog, I’ve posted a list of recommendations about how travel websites can use information to extend their reach.

March 2, 2009

AUAPI: encoding hCards in JSON

auapi,aumfp,semantic web · David Janes · 9:15 am ·

The best model for describing people is the vCard standard, RFC 2425 and RFC 2426. The microformats community has adapted the vCard standard for serialization into HTML using hCard. In the Almost Universal API (AUAPI), people and organizations should almost always be described using a JSON-encoded hCard.

It is difficult to describe, without going into great minutiae, what the difficulties are in transforming the hCard and vCard standards into a pleasant looking and more importantly an easy-to-use hierarchy: there are certainly a number of edge cases that one would have to deal with it! There’s certainly an argument for just encoding hCard/vCards as a straight vCard serialization – at least in terms of simplicity of encoding. The issue is that the end consumer (which I believe should be the strongest focus) really has to do the dirty work in grouping everything together themselves.

Algorithm

This algorithm is destructive to the data structure it works upon, so generally you’ll be make a copy first.

  • note that though we reference to all upper, mixed case, camel case and so forth hCard attributes, all attributes are actually physically encoded in lower case with “-” separators
  • let the “groupers” be ADR, GEO, N, ORG, TEL. Groupers group together attributes that are related (such as FirstName and LastName)
  • let the “narrowers” be Home, Work, Parcel, Postal (and no-narrower). Narrowers assign a specific meaning to a value, i.e. this a Work phone number.
  • assume each value is described by a number of attributes, i.e. “416-515-5555″ can be described by ( TEL, Work, Mobile )

Then:

  • for Narrower, then for each Grouper
    • create a dictionary ‘subd’
    • for each values that is described by the ( Narrower, Grouper )
      • for each remaining attribute (besides Narrower and Grouper), add to subd
      • if the value was fully described by ( Narrower, Grouper ), add to subd under the key ‘@’
    • for key, value in subd
      • add to the final result
      • if narrower is not ‘no-narrower’, add ‘@narrower = narrower’
    • add subd to the result under the key Grouper
  • add all remaining values from the original hCard to the result, noting that
    • if the value is described by a Narrower, we encoded it as a dictionary with ‘@narrower = narrower’

Clear? Well, the examples below will help. We the “416-515-5555″ above we would get:

{
 "hcard:hcard" : {
  'tel' : {
   '@work' : 'work',
   'mobile' : '416-515-5555',
  }
 }
}

Code

The source code for this algorithm is in the AUMFP tree, in file vcard.py function decompose (see around line 1083)

Namespace

All JSON encoded hCards are in the namespace hcard:. In the AUAPI serialization, this namespace should only be on the enclosing element, all children will be assumed to be in the namespace. I am currently using the URI http://purl.org/uF/hCard/1.0/ for this namespace (when XML serializing); this may change in the future.

Example 1 – home phone number from whitepages.com

{
 'hcard:hcard': {'adr': {'country-name': u'United States',
                         'locality': u'Huntsville',
                         'postal-code': '35801-2908',
                         'region': 'Alabama',
                         'street-address': u'1114 Humes Avenue NE'},
                 'fn': u'Jack Smith',
                 'geo': {'latitude': 34.743763000000001,
                         'longitude': -86.572568000000004},
                 'n': {'family-name': u'Smith', 'given-name': u'Jack'},
                 'tel': {'voice': u'256-539-8788'}},
}

Example 2 – work phone number from whitepages.com

{ 'hcard:hcard': {'adr': {'country-name': u'United States',
                         'locality': u'Gurley',
                         'postal-code': '35748-8715',
                         'region': 'Alabama',
                         'street-address': u'148 Little Cove Road'},
                 'fn': u'Jack Smith',
                 'geo': {'latitude': 34.698258000000003,
                         'longitude': -86.383027999999996},
                 'n': {'family-name': u'Smith', 'given-name': u'Jack'},
                 'org': {'organization-name': u'Alldyne Powder Technoliges'},
                 'tel': {'@work': 'work', 'voice': u'256-776-1238'}},
}

Example 3 – hCard directly to JSON

{ 'hcard:hcard': {
                 'adr': {u'country-name': u'United States of America',
                         u'locality': u'San Francisco',
                         u'region': u'CA'},
                 u'fn': u'Tantek \xc7elik',
                 u'logo': u'icon-2007-128px.png',
                 'n': {'family-name': u'\xc7elik',
                       'given-name': u'Tantek'},
                 u'photo': u'http://tantek.com/icon-2007-128px.png',
                 u'url': u'http://feeds.technorati.com/contact/tantek.com/#hcard'},
}

February 22, 2009

What is the framework for public APIs?

ideas,semantic web · David Janes · 3:55 pm ·

This post was originally sent to the ChangeCamp mailing list in response to a question about “what framework should we use for public APIs?“.

The core “frameworks” are POSH, REST and JSON. POSH is “Plain Old Semantic HTML”, meaning websites should be developed using modern web standards, pages should validate and use HTML elements correctly, and presentation is coded using CSS. REST can have deeper implications, but amongst the simplest is that pages can be returned using simple GET statements against well known URLs. JSON has emerged as the defacto standard for returning API results, amongst the reasons for is simplicity of creating mashups and embedability.

Atom and/or RSS provide the framework for update notifications. There are emerging technologies for real-time delivery, but it’s too early to worry about that.

Microformats provide a framework for embedding well-understood objects in HTML, are based on popular and well-understood standards, are easy(-ish) to implement, and a “consumer” ecosystem exists. In particular, people can be represented by hCard, events by hCalendar, tagged data by rel-tag and microcontent (articles within a page) by hAtom. Note that no parallel infrastructure need exist to do microformats: they are served within HTML pages.

Identify should use OAuth and OpenID; pragmatism says Facebook Connect and Google Friend Connect should be in the mix too, though I have a number of reservations about those.

I am very non-bullish about RDF, particularly as a model for delivering data of well-defined formats. IMHO it has missed almost the entirely the mashup wave of the last few years, and successes seem to be scattered at best. RDFa is competing in microformat’s “space” and may see success yet if it starts proving concrete solutions rather than “here’s a format that can do anything”, especially given microformat’s process issues.

December 29, 2008

Interesting links from the last month

db,ideas,semantic web · David Janes · 9:29 am ·
  • Aspena web server for highly extensible Python-based publication, application, and hybrid websites. As a potential alternative to Python’s builtin HTTPServer. MIT license.
  • V8V8 is Google’s open source JavaScript engine; written in C++; can run standalone, or can be embedded into any C++ application. I am very excited by this, as allowing users to send code to the server to execute Javascript is an amazingly powerful idea. If anyone knows of a Python wrapper, let me know please. New BSD license.
  • KomodoEdit (a testimonial) – I am going to try this out, though vi/vim will always be my first love (JJ also has an article on using ctags).
  • Virtuoso - an innovative Universal Server platform that delivers an enterprise level Data Integration and Management solution for SQL, RDF, XML, Web Services, and Business Processes. There’s way to much bla bla bla in that sentence, but apparently this is really sweet at handling SPARQL/RDF triples. Kingsley Idehen writes extensively about this on his blog (e.g.).
  • Drizzlea database optimized for Cloud and Net applications. Way too early to commit to this yet. See The New MySQL Landscape for more interesting going ons.
  • AuthKitauthentication and authorization toolkit for WSGI applications and frameworks.
  • Geodjangoa world-class geographic web framework. Lots of great ideas and pointers to libraries in here, even if you’re not planning to use this itself.
  • Disco – an open-source implementation of the Map-Reduce framework for distributed computing. The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. Here’s a blog post about the same, with references to vs. Hadoop.
  • On (Python) packaging. Debating distutil, easy_install and pip.

December 11, 2008

A brief survey of Yahoo Pipes as a DQT

demo,djolt,dqt,ideas,semantic web,work · David Janes · 7:19 am ·

MacFUSEYahoo Pipes is a visual editor of mashups, allowing you to take data from sources on the net, transform them in various interesting ways and output the result as Atom, RSS or JSON. The primary downside Pipes of course is that you’re totally dependent on Yahoo for the infrastructure: it runs at Yahoo pulling feeds that have to be accessable through the public Internet.

It’s easy to use Pipes: just go to this page and start working with the sample example Pipe. You’ll need a Yahoo login ID, but most of us have that anyway. I’ve created an example that uses Yahoo Pipes to feed a Djolt template which you can see here.

We can analyze Pipes in the terms of the DQT paradigm we’ve outlined in the previous post.

Data Sources and Queries

Sources and Queries are merged (quite logically) in the Pipes interface. You can read in depth documentation here.

  • Fetch CSV
  • Feed Autodiscovery – outputs syndication feeds found on a page (RSS feeds on a CBC page)
  • Fetch Feed
  • Fetch Page – will read a page and parse the contents with a reg
  • Fetch Site Feed – this is the logical combination of Fetch Feed and Fetch Autodiscovery
  • Flickr – find images by tag near a location (photos of cats in Toronto)
  • Google Base – look up information in Google Base
  • Item Builder – a way of building new items from existing items
  • Yahoo Local
  • Yahoo Search

Transforms

The operator documentation can be read here.

  • Count
  • Filter
  • Location Extractor – a geocoder that magically looks for locations
  • Loop
  • Regex
  • Rename
  • Reverse
  • Sort
  • Split
  • Sub-element – pulls a particular sub-element of an item and makes that the item. This is very much like WORK path manipulation
  • Tail
  • Truncate
  • Union
  • Unique
  • Web Service

Plus a number of specialized data services, for dealing with elements such as dates.

Templates

Pipes does not provide an arbitrary Djolt-like template producing HTML. Instead, they provide a number of pre-made code templates that output well known data types, including RSS, JSON and Atom (and some stranger choices, like PHP).

December 6, 2008

All your Base are belong to us

db,freebase,semantic web · David Janes · 6:26 am ·

Freebase is a user-editable, user-extensible structured database, a sort of one-stop shop semantic web/Wikipedia application. I started playing with Freebase about a year ago and the application has made significant strides over that period, especially in the usability department. Freebase also provides a very nice API which I’m using in GenX, with the caveat that it’s currently almost useless because of query timeouts.

I just came across the following page on Freebase: http://vancouver.freebase.com/. This page is what Freebase calls a Base, which is a collection of Tables/Views, which are things like “Vancouver Bloggers“, “Mayoral Candidates 2008” and so forth. A Table/View is a list of Topics, which are basically the equivalent of a Wikipedia page. Get all that? It makes sense after a while

A few observations:

  • Why have I written Table/View above? Because in some places it’s called a Table and other places it’s called a View. Which is it? I’m guessing View but it’s still not 100% clear.
  • I decided to create our own Toronto Base especially for the TorCamp community. Given that you get your own top-level domain name there’s somewhat of an incentive to be a first-mover on this
  • When you create a Base, it provides a list of suggested Views that can be added. Nice. Unfortunately, it added each View twice. I then had to go delete the duplicate View manually. Not so nice. And then even though I’ve deleted the View it still shows up on a detail page. Sigh.
  • On thus plus side, this is all done in a nice-Ajaxy way
  • It’s really not at all obvious how you create a new View. Really not obvious. Here’s the documentation.
  • My initial opinion was that Views seem to be copies, not references: this turns out to be a wrong assumption on my part. Views are in fact (if I got this right) the results of a query on the Freebase db. This means that as more Topics match the View query, they’ll automatically show up. The query is a copy, not a reference, but this is a good thing.
  • The implication is that it’s difficult to create a View that is an arbitrary “bag” of topics. For example, if I want to create a Toronto Bloggers View, I have to actually make sure that all the Topics that will show up are marked with some attribute that can be matched to give them a Toronto-bloggerness quality.

November 24, 2008

Database roundup

db,ideas,python,semantic web · David Janes · 7:27 am ·

Here’s a few things I was reading about over the weekend.

SQLAlchemy

SQLAlchemy is a full-featured Design Pattern-heavy pythonic database ORM. I am totally going to use this for my next Python SQL database project and may even do some playing with old datasets (using the reflection features, yum) soon. If you are considering doing SQL work on your next Python project, don’t even bother with the usual PEP 249 stuff, start with this.

Note that if you’re working with Django it handles the DB in its own way so SQLAlchemy may be of limited utility.

CouchDB

CouchDB “is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”. I couldn’t have written that more succently myself, so I didn’t. I qualified the paragraph above on SQLAlchemy that I’m going to use that for my next SQL project because I’m really biting at the bit to try CouchDB out. The CouchDB design philosophy – a REST API a returning lists of JSON-objects – reflects my current design paradigm very closely, and the only question I have is whether in practically scales to millions of rows.

A caveat that it’s written in the-cool-nerds-are-doing-it language Erlang, but because you don’t have to interact with that it should be OK for us mortals.

CouchDB is about to officially become a “top level” Apache project, though none of the documentation on the Apache.org site reflects this yet.

Virtuoso

Virtuoso is a “high-performance object-relational SQL database”. It apparently can perform well. As I came across through the Planet RDF aggregator, this may be something you want to look into if you’re working on an RDF/SPARQL project.

Amazon Web Services Hosted Data Sets

That’s a mouthfull, isn’t it? Amazon is offering to host public datasets on EC2 for free. What’s the catch? It will host the data, but you have to pay for the computing resources to use that data in the normal EC2 manner. Still, if you’re using a large public dataset and you’re already EC2-friendly, you might want to consider this program. An even more interesting thought occurs (though I’m not sure if it will fly): if you’re using large amounts of your own data on EC2, you may want to offer it up as a free resource.

There’s more on this on by Lidija Davis on Read/Write Web.

November 12, 2008

Work API Teaser II – Praized API

demo,ideas,python,search,semantic web,work · David Janes · 6:46 pm ·

Implementing a merchant search using the Praized API took about 10 minutes (mainly finding the right documentation), using my WORK framework:

class PraizedMerchants(bm_api.API):
    """See: http://code.google.com/p/praized/wiki/A_Second_Tutorial_Search"""

    _uri_base = "http://api.praized.com/apitribe/merchants.xml"
    _meta_path = "community"
    _item_path = "merchants.merchant"
    _page_max_path = 'pagination.page_count'
    _page_max = -1

    def __init__(self, api_key, slug = "apitribe", **ad):
        bm_api.API.__init__(self, api_key = api_key, **ad)

        self._uri_base = "http://api.praized.com/%s/merchants.xml" % slug

    def CustomizePageURI(self, page_index):
        if page_index > 1:
            return  "page=%s" % page_index

Partially hardcoding ‘apitribe’ as a ‘community slug’ is probably a bad idea. Anyhoo, here’s how you call it…

api_key = os.environ["PRAIZED_APIKEY"]
api = PraizedMerchants(api_key = api_key, slug = "david-janess-code")
api.SearchOn(
    q = "Bistro",
    l = "Toronto",
)
for item in api.IterItems():
    print json.dumps(item, indent = 1)

… and a set if results, somewhat edited below. I’ll have to figure out what that “permalink” is all about (I’ve edited it to shorten it)  … it could be something neat, but I haven’t quite grasped all the ins and outs of what Praized wants to accomplish as a business.

{
 "@Index": 0,
 "@Page": 1,
 "short_url": "http://przd.com/zAU-7",
 "pid": "af5bebd604f3d1517a8113e0a2e8cc58",
 "updated_at": "2008-10-04T20:49:34Z",
 "phone": "(416) 585-7896",
 "permalink":
   ".../praized/places/ca/ontario/toronto/coffee-supreme-bistro?l=Toronto&q=Bistro",
 "name": "Coffee Supreme Bistro",
 "created_at": "2008-10-04T20:49:34Z",
 "location": {
  "city": {
   "name": "Toronto"
  },
  "country": {
   "code": "CA",
   "name_fr": "Canada",
   "name": "Canada"
  },
  "longitude": "-79.384071",
  "regions": {
   "province": "Ontario"
  },
  "postal_code": "M5J 1T1",
  "latitude": "43.646347",
  "street_address": "40 University Avenue"
 }
}

WORK API Teaser

ideas,python,semantic web,work · David Janes · 9:41 am ·

Following from the concepts I wrote about yesterday, here’s two examples of API parsers using a WORK model.

RSS 2.0

Class definition – that’s the whole thing there!:

class RSS20(API):
    _item_path = "channel.item"
    _meta_path = "channel"

    def __init__(self, uri):
        API.__init__(self)

        self._uri_base = uri

Using it:

api = RSS20(uri = 'http://feeds.feedburner.com/DavidJanesCode')
for item in api.IterItems():
    print "-", item['title']

Results:

- WORK - Web Object Records
- Syntax Error on Line 1
- Adding MapField to inputEx
- Switching between mapping APIs and universal zoom levels
- How to dynamically load map APIs
- How to use the Google Maps API
- How to use the Microsoft Virtual Earth API
- Tip - how to get your browser’s User Agent
- How to use the MapQuest API
- How to use the Yahoo Maps Service AJAX API
- How to detect internal link jumps
- GenX - first public demonstration
- Amazon’s OpenSearch: mostly useless
- More style updates
- How to do multi-column multilingual full text searching in Oracle
- Tip - fixing broken menus over form on IE6 and IE7
- New style for this weblog
- AUMFP - Demo
- Tip - use mod_rewrite to redirect to subdirectory
- AUMFP - The Almost Universal Microformats Parser

Amazon ECS

This will probably end up replacing PyECS!

Class definition:

class AmazonECS(API):
    _base_query = {
        "Sort" : "relevancerank",
        "Operation" : "ItemSearch",
        "Version" : "2008-08-19",
        "ResponseGroup" : [ "Small", ],
    }
    _uri_base = "http://ecs.amazonaws.com/onca/xml"
    _meta_path = "Items.Request"
    _item_path = "Items.Item"
    _page_max_path = 'Items.TotalPages'
    _item_max_path = 'Items.TotalResults'
    _page_max = -1

    def __init__(self, **ad):
        API.__init__(self, **ad)

    def CustomizePageURI(self, page_index):
        if page_index == 1:
            return

        return  "%s=%s" % ( "ItemPage", page_index )

Using it:

api = AmazonECS(AWSAccessKeyId = os.environ["AWS_ECS_ACCESSKEYID"])
api.SearchOn(
    Keywords = "Larry Niven",
    SearchIndex = "Books",
    Condition = "New",
)
for item in api.IterItems():
    print "-", item['ItemAttributes.Title']

Results … note that this fetching many pages of results:

- Fleet of Worlds
- Juggler of Worlds
- Escape from Hell
- Inferno
- N-Space
- The Ringworld Engineers (Ringworld)
- The Draco Tavern
- Legacy of Heorot: Legacy of Heorot
- Footfall
- A WORLD OUT OF TIME (ORBIT BOOKS)
- The Burning City (Hardback)
- Protector
- Burning Tower
- Three Books of Known Space
- Ringworld Throne
- Tales of Known Space: The Universe of Larry Niven
- Scatterbrain
- Ringworld
- Lucifer's Hammer
... (continues) ...
Older Posts »

Powered by WordPress

Switch to our mobile site