David Janes' Code Weblog

December 11, 2008

A brief survey of Yahoo Pipes as a DQT

demo,djolt,dqt,ideas,semantic web,work · David Janes · 7:19 am ·

MacFUSEYahoo Pipes is a visual editor of mashups, allowing you to take data from sources on the net, transform them in various interesting ways and output the result as Atom, RSS or JSON. The primary downside Pipes of course is that you’re totally dependent on Yahoo for the infrastructure: it runs at Yahoo pulling feeds that have to be accessable through the public Internet.

It’s easy to use Pipes: just go to this page and start working with the sample example Pipe. You’ll need a Yahoo login ID, but most of us have that anyway. I’ve created an example that uses Yahoo Pipes to feed a Djolt template which you can see here.

We can analyze Pipes in the terms of the DQT paradigm we’ve outlined in the previous post.

Data Sources and Queries

Sources and Queries are merged (quite logically) in the Pipes interface. You can read in depth documentation here.

  • Fetch CSV
  • Feed Autodiscovery – outputs syndication feeds found on a page (RSS feeds on a CBC page)
  • Fetch Feed
  • Fetch Page – will read a page and parse the contents with a reg
  • Fetch Site Feed – this is the logical combination of Fetch Feed and Fetch Autodiscovery
  • Flickr – find images by tag near a location (photos of cats in Toronto)
  • Google Base – look up information in Google Base
  • Item Builder – a way of building new items from existing items
  • Yahoo Local
  • Yahoo Search

Transforms

The operator documentation can be read here.

  • Count
  • Filter
  • Location Extractor – a geocoder that magically looks for locations
  • Loop
  • Regex
  • Rename
  • Reverse
  • Sort
  • Split
  • Sub-element – pulls a particular sub-element of an item and makes that the item. This is very much like WORK path manipulation
  • Tail
  • Truncate
  • Union
  • Unique
  • Web Service

Plus a number of specialized data services, for dealing with elements such as dates.

Templates

Pipes does not provide an arbitrary Djolt-like template producing HTML. Instead, they provide a number of pre-made code templates that output well known data types, including RSS, JSON and Atom (and some stranger choices, like PHP).

December 9, 2008

Introducing DQT – Data/Query/Transform/Template

dqt,ideas · David Janes · 4:05 pm ·

Data/Query/Transform/Template – DQT, dropping the final T – is a commonly used pattern for displaying data on a website. The elements of this pattern are:

  • a Data source, such a blog database, an e-mail store, the Internet as a whole, a MySQL, and in particular the results of an API call.
  • a Query, which is a way of selecting a particular subset or slice of the data (typically homogeneous)
  • Transform rules, which can make the data look different by renaming fields, enhancing data using tools such as geolocation, filtering records out, merging multiple data sources and so forth.
  • a Template, which is a way of converting to a useful end-user format, such as HTML, JSON or XML

In the particular context of what I’m writing about, we can assume that we’re manipulating WORK items – that is, an an API returns a “Meta” block of information and a stream of “Items”, each in turn which are WORK items also. By identifying common patterns of dynamic page construction, my hope is that we can simplify page and mashup creation.

You’ve seen this pattern many times. My plan is that be describing it properly, we can make it easier to do.

WordPress Blogs

  • The Data source is the MySQL table with blog posts, plus ancillary information pulled from other tables.
  • The Query is some combination of ( page number, post path, category, tag ). Not all combinations are legal obviously, but this is the information that can be encoded in a URL request. The Data source and the Query result a number of posts being made available for further processing
  • The Template is the PHP code that converts the individual database items into HTML for display

There is no Transform in this example. See it here ;-) .

BTW: Don’t take this as a how-to guide for WordPress. I’m trying to look at this from a high-level conceptual point-of-view.

Google Mail

  • the Data source is all the e-mails in the Google database, probably billions or trillions of messages
  • the Query is some combination of ( userid, page number, search ). The userid is not encoded in the URL, it is known be
  • the Template is the Javascript code that

That’s a really high level view: in fact, Google Mail does this DQT twice: the first time around to select JSON or XML data to be transmitted to the user’s browser; the second time around to locally on the user’s browser select and display items.

Yahoo News

  • the Data source is Yahoo’s news database
  • the Query is a category, or not category at all (poorly encoded the URL, I may add)
  • the Transform groups news into like categories (play along with me here)
  • the Template is the Yahoo’s HTML generator, whatever that may be

See it here.

An RSS feed

The Data and the Query are substantially similar to the WordPress Blog example, but:

  • we Transform the fields into a format that can be understood by an RSS generator
  • the Template is a specialized object that converts WORK items into RSS entries, that is, we don’t (or shouldn’t) use a Djolt-like template to generate XML.

This is obviously a somewhat of a hypothetical example, but reflects my recent ideas about how machine readable data should be generated.

December 8, 2008

Coding backwards for simplicity

djolt,dqt,ideas,pybm,python,work · David Janes · 4:58 pm ·

I haven’t been posting as much as I like here for the last three weeks, not because of lack of ideas but because I haven’t been able to consolidate what I’ve been working on into a coherent thought. I’m trying to come up with a overreaching conceptual arch that covers WORK, Djolt and the various API interfaces I’ve been coded. Tentatively and horribly, I’m calling this Data/Query/Transform/Template right now though I’m expecting this to change.

The first demo of this … without further explanation … can be seen here. More details about what this is actually demonstrating (besides formatting this blog) will be forthcoming.

What I want to draw attention to in this post is how I coded this. What I’ve been doing for the last several weeks is coding backwards: I start with what I want the final code to look like and then figure out all the libraries, little languages and so forth that would be needed to code that. After several false starts, my conceptual logjam broke about a week ago and code started radically simplifying.

The ideal code, in my mind, is almost entirely static declarations: no loops, no if statements, no while statements, no goto-type statements (god help us). We simply specify how the parts are connected, and hope that we can abstract the complexity into the libraries that make this all happen. The code that you see below is actually post all my conceptualizing: I just wanted to write some code and since I had almost all the parts together it fell together quite nicely:

import bm_wsgi
import bm_io

import djolt
import api_feed

from bm_log import Log

class Application(bm_wsgi.SimpleWrapper):
    def __init__(self, *av, **ad):
        bm_wsgi.SimpleWrapper.__init__(self, *av, **ad)

    def CustomizeSetup(self):
        self.html_template_src = bm_io.readfile("index.dj")
        self.html_template = djolt.Template(self.html_template_src)

        self.context = djolt.Context()
        self.context["paramd"] = {
            "feed" : "http://feeds.feedburner.com/DavidJanesCode",
            "template" : """\
<ul>
{% for item in data.items %}
	<li><a href="{{ item.link }}">{{ item.title }}</a></li>
{% endfor %}
""",
        }
        self.context.Push()
        self.context["paramd"] = self.paramd
        self.context["data"] = api_feed.RSS20(self.context.as_string("paramd.feed"))

    def CustomizeContent(self):
        yield   self.html_template.Render(self.context)

if __name__ == '__main__':
    Application.RunCGI()

There’s almost nothing there! In particular, note:

  • bm_wsgi.SimpleWrapper handles all the WSGI interface work, including determining when to output HTML headers, error trapping, and Unicode to UTF-8 encoding
  • the most complicated part of the application is setting up the Context. In particular, note that self.paramd is automatically populated by the QUERY_STRING passed to the application, and the double setting we do here allows us to have default values.
  • If you want to see the HTML template that drives the application it is here. Note two variations from Django templates: the {% asis %} block which doesn’t intrepret it’s content as Djolt code and the {{ *paramd.template|safe }} variable which interprets the variable’s contents as a template.
  • Methods called Customize-something are my convention for framework functions, i.e. methods that will be called for us rather than methods we call.

How to JSON encode iterators

ideas,python · David Janes · 2:32 pm ·

As part of my recent explorations, I’ve been playing a lot with Python iterators/generators. The key efficiency of iterators is that when working with lengthy list-like objects, you need only create the part that’s being looked at. It’s just-in-time objects.

If you attempt to JSON serialize an object with an iterator/generator object in it, the json module throws a cog: it doesn’t know how to serialize these types of objects. The json module is extensible and the documentation makes a suggestion how to do this:

class IterEncoder(json.JSONEncoder):
 def default(self, o):
   try:
       iterable = iter(o)
   except TypeError:
       pass
   else:
       return list(iterable)
   return JSONEncoder.default(self, o)

print json.dumps(xrange(4), cls = IterEncoder)

This seems somewhat ugly to me. In particular, lots of objects can be wrapped by the iter function that don’t need to be, plus lots of objects will cause that TypeError to be thrown which seems to be rather a bit of waste. Here’s the solution I came up with:

class IterEncoder(json.JSONEncoder):
    def default(self, o):
        try:
            return  json.JSONEncoder.default(self, o)
        except TypeError, x:
            try:
                return  list(o)
            except:
                return  x

This tries to encode the object the normal way. Only if that doesn’t work do we try to turn the object into a list. If that’s not convertible (i.e. the list object constructor fails) we go back and throw the original exception provided by JSONEncoder – we’ve really failed.

You use this as follows:

class X:
    def Iter(self):
        yield 1
        yield 2
        yield 3
        yield 4

xi = X().Iter()

print json.dumps(xi, cls = IterEncoder)
print json.dumps(xrange(4), cls = IterEncoder)

Which yields the expected:

[1, 2, 3, 4]
[0, 1, 2, 3]

Don’t be overly tempted to check the type of o: it may be types.GeneratorType or types.XRangeType or perhaps even something else that I haven’t found out yet.

December 7, 2008

Once more with the maps

maps · David Janes · 7:48 am ·

Michal Migurski added a very informative comment about tiling levels here, with a pointer to way more detailed information here.

December 6, 2008

JavaFX 1.0 released

html / javascript,java · David Janes · 6:54 am ·

CNET has a very detailed article on JavaFX, Sun’s better-late-than-never-maybe-hopefully answer to Adobe’s Flash / AIR and Microsoft’s Silverlight:

With a back-to-the-future technology called JavaFX to be launched Thursday, Sun Microsystems hopes to attract a new class of developer while building a much-needed new revenue source.

JavaFX 1.0 returns to the sales pitch that Sun used during Java’s launch more than 13 years ago: a foundation for software on a wide variety of computing “clients” such as desktop computers or mobile phones. JavaFX builds on current Java technology but adds two major pieces.

First is a new software foundation designed to run so-called rich Internet applications–network-enabled programs with lush user interfaces. Second is a new programming language called JavaFX Script that’s intended to be easier to use than traditional Java.

The benefit for me (and maybe you) is that most web shops have three different classes of developers: backend programmers, HTML/CSS coders, and The Flash Guy. The problem with The Flash Guy is that they have to be involved to make any changes to Flash components; JavaFX instead provides components that can easily be manipulated by coders and by web designers (with a little training I think).

The JavaFX site looks a hell of a lot better than anything Sun’s produced in the past too, so maybe they’re learning that looks matter too. Here’s some examples to play with.

All your Base are belong to us

db,freebase,semantic web · David Janes · 6:26 am ·

Freebase is a user-editable, user-extensible structured database, a sort of one-stop shop semantic web/Wikipedia application. I started playing with Freebase about a year ago and the application has made significant strides over that period, especially in the usability department. Freebase also provides a very nice API which I’m using in GenX, with the caveat that it’s currently almost useless because of query timeouts.

I just came across the following page on Freebase: http://vancouver.freebase.com/. This page is what Freebase calls a Base, which is a collection of Tables/Views, which are things like “Vancouver Bloggers“, “Mayoral Candidates 2008” and so forth. A Table/View is a list of Topics, which are basically the equivalent of a Wikipedia page. Get all that? It makes sense after a while

A few observations:

  • Why have I written Table/View above? Because in some places it’s called a Table and other places it’s called a View. Which is it? I’m guessing View but it’s still not 100% clear.
  • I decided to create our own Toronto Base especially for the TorCamp community. Given that you get your own top-level domain name there’s somewhat of an incentive to be a first-mover on this
  • When you create a Base, it provides a list of suggested Views that can be added. Nice. Unfortunately, it added each View twice. I then had to go delete the duplicate View manually. Not so nice. And then even though I’ve deleted the View it still shows up on a detail page. Sigh.
  • On thus plus side, this is all done in a nice-Ajaxy way
  • It’s really not at all obvious how you create a new View. Really not obvious. Here’s the documentation.
  • My initial opinion was that Views seem to be copies, not references: this turns out to be a wrong assumption on my part. Views are in fact (if I got this right) the results of a query on the Freebase db. This means that as more Topics match the View query, they’ll automatically show up. The query is a copy, not a reference, but this is a good thing.
  • The implication is that it’s difficult to create a View that is an arbitrary “bag” of topics. For example, if I want to create a Toronto Bloggers View, I have to actually make sure that all the Topics that will show up are marked with some attribute that can be matched to give them a Toronto-bloggerness quality.

December 4, 2008

Djolt Indirection

demo,djolt,ideas,python · David Janes · 6:05 am ·

I’ve been working through a sticky problem with Djolt, trying to implement my Toronto Fires example in as few lines as possible. As part of this, I’ve come up with the idea of adding indirection to Djolt templates:

import djolt

d = {
    "a" : "It says: {{ b }}",
    "b" : "Hello, World"
}

t = djolt.Template("""
a: {{ a }}
b: {{ b }}
*a: {{ *a }}
""")

print t.Render(d)
""")

print t.Render(d)

Which yields:

a: It says: {{ b }}
b: Hello, World
*a: It says: Hello, World

This is significantly updated from the original version I posted here an hour ago. The indirection now makes the variable read as a template. This is a much more powerful concept.

Python 3000 – we are no longer flying

Uncategorized · David Janes · 5:45 am ·

Python 3000 – the next generation, backwards incompatiable with all previous versions of Python – is now available. I think I’ll sit this one out for a while.

« Newer Posts

Powered by WordPress

Switch to our mobile site