David Janes' Code Weblog

February 5, 2009

Turning garbage "HTML" into XML parsable XHTML using Beautiful Soup

html / javascript,python · David Janes · 6:56 am ·

Here’s our problem child HTML: Members of Provincial Parliament. Amongst the attrocities committed against humanity, we see:

  • use of undeclared namespaces in both tags (<o:p>) and attributes (<st1:City w:st="on">)
  • XML processing instructions – incorrectly formatted! – dropped into the middle of the document in multiple places (<?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" />)
  • leading space before the DOCTYPE

This is so broken that even HTML TIDY chokes on it, producing a severely truncated file. This broken document provided me however an opportunity to play with the Python library Beautiful Soup, which lists amongst it’s advantages:

  • Beautiful Soup won’t choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don’t have to create a custom parser for each application.
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

Alas, straight out of the box Beautiful Soup didn’t do it for me, perhaps because of some of my strange requirements (my data flow works something like this: raw document → XML → DOM parser → JSON). However, Beautiful Soup does provide the necessary calls to manipulate the document to do the trick. Here’s what I did:

First, we import Beautiful Soup and parse it to the object soup. We’re expecting an HTML node at the top, so we look for that.

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(raw)

if not hasattr(soup, "html"):
	return

Next, we loop through every node in the document, using Beautiful Soup’s findAll interface. You will see several variants of this call here in the code. What we’re looking for is use of namespaces, which we then add to the HTML element as attributes using fake namespace declarations.

We need to find namespaces already declared:

used = {}
for ns_key, ns_value in soup.html.attrs:
	if not ns_key.startswith("xmlns:"):
		continue

	used[ns_key[6:]] = 1

Then we look for ones that are actually used:

nsd = {}
for item in soup.findAll():
	name = item.name
	if name.find(':') > -1:
		nsd[name[:name.find(':')]] = 1

	for name, value in item.attrs:
		if name.find(':') > -1:
			nsd[name[:name.find(':')]] = 1

Then we add all the missing namespaces to the HTML node.

for ns in nsd.keys():
	if not used.get(ns):
		soup.html.attrs.append(( "xmlns:%s" % ns, "http://www.example.com#%s" % ns, ))

Next we look for attributes that aren’t properly XML declarations, e.g. HTML style <input checked />-type items.

for item in soup.findAll():
	for index, ( name, value ) in enumerate(item.attrs):
		if value == None:
			item.attrs[index] = ( name, name )

Then we remove all nodes from the document that we aren’t expecting to see. If you keep the script tags you’re going to have to make sure that each node is properly CDATA encoded; I didn’t care about this so I just remove them.

[item.extract() for item in soup.findAll('script')]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.ProcessingInstruction ))]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.Declaration ))]

In the final step we convert the document to Unicode. This requires another step of post-processing: html2xml changes all entity uses that XML doesn’t recognize into a &#...; style. E.g. we do change &nbsp; but we don’t change &amp;. At this point we now have a document that can be processed by standard DOM parsers (if you convert to UTF-8 bytes, sigh).

cooked = unicode(soup)
cooked = bm_text.html2xml(cooked)

January 25, 2009

Creating OPML subscription lists using Pipe Cleaner

authentication,demo,pipe cleaner,pybm,python · David Janes · 11:40 am ·

Here’s a neat API I completed this morning, called api_feeds. It takes a URL (or a list of them) and transforms them into:

  • the home page associated with the URL
  • the feed(s) for the URL
  • the name of the home page

If you’re following along at home, this is essentially the information needed for a single outline in an OPML subscription list.

Here’s a simple python example:

api = api_feeds.OneFeed()
api.request = {
    "uri" : "http://code.davidjanes.com/blog/2009/01/23/transparently-working-with-oauath/",
}

pprint.pprint(api.response, width = 1)

And here’s what the output looks like:

{'link': u'http://code.davidjanes.com/blog',
 'links': [{'href': u'http://feeds.feedburner.com/DavidJanesCode',
            'rel': 'alternate',
            'type': u'application/rss+xml'}],
 'title': u"David Janes' Code Weblog"}

There’s actually quite a bit going on here behind the scenes, most of it using code I didn’t initially write but have quite heavily hacked: the Universal Feed Parser and the Feed Finder.

What becomes really interesting what happens when we combine this with other modules. Here’s an example of how we can build an OPML subscription list from all the posts I’ve tagged “python” and “django” in del.icio.us. The code looks up each link I’ve bookmarked, does the feed discovery above, filters out items that don’t have feeds, and outputs as OPML. Note the neat pipeline type aspect to the code:

api_delicious = api_delicious.PostsList(tag = "python django")
api_many = api_feeds.ManyFeeds(require_feed = True)
api_opml = api_opml.OPMLWriter()

api_many.items = api_delicious.items
api_opml.items = api_many.items

print api_opml.Produce()

Producing the following OPML:

<opml encoding="utf-8" version="2.0">
  <head>
    <title>[Untitled]</title>
  </head>
  <body>
    <outline htmlUrl="http://push.cx"
      rssUrl="http://push.cx/feed"
      text="Push cx"
      type="rss"/>
    <outline htmlUrl="http://crankycoder.com"
      rssUrl="http://crankycoder.com/feed/"
      text="crankycoder.com"
      type="rss"/>
    <outline htmlUrl="http://blog.dowski.com"
      rssUrl="http://blog.dowski.com/feed/"
      text="the occasional occurrence"
      type="rss"/>
    <outline htmlUrl="http://www.b-list.org/feeds/entries/"
      rssUrl="http://feeds2.feedburner.com/b-list-entries"
      text="The B-List: Latest entries"
      type="rss"/>
    <outline htmlUrl="http://blog.thescoop.org"
      rssUrl="http://blog.thescoop.org/feed/"
      text="The Scoop"
      type="rss"/>
    <outline htmlUrl="http://effbot.org"
      rssUrl="http://effbot.org/zone/rss.xml"
      text="effbot.org"
      type="rss"/>
    <outline htmlUrl="http://blog.disqus.net"
      rssUrl="http://feeds.feedburner.com/BigHeadLabs"
      text="Disqus"
      type="rss"/>
    <outline htmlUrl="http://blog.ianbicking.org"
      rssUrl="http://blog.ianbicking.org/feed/atom/"
      text="Ian Bicking: a blog"
      type="rss"/>
    <outline htmlUrl="http://antoniocangiano.com"
      rssUrl="http://feeds.feedburner.com/ZenAndTheArtOfRubyProgramming"
      text="Zen and the Art of Programming"
      type="rss"/>
    <outline htmlUrl="http://www.carthage.edu/webdev"
      rssUrl="http://www.carthage.edu/webdev/?feed=rss2"
      text="carthage webdev"
      type="rss"/>
    <outline htmlUrl="http://www.eweek.com"
      rssUrl="http://www.eweek.com/rss-feeds-13.xml"
      text="Application Development - RSS Feeds"
      type="rss"/>
    <outline htmlUrl="http://jeffcroft.com/"
      rssUrl="http://feeds.feedburner.com/jeffcroft/blog"
      text="JeffCroft.com: Latest blog entries"
      type="rss"/>
  </body>
</opml>

This will be just as terse (terser, probably) when written as a Pipe Cleaner script; I’m just struggling over how to introduce the authentication code gracefully into the scripts.

January 23, 2009

Transparently working with OAuath

authentication,demo,pipe cleaner,pybm,python · David Janes · 5:03 am ·

This is part one of two posts I’m going to write about OAuth; the second will be somewhat more critical in tone. Before I criticize – and I know it’s hard to put together technologically things like OAuth – I want to actually accomplish something with it, so I at least I appear that I have somewhat of a clue about it. This is a report of what I’ve done.

bm_uri is a libary and tool I’ve written for working with URIs, and in particular http:// and https:// URLs. Here are some of the advantages of using bm_uri over all the normal Python urllib and urllib2 methods:

  • downloads are cached; if a URL is temporarily not available, bm_uri will return the cached version, likewise if it has been downloaded in the near past, the cached version will be returned rather than hitting the net again
  • downloads can be cooked, meaning converted into a more useful form such as TIDY-cleaned up HTML, JSON, Unicode text and so forth
  • bm_uri handles all the protocol stuff for you (such as User-Agent, Last-Modified and so forth) so you don’t have to
  • authentication is handled “invisibly” as possible for you … at least after the initial setup

Here is an example of accessing a OAuth resource using bm_uri returning my current location from Fire Eagle as a Python object. From a programming point of a view, I believe I have reduced this to close to the minimum number of steps possible. Here’s the setup phase:

import bm_uri
import bm_oauth
import pprint

bm_cfg.cfg.initialize()

bm_oauth.OAuth(service_name = "fireeagle")

Here’s using it in code – note how there’s no reference to OAuth here whatsoever.

loader = bm_uri.JSONLoader('https://fireeagle.yahooapis.com/api/0.1/user.json?format=json')
loader.Load()

pprint.pprint(loader.GetCooked())

And here’s the output of the program:

{u'stat': u'ok',
 u'user': {u'location_hierarchy': [{u'best_guess': True,
         u'geometry': {u'coordinates': [-79.418426513699998,
                   43.731891632100002],
              u'type': u'Point'},
         u'id': 572261,
         u'label': None,
         u'level': 1,
         u'level_name': u'postal',
         u'located_at': u'2008-03-19T04:09:30-07:00',
...
         u'name': u'Canada',
         u'normal_name': None,
         u'place_id': u'EESRy8qbApgaeIkbsA',
         u'woeid': 23424775}],
     u'readable': True,
     u'writable': False}}

Gather information

The devil is in the details, obviously and with OAuth, the little satan is doing the initial setup. Here’s how I did this for Fire Eagle – there’ll be something analogous for whatever service you are using:

  • Log in or sign up (obviously)
  • Go to the Developers’ Page
  • Click on Create a New App
  • Copy the “Consumer Key” and the “Consumer Secret” … these will be long-ish strings of nonsense
  • Find out the Request Token URL, the Access Token URL, and the Authorization URL. These are public knowledge and for Fire Eagle are:
    • https://fireeagle.yahooapis.com/oauth/request_token
    • https://fireeagle.yahooapis.com/oauth/access_token
    • http://fireeagle.yahoo.net/oauth/authorize

Note how Yahoo has conveniently made that last URL similar looking to the others, but not quite the same. Thanks!

However you implement OAuth, you’re probably going to need to be able to persist information to disk or database. As documented here several weeks ago, we already have that covered with our bm_cfg module. In ~/.cfg/fireeagle.json, create the following JSON format file:

{
 "fireeagle": {
  "api_uri" : "https://fireeagle.yahooapis.com/",
  "oauth_access_token_url": "https://fireeagle.yahooapis.com/oauth/access_token",
  "oauth_authorization_url": "http://fireeagle.yahoo.net/oauth/authorize",
  "oauth_consumer_key": "ABCDEFGHIJKL",
  "oauth_consumer_secret": "ABCDEFGHIJKLMNOPQRSTUVWXYZ012345",
  "oauth_token_url": "https://fireeagle.yahooapis.com/oauth/request_token",
 }
}

The only new item here is the api_uri: that’s the prefix of URLs that bm_uri will use OAuth with.

Set it up

Next you have to do all sorts of OAuth stuff to actually work with OAuth. If the why interests you, please go read the spec! I’m more of how person myself, and this is what we need to do:

  • run: python bm_uri.py --service fireeagle --authorize
  • this will pop up a browser window; grant your application access and then…
  • run: python bm_uri.py --service fireeagle --exchange

And that’s it – you should now be able to just work with the Fire Eagle API in bm_uri without even having to know OAuth is there!

End notes

  • the current implementation only works with HTTP/REST GET; POST to come soon, DELETE and PUT as needed
  • bm_uri, bm_config and the rest of the code is freely licensed and available here. It is a constantly changing product, albeit converging on perfection in my own mind ;-)

January 9, 2009

Thinking about Configuration

ideas,python · David Janes · 7:20 am ·

Happy New Year, everyone. I’ve been busy at paying work recently, plus cleaning up and testing existing code I’ve been discussing here over the last few months. At work I’ve been developing in WebObjects, which though a lovely platform is not the way of the future so I’m not documenting many of my experiences here.

The applications I’ve been working on recently, Pipe Cleaner and GenX, need – like most applications – configuration. This will store information which can be safely exposed to the public, such as my Google Maps API key, and information that I need to keep private within the application, such as my Freebase username and password (cf. however the password anti-pattern). Furthermore, though the code I’m writing is in Python it is possible that the code that provides the UI will be written in another language, such as PHP inside of WordPress.

Given these considerations, here’s my design choices:

  • configuration files are stored as multiple individual files inside a directory (or directories)
  • configuration files are in JSON, and contain a dictionary of dictionaries (see below)
  • configuration files can be marked as private or public
  • the same logical configuration (say for Amazon, which has both public and private information) can be in a public and private file
  • the configuration is global, but is accessed through setter/getter properties
  • non-global versions of the configuration can be made

That all said, here’s what I’ve written. First, the setters and getters:

class Cfg:
    _cfg_private = {}
    _cfg_public = {}

    @apply
    def public():
        def fget(self):
            return  self._cfg_public

        return property(**locals())

    @apply
    def private():
        def fget(self):
            return  self._cfg_private

        return property(**locals())

As an aside, I’m not 100% sure about Python decorators and wonder if my favorite language is being turned into a C++ like mess.

Next, the ‘add’ function that adds information to the configuration ensuring private and public are handled correctly. Note that there can be multiple dictionaries inside of ‘d’, but ‘d’ is either all Public or not.

    def add(self, d):
        if type(d) != types.DictType:
            raise TypeError("only dictionaries can be added")

        if d.get('@Public'):
            #
            #   Public definitions never overwrite private definitions
            #
            for key, value in d.iteritems():
                if type(value) != types.DictType:
                    continue

                if not self._cfg_private.has_key(key):
                    self._cfg_private[key] = value

                self._cfg_public[key] = value
        else:
            self._cfg_private.update(d)

And finally the loader, which gets everything in a directory or one level down. Note the ‘exception’ parameter which makes me a bad person, but I don’t like code failing unless I tell it to.

    def load(self, path, exception = False, depth = 0):
        try:
            if os.path.isdir(path) and depth < 2:
                for file in os.listdir(path):
                    self.load(os.path.join(path, file))
            elif os.path.isfile(path):
                if path.endswith(".json"):
                    self.add(json.loads(bm_io.readfile(path)))

        except:
            if exception:
                raise

            Log("ignoring exception", exception = True, path = path)

And one more thing: make the global configuation:

cfg = Cfg()

Here’s how you use it:

import bm_cfg

# setup ... on a per-file or directory basis
for file in sys.argv[1:]:
    bm_cfg.cfg.load(file)

# use it
pprint.pprint({
    "private" : bm_cfg.cfg.private,
    "public" : bm_cfg.cfg.public,
}, width = 1)

Here’s what my configuration directory looks like:

$ pwd
/Users/davidjanes/Sites/pc/cfg
$ ls
amazon.json		freebase.json		praized.json
amazon.public.json	gmaps.json		yahoo.json

Here’s the (private) amazon.json:

{
    "amazon" : {
        "Locale" : "us",
        "AccessKeyID" : "0......",
        "AssociateTag" : "ona-20",
        "Private" : "Don't See"
    }
}

And here’s the (public) amazon.public.json:

{
    "@Public" : 1,
    "amazon" : {
        "Locale" : "us",
        "AccessKeyID" : "0......",
        "AssociateTag" : "ona-20"
    }
}

Note that if the private version of the Amazon file wasn’t available, the public version would also be in the private one. I.e. the private configuration basically is “everything” (noting possibly exceptions above in the code).

December 22, 2008

Issues with utcoffset and pytz

demo,python · David Janes · 10:14 am ·

In the previous entry, we talked about the difficultly in finding out the delta from UTC for a timezone returned from the pytz module. In particular, consider the offset for St. John’s, Newfoundland which should be at -3:30.

dt_now = datetime.datetime.now()
tz = pytz.timezone('America/St_Johns')

offset = tz.utcoffset(dt_now)

Log(
    "using datetime.utcoffset",
    offset = format(offset),
)

With the unexpected result:

  message: using datetime.utcoffset
  offset: -4:29 (-12660)

I did a fair bit of Google searching for an answer without finding a satisfactory result, so I did further research on my own. To find the correct offset value, I found that this works:

dt_sj = tz.localize(dt_now)
offset = dt_sj - pytz.UTC.localize(dt_now)

Log(
    "using delta to UTC",
    offset = format(offset),
)

Which yields the correct:

  message: using delta to UTC
  offset: 03:30 (12600)

Note that if you’re going to use the above method for finding deltas, you’re going to have to take Daylight Savings Time into consideration also. I have not done this here, as I’m a little pressed for time and just want to illustrate the problem.

The issue seems to be with the way that pytz uses the Olson database entry (from here) for St. John’s – and all other locations. It appears that pytz is using the first rule it sees, from 1884, rather than the rule for the date that was passed in. I think this is a bug.

#
# St John's has an apostrophe, but Posix file names can't have apostrophes.
# Zone  NAME        GMTOFF  RULES   FORMAT  [UNTIL]
Zone America/St_Johns   -3:30:52 -  LMT 1884
            -3:30:52 StJohns N%sT   1918
            -3:30:52 Canada N%sT    1919
            -3:30:52 StJohns N%sT   1935 Mar 30
            -3:30   StJohns N%sT    1942 May 11
            -3:30   Canada  N%sT    1946
            -3:30   StJohns N%sT

The setup code for the examples above is:

from bm_log import Log
import dateutil.parser
import pytz
import datetime

def format(td):
    seconds = td.seconds + td.days * ( 24 * 3600 )
    return  "%02d:%02d (%s)" % ( seconds // 3600, seconds % 3600 // 60, seconds, )

Update 2010-03-09: This has been fixed in the code base and (presumably) will be in the next upcoming release.

Working with dates, times and timezones in Python

demo,python · David Janes · 7:37 am ·

Here’s a few examples of working with dates, times and timezones in Python. We are using the following packages:

  • datetime (part of the standard Python distribution)
  • dateutil – for date parsing, though there’s a lot more depth to this package that I’m not touching here
  • pytz – for timezone handling, and specifically making available the Olson timezone database to Python

There’s a lot of complexity to working with datetimes in any language; I’m not going to get into that but would prefer instead to show a few practical examples. Keep the following in mind:

  • datetimes may or may not have timezones associated with them. If they do not, they are called “naive” and their meaning is effectively defined by the program. In general, you want to work with non-naive datetimes. Generally the assumption would be that the naive datetime is in the application’s current timezone or the user’s preferred timezone
  • when working with datetimes, consider the strategy of converting everything to the universal UTC timezone, then converting back to the user’s timezone only when you need to display that to the user
  • if you are rolling your own code for handling dates, times and timezones and you haven’t done a lot of research, your implementation is garbage. Do yourself and everyone else a favor and use a library.

Our standard imports. Log is from the pybm library and it’s purpose is rather obvious.

from bm_log import Log
import dateutil.parser
import pytz
import datetime

Here’s an example of parsing the an e-mail or RSS type date using dateutil.

dts = "Thu, 13 Nov 2008 05:41:35 +0000"
dt = dateutil.parser.parse(dts)

Log(
    "Parsing an RFC type date",
    src = dts,
    dt = dt,
    iso = dt.isoformat(),
)
  message: Parsing an RFC type date
  dt: 2008-11-13 05:41:35+00:00
  iso: 2008-11-13T05:41:35+00:00
  src: Thu, 13 Nov 2008 05:41:35 +0000

Here’s an example of parsing an ISO Datetime

dts = '2008-11-13T05:41:35-0400'
dt = dateutil.parser.parse(dts)

Log(
    "Parsing an ISO Date with Timezone",
    src = dts,
    dt = dt,
    iso = dt.isoformat(),
)
  message: Parsing an ISO Date with Timezone
  dt: 2008-11-13 05:41:35-04:00
  iso: 2008-11-13T05:41:35-04:00
  src: 2008-11-13T05:41:35-0400

Here’s an example of parsing a naive timezone.

dts = '2008-11-13T05:41:35'
dt = dateutil.parser.parse(dts)

Log(
    "Parsing an ISO Date without a Timezone",
    src = dts,
    dt = dt,
    iso = dt.isoformat(),
)
  message: Parsing an ISO Date without a Timezone
  dt: 2008-11-13 05:41:35
  iso: 2008-11-13T05:41:35
  src: 2008-11-13T05:41:35

Here’s are two similar example, showing how to force the timezone if it’s not present. This will happen in the first part, but not the second.

tz = pytz.timezone('America/Toronto')
dts = '2008-11-13T05:41:35'
dt = dateutil.parser.parse(dts)
if dt.tzinfo == None:
    dt = dt.replace(tzinfo = tz)

Log(
    "Parsing an ISO Date without a Timezone BUT specifying default TZ",
    src = dts,
    dt = dt,
    iso = dt.isoformat(),
    tz = tz,
)

tz = pytz.timezone('America/Toronto')
dts = '2008-11-13T05:41:35-0400'
dt = dateutil.parser.parse(dts)
if dt.tzinfo == None:
    dt = dt.replace(tzinfo = tz)

Log(
    "Parsing an ISO Date with a Timezone AND specifying default TZ",
    src = dts,
    dt = dt,
    iso = dt.isoformat(),
    tz = tz,
)
  message: Parsing an ISO Date without a Timezone BUT specifying default TZ
  dt: 2008-11-13 05:41:35-05:00
  iso: 2008-11-13T05:41:35-05:00
  src: 2008-11-13T05:41:35
  tz: America/Toronto

  message: Parsing an ISO Date with a Timezone AND specifying default TZ
  dt: 2008-11-13 05:41:35-04:00
  iso: 2008-11-13T05:41:35-04:00
  src: 2008-11-13T05:41:35-0400
  tz: America/Toronto

Update: here’s an example of moving datetimes to UTC and then to a different Timezone. Remember: you want your backend code to work with UTC datetimes for simplicity and correctness:

dts = '2008-11-13T05:41:35-0400'
dt_orig = dateutil.parser.parse(dts)
dt_utc = dt.astimezone(pytz.UTC)

Log(
    "Changing a datetime to UTC",
    src = dts,
    dt_orig = dt_orig,
    dt_utc = dt_utc,
)

tz_vancouver = pytz.timezone('America/Vancouver')
dt_vancouver = dt_utc.astimezone(tz_vancouver)

Log(
    "Changing UTC datetime to a different timezone",
    dt_vancouver = dt_vancouver,
    dt_utc = dt_utc,
)
  message: Changing a datetime to UTC
  dt_orig: 2008-11-13 05:41:35-04:00
  dt_utc: 2008-11-13 09:41:35+00:00
  src: 2008-11-13T05:41:35-0400

  message: Changing UTC datetime to a different timezone
  dt_utc: 2008-11-13 09:41:35+00:00
  dt_vancouver: 2008-11-13 01:41:35-08:00

Here is an example of listing all “common” timezones using pytz. Note that “America” refers to the two continents, not the Irish word for the United States. Printing the actual timezone offset turned out to be a surprisingly complex task, which I will outline in a different blog post. For now let it suffice that with pytz try not to depend on utcoffset.

dt_now = datetime.datetime.now()

def tzname2offset(tzname):
    dt_in_utc = pytz.UTC.localize(dt_now)
    dt_in_tz = pytz.timezone(tzname).localize(dt_now)

    offset = dt_in_utc - dt_in_tz
    seconds = offset.seconds + offset.days * ( 24 * 3600 )

    return  "%02d:%02d" % ( seconds // 3600, seconds % 3600 // 60, )

Log(
    "Olsen (pytz) common timezones and their UTC offsets",
    timezones = map(
        lambda tzname: ( tzname, tzname2offset(tzname), ),
        pytz.common_timezones,
    )
)
  message: Olsen (pytz) common timezones and their UTC offsets
  timezones:
    [('Africa/Abidjan', '00:00'),
     ('Africa/Accra', '00:00'),
     ('Africa/Addis_Ababa', '03:00'),
     ('Africa/Algiers', '01:00'),
     ('Africa/Asmara', '03:00'),
...
     ('Pacific/Wake', '12:00'),
     ('Pacific/Wallis', '12:00'),
     ('US/Alaska', '-9:00'),
     ('US/Arizona', '-7:00'),
     ('US/Central', '-6:00'),
     ('US/Eastern', '-5:00'),
     ('US/Hawaii', '-10:00'),
     ('US/Mountain', '-7:00'),
     ('US/Pacific', '-8:00'),
     ('UTC', '00:00')]

December 8, 2008

Coding backwards for simplicity

djolt,dqt,ideas,pybm,python,work · David Janes · 4:58 pm ·

I haven’t been posting as much as I like here for the last three weeks, not because of lack of ideas but because I haven’t been able to consolidate what I’ve been working on into a coherent thought. I’m trying to come up with a overreaching conceptual arch that covers WORK, Djolt and the various API interfaces I’ve been coded. Tentatively and horribly, I’m calling this Data/Query/Transform/Template right now though I’m expecting this to change.

The first demo of this … without further explanation … can be seen here. More details about what this is actually demonstrating (besides formatting this blog) will be forthcoming.

What I want to draw attention to in this post is how I coded this. What I’ve been doing for the last several weeks is coding backwards: I start with what I want the final code to look like and then figure out all the libraries, little languages and so forth that would be needed to code that. After several false starts, my conceptual logjam broke about a week ago and code started radically simplifying.

The ideal code, in my mind, is almost entirely static declarations: no loops, no if statements, no while statements, no goto-type statements (god help us). We simply specify how the parts are connected, and hope that we can abstract the complexity into the libraries that make this all happen. The code that you see below is actually post all my conceptualizing: I just wanted to write some code and since I had almost all the parts together it fell together quite nicely:

import bm_wsgi
import bm_io

import djolt
import api_feed

from bm_log import Log

class Application(bm_wsgi.SimpleWrapper):
    def __init__(self, *av, **ad):
        bm_wsgi.SimpleWrapper.__init__(self, *av, **ad)

    def CustomizeSetup(self):
        self.html_template_src = bm_io.readfile("index.dj")
        self.html_template = djolt.Template(self.html_template_src)

        self.context = djolt.Context()
        self.context["paramd"] = {
            "feed" : "http://feeds.feedburner.com/DavidJanesCode",
            "template" : """\
<ul>
{% for item in data.items %}
	<li><a href="{{ item.link }}">{{ item.title }}</a></li>
{% endfor %}
""",
        }
        self.context.Push()
        self.context["paramd"] = self.paramd
        self.context["data"] = api_feed.RSS20(self.context.as_string("paramd.feed"))

    def CustomizeContent(self):
        yield   self.html_template.Render(self.context)

if __name__ == '__main__':
    Application.RunCGI()

There’s almost nothing there! In particular, note:

  • bm_wsgi.SimpleWrapper handles all the WSGI interface work, including determining when to output HTML headers, error trapping, and Unicode to UTF-8 encoding
  • the most complicated part of the application is setting up the Context. In particular, note that self.paramd is automatically populated by the QUERY_STRING passed to the application, and the double setting we do here allows us to have default values.
  • If you want to see the HTML template that drives the application it is here. Note two variations from Django templates: the {% asis %} block which doesn’t intrepret it’s content as Djolt code and the {{ *paramd.template|safe }} variable which interprets the variable’s contents as a template.
  • Methods called Customize-something are my convention for framework functions, i.e. methods that will be called for us rather than methods we call.

How to JSON encode iterators

ideas,python · David Janes · 2:32 pm ·

As part of my recent explorations, I’ve been playing a lot with Python iterators/generators. The key efficiency of iterators is that when working with lengthy list-like objects, you need only create the part that’s being looked at. It’s just-in-time objects.

If you attempt to JSON serialize an object with an iterator/generator object in it, the json module throws a cog: it doesn’t know how to serialize these types of objects. The json module is extensible and the documentation makes a suggestion how to do this:

class IterEncoder(json.JSONEncoder):
 def default(self, o):
   try:
       iterable = iter(o)
   except TypeError:
       pass
   else:
       return list(iterable)
   return JSONEncoder.default(self, o)

print json.dumps(xrange(4), cls = IterEncoder)

This seems somewhat ugly to me. In particular, lots of objects can be wrapped by the iter function that don’t need to be, plus lots of objects will cause that TypeError to be thrown which seems to be rather a bit of waste. Here’s the solution I came up with:

class IterEncoder(json.JSONEncoder):
    def default(self, o):
        try:
            return  json.JSONEncoder.default(self, o)
        except TypeError, x:
            try:
                return  list(o)
            except:
                return  x

This tries to encode the object the normal way. Only if that doesn’t work do we try to turn the object into a list. If that’s not convertible (i.e. the list object constructor fails) we go back and throw the original exception provided by JSONEncoder – we’ve really failed.

You use this as follows:

class X:
    def Iter(self):
        yield 1
        yield 2
        yield 3
        yield 4

xi = X().Iter()

print json.dumps(xi, cls = IterEncoder)
print json.dumps(xrange(4), cls = IterEncoder)

Which yields the expected:

[1, 2, 3, 4]
[0, 1, 2, 3]

Don’t be overly tempted to check the type of o: it may be types.GeneratorType or types.XRangeType or perhaps even something else that I haven’t found out yet.

December 4, 2008

Djolt Indirection

demo,djolt,ideas,python · David Janes · 6:05 am ·

I’ve been working through a sticky problem with Djolt, trying to implement my Toronto Fires example in as few lines as possible. As part of this, I’ve come up with the idea of adding indirection to Djolt templates:

import djolt

d = {
    "a" : "It says: {{ b }}",
    "b" : "Hello, World"
}

t = djolt.Template("""
a: {{ a }}
b: {{ b }}
*a: {{ *a }}
""")

print t.Render(d)
""")

print t.Render(d)

Which yields:

a: It says: {{ b }}
b: Hello, World
*a: It says: Hello, World

This is significantly updated from the original version I posted here an hour ago. The indirection now makes the variable read as a template. This is a much more powerful concept.

November 28, 2008

Djolt – Django-like Templates

djolt,pybm,python,work · David Janes · 4:34 pm ·

Djolt is a reimplementation of Django’s template language in Python. Why do this?

  • I like the Django template language
  • I wanted something that small and independent of Django
  • I wanted something that will work with WORK paths (this was the real deal breaker for using Django)
  • I wanted something that I could take and reimplement in Javascript and maybe Java too
  • Some template engines, Cheetah for example, are far too heavy for the kind of light-weight applications I have in mind; note that I’ve had great success with Cheetah in the past
  • Some template engines, such as that in Python 2.6, are for too underfeatured

However, if you’re really looking for the whole Django template experience and don’t want to use Djolt, just start here.

How do I get it?

Djolt is packaged as part of the pybm library.

How do I use it?

import djolt

t = djolt.Template("""
<ul>
{% for name in names %}
<li>{{ name }}</li>
{% endfor %}
</ul>
""")
print t.Render({
    "names" : [ "Johnny", "Jack", "Ray", "Mary & Sam", ]
})

Which gives the results:

<ul>
<li>Johnny</li>
<li>Jack</li>
<li>Ray</li>
<li>Mary &amp; Sam</li>
</ul>

Note the “autoescaping” of the & character.

What tags does it define?

  • autoescape/endautoescape
  • if/else/endif
  • equal/endequal
  • for/endfor
  • notequal/notendequal

It does not implement blocks.

What filters does it define?

  • add
  • cut
  • default (see otherwise below)
  • default_if_none
  • divisibleby
  • first
  • join
  • last
  • length
  • length_is
  • linebreaks
  • lower
  • pluralize
  • random
  • safe (respecting all the Django autoescape rules)
  • slug
  • upper

Unimplemented filters are due to laziness and will be done “on demand”. We also introduce a few new filters:

  • jslug – like slug, but more Javascript friendly
  • otherwise – like default, except the empty string/empty values trigger the filter also

Are their differences between Djolt and Django templates?

  • Djolt tags suck up whitespace if they’re on a line by themselves
  • If Djolt cannot resolve a variable, it resolves to the appropriate “empty” value (as opposed to failing). This is keeping in line with WORK philosophy

Beyond that you should be able to use most Django template examples (that don’t use block/implements) as-is.

Is it extensible?

Yes. You can add your own tags and filters by following the examples in code (djolt_nodes.py and djolt_filters.py respectively).

« Newer Posts · Older Posts »

Powered by WordPress