David Janes' Code Weblog

February 5, 2009

Turning garbage "HTML" into XML parsable XHTML using Beautiful Soup

html / javascript,python · David Janes · 6:56 am ·

Here’s our problem child HTML: Members of Provincial Parliament. Amongst the attrocities committed against humanity, we see:

  • use of undeclared namespaces in both tags (<o:p>) and attributes (<st1:City w:st="on">)
  • XML processing instructions – incorrectly formatted! – dropped into the middle of the document in multiple places (<?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" />)
  • leading space before the DOCTYPE

This is so broken that even HTML TIDY chokes on it, producing a severely truncated file. This broken document provided me however an opportunity to play with the Python library Beautiful Soup, which lists amongst it’s advantages:

  • Beautiful Soup won’t choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don’t have to create a custom parser for each application.
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

Alas, straight out of the box Beautiful Soup didn’t do it for me, perhaps because of some of my strange requirements (my data flow works something like this: raw document → XML → DOM parser → JSON). However, Beautiful Soup does provide the necessary calls to manipulate the document to do the trick. Here’s what I did:

First, we import Beautiful Soup and parse it to the object soup. We’re expecting an HTML node at the top, so we look for that.

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(raw)

if not hasattr(soup, "html"):
	return

Next, we loop through every node in the document, using Beautiful Soup’s findAll interface. You will see several variants of this call here in the code. What we’re looking for is use of namespaces, which we then add to the HTML element as attributes using fake namespace declarations.

We need to find namespaces already declared:

used = {}
for ns_key, ns_value in soup.html.attrs:
	if not ns_key.startswith("xmlns:"):
		continue

	used[ns_key[6:]] = 1

Then we look for ones that are actually used:

nsd = {}
for item in soup.findAll():
	name = item.name
	if name.find(':') > -1:
		nsd[name[:name.find(':')]] = 1

	for name, value in item.attrs:
		if name.find(':') > -1:
			nsd[name[:name.find(':')]] = 1

Then we add all the missing namespaces to the HTML node.

for ns in nsd.keys():
	if not used.get(ns):
		soup.html.attrs.append(( "xmlns:%s" % ns, "http://www.example.com#%s" % ns, ))

Next we look for attributes that aren’t properly XML declarations, e.g. HTML style <input checked />-type items.

for item in soup.findAll():
	for index, ( name, value ) in enumerate(item.attrs):
		if value == None:
			item.attrs[index] = ( name, name )

Then we remove all nodes from the document that we aren’t expecting to see. If you keep the script tags you’re going to have to make sure that each node is properly CDATA encoded; I didn’t care about this so I just remove them.

[item.extract() for item in soup.findAll('script')]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.ProcessingInstruction ))]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.Declaration ))]

In the final step we convert the document to Unicode. This requires another step of post-processing: html2xml changes all entity uses that XML doesn’t recognize into a &#...; style. E.g. we do change &nbsp; but we don’t change &amp;. At this point we now have a document that can be processed by standard DOM parsers (if you convert to UTF-8 bytes, sigh).

cooked = unicode(soup)
cooked = bm_text.html2xml(cooked)

December 18, 2008

Pipe Cleaner

demo,djolt,dqt,html / javascript,ideas,jd,maps,pipe cleaner,pybm,work · David Janes · 6:38 pm ·

I’ve been working (in my decreasing available spare time) on a project to pull together into a project called “Pipe Cleaner” all the various concepts I’ve been mentioning on this blog: Web Object Records (WORK) for API Access and object manipulation, Djolt for generating text from templates, Data/Query/Transform/Template (DQT) for transforming data and JD for scripting these elements together. The pieces came together this morning enough to put a demo together and here it is – the Toronto Fires Pt II Demo.

How, you may ask, does this differ from the original Toronto Fires Demo? The answer is how it is put together, which we describe here.

Index.dj

This is the Djolt template that generates the output. The data fed to this template is generate by the JD script, described in the next section.

<html>
<head>
    <link rel="stylesheet" type="text/css" href="css.css" />
    {{ gmaps.js|safe }}
</head>
<body>
<div id="content_wrapper">
    <div id="map_wrapper">
        {{ gmaps.html|safe }}
    </div>
    <div id="text_wrapper">
{% for incident in incidents %}
    <div id="{{ incident.IncidentNumber }}">
        {{ incident.body_sb|safe }}
    </div>
{% endfor %}
</div>
</body>
</html>

Quite simple … as you can see, most of the data is being pulled in from elsewhere. The elsewhere is provided by the script described in the next section.

Index.jd

This is the script that pull all the pieces together. Note that I’m not 100% happy with the way the data is imported, I would like the geocoding to become part of this data flow too. In the next release perhaps.

First we pull in the “fire” module that we wrote in the previous Map examples. This is doing exactly what you think: importing a Python module. We may have to increase the security or restrict this to working with an API for general purpose use.

import module:"fire";

Next we define two headers – one that is going to appear in the Google Maps popup, the next that is going to appear in the sidebar. They need to be different as they refer to themselves. Note that the sidebar header “breaks” the encapsulation of Google Maps – this seems to be unavoidable. The to:"fitem.head.map" and to:"fitem.head.sb" are manipulating a WORK dictionary to store values.

Note also here that we’ve extended JD to accept Python multiline strings – this was unavoidable if JD was to be useful to me.

set to:"fitem.head.map" value:"""
<h3>
<a href="#{{ IncidentNumber }}">{{ AlarmLevel}}: {{ IncidentType }} on {{ RawStreet }}
</h3>
""";

set to:"fitem.head.sb" value:"""
<h3>
{% if latitude and longitude %}
<a href="javascript:js_maps.map.panTo(new GLatLng({{ latitude }}, {{ longitude }}))">*
{% endif %}
<a href="#{{ IncidentNumber }}">{{ AlarmLevel}}: {{ IncidentType }} on {{ RawStreet }}
</h3>
""";

The next block defines the text of the body used to describe a fire incident. It follows much the same pattern as the previous block.

set to:"fitem.body" value:"""
<p>
Alarm Level: {{ AlarmLevel }}
<br />
Incident Type: {{ IncidentType }}
<br />
City: {{ City }}
<br />
Street: {{ Street }} ({{ CrossStreet }})
<br />
Units: {{ Units }}
</p>
""";

This is a map: it is translating the values in fire.GetGeocodeIncidents into a new format and storing that in incidents. The format that we were are storing it in is understood by the Google Maps generating module.

We may rename this translate, as the word map is somewhat overloaded.

map from:"fire.GetGeocodedIncidents" to:"incidents" map:{
    "latitude" : "{{ latitude }}",
    "longitude" : "{{ longitude }}",
    "title" : "{{ AlarmLevel}}: {{ IncidentType }} on {{ RawStreet }}",
    "uri" : "{{ HOME_URI }}#{{ IncidentNumber }}",
    "body" : "{{ *fitem.head.map|safe }}{{ *fitem.body|safe }}",
    "body_sb" : "{{ *fitem.head.sb|safe }}{{ *fitem.body|safe }}",
    "IncidentNumber" : "{{ IncidentNumber }}"
};

Next we set up the “meta” (see WORK meta description if you’re not following along) for the maps. The render_value:true declaration makes PC interpret the templates in strings). We then call our Google Maps generating code (which are actually more Pipe Cleaners) and that gets fed to the Djolt template we first showed you. Clear? Maybe not, we’ll have more examples coming…

set to:"map_meta" render_value:true value:{
    "id" : "maps",
    "latitude" : 43.67,
    "longitude" : -79.38,
    "uzoom" : -13,
    "gzoom" : 13,
    "api_key" : "{{ cfg.gmaps.api_key|otherwise:'...mykey...' }}",
    "html" : {
        "width" : "1024px",
        "height" : "800px"
    }
};

load template:"gmaps.js" items:"incidents" meta:"map_meta";
load template:"gmaps.html" items:"incidents" meta:"map_meta";

December 13, 2008

Brief notes on SIMILE Timeline

demo,html / javascript · David Janes · 8:01 am ·

SIMILE Timeline is “the Google Maps for time based events”. It used to be housed at MIT but now it’s graduated to Google Code. I’ve created an example application showing this year’s Oscar awards and a number of movies that are, umm, are not all likely to be nominated.

  • the application source can be seen here; it is based on this demo and some work I had done previously. Note:
    • multiple scrolling bands linked together
    • custom icons
    • custom colors
  • we demonstrate populating the timeline widget using JSON data coded in the application; the most difficult part of putting this demo together was cutting and pasting all this data, a task we hope to make easier with our DQT code
  • the documentation for Timeline is starting to diverge from the source code; showEventText is now replaced by overview in the band creation code
  • there’s a large number of weaknesses (still) with Timeline for dealing with arbitrary data, these may be corrected in the Javascript code but I don’t have time to go through all this
    • note the incorrect placement of custom icons; using Firebug to inspect the HTML, I discovered that unfortunately everything is placed using style tags so it’s difficult to correct using CSS. Ideally I would like to be able to assign classes and ID tags to everything
    • there doesn’t seem to be an obvious way to control the widths of the labels; in fact, if I reduce the band spacing in the top band, the text starts to overlap in a horrible manner
    • popup information boxes get confused when there is too little space to display information
    • when I don’t add a description to events, it uses “undefined” (see the Oscar Nominations Period)
    • date display functions are inferring more resolution (i.e. an actual time as opposed to the just the date) that I’m giving it

If anyone has corrections for me I’ll update the demo

December 6, 2008

JavaFX 1.0 released

html / javascript,java · David Janes · 6:54 am ·

CNET has a very detailed article on JavaFX, Sun’s better-late-than-never-maybe-hopefully answer to Adobe’s Flash / AIR and Microsoft’s Silverlight:

With a back-to-the-future technology called JavaFX to be launched Thursday, Sun Microsystems hopes to attract a new class of developer while building a much-needed new revenue source.

JavaFX 1.0 returns to the sales pitch that Sun used during Java’s launch more than 13 years ago: a foundation for software on a wide variety of computing “clients” such as desktop computers or mobile phones. JavaFX builds on current Java technology but adds two major pieces.

First is a new software foundation designed to run so-called rich Internet applications–network-enabled programs with lush user interfaces. Second is a new programming language called JavaFX Script that’s intended to be easier to use than traditional Java.

The benefit for me (and maybe you) is that most web shops have three different classes of developers: backend programmers, HTML/CSS coders, and The Flash Guy. The problem with The Flash Guy is that they have to be involved to make any changes to Flash components; JavaFX instead provides components that can easily be manipulated by coders and by web designers (with a little training I think).

The JavaFX site looks a hell of a lot better than anything Sun’s produced in the past too, so maybe they’re learning that looks matter too. Here’s some examples to play with.

November 14, 2008

How to make your blog readable on an iPhone

html / javascript,ideas,maps,mobile,tips · David Janes · 10:47 am ·

Here’s the following changes I made to this blog to make it readable on an iPhone

Add iPhone directives to header

  • the META tag informas the iPhone about how wide we want the page to look – i.e. the width of the iPhone
  • the LINK tag loads our iPhone specific CSS (tip from here)
  • this should be after your normal CSS LINK (or whatever) directive
<meta name="viewport" content="width=320" />
<link rel="stylesheet" type="text/css"
  media="only screen and (max-device-width: 480px)"
  href="/blog/wp-content/themes/davidjanes/iphone.css" />

Define iPhone specific CSS

  • this will obviously vary depending on what your current CSS does – that still gets loaded
  • most of this was pragmatically discovered
  • I hide several sections which I don’t think the user wants to see on an iPhone; I’ll probably play with this more
  • the PRE directive doesn’t use line breaking, so we just clip the examples, after making sure the font is small enough to get enough on a single line
body {
    padding: 5px;
    width: 480px;
}
div#content {
    float: none;
}
div#menu {
    display: none;
}
p.credit {
    display: none;
}
pre {
    overflow: hidden;
    font-size: 10px !important;
}
h1, h1 * {
    font-size: 36px;
}

What still needs to be done

  • we shouldn’t serve sections that users are not going to see – it’s a waste of bandwidth
  • we shouldn’t serve more than 10 articles
  • I’ll have to figure out how to do mobile browser detection on WordPress

How to enable/disable Mouse Wheel actions on your map

html / javascript,maps,tips · David Janes · 8:44 am ·

All the major map APIs have the ability to zoom in and out if your pointer is over the map and you scroll the mouse wheel. Being able to disable this function if you’re working in a small popup form window is very important!

Google Maps

By default, this feature is disabled. To enable:

map.enableScrollWheelZoom();

To disable (again):

map.disableScrollWheelZoom();

Source

Yahoo Maps

By default, this feature is enabled. To disable:

map.disableKeyControls()

There doesn’t appear to be a way to re-enable afterward.

Source

Microsoft Virtual Earth

By default, this feature is enabled. To disable you have to capture the event:

trap = function() { return true; }
map.AttachEvent("onmousewheel", trap);

To re-enable, you have to detach the exact same event (hence the trap function)

map.DetachEvent("onmousewheel", trap);

Source

November 10, 2008

Syntax Error on Line 1

html / javascript,tips · David Janes · 11:34 am ·

Here’s a nasty little error that I tracked down on IE6 this morning. The error message (click on the icon in the lower left of the browser) is something like Error: Syntax Error; Code 0; Line: 1.

Of course, when you go to look for the error in your Javascript, nothing seems to be wrong.  One explanation I found for this was that a BODY onload tag is broken. In our case however this wasn’t the issue.

Instead it was really something quite simple: we were using a SCRIPT tag to download a Javascript file that no longer existed and was 302 redirecting to an HTML page! The IE6 Javascript engine then attempted to execute the HTML and of course immediately failed. In our case, this happened because we were getting a partner widget and they had changed the URL without notifying us. A 404 error would have been more appropriate, though I have not tested to see whether the same error would then occur.

One further note: if you’re debugging scripts on IE6:

November 9, 2008

Adding MapField to inputEx

demo,genx,html / javascript,ideas,inputex,maps · David Janes · 7:20 am ·

InputEx is a neat Javascript library by Eric Abouaf and other contributors to dynamically create forms. It’s best shown by example, so:

I’m working on a project called GenX that allows people to dynamically lookup, edit and create data in editors (such as in WordPress). One of the features I want is the ability to add maps so after a week of research, I’ve created a MapField for inputEx:

Notes

  • this is just a preliminary version; there’s probably some code style and idiom issues that need to be worked out
  • I’m going to ask the InputEx group to pick this up; I reserve no rights
  • it’s not fully integrated into the Builder example yet
  • it works with Google Maps, Yahoo Maps and Microsoft’s Virtual Earth. You will need a key for your domain for Google Maps, see the example above about how to insert your key
  • it introduces a concept of a Globals for MapField, as there are certain settings (i.e. your mapping keys) that really belong at “class” level rather than instantiated object level
  • Mapping APIs are dynamically loaded
  • Zoom levels are returned in native format and “universal format
  • I’ve done all my testing of FF3, so there may be browser issues I need to look at

Code changes

  • the only files needed are
    • examples/map_field.html
    • js/fields/MapField.js

Where this is going

  • I’ve spent the last week working on maps; I need to get back to using this code rather than improving and documenting it
  • I want to add a geocoder
  • I want the geocoder to be able to be sensitive to other form fields, so for example if you’re entering an address it can actually use those as an input

One final mod to inputEx

I’ve modified the builder to make the pencil icon more obviously become the close button when you’ve opened a subpanel. You can see the demo here. Here’s the change set:

In images:

curl --location http://www.famfamfam.com/lab/icons/silk/icons/pencil_delete.png \
 > pencil_delete.png

In inputex/inputExBuilder/css/inputExBuilder.css add:

div.inputEx-TypeField-EditButton.opened {
   background-image: url(../../images/pencil_delete.png);
}

In inputex/js/fields/TypeField.js replace onTogglePropertiesPanel with:

   onTogglePropertiesPanel: function() {
      if (this.propertyPanel.style.display == 'none') {
         this.propertyPanel.style.display = '';
         YAHOO.util.Dom.addClass(this.button, "opened");
      } else {
         this.propertyPanel.style.display = 'none';
         YAHOO.util.Dom.removeClass(this.button, "opened");
      }
   },

November 8, 2008

Switching between mapping APIs and universal zoom levels

html / javascript,maps · David Janes · 3:03 pm ·

Did you know that Google Maps, Yahoo Maps and Virtual Earth all use the same map tile resolutions? That is, you can actually seamlessly switch between mapping systems and have everything line up exactly the same way. If you’d like to play with this, look at our map examples:

But it’s even easier to show this using a graphic (thank you to Alistair Morton of Peapod Studios for compositing): All Maps

Table of map zoom level equivalents

Google Maps and Virtual Earth the same zoom levels, excepting that Google has a level 0. Bigger numbers are closer to the ground, smaller numbers are up into space. Yahoo Maps has fewer levels to work with – it doesn’t let you drill down below the block level. Yahoo numbers its zoom levels the opposite of Google and Virtual Earth, small numbers being closer to the ground and so on and so forth.

To allow seamless switching of data between APIs, I propose here universal zoom levels, which can be used as-is on any of these mapping APIs. To avoid confusion of which type of zoom level one is using, we use the negative number range for universal zoom levels. As Google and Microsoft are using the same zoom levels, I’ve decided that the universal zoom levels should simply be the inverse of their zoom levels, bounded by (-1, -19) inclusive.

Google Maps Virtual Earth Yahoo Maps Universal
Ground 19 19 - -19
18 18 - -18
Block 17 17 1 -17
16 16 2 -16
15 15 3 -15
14 14 4 -14
City 13 13 5 -13
12 12 6 -12
11 11 7 -11
10 10 8 -10
9 9 9 -9
8 8 10 -8
State 7 7 11 -7
6 6 12 -6
5 5 13 -5
4 4 14 -4
3 3 15 -3
2 2 16 -2
Space 1 1 17 -1
0 - - -

Conversion Formulae

Here’s Javascript to convert between your favorite mapping API’s zoom levels and universal zoom levels. Note that all to_native functions accept positive numbers as-is.

zoom = {
  google : {
    to_univeral : function(native) {
      return  Math.min(-1, -native);
    },

    to_native : function(universal) {
      return  universal > 0 ? universal : -universal;
    }
  },

  virtual_earth : {
    to_univeral : function(native) {
      return  Math.min(-1, -native);
    },

    to_native : function(universal) {
      return  universal > 0 ? universal : -universal;
    }
  },

  yahoo : {
    to_univeral : function(native) {
      return  Math.min(-1, native - 18);
    },

    to_native : function(universal) {
      return  universal > 0 ? universal : Math.max(universal + 18, 1);
    }
  }
};

End Notes

A truly universal zoom level would just use altitude; dealing with a simple integer is easier and less complication prone though. MapQuest is not included in this post as its tiles do not line up with the other APIs.

How to dynamically load map APIs

html / javascript,maps · David Janes · 1:11 pm ·

When using various map APIs it’s often useful for performance reasons not to actually load the API until we’re sure we’re going to need it. This is called dynamic loading or lazy loading. The post explains how to do this on the three major mapping APIs.

Note about the examples

These have been excerpted from a project I am currently working on, and so don’t work as-is. The basic calling structure (not shown) is as follows:

  • assume each function is being called from CALLER.
  • If true is returned we allow the dynamic loader to do it’s work; when that is complete it calls CALLER.CREATE_MAP.
  • If true isn’t returned, we assume everything is good to go and the caller directly calls CALLER.CREATE_MAP – this will typically happen when the Map API is detected in Javascript!

There’s also a global context in GLOBALS used for keeping track of state. In particular, we assume the user may be trying to create multiple maps so we have to make sure all the CREATE_MAP callbacks are triggered once all the Javascript has been loaded!

In the Google Maps and Virtual Earth we see the Javascript-with-callback pattern:

  • a callback function is passed in the URL as a parameter
  • the downloaded Javascript document is dynamically modified so that it calls that function. Hence, you know the Javascript has been loaded.

If you’re a Javascript noob, scripts are dynamically loaded into your HTML document by creating a SCRIPT element and attaching it to the document HEAD.

Google Maps API

Google Maps API uses a multi-phase loading scheme:

  • get the generic loader, giving it a callback function to invoke when it’s completely loaded
  • when that’s loaded (into google)…
  • call the loader to get the maps API
  • when that’s loaded, create the map

This feature is documented here.

google_preload : function(CALLER) {
    if (window.GMap2) {
        return;
    }

    var api_key = YOUR_GOOGLE_API_KEY_FOR_YOUR_DOMAIN;

    var preloader = 'MapGooglePreloader_' + GLOBALS.MapFieldsNumber;
    GLOBALS[preloader] = function() {
        google.load("maps", "2", {
            "callback" : function() {
                CALLER.CREATE_MAP();
            }
        });
    }

    if (window.google) {
        GLOBALS[preloader]();
    } else {
        var script = document.createElement("script");
        script.src = "http://www.google.com/jsapi?key=" + api_key +
        "&callback=GLOBALS." + preloader;
        script.type = "text/javascript";

        document.getElementsByTagName("head")[0].appendChild(script);
    }

    return  true;
}

Microsoft Virtual Earth

The Virtual Earth API is easy to dynamically load also. As with Google this feature is directly supported through a Javascript callback mechanism; unlike Google it’s essentially a single phase design.

One issue I ran into was the error message p_elSource.attachEvent is not a function error. Although a Google search for a solution mostly turns up a blank, a tip of the hat to Soul Solutions in Australia for finding a work-around (and have similarly solved dynamic API loading at the link).

virtualearth_preload : function(CALLER) {
    if (window.VEMap) {
        return;
    }

    var preloader = 'MapVEPreloader_' + GLOBALS.MapFieldsNumber;
    GLOBALS[preloader] = function() {
        CALLER.CREATE_MAP();
    }

    if (!window.attachEvent) {
        var script = document.createElement("script");
        script.src = "http://dev.virtualearth.net/mapcontrol/v6.2/js/atlascompat.js";
        script.type = "text/javascript";

        document.getElementsByTagName("head")[0].appendChild(script);
    }

    var script = document.createElement("script");
    script.src = "http://dev.virtualearth.net/mapcontrol/mapcontrol.ashx?v=6.2" +
      "&onScriptLoad=GLOBALS." + preloader;
    script.type = "text/javascript";

    document.getElementsByTagName("head")[0].appendChild(script);

    return  true;
}

Yahoo AJAX Map API

The Yahoo AJAX Map API does not directly support dynamic API loading. However, just loading the Javascript and running a timer that waits to see whether the script has loaded seems to do the trick. If you’re new-ish to Javascript, remember that it’s not multi-threaded so we can always be sure that when a Javascript file is loaded, it is completely loaded (excepting errors, of course).

yahoo_preload : function(CALLER) {
    if (window.YMap) {
        return;
    }

    var preloader = 'MapYahooPreloader_' + GLOBALS.MapFieldsNumber;
    if (!GLOBALS[preloader]) {
        GLOBALS[preloader] = 1;

        window.YMAPPID = YOUR_YAHOO_API_KEY;

        var script = document.createElement("script");
        script.src = "http://us.js2.yimg.com/us.js.yimg.com/lib/map/js/api/ymapapi_3_8_0_7.js";
        script.type = "text/javascript";

        document.getElementsByTagName("head")[0].appendChild(script);
    }

    window.setTimeout(function() {
        if (window.YMap) {
            CALLER.CREATE_MAP();
        } else {
            CALLER.yahoo_preload(CALLER);
        }
    }, 0.1);

    return  true;
}

Older Posts »

Powered by WordPress