David Janes' Code Weblog

February 5, 2009

Turning garbage "HTML" into XML parsable XHTML using Beautiful Soup

html / javascript,python · David Janes · 6:56 am ·

Here’s our problem child HTML: Members of Provincial Parliament. Amongst the attrocities committed against humanity, we see:

  • use of undeclared namespaces in both tags (<o:p>) and attributes (<st1:City w:st="on">)
  • XML processing instructions – incorrectly formatted! – dropped into the middle of the document in multiple places (<?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" />)
  • leading space before the DOCTYPE

This is so broken that even HTML TIDY chokes on it, producing a severely truncated file. This broken document provided me however an opportunity to play with the Python library Beautiful Soup, which lists amongst it’s advantages:

  • Beautiful Soup won’t choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don’t have to create a custom parser for each application.
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

Alas, straight out of the box Beautiful Soup didn’t do it for me, perhaps because of some of my strange requirements (my data flow works something like this: raw document → XML → DOM parser → JSON). However, Beautiful Soup does provide the necessary calls to manipulate the document to do the trick. Here’s what I did:

First, we import Beautiful Soup and parse it to the object soup. We’re expecting an HTML node at the top, so we look for that.

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(raw)

if not hasattr(soup, "html"):
	return

Next, we loop through every node in the document, using Beautiful Soup’s findAll interface. You will see several variants of this call here in the code. What we’re looking for is use of namespaces, which we then add to the HTML element as attributes using fake namespace declarations.

We need to find namespaces already declared:

used = {}
for ns_key, ns_value in soup.html.attrs:
	if not ns_key.startswith("xmlns:"):
		continue

	used[ns_key[6:]] = 1

Then we look for ones that are actually used:

nsd = {}
for item in soup.findAll():
	name = item.name
	if name.find(':') > -1:
		nsd[name[:name.find(':')]] = 1

	for name, value in item.attrs:
		if name.find(':') > -1:
			nsd[name[:name.find(':')]] = 1

Then we add all the missing namespaces to the HTML node.

for ns in nsd.keys():
	if not used.get(ns):
		soup.html.attrs.append(( "xmlns:%s" % ns, "http://www.example.com#%s" % ns, ))

Next we look for attributes that aren’t properly XML declarations, e.g. HTML style <input checked />-type items.

for item in soup.findAll():
	for index, ( name, value ) in enumerate(item.attrs):
		if value == None:
			item.attrs[index] = ( name, name )

Then we remove all nodes from the document that we aren’t expecting to see. If you keep the script tags you’re going to have to make sure that each node is properly CDATA encoded; I didn’t care about this so I just remove them.

[item.extract() for item in soup.findAll('script')]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.ProcessingInstruction ))]
[item.extract() for item in soup.findAll(
    text = lambda text:isinstance(text, BeautifulSoup.Declaration ))]

In the final step we convert the document to Unicode. This requires another step of post-processing: html2xml changes all entity uses that XML doesn’t recognize into a &#...; style. E.g. we do change &nbsp; but we don’t change &amp;. At this point we now have a document that can be processed by standard DOM parsers (if you convert to UTF-8 bytes, sigh).

cooked = unicode(soup)
cooked = bm_text.html2xml(cooked)

1 comment

  1. Thanks for this post, I didn’t know about BeautfulSoup before.
    For anyone wanting to change all entities including   – look here: http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity%20Conversion

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress