HTML

Python HTML Parser Performance

In preparation for my PyCon talk on HTML I thought I’d do a performance comparison of several parsers and document models.

The situation is a little complex because there’s different steps in handling HTML:

  1. Parse the HTML
  2. Parse it into something (a document object)
  3. Serialize it

Some libraries handle 1, some handle 2, some handle 1, 2, 3, etc. For instance, ElementSoup uses ElementTree as a document, but BeautifulSoup as the parser. BeautifulSoup itself has a document object included. HTMLParser only parses, while html5lib includes tree builders for several kinds of trees. There is also XML and HTML serialization.

So I’ve taken several combinations and made benchmarks. The combinations are:

  • lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
  • BeautifulSoup: a parser, document, and HTML serializer.
  • html5lib: a parser. It has a serializer, but I didn’t use it. It has a built-in document object (simpletree), but I don’t think it’s meant for much more than self-testing.
  • ElementTree: a document object, and XML serializer (I think newer versions might include an HTML serializer, but I didn’t use it). It doesn’t have a parser, but I used html5lib to parse to it. (I didn’t use the ElementSoup.)
  • cElementTree: a document object implemented as a C extension. I didn’t find any serializer.
  • HTMLParser: a parser. It didn’t parse to anything. It also doesn’t parse lots of normal (but maybe invalid) HTML. When using it, I just ran documents through the parser, not constructing any tree.
  • htmlfill: this library uses HTMLParser, but at least pays a little attention to the elements as they are parsed.
  • Genshi: includes a parser, document, and HTML serializer.
  • xml.dom.minidom: a document model built into the standard library, which html5lib can parse to. (I do not recommend using minidom for anything — some reasons will become apparent in this post, but there are many other reasons not covered why you shouldn’t use it.)

I expected lxml to perform well, as it is based on the C library libxml2. But it performed better than I realized, far better than any other library. As a result, if it wasn’t for some persistent installation problems (especially on Macs) I would recommend lxml for just about any HTML task.

You can try the code out here. I’ve included all the sample data, and the commands I ran for these graphs are here. These tests use a fairly random selection of HTML files (355 total) taken from python.org.

Parsing

lxml:0.6; BeautifulSoup:10.6; html5lib ElementTree:30.2; html5lib minidom:35.2; Genshi:7.3; HTMLParser:2.9; htmlfill:4.5

The first test parses the documents. Things to note: lxml is 6x faster than even HTMLParser, even though HTMLParser isn’t doing anything (lxml is building a tree in memory). I didn’t include all the things html5lib can parse to, because they all take about the same amount of time. xml.dom.minidom is only included because it is so noticeably slow. Genshi is fairly fast, but it’s the most fragile of the parsers. html5lib, lxml, and BeautifulSoup are all fairly similarly robust. html5lib has the benefit of (at least in theory) being the correct parsing of HTML.

While I don’t really believe it matters often, lxml releases the GIL during parsing.

Serialization

lxml:0.3; BeautifulSoup:2.0; html5lib ElementTree:1.9; html5lib minidom:3.8; Genshi:4.4

Serialization is pretty fast across all the libraries, though again lxml leads the pack by a long distance. ElementTree and minidom are only doing XML serialization, but there’s no reason that the HTML equivalent would be any faster. That Genshi is slower than minidom is surprising. That anything is worse than minidom is generally surprising.

Memory

lxml:26; BeautifulSoup:82; BeautifulSoup lxml:104; html5lib cElementTree:54; html5lib ElementTree:64; html5lib simpletree:98; html5lib minidom:192; Genshi:64; htmlfill:5.5; HTMLParser:4.4

The last test is of memory. I don’t have a lot of confidence in the way I made this test, but I’m sure it means something. This was done by parsing all the documents and holding the documents in memory, and using the RSS size reported by ps to see how much the process had grown. All the libraries should be imported when calculating the baseline, so only the documents and parsing should cause the memory increase.

HTMLParser is a baseline, as it just keeps the documents in memory as a string, and creates some intermediate strings. The intermediate strings don’t end up accounting for anything, since the memory used is almost exactly the combined size of all the files.

A tricky part of this measurement is that the Python allocator doesn’t let go of memory that it requests, so if a parser creates lots of intermediate strings and then releases them the process will still hang onto all that memory. To detect this I tried allocating new strings until the process size grew (trying to detect allocated but unused memory), but this didn’t reveal much — only the BeautifulSoup parser, serialized to an lxml tree, showed much extra memory.

This is one of the only places where html5lib with cElementTree was noticeably different than html5lib with ElementTree. Not that surprising, I guess, since I didn’t find a coded-in-C serializer, and I imagine the tree building is only going to be a lot faster for cElementTree if you are building the tree from C code (as its native XML parser would do).

lxml is probably memory efficient because it uses native libxml2 data structures, and only creates Python objects on demand.

In Conclusion

I knew lxml was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

So in conclusion: lxml kicks ass. You can use it in ways you couldn’t use other systems. You can parse, serialize, parse, serialize, and repeat the process a couple times with your HTML before the performance will hurt you. With high-level constructs many constructs can happen in very fast C code without calling out to Python. As an example, if you do an XPath query, the query string is compiled into something native and traverses the native libxml2 objects, only creating Python objects to wrap the query results. In addition, things like the modest memory use make me more confident that lxml will act reliably even under unexpected load.

I also am more confident about using a document model instead of stream parsing. It is sometimes felt that streamed parsing is better: you don’t keep the entire document in memory, and your work generally scales linearly with your document size. HTMLParser is a stream-based parser, emitting events for each kind of token (open tag, close tag, data, etc). Genshi also uses this model, with higher-level stuff like filters to make it feel a bit more natural. But the stream model is not the natural way to process a document, it’s actually a really awkward way to handle a document that is better seen as a single thing. If you are processing gigabyte files of XML it can make sense (and both the normally document-oriented lxml and ElementTree offer options when this happens). This doesn’t make any sense for HTML. And these tests make me believe that even really big HTML documents can be handled quite well by lxml, so a huge outlying document won’t break a system that is appropriately optimized for handling normal sized documents.

HTML
Python
Programming

Comments (28)

Permalink

HTML Accessibility

So I gave a presentation at PyCon about HTML, which I ended up turning into an XML-sucks HTML-rocks talk. Well that’s a trivialization, but I have the privilege of trivializing my arguments all I want.

Somewhat to my surprise this got me a heckler (of sorts). I think it came up when I was making my <em> lies and <i> is truth argument. That is, presentation and intention are the same. There are those people who feel they can separate the two, creating semantic markup that represents their intent, but they are so few that the reader can never trust that the distinction is intentional, and so <i> and <em> must be treated as equivalent.

Someone then yelled out something like "what about blind people?" The argument being that screen readers would like to distinguish between the two, as not all things we render as italic would be read with emphasis.

It’s not surprising to me that the first time I’ve gotten an actively negative reaction to a talk it was about accessibility. When having technical discussions it’s hard to get that heated up. Is Python or Ruby better? We can talk shit on the web, where all emotions get mixed up and weirded, but in person these discussions tend to be quite calm and reasonable.

Discussions about accessibility, however, have strong moral undertones. This isn’t just What Tool Is Right For The Job. There is a kind of moral certainty to the argument that we should be making a world that is accessible to all people.

I fear this moral certainty has led people self-righteously down unwise paths. They believe — with of course some justification — that the world must be made right. And so many boil-the-ocean proposals are made, and even become codified by standards, but markup standards are useless unless embodied in actual content, and this is where accessibility falls down.

There are two posts that together have greatly eroded my trust in accessibility advocates, so that I feel like I am left adrift, unwilling to jump through the hoops accessibility advocates put up as I strongly suspect they are pointless.

The first post is about the longdesc attribute, an obscure attribute intended to tell the story of a picture. Where alt is typically used as a placeholder for the image, and a short description, longdesc can point to a document that describes the image in length. Empirically they (Ian Hickson in particular) found that the attribute was almost never used in a useful or correct way, rendering it effectively useless. If the discussion had clearly ended at this point, I would have deducted points for those people use advocated longdesc based on bad judgement, but it would not have effected my trust because anyone can mispredict. But the comments just seemed to reinforce the belief that because it should work, that it would work.

The second post was Ian Hickson’s description of using a popular screen reader (JAWS) — you’ll have to dig into the article some, as it’s embedded in other wandering thoughts. In summary, JAWS is a horrible experience, and as an example it didn’t even understand paragraph breaks (where the reader would be expected to pause). What’s the point of semantic markup for accessibility when the most basic markup that is both presentation and semantic (<p>) is ignored? Ian’s brief summary is that if you want to make your page readable in JAWS you’d do better by paying attention to punctuation (which does get read) than to markup. And if you want to help improve accessibility, blind people need a screen reader that isn’t crap.

Months later we started talking a bit about the accessibility of openplans.org. Everyone wants to do the right thing, no? With my trust eroded, I argued strongly that we should only implement accessibility empirically, not based on "best practices". Well, barring some patterns that seem very logical to me, like putting navigation textually at the bottom of the page, and other stuff that any self-respecting web developer does these days. But except for that, if we want to really consider accessibility we should get a tool and use it. But I don’t really know what that tool should be; JAWS is around $1000, all for what sounds like a piece of crap product. We could buy that, even though of course most web developers couldn’t possibly justify the purchase. But is that really the right choice? I don’t know. If we could detect something in the User-Agent string we could see what our users actually use. But I don’t think there’s information there. And I don’t know what people are using. Optimizing for screen magnifiers is much different that optimizing for screen readers.

Another shortcut for accessibility — a shortcut I also distrust — is that to make a site accessible you make sure it works without Javascript. But don’t many screen readers work directly off browsers? Browsers implement Javascript. Do blind users turn Javascript off? I don’t know. If you use no-Javascript as a hint to make the site more accessible, you might just be wasting your effort.

There’s also some weird perspective problems with accessibility. Blind users will always be a small portion of the population. It’s just unreasonable to expect sighted users to write to this small population. Relying on hidden hints in content to provide accessibility just can’t work. Hidden content will be broken, only visible content can be trusted. Admitting this does not mean giving up. As a sighted reader I do not expect the written and spoken word to be equivalent. I don’t think blind listeners lose anything by hearing something that is more a dialect specific to the computer translation of written text to spoken text. (Maybe treating text-to-speech as a translation effort would be more successful anyway?)

A freely available screen reader probably would help a lot as well. I write my markup to render in browsers, not to render to a spec. Anything else is just bad practice. I can’t seriously write my markup for readers based on a spec.

HTML
Web
Programming

Comments (9)

Permalink

lxml.html

Over the summer I did quite a bit of work on lxml.html. I’m pretty excited about it, because with just a little work HTML starts to be very usefully manipulatable. This isn’t how I’ve felt about HTML in the past, with all HTML emerging from templates and consumed only by browsers.

The ElementTree representation (which lxml copies) is a bit of a nuisance when representing HTML. A few methods improve it, but it is still awkward for content with mixed tags and text (common in HTML, uncommon in most other XML). Looking at Genshi Transforms there are some things I wish we could do, like simply "unwrap" text and then wrap it again. But once you remove a tag the text is thoroughly merged into its neighbors. Another little nuisance is that el.text and el.tail can be None, which means you have to guard a lot of code.

That said, here’s the Genshi example:

>>> html = HTML('''<html>
...   <head></head>
...   <body>
...     Some <em>body</em> text.
...   </body>
... </html>''')
>>> print html | Transformer('body/em').map(unicode.upper, TEXT) 
...                                    .unwrap().wrap(tag.u).end() 
...                                    .select('body/u') 
...                                    .prepend('underlined ')

Here’s how you’d do it with lxml.html:

>>> html = fromstring('''... same thing ...''')
>>> def transform(doc):
...     for el in doc.xpath('body/em'):
...         el.text = (el.text or '').upper()
...         el.tag = 'u'
...     for el in doc.xpath('body/u'):
...         el.text = 'underlined ' + (el.text or '')

I’m not sure if Genshi works in-place here, or makes a copy; otherwise these are pretty much equivalent. Which is better? Personally I prefer mine, and actually prefer it quite strongly, because it’s quite simple — it’s a function with loops and assignments. It’s practically pedestrian in comparison to the Genshi example, which uses methods to declaratively create a transformer.

Some of the things now in lxml.html include:

  • Link handling, which is particularly focused on rewriting links so you can put HTML fragments into a new context without breaking the relative links.

  • Smart doctest comparisons (attribute-order-neutral comparisons, with improved diffs, and also whitespace neutral, based loosely on formencode.doctest_xml_compare). Inside your doctest choose XML parsing with from lxml import   usedoctest or HTML parsing with from lxml.html import   usedoctest. I consider the import trick My Worst Monkeypatch Ever, but it kind of reads nicely. For testing it is very nice.

  • Cleaning code, to avoid XSS attacks, in lxml.html.clean. This is still pretty messy, because there’s lots of little things you may or may not want to protect against. E.g., I think I can mostly clean out style tags (at least of Javascript), but some people might want to remove all style. So there’s an option. There’s lots of options. Too many.

  • With the cleaning code there’s word-wrapping code and autolinking code. I think of these as clean-up-people’s-scrappy-HTML tools. Also important for putting untrusted HTML in a new context.

  • I rewrote htmlfill in lxml.html.formfill. It’s a bit simpler, and keeps error messages separate from actual value filling. They were really only combined because I didn’t want to do two passes with HTMLParser for the two steps, but that doesn’t matter when you load the document into memory. I also stopped using markup like <form:error> for placing error messages; it’s all automatic now, which I suppose is both good and bad.

  • After I wrote lxml.html.formfill I got it into my head to make smarter forms more natively. So now you can do:

    >>> from lxml.html import parse
    >>> page = parse('http://tripsweb.rtachicago.com/').getroot()
    >>> form = page.forms[0]
    >>> from pprint import pprint
    >>> pprint(form.form_values())
    [('action', 'entry'),
     ('resptype', 'U'),
     ('Arr', 'D'),
     ('f_month', '09'),
     ('f_day', '21'),
     ('f_year', '2007'),
     ('f_hours', '9'),
     ('f_minutes', '30'),
     ('f_ampm', 'AM'),
     ('Atr', 'N'),
     ('walk', '0.9999'),
     ('Min', 'T'),
     ('mode', 'A')]
    >>> for key in sorted(f.fields.keys()):
    ...     print key
    None
    Arr
    Atr
    Dest
    Min
    Orig
    action
    dCity
    endpoint
    f_ampm
    f_day
    f_hours
    f_minutes
    f_month
    f_year
    mode
    oCity
    resptype
    startpoint
    walk
    >>> f.fields['Orig'] = '1500 W Leland'
    >>> f.fields['Dest'] = 'LINCOLN PARK ZOO'
    >>> from lxml.html import submit_form()
    >>> result = parse(submit_form(f)).getroot()
    

    From there I’d have to actually scrape the results to figure out what the best trip was, which isn’t as easy.

  • HTML diffing and something like svn blame for a series of documents, in lxml.html.diff. Someone noted a similarity between htmldiff and templatemaker, and they are conceptually similar, but with very different purposes. htmldiff goes to great trouble to ignore markup and focus only on changes to textual content. As such it is great for a history page. templatemaker focuses on the dissection of computer-generated HTML and extracting its human-generated components. Templatemaker is focused on screen scraping. It might be handy in that form example above…

  • There’s also a fairly complete implementation of CSS 3 selectors. It would be interesting to mix this with cssutils.

    Though some people aren’t so enthusiastic about CSS namespaces (and I can’t really blame him), conveniently this CSS 3 feature makes CSS selectors applicable to all XML. I don’t know if anyone is actually going to use them instead of XPath on non-HTML documents, but you could. Because the implementation just compiles CSS to XPath, you could potentially use this module with other XML libraries that know XPath. Of which I only actually know one (or two <http://genshi.edgewall.org/>?) — though compiling CSS to XPath, then having XPath parsed and interpreted in Python, is probably not a good idea. But if you are so inclined, there’s also a parser in there you could use.

  • lxml and BeautifulSoup are no longer exclusive choices: lxml.html.ElementSoup.parse() can parse pages with BeautifulSoup into lxml data structures. While the native lxml/libxml2 HTML parser works on pretty bad HTML, BeautifulSoup works on really bad HTML. It would be nice to have something similar with html5lib.

HTML
Python

Comments (7)

Permalink

Reflection and Description Of Meaning

After writing my last post I thought I might follow up with a bit of cognitive speculation. Since the first comment was exactly about the issue I was thinking about writing on, I might as well follow up quickly.

Jeff Snell replied:

You parse semantic markup in rich text all the time. When formatting changes, you apply a reason. RFC’s don’t capitalize MUST and SHOULD because the author is thinking in upper-case versus lower-case. They’re putting a strong emphasis on those words. As a reader, you take special notice of those words being formatted that way and immediately recognize that they contain a special importance. So I think that readers do parse writing into semantic markup inside their brains.

Emphasis not added. Wait, bold isn’t emphasis, it’s strong! So sorry, STRONG not added.

I think the reasoning here is flawed, in that it supposes that reflection on how we think is an accurate way of describing how we think.

A few years ago I got interested in cognition for a while and particularly some of the new theories on consciousness. One of the parts that really stuck with me was the difference in how we think about thinking, and how thinking really works (as revealed with timing experiments). That is, our conscious thought (the thinking-about-thinking) happened after the actual thought; we make up reasons for our actions when we’re challenged, but if we aren’t challenged to explain our actions there’s no consciousness at all (of course, you can challenge yourself to explain your reasoning — but you usually won’t). And then we revise history so that our reasoning precedes our decision, but that’s not always very accurate. This gets around the infinite-loop problem, where either there’s always another level of meta-consciousness reasoning about the lower level of consciousness, or there’s a potentially infinite sequence of whys that have to be answered for every decision. And of course sometimes we really do make rational decisions and there are several levels of why answered before we commit. But this is not the most common case, and there’s always a limit to how much reflection we can do. There are always decisions made without conscious consideration — if only to free ourselves to focus on the important decisions.

And so as both a reader and a writer, I think in terms of italic and bold. As a reader and a writer there is of course translation from one form to another. There’s some idea inside of me that I want to get out in my writing, there’s some idea outside of me that I want to understand as a reader. But just because I can describe some intermediate form of semantic meaning, it doesn’t mean that that meaning is actually there. Instead I invent things like "strong" and "emphasis" when I’m asked to decide why I chose a particular text style. But the real decision is intuitive — I map directly from my ideas to words on the page, or vice versa for reading.

Obviously this is not true for all markup. But my intuition as both a reader and a writer about bold and italic is strong enough that I feel confident there’s no intermediary representation. This is not unlike the fact I don’t consider the phonetics of most words (though admittedly I did when trying to spell "phonetics"); common words are opaque tokens that I read in their entirety without consideration of their component letters. And a good reader reads text words without consideration of their vocal equivalents (though as a writer I read my own writing out loud… is that typical? I’m guessing it is). A good reader can of course vocalize if asked, but that doesn’t mean the vocalization is an accurate representation of their original reading experience.

Though it’s kind of an aside, I think the use of MUST and SHOULD in RFCs fits with this theory. By using all caps they emphasize the word over the prose, they make the reader see the words as tokens unique from "must" and "should", with special meanings that are related to but also much more strict than their usual English meaning. The caps are a way of disturbing our natural way of determining meaning because they need a more exact language.

HTML
Non-technical

Comments (7)

Permalink