<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.2.3" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: Python HTML Parser Performance</title>
	<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/</link>
	<description></description>
	<pubDate>Wed, 09 Jul 2008 12:08:17 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>

	<item>
		<title>By: Ian Bicking</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-20642</link>
		<dc:creator>Ian Bicking</dc:creator>
		<pubDate>Wed, 09 Jul 2008 03:08:49 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-20642</guid>
		<description>Well, hopefully you aren't considering scraping email addresses...?  You could look at python-spidermonkey, but the task isn't all that easy, especially getting document.write to work.</description>
		<content:encoded><![CDATA[<p>Well, hopefully you aren&#8217;t considering scraping email addresses&#8230;?  You could look at python-spidermonkey, but the task isn&#8217;t all that easy, especially getting document.write to work.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jerzy Orlowski</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-20627</link>
		<dc:creator>Jerzy Orlowski</dc:creator>
		<pubDate>Tue, 08 Jul 2008 23:31:04 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-20627</guid>
		<description>Sorry, but the website formatted example input into html. It was a code in javascript that concatenates some variables to generate an email address.</description>
		<content:encoded><![CDATA[<p>Sorry, but the website formatted example input into html. It was a code in javascript that concatenates some variables to generate an email address.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jerzy Orlowski</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-20626</link>
		<dc:creator>Jerzy Orlowski</dc:creator>
		<pubDate>Tue, 08 Jul 2008 23:27:57 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-20626</guid>
		<description>Hi

Is there any way to interpret javascript in Python?

example input:

  document.write('&#60;a href="mailto:biuro');
  document.write('@inprofi.pl"&#62;biuro');
  document.write('@inprofi.pl&#60;/a&#62;');
</description>
		<content:encoded><![CDATA[<p>Hi</p>

<p>Is there any way to interpret javascript in Python?</p>

<p>example input:</p>

<p>document.write(&#8217;&lt;a href=&#8221;mailto:biuro&#8217;);
  document.write(&#8217;@inprofi.pl&#8221;&gt;biuro&#8217;);
  document.write(&#8217;@inprofi.pl&lt;/a&gt;&#8217;);</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric LEBIGOT</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-18336</link>
		<dc:creator>Eric LEBIGOT</dc:creator>
		<pubDate>Fri, 13 Jun 2008 13:15:37 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-18336</guid>
		<description>Great overview of HTML parsers &#38; friends!  Very useful indeed.

Just one point about Mac OS X (10.4): I had no problem installing lxml!  Just did a simple "sudo easy_install lxml".  libxml2 and Python were both installed in a quite standard way and without any pain (through Fink).</description>
		<content:encoded><![CDATA[<p>Great overview of HTML parsers &amp; friends!  Very useful indeed.</p>

<p>Just one point about Mac OS X (10.4): I had no problem installing lxml!  Just did a simple &#8220;sudo easy_install lxml&#8221;.  libxml2 and Python were both installed in a quite standard way and without any pain (through Fink).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Subeen</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-18115</link>
		<dc:creator>Subeen</dc:creator>
		<pubDate>Mon, 09 Jun 2008 18:25:56 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-18115</guid>
		<description>Really nice article. Thanks.</description>
		<content:encoded><![CDATA[<p>Really nice article. Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cesar Ortiz</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16743</link>
		<dc:creator>Cesar Ortiz</dc:creator>
		<pubDate>Tue, 15 Apr 2008 23:04:29 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16743</guid>
		<description>... and I wasn´t able to find an article like this.
This post can save time to a lot of people.</description>
		<content:encoded><![CDATA[<p>&#8230; and I wasn´t able to find an article like this.
This post can save time to a lot of people.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cesar Ortiz</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16742</link>
		<dc:creator>Cesar Ortiz</dc:creator>
		<pubDate>Tue, 15 Apr 2008 23:00:14 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16742</guid>
		<description>Hi,

Some time ago I was looking for a fast html parse that could handle 'bad html' and the winner was libxml2. 
I just wanted a SAX parsing, because instead of a generic tree, we use other kind or data structures.

Excelent article,

Cesar</description>
		<content:encoded><![CDATA[<p>Hi,</p>

<p>Some time ago I was looking for a fast html parse that could handle &#8216;bad html&#8217; and the winner was libxml2. 
I just wanted a SAX parsing, because instead of a generic tree, we use other kind or data structures.</p>

<p>Excelent article,</p>

<p>Cesar</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian Bicking</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16474</link>
		<dc:creator>Ian Bicking</dc:creator>
		<pubDate>Wed, 02 Apr 2008 15:22:19 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16474</guid>
		<description>Dan: no, no qualification, I would recommend lxml for just about any HTML task.</description>
		<content:encoded><![CDATA[<p>Dan: no, no qualification, I would recommend lxml for just about any HTML task.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16448</link>
		<dc:creator>Dan</dc:creator>
		<pubDate>Wed, 02 Apr 2008 09:29:49 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16448</guid>
		<description>"I would recommend lxml for just about any HTML task."

You're missing a crucial qualification here - I'd say "for just about any HTML task &lt;i&gt;where performance matters&lt;/i&gt;". 90% of the time when I'm parsing HTML, differences in speed and memory usage are totally insignificant. I'd certainly trade off an order of magnitude speed difference for even a slight benefit in ease-of-use.

Still, it's good to see numbers on how big the performance difference is, and I'll probably look into using lxml with BeautifulSoup some time in the future.</description>
		<content:encoded><![CDATA[<p>&#8220;I would recommend lxml for just about any HTML task.&#8221;</p>

<p>You&#8217;re missing a crucial qualification here - I&#8217;d say &#8220;for just about any HTML task <i>where performance matters</i>&#8220;. 90% of the time when I&#8217;m parsing HTML, differences in speed and memory usage are totally insignificant. I&#8217;d certainly trade off an order of magnitude speed difference for even a slight benefit in ease-of-use.</p>

<p>Still, it&#8217;s good to see numbers on how big the performance difference is, and I&#8217;ll probably look into using lxml with BeautifulSoup some time in the future.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jgraham</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16437</link>
		<dc:creator>jgraham</dc:creator>
		<pubDate>Tue, 01 Apr 2008 22:26:44 +0000</pubDate>
		<guid>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-16437</guid>
		<description>&#62; There seems to be an incomplete C parser in the html5lib project too?

Indeed, there is the very beginning of a pure-C HTML5 parser in the html5lib project. This has stalled recently as I have had other things to keep me busy and my total inexperience as a C developer makes it a tough project to dip into intermittently. If anyone fancies picking it up and giving it a push, they are most welcome to; you should be able to catch me in #whatwg on irc.freenode.net if you have any questions.</description>
		<content:encoded><![CDATA[<p>&gt; There seems to be an incomplete C parser in the html5lib project too?</p>

<p>Indeed, there is the very beginning of a pure-C HTML5 parser in the html5lib project. This has stalled recently as I have had other things to keep me busy and my total inexperience as a C developer makes it a tough project to dip into intermittently. If anyone fancies picking it up and giving it a push, they are most welcome to; you should be able to catch me in #whatwg on irc.freenode.net if you have any questions.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
