<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Python HTML Parser Performance</title>
	<atom:link href="http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/</link>
	<description></description>
	<lastBuildDate>Fri, 03 Sep 2010 01:53:06 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Haberler</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-159187</link>
		<dc:creator>Haberler</dc:creator>
		<pubDate>Tue, 13 Apr 2010 12:39:35 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-159187</guid>
		<description>I try to soup but it was very slow.
So that I recomend eveyone to use lxml

Regards</description>
		<content:encoded><![CDATA[<p>I try to soup but it was very slow.
So that I recomend eveyone to use lxml</p>

<p>Regards</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Simple Simon</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-151954</link>
		<dc:creator>Simple Simon</dc:creator>
		<pubDate>Mon, 15 Feb 2010 14:36:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-151954</guid>
		<description>DeGroot points out that, while customers and suppliers may have their own ways of calculating dielectric constant, having a standardized IPC methodology will allow for better benchmarking of such attributes as frequency, classes of material types, and standardized test structures that are similar to practical signal traces.</description>
		<content:encoded><![CDATA[<p>DeGroot points out that, while customers and suppliers may have their own ways of calculating dielectric constant, having a standardized IPC methodology will allow for better benchmarking of such attributes as frequency, classes of material types, and standardized test structures that are similar to practical signal traces.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paul Boddie</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-149798</link>
		<dc:creator>Paul Boddie</dc:creator>
		<pubDate>Sun, 31 Jan 2010 18:06:32 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-149798</guid>
		<description>I finally got round to running the benchmarks with libxml2dom. Here are my results:

Parsing
-------

- libxml2dom     :  0.6696 sec ( 100% of libxml2dom)
- htmlparser     :  4.1295 sec ( 616% of libxml2dom)
- html5_et       :  39.7368 sec (5934% of libxml2dom)
- html5_simple   :  38.5487 sec (5756% of libxml2dom)

[Chart](http://chart.apis.google.com/chart?chxl=0%3A&#124;html5lib+simpletree+(38.5+sec)&#124;html5lib+ElementTree+(39.7+sec)&#124;HTMLParser+(4.1+sec)&#124;libxml2dom+(0.7+sec)&#124;&amp;cht=bhs&amp;chs=400x120&amp;chd=e%3ABFGp..-E&amp;chxt=y)

Serialising
-----------

(htmlparser omitted since it doesn&#039;t serialise)

- libxml2dom     :  0.3003 sec ( 100% of libxml2dom)
- html5_et       :  3.4985 sec (1164% of libxml2dom)
- html5_simple   :  2.2173 sec ( 738% of libxml2dom)

[Chart](http://chart.apis.google.com/chart?chxl=0%3A&#124;html5_simple+(2.2+sec)&#124;html5_et+(3.5+sec)&#124;libxml2dom+(0.3+sec)&#124;&amp;cht=bhs&amp;chs=400x90&amp;chd=e%3AFf..oj&amp;chxt=y)

Memory Usage
------------

(&quot;sec&quot; in output manually changed to &quot;MB&quot;)

- libxml2dom     :  26.3080 MB ( 100% of libxml2dom)
- htmlparser     :  4.4520 MB (  16% of libxml2dom)
- html5_et       :  64.0000 MB ( 243% of libxml2dom)
- html5_simple   :  99.1560 MB ( 376% of libxml2dom)

[Chart](http://chart.apis.google.com/chart?chxl=0%3A&#124;html5lib+simpletree+(99.2+Mb)&#124;html5lib+ElementTree+(64.0+Mb)&#124;HTMLParser+(4.5+Mb)&#124;libxml2dom+(26.3+Mb)&#124;&amp;cht=bhs&amp;chs=400x120&amp;chd=e%3AQ-C3pT..&amp;chxt=y)</description>
		<content:encoded><![CDATA[<p>I finally got round to running the benchmarks with libxml2dom. Here are my results:</p>

<h2>Parsing</h2>

<ul>
<li>libxml2dom     :  0.6696 sec ( 100% of libxml2dom)</li>
<li>htmlparser     :  4.1295 sec ( 616% of libxml2dom)</li>
<li>html5_et       :  39.7368 sec (5934% of libxml2dom)</li>
<li>html5_simple   :  38.5487 sec (5756% of libxml2dom)</li>
</ul>

<p><a href="http://chart.apis.google.com/chart?chxl=0%3A|html5lib+simpletree+(38.5+sec)|html5lib+ElementTree+(39.7+sec)|HTMLParser+(4.1+sec)|libxml2dom+(0.7+sec)|&amp;cht=bhs&amp;chs=400x120&amp;chd=e%3ABFGp..-E&amp;chxt=y">Chart</a></p>

<h2>Serialising</h2>

<p>(htmlparser omitted since it doesn&#8217;t serialise)</p>

<ul>
<li>libxml2dom     :  0.3003 sec ( 100% of libxml2dom)</li>
<li>html5_et       :  3.4985 sec (1164% of libxml2dom)</li>
<li>html5_simple   :  2.2173 sec ( 738% of libxml2dom)</li>
</ul>

<p><a href="http://chart.apis.google.com/chart?chxl=0%3A|html5_simple+(2.2+sec)|html5_et+(3.5+sec)|libxml2dom+(0.3+sec)|&amp;cht=bhs&amp;chs=400x90&amp;chd=e%3AFf..oj&amp;chxt=y">Chart</a></p>

<h2>Memory Usage</h2>

<p>(&#8220;sec&#8221; in output manually changed to &#8220;MB&#8221;)</p>

<ul>
<li>libxml2dom     :  26.3080 MB ( 100% of libxml2dom)</li>
<li>htmlparser     :  4.4520 MB (  16% of libxml2dom)</li>
<li>html5_et       :  64.0000 MB ( 243% of libxml2dom)</li>
<li>html5_simple   :  99.1560 MB ( 376% of libxml2dom)</li>
</ul>

<p><a href="http://chart.apis.google.com/chart?chxl=0%3A|html5lib+simpletree+(99.2+Mb)|html5lib+ElementTree+(64.0+Mb)|HTMLParser+(4.5+Mb)|libxml2dom+(26.3+Mb)|&amp;cht=bhs&amp;chs=400x120&amp;chd=e%3AQ-C3pT..&amp;chxt=y">Chart</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nyx-</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-148278</link>
		<dc:creator>Nyx-</dc:creator>
		<pubDate>Wed, 20 Jan 2010 06:54:06 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-148278</guid>
		<description>Wow, awesome tutorial, I was reading an out dated guide using xml.dom.ext.reader HtmlLib, I&#039;m really happy I was able to find this ^__^</description>
		<content:encoded><![CDATA[<p>Wow, awesome tutorial, I was reading an out dated guide using xml.dom.ext.reader HtmlLib, I&#8217;m really happy I was able to find this ^__^</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Software house</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-127448</link>
		<dc:creator>Software house</dc:creator>
		<pubDate>Fri, 04 Sep 2009 05:58:02 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-127448</guid>
		<description>I think you should invest some time in learning XPath. Parse the html string with lxml.html and then run xpath queries on that etree object. I bet it would simplify the code a lot (especially when you look for more information then just a node).</description>
		<content:encoded><![CDATA[<p>I think you should invest some time in learning XPath. Parse the html string with lxml.html and then run xpath queries on that etree object. I bet it would simplify the code a lot (especially when you look for more information then just a node).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: aaaa</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-118587</link>
		<dc:creator>aaaa</dc:creator>
		<pubDate>Sun, 19 Jul 2009 18:57:53 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-118587</guid>
		<description>its not hard to install on ubuntu.

after you get the two components,

lxml can be done by

sudo apt-get install lxml-python

BAM!</description>
		<content:encoded><![CDATA[<p>its not hard to install on ubuntu.</p>

<p>after you get the two components,</p>

<p>lxml can be done by</p>

<p>sudo apt-get install lxml-python</p>

<p>BAM!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: StatsProf</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-116311</link>
		<dc:creator>StatsProf</dc:creator>
		<pubDate>Thu, 09 Jul 2009 23:40:34 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-116311</guid>
		<description>Like your other readers, I was surprised by your results.

I was also struck by my difficulty recognizing that your first graph was not meant to suggest a Gaussian (&quot;Normal&quot;) distribution, and then trying to match the graphs to your text. 

Following Edward Tuft (htp://www.edwardtufte.com/), they would be easier to interpret if the values were sorted before creating the graphs. I hope you&#039;ll excuse the character graphics, something like:

+ .................Sec

+ ..lxml.html  0.6&#124;X

+ .MLParser  2.9&#124;XXXX

+ .de.htmlfill  4.5&#124;XXXXXXX

+ ......Genshi  7.3&#124;XXXXXXXXXXXX

+ .tifulSoup 10.6&#124;XXXXXXXXXXXXXXXXX

+ mentTree 30.2&#124;XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

+ ..minidom 35.2&#124;XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

...............................+___________+___________+___________+___________+___________+________


given my inexperience with / incomprehension of &quot;Markdown;&quot; how DO you get an unordered list?

Thanks, and best wishes.</description>
		<content:encoded><![CDATA[<p>Like your other readers, I was surprised by your results.</p>

<p>I was also struck by my difficulty recognizing that your first graph was not meant to suggest a Gaussian (&#8220;Normal&#8221;) distribution, and then trying to match the graphs to your text. </p>

<p>Following Edward Tuft (htp://www.edwardtufte.com/), they would be easier to interpret if the values were sorted before creating the graphs. I hope you&#8217;ll excuse the character graphics, something like:</p>

<ul>
<li><p>&#8230;&#8230;&#8230;&#8230;&#8230;..Sec</p></li>
<li><p>..lxml.html  0.6|X</p></li>
<li><p>.MLParser  2.9|XXXX</p></li>
<li><p>.de.htmlfill  4.5|XXXXXXX</p></li>
<li><p>&#8230;&#8230;Genshi  7.3|XXXXXXXXXXXX</p></li>
<li><p>.tifulSoup 10.6|XXXXXXXXXXXXXXXXX</p></li>
<li><p>mentTree 30.2|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p></li>
<li><p>..minidom 35.2|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p></li>
</ul>

<p>&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;.+___________+___________+___________+___________+___________+________</p>

<p>given my inexperience with / incomprehension of &#8220;Markdown;&#8221; how DO you get an unordered list?</p>

<p>Thanks, and best wishes.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Wayne Sadler</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-112648</link>
		<dc:creator>Wayne Sadler</dc:creator>
		<pubDate>Wed, 17 Jun 2009 10:11:09 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-112648</guid>
		<description>Hi Ian Bicking

I like your posting on the lxml timing, however I have been working on some parsing with another programmer, however we are having some timing issues. Average 3 sec.

May I ask if you are a god on these type of scripts since I am aiming for a lower speed in which to process the data, if you are availible to assit can you please email me.

Correction for the above post, Ian Bicking did not post at June 17, 2009 at 3:57 am for some reason the field auto filled and I did not notice it..

Many thanks Wayne</description>
		<content:encoded><![CDATA[<p>Hi Ian Bicking</p>

<p>I like your posting on the lxml timing, however I have been working on some parsing with another programmer, however we are having some timing issues. Average 3 sec.</p>

<p>May I ask if you are a god on these type of scripts since I am aiming for a lower speed in which to process the data, if you are availible to assit can you please email me.</p>

<p>Correction for the above post, Ian Bicking did not post at June 17, 2009 at 3:57 am for some reason the field auto filled and I did not notice it..</p>

<p>Many thanks Wayne</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian Bicking</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-112647</link>
		<dc:creator>Ian Bicking</dc:creator>
		<pubDate>Wed, 17 Jun 2009 09:57:30 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-112647</guid>
		<description>Hi Ian Bicking

I like your posting on the lxml timing, however I have been working on some parsing with another programmer, however we are having some timing issues. Average 3 sec.

May I ask if you are a god on these type of scripts since I am aiming for a lower speed in which to process the data, if you are availible to assit can you please email me.

Many thanks Wayne</description>
		<content:encoded><![CDATA[<p>Hi Ian Bicking</p>

<p>I like your posting on the lxml timing, however I have been working on some parsing with another programmer, however we are having some timing issues. Average 3 sec.</p>

<p>May I ask if you are a god on these type of scripts since I am aiming for a lower speed in which to process the data, if you are availible to assit can you please email me.</p>

<p>Many thanks Wayne</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian Bicking</title>
		<link>http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/comment-page-1/#comment-20642</link>
		<dc:creator>Ian Bicking</dc:creator>
		<pubDate>Wed, 09 Jul 2008 03:08:49 +0000</pubDate>
		<guid isPermaLink="false">http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/#comment-20642</guid>
		<description>Well, hopefully you aren&#039;t considering scraping email addresses...?  You could look at python-spidermonkey, but the task isn&#039;t all that easy, especially getting document.write to work.</description>
		<content:encoded><![CDATA[<p>Well, hopefully you aren&#8217;t considering scraping email addresses&#8230;?  You could look at python-spidermonkey, but the task isn&#8217;t all that easy, especially getting document.write to work.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.515 seconds -->
