Programming

WebOb decorator

Lately I’ve been writing a few applications (e.g., PickyWiki and a revisiting a request-tracking application VaingloriousEye), and I usually use no framework at all. Pylons would be a natural choice, but given that I am comfortable with all the components, I find myself inclined to assemble the pieces myself.

In the process I keep writing bits of code to make WSGI applications from simple WebOb -based request/response cycles. The simplest form looks like this:

from webob import Request, Response, exc

def wsgiwrap(func):
    def wsgi_app(environ, start_response):
        req = Request(environ)
        try:
            resp = func(req)
        except exc.HTTPException, e:
            resp = e
        return resp(environ, start_response)
    return wsgi_app

@wsgiwrap
def hello_world(req):
    return Response('Hi %s!' % (req.POST.get('name', 'You')))

But each time I’d write it, I change things slightly, implementing more or less features. For instance, handling methods, or coercing other responses, or handling middleware.

Having implemented several of these (and reading other people’s implementations) I decided I wanted WebOb to include a kind of reference implementation. But I don’t like to include anything in WebOb unless I’m sure I can get it right, so I’d really like feedback. (There’s been some less than positive feedback, but I trudge on.)

My implementation is in a WebOb branch, primarily in webob.dec (along with some doctests).

The most prominent way this is different from the example I gave is that it doesn’t change the function signature, instead it adds an attribute .wsgi_app which is WSGI application associated with the function. My goal with this is that the decorator isn’t intrusive. Here’s the case where I’ve been bothered:

class MyClass(object):
    @wsgiwrap
    def form(self, req):
        return Response(form_html...)

    @wsgiwrap
    def form_post(self, req):
        handle submission

OK, that’s fine, then I add validation:

@wsgiwrap
def form_post(self, req):
    if req not valid:
        return self.form
    handle submission

This still works, because the decorator allows you to return any WSGI application, not just a WebOb Response object. But that’s not helpful, because I need errors…

@wsgiwrap
def form_post(self, req):
    if req not valid:
        return self.form(req, errors)
    handle submission

That is, I want to have an option argument to the form method that passes in errors. But I can’t do this with the traditional wsgiwrap decorator, instead I have to refactor the code to have a third method that both form and form_post use. Of course, there’s more than one way to address this issue, but this is the technique I like.

The one other notable feature is that you can also make middleware:

@wsgify.middleware
def cap_middleware(req, app):
    resp = app(req)
    resp.body = resp.body.upper()
    return resp

capped_app = cap_middleware(some_wsgi_app)

Otherwise, for some reason I’ve found myself putting an inordinate amount of time into __repr__. Why I’ve done this I cannot say.

Programming
Python
Web

Comments (11)

Permalink

Treating configuration values as templates

A while back I described fassembler and one of the things I liked in it is how the configuration works. It uses a conventional declarative INI-style but also allows arbitrary code, so that defaults can be based on each other.

Here’s a basic example of a default configuration:

[some_app]
port_offset = 10
port = {{int(section.DEFAULT['base_port'])+int(port_offset)}}

Then if another configuration file defines base_port then this will all resolve. You can do this in Python, but you don’t get sections, and you have to define everything in just the right order. So while base_port will probably be defined in a deployment-specific configuration, it has to be defined before these other derivative settings are defined. On the other hand, you want deployment-specific configuration to take precedence… so there’s really no good ordering.

Anyway, the implementation really isn’t that hard. I use Tempita as the templating language because, well, I wrote it, and because it’s simple and appropriate for small strings. For the configuration parsing, ConfigParser will do.

Here’s what the basic code looks like in ConfigParser:

from ConfigParser import ConfigParser
from tempita import Template

class TempitaConfigParser(ConfigParser):

    def _interpolate(self, section, option, rawval, vars):
        ns = _Namespace(self, section, vars)
        tmpl = Template(rawval, name='%s.%s' % (section, option))
        value = tmpl.substitute(ns)
        return value

Actually instead of using tempita.Template, we could just do eval(rawval, {}, ns), it would just require a lot more quoting (every value would have to be a valid Python expression). Either with that or Tempita the implementation of _Namespace will look the same.

Here’s a simple implementation:

from UserDict import DictMixin

class _Namespace(DictMixin):
    def __init__(self, config, section, vars):
        self.config = config
        self.section = section
        self.vars = vars

    def __getitem__(self, key):
        if key == 'section':
            return _Section(self)
        if self.config.has_option(self.section, key):
            return self.config.get(self.section, key)
        if vars and key in self.vars:
            return self.vars[key]
        raise KeyError(key)

   def __setitem__(self, key, value):
       if self.vars is None:
           self.vars = {key: value}
       else:
           self.vars[key] = value

We’ve introduced a magic variable section, which is used to refer to other sections. It looks like this:

class _Section(object):
    def __init__(self, namespace):
        self._namespace = namespace

    def __getattr__(self, attr):
        if attr.startswith('_'):
            raise AttributeError(attr)
        return _Namespace(self._namespace.config, attr,     self._namespace.vars)

With these I think you get many of the benefits of using Python code as your configuration format, while still having the benefits of a more declarative approach to configuration, one that allows for forward and backward references.

A full implementation has several more things than I show here, but you can see the full example in my recipes. It also has an example of using INITools instead of ConfigParser to give more accurate filenames and line numbers when there is an exception, while otherwise using the same interface.

Programming

Comments (1)

Permalink

Woonerf and Python

At TOPP there’s a lot of traffic discussion, since a substantial portion of the organization is dedicated to Livable Streets initiatives. One of the traffic ideas people have gotten excited about is Woonerf. This is a Dutch traffic planning idea. In areas where there’s the intersection of lots of kinds of traffic (car, pedestrian, bike, destinations and through traffic) you have to deal with the contention for the streets. Traditionally this is approached as a complicated system of rules and right-of-ways. There’s spaces for each mode of transportation, lights to say which is allowed to go when (with lots of red and green arrows), crosswalk islands, concrete barriers, and so on.

A problem with this is that a person can only pay attention to so many things at a time. As the number of traffic controls increases, the controls themselves dominate your attention. It’s based on the ideal that so long as everyone pays attention tothe controls, they don’t have to pay attention to each other. Of course, if there’s a circumstance the controls don’t take into account then people will deviate (for instance, crossing somewhere other than the crosswalk, or getting in the wrong lane for a turn, or the simple existance of a bike is usually unaccounted for). If all attention is on the controls, and everyone trusts that the controls are being obeyed, these deviations can lead to accidents. This can create a negative feedback cycle where the controls become increasingly complex to try to take into account every possibility, with the addition of things like Jersey barriers to exclude deviant traffic. At least in the U.S., and especially in the suburbs or in complex intersections, this feeling of an overcontrolled and restricted traffic plan is common.

Copenhagen retail street

So: Woonerf. This is an extreme reaction to traffic controls. An intersection designed with the principles of Woonerf eschews all controls. This includes even things like curbs and signage. It removes most cues about behavior, and specifically of the concept of "right of way". Every person entering the intersection must view it as a negotiation. The use of eye contact, body language, and hand signals determines who takes the right of way. In this way all kinds of traffic are peers, regardless of destination or mode of transport. Also each person must focus on where they are right now, and not where they will be a minute from now; they must stay engaged.


Code as Jersey Barrier

So, I was reading a critique of Python where someone was saying how they missed public/private/protected distinctions on attributes and methods. And it occurred to me: Python’s object model is like Woonerf.

Python does not enforce rules about what you must and must not do. There are cues, like leading underscores, the __magic_method__ naming pattern, or at the module level there’s __all__. But there are no curbs, you won’t even feel the slightest bump when you access a "private" attribute on an instance.

This can lead to conflicts. For example, during discussions on installation, some people will argue for creating requirements like "SomeLibrary>=1.0,<2.0", with the expectation that while version 2.0 doesn’t exist, so long as you install something in the 1.x line it will maintain compatibility with your application. This is an unrealistic expectation. Do you and the library maintainer have the same idea about what compatibility means? What if you depend on something the maintainer considers a bug?

Practically, you can’t be sure that future versions of a library will work. You also can’t be sure they won’t work; there’s nothing that requires the maintainer of the library to break your application with version 2.0. This is where it becomes a negotiation. If you decide to cross without a crosswalk (use a non-public API) then okay. You just have to keep an eye out. And library authors, whether they like it or not, need to consider the API-as-it-is-used as much as the API-they-have-defined. In open source in particular, there are a lot of ways to achieve this communication. We don’t use some third party (e.g., a QA team or language features) to enforce rules on both sides (there are no traffic controls), instead the communication is more flat, and speaks as much to intentions as mechanisms. When someone asks "how do I do X?" a common response is: "what are you trying to accomplish?" Often an answer to the second question makes the first question irrelevant.

Woonerf is great for small towns, for creating a humane space. Is it right for big cities and streets, for busy people who want to get places fast, for trucking and industry? I’m not sure, but probably not. This is where a multi-paradigm approach is necessary. Over time libraries have to harden, become more static, innovation should happen on top of them and not in the library. Some times we create third party controls through interfaces (of one kind or another). I suppose in this case there is a kind of negotiation about how we negotiate — there’s no one process for how to build negotiation-free foundations in Python. But it’s best not to harden things you aren’t sure are right, and I’m pretty sure there’s no "right" at this very-human level of abstraction.

Programming

Comments (9)

Permalink

Atompub as an alternative to WebDAV

I’ve been thinking about an import/export API for PickyWiki; I want something that’s sensible, and works well enough that it can be the basic for things like creating restorable snapshots, integration with version control systems, and being good at self-hosting documentation.

So far I’ve made a simple import/export system based on Atom. You can export the entire site as an Atom feed, and you can import Atom feeds. But whole-site import/export isn’t enough for the tools I’d like to write on top of the API.

WebDAV would seem like a logical choice, as it lets you get and put resources. But it’s not a great choice for a few reasons:

  • It’s really hard to implement on the server.
  • Even clients are hard to implement.
  • It uses GET to get resources. This is probably its most fatal flaw. There is no CMS that I know of (except maybe one) where the thing you view the browser is the thing that you’d actually edit. To work around this CMSes use User-Agent sniffing or an alternate URL space.
  • WebDAV is worried about "collections" (i.e., directories). The web basically doesn’t know what "collections" are, it only knows paths, and paths are strings.
  • (In summary) WebDAV uses HTTP, but it is not of the web.

I don’t want to invent something new though. So I started thinking of Atom some more, and Atompub.

The first thought is how to fix the GET problem in WebDAV. A web page isn’t an editable representation, but it’s pretty reasonable to put an editable representation into an Atom entry. Clients won’t necessarily understand extensions and properties you might add to those entries, but I don’t see any way around that. An entry might look like:

<entry>
  <content type="html">QUOTED HTML</content>
  ... other normal metadata (title etc) ...
  <privateprop:myproperty xmlns:privateprop="URL" name="foo" value="bar" />
</entry>

While there is special support for HTML, XHTML, and plain text in Atom, you can put any type of content in <content>, encoded in base64.

To find the editable representation, the browser page can point to it. I imagine something like this:

<link rel="alternate" type="application/atom+xml; type=entry"
 href="this-url?format=atom">

The actual URL (in this example this-url?format=atom) can be pretty much anything. My one worry is that this could be confused with feed detection, which looks like:

<link rel="alternate" type="application/atom+xml"
 href="/atom.xml">

The only difference is "; type=entry", which I’m betting a lot of clients don’t pay attention to.

The Atom entries then can have an element:

<link rel="edit" href="this-url" />

This is a location where you can PUT a new entry to update the resource. You could allow the client to PUT directly over the old page, or use this-url?format=atom or whatever is convenient on the server-side. Additionally, DELETE to the same URL would delete.

This handles updates and deletes, and single-page reads. The next issue is creating pages.

Atompub makes creation fairly simple. First you have to get the Atompub service document. This is a document with the type application/atomsvc+xml and it gives the collection URL. It’s suggested you make this document discoverable like:

<link rel="service" type="application/atomsvc+xml"
 href="/atomsvc.xml">

This document then points to the "collection" URL, which for our purposes is where you create documents. The service document would look like:

<service xmlns="http://www.w3.org/2007/app"
         xmlns:atom="http://www.w3.org/2005/Atom">
  <workspace>
    <atom:title>SITE TITLE</atom:title>
    <collection href="/atomapi">
      <atom:title>SITE TITLE</atom:title>
      <accept>*/*</accept>
      <accept>application/atom+xml;type=entry</accept>
    </collection>
  </workspace>
</service>

Basically this indicates that you can POST any media to /atomapi (both Atom entries, and things like images).

To create a page, a client then does a POST like:

POST /atomapi
Content-Type: application/atom+xml; type=entry
Slug: /page/path

<entry xmlns="...">...</entry>

There’s an awkwardness here, that you can suggest (via the Slug header) what the URL for the new page is. The client can find the actual URL of the new page from the Location header in the response. But the client can’t demand that the slug be respected (getting an error back if it is not), and there’s lots of use cases where the client doesn’t just want to suggest a path (for instance, other documents that are being created might rely on that path for links).

Also, "slug" implies… well, a slug. That is, some path segment probably derived from the title. There’s nothing stopping the client from putting a complete path in there, but it’s very likely to be misinterpreted (e.g. translating /page/path to /2009/01/pagepath).

Bug I digress. Anyway, you can post every resource as an entry, base64-encoding the resource body, but Atompub also allows POSTing media directly. When you do that, the server puts the media somewhere and creates a simple Atom entry for the media. If you wanted to add properties to that entry, you’d edit the entry after creating it.

The last missing piece is how to get a list of all the pages on a site. Atompub does have an answer for this: just GET /atomapi will give you an Atom feed, and for our purposes we can demand that the feed is complete (using paging so that any one page of the feed doesn’t get too big). But this doesn’t seem like a good solution to me. GData specifies a useful set of queries to for feeds, but I’m not sure that this is very useful here; the kind of queries a client needs to do for this use case aren’t things GData was designed for.

The queries that seem most important to me are queries by page path (which allows some sense of "collections" without being formal) and by content type. Also to allow incremental updates on the client side, filtering these queries by last-modified time (i.e., all pages created since I last looked). Reporting queries (date of creation, update, author, last editor, and custom properties) of course could be useful, but don’t seem as directly applicable.

Also, often the client won’t want the complete Atom entry for the pages, but only a list of pages (maybe with minimal metadata). I’m unsure about the validity of abbreviated Atom entries, but it seems like one solution. Any Atom entry can have something like:

<link rel="self" type="application/atom+xml; type=entry"
 href="url?format=atom" />

This indicates where the entry exists, though it doesn’t suggest very forcefully that the actual entry is abbreviated. Anyway, I could then imagine a feed like:

<feed>
  <entry>

    <content type="some/content-type" />
    <link rel="self" href="..." />
    <updated>YYYYMMDDTHH:MM:SSZ</updated>
  <entry>
  ...
</feed>

This isn’t entirely valid, however — you can’t just have an empty <content> tag. You can use a src attribute to use indirection for the content, and then add Yet Another URL for each page that points to its raw content. But that’s just jumping through hoops. This also seems like an opportunity to suggest that the entry is incomplete.

To actually construct these feeds, you need some way of getting the feed. I suggest that another entry be added to the Atompub service document, something like:

<cmsapi:feed href="URI-TEMPLATE" />

That would be a URI Template that accepted several known variables (though frustratingly, URI Templates aren’t properly standardized yet). Things like:

  • content-type: the content type of the resource (allowing wildcards like image/*)
  • container: a path to a container, i.e., /2007 would match all pages in /2007/…
  • path-regex: some regular expression to match the paths
  • last-modified: return all pages modified at the given date or later

All parameters would be ANDed together.

So, open issues:

  • How to strongly suggest a path when creating a resource (better than Slug)
  • How to rename (move) or copy a page (it’s easy enough to punt on copy, but I’d rather move by a little more formal than just recreating a resource in a new location and deleting the original)
  • How to represent abbreviated Atom entries

With these resolved I think it’d be possible to create a much simpler API than WebDAV, and one that can be applied to existing applications much more easily. (If you think there’s more missing, please comment.)

HTML
Programming
Web

Comments (26)

Permalink

Avoiding Silos: “link” as a first-class object

One of the constant annoyances to me in web applications is the self-proclaimed need for those applications to know about everything and do everything, and only spotty ad hoc techniques for including things from other applications.

An example might be blog navigation or search, where you can only include data from the application itself. Or "Recent Posts" which can only show locally-produce posts. What if I post something elsewhere? I have to create some shoddy placeholder post to refer to it. Bah! Underlying this the data is usually structured in a specific way, with the HTML being a sort of artifact of the database, the markup transient and a slave to the database’s structure.

An example of this might be a recent post listing like:

<ul>
  for post in recent_posts:
    <li>
      <a href="/post/{{post.year}}/{{post.month}}/{{post.slug}}">
        {{post.title}}</a>
    </li>
</ul>

There’s clearly no room for exceptions in this code. I am thus proposing that any system like this should have the notion of a "link" as a first-class object. The code should look like this:

<ul>
  for post in recent_posts:
    <li>
      {{post.link()}}
    </li>
</ul>

Just like with changing IDs to links in service documents, the template doesn’t actually look any more complicated than it did before (simpler, even). But now we can use simple object-oriented techniques to create first-class links. The code might look like:

class Post(SomeORM):
    def url(self):
        if self.type == 'link':
            return self.body
        else:
            base = get_request().application_url
            return '%s/%s/%s/%s' % (
                base, self.year, self.month, self.slug)

    def link(self):
        return html('<a href="%s">%s</a>') % (
            self.url(), self.title)

The addition of the .url() method has the obvious effect of making these offsite links work. Using a .link() method has the added advantage of allowing things like HTML snippets to be inserted into the system (even though that is not implemented here). By allowing arbitrary HTML in certain places you make it possible for people to extend the site in little ways — possibly adding markup to a title, or allowing an item in the list that actually contains two URLs (e.g., <a href="url1">Some Item</a> (<a href="url2">via</a>)).

In the context of Python I recommend making these into methods, not properties, because it allows you to later add keyword arguments to specialize the markup (like post.link(abbreviated=True)).

One negative aspect of this is that you cannot affect all the markup through the template alone, you may have to go into the Python code to change things. Anyone have ideas for handling this problem?

HTML
Programming
Python
Web

Comments (13)

Permalink

Javascript Status Message Display

In a little wiki I’ve been playing with I’ve been trying out little ideas that I’ve had but haven’t had a place to actually implement them. One is how notification messages work. I’m sure other people have done the same thing, but I thought I’d describe it anyway.

A common pattern is to accept a POST request and then redirect the user to some page, setting a status message. Typically the status message is either set in a cookie or in the session, then the standard template for the application has some code to check for a message and display it.

The problem with this is that this breaks all caching — at any time any page can have some message injected into it, basically for no reason at all. So I thought: why not do the whole thing in Javascript? The server will set a cookie, but only Javascript will read it.

The code goes like this; on the server (easily translated into any framework):

resp.set_cookie('flash_message', urllib.quote(msg))

I quote the message because it can contain characters unsafe for cookies, and URL quoting is a particularly easy quoting to apply.

Then I have this Javascript (using jQuery):

$(function () {
    // Anything in $(function...) is run on page load
    var flashMsg = readCookie('flash_message');
    if (flashMsg) {
        flashMsg = unescape(flashMsg);
        var el = $('<div id="flash-message">'+
          '<div id="flash-message-close">'+
          '<a title="dismiss this message" '+
          'id="flash-message-button" href="#">X</a></div>'+
          flashMsg + '</div>');
        $('a#flash-message-button', el).bind(
          'click', function () {
            $(this.parentNode.parentNode).remove();
        });
        $('#body').prepend(el);
        eraseCookie('flash_message');
    }
});

Note that I’ve decided to treat the flash message as HTML. I don’t see a strong risk of injection attack in this case, though I must admit I’m a little unclear about what the normal policies are for cross-domain cookie setting.

I use these cookie functions because oddly I can’t find cookie handling functions in jQuery. It’s always weird to me how primitive document.cookie is. Anyway, CSS looks like this:

#flash-message {
  margin: 0.5em;
  border: 2px solid #000;
  background-color: #9f9;
  -moz-border-radius: 4px;
  text-align: center;
}

#flash-message-close {
  float: right;
  font-size: 70%;
  margin: 2px;
}

a#flash-message-button {
  text-decoration: none;
  color: #000;
  border: 1px solid #9f9;
}

a#flash-message-button:hover {
  border: 1px solid #000;
  background-color: #009;
  color: #fff;
}

This doesn’t have non-Javascript fallback, but I think that’s okay. This isn’t something that a spider would ever see (since spiders shouldn’t be submitting forms that result in update messages). Accessible browsers generally implement Javascript so that’s also not particularly a problem, though there may be additional hints I could give in CSS or Javascript to help make this more readable (if there’s a message, it should probably be the first thing read on the page).

Another common component of pages that varies separate from the page itself is logged-in status, but that’s more heavily connected to your application. Get both into Javascript and you might be able to turn caching way up on a lot of your pages.

Javascript
Programming
Web

Comments (13)

Permalink

lxml: an underappreciated web scraping library

When people think about web scraping in Python, they usually think BeautifulSoup. That’s okay, but I would encourage you to also consider lxml.

First, people think BeautifulSoup is better at parsing broken HTML. This is not correct. lxml parses broken HTML quite nicely. I haven’t done any thorough testing, but at least the BeautifulSoup broken HTML example is parsed better by lxml (which knows that <td> elements should go inside <table> elements).

Second, people feel lxml is harder to install. This is correct. BUT, lxml 2.2alpha1 includes an option to compile static versions of the underlying C libraries, which should improve the installation experience, especially on Macs. To install this new way, try:

$ STATIC_DEPS=true easy_install 'lxml>=2.2alpha1'

One you have lxml installed, you have a great parser (which happens to be super-fast and that is not a tradeoff). You get a fairly familiar API based on ElementTree, which though a little strange feeling at first, offers a compact and canonical representation of a document tree, compared to more traditional representations. But there’s more…

One of the features that should be appealing to many people doing screen scraping is that you get CSS selectors. You can use XPath as well, but usually that’s more complicated (for example). Here’s an example I found getting links from a menu in a page in BeautifulSoup:

from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://java.sun.com').read())
menu = soup.findAll('div',attrs={'class':'pad'})
for subMenu in menu:
    links = subMenu.findAll('a')
    for link in links:
        print "%s : %s" % (link.string, link['href'])

Here’s the same example in lxml:

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
for link in doc.cssselect('div.pad a'):
    print '%s: %s' % (link.text_content(), link.get('href'))

lxml generally knows more about HTML than BeautifulSoup. Also I think it does well with the small details; for instance, the lxml example will match elements in <div class="pad menu"> (space-separated classes), which the BeautifulSoup example does not do (obviously there are other ways to search, but the obvious and documented technique doesn’t pay attention to HTML semantics).

One feature that I think is really useful is .make_links_absolute(). This takes the base URL of the page (doc.base) and uses it to make all the links absolute. This makes it possible to relocate snippets of HTML or whole sets of documents (as with this program). This isn’t just <a href> links, but stylesheets, inline CSS with @import statements, background attributes, etc. It doesn’t see quite all links (for instance, links in Javascript) but it sees most of them, and works well for most sites. So if you want to make a local copy of a site:

from lxml.html import parse, open_in_browser
doc = parse('http://wiki.python.org/moin/').getroot()
doc.make_links_absolute()
open_in_browser(doc)

open_in_browser serializes the document to a temporary file and then opens a web browser (using webbrowser).

Here’s an example that compares two pages using lxml.html.diff:

from lxml.html.diff import htmldiff
from lxml.html import parse, tostring, open_in_browser, fromstring

def get_page(url):
    doc = parse(url).getroot()
    doc.make_links_absolute()
    return tostring(doc)

def compare_pages(url1, url2, selector='body div'):
    basis = parse(url1).getroot()
    basis.make_links_absolute()
    other = parse(url2).getroot()
    other.make_links_absolute()
    el1 = basis.cssselect(selector)[0]
    el2 = other.cssselect(selector)[0]
    diff_content = htmldiff(tostring(el1), tostring(el2))
    diff_el = fromstring(diff_content)
    el1.getparent().insert(el1.getparent().index(el1), diff_el)
    el1.getparent().remove(el1)
    return basis

if __name__ == '__main__':
    import sys
    doc = compare_pages(sys.argv[1], sys.argv[2], sys.argv[3])
    open_in_browser(doc)

You can use it like:

$ python lxmldiff.py \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=70' \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=81' \
'div#content'

Another feature lxml has is form handling. All the cool sexy new sites use minimal forms, but searching for "registration forms" I get this nice complex form. Let’s look at it:

>>> from lxml.html import parse, tostring
>>> doc = parse('http://www.actuaryjobs.com/cform.html').getroot()
>>> doc.forms
[<Element form at -48232164>]
>>> form = doc.forms[0]
>>> form.inputs.keys()
['thank_you_title', 'City', 'Zip', ... ]

Now we have a form object. There’s two ways to get to the fields: form.inputs, which gives us a dictionary of all the actual <input> elements (and textarea and select). There’s also form.fields, which is a dictionary-like object. The dictionary-like object is convenient, for instance:

>>> form.fields['cEmail'] = 'me@example.com'

This actually updates the input element itself:

>>> tostring(form.inputs['cEmail'])
'<input type="input" name="cEmail" size="30" value="test2">'

I think it’s actually a nicer API than htmlfill and can serve the same purpose on the server side.

But then you can also use the same interface for scraping, by filling fields and getting the submission. That looks like:

>>> import urllib
>>> action = form.action
>>> data = urllib.urlencode(form.form_values())
>>> if form.method == 'GET':
...     if '?' in action:
...         action += '&#038;' + data
...     else:
...         action += '?' + data
...     data = None
>>> resp = urllib.urlopen(action, data)
>>> resp_doc = parse(resp).getroot()

Lastly, there’s HTML cleaning. I think all these features work together well, do useful things, and it’s based on an actual understanding HTML instead of just treating tags and attributes as arbitrary. (Also if you really like jQuery, you might want to look at pyquery, which is a jQuery-like API on top of lxml).

HTML
Programming
Python

Comments (41)

Permalink

The Magic Sentinel

In an effort to get back on the blogging saddle, here’s a little note on default values in Python.

In Python there are often default values. The most typical default value is None — None is a object of vague meaning that almost screams "I’m a default". But sometimes None is a valid value, and sometimes you want to detect the case of "no value given" and None can hardly be called no value.

Here’s an example:

def getuser(username, default=None):
    if not user_exists(username):
        return default
    ...

In this case there is always a default, and so anytime you call getuser() you have to check for a None result. But maybe you have code where you’d really just like to get an exception if the user isn’t found. To get this you can use a sentinel. A sentinel is an object that has no particular meaning except to signal the end (like a NULL byte in a C string), or a special condition (like no default user).

Sometimes people do it like this:

_no_default = ()
def getuser(username, default=_no_default):
    if not user_exists(username):
        if default is _no_default:
            raise LookupError("No user with the username %r" % username)
        return default
    ...

This works because that zero-item tuple () is a unique object, and since we are using the comparison default is _no_default only that exact object will trigger that LookupError.

Once you understand the pattern, this is easy enough to read. But when you use help() or other automatic generation it is a little confusing, because the default value just appears as (). You could also use object() or [] or anything else, but the automatically generated documentation still won’t look that nice. So for a bit more polish I suggest:

class _NoDefault(object):
    def __repr__(self):
        return '(no default)'
NoDefault = _NoDefault()
del _NoDefault

def getuser(username, default=NoDefault):
    ...

You might then think "hey, why isn’t there one NoDefault that everyone can share?" If you do share that sentinel you run the risk of accidentally passing in that value even though you didn’t intend to. The value "NoDefault" will become overloaded with meaning, just as None is. By having a more private sentinel object you avoid that. A single nice sentinal factory (like _NoDefault in this example) would be nice, though. Though really PEP 3102 will probably make sentinals like this unnecessary for Python 3.0.

Note that you can also implement arguments with no default via *args and **kwargs, e.g.:

def getuser(username, *args):
    if not user_exists(username):
        if not args:
            raise LookupError(...)
        else:
            return args[0]

But to do this right you should test that len(args)<=1, raise appropriate errors, maybe consider keyword arguments, and so one. It’s a pain in the butt, and when you’re finished the signature displayed by help() will be wrong anyway.

Programming
Python

Comments (10)

Permalink

pyinstall is dead, long live pip!

I’ve finished renaming pyinstall to its new name: pip. The name pip is a acronym and declaration: pip installs packages.

I’ve also added a small feature intended for Google App Engine users, allowing you to zip and unzip packages in an environment. For instance:

$ pip zip --list
In ./lib/python2.5/site-packages:
  No zipped packages.
  Unzipped packages:
    paste  (98 files)
    pygments  (64 files)
    tempita  (7 files)
    weberror  (31 files)
    webob  (22 files)
    webtest  (9 files)
    nose  (43 files)
    setuptools-0.6c9-py2.5.egg  (43 files)
    simplejson  (28 files)
$ pip zip webob
Zip webob (in ./lib/python2.5/site-packages/webob)

Right now this doesn’t work well with egg directories (i.e., packages installed with easy_install), though that shouldn’t be too hard to resolve. pip install itself does not install packages into egg directories (it does install eggs, which is to say it installs all the egg metadata and works fine with pkg_resources).

I don’t really use buildout myself, but I would like to throw it out there that I think someone should create a pip recipe as an alternative to zc.recipe.egg. There’s not really a stable programmatic API in pip at this point, but with no consumers of the API it feels like premature design to settle on something now — integrate with pip and we can figure out what that stable API should be. If you integrate buildout, probably another useful feature would be an option have to pip freeze write the packages out to a setting in your buildout.cfg.

Programming

Comments (19)

Permalink

Hypertext-driven URLs

Roy T. Fielding, author of the REST thesis wrote an article recently: REST APIs must be hypertext-driven. I liked this article, it fit with an intuition I’ve had. Then he wrote an article explaining that he wouldn’t really explain the other articles because, I guess, he wanted a conversation with the specialists, and it seems like a kind of invitation to reinterpret his writing. So since others are doing it I figured I’d do it too.

I’d summarize his argument thus:

  • Focus on media types, i.e., resource formats, i.e., document formats. The protocol will flow from these if they are well specified.
  • URL structures are not a media type. They are some kind of server layout. You can’t hold them, you can’t pass them around, there is no notion of CRUD. Media types have all sorts of advantages that URL structures do not.

An example of a protocol based on a URL structure would be something like:

  • Do GET /articles/ to get a JSON list of all the article ids, with a response like [1, 2, 3]
  • Do a GET /articles/{id} to get the representation of a specific article.

JSON is a reasonable structure for a media type. It is not itself a fully explained type, because it’s just a container for data, just like XML. In this example you have a document, [1, 2, 3] which isn’t self-describing and just isn’t very useful. A more appropriate protocol would be:

  • You start with a container, in our example /articles/. Do GET /articles/ to get a JSON document listing the URLs of all the articles. These URLs are relative to the container URL. You’ll get a response like ['./1', './2', './3'] (actually ['1', '2', '3'] would be fine too).
  • Do GET {article-url} to get the article representation.

It’s a small difference. Heck, the communication could look identical in practice, but by putting URLs in the JSON document instead of this abstract "id" notion you’ve created a more flexible and self-describing system. You could probably give a name to that list of URLs, and then just talk about that name.

An example in Atompub is rel="edit". An Atom entry can look like:

<entry>...
  <link rel="edit" href="/post/15" />
</entry>

Instead of the client just somehow knowing where to go to edit an entry, it’s made explicit. Thus you can move the entry around, while still pointing back to the canonical location to edit that entry.

There’s nothing really that complicated about this, the rule is really quite simple: link to other things, don’t just expect the client to know or guess where those other things are.

For a more concrete example of where this linking works well, OpenID uses <link rel="openid.server" href="…"> and <link rel="openid.delegate" href="…">, which allows you to add a little information to any HTML homepage so that the login can happen at a third location. If OpenID used something like looking at {homepage}/openid for a OpenID server then you couldn’t select whatever OpenID service you liked, or change services, or apply OpenID to hosted locations where you couldn’t install an OpenID server.

I’ll add my own little opinion in here: this is why the URL structure of applications doesn’t affect their RESTfulness, nor is URL structure all that important of a concern generally. Pretty URL structures are a nice thing to do, like indenting your code in a pleasant way, but it has nothing to do with your API, and if you can’t use a crappy URL structure with that same API then probably something is wrong with that API.

Programming
Web

Comments (13)

Permalink