Python

WebOb decorator

Lately I’ve been writing a few applications (e.g., PickyWiki and a revisiting a request-tracking application VaingloriousEye), and I usually use no framework at all. Pylons would be a natural choice, but given that I am comfortable with all the components, I find myself inclined to assemble the pieces myself.

In the process I keep writing bits of code to make WSGI applications from simple WebOb -based request/response cycles. The simplest form looks like this:

from webob import Request, Response, exc

def wsgiwrap(func):
    def wsgi_app(environ, start_response):
        req = Request(environ)
        try:
            resp = func(req)
        except exc.HTTPException, e:
            resp = e
        return resp(environ, start_response)
    return wsgi_app

@wsgiwrap
def hello_world(req):
    return Response('Hi %s!' % (req.POST.get('name', 'You')))

But each time I’d write it, I change things slightly, implementing more or less features. For instance, handling methods, or coercing other responses, or handling middleware.

Having implemented several of these (and reading other people’s implementations) I decided I wanted WebOb to include a kind of reference implementation. But I don’t like to include anything in WebOb unless I’m sure I can get it right, so I’d really like feedback. (There’s been some less than positive feedback, but I trudge on.)

My implementation is in a WebOb branch, primarily in webob.dec (along with some doctests).

The most prominent way this is different from the example I gave is that it doesn’t change the function signature, instead it adds an attribute .wsgi_app which is WSGI application associated with the function. My goal with this is that the decorator isn’t intrusive. Here’s the case where I’ve been bothered:

class MyClass(object):
    @wsgiwrap
    def form(self, req):
        return Response(form_html...)

    @wsgiwrap
    def form_post(self, req):
        handle submission

OK, that’s fine, then I add validation:

@wsgiwrap
def form_post(self, req):
    if req not valid:
        return self.form
    handle submission

This still works, because the decorator allows you to return any WSGI application, not just a WebOb Response object. But that’s not helpful, because I need errors…

@wsgiwrap
def form_post(self, req):
    if req not valid:
        return self.form(req, errors)
    handle submission

That is, I want to have an option argument to the form method that passes in errors. But I can’t do this with the traditional wsgiwrap decorator, instead I have to refactor the code to have a third method that both form and form_post use. Of course, there’s more than one way to address this issue, but this is the technique I like.

The one other notable feature is that you can also make middleware:

@wsgify.middleware
def cap_middleware(req, app):
    resp = app(req)
    resp.body = resp.body.upper()
    return resp

capped_app = cap_middleware(some_wsgi_app)

Otherwise, for some reason I’ve found myself putting an inordinate amount of time into __repr__. Why I’ve done this I cannot say.

Programming
Python
Web

Comments (11)

Permalink

Avoiding Silos: “link” as a first-class object

One of the constant annoyances to me in web applications is the self-proclaimed need for those applications to know about everything and do everything, and only spotty ad hoc techniques for including things from other applications.

An example might be blog navigation or search, where you can only include data from the application itself. Or "Recent Posts" which can only show locally-produce posts. What if I post something elsewhere? I have to create some shoddy placeholder post to refer to it. Bah! Underlying this the data is usually structured in a specific way, with the HTML being a sort of artifact of the database, the markup transient and a slave to the database’s structure.

An example of this might be a recent post listing like:

<ul>
  for post in recent_posts:
    <li>
      <a href="/post/{{post.year}}/{{post.month}}/{{post.slug}}">
        {{post.title}}</a>
    </li>
</ul>

There’s clearly no room for exceptions in this code. I am thus proposing that any system like this should have the notion of a "link" as a first-class object. The code should look like this:

<ul>
  for post in recent_posts:
    <li>
      {{post.link()}}
    </li>
</ul>

Just like with changing IDs to links in service documents, the template doesn’t actually look any more complicated than it did before (simpler, even). But now we can use simple object-oriented techniques to create first-class links. The code might look like:

class Post(SomeORM):
    def url(self):
        if self.type == 'link':
            return self.body
        else:
            base = get_request().application_url
            return '%s/%s/%s/%s' % (
                base, self.year, self.month, self.slug)

    def link(self):
        return html('<a href="%s">%s</a>') % (
            self.url(), self.title)

The addition of the .url() method has the obvious effect of making these offsite links work. Using a .link() method has the added advantage of allowing things like HTML snippets to be inserted into the system (even though that is not implemented here). By allowing arbitrary HTML in certain places you make it possible for people to extend the site in little ways — possibly adding markup to a title, or allowing an item in the list that actually contains two URLs (e.g., <a href="url1">Some Item</a> (<a href="url2">via</a>)).

In the context of Python I recommend making these into methods, not properties, because it allows you to later add keyword arguments to specialize the markup (like post.link(abbreviated=True)).

One negative aspect of this is that you cannot affect all the markup through the template alone, you may have to go into the Python code to change things. Anyone have ideas for handling this problem?

HTML
Programming
Python
Web

Comments (13)

Permalink

Using pip Requirements

Following onto a set of recent posts (from James, me, then James again), Martijn Faassen wrote a description of Grok’s version management. Our ideas are pretty close, but he’s using buildout, and I’ll describe out to do the same things with pip.

Here’s a kind of development workflow that I think works well:

  • A framework release is prepared. Ideally there’s a buildbot that has been running (as Pylons has, for example), so the integration has been running for a while.
  • People make sure there are released versions of all the important components. If there are known conflicts between pieces, libraries and the framework update their install_requires in their setup.py files to make sure people don’t use conflicting pieces together.
  • Once everything has been released, there is a known set of packages that work together. Using a buildbot maybe future versions will also work together, but they won’t necessarily work together with applications built on the framework. And breakage can also occur regardless of a buildbot.
  • Also, people may have versions of libraries already installed, but just because they’ve installed something doesn’t mean they really mean to stay with an old version. While known conflicts have been noted, there’s going to be lots of unknown conflicts and future conflicts.
  • When starting development with a framework, the developer would like to start with some known-good set, which is a set that can be developed by the framework developers, or potentially by any person. For instance, if you extend a public framework with an internal framework (or even a public sub-framework like Pinax) then the known-good set will be developed by a different set of people.
  • As an application is developed, the developer will add on other libraries, or use some of their own libraries. Development will probably occur at the trunk/tip of several libraries as they are developed together.
  • A developer might upgrade the entire framework, or just upgrade one piece (for instance, to get a bug fix they are interested in, or follow a branch that has functionality they care about). The developer doesn’t necessarily have the same notion of "stable" and "released" as the core framework developers have.
  • At the time of deployment the developer wants to make sure all the pieces are deployed together as they’ve tested them, and how they know them to work. At any time, another developer may want to clone the same set of libraries.
  • After initial deployment, the developer may want to upgrade a single component, if only to test that an upgrade works, or if it resolves a bug. They may test out combinations only to throw them away, and they don’t want to bump versions of libraries in order to deploy new combinations.

This is the kind of development pattern that requirement files are meant to assist with. They can provide a known-good set of packages. Or they can provide a starting point for an active line of development. Or they can provide a historical record of how something was put together.

The easy way to start a requirement file for pip is just to put the packages you know you want to work with. For instance, we’ll call this project-start.txt:

Pylons
-e svn+http://mycompany/svn/MyApp/trunk#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk#egg=MyLibrary

You can plug away for a while, and maybe you decide you want to freeze the file. So you do:

$ pip freeze -r project-start.txt project-frozen.txt

By using -r project-start.txt you give pip freeze a template for it to start with. From that, you’ll get project-frozen.txt that will look like:

Pylons==0.9.7
-e svn+http://mycompany/svn/MyApp/trunk@1045#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk@1058#egg=MyLibrary

## The following requirements were added by pip --freeze:
Beaker==0.2.1
WebHelpers==0.9.1
nose==1.4
# Installing as editable to satisfy requirement INITools==0.2.1dev-r3488:
-e svn+http://svn.colorstudy.com/INITools/trunk@3488#egg=INITools-0.2.1dev_r3488

At that point you might decide that you don’t care about the nose version, or you might have installed something from trunk when you could have used the last release. So you go and adjust some things.

Martijn also asks: how do you have framework developers maintain one file, and then also have developers maintain their own lists for their projects?

You could start with a file like this for the framework itself. Pylons for instance could ship with something like this. To install Pylons you could then do:

$ pip -E MyProject install \\
>    -r http://pylonshq.com/0.9.7-requirements.txt

You can also download that file yourself, add some comments, rename the file and add your project to it, and use that. When you freeze the order of the packages and any comments will be preserved, so you can keep track of what changed. Also it should be ameniable to source control, and diffs would be sensible.

You could also use indirection, creating a file like this for your project:

-r http://pylonshq.com/0.9.7-requirements.txt
-e svn+http://mycompany/svn/MyApp/trunk#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk#egg=MyLibrary

That is, requirements files can refer to each other. So if you want to maintain your own requirements file alongside the development of an upstream requirements file, you could do that.

Packaging
Python

Comments (2)

Permalink

A Few Corrections To “On Packaging”

James Bennett recently wrote an article on Python packaging and installation, and Setuptools. There’s a lot of issues, and writing up my thoughts could take a long time, but I thought at least I should correct some errors, specifically category errors. Figuring out where all the pieces in Setuptools (and pip and virtualenv) fit is difficult, so I don’t blame James for making some mistakes, but in the interest of clarifying the discussion…

I will start with a kind of glossary:

Distribution:
This is something-with-a-setup.py. A tarball, zip, a checkout, etc. Distributions have names; this is the name in setup(name="…") in the setup.py file. They have some other metadata too (description, version, etc), and Setuptools adds to that metadata some. Distutils doesn’t make it very easy to add to the metadata — it’ll whine a little about things it doesn’t know, but won’t do anything with that extra data. Fixing this problem in Distutils is an important aspect of Setuptools, and part of what Distutils itself unsuitable as a basis for good library management.
package/module:
This is something you import. It is not the same as a distribution, though usually a distribution will have the same name as a package. In my own libraries I try to name the distribution with mixed case (like Paste) and the package with lower case (like paste). Keeping the terminology straight here is very difficult; and usually it doesn’t matter, but sometimes it does.
Setuptools The Distribution:
This is what you install when you install Setuptools. It includes several pieces that Phillip Eby wrote, that work together but are not strictly a single thing.
setuptools The Package:
This is what you get when you do import setuptools. Setuptools largely works by monkeypatching distutils, so simply importing setuptools activates its functionality from then on. This package is entirely focused on installation and package management, it is not something you should use at runtime (unless you are installing packages as your runtime, of course).
pkg_resources The Module:
This is also included in Setuptools The Distribution, and is for use at runtime. This is a single module that provides the ability to query what distributions are installed, metadata about those distributions, information about the location where they are installed. It also allows distributions to be "activated". A distribution can be available but not activated. Activating a distribution means adding its location to sys.path, and probably you’ve noticed how long sys.path is when you use easy_install. Almost everything that allows different libraries to be installed, or allows different versions of libraries, does it through some management of sys.path. pkg_resources also allows for generic access to "resources" (i.e., non-code files), and let’s those resources be in zip files. pkg_resources is safe to use, it doesn’t do any of the funny stuff that people get annoyed with.
easy_install:
This is also in Setuptools The Distribution. The basic functionality it provides is that given a name, it can search for package with that distribution name, and also satisfying a version requirement. It then downloads the package, installs it (using setup.py install, but with the setuptools monkeypatches in place). After that, it checks the newly installed distribution to see if it requires any other libraries that aren’t yet installed, and if so it installs them.
Eggs the Distribution Format:
These are zip files that Setuptools creates when you run python setup.py bdist_egg. Unlike a tarball, these can be binary packages, containing compiled modules, and generally contain .pyc files (which are portable across platforms, but not Python versions). This format only includes files that will actually be installed; as a result it does not include doc files or setup.py itself. All the metadata from setup.py that is needed for installation is put in files in a directory EGG-INFO.
Eggs the Installation Format:
Eggs the Distribution Format are a subset of the Installation Format. That is, if you put an Egg zip file on the path, it is installed, no other process is necessary. But the Installation Format is more general. To have an egg installed, you either need something like DistroName-X.Y.egg/ on the path, and then an EGG-INFO/ directory under that with the metadata, or a path like DistroName.egg-info/ with the metadata directly in that directory. This metadata can exist anywhere, and doesn’t have to be directly alongside the actual Python code. Egg directories are required for pkg_resources to activate and deactivate distributions, but otherwise they aren’t necessary.
pip:
This is an alternative to easy_install. It works somewhat differently than easy_install, but not much. Mostly it is better than easy_install, in that it has some extra features and is easier to use. Unlike easy_install, it downloads all distributions up-front, and generates the metadata to read distribution and version requirements. It uses Setuptools to generate this metadata from a setup.py file, and uses pkg_resources to parse this metadata. It then installs packages with the setuptools monkeypatches applied. It just happens to use an option python setup.py –single-version-externally-managed, which gets Setuptools to install packages in a more flat manner, with Distro.egg-info/ directories alongside the package. Pip installs eggs! I’ve heard the many complaints about easy_install (and I’ve had many myself), but ultimately I think pip does well by just fixing a few small issues. Pip is not a repudiation of Setuptools or the basic mechanisms that easy_install uses.
PoachEggs:
This is a defunct package that had some of the features of pip (particularly requirement files) but used easy_install for installation. Don’t bother with this, it was just a bridge to get to pip.
virtualenv:
This is a little hack that creates isolated Python environments. It’s based on virtual-python.py, which is something I wrote based on some documentation notes PJE wrote for Setuptools. Basically virtualenv just creates a bin/python interpreter that has its own value of sys.prefix, but uses the system Python and standard library. It also installs Setuptools to make it easier to bootstrap the environment (because bootstrapping Setuptools is itself a bit tedious). I’ll add pip to it too sometime. Using virtualenv you don’t have to worry about different library versions, because for any one environment you will probably only need one version of a library. On any one machine you probably need different versions, which is why installing packages system-wide is problematic for most libraries. (I’ve been meaning to write a post on why I think using system packaging for libraries is counter-productive, but that’ll wait for another time.)

So… there’s the pieces involved, at least the ones I can remember now. And I haven’t really discussed .pth files, entry points, sys.path trickery, site.py, distutils.cfg… sadly this is a complex state of affairs, but it was also complex before Setuptools.

There are a few things that I think people really dislike about Setuptools.

First, zip files. Setuptools prefers zip files, for reasons that won’t mean much to you, and maybe are more historical than anything. When a distribution doesn’t indicate if it is zip-safe, Setuptools looks at the code and sees if it uses __file__, an if not it presumes that the code is probably zip-safe. The specific problem James cites is what appears to be a bug in Django, that Django looks for code and can’t traverse into zip files in the same way that Python itself can. Setuptools didn’t itself add anything to Python to make it import zip files, that functionality was added to Python some time before. The zipped eggs that Setuptools installs are using existing (standard!) Python functionality.

That said, I don’t think zipping libraries up is all that useful, and while it should work, it doesn’t always, and it makes code harder to inspect and understand. So since it’s not that useful, I’ve disabled it when pip installs packages. I also have had it disabled on my own system for years now, by creating a distutils.cfg file with [easy_install] zip_ok = False in it. Sadly App Engine is forcing me to use zip files again, because of its absurdly small file limits… but that’s a different topic. (There is an experimental pip zip command mostly intended for App Engine.)

Another pain point is version management with setup.py and Setuptools. Indeed it is easy to get things messed up, and it is easy to piss people off by overspecifying, and sometimes things can get in a weird state for no good reason (often because of easy_install’s rather naive leap-before-you-look installation order). Pip fixes that last point, but it also tries to suggest more constructive and less painful ways to manage other pieces.

Pip requirement files are an assertion of versions that work together. setup.py requirements (the Setuptools requirements) should contain two things: 1: all the libraries used by the distribution (without which there’s no way it’ll work) and 2: exclusions of the versions of those libraries that are known not to work. setup.py requirements should not be viewed as an assertion that by satisfying those requirements everything will work, just that it might work. Only the end developer, testing the system together, can figure out if it really works. Then pip gives you a way to record that working set (using pip freeze), separate from any single distribution or library.

There’s also a lot of conflicts between Setuptools and package maintainers. This is kind of a proxy war between developers and sysadmins, who have very different motivations. It deserves a post of its own, but the conflicts are about more than just how Setuptools is implemented.

I’d love if there was a language-neutral library installation and management tool that really worked. Linux system package managers are absolutely not that tool; frankly it is absurd to even consider them as an alternative. So for now we do our best in our respective language communities. If we’re going to move forward, we’ll have to acknowledge what’s come before, and the reasoning for it.

Packaging
Python

Comments (38)

Permalink

lxml: an underappreciated web scraping library

When people think about web scraping in Python, they usually think BeautifulSoup. That’s okay, but I would encourage you to also consider lxml.

First, people think BeautifulSoup is better at parsing broken HTML. This is not correct. lxml parses broken HTML quite nicely. I haven’t done any thorough testing, but at least the BeautifulSoup broken HTML example is parsed better by lxml (which knows that <td> elements should go inside <table> elements).

Second, people feel lxml is harder to install. This is correct. BUT, lxml 2.2alpha1 includes an option to compile static versions of the underlying C libraries, which should improve the installation experience, especially on Macs. To install this new way, try:

$ STATIC_DEPS=true easy_install 'lxml>=2.2alpha1'

One you have lxml installed, you have a great parser (which happens to be super-fast and that is not a tradeoff). You get a fairly familiar API based on ElementTree, which though a little strange feeling at first, offers a compact and canonical representation of a document tree, compared to more traditional representations. But there’s more…

One of the features that should be appealing to many people doing screen scraping is that you get CSS selectors. You can use XPath as well, but usually that’s more complicated (for example). Here’s an example I found getting links from a menu in a page in BeautifulSoup:

from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://java.sun.com').read())
menu = soup.findAll('div',attrs={'class':'pad'})
for subMenu in menu:
    links = subMenu.findAll('a')
    for link in links:
        print "%s : %s" % (link.string, link['href'])

Here’s the same example in lxml:

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
for link in doc.cssselect('div.pad a'):
    print '%s: %s' % (link.text_content(), link.get('href'))

lxml generally knows more about HTML than BeautifulSoup. Also I think it does well with the small details; for instance, the lxml example will match elements in <div class="pad menu"> (space-separated classes), which the BeautifulSoup example does not do (obviously there are other ways to search, but the obvious and documented technique doesn’t pay attention to HTML semantics).

One feature that I think is really useful is .make_links_absolute(). This takes the base URL of the page (doc.base) and uses it to make all the links absolute. This makes it possible to relocate snippets of HTML or whole sets of documents (as with this program). This isn’t just <a href> links, but stylesheets, inline CSS with @import statements, background attributes, etc. It doesn’t see quite all links (for instance, links in Javascript) but it sees most of them, and works well for most sites. So if you want to make a local copy of a site:

from lxml.html import parse, open_in_browser
doc = parse('http://wiki.python.org/moin/').getroot()
doc.make_links_absolute()
open_in_browser(doc)

open_in_browser serializes the document to a temporary file and then opens a web browser (using webbrowser).

Here’s an example that compares two pages using lxml.html.diff:

from lxml.html.diff import htmldiff
from lxml.html import parse, tostring, open_in_browser, fromstring

def get_page(url):
    doc = parse(url).getroot()
    doc.make_links_absolute()
    return tostring(doc)

def compare_pages(url1, url2, selector='body div'):
    basis = parse(url1).getroot()
    basis.make_links_absolute()
    other = parse(url2).getroot()
    other.make_links_absolute()
    el1 = basis.cssselect(selector)[0]
    el2 = other.cssselect(selector)[0]
    diff_content = htmldiff(tostring(el1), tostring(el2))
    diff_el = fromstring(diff_content)
    el1.getparent().insert(el1.getparent().index(el1), diff_el)
    el1.getparent().remove(el1)
    return basis

if __name__ == '__main__':
    import sys
    doc = compare_pages(sys.argv[1], sys.argv[2], sys.argv[3])
    open_in_browser(doc)

You can use it like:

$ python lxmldiff.py \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=70' \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=81' \
'div#content'

Another feature lxml has is form handling. All the cool sexy new sites use minimal forms, but searching for "registration forms" I get this nice complex form. Let’s look at it:

>>> from lxml.html import parse, tostring
>>> doc = parse('http://www.actuaryjobs.com/cform.html').getroot()
>>> doc.forms
[<Element form at -48232164>]
>>> form = doc.forms[0]
>>> form.inputs.keys()
['thank_you_title', 'City', 'Zip', ... ]

Now we have a form object. There’s two ways to get to the fields: form.inputs, which gives us a dictionary of all the actual <input> elements (and textarea and select). There’s also form.fields, which is a dictionary-like object. The dictionary-like object is convenient, for instance:

>>> form.fields['cEmail'] = 'me@example.com'

This actually updates the input element itself:

>>> tostring(form.inputs['cEmail'])
'<input type="input" name="cEmail" size="30" value="test2">'

I think it’s actually a nicer API than htmlfill and can serve the same purpose on the server side.

But then you can also use the same interface for scraping, by filling fields and getting the submission. That looks like:

>>> import urllib
>>> action = form.action
>>> data = urllib.urlencode(form.form_values())
>>> if form.method == 'GET':
...     if '?' in action:
...         action += '&#038;' + data
...     else:
...         action += '?' + data
...     data = None
>>> resp = urllib.urlopen(action, data)
>>> resp_doc = parse(resp).getroot()

Lastly, there’s HTML cleaning. I think all these features work together well, do useful things, and it’s based on an actual understanding HTML instead of just treating tags and attributes as arbitrary. (Also if you really like jQuery, you might want to look at pyquery, which is a jQuery-like API on top of lxml).

HTML
Programming
Python

Comments (41)

Permalink

The Magic Sentinel

In an effort to get back on the blogging saddle, here’s a little note on default values in Python.

In Python there are often default values. The most typical default value is None — None is a object of vague meaning that almost screams "I’m a default". But sometimes None is a valid value, and sometimes you want to detect the case of "no value given" and None can hardly be called no value.

Here’s an example:

def getuser(username, default=None):
    if not user_exists(username):
        return default
    ...

In this case there is always a default, and so anytime you call getuser() you have to check for a None result. But maybe you have code where you’d really just like to get an exception if the user isn’t found. To get this you can use a sentinel. A sentinel is an object that has no particular meaning except to signal the end (like a NULL byte in a C string), or a special condition (like no default user).

Sometimes people do it like this:

_no_default = ()
def getuser(username, default=_no_default):
    if not user_exists(username):
        if default is _no_default:
            raise LookupError("No user with the username %r" % username)
        return default
    ...

This works because that zero-item tuple () is a unique object, and since we are using the comparison default is _no_default only that exact object will trigger that LookupError.

Once you understand the pattern, this is easy enough to read. But when you use help() or other automatic generation it is a little confusing, because the default value just appears as (). You could also use object() or [] or anything else, but the automatically generated documentation still won’t look that nice. So for a bit more polish I suggest:

class _NoDefault(object):
    def __repr__(self):
        return '(no default)'
NoDefault = _NoDefault()
del _NoDefault

def getuser(username, default=NoDefault):
    ...

You might then think "hey, why isn’t there one NoDefault that everyone can share?" If you do share that sentinel you run the risk of accidentally passing in that value even though you didn’t intend to. The value "NoDefault" will become overloaded with meaning, just as None is. By having a more private sentinel object you avoid that. A single nice sentinal factory (like _NoDefault in this example) would be nice, though. Though really PEP 3102 will probably make sentinals like this unnecessary for Python 3.0.

Note that you can also implement arguments with no default via *args and **kwargs, e.g.:

def getuser(username, *args):
    if not user_exists(username):
        if not args:
            raise LookupError(...)
        else:
            return args[0]

But to do this right you should test that len(args)<=1, raise appropriate errors, maybe consider keyword arguments, and so one. It’s a pain in the butt, and when you’re finished the signature displayed by help() will be wrong anyway.

Programming
Python

Comments (10)

Permalink

Where Next For Plone Development?

I attended PloneConf 2008 recently, to talk about Deliverance. But I’ll talk here more about my observations as a relative outsider to that community. (This post is really written for the Plone community — if you aren’t familiar with Plone this post probably won’t make much sense or be very useful.)

One of the ongoing concerns in the Plone community has been the difficulty of attracting and maintaining developer interest in the community, and generally making Plone easier to work with as a developer. It’s been a very successful community for consulting and integration work, but it has not been as rewarding for developers. There’s the idea of a "Plone Tax", which means different things to different people but just generally speaks to the sense that Plone takes a little off the top — time to restart, time to run the tests, time to dig into code that just goes too deep. There are some distinct problems with Plone, but probably the biggest problem is the quantity of small challenges and the general size and complexity of the system.

At the moment there is no clear path forward to resolve this. A previous effort to fix things is a project called "Five", which referred to Zope 2+3 — backporting libraries and techniques from Zope 3 into Zope 2. Plone is intimately tied to Zope 2, and Five let them use new code without having to do abandon the old environment. But the result wasn’t terribly satisfying: Five didn’t remove any complexity, it only added to it. Even if the new components were superior that doesn’t make them simple. Plone is aching for more simplicity, not more power.

It’s unclear how Plone can actually reform itself as a codebase. Zope 2 is a behemoth, and its metaphors are deeply intertwined with existing code. Acquisition in particular is ubiquitous, essential to lots of the machinery, and deeply confusing. I saw a general interest in two directions: one was to encourage non-content-management tasks to be implemented in a complementary (but separate) technology, another direction is continuing to refactor the existing codebase while somehow trying to maintain backward compatibility.

It’s unclear how to refactor the existing codebase in such a way that it is any simpler, but I suppose these two directions are not exclusive. I want to focus on the idea of a complementary environment. There’s two products in particular that have been attracting interest: Grok and Repoze.

These two products usually go under the umbrella of "exciting new ways to improve the developer experience in Plone" which is a kind of generic positive sentiment. But to my mind they represent two significantly different paths forward, and the differences deserve some more critical thought.

Grok is a layer on top of Zope 3 that attempts to make development in that environment more pleasing. It eliminates most ZCML (ZCML is Zope 3’s XML-based language for declaring the relation of various components in a system). Grok uses conventions and introspection to make Zope 3 look more like a traditional web framework, with simple views and models and templates, and less of the wiring you have to set up in typical Zope 3 architectures. At the same time, you can add all the Zope 3 declarations to break out of the automatic conventions. Grok is a more pleasant layer on top of Zope 3, but it’s entirely focused on Zope 3, and it is led by Martijn Faassen and Philipp von Weitershausen who are both very involved with the Zope 3 community.

Repoze is a more recent project, led by Chris McDonough, Tres Seaver, and Paul Everitt (well, Paul might call himself more of a cheerleader). They all work together at Agendaless Consulting. They’ve been highly involved in Zope 2 for a long time, and are all former employees of Zope Corp and major contributors to Zope 2. I don’t think they’ve ever quite made the jump to Zope 3, and their consulting and experience kept them involved with Zope 2. A while back they got some WSGI religion and started splitting out some pieces of Zope into WSGI middleware and other independent libraries. This included things like pulling out the transaction handler from Zope, reimplementing the Zope 2 publisher so it was more WSGIish, and a variety of other libraries. These libraries are essentially extractions from Zope, or in the case of the Zope publisher and repoze.plone, a way to wrap what would be considered a "legacy" application in the same interface as other newer pieces. Having extracted nearly everything they wanted, they’ve started work on a framework intended to be familiar to the Zope community, repoze.bfg. This framework uses some Zope 3 concepts, but it’s more built from scratch than it is built from Zope 3, and it is attached to Zope 2 only insofar as it uses the pieces they’ve extracted and the ideas they’ve become comfortable with. They’ve described it as the framework they want to use when someone asks them to build something in Zope 2.

Grok and Repoze have a significantly different development methodology. One is a layer on Zope 3, the other is an extraction of ideas from Zope 2 (and a few from Zope 3). In part I think the distinction hasn’t been presented very clearly because the Repoze and Grok communities overlap a great deal, and everyone is quite congenial with each other, and they are reluctant to enter debates about the designs. While I also consider them all colleagues, I also feel pretty strongly about the design differences and I feel a discussion contrasting them is important.

Martijn recently wrote a post on why Plone should consider Grok and a follow-up post. These posts speak to a variety of advantages Grok has over plain Zope 3, which Grok can offer to Plone to manage its existing use of Zope 3 technologies.

I think Plone shouldn’t be so focused on managing the complexity of its stack, but focus on reducing that complexity. And it should reduce that complexity by focusing on content management and moving all the other pieces people have built on Plone out of Plone. It can’t just leave people hanging, which is why developing a clear story for how those other pieces should be developed is essential. Plone the community doesn’t have to map one-to-one to Plone the software. Plone the software should become smaller and more focused. Plone the community doesn’t have to become more focused — the community does what it needs to do, what customers ask for, what is necessary to make a site compelling. To be clear on what I’m proposing here: Plone should have a community-recommended way to build non-Plone applications. This doesn’t mean you have to use those techniques, but it should be much more concrete than just "there’s a bunch of cool things out there, and maybe you should look around and use one of those." By having a community-recommended pattern of development you can maintain and build on the Plone community, which is at least as important an asset as the Plone software.

With this in mind I believe an extraction of Zope 2 and Plone ideas is the right path forward. Extraction is the process of isolating code and ideas, and localizing the effect of that code. These are some of the most important ways to actually increase the simplicity of a codebase. In the Zope 2 (and even 3) codebase the thing that brings the most complexity is the non-locality of code, that parts of the system can effect each other in complex and unexpected ways. Zope 3 formalizes these patterns of non-locality: the Component Architecture is largely about introducing non-localized relationships between code.

Arguably this non-locality of effect is exactly what people want, it’s what enables pervasive customizations. When a client asks you to change some little piece deep in the system, you don’t really want to modify the system’s code — you want to add a little code to the outside of the system that effects the change you desire. The Component Architecture is a formalized way of making these kinds of changes, where Acquisition was a lower-level mechanism to do the same sort of thing. I remain a Component Architecture skeptic, mostly because I think the mechanism is overused. Still I think clearly Plone needs flexibility that a purely bespoke application would not require. But there’s no way out of it: that flexibility has a high cost. In this there is no clear solution. In addition to Acquisition, the complexity of Zope 2 security, the many layers of skinning… is the Component Architecture what Plone needs? From where I stand it seems like a step further in the wrong direction… and yes, it is a better placed step than any of the ones before, but if it’s still a step in the wrong direction does that matter?

I think Plone (at least the community) needs to be conservative in its enabling of these customizations. Sure, the customizability is "powerful," but I don’t hear people clamoring for power. They want simple, predictable, fast, maintainable. That’s not necessarily the same as "easy" — I think sometimes it’s worth making things a little longer and less automatic to make a system more explicit and make code more localized. I think Repoze is a step in this direction, a big (maybe even an intimidatingly giant) step towards simplicity. That’s the step I think Plone should make.

So, I offer this as my suggestion to Plone. I think Plone-the-community has the opportunity to be more than Plone-the-software; I think it must do this to remain viable in the long term. But to get there the community make some choices — you can’t add simplicity.

Python
Web
Zope/Plone

Comments (8)

Permalink

Decorators and Descriptors

So, decorators are neat (maybe check out a new tutorial on them). Descriptors are neat but may seem hard (though they hardly take a long time to describe). Sometimes these two things intersect, and this post describes how.

Here’s an example decorator where this comes up. First, we want something that looks like a WSGI application:

def application(environ, start_response):
    start_response('200 OK', [('content-type', 'text/html')])
    return ['hi!']

But we want to use WebOb, like:

from webob import Request, Response
def application(environ, start_response):
    req = Request(environ)
    resp = Response('hi!')
    return resp(environ, start_response)

(We don’t use req in this example, but of course you probably would in a real WSGI application)

Now req = Request(environ) is boilerplate, and it’d be nicer to just do return resp instead of return resp(environ, start_response). So let’s make a decorator to do that:

class wsgiapp(object):
    def __init__(self, func):
        self.func = func
    def __call__(self, environ, start_response):
        resp = self.func(Request(environ))
        return resp(environ, start_response)

@wsgiapp
def application(req):
    return Response('hi!')

If you don’t understand what happened there, go read up on decorators.
Now, what if you want to decorate a method? For instance:

class Application(object):
    def __init__(self, text):
        self.text = text
    @wsgiapp
    def __call__(self, req):
        return Response(self.text)
application = Application('hi!')

This won’t quite work, because @wsgiapp will call Application.__call__(req) — with no self argument. This is generally a problem with any decorator that changes the signature, because the signature for methods has this extra self argument. Descriptors can handle this. First we’ll have the same wsgiapp definition as we had before, but we’ll add the magic descriptor method __get__:

class wsgiapp(object):
    def __init__(self, func):
        self.func = func
    def __call__(self, environ, start_response):
        resp = self.func(Request(environ))
        return resp(environ, start_response)
    def __get__(self, obj, type=None):
        if obj is None:
            return self
        new_func = self.func.__get__(obj, type)
        return self.__class__(new_func)

So, to explain:

When you get an attribute from an instance, like Application().__call__, Python will check if the object that was fetched has a __get__ method. If it does, it will call that method and use the result of that method.

This part:

if obj is None:
    return self

is what happens when you do Application.__call__ — in other words, when get a class attribute. In that case obj (self) will be None, and it will just return the descriptor (it could do something else, like in this example, but usually it doesn’t).

Functions already have a __get__ method. You can try it yourself:

>>> def example(*args):
...     print 'got', args
>>> example_bound = example.__get__(1)
>>> example_bound('test')
got (1, 'test')

So in the example with wsgiapp we are just changing the decorator to wrap the new bound function instead of the old unbound function. This allows wsgiapp to be compatible with both plain functions and methods. In fact, it would probably be preferable to always call func.__get__(obj, type) (even if obj is None), as then we could also wrap class methods or other kinds of descriptors.

Python

Comments (8)

Permalink

The Philosophy of Deliverance

I’ll be attending PloneConf this year again, giving a talk about Deliverance. I’ve been working on Deliverance lately for work, but the hard part about it is that it’s not obviously useful. To help explain it I wrote the philosophy of Deliverance, which I will copy here, to give you an idea of what I’ve been doing:

Why is Deliverance? Why was it made, what purpose does it serve, why should you use it, how can it change the way you do web development?

On the Subject of Platforms

Right now we live in an age of platforms. Developers (or management or coincidence) decides on a platform, and that serves as the basis for all future development. Usually there’s some old things from a previous platform (or a primordial pre-platform age: I’m looking at you formmail.pl!) The goal is always to eliminate all of these old pieces, rewriting them for the new platform. That goal is seldom attained in a timely manner, and even before it is accomplished you may be moving to the next platform.

Why do you have to port everything forward to the newest platform? Well, presumably it is better engineered. The newest platform is presumably what people are most familiar with. But if those were the only reasons it would be hard to justify a rewrite of working software. Often the real push comes because your systems don’t work together. It’s hard to keep templates in sync across all the platforms. Multiple logins may be required. Navigation is inconsistent and incomplete. Functionality that cross-cuts pages — comments, login status, shopping cart status, etc — isn’t universally available.

A similar conflict arises when you consider how to add new functionality to a site. For example, you may want to add a blog. Do you:

  1. Use the best blogging software available?
  2. Use something native to your platform?
  3. Write something yourself?

The answer is probably 2 or 3, because it would be too hard to integrate something foreign to your platform. This form of choice means that every platform has some kind of "blog", but the users of that blog are likely to only be a subset of the users of the parent platform. This makes it difficult for winners to emerge, or for a well-developed piece of software to really be successful. Platform-based software is limited by the adoption of the platform.

Not all software has a platform. These tend to be the most successful web applications, things like Trac, WordPress, etc.

"Aha!" you think "I’ll just use those best-of-breed applications!" But no! Those applications themselves turn into platforms. WordPress is practically a CMS. Trac too. Extensible applications, if successful, become their own platform. This is not to place blame, they aren’t necessarily any worse than any other platform, just an acknowledgment that this move to platform can happen anywhere.

Beyond Platforms, or A Better Platform

One of the major goals of Deliverance is to move beyond platforms. It is an integration tool, to allow applications from different frameworks or languages to be integrated gracefully.

There are only a few core reasons that people use platforms:

  1. A common look-and-feel across the site.
  2. Cohesive navigation.
  3. Indexing of the entire site.
  4. Shared authentication and user accounts.
  5. Cross-cutting functionality (e.g., commenting).

Deliverance specifically addresses 1, providing a common look-and-feel across a site. It can provide some help with 2, by allowing navigation to be more centrally managed, without relying purely on per-application navigation (though per-application navigation is still essential to navigating the individual applications). 3, 4, and 5 are not addressed by Deliverance (at least not yet).

Deliverance applies a common theme across all the applications in your site. It’s basic unit of abstraction is HTML. It doesn’t use a particular templating language. It doesn’t know what an object is. HTML is something every web application produces. Deliverance’s means of communication is HTTP. It doesn’t call functions or create request objects [*]. Again, everything speaks HTTP.

Deliverance also allows you to include output from multiple locations. In all cases there’s the theme, a plain HTML page, and the content, whatever the underlying application returns. You can also include output from other parts of the site, most commonly navigation content that you can manage separately. All of these pieces can be dynamic — again, Deliverance only cares about HTML and HTTP, it doesn’t worry about what produces the response.

This is all very similar to systems built on XSLT transforms, except without the XSLT [†], and without XML. Strictly speaking you can apply XSLT to any parseable markup, even HTML, but the most common (or at least most talked about) way to apply XSLT is using "semantic" XML output that is transformed into HTML. Deliverance does not try to understand the semantics of applications, and instead expects them to provide appropriate presentation of whatever semantics the underlying application possesses. Presentation is more universal than semantics.

While Deliverance does its best to work with applications as-they-exist, without making particular demands on those applications, it is not perfect. Conflicting CSS can be a serious problem. Some applications don’t have very good structure to work with. You can’t generate any content in Deliverance, you can only manipulate existing content, and often that means finding new ways to generate content, or making sure you have a place to store your content (as in the case of navigation). This is why arguably Deliverance does not remove the need for a platform, but is just its own platform. In so far as this is true, Deliverance tries to be a better platform, where "better" is "more universal" rather than "more powerful". Most templating systems are more powerful than Deliverance transformations. It can be useful to have access to the underlying objects used to procude the markup. But Deliverance doesn’t give you these things, because it only implements things that can be applied to any source of content. Static files are entirely workable in Deliverance, just as any application written in Python, PHP, or even an application hosted on an entirely separate service is usable through Deliverance.

The Missing Parts

As mentioned before, two important benefits of a platform are missing from Deliverance. I’ll try to describe what I believe are the essential aspects. I hope at some time that Deliverance or some complementary application will be able to satisfy these needs. Also, I suggest some lines of development that might be easier than others.

Indexing The Entire Site

Typically each application has a notion of what all the interesting pages in that application are. Most applications have a set of uninteresting pages, or transient pages. A search result is transient, as an example. An application also knows when new pages appear, and when other pages disappear. A site-wide index of these pages would allow things like site maps, cross-application search, and cross-application reporting to be done.

An interesting exception to the knowledge an application has of itself: search results are generally boring. But a search result based on a category might still be interesting. The difference between a "search" and a "report" is largely in the eye of the beholder. An important feature is that the application shouldn’t be the sole entity allowed to mark interesting pages. Manually-managed lists of resources that may point to specific applications can allow people to usefully and easily tweak the site. Ideally even fully external resources could be included, such as a resource on an entirely different site.

To do indexing you need both events (to signal the creation, update, or deletion of an entity/page), and a list of entities (so the index can be completely regenerated). A simple way of giving a list of entities would be the Google Site Map XML resource. Signaling events is much more complex, so I won’t go into it in any greater depth here, but we’re working on a product called Cabochon to handle events.

One thing that indexing can provide is a way to use microformats. Right now microformats are interesting, but for most sites they are largely useless. You can mark up your content, but no one will do anything interesting with that markup. If you could easily code up an indexer that could keep up-to-date on all the content on your site, you could produce interesting results like cross-application mapping.

Shared Authentication And User Accounts

Authentication is one of the most common and annoying integration tasks when crossing platform boundaries. Systems like Open ID offer the ability to unify cross-site authentication, but they don’t actually solve the problem of a single site with multiple applications.

There is a basic protocol in HTTP for authentication, one that is workable for a system like Deliverance, and there are already several existing products (like repoze.who) that work this way. It works like this:

  • The logged-in username is sent in some header, e.g., X-Remote-User. Some kind of signing is necessary to really trust this header (Deliverance could filter out that header in incoming requests, but if you removed Deliverance from the stack you’d have a security hole).
  • If the user isn’t logged in, and the application wants them to log in, the application response with a 401 Unauthorized response. It is supposed to set the WWW-Authenticate header, probably to some value indicating that the intermediary should determine the authentication type. In some cases a kind of HTTP authentication is required (typically Basic or Digest) because cookie-based logins are too stateful (e.g., in APIs, or for WebDAV access).
  • The intermediary catches the 401 and initiates the login process. This might mean a redirect to a login page, and setting a cookie on successful login. The login page and setting the cookie could potentially be done by an application outside of the intermediary; the intermediary only has to do the appropriate redirects and setting of headers.
  • In the case when a user is logged in but isn’t permitted, the application simply sends a 403 Forbidden response. The intermediary shouldn’t actually do anything in this case (though maybe it could usefully add a logout link to that message). I only mention this because some systems use 401 for Forbidden, which causes no end of problems.

While some applications allow for this kind of authentication scheme, many do not. However, the scheme is general enough that I think it is justifiable that applications could be patched to work like this.

This handles shared authentication, but the only information handed around is a username. Information about the user — the real name, email, homepage, permission roles, etc — are not shared in this model.

You could add something like an internal location to the username. E.g.: X-Remote-User: bob; info_url=http://mysite.com/users/bob.xml. It would be the application’s responsibility to make a subrequest to fetch that information. This can be somewhat inefficient, though with appropriate caching perhaps it would be fine. But many applications want very much to have a complete record of all users. Changing this is likely to be much harder than changing the authentication scheme. A more feasible system might be something on the order of what is described in Indexing the Entire Site: provide a complete listing of the site as well as events when users are created, updated, or deleted, and allow applications to maintain their own private but synced databases of users.

A common permission system is another level of integration. One way of handling this would be if applications had a published set of actions that could be performed, and the person integrating the application could map actions to roles/groups on the system.

Cross-cutting Functionality

This item requires a bit of explanation. This is functionality that cuts across multiple parts of the site. An example might be comments, where you want a commenting system to be applicable to a variety of entities (though probably not all entities). Or you might want page-update notification, or to provide a feed of changes to the entity.

You might also want to include some request logger like Google Analytics to all pages, but this is already handled well by Deliverance theming. Deliverance’s aggregation handles universal content well, but it doesn’t handle content (or subrequests) that should only be present in a portion of pages.

One possible way to address this is transclusion, where a page can specifically request some other resource to be included in the page. A simple subrequest could accomplish this, but many applications make it relatively easy to include some extra markup (e.g., by editing their templates) but not so easy to do something like a subrequest. We’ve written a product Transcluder to use an HTML format to indicate transclusion.

It’s also possible using Deliverance that you could implement this functionality without any application modification, though it means added configuration — an application written to be inserted into a page via Deliverance, and a Deliverance rule that plugs everything together (but if written incorrectly would have to be debugged).

Other Conventions

In addition to this, other platform-like conventions would make the life of the integrator much easier.

Template Customization

While Deliverance handles the look-and-feel of a page, it leaves the inner chunk of content to the application. If you want to tweak something small you will still need to customize the template of the application.

It would be wonderful if applications could report on what files were used in the construction of a request, and used a common search path so you could easily override those files.

Backups and Other Maintenance

Process management can be handled by something like Supervisor, and maybe in the future Deliverance will even embed Supervisor.

But even then, regular backups of the system are important. Typically each application has its own way of producing a backup. Conventions for producing backups would be ideal. Additional conventions for restoring backups would be even better.

Many systems also require periodic maintenance — compacting databases, checking for any integrity problems, etc. Some unified cron-like system might be handy, though it’s also workable for applications to handle this internally in whatever ad hoc way seems appropriate.

Common Error Reporting

With a system where one of many components can fail, it’s important to keep track of these problems. If errors just end up in one of 10 log files, it’s unlikely anyone is closely tracking them.

One product we’re working on to help with this is ErrorEater, which works along with Supervisor. Applications have to be modified to emit errors in a specific format that Supervisor understands, but this is generally not too difficult.

Farming

Application farming is when one instance of an application can support many "sites". These might be sites with their own domains, or just distinct projects. Examples are Trac, which supports multiple projects in one instance, or WordPress MU which supports many WordPress instances running off a single database and code base.

It would be nice if you could add a simple header to a request, like X-Project-Name: foo and that would be used by all these products to select the site (or sub-site or project or any other organization unit). Then mapping domain names, paths, or other aspects of a request to the project could be handled once and the applications could all consistently consume it.

(Internally for openplans.org we’re using X-OpenPlans-Project and custom patches to several projects to support this, but it’s all ad hoc.)

Footnotes

[*] This isn’t entirely true, Deliverance internally uses WSGI which is a Python-level abstraction of HTTP calls.
[†] At different times in the past, in an experimental branch right now, and potentially integrated in the future, Deliverance has been compiled down to XSLT rules. So Deliverance could be seen even as an simple transformation language that compiles down to XSLT.

HTML
Programming
Python
Web

Comments (2)

Permalink

pyinstall pybundles

Update: pyinstall has been renamed to pip (per the suggestions in the comments to this post)

I added pybundles to pyinstall very shortly before I announced pyinstall. I hadn’t actually tried it out that much. Since then I’ve made three more minor releases of pyinstall, and I think the bundle support is working pretty decently.

A .pybundle file is just a bunch of source code, all the source code you need to install some package(s). For instance, for Deliverance I’ve created a bundle file so you can do:

$ easy_install pyinstall virtualenv
$ pyinstall.py -E DeliveranceTest/ \\
>   http://deliverance.openplans.org/dist/Deliverance-snapshot-latest.pybundle

This creates a virtualenv environment in DeliveranceTest/, unpacks all the source from the bundle, and installs all of it. It’s not magical — it still has to compile the source and move the files around — but it does mean just a single download, and the versions of everything that is installed aren’t going to change unless that bundle file is regenerated.

I’ve been thinking about some other features for pybundles, like post installation scripts. All of this has raised a problem though: pyinstall needs to be a two-level command, with commands like:

pyinstall.py install X
pyinstall.py bundle Y
pyinstall.py freeze req.txt
and of course the not-yet-implemented:
pyinstall.py remove Z

But pyinstall.py install does not read well. It’s not too late to rename the package (yet again), or just rename the script. Ideas?

Python

Comments (20)

Permalink