XML Stubs
Lately I have been considering the best way to stub out XML. Often times I will want to start with some arbitrary XML and add extra elements or change defaults, so the question comes up which is the best way to handle it.
The first option is to create a stub. For example, a potential Atom Entry stub might be something like this:
atom_stub = '''<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom">
<title> </title>
<id> </id>
<updated> </updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml"></div>
</content>
</entry>
The problem with this is that as you begin trying to organize these kinds of snippets, you end up wanting to go ahead and initialize values as well as changing the stub slightly. In the above example, if you wanted to change the atom:content type to “html”, you would need to change the attribute and remove the “div” element.
Assuming you don’t need to change stubs very often, you still might create a rather large library of stubs. The obvious way to handle this would be to start giving stubs (or sets of stubs) their own files. This presents a potential problem in that depending on your needs, you’ll either end up with a file for each stub or a set of modules that have nothing but strings in them. The file case requires you to consider some sort of resolver or using pkg_resources to grab the values reliably. The string case could be problematic if due to potential escaping issues as well generally being unfriendly to edit.
The other option is to create classes, or more generally, python code for your stubs. This addresses many of the issues mentioned above but also presents its own problems. For example, here would a potential stub of the above example using Amara:
ATOM_NS = u'http://www.w3.org/2005/Atom'
XHTML_NS = u'http://www.w3.org/1999/xhtml'
class AtomEntry(object):
def __init__(self, details={}):
self.entry = self._initialize()
def _initialize(self):
doc = amara.create_document(u'entry', ATOM_NS)
doc.entry.xml_append(doc.xml_create_element(u'id', ATOM_NS))
doc.entry.xml_append(doc.xml_create_element(u'title', ATOM_NS))
doc.entry.xml_append(doc.xml_create_element(u'updated', ATOM_NS))
doc.entry.xml_append(
doc.xml_create_element(
u'content', ATOM_NS,
attributes={u'type': u'xhtml'}
))
doc.entry.content.xml_append(doc.xml_create_element(u'div', XHTML_NS))
return doc
This is relatively clean but as you start trying to initialize data such as adding default text or including text from parameters (ie the details dict passed to the constructor), the code will become much larger. Also, it is not as clear what the code is actually doing. It doesn’t take long to see it is working with XML, but actually knowing what that stub will look like in the end without an understanding of the library could create a learning curve.
Some other options to consider would be using something like XSLT plus a valid base document to create stubs. You still end up with the similar file resolution vs module issues, but you can keep your base stub the same and simply modify the XSLT to alter the stub. This would probably not keep the code small but it would provide both a physical distinction between the python and XML stub code as well as make it somewhat clearer what things will look like. Yet another option is to try and improve the syntax by finding patterns in your stub to create a wrapper around some internal XML object like Amara. You could then potentially use a set of of core data types to create the XML. It might make stubs look something like:
atom_stub = {
'entry':
{'title': ''},
{'id': ''},
{'updated': ''},
{ ('content', ('type', 'xhtml')):
{'div': '' }
},
}
This starts to look like a reasonable option, but I honestly would avoid it. The syntax seems error prone and you have to create your own parser all because you didn’t want to write things out via a DOM-ish syntax. While I am not a fan of DOM proper, Amara’s interface for writing XML is actually pretty reasonable. There is no silver bullet to these issues, but it is something to think about as it is always just grunt work creating stubs in that are helpful and maintainable.
If anyone does have any other good ideas, please share!
Don’t make it a design problem
I have heard some recent discussions recently that made me think about what kinds of problems you want to provide users. That seems like heresy in that you really never want to give users problems, but in the real world the facts suggest that users will have issues with software. As a programmer it is our responsibility to consider what will be an issue in terms of finding potential bugs. We should also consider going beyond bugs and consider what situational issues could arise from using our software along with how users can (and will) work around known issues. This is different from something like test coverage where a user hits a corner case that wasn’t coded for by the developer. The “issues” in this case are situational and would be different for every single user depending on how they consider the application within the scope of the virtual world.
In the world of HCI this is pretty much analyzing how well the application design matches the perceived viewpoint of the user. In other words, how well does the application meet user’s expectations. This is where you see metaphors come into play and dictate a real world situation where people map experiences on top of some application they are using. The problem goes beyond this though, because even within a metaphor or design, there are questions the user must answer in order to use the application. This is where a great application can buckle under the greatest design.
In an application like MS Word, there are many types of users. There are those people who struggle to get hanging indents and work with random changes affecting the entire document and users who manage to reflect daily on proper use of obscure dialogs and settings. In both of these cases, the user is forced to make a design decision in how they create their document. This is very similar to what objects a developer will use to create an application, which is always a very subjective decision depending on the constraints and requirements. It is this kind of issue that developers, if possible, should avoid at all costs.
Take subversion for example, they made a decision to push the terminology of branches, trunk, and tags to be a documentation issue. They rely on conventions to push users towards a workflow instead of hard coding those items within the application. From the developer standpoint this is great because there is no end to wealth of configurations and work flows you can use. But from the standpoint of someone trying to get something done, it doesn’t make a bit of difference. I am not saying it is not handy or valuable, but rather there was some cost for the added flexibility.
When you are creating an application for a more traditional user, it is important that they do not get hung up on details that do not reflect getting actual work done. In the case of subversion, a poorly laid out repository can be a real pain. In the case of a traditional user application, it can be a crippling bit if flexibility that has catastrophic affects on both the productivity with the application as well as the continued success. This does not mean all applications should lock user’s into their internal model, but rather sacrificing flexibility in order to reduce the design issues can prevent issues from ever cropping up. In addition to this, it can create the market for other orthogonal applications that can compliment each other in completely different means. Applications centered around deployment are one good example of a tool that can be complimentary to a confined application.
It is not that flexibility is inherently bad, but forcing the user to make a design decision will always leave the door open for poor experiences, which in the long run, makes the application difficult to use.
Mapping is Bad
Lately there has been a good deal of discussion regarding Object Relational Mappers for Python. The discussion (from what I can tell) stemmed from using things like Elixir with SQLAlchemy and introducing more problems. I think this issue is almost identical to the different marshalling libraries for XML. The real issue is a perceived deficiency in some core data model compared to the programming model. There is a concern that having to think in two different conceptual paradigms will greatly slow development. Another aspect of this, maintenance of a system with different languages and models sprinkled around. While there are some tough issues you can run into when constantly switching between models, the cost of normalizing each model to the programming paradigm is extremely costly.
When thinking about this issue, I always come back to casting. Before I understood generics in C#, I was constantly using generic collections where I had to constantly be considering the type of the object. Even though I eventually realized my issues were solvable, it became clear that constantly having to reshape information in order to use it is time consuming. One thing someone can do is to create a translator or interface to more easily make transitions between types. This is exactly the path an ORM takes as well as an XML marshalling library. It aims to solve the problem of translating some database or XML to the model of the programming language, which is more often than not, and object oriented model.
Another path is to look at the problem from a build management standpoint. Instead of thinking of how to constantly translate data to one paradigm, try programming according to the data. This is the natural pattern for XML when you use tools such as XPath and XSLT. For example, in the XML as data case, you can just query the document via XPath for some value or run an XSLT on the file to change it to what you need. At this point the problem is not how to deal with the data, but rather how to deal with your build system including potentially many different models. At this point the question comes closer to should I name my folder “xslt” or “transforms” and how do I resolve those file locations.
This second case is not a trivial issue of course, but it is one that has been solved. In Python, we have the package resource tools such as eggs, easy_install, and setuptools. C# also has compiled transforms and the ability to save an XSLT as a dll, which means you could work with it from the GAC or use it via traditional Visual Studio project inclusion. None of this is perfect, but by changing the problem space from being an issue of translation between data types, the issue is working with files. This problem has been around for quite a long time so there are plenty of ways to find solutions that can fit within any application.
That said, tools such as SQLAlchemy and Amara provide a great way to get the simple stuff done quickly while staying out of the way when things get complicated. The cost is a slightly less optimized API, but overall the benefits are huge because the maintenance question is already answered with the fuller featured API. It can be tough at times to switch models all the time, but with the web being what it is and developers already feeling pressure to understand more languages and paradigms, accepting the challenge only seems like a good first step to eventually finding better solutions to constant transitions.
Mercurial Feeds
I started using Mercurial for projects and I am really excited about it. One feature that I think is rather addicting is the RSS feed for commits. This simply rocks! I find that my commit messages keep something of a log of the project. I place TODO items and questions I might have on my decisions and generally give an overview of what I worked on. It really feels like a blog.
I am sure there are some downsides to doing this. I suppose in a huge project if tons of contributors were double dipping by using a mercurial repository as a blog things could get sketchy. At this point though, it really make communication as to what is going on in a project much simpler and easy to manage.
The next step is to hack together an Atom feed instead of using RSS…
Writing XML in Code
I have been working with Sylvain recently on Amplee and it has been interesting to see how each of us use Amara. I have always tried to use it like an object as much as possible. What you end up seeing is things like this:
if hasattr(doc.entry, 'title'):
title = str(doc.entry.title)
Amplee takes a slightly different approach that doesn’t feel as obvious, but is a little safer. I actually like this approach because it feels closer to using a Python dictionary rather than an object.
title = doc.entry.xml_child_elements.get('title')
This is one line, safer for when you don’t actually have the element and doesn’t use the ugly hasattr function. Chalk one up for Sylvain!
Another difference I have noticed is when we write a XML document with Amara. These differences have no actual impact on the code, but I think it is interesting nonetheless.
col.xml_append(doc.xml_create_element(qname(u"title", ATOM10_PREFIX),
ns=ATOM10_NS, attributes={u'type': u'text'},
content=collection.title))
The qname function is different to me and the lines are a bit long for my test.
I would do something like this:
col.xml_append(
doc.xml_create_element(
u'title', ATOM10_PREFIX,
attributes={ u'type': u'text'},
content=collection.title
)
)
I am not sure this is clearer, but it does present a recognizable pattern over code. In the places I add elements you can see the pattern, which is somewhat helpful. The downside is I think it kind of looks like a C based language, which isn’t as cool
There are other differences as well. I think some of it stems from our backgrounds with different XML technologies. I never really wrapped my head around DOM in a way that I was effective with it. My path would usually lead to a object wrapper. Sylvain though has spent a good deal of time working with the DOM and understands how to use it more fluently. I also have been making a huge effort to keep lines under 80 columns, which is probably why some of my code tends to be longer vertically. It is always interesting to contrast your ideas and code with someone elses so this has been a fun exercise and good learning experience.
URLs Rule the World
So, Bill de hÓra has been blogging quite a bit on URLs and their design. What is interesting to me is that context of today’s entry, he reveals JSPs don’t have something like Routes. This is something of interest to me simply because I have written and used quite a few dispatchers over the past few months with the common theme being it is a simple problem to solve.
You can easily dispatch on the typical:
/{controller}/{method}/{*args}
without much trouble. In Bright Content, we use regex to get dates and create essentially an ID. I have also done slightly more complex patterns such as
/{model}/{id}/{action}/{args}
What is interesting is this method of viewing URLs is really different from something like mod_rewrite because there is an assumption made that the URL contains important keys to the resource. I am not sure if this is conventional wisdom regarding cool URIs, but it seems like it should be. From the Java/C# perspective, I can see why someone could essentially disregard the content of the URL in favor of GET parameters. The history behind the tools and frameworks have been to use GET parameters instead of utilizing the URL tokens, which makes sense because there is less ambiguity. For example, if I have a url:
/blog/2007/3/14/Some_Slugged_Up_Title/
The question arises as to what each token (that which is split by the “/” character) represents. In the above example, the pattern is essentially
/{controller}/{year}/{month}/{day}/{slug}
But the issue is that the year, month and day could be considered one parameter. The slug then could also be different based on an extension, meaning that in the case above, I have a trailing slash, yet if I added a “.atom”, it might serve a different content type.
These really aren’t hard problems to deal with, but I can imagine that for some, it is a new way of thinking. I know as I have thought about the problem the issues run very deep and there is little to stop you from making bad decisions that can really hurt an application’s design. It is always interesting to see issues that seem simple become complicated by going one path and then finding a new direction that starts the process all over again.
Creating a Resolver for Tranformations
The other day I spent a bit of time getting a resolver working for 4Suite and my XSLTemplates library. I am pretty sure the concept of a resolver is in the XSLT spec, but it might not be as it seems the implementations/examples rarely mention them. The idea is that you can resolve links yourself in XSLT. This alludes to the XInclude spec but really is a bit more magic. The resolver can lookup where the requested resource is and grab it. Really it is a pretty simple concept.
The problem is that implementing a resolver is less than intuitive. In .NET my understanding is there is a specific XmlUriResolver class that must be extended in order to create a custom resolver. I haven’t looked into what it takes for something like libxslt2 (which would impact lxml) or saxon. 4Suite handles the resolution when creating an InputSource which is essentially just a wrapper for objects getting passed into different 4Suite objects and methods.
Part of the reason I think resolvers are not very wide spread (from what I have seen at least) is because most folks just don’t think about the functionality. I wanted a resolver in order to allow an external tool to transparently find packages via web requests, local template directories and using data files in eggs via pkg_resources. We use a resolver at my job to do a cascade of lookups so you can override different XSLT files at runtime.
The one caveat regarding a resolver is that it adds a bit of magic to your XSLTs in that a simple URI could be referencing a file anywhere. This could make debugging an issue if someone didn’t understand there was a custom resolver in play. With that said, it is a pretty slick tool to have around. I’ll be adding my new template resolver in my XSLTemplates module, which might help to provide a good example.
AtomPub Hacking
Recently the Atom Publishing Protocol was blessed as an IETF standard. I have been personally working on an AtomPub implementation for Bright Content while playing around with Amplee here and there. One thing I realize regarding Atom and generally working with XML is that an understanding of all the different XML technologies is critical in order to get the most value.
Take for example updating an Atom entry. One thing that I constantly run into is recursively copying nodes. This is a breeze with XSLT, which is why I started processing updates through a stylesheet. The issue I found there was simply organizing my stylesheets within the context of my application. In the end I think I changed the design to simply just use Amara, but there was a definite path for building Python and XSLT based apps using distutils, which is really exciting.
Where this connection between XSLT and Python can get tricky is in the case where external sources are referenced. In my AtomPub client, I was getting service documents directly within the XSLT by wrapping httplib2 and creating a node-set from the GET results. I ended up changing this because it was more forgiving to get the service document in Python and grab any collections. This revealed a slight disjoint, but in the end it worked out since most of the code could still use the same match templates.
I think navigating these differences between models and taking advantage of each language’s strength simply takes time and practice. It can be hard to see where you might want to do something with XSLT when you have to think about resolving stylesheet links and everything else, so possibly the answer to making the whole thing work is finding a good way to easy apply an XSLT to XML. The Amara xml_xslt method is a good start but something more robust might be beneficial. Something to think about…
Getting Started
When I first started working with XML, for some reason it appealed to me. I can’t really say why I was attracted to it. XPath seemed pretty slick and simple (even though it really isn’t that simple) and I like the idea of loading a document and asking it something. As I have moved along in my programming world, Python has become my language of choice. I enjoy Ruby as well, but the libraries and community behind Python are so fruitful regarding learning to be a better programmer that I can’t help but use it.
With Python and XML under my belt, the questions is what’s next. The result has been the Atom Publishing Protocol, OpenID and generally creating better models for user interfaces in a round about way. In school the semantic web seemed like a dream come true and as I have began to realize the power of scripting small tools, it has only become more important to me. The result then is a new understanding of XML and how I can work with it in Python.