Thursday, June 25, 2009

Divide and Conquer, or XPath, XSLT, XQuery and XProc packaging

Packaging of various X* technologies seems to be of interest for a lot of people for now. And of course it is for me. But it seems everyone comes with its own idea of packaging, as well as a different scope. So to add to the complexity yet, I will present here my own ideas on that matter. Hopefully, I will try to tidy up the different concepts and to identify the different needs. And as always, I like to speak about concrete. To ease further discussions, if only that. So I will introduce a prototype of a packaging system for X* libraries and extensions for Saxon.

Packaging is nothing in itself. It is always related to something else (a language, a technology, a framework...) Packaging is just a mean to ease sharing and delivering something in the scope of that "something else." The several files in an ODF document are packaged in a single ZIP file, with a pre-defined structure, to make it possible for an application to use its content. The important point is not the structure in itself, but rather the information it gathers.

I have followed some very interesting discussions about X* packging during the last few weeks, with very interesting people. Rapidly, I have seen everyone were talking about slightly (or not) different things. The most important point where people have different views IMHO, is the scope of packaging.

As with most of modern languages, an XML developer may have to deliver different pieces of software, depending on the project: libraries, standalone applications, or web applications built for a specific framework. If you look at Java for instance, this is reflected quite clearly in its various packaging formats: JAR files for libraries and applications, WAR files for web applications, EAR files for entire enterprise applications...

WAR files contain Java classes, as JAR files. But the structure is quite different, and there are a few other files, describing what is in the package: "that class is a servlet class, conforming to the definition of servlet and coded to live in a servlet container, with a precise lifecycle," or "the package depends on this JAR file."

The same way, you can package XSLT libraries or XQuery modules, telling a processor that when a stylesheet or a module imports a specific URI, some functions are available (provided as plain XSLT stylesheets, XQuery modules, or extension functions.) Or you can package an entire web application using XProc to control the overall processes, XQuery to query XML databases and XSLT for the presentation layer (sounds very MVC, doesn't it?) But those packages are really different beasts: when the first example just need to package some XSLT, XQuery, Java, whatever code, alonside a simple cataloging system, the second example require to define a complete web framework, its lifecycle, how script can plug into this and exchange information with it ("this XProc pipeline has to be evaluated on an HTTP GET on http://www.example/app/theuri, it knows you will provide it with request information as a wa:http-request element, as we agreed upon, and that XSLT stylesheet has to be applied to its result; by the way it will access runtime information by using the extension functions you provide.")

There has been some work on XRX frameworks, and clearly it would be beneficial for anybody (users, but also implementors,) to have such a standard packaging format for entire applications following their rules (as WAR and EAR files can be to Java.) And they would benefit also from a more low-level packaging format dedicated to package X* libraries, and would build upon them. But they really are at different levels, and I think it is fundamental to make the distinction between both concepts.

As part of the EXPath project, and because I think this is the first step X* technologies need for several years to enable the delivery of libraries, I am particularly interested in a library packaging format.

To illustrate that, I've built a very simple prototype of a package manager for Saxon. On the one hand you have a simple GUI to install and delete packages in a repository, and on the other hand you have a shell script to launch Saxon (setting the classpath for extension functions and setting catalogs to resolve XSLT imports refering to libraries.) If those tools are built around a well-defined, open package format, other implementations could be written (for eXist, for MarkLogic, XQilla, Zorba... but also for oXygen, providing a one-click implementation to install a package and then being able to enable it in some scenarii.)

You can find the manager at http://www.fgeorges.org/purl/20090624/. You should be able to run it simply by clicking on one of the links on the launch.html page (through Java Web Start,) but you can also download the JAR file (look also in the lib/ sub-directory,) putting both JAR files in the classpath and running Java the usual way, with the main class org.expath.pkg.saxon.PackageManagerGUI (there is also a text interface with org.expath.pkg.saxon.PackageManagerTextUI.) You first have to set up an environment variable EXPATH_REPO, pointing to a directory (that will be your EXPath Packaging repository, just create an empty directory.) The interface is very simple: choose the install item in the file menu, and select the package file you want to install. To remove a package, select it in the list of installed modules and select delete in the menu.

Once a module is installed, you can use it via Saxon by adding the additional JARs to the classpath as needed (for extension functions) and by setting up the XML Catalogs support. The following script does that for you: http://www.fgeorges.org/purl/20090624/saxon. It needs a few environment variables: EXPATH_REPO as explained above, APACHE_XML_RESOLVER_JAR must point to the Apache XML Commons Resolver (see http://xml.apache.org/commons/, and be sure to pick the resolver JAR) and SAXON_HOME must point to the directory containing the Saxon JARs.

But what about the package format itself? In this prototype, this is a simple ZIP file, with the following structure:

expath-pkg.xml
expath-http-client/
   saxon/
      xsl/
         expath-http-client-saxon.xsl
      jar/
         expath-http-client-saxon.jar
      lib/
         commons-codec-1.3.jar
         ...jar

where expath-pkg.xml is the package descriptor, and expath-http-client is the directory containing one module (here the EXPath HTTP Client module.) This module is implemented as a Java extension, besides a frontend XSLT stylesheet that take care of Saxon-specifics to bind to the Java functions. During the install, an XML Catalogs file is created, to resolve the URI http://www.expath.org/mod/http-client.xsl to that stylesheet, in the local repository. One stylesheet can then simply import that URI and use the functions of the module. The real package for the HTTP Client can be downloaded at the same place: http://www.fgeorges.org/purl/20090624/expath-http-client-saxon-0.3.zip.

There are of course still a lot of work defining exactly the package format, how to handle dependencies, improving the implementation... But I think that gives the big picture. If you are interested, here is what the package descriptor looks like:

<package xmlns="http://expath.org/mod/expath-pkg">
   <module version="0.3" name="expath-http-client">
      <title>EXPath HTTP Client</title>
      <xsl>
         <import-uri>http://www.expath.org/mod/http-client.xsl</import-uri>
         <file>saxon/xsl/expath-http-client-saxon.xsl</file>
      </xsl>
   </module>
</package>

We can see the package contains one module, namely "EXPath HTTP Client," version 0.3. The URIs are used to create an XML catalog. This version of the package contains all the dependencies (the JARs used by the Java implementation of the extension functions,) but they can be also left out, and configured with the following element:

<saxon>
   <dep type="jar">
      <title>Apache Commons Codec 1.3</title>
      <home>http://jakarta.apache.org/commons/codec/</home>
   </dep>
   <dep type="jar">
      <title>Apache Commons Logging 1.1.1</title>
      <home>http://commons.apache.org/logging/</home>
   </dep>
   <dep type="jar">
      <title>Apache HTTP Client 4.0-beta2</title>
      <home>http://hc.apache.org/</home>
   </dep>
   <dep type="jar">
      <title>Apache HTTP Core 4.0</title>
      <home>http://hc.apache.org/</home>
   </dep>
   <dep type="jar">
      <title>Tagsoup 1.2</title>
      <home>http://home.ccil.org/~cowan/XML/tagsoup/</home>
      <href>http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar</href>
   </dep>
</saxon>

The GUI does not take them into account yet, but it should propose to automatically download JARs when possible, and give the user a list of libraries and their homepage when a manual download is required. But of course, the same format can be used to package standard XSLT stylesheets, without any Java features, just by mapping the main entry point files to their public URIs.

Of course, this format will be particularly useful once precisely defined in an open spec, and if several processors support it (either natively, or through external managers.)

To end this post, I would like to introduce an idea from Jim Fuller: CXAN. I am sure most of you know CTAN for TeX, or CPAN for Perl. They are central, organized repositories of libraries for those languages, accessible throught HTTP. With a proper packaging format, it would be possible to set up such a web repository gathering XPath, XSLT, XQuery and XProc libraries and applications, installable automatically with a manager that would install a package from its name, handling dependencies and the like. But for sure, that is yet a step forward.

Labels: , , ,

4 Comments:

Blogger Adam Retter said...

Hi Florent, a very interesting article :-)

When I read this part "but it should propose to automatically download JARs when possible" I actually started thinking of Maven. Perhaps we could actually reuse Maven in someway for the dependency management?

15:15  
Blogger Florent Georges said...

Thanks Adam.

About dependencies, that's clearly the part that needs more work. I do not want to rely on the user having Maven installed, but it would be interesting to add the Maven info *in addition* to the homepage, name and version of the dependency (and maybe to a direct link to the JAR, if available.)

The point is to have in any case enough info to install by hand, but we can add optional info for any dependency manager, *in addition*.

Good point!

17:29  
Blogger Dan McCreary said...

Great article. You have clearly put some good thought into the issues of packaging XRX applications.

I also agree with Adam we need to look into dependency management also. The OSGI standards also have a nice layered architecture (for Java) that we might benefit from.

14:37  
Anonymous Anonymous said...

excellent stuff ... once I have a moment will look (and comment) in more detail.

Jim Fuller

16:20  

Post a Comment

<< Home