We present FleXML, a program that generates fast validating XML processors from `self-contained' XML DTDs. It uses the flex (lexical analyser generator) program to translate the DTD into a finite automaton enriched with a stack with the `element context'. This means that the XML processor will act directly on each character received. The program is freely redistributable and modifyable (under GNU `copyleft').
Validating XML, DTD, lexical analysis, finite automata.
The `X' of XML stands for Extensible [XML]. This signifies that each and every XML document specifies in its header the details of the format that it will use and may change its format a bit relative to the used base format.
This has influenced the tools available to write validating XML processors: they use a call-back model where the XML processor passes strings with the tags and attributes names and values to the application. These call-backs must be generic because one cannot know whether a document will start by extending its own notation with more tags and attributes. For well-formed but non-validated XML documents this makes a lot of sense, of course, but we would in general like to exploit the information in the DTD for optimizations.
In particular, for many applications a fixed format suffices in the sense that a single DTD is used without individual extensions for a large number of documents. In that case we can do much better because the possible tags and attributes are static.
We have implemented an XML processor generator using the Flex scanner generator that produces deterministic finite automata [ASU]. Which means that there is almost no overhead for XML processing: the generated XML processors read the XML document character by character and can immediately dispatch the actions associated with each element (or reject the document as invalid).
Furthermore we have devised a simple extension of the C programming language that facilitates the writing of `element actions' in C, making it easy to write really fast XML applications. In particular we represent XML attribute values efficiently in C when this is possible, thus avoiding the otherwise ubiquitous conversions between strings and data values.
FleXML is available for free (from SourceForge). In this paper we present FleXML through an elaborated example and discuss some of the technical issues.
Assume that we have an XML document my-joke.xml
      containing the classical joke
<!DOCTYPE joke SYSTEM "my.dtd">
<joke><line>My appartment is so small</line> <suspense/>
<line type='punch-line'>the mice are round-shouldered</line></joke>
my.dtd containing
<!-- my.dtd: Small DTD for jokes (just for fun). -->
<!ELEMENT joke (line,(line|suspense)*)>
<!ELEMENT line (#PCDATA)>
<!ATTLIST line type (normal|question|punch-line) 'normal'>
<!ELEMENT suspense EMPTY>
      With FleXML this can be done by creating an `actions file'
      my-show.act which implements the desired actions
      for each element.  The remainder of this section explains the
      contents of such an actions file.
An actions file is itself an XML document which must begin with
<!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>
flexml-act.dtd DTD is part of the FleXML
      system and is reproduced in the manual page.
    
      We decide that our application should react to a
      line element by printing the text inside it, and
      that it should differentiate between the three possible `type'
      attribute values by printing corresponding trailing punctuation.
This introduces a slight complication, because the attribute values are available when parsing the start tag whereas the element text is not available until we parse the end tag (where it has been read).
This means that we must declare a top-level variable.
<top><![CDATA[
char* terminator = ".";
]]></top>
CDATA sections to make sure that
      all characters (including white-space) are passed unmodified to
      the C compiler.
    
      With this we can write the action to set it when reading the
      line start tag as
<start tag='line'><![CDATA[
  switch ( {type} ) {
    case {!type}: terminator = "..."; break;
    case {type=normal}: terminator = "."; break;
    case {type=question}: terminator = "??"; break;
    case {type=punch-line}: terminator = "!!"; break;
  }
]]></start>
      The idea is that the enumeration attribute type is
      implemented in C as if it had been declared by
enum { {!type}, {type=normal}, {type=question}, {type=punch-line} } {type};
{...} units as C identifiers),
      hence the possibility of using the fast C switch
      statement to pick the right choice directly.  Note that the
      first choice, {!type}, corresponds to not
      setting the attribute; in this example the attribute has a
      default value so this can never happen, however, we include the
      choice anyway to prevent the C compiler from issuing warnings
      about missing choices in switch statements.
    
      With this in place we can write the action for
      </line>.  Since it prints something, however,
      we first need to add the appropriate header inclusion.
<top><![CDATA[
#include <stdio.h>
]]></top>
<end tag='line'><![CDATA[
  printf("%s%s\n", pcdata, terminator);
]]></end>
      Finally, we will make the application amusing by `displaying'
      the <suspense/> empty tag as a short delay;
      this also involves a header inclusion.
<top><![CDATA[
#include <unistd.h>
]]></top>
<start tag='suspense'><![CDATA[
  sleep(2);
]]></start>
That is all; the only remaining thing is to terminate the action file properly.
</actions>
We can now run FleXML with the DTD and the actions file as input and will get an XML application as output that, when run (after processing by flex and a C compiler), will indeed print
My appartment is so small.
the mice are round-shouldered!!
flexml -A -a my-show.act my.dtd 
flex -omy-show.c my-show.l
cc -omy-show my-show.c
./my-show <./my-joke.xml
      An important aspect of the design of FleXML is that the only
      thing that should matter to the programmer should be the
      complexity of the application, not of the used DTD.  As
      an example the following action file prints the
      href attribute of all hyperlinks in an XHTML
      document:
<!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>
<top><![CDATA[
#include <stdio.h>
]]></top>
<start tag='a'><![CDATA[
if ({href}) printf("%s\n", {href});
]]></start>
</actions>
flexml $(FLEXDEBUG) -rhtml -p'-//IETF//DTD XHTML 1.0 Transitional//EN' \
  'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
gcc -Wall -ansi -pedantic -O2 -g -c xhtml-href.c -o xhtml-href.o
flex -Bsv -Ca -oxhtml1-transitional.c xhtml1-transitional.l
gcc -Wall -ansi -pedantic -O2 -g   -c xhtml1-transitional.c -o xhtml1-transitional.o
gcc -Wall -ansi -pedantic   xhtml-href.o xhtml1-transitional.o   -o xhtml-href
FleXML is a perl script [Perl] which reads and interprets a DTD and subsequently produces an XML processor source file for the flex scanner and optionally an XML application with the C source of the element actions from an actions file. The DTD is used to construct the rules used by flex to match the individual XML components in such a way that only valid documents match.
      For example, the flex code for scanning an attribute declaration
      of the line tag is the following:
<AL_line>{
 "type"{Eq}"'normal'"  |
 "type"{Eq}"\"normal\""  A_line_type = A_line_type_normal;
 "type"{Eq}"'question'"  |
 "type"{Eq}"\"question\""  A_line_type = A_line_type_question;
 "type"{Eq}"'punch-line'"  |
 "type"{Eq}"\"punch-line\""  A_line_type = A_line_type_punch_d_line;
 ">" {
  LEAVE; STag_line(); pcdata = BUFFERSET(pcdata); ENTER(IN_line);
 }
 "/>" {
  LEAVE; STag_line(); pcdata = ""; ETag_line();
  switch (YY_START) {
   case ROOT_line: SET(EPILOG); break;
   case S_joke: SET(S_joke_1); break;
   case S_joke_1: case S_joke_2: case S_joke_3: SET(S_joke_3); break;
}}}
{Eq} an alias for the regular expression
    matching an equal sign (corresponding to production `[25] Eq' of
    the XML specification).
    
      This reads as follows: when the XML processor is reading the
      attribute list of the line tag, i.e., when it is in
      the <AL_line> state, a `t' will
      enter an internal state where a `y' proceeds to
      another internal state but other characters makes the document
      invalid (because no rule permits it).  Once the equal sign has
      been scanned, the next characters determine the attribute value,
      and at the end one of the flex actions is performed, setting the
      attribute value (A_line_type is the real C for what
      we wrote as {type}, etc.).  The important thing is
      that one can ensure by careful tuning of the flex rules that a
      valid document will proceed only by looking each character up in
      a table and determining the subsequent `state' and `action'.
      One must avoid pairs of rules such as
"-->"		LEAVE(COMMENT);
.		SKIP;
.' matches any character) because they
    mean that the scanner will not be sure after having read a
    `-' character whether it is part of a comment
    terminator or `just' a dash.  In such cases an extra rule must be
    inserted because for the set
"-->"		LEAVE(COMMENT);
"--"            |
.		SKIP;
      After the actual attribute rules, two rules handle termination
      of the attribute list.  There are two cases corresponding to
      whether we just read a start tag or an empty element.  In case
      it was a start tag then we must enter the `inner' mode of the
      element called IN_line for the line
      element.  The switch handles the state changes
      needed for the line element resulting from the fact that the
      element can appear in different contexts.  This is always
      possible to construct because of the requirement that an XML DTD
      must be deterministic: we just need an element content
      stack (this is what the LEAVE and
      ENTER macros are for).
    
In comparison with the forthcoming XML Style-sheet Language [XSL] our approach is much more primitive for better and worse: only a limited range of applications can be produced with FleXML but those will be very fast.
This is useful for XML applications that are meant to process a large number of documents in a fixed format. One such application is the NTSys u-payment transaction server which is implemented as an Apache module where performance is of premium importance. Using FleXML permits a highly modular development of modules for the various transaction types based on a common DTD and a collection of applications that are generated separately and all linked together with the common processor.
FleXML is still under development: please try it out (either from SourceForge or from the Debian GNU/Linux distribution where FleXML is include from release 2.2. The author would welcome comments as to how the system can best evolve. Two things that are definitely on the agenda is a limited `context condition' language for expressing constraints on the position of an element in a document (corresponding to the Xpath subset of XSL), and an efficient way to combine several DTDs into one large t facilitate general XML servers (that can even dispatch to a generic XML interface in cases where the FleXML restrictions are not respected by a document).
I am grateful to NTSys for supporting the development of FleXML. Finally extend my sincere thanks to Jef Poskanzer, Vern Paxson, and the rest of the flex maintainers for a great tool.