===============================
Gamera XML format (version 2.0)
===============================

Introduction
------------

The Gamera XML format is used to store sets of small ONEBIT (binary)
images.  This is most commonly used to store training data for a
classifier.

Since these files tend to be quite large, Gamera supports loading and
saving of gzipped XML files.

XML format
----------

Below is an informal description on the XML format.  A more formal
definition is given in the `DTD`_.

Gamera uses Unicode internally to store symbol names, so the XML file
can be encoded in most standard encodings that Python can
automatically convert to Unicode.  (See the Python documentation for
the ``codecs`` module).

All XML files created directly from Gamera are encoded in UTF-8.

.. code:: XML

  <?xml version="1.0" encoding="utf-8"?>

``gamera-database`` (Toplevel)
``````````````````````````````

At the top level is the ``gamera-database`` element.  A version must
be specified.  Note that the version refers to the "Gamera XML file
format" version which does not correspond to the version of Gamera.
The version documented here is ``2.0``.  Version ``1.0`` was a weak
first attempt and is no longer supported.  We plan to make future
versions backward and forward compatible whenever possible.

.. code:: XML

  <gamera-database version="2.0">
    ...
  </gamera-database>

``glyphs``
''''''''''

The ``glyphs`` element simply contains ``glyph`` elements.

.. code:: XML

  <glyphs>
    ...
  </glyphs>

``glyph``
+++++++++

Each ``glyph`` element represents a small ONEBIT (binary) image.

It has the following attributes that specify the bounding box, and are
directly equivalent to the members on the image object itself:
``uly``, ``ulx``, ``nrows`` and ``ncols``.

.. code:: XML

   <glyph uly="242" ulx="1758" nrows="26" ncols="18">
     ...
   </glyph>

``ids``
.......

Each ``glyph`` contains an ``ids`` element which contains all of
the class names that define the glyph.  Since some classifiers will
return multiple class names with different confidences, ``ids`` is
actually a set of ``id`` elements.

``ids`` has an attribute ``state`` that defines how the glyph was
classified.  It can be one of {``UNCLASSIFIED``, ``AUTOMATIC``,
``HEURISTIC``, ``MANUAL``}.

Each ``id`` element within ``ids`` has two attributes:

``name``
	The class name, which is period-delimited by category.

``confidence``
	A confidence value represented how sure the classifier was
	about the identity of this glyph.  The value is in the range 0
	(not at all confident) to 1 (very confident).

.. code:: XML

      <ids state="AUTOMATIC">
        <id name="number.three" confidence="1.000000"/>
      </ids>

``data``
........

Data contains a run-length encoded representation of the image
itself.  This data is used to reconstruct the image when loading.  It
consists of decimal numbers separated by whitespace.  The numbers
alternate between the length of white runs and the length of black
runs.  Runs "wrap-over" the right side of the image to the next row on
the left side of the image.  Systems that work on Gamera XML should
take care not to modify the data block, as it is very easy to corrupt
and there is only minimal error checking upon reloading.

.. code:: XML

      <data>
        6 4 12 9 8 12 5 6 1 7 4 5 3 7 2 6 4 6 2 7 3 6 2 7 3 6 3 6 3 6 3 4 5 6
        4 3 4 7 5 3 2 6 7 11 7 11 8 11 6 5 1 7 2 6 4 6 2 6 4 15 3 14 4 14 4 6
        2 6 3 7 2 15 4 12 7 10 11 3 9 0 
      </data>

``features``
............

The ``features`` element contains a set of ``feature`` elements, one
for each feature function.

.. note:: When Gamera XML files are loaded into Gamera, the features
   information is discarded and regenerated.  This is so the
   classifier has complete control over what features are used and how
   they are used.  The feature values are dumped to the XML file only
   for the benefit of third-party applications that may wish to use
   the features without using Gamera code itself.

The ``features`` element has one attribute, ``scaling``, which is used
to define how the features should be scaled (if at all).

Each ``feature`` element has one attribute, ``name``, which is the
feature function that generated the values.  More feature functions
can be added by `creating plugin methods`__ in the category
``Features``.

.. __: writing_plugins.html

The ``CDATA`` block of the ``feature`` element contains a list of
floating point numbers (in decimal) that were generated by the given
feature function.

.. code:: XML

      <features scaling="1.0">
        <feature name="area">
          468.0
        </feature>
        <feature name="aspect_ratio">
          0.692307692308
        </feature>
        <feature name="black_area">
          294.0
        </feature>
        <feature name="compactness">
          0.428571428571
        </feature>
        <feature name="moments">
          0.302825261159 0.307856673241 0.0657821501667 0.149170828096
          0.0533051094428 0.008619818861 -0.000427188837948 0.000613411663867
          0.0317229010445
        </feature>
        <feature name="nholes">
          0.944444444444 0.615384615385
        </feature>
        <feature name="nholes_extended">
          3.0 6.0 4.0 2.0 3.0 6.0 4.0 3.0
        </feature>
        <feature name="skeleton_features">
          3.0 10.0 0.206349206349 6.0 4.0 3.0
        </feature>
        <feature name="volume">
          0.628205128205
        </feature>
        <feature name="volume16regions">
          0.485714285714 0.514285714286 0.457142857143 0.8 0.771428571429
          0.685714285714 0.828571428571 0.8 0.657142857143 0.571428571429
          0.714285714286 0.685714285714 0.657142857143 0.885714285714
          0.857142857143 0.771428571429
        </feature>
        <feature name="volume64regions">
          0.0833333333333 0.5 0.5 0.0833333333333 0.0 0.583333333333
          0.916666666667 0.416666666667 0.5 1.0 1.0 0.583333333333 0.25
          0.833333333333 1.0 0.916666666667 0.833333333333 1.0 0.916666666667
          0.833333333333 0.916666666667 1.0 1.0 1.0 0.916666666667 0.5 0.5
          0.333333333333 1.0 0.583333333333 0.416666666667 0.833333333333
          0.833333333333 0.25 0.0 0.5 0.916666666667 0.166666666667
          0.0833333333333 0.833333333333 0.75 0.833333333333 0.666666666667
          0.916666666667 1.0 0.75 0.75 1.0 0.583333333333 1.0 1.0 1.0 1.0 1.0
          1.0 0.75 0.25 0.916666666667 1.0 0.666666666667 0.583333333333 1.0
          1.0 0.416666666667
        </feature>
        <feature name="zernike_moments">
          -10.6511384992 6.48862460298 -29.6215265124 21.4915669267
          -4.95741930842 9.70917594801 -41.1062437889 61.7375776134
          5.72778404882 5.31916863982 -16.2283589874 87.6140952777
          0.148787539745 -2.75048482171 -4.19045628996 74.7653517483
          0.403589004311 -0.403486704972 -4.89682421377 61.1321737136
          -1.01342510763 -0.526430778878 -13.8421959343 63.5845586372
          -0.188701625589 0.697880091555
        </feature>
      </features>

``symbols``
'''''''''''

The ``symbols`` element contains a set of symbol names.  This is used
to import/export symbol tables without the expense of loading glyphs.

Each ``symbol`` element contains one attribute ``name`` which defines
a period-delimited class name.

.. code:: XML

  <symbols>
    <symbol name="notehead.half.ledger.middle" />
    <symbol name="dot" />
    <symbol name="clef.bass" />
    <symbol name="accidental.doubleflat" />
    <symbol name="notehead.triangle.hollow" />
    <symbol name="bracket" />
    <symbol name="brace" />
  </symbols>

DTD
---

This is the formal definition of the Gamera XML Format.

.. code:: XML

  <!-- DTD for gamera glyph database files 
    11/18/02 Michael Droettboom
  -->

  <!ELEMENT gamera-database (symbols*,glyphs*)>
  <!ATTLIST gamera-database version CDATA #REQUIRED>

  <!ELEMENT symbols (symbol*)>

  <!ELEMENT symbol EMPTY>
  <!ATTLIST symbol name CDATA #REQUIRED>

  <!ELEMENT glyphs (glyph*)>

  <!ELEMENT glyph (ids,data,features?)>
  <!ATTLIST glyph uly CDATA #REQUIRED>
  <!ATTLIST glyph ulx CDATA #REQUIRED>
  <!ATTLIST glyph nrows CDATA #REQUIRED>
  <!ATTLIST glyph ncols CDATA #REQUIRED>

  <!ELEMENT ids (id*)>
  <!ATTLIST ids state 
    (UNCLASSIFIED|HEURISTIC|AUTOMATIC|MANUAL) "UNCLASSIFIED">

  <!ELEMENT id EMPTY>
  <!ATTLIST id name CDATA "UNKNOWN">
  <!ATTLIST id confidence CDATA "1.0">

  <!ELEMENT data (#PCDATA)>

  <!ELEMENT features (feature*)>
  <!ATTLIST features scaling CDATA "1.0">

  <!ELEMENT feature (#PCDATA)>
  <!ATTLIST feature name CDATA #REQUIRED>

Saving and loading Gamera XML files
-----------------------------------

The code for saving and loading Gamera XML files is in
``gamera/gamera_xml.py``.

Use the following functions to save and load Gamera XML files:

.. docstring:: gamera gamera_xml glyphs_from_xml glyphs_with_features_from_xml glyphs_to_xml strip_features
