XMill, the XML Compressor

XML is becoming an increasingly popular standard for representing and storing documents and for transporting data over the Internet. The amount of data available in XML is growing rapidly and efficient transport and storage techniques are necessary. One such technique is compression. Conventional compressors - such as Lempel-Ziv, or Huffman encoding - achieve reasonable compression. However they do not consider the specific syntax and semantics of XML and thus miss several opportunities for compression.

XMill is a special purpose compressor for XML that usually achieves about twice the compression ratio of gzip at roughly the same speed. The main component of XMill is a clustering technique that groups data elements together before applying conventional data compression to them. Depending on the type of XML data to be compressed, the user can choose between default clustering techniques or can define own clustering strategies. Furthermore, XMill can be extended with specialized compressors for complex data structures, such as URLs, dates, images, or DNA sequences.

XMill can be downloaded from The homepage also describes several experiments that illustrate how XMill improves over conventional compressors for several real data sets, such as weblog data, protein sequence data, linguistic data, and bibliographic data.

Project Members

Hartmut Liefke   Dan Suciu   


