Penn DB Group's logo
XMill, the XML Compressor
Arrow; just used for page layout. People
Arrow, used for page layout Publications
Arrow, used for page layout Research
Arrow, used for page layout Classes
Arrow, used for page layout Seminar
Arrow, used for page layout Resources
   
Search this website

XMill, the XML Compressor

XML is becoming an increasingly popular standard for representing and storing documents and for transporting data over the Internet. The amount of data available in XML is growing rapidly and efficient transport and storage techniques are necessary. One such technique is compression. Conventional compressors - such as Lempel-Ziv, or Huffman encoding - achieve reasonable compression. However they do not consider the specific syntax and semantics of XML and thus miss several opportunities for compression.

XMill is a special purpose compressor for XML that usually achieves about twice the compression ratio of gzip at roughly the same speed. The main component of XMill is a clustering technique that groups data elements together before applying conventional data compression to them. Depending on the type of XML data to be compressed, the user can choose between default clustering techniques or can define own clustering strategies. Furthermore, XMill can be extended with specialized compressors for complex data structures, such as URLs, dates, images, or DNA sequences.

XMill can be downloaded from http://www.seas.upenn.edu/~liefke/xmill/xmill.html. The homepage also describes several experiments that illustrate how XMill improves over conventional compressors for several real data sets, such as weblog data, protein sequence data, linguistic data, and bibliographic data.

Project Members

Hartmut Liefke   Dan Suciu   

Publications

Levine Hall
3330 Walnut Street
Philadelphia, PA 19104
 

Last update: 08/02/11     Comments