Compressed indexable representation of XML data


PPARC e-Science studentship, equiv. GBP 60K


March 2004 - Feb 2007


e-Science presents computer scientists with new challenges in terms of handling huge volumes of data. The student allocated on this project will work closely with people involved in the AstroGrid project, and is concerned with the efficient storage and processing of large XML files that arise in the context of the International Virtual Observatory Alliance (IVOA). VOTable is an XML-based astronomical data format developed by the IVOA for tables and (later) images.

Unfortunately XML-based files are larger than the binary equivalent (such as FITS), and network bandwidth will be a scarce resource for the Virtual Observatory. Different VOTable encodings allow trade-offs between efficiency and ease of parsing. Even within the XML community at large there is growing concern that inefficiency arising from document size will hinder adoption and use of XML. A few XML-specific approaches can compress XML files better than generic algorithms such as gzip However, compression ratios can vary greatly (from 3:1 to 66:1) on different kinds of data. One issue then is to to understand the characteristics of astronomical XML files and invent or discover a compression method for these files.

Although compressed files are much smaller, their contents become inaccessible until uncompressed. Indeed, it would be impossible even to support the most rudimentary approaches to searching a compressed XML database, such as searching for sections that match an Xpath expression. Thus, another issue is to develop a compressed file representation that supports sequential searching through the file for the necessary structural and semantic components.

As XML-based formats such as VOTable become the norm for the extraction of data from astronomical archives, XML is likely to follow FITS in being used not only for data interchange but also for data storage. XML-based databases will therefore assume increased importance. However, current XML technology is not efficient enough to scale well. A final issue is to develop a compressed in-memory representation that supports complex queries.


R Raman.

Share this page:

Contact Us

Admissions Enquiries:
BSc: +44 (0) 116 252 5280
MSc: +44 (0) 116 252 2265
E: BSc
E: MSc

Departmental Enquiries:
T: +44 (0) 116 252 2129/3887
F: +44 (0) 116 252 3604

Dept of Informatics
University of Leicester
Leicester, LE1 7RH
United Kingdom


DisabledGo logo

The University of Leicester is committed to equal access to our facilities. DisabledGo has a detailed accessibility guide for the Informatics Building.