[fitsbits] Potential new compression method for FITS tables

Mark Taylor m.b.taylor at bristol.ac.uk
Thu Dec 16 08:42:51 EST 2010


Dear FITS table compression authors and lurkers,

I have some comments on this document; sorry it's taken a long time,
I've had some other things on.

Although it doesn't say so explicitly, I presume since there's no
indication otherwise that tables encoded in the way described by this
document are still XTENSION = 'BINTABLE'.  I think this is problematic,
since a table reader which is unaware of this convention may encounter
such a FITS extension, see that it's a BINTABLE, and believe that it can
make sense of it.  Although a table encoded according to this convention
is syntactically a correct BINTABLE, if interpreted as a normal BINTABLE,
the contents will be garbage.  Moreover, the semantics of some of the
other T***n headers will no longer hold.  For instance the TDIMn header,
whose content is not changed under the proposed convention, will no
longer contain the shape of elements in the column, and TUNITn will no
longer contain its units.  For this reason it seems to me that if the
proposal is to be adopted, it ought to propose a new XTENSION type for
tile-compressed tables, so that unaware software realises that it doesn't
know how to interpret such HDUs.

I also have a concern that these tables are harder to use than
existing non-compressed BINTABLEs.  There are two aspects to this.
Most obviously, tool/library authors who wish to support such files 
will need to write additional code for uncompression and/or
compression.  Secondly, tables which have been compressed in
this way are unsuitable for random access, since unlike for a
normal BINTABLE, it's not possible to calculate the HDU offset
of a given row/column cell.  This may have considerable performance
implications for data access patterns which require other than
sequential access to the data (it would certainly slow down a number
of operations in TOPCAT/STILTS).  Whether this matters depends on
who is using this convention in what context.  For archives that
want only to store table data as compactly as possible, it's 
not an issue.  But for tables which are distributed to users who 
want to do processing on them, the saving of disk space and bandwidth
may be outweighed by the inconvenience of slower and/or restricted 
access.  It might be a good idea to mention this issue somewhere
in the discussion.

Concerning the results table in section 6, it took me a while to
work out what the "Disk Savings factor" meant.  I think its value
is (1-1/GZIP_2)/(1-1/gzip).  This doesn't seem to be a specially
useful figure, since it's not scaled by the original size of the
file, so for instance if both the gzip and GZIP_2 methods save
a negligable amount, its value will diverge.  A better headline
figure for the improvement of tile-compression against plain gzip 
might be GZIP_2/gzip.

Finally, there appears to be a minor typo/editorial error:
Section 1 says: "...however in the prototype implementation described
here, the gzip algorithm is used to compress every column".
However, in section 3 the ZCTYPn header is defined which is able to
select the algorithm from a choice which includes RICE.

Mark


On Thu, 28 Oct 2010, William Pence wrote:

> For the past few months, several of us (Rob Seaman, Rick White, and 
> myself) have been experimenting with a new compression method for FITS 
> binary tables that appears to be significantly more effective than the 
> usual method of simply compressing the whole FITS file with gzip.  We 
> have produced a document, available at 
> http://fits.gsfc.nasa.gov/tiletable.pdf that describes this proposed 
> convention in more detail;  here is a brief description from that document:
> 
> "This document describes a convention for compressing FITS binary
> tables that is modeled after the FITS tiled-image compression method
> (White et al. 2009) that has been in use for about a decade. The input
> table is first optionally subdivided into tiles, each containing an
> equal number of rows,  then every column of data within each tile is
> compressed and stored as a variable-length array of bytes in the
> output FITS binary table.  All the header keywords from the input
> table are copied to the header of the  output table and remain
> uncompressed for efficient access. The output compressed  table
> contains the same number and order of columns as in the input
> uncompressed binary table. There is one row in the output table
> corresponding to each tile of rows in the input table.  In principle,
> each column of data can be compressed using a different algorithm
> that is optimized for the type of data within that column, however in
> the prototype implementation described here, the gzip algorithm is
> used to compress every column."
> 
> In experiments on a sample of FITS tables from the HEASARC archive, this 
> new compression method produced about 50% more disk space savings than 
> the simple "gzip-the-whole-file" method.  This compression improvement 
> is mainly a result of a) compressing the table column by column, instead 
> of on a row-by-row basis, and b) using a byte shuffling technique on 
> numeric columns that sorts the bytes in decreasing order of significance.
> 
> This is still a prototype, and we plan to do further testing before even 
> considering using this compression method on any publicly available FITS 
> files.  In the meantime, we would be interested in any comments or 
> suggestions on this potential new FITS compression convention.  We are 
> also interested in gathering a larger sample of representative FITS 
> tables for test purposes, so I would appreciate any suggestions of 
> suitable FITS files from different projects or observatories.
> 
> Bill Pence
> -- 
> ____________________________________________________________________
> Dr. William Pence                       William.Pence at nasa.gov
> NASA/GSFC Code 662       HEASARC        +1-301-286-4599 (voice)
> Greenbelt MD 20771                      +1-301-286-1684 (fax)

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/




More information about the fitsbits mailing list