[fitsbits] Potential new compression method for FITS tables
William Thompson
William.T.Thompson at nasa.gov
Thu Dec 16 11:38:59 EST 2010
Mark Taylor wrote:
> Dear FITS table compression authors and lurkers,
>
> I have some comments on this document; sorry it's taken a long time,
> I've had some other things on.
>
> Although it doesn't say so explicitly, I presume since there's no
> indication otherwise that tables encoded in the way described by this
> document are still XTENSION = 'BINTABLE'. I think this is problematic,
> since a table reader which is unaware of this convention may encounter
> such a FITS extension, see that it's a BINTABLE, and believe that it can
> make sense of it. Although a table encoded according to this convention
> is syntactically a correct BINTABLE, if interpreted as a normal BINTABLE,
> the contents will be garbage. Moreover, the semantics of some of the
> other T***n headers will no longer hold. For instance the TDIMn header,
> whose content is not changed under the proposed convention, will no
> longer contain the shape of elements in the column, and TUNITn will no
> longer contain its units. For this reason it seems to me that if the
> proposal is to be adopted, it ought to propose a new XTENSION type for
> tile-compressed tables, so that unaware software realises that it doesn't
> know how to interpret such HDUs.
I don't think that a separate XTENSION type is needed for this convention. We
already have other types of data which cannot easily be made sense of by a naive
reader. The most directly relevant example is the Tiled Image Compression
convention which is also stored in binary table form. (I would like to point
out, by the way, that the Tiled Image Compression convention is now being used
by the Solar Dynamics Observatory mission to help them manage their 2 terabytes
per day of data volume.)
You make a good point about the TDIMn keyword no longer being consistent with
the size of the array. This could conceivably cause problems with some readers.
However, those readers wouldn't be able to read the file anyway. The main
concern would be if the inconsistency caused the software to actually crash,
rather than just produce nonsense.
I don't agree with the statement that TUNITn is no longer applicable.
> I also have a concern that these tables are harder to use than
> existing non-compressed BINTABLEs. There are two aspects to this.
> Most obviously, tool/library authors who wish to support such files
> will need to write additional code for uncompression and/or
> compression. ...
That's already true for the Tiled Image Compression convention.
> ... Secondly, tables which have been compressed in
> this way are unsuitable for random access, since unlike for a
> normal BINTABLE, it's not possible to calculate the HDU offset
> of a given row/column cell. This may have considerable performance ...
That's true for any binary table using the variable length array convention.
> implications for data access patterns which require other than
> sequential access to the data (it would certainly slow down a number
> of operations in TOPCAT/STILTS). Whether this matters depends on
> who is using this convention in what context. For archives that
> want only to store table data as compactly as possible, it's
> not an issue. But for tables which are distributed to users who
> want to do processing on them, the saving of disk space and bandwidth
> may be outweighed by the inconvenience of slower and/or restricted
> access. It might be a good idea to mention this issue somewhere
> in the discussion.
>
> Concerning the results table in section 6, it took me a while to
> work out what the "Disk Savings factor" meant. I think its value
> is (1-1/GZIP_2)/(1-1/gzip). This doesn't seem to be a specially
> useful figure, since it's not scaled by the original size of the
> file, so for instance if both the gzip and GZIP_2 methods save
> a negligable amount, its value will diverge. A better headline
> figure for the improvement of tile-compression against plain gzip
> might be GZIP_2/gzip.
>
> Finally, there appears to be a minor typo/editorial error:
> Section 1 says: "...however in the prototype implementation described
> here, the gzip algorithm is used to compress every column".
> However, in section 3 the ZCTYPn header is defined which is able to
> select the algorithm from a choice which includes RICE.
>
> Mark
>
>
> On Thu, 28 Oct 2010, William Pence wrote:
>
>> For the past few months, several of us (Rob Seaman, Rick White, and
>> myself) have been experimenting with a new compression method for FITS
>> binary tables that appears to be significantly more effective than the
>> usual method of simply compressing the whole FITS file with gzip. We
>> have produced a document, available at
>> http://fits.gsfc.nasa.gov/tiletable.pdf that describes this proposed
>> convention in more detail; here is a brief description from that document:
>>
>> "This document describes a convention for compressing FITS binary
>> tables that is modeled after the FITS tiled-image compression method
>> (White et al. 2009) that has been in use for about a decade. The input
>> table is first optionally subdivided into tiles, each containing an
>> equal number of rows, then every column of data within each tile is
>> compressed and stored as a variable-length array of bytes in the
>> output FITS binary table. All the header keywords from the input
>> table are copied to the header of the output table and remain
>> uncompressed for efficient access. The output compressed table
>> contains the same number and order of columns as in the input
>> uncompressed binary table. There is one row in the output table
>> corresponding to each tile of rows in the input table. In principle,
>> each column of data can be compressed using a different algorithm
>> that is optimized for the type of data within that column, however in
>> the prototype implementation described here, the gzip algorithm is
>> used to compress every column."
>>
>> In experiments on a sample of FITS tables from the HEASARC archive, this
>> new compression method produced about 50% more disk space savings than
>> the simple "gzip-the-whole-file" method. This compression improvement
>> is mainly a result of a) compressing the table column by column, instead
>> of on a row-by-row basis, and b) using a byte shuffling technique on
>> numeric columns that sorts the bytes in decreasing order of significance.
>>
>> This is still a prototype, and we plan to do further testing before even
>> considering using this compression method on any publicly available FITS
>> files. In the meantime, we would be interested in any comments or
>> suggestions on this potential new FITS compression convention. We are
>> also interested in gathering a larger sample of representative FITS
>> tables for test purposes, so I would appreciate any suggestions of
>> suitable FITS files from different projects or observatories.
>>
>> Bill Pence
>> --
>> ____________________________________________________________________
>> Dr. William Pence William.Pence at nasa.gov
>> NASA/GSFC Code 662 HEASARC +1-301-286-4599 (voice)
>> Greenbelt MD 20771 +1-301-286-1684 (fax)
>
> --
> Mark Taylor Astronomical Programmer Physics, Bristol University, UK
> m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
>
> _______________________________________________
> fitsbits mailing list
> fitsbits at listmgr.cv.nrao.edu
> http://listmgr.cv.nrao.edu/mailman/listinfo/fitsbits
>
--
William Thompson
NASA Goddard Space Flight Center
Code 671
Greenbelt, MD 20771
USA
301-286-2040
William.T.Thompson at nasa.gov
More information about the fitsbits
mailing list