[fitsbits] Potential new compression method for FITS tables

Thu Dec 16 11:32:20 EST 2010

Hi Mark,

> I have some comments on this document; sorry it's taken a long time,

Rather, thanks for taking the time to read it.

> Although it doesn't say so explicitly, I presume since there's no
> indication otherwise that tables encoded in the way described by this
> document are still XTENSION = 'BINTABLE'.

Yes.

> Although a table encoded according to this convention
> is syntactically a correct BINTABLE, if interpreted as a normal BINTABLE,
> the contents will be garbage.

This is true for the tiled-image convention, too.  Bill can likely do the best job of discussing the trade-offs.

> For instance the TDIMn header,
> whose content is not changed under the proposed convention, will no
> longer contain the shape of elements in the column, and TUNITn will no
> longer contain its units.

These are interpreted as applying to the decompressed elements (and have little utility for the compressed vectors).

> For this reason it seems to me that if the
> proposal is to be adopted, it ought to propose a new XTENSION type for
> tile-compressed tables, so that unaware software realises that it doesn't
> know how to interpret such HDUs.

I think there is a general concern about multiplying the number of XTENSION types.  This would also guarantee that such software can't make heads or tails of any sort of the HDUs.

> I also have a concern that these tables are harder to use than
> existing non-compressed BINTABLEs.  There are two aspects to this.
> Most obviously, tool/library authors who wish to support such files 
> will need to write additional code for uncompression and/or
> compression.

Yes.  These files are the FITS equivalent of columnar database technology, which faces similar advantages and disadvantages.

> Secondly, tables which have been compressed in
> this way are unsuitable for random access, since unlike for a
> normal BINTABLE, it's not possible to calculate the HDU offset
> of a given row/column cell.

Rather the tiling works here as with random access for tiled-images.  The software can calculate the offset to the row containing the tile.

> This may have considerable performance
> implications for data access patterns which require other than
> sequential access to the data (it would certainly slow down a number
> of operations in TOPCAT/STILTS).

Again, this is fundamentally the same set of trade-offs as for columnar data stores.  We would certainly want to tune these for astronomical data, users and software.

> Whether this matters depends on
> who is using this convention in what context.  For archives that
> want only to store table data as compactly as possible, it's 
> not an issue.

Right.

> But for tables which are distributed to users who 
> want to do processing on them, the saving of disk space and bandwidth
> may be outweighed by the inconvenience of slower and/or restricted 
> access.  It might be a good idea to mention this issue somewhere
> in the discussion.

Good suggestion.  The focus should be on the workflow throughput as with tiled-image compression.  The details will vary from workflow to workflow.

> Concerning the results table in section 6, it took me a while to
> work out what the "Disk Savings factor" meant.

We should review how best to report this for tables.  For NOAO imaging data, fpack with Rice saves 19% over-and-above (unshuffled) gzip data holdings for 16-bit data and 28% for 32-bit data.  There are interesting differences for tabular data, particularly low entropy columns.

Rob