[fitsbits] Potential new compression method for FITS tables
Rob Seaman
seaman at noao.edu
Thu Dec 16 11:32:20 EST 2010
Hi Mark,
> I have some comments on this document; sorry it's taken a long time,
Rather, thanks for taking the time to read it.
> Although it doesn't say so explicitly, I presume since there's no
> indication otherwise that tables encoded in the way described by this
> document are still XTENSION = 'BINTABLE'.
Yes.
> Although a table encoded according to this convention
> is syntactically a correct BINTABLE, if interpreted as a normal BINTABLE,
> the contents will be garbage.
This is true for the tiled-image convention, too. Bill can likely do the best job of discussing the trade-offs.
> For instance the TDIMn header,
> whose content is not changed under the proposed convention, will no
> longer contain the shape of elements in the column, and TUNITn will no
> longer contain its units.
These are interpreted as applying to the decompressed elements (and have little utility for the compressed vectors).
> For this reason it seems to me that if the
> proposal is to be adopted, it ought to propose a new XTENSION type for
> tile-compressed tables, so that unaware software realises that it doesn't
> know how to interpret such HDUs.
I think there is a general concern about multiplying the number of XTENSION types. This would also guarantee that such software can't make heads or tails of any sort of the HDUs.
> I also have a concern that these tables are harder to use than
> existing non-compressed BINTABLEs. There are two aspects to this.
> Most obviously, tool/library authors who wish to support such files
> will need to write additional code for uncompression and/or
> compression.
Yes. These files are the FITS equivalent of columnar database technology, which faces similar advantages and disadvantages.
> Secondly, tables which have been compressed in
> this way are unsuitable for random access, since unlike for a
> normal BINTABLE, it's not possible to calculate the HDU offset
> of a given row/column cell.
Rather the tiling works here as with random access for tiled-images. The software can calculate the offset to the row containing the tile.
> This may have considerable performance
> implications for data access patterns which require other than
> sequential access to the data (it would certainly slow down a number
> of operations in TOPCAT/STILTS).
Again, this is fundamentally the same set of trade-offs as for columnar data stores. We would certainly want to tune these for astronomical data, users and software.
> Whether this matters depends on
> who is using this convention in what context. For archives that
> want only to store table data as compactly as possible, it's
> not an issue.
Right.
> But for tables which are distributed to users who
> want to do processing on them, the saving of disk space and bandwidth
> may be outweighed by the inconvenience of slower and/or restricted
> access. It might be a good idea to mention this issue somewhere
> in the discussion.
Good suggestion. The focus should be on the workflow throughput as with tiled-image compression. The details will vary from workflow to workflow.
> Concerning the results table in section 6, it took me a while to
> work out what the "Disk Savings factor" meant.
We should review how best to report this for tables. For NOAO imaging data, fpack with Rice saves 19% over-and-above (unshuffled) gzip data holdings for 16-bit data and 28% for 32-bit data. There are interesting differences for tabular data, particularly low entropy columns.
Rob
More information about the fitsbits
mailing list