[fitsbits] Potential new compression method for FITS tables

William Pence William.Pence at nasa.gov
Tue Dec 21 14:48:15 EST 2010


Mark,

Thank you for carefully reading our document (describing a potential new 
compression method for FITS binary tables).  Here are a few more 
comments, in addition to the previous ones from Rob Seaman:

This new compression method is intended as an improvement over simply 
gzipping the whole FITS file, which is the current practice in many data 
archives.  The 3 main advantages of this new compression method are (1) 
  it usually produces higher compression (primarily as a result of 
transposing the rows and column in the table), (2) the header keywords 
remain uncompressed for easy access, and (3) if the FITS file contains 
multiple binary tables then each table can be accessed individually, 
without having to uncompress the entire file.

I agree with Mark's observation that this compressed table format is not 
very convenient for applications that need random access to the rows and 
columns of data.  This is no different, however, from the case where the 
entire FITS file is compressed with gzip.   In both cases, it is usually 
necessary to uncompress the table before the application reads or writes 
data in the table.  This can be done either by explicitly creating an 
uncompressed copy of the FITS file (e.g., by using our fpack/funpack 
FITS file compression utility programs) which is then processed by the 
application program, or by having the FITS reader create an uncompressed 
virtual FITS file in memory, which is then accessed by the application 
program on the fly.  I'm planning to implement this latter approach in 
the CFITSIO library, similar to what has already been done to support 
the tiled-image compression format.  Application programs that use 
CFITSIO to access these compressed tables will be able to do so in 
exactly the same way as for normal uncompressed tables;  CFITSIO will 
transparently uncompress the table when necessary, and if the 
application modifies the table, then CFITSIO will automatically 
recompress it when the application is finished.

Mark also expressed concerns about possible confusion between the 
compressed and uncompressed versions of the same table, by humans or by 
software that is unaware of this compression convention.  It is true 
that the headers of the uncompressed and uncompressed tables look quite 
similar, because only the NAXIS2, PCOUNT, and TFORMn keyword value must 
necessarily differ.  All the other keywords can remain unchanged.   I 
think this is largely a positive, because readers of the compressed 
table header (whether human or software) can quite easily understand the 
contents of the compressed table.   I don't think there is any danger 
than unsuspecting software could mistakenly process the compressed table 
and produce misleading scientific results, if for no other reason than 
because the compressed table will only contain a single row of data in 
most cases.  Mark suggested inventing a new extension type (instead of 
BINTABLE) for these compressed tables, but I don't think we want to 
encourage a proliferation of new extension types simply because the 
contents of the table are slightly different.  In any case, section 
3.4.2 of the FITS standard says that only one extension format shall be 
approved for each type of data organization.

One possible improvement we could make is to add a few COMMENT keywords 
to the header of the compressed table to tell readers that table columns 
have been compressed, and include a link to further information about 
how to interpret the contents.

Finally, I agree that there is room for improvement in our "Disk Savings 
factor" metric.  We'll try to come up with a more meaningful 
quantitative measure of the benefits of this new compression method as 
compared to simply gzipping the FITS file.

Bill

On 12/16/2010 11:32 AM, Rob Seaman wrote:
> Hi Mark,
>
>> I have some comments on this document; sorry it's taken a long time,
>
> Rather, thanks for taking the time to read it.
>
>> Although it doesn't say so explicitly, I presume since there's no
>> indication otherwise that tables encoded in the way described by this
>> document are still XTENSION = 'BINTABLE'.
>
> Yes.
>
>> Although a table encoded according to this convention
>> is syntactically a correct BINTABLE, if interpreted as a normal BINTABLE,
>> the contents will be garbage.
>
> This is true for the tiled-image convention, too.  Bill can likely do the best job of discussing the trade-offs.
>
>> For instance the TDIMn header,
>> whose content is not changed under the proposed convention, will no
>> longer contain the shape of elements in the column, and TUNITn will no
>> longer contain its units.
>
> These are interpreted as applying to the decompressed elements (and have little utility for the compressed vectors).
>
>> For this reason it seems to me that if the
>> proposal is to be adopted, it ought to propose a new XTENSION type for
>> tile-compressed tables, so that unaware software realises that it doesn't
>> know how to interpret such HDUs.
>
> I think there is a general concern about multiplying the number of XTENSION types.  This would also guarantee that such software can't make heads or tails of any sort of the HDUs.
>
>> I also have a concern that these tables are harder to use than
>> existing non-compressed BINTABLEs.  There are two aspects to this.
>> Most obviously, tool/library authors who wish to support such files
>> will need to write additional code for uncompression and/or
>> compression.
>
> Yes.  These files are the FITS equivalent of columnar database technology, which faces similar advantages and disadvantages.
>
>> Secondly, tables which have been compressed in
>> this way are unsuitable for random access, since unlike for a
>> normal BINTABLE, it's not possible to calculate the HDU offset
>> of a given row/column cell.
>
> Rather the tiling works here as with random access for tiled-images.  The software can calculate the offset to the row containing the tile.
>
>> This may have considerable performance
>> implications for data access patterns which require other than
>> sequential access to the data (it would certainly slow down a number
>> of operations in TOPCAT/STILTS).
>
> Again, this is fundamentally the same set of trade-offs as for columnar data stores.  We would certainly want to tune these for astronomical data, users and software.
>
>> Whether this matters depends on
>> who is using this convention in what context.  For archives that
>> want only to store table data as compactly as possible, it's
>> not an issue.
>
> Right.
>
>> But for tables which are distributed to users who
>> want to do processing on them, the saving of disk space and bandwidth
>> may be outweighed by the inconvenience of slower and/or restricted
>> access.  It might be a good idea to mention this issue somewhere
>> in the discussion.
>
> Good suggestion.  The focus should be on the workflow throughput as with tiled-image compression.  The details will vary from workflow to workflow.
>
>> Concerning the results table in section 6, it took me a while to
>> work out what the "Disk Savings factor" meant.
>
> We should review how best to report this for tables.  For NOAO imaging data, fpack with Rice saves 19% over-and-above (unshuffled) gzip data holdings for 16-bit data and 28% for 32-bit data.  There are interesting differences for tabular data, particularly low entropy columns.
>
> Rob
>
> _______________________________________________
> fitsbits mailing list
> fitsbits at listmgr.cv.nrao.edu
> http://listmgr.cv.nrao.edu/mailman/listinfo/fitsbits

-- 
____________________________________________________________________
Dr. William Pence                       William.Pence at nasa.gov
NASA/GSFC Code 662       HEASARC        +1-301-286-4599 (voice)
Greenbelt MD 20771                      +1-301-286-1684 (fax)





More information about the fitsbits mailing list