[fitsbits] Potential new compression method for FITS tables

Fri Dec 17 10:10:19 EST 2010

On Dec 17, 2010, at 5:40 AM, Preben Grosbol wrote:

> I can second your concern.  We have had the discussion before in the 80's with the blocking convention. I believe it to be important that old readers, not knowing new conventions, are not mislead.

...or BITPIX=-32 for floating-point.  The issue here is the potential consequences of having been misled.  We're not throwing an error just to be pedantic - there is a specific goal.  For instance, when adding a new data type the risk is that some image processing tool or workflow will fail to distinguish 32-bit floating-point from 32-bit integers.

Both the consequences of misapplying TDIM as well as the likelihood of doing so are significantly lower.  Every tool that handles images has to use BITPIX.  The same does not apply to table tools and TDIM - and those that do are likely to throw an error already.  Note that the issue of TDIM and variable length arrays has come up before:

	http://listmgr.cv.nrao.edu/pipermail/fitsbits/2000-January/000005.html

This new convention could be taken as clarifying the earlier discussion.  Just one point that the issue isn't so much about compression as about transposing the table into tiled column-major representation.  For instance, ZCTYPn might be allowed a value of 'NONE'.

> As a minimum, it should be trivial for the end user (e.g. by reading the header keywords) to understand why data are not read correctly.

An explicit keyword could be added.  Perhaps this could be a general feature listing all registered conventions that might be used in a particular HDU, something like:

	FITS_REG= 'CHECKSUM,HIERARCH,TILEDTAB'  /  conventions may be used

> As a side comments, I participated in a meeting for general archives a few years ago.  There they did not recommend to save compress data since the effect of single bit errors is more serious than for raw data, not justifying the gain of disk space.

Several recent papers - e.g., from Kepler, JDEM, and astrometry.net (as well as http://arxiv.org/abs/1007.1179) have advocated aggressive lossy compression with a compression factor of ~10 when starting with 32-bit data.  To simplify, this corresponds roughly to Shannon-Nyquist quantization.  It is an interesting question how robust different compression algorithms are against bit errors.  Rice, for instance, is applied on very small buffers of ~32 pixels at a time.  A single bit error would be limited to that buffer and perhaps to a single pixel.  In any event, it is up to each project archive to decide if 10:1 (or "just" lossless 2:1) storage efficiency gains justify such the risk.

Data representations, especially for archival data, should be coherently planned.  Compression is a separate issue from error detection and recovery.  If detecting bit errors is a concern, note that the FITS Checksum is available (and has been built into fpack).  If error correction is desired, then we should be discussing adding Gray code support to FITS.

Ad hoc solutions like avoiding compression will result in archival holdings that are unevenly protected.  High entropy data (eg, flatfields or long science exposures) will be less well protected than low entropy (high redundancy) data such as bias frames and short/narrow band exposures.  Where is the logic in that?

Meanwhile, the broader data storage community is implementing deduplication features that are far riskier than simply using efficient data representations.  One bit error can affect dozens of linked copies.  On the other hand, the various RAID strategies are just one way that robustness can be built into the mass storage systems themselves.

After the FITS Checksum was introduced at ADASS IV (http://www.adass.org/adass/proceedings/adass94/seamanr.html), I recall presenting something on digital signing technologies at the FITS BoF at ADASS V in Tucson.  Perhaps it is time to renew that discussion as well in a broader context including encryption and error correcting codes?

There is nothing magic about the mix of IEEE floating-point, 2's complement integers, and ASCII.  Efficient representation of data is one desirable trade-off among many.

Rob