[fitsbits] Rice compression from the command line

Rob Seaman seaman at noao.edu
Wed Jul 12 15:23:53 EDT 2006


Work on the next release of the NOAO Science Archive has caused me to  
revisit an earlier selection of gzip (which itself was the result of  
an exercise in "satisficing" the choice of compression).  For all the  
obvious reasons (improved read/write speed, higher compression  
factors, transparent access) we're taking another look at FITS Rice  
compression.  Not much seems to have changed over the past five years  
- except that it seems like the example imcopy program in the cfitsio  
distribution is actually being used in production environments.  This  
program has several functional shortcomings, in addition to all the  
obvious logistical features that are missing in comparison to the  
unix gzip command, for instance.

I've appended a quickly modified prototype that addresses some of  
those issues.  (Compile and link as with imcopy.c.)  If there are  
alternative FITS Rice compression tools already available, I would be  
delighted to hear about them.  In the mean time, let me describe some  
of the issues I see with Rice compression, whether at the level of  
the FITS Convention, CFITSIO or imcopy:

	http://heasarc.gsfc.nasa.gov/docs/software/fitsio/compression.html
	http://heasarc.gsfc.nasa.gov/docs/software/fitsio/compression/ 
compress_image.html
	http://heasarc.gsfc.nasa.gov/docs/software/fitsio/fitsio.html
	http://heasarc.gsfc.nasa.gov/docs/software/fitsio/cexamples/imcopy.c

Starting with the imcopy application first, there are as I say many  
missing feature.  The two most obvious such are the ability to  
compress "in-place" and to process a list of files.  One of the  
primary use cases for compression is as a magic wand to wave over a  
file or a directory to shrink the disk usage.  Such a compression  
utility that instead creates a second file misses the point that many  
users will be aiming for.

The next two issues appear to me to reflect limitations in the  
conceptual design of the CFITSIO interface.  1) a copy operation is  
not idempotent.  Since the interface is semantically aware of the  
meaning, as well as the contents of headers, a new copy may differ in  
various ways from the original.  This is a problem for a compression  
application that wants to be able to restore a byte-by-byte copy of  
the original. 2) updating an HDU does not necessarily update the  
checksums.  Failing this, the checksum convention mandates that the  
CHECKSUM and DATASUM keywords be deleted, but instead CFITSIO leaves  
stale keywords (which remain stale even after restoring the  
uncompressed HDU, see #1).

(Tests indicate that the output file resulting from compressing and  
then uncompressing whatever input file, may itself be idempotent.  I  
don't know if this will hold up for all cases or for FITS interfaces  
other than CFITSIO.  Such an action is something like the FITS  
equivalent of canonicalizing XML.)

Finally the FITS compression convention is incomplete.  It doesn't  
actually express a coherent strategy for compressing and/or  
uncompressing general FITS objects, but is limited to per-HDU  
issues.  For example, if an "SIF" file (that is, not an "MEF") is  
compressed, an MEF is generated to contain the resulting binary  
table.  No information is retained to describe the original file  
structure, so uncompressing this file later generates an ambiguity  
about whether the original was indeed an SIF or rather was an  
uncompressed MEF with a single IMAGE extension.  A complementary  
issue arises with MEF input, if the primary HDU is not dataless.   
Does the "extra" extension resulting from compression become the  
first output extension or the last?  How many extensions does such a  
restored file have?  N or N+1?

Philosophically FITS compression is not like gzip or other "opaque"  
compression.  The output is itself a legal FITS object and interfaces  
like CFITSIO or tools like imcopy can invisibly regard a compressed  
image array as equivalent to an uncompressed array.  This is a great  
strength, but it doesn't remove the utility of other compression use  
cases.  For instance, I would be grateful if somebody could tell me  
how to infer the compression status of an HDU using CFITSIO.   
Invisibility is nice, but Claude Rains tells us its limits.  (Which  
are that the prototype doesn't currently uncompress, simply because  
it can't a priori decide if the input is compressed to begin with.   
Obvious workaround is to have separate "grice" and "gunrice"  
commands.  This might be desirable in any case for reasons I won't  
belabor here.)

Some questions to mull over:

1) Does a better alternative to the CFITSIO imcopy already exist?   
(Options don't have to be limited to ANSI C.)  How best might we  
encourage a wide adoption of a single standard across the  
astronomical community?  Gzip is ubiquitous, but so is FITS.

2) What features should a general purpose command line FITS  
compression tool have?  (For instance, should the checksums from the  
original file be cached for later comparison to restored HDUs -  
whether on disk or in memory?)

3) Should idempotency and correct checksum handling be the  
responsibility of CFITSIO, or rather of the application?

4) What logistical procedures and semantic structures need to be  
added to the FITS compression convention to support real-world usage?

5) Note that I have not talked about compression algorithms at all.   
Has any progress been made on these issues in the last few years that  
FITS could benefit from?  The compression convention is intended to  
support multiple algorithms, of course.

Please take a look at the attached code.  Please don't just take it  
and use it under battlefield conditions - this appears to be what  
happened with the original imcopy program :-)  I've traded some email  
with Bill Pence about this issue, but would be delighted to hear  
additional feedback.  If it turns out that further work is warranted  
on this prototype, I'll gladly donate the results to be incorporated  
into CFITSIO as Bill may deem appropriate.  Folks interested in  
collaborating are always welcome.

Rob Seaman
NOAO

--------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060712/90760380/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: newimcopy.c
Type: application/octet-stream
Size: 4371 bytes
Desc: not available
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060712/90760380/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060712/90760380/attachment-0001.html>


More information about the fitsbits mailing list