[fitsbits] Rice compression from the command line
Rob Seaman
seaman at noao.edu
Wed Jul 12 15:23:53 EDT 2006
Work on the next release of the NOAO Science Archive has caused me to
revisit an earlier selection of gzip (which itself was the result of
an exercise in "satisficing" the choice of compression). For all the
obvious reasons (improved read/write speed, higher compression
factors, transparent access) we're taking another look at FITS Rice
compression. Not much seems to have changed over the past five years
- except that it seems like the example imcopy program in the cfitsio
distribution is actually being used in production environments. This
program has several functional shortcomings, in addition to all the
obvious logistical features that are missing in comparison to the
unix gzip command, for instance.
I've appended a quickly modified prototype that addresses some of
those issues. (Compile and link as with imcopy.c.) If there are
alternative FITS Rice compression tools already available, I would be
delighted to hear about them. In the mean time, let me describe some
of the issues I see with Rice compression, whether at the level of
the FITS Convention, CFITSIO or imcopy:
http://heasarc.gsfc.nasa.gov/docs/software/fitsio/compression.html
http://heasarc.gsfc.nasa.gov/docs/software/fitsio/compression/
compress_image.html
http://heasarc.gsfc.nasa.gov/docs/software/fitsio/fitsio.html
http://heasarc.gsfc.nasa.gov/docs/software/fitsio/cexamples/imcopy.c
Starting with the imcopy application first, there are as I say many
missing feature. The two most obvious such are the ability to
compress "in-place" and to process a list of files. One of the
primary use cases for compression is as a magic wand to wave over a
file or a directory to shrink the disk usage. Such a compression
utility that instead creates a second file misses the point that many
users will be aiming for.
The next two issues appear to me to reflect limitations in the
conceptual design of the CFITSIO interface. 1) a copy operation is
not idempotent. Since the interface is semantically aware of the
meaning, as well as the contents of headers, a new copy may differ in
various ways from the original. This is a problem for a compression
application that wants to be able to restore a byte-by-byte copy of
the original. 2) updating an HDU does not necessarily update the
checksums. Failing this, the checksum convention mandates that the
CHECKSUM and DATASUM keywords be deleted, but instead CFITSIO leaves
stale keywords (which remain stale even after restoring the
uncompressed HDU, see #1).
(Tests indicate that the output file resulting from compressing and
then uncompressing whatever input file, may itself be idempotent. I
don't know if this will hold up for all cases or for FITS interfaces
other than CFITSIO. Such an action is something like the FITS
equivalent of canonicalizing XML.)
Finally the FITS compression convention is incomplete. It doesn't
actually express a coherent strategy for compressing and/or
uncompressing general FITS objects, but is limited to per-HDU
issues. For example, if an "SIF" file (that is, not an "MEF") is
compressed, an MEF is generated to contain the resulting binary
table. No information is retained to describe the original file
structure, so uncompressing this file later generates an ambiguity
about whether the original was indeed an SIF or rather was an
uncompressed MEF with a single IMAGE extension. A complementary
issue arises with MEF input, if the primary HDU is not dataless.
Does the "extra" extension resulting from compression become the
first output extension or the last? How many extensions does such a
restored file have? N or N+1?
Philosophically FITS compression is not like gzip or other "opaque"
compression. The output is itself a legal FITS object and interfaces
like CFITSIO or tools like imcopy can invisibly regard a compressed
image array as equivalent to an uncompressed array. This is a great
strength, but it doesn't remove the utility of other compression use
cases. For instance, I would be grateful if somebody could tell me
how to infer the compression status of an HDU using CFITSIO.
Invisibility is nice, but Claude Rains tells us its limits. (Which
are that the prototype doesn't currently uncompress, simply because
it can't a priori decide if the input is compressed to begin with.
Obvious workaround is to have separate "grice" and "gunrice"
commands. This might be desirable in any case for reasons I won't
belabor here.)
Some questions to mull over:
1) Does a better alternative to the CFITSIO imcopy already exist?
(Options don't have to be limited to ANSI C.) How best might we
encourage a wide adoption of a single standard across the
astronomical community? Gzip is ubiquitous, but so is FITS.
2) What features should a general purpose command line FITS
compression tool have? (For instance, should the checksums from the
original file be cached for later comparison to restored HDUs -
whether on disk or in memory?)
3) Should idempotency and correct checksum handling be the
responsibility of CFITSIO, or rather of the application?
4) What logistical procedures and semantic structures need to be
added to the FITS compression convention to support real-world usage?
5) Note that I have not talked about compression algorithms at all.
Has any progress been made on these issues in the last few years that
FITS could benefit from? The compression convention is intended to
support multiple algorithms, of course.
Please take a look at the attached code. Please don't just take it
and use it under battlefield conditions - this appears to be what
happened with the original imcopy program :-) I've traded some email
with Bill Pence about this issue, but would be delighted to hear
additional feedback. If it turns out that further work is warranted
on this prototype, I'll gladly donate the results to be incorporated
into CFITSIO as Bill may deem appropriate. Folks interested in
collaborating are always welcome.
Rob Seaman
NOAO
--------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060712/90760380/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: newimcopy.c
Type: application/octet-stream
Size: 4371 bytes
Desc: not available
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060712/90760380/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060712/90760380/attachment-0001.html>
More information about the fitsbits
mailing list