[fitsbits] Rice compression from the command line

Wed Jul 19 10:44:15 EDT 2006

On Jul 18, 2006, at 9:05 PM, Mark Calabretta wrote:

> For the FITS binary table, 7zip is costly in CPU time for compression
> but beats gzip and bzip2 handsomely in compression ratio.  However,  
> 7zip
> is not nearly so costly in elapsed time for decompression.  If these
> results are typical then 7zip would have to be the compressor of  
> choice
> for FITS data distributed on the web.

Which raises the general question of constructing a figure of merit  
for data compression.  Discussions like this usually focus on  
compression ratio, the speed to compress and the speed to decompress,  
but there are a number of important, less quantifiable, parameters:

1) market penetration - gzip is a clear leader here

2) openness of software - Both ends of the spectrum may have issues.   
Patents held by some multi-national can quell our access (and  
interest) if there is no loophole for educational licensing, but  
navigating the intricacies of some extreme copyleft can do the same.

3) applicability to a particular purpose - tiled Rice and PLIO are  
very attractive, tiled gzip much less so (with default parameters)

4) tailoring to data - a tile compressed FITS file is still a FITS file

5) stability across a range of data sets - Even good ol' gzip varies  
quite a bit in compression ratio from one file to the next.  For  
example, the average gzip compression ratio over two years of NOAO  
Mosaic II data is 0.586 +/- 0.0449.  Four and a half percent (1- 
sigma) may not seem like a very wide distribution, but it's all in  
the meaning of "average".  This is from 170 nights selected from 304  
total.  All nights with binned data were rejected.  All multi- 
instrument nights were rejected.  All nights with fewer than 10  
object exposures were rejected.  And more to the point, average here  
means "the mean of nightly means".  Picking a random recent night,  
the compression ratio varies between 0.33 and 0.79 across several  
dozen overtly identical 140 MB files.  Calibrations at the low end,  
of course, and object frames at the top.  Obviously there are issues  
of information theory here and one could use the incompressibility of  
the "science" data to gauge the skill of the observer :-)

6) availability of software - if God hadn't created cfitsio, it would  
have had to be invented.  (Those who might be thinking that the same  
applies for the Devil and IRAF - shame on you!)

7) community support - after 7 years one might have hoped that more  
projects and software would support tile compression.

8) <your feature here>

In general, we often get bound up in theoretical discussions about  
things like lossy compression, rather than focusing on pragmatic  
issues of usability and suitability.  Meanwhile the LSST tidal wave  
approaches, but there are going to be several smaller waves impacting  
astronomy's shores first, including Pan-STARRS (however many  
telescopes) and next generation instruments like the One-Degree  
Imager and the Dark Energy Camera.

Features like #'s 1-8 can all be addressed through coordinated  
community action - it might as well be the FITS community.  On the  
other hand, the best way to understand the figure of merit parameters  
of compression ratio, speed in, and speed out may be to focus not on  
static archival holdings, but rather on the costs of bandwidth and  
latency encountered when moving the data around.  After all, isn't  
the point of the emerging Virtual Observatory to keep the pixels in  
play, ever moving and interacting?  Even if we co-locate processing  
with data, the data have to shuttle from a SAN across gigabit or  
fiber channel to the Beowulf next door.  As Arnold just pointed out,  
customer satisfaction (and thus our job security, I might add) depend  
on the aggregate response of our systems.

I stumbled across a very interesting, very recent, paper on lossless  
floating point compression:

	http://www-static.cc.gatech.edu/~lindstro/papers/floatzip/paper.pdf

...so recent it has yet to appear in either author's online  
publication list.

As far as I can tell, there is nothing about any of the algorithms  
referenced that would keep them from being used with astronomical  
data.  The real question is how to turn academic advances into useful  
tools for our community.  The FITS tile compression convention is one  
step toward greasing the rails.

Bill Pence wants to add Hcompress to the cfitsio support for tile  
compression.  Imagine, rather, supporting any and all of the  
algorithms mentioned above - perhaps using some sort of plug-in/ 
component architecture.  We're never going to identify a single best  
compression scheme for all our data.  This was the subtext of the  
tile compression proposal in the first place.  It's time to follow  
through to the logical conclusion.  If any application could  
transparently access data compressed a dozen different ways (perhaps  
HDU by HDU in the same MEF), there would be no reason not to store  
such heterogeneous representations or to convert the data on-the-fly  
for task-specific purposes.  A suite of layered benchmark  
applications would provide the tools to make these decisions.  Those  
tools could even be automated to operate in adaptive ways within the  
data handling components of our archives, pipelines, web services and  
portals.

Sounds like a nifty ADASS abstract to me :-)  I'd already asked Bill  
if he wanted to work on such a paper - anybody else want to pile on?

Rob

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060719/b3b235ac/attachment.html>