[fitsbits] Rice compression from the command line
Rob Seaman
seaman at noao.edu
Wed Jul 19 10:44:15 EDT 2006
On Jul 18, 2006, at 9:05 PM, Mark Calabretta wrote:
> For the FITS binary table, 7zip is costly in CPU time for compression
> but beats gzip and bzip2 handsomely in compression ratio. However,
> 7zip
> is not nearly so costly in elapsed time for decompression. If these
> results are typical then 7zip would have to be the compressor of
> choice
> for FITS data distributed on the web.
Which raises the general question of constructing a figure of merit
for data compression. Discussions like this usually focus on
compression ratio, the speed to compress and the speed to decompress,
but there are a number of important, less quantifiable, parameters:
1) market penetration - gzip is a clear leader here
2) openness of software - Both ends of the spectrum may have issues.
Patents held by some multi-national can quell our access (and
interest) if there is no loophole for educational licensing, but
navigating the intricacies of some extreme copyleft can do the same.
3) applicability to a particular purpose - tiled Rice and PLIO are
very attractive, tiled gzip much less so (with default parameters)
4) tailoring to data - a tile compressed FITS file is still a FITS file
5) stability across a range of data sets - Even good ol' gzip varies
quite a bit in compression ratio from one file to the next. For
example, the average gzip compression ratio over two years of NOAO
Mosaic II data is 0.586 +/- 0.0449. Four and a half percent (1-
sigma) may not seem like a very wide distribution, but it's all in
the meaning of "average". This is from 170 nights selected from 304
total. All nights with binned data were rejected. All multi-
instrument nights were rejected. All nights with fewer than 10
object exposures were rejected. And more to the point, average here
means "the mean of nightly means". Picking a random recent night,
the compression ratio varies between 0.33 and 0.79 across several
dozen overtly identical 140 MB files. Calibrations at the low end,
of course, and object frames at the top. Obviously there are issues
of information theory here and one could use the incompressibility of
the "science" data to gauge the skill of the observer :-)
6) availability of software - if God hadn't created cfitsio, it would
have had to be invented. (Those who might be thinking that the same
applies for the Devil and IRAF - shame on you!)
7) community support - after 7 years one might have hoped that more
projects and software would support tile compression.
8) <your feature here>
In general, we often get bound up in theoretical discussions about
things like lossy compression, rather than focusing on pragmatic
issues of usability and suitability. Meanwhile the LSST tidal wave
approaches, but there are going to be several smaller waves impacting
astronomy's shores first, including Pan-STARRS (however many
telescopes) and next generation instruments like the One-Degree
Imager and the Dark Energy Camera.
Features like #'s 1-8 can all be addressed through coordinated
community action - it might as well be the FITS community. On the
other hand, the best way to understand the figure of merit parameters
of compression ratio, speed in, and speed out may be to focus not on
static archival holdings, but rather on the costs of bandwidth and
latency encountered when moving the data around. After all, isn't
the point of the emerging Virtual Observatory to keep the pixels in
play, ever moving and interacting? Even if we co-locate processing
with data, the data have to shuttle from a SAN across gigabit or
fiber channel to the Beowulf next door. As Arnold just pointed out,
customer satisfaction (and thus our job security, I might add) depend
on the aggregate response of our systems.
I stumbled across a very interesting, very recent, paper on lossless
floating point compression:
http://www-static.cc.gatech.edu/~lindstro/papers/floatzip/paper.pdf
...so recent it has yet to appear in either author's online
publication list.
As far as I can tell, there is nothing about any of the algorithms
referenced that would keep them from being used with astronomical
data. The real question is how to turn academic advances into useful
tools for our community. The FITS tile compression convention is one
step toward greasing the rails.
Bill Pence wants to add Hcompress to the cfitsio support for tile
compression. Imagine, rather, supporting any and all of the
algorithms mentioned above - perhaps using some sort of plug-in/
component architecture. We're never going to identify a single best
compression scheme for all our data. This was the subtext of the
tile compression proposal in the first place. It's time to follow
through to the logical conclusion. If any application could
transparently access data compressed a dozen different ways (perhaps
HDU by HDU in the same MEF), there would be no reason not to store
such heterogeneous representations or to convert the data on-the-fly
for task-specific purposes. A suite of layered benchmark
applications would provide the tools to make these decisions. Those
tools could even be automated to operate in adaptive ways within the
data handling components of our archives, pipelines, web services and
portals.
Sounds like a nifty ADASS abstract to me :-) I'd already asked Bill
if he wanted to work on such a paper - anybody else want to pile on?
Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20060719/b3b235ac/attachment.html>
More information about the fitsbits
mailing list