[fitsbits] (...) CHECKSUM convention
Joe Hourcle
oneiros at grace.nascom.nasa.gov
Mon Jul 6 10:40:54 EDT 2015
On Mon, 6 Jul 2015, Lucio Chiappetti wrote:
> On Sun, 5 Jul 2015, Rob Seaman wrote:
>
>> My use case is different. Many large-data projects have proposed
>> variations of duplicating processing at remote sites rather than processing
>> at one site and transporting the results to another. The project I was
>> describing involved a few million files that had previously been replicated
>> between two sites connected via an expensive network link. A sequence of
>> steps was necessary to update the pixels (happened to be a new compression
>> algorithm) and headers for both data stores. The goal was to produce
>> identical output files at each end. I found I needed to disable the
>> timestamp in the CFITSIO CHECKSUM to make the copies verbatim.
>
> I never embarked myself in projects of such size, and in general I did not
> care much of checksums even when downloading data from ftp sites.
>
> However concerning identical copies of files, I like the idea of them having
> the same timestamps. For instance for mirroring a development web site (where
> all pages are timestamped, and the timestamp shown via a SSI include
> directive) into a production one.
>
> The tool I use for this is rsync.
>
> I wonder whether there is a usage case for a specialized rsync for FITS
> (maybe in conjunction with compression), something acting at HDU instead of
> file level ...
I wrote a paper for the SABiD (Solar Astronomy Big Data) conference last
year, but withdrew it when I got frustrated with the IEEE submission
process:
Distributing Solar Data: Minimizing Wasted Bandwidth
http://dx.doi.org/10.5281/zenodo.16950
Standard rsync works okay, so long as the size of the headers doesn't
change. (and the proposal to allocate extra blank blocks in the header
would make this more likely).
You can do interesting things with the HTTP Range header to retrieve just
the headers, then only selectively transfer the data portion if DATASUM
has changed.
I've done some simple proof-of-concept stuff, but I've never put it into a
simple tool for other people's use. (in part, because I've found what a
horrible PITA it is to release software at NASA. That, and there are some
problems if the site serving the data is using the default mod_security
rules, which consider 'Range: 0-2879' to be some sort of attack, along
with any other ranges starting with 0.)
-Joe
-----
Joe Hourcle
Programmer/Analyst
Solar Data Analysis Center
Goddard Space Flight Center
More information about the fitsbits
mailing list