[fitsbits] (...) CHECKSUM convention

Joe Hourcle oneiros at grace.nascom.nasa.gov
Mon Jul 6 10:40:54 EDT 2015



On Mon, 6 Jul 2015, Lucio Chiappetti wrote:

> On Sun, 5 Jul 2015, Rob Seaman wrote:
>
>> My use case is different.  Many large-data projects have proposed 
>> variations of duplicating processing at remote sites rather than processing 
>> at one site and transporting the results to another.  The project I was 
>> describing involved a few million files that had previously been replicated 
>> between two sites connected via an expensive network link.  A sequence of 
>> steps was necessary to update the pixels (happened to be a new compression 
>> algorithm) and headers for both data stores.  The goal was to produce 
>> identical output files at each end.  I found I needed to disable the 
>> timestamp in the CFITSIO CHECKSUM to make the copies verbatim.
>
> I never embarked myself in projects of such size, and in general I did not 
> care much of checksums even when downloading data from ftp sites.
>
> However concerning identical copies of files, I like the idea of them having 
> the same timestamps. For instance for mirroring a development web site (where 
> all pages are timestamped, and the timestamp shown via a SSI include 
> directive) into a production one.
>
> The tool I use for this is rsync.
>
> I wonder whether there is a usage case for a specialized rsync for FITS 
> (maybe in conjunction with compression), something acting at HDU instead of 
> file level ...


I wrote a paper for the SABiD (Solar Astronomy Big Data) conference last 
year, but withdrew it when I got frustrated with the IEEE submission 
process:

 	Distributing Solar Data: Minimizing Wasted Bandwidth
 	http://dx.doi.org/10.5281/zenodo.16950

Standard rsync works okay, so long as the size of the headers doesn't 
change.  (and the proposal to allocate extra blank blocks in the header 
would make this more likely).

You can do interesting things with the HTTP Range header to retrieve just 
the headers, then only selectively transfer the data portion if DATASUM 
has changed.

I've done some simple proof-of-concept stuff, but I've never put it into a 
simple tool for other people's use.  (in part, because I've found what a 
horrible PITA it is to release software at NASA.  That, and there are some 
problems if the site serving the data is using the default mod_security 
rules, which consider 'Range: 0-2879' to be some sort of attack, along 
with any other ranges starting with 0.)

-Joe

-----
Joe Hourcle
Programmer/Analyst
Solar Data Analysis Center
Goddard Space Flight Center



More information about the fitsbits mailing list