[fitsbits] start of Public Comment Period on the CHECKSUM convention
Rob Seaman
seaman at noao.edu
Sun Jul 5 13:35:09 EDT 2015
On Jul 2, 2015, at 3:14 AM, Mark Calabretta <mark at calabretta.id.au> wrote:
> On Fri, 26 Jun 2015 14:32:40 -0700 Rob Seaman <seaman at noao.edu> wrote:
>
>> The problem here is that one might want to reproduce a verbatim
>> file at a later date and the timestamp makes this impossible since the
>> checksum will differ precisely because of the timestamp.
>
> That's not the way I read it. If the HDU remains unchanged, then
> CHECKSUM and the date when it was computed should remain unchanged,
> as also DATE.
I didn’t describe it well. Yes, for a specific file timestamps of various things (checksum, min-max, DATE itself) should remain unmodified if the pixels and/or metadata remain unchanged. Mark describes the issue well.
My use case is different. Many large-data projects have proposed variations of duplicating processing at remote sites rather than processing at one site and transporting the results to another. The project I was describing involved a few million files that had previously been replicated between two sites connected via an expensive network link. A sequence of steps was necessary to update the pixels (happened to be a new compression algorithm) and headers for both data stores. The goal was to produce identical output files at each end. I found I needed to disable the timestamp in the CFITSIO CHECKSUM to make the copies verbatim.
This permitted verifying that the two original copies of the data matched, then updating separately at each end, then verifying that the output files matched. Not to belabor the point, but if timestamps are updated the MD5 or SHAs would differ as well even if all the rest of the data / metadata are verbatim duplicates.
> One reason for recording the checksum date relates to the problem
> described by Richard van Nieuwenhoven:
>
>> A very very simplified example: There is a reader out there that will
>> just correct some special value in a fits file, but it does not support
>> the CHECKSUM. If that tool is used on a fits file with a checksum, 2
>> things will be broken. 1. it is not know is the checksum was correct
>> before the change and 2. afterwards the checksum is broken... So the
>> user hast to know that the fits-file has a CHECKSUM and that the tool
>> does not support it ….
As with the many discussions about variance arrays over the years, the entire workflow must be checksum or hash aware to preserve the traceability of the particular kind of trust. This applies to MD5s as much as to the FITS Checksum. One advantage of the FITS Checksum is that delta corrections are easily computable as changes are made (though I’m unaware of anybody implementing this in a workflow). E.g., if a header, or even a single header keyword is updated, the checksum of the old header cardimage can be subtracted and the checksum of the new cardimage added to calculate the new total checksum.
> There are two possibilities when validating an HDU:
>
> A) The HDU's checksum equals -0.
>
> Congratulations, report that the HDU was validated. This should
> happen in the great majority of cases.
Yes, but note that the value of the checksum is also retained and can be individually checked (as in the use case above between two entirely separate copies of the data).
Also note that the checksum of the data records is preserved separately. This permits verifying output data files against original files whose headers have been updated. For instance the data sums of the 70 image extensions of each archived DECam image are verified every morning against the original camera files *after* the headers have been updated. (Separate checks are done for the headers.)
> B) The HDU's checksum does not equal -0.
>
> Look at
> a) the date the checksum was computed,
> b) the date the HDU was written, as recorded in the DATE keyvalue.
>
> If (a) is earlier than (b) then it is reasonably safe to assume that
> the HDU was modified by naive software without updating CHECKSUM.
> Issue a warning that the CHECKSUM appears to be unreliable.
>
> If (a) is later than, or the same as (b), or if DATE is missing,
> then there could be a problem. Possibly the HDU was modified and
> rewritten by slack software that didn't update DATE, or possibly it
> really was corrupted. Issue a warning that the HDU was not validated.
> It's then up to a human to decide what to do based on the provenance
> of the FITS file. In most cases this should be straightforward.
Yes, this is a useful use case. It is not the only use case. My request is that the date/timestamp remain optional. This should be true of all FITS-or-other-astronomical-data-format processing timestamps (as opposed to science timestamps).
> However, because metadata should not be stored in the comment field,
> I would alter the CHECKSUM proposal to create separate keywords, say
> DATE-CHK and DATE-DSM, for the CHECKSUM and DATASUM dates.
I have no problem adding new (optional) keywords, and DATE-xxx is the proper model for this.
> Because obviously CHECKSUM could have, and should have been recomputed
> after those operations were performed.
Usage of checksums and hashes is not specific to FITS, of course. One can make an interesting analogy to git and other source code management technologies. These also have stengths and weaknesses similar to as discussed in this thread. It is non-trivial to convince git to preserve file timestamps, for instance. (At least, I have found it so.)
Rob
More information about the fitsbits
mailing list