[fitsbits] Proposed Changes to the FITS Standard

Sun Aug 19 13:15:44 EDT 2007

Perhaps there is a consensus building to back off from an absolute  
ban on duplicate keywords.  Here's my reply to Bill's latest in any  
event - kind of a "Tao of FITS".  Apologies for the length:

> - keyword values are restricted to be a single value, not an array
> - logical keyword values must consist of a single T or F followed
>    only by a space or a slash character
> - integer and float keyword values must not contain embedded spaces
> - complex keyword values must be enclosed in parentheses
> - no other keywords may intervene between the mandatory keywords in
>    the primary array or extension
> - the TFORM keyword values must be upper case (e.g., F5.2, not f5.2)

These are all (except perhaps the last) rare occurrences.  In that  
case, a newly placed requirement is more like a clarification of the  
standard than a change.  Duplicate keywords, on the other hand, are a  
frequent occurrence (thus the interest in eliminating them :-)  "Once  
FITS, always FITS" may never have been in serious question before.

> Imposing a new requirement on software systems to read the last  
> instance of the keyword would likely have a lot of negative  
> repercussions.

No more so than imposing a new requirement to detect and act on  
duplicate keywords.  The difference is that outlawing duplicates  
doesn't fix those systems reading indeterminate values.

We might ask what an ideal FITS library or application should do on  
encountering various exceptions.  Whether or not duplicate keywords  
are outlawed, deprecated or ignored, feeding such input to our  
software will remain a frequent event.  We can't legislate moral  
behavior, rather only the consequences for detected immorality.

It seems to me that in this imperfect world it would be better if the  
major FITS software packages adopted a coherent behavior on  
encountering duplicate keywords.  A header with duplicate FITS  
keywords is not a bug.  Currently, it is perfectly legal FITS, if  
questionable practice.  This cannot now be rescinded (except with  
some form of HDU-level versioning, I still assert).  But even if  
duplicate keywords were illegal FITS, the question remains of what  
FITS software should do upon encountering them - and how our code  
should recognize the fact in the first place.

> Requiring all software systems to follow the same behavior is not  
> practical, so the only sure way to prevent users from getting an  
> incorrect result when analyzing the file is to eliminate duplicate  
> keywords in the first place.

You cannot avoid the question of what software is required to do by  
outlawing data.  Those data can and will continue to be presented as  
input to our software.  Perhaps there is some notion that we'll  
require all archives and data providers to scrub their data.  This is  
at least as impractical a requirement as you describe for the  
software - more to the point, who will data providers turn to for  
software to perform such scrubbing?  One way or the other, if we  
tackle this issue our software will have to detect duplicate keyword  
instances and take some action as a result.

> There is less harm if the duplicated keywords all have the same  
> value, so maybe the wording of this requirement should be modified  
> to take this into account.

This strikes me as the sort of contingent action that indicates the  
primary action is ill conceived.  As far as the software, it is  
simply another requirement placed on top of the first.  Look for  
duplicate keyword names, then look for duplicate values - would the  
next step be a test for duplicate comments?

> some of your FOREIGN extensions have the order of these 2 keywords  
> reversed.

We'll look into the behavior you describe.  I would expect most  
extension types, including FOREIGN, to be conformable to this more  
strict keyword ordering whether it is required or merely preferred.

In addition to clarifying the ordering of PCOUNT/GCOUNT, this may be  
a good time to state this more clearly for all the mandatory keywords  
(section 4.4).  In particular, the ordering of NAXISn is never  
explicitly restricted to increasing numerical order.  The only  
statement for any of the mandatory keywords is presented in table  
4.5, which suggests NAXISn be ordered, but never outright says it.

>>>   3. Embedded space characters are now forbidden within numeric
>>>      values in an ASCII Table (e.g.  "1 23 4.5"  is no longer
>>>      allowed to represent the decimal value 1234.5)
>>
>> Again - are there any examples of such usage in the field?
>
> No, as far as we know.  If there are any, then it is very likely  
> that most current software systems do not support embedded spaces  
> in the value and will silently read an incorrect value, or will  
> exit with an error.  Thus, it seems better to me to outlaw this  
> usage rather than just not recommend it or deprecate it.

Again, the question is whether it is more productive to attempt to  
outlaw something or to describe what steps software should take upon  
encountering the usage.  If there are no known instances, "outlawing"  
is equivalent to clarifying the standard.  This is likely such a  
case.  If there are many instances, I don't think we can escape from  
taking a position on what the software should do.

> I don't really see any practical benefit to having a version  
> keyword.  Either the software will support a new requirement, or it  
> won't; the presence of a version (or DATE) keyword isn't really  
> helpful, except maybe to a human reading the header.

I don't understand.  The software would interpret the version to know  
if the new requirement should be enforced for a particular HDU.  In  
the absence of such versioning (by token or date), the software has  
to follow some sloppy heuristic to let the nuances of the data guide  
its behavior.  The other two new requirements on the table strike me  
as clarifications and can go forward without versioning, perhaps with  
some tweaking of the language.  I'm not sure about the EXTEND keyword.

I'm not a big fan of introducing versioning myself, but the clear  
implication of avoiding versioning is that duplicate keywords cannot  
be gracefully banned after the fact.  In fact, consider a situation  
in which the choice had been made to ban them back in the FITS  
Dreamtime - exactly the same stringent software requirements would  
pertain to detect instances and take application dependent action.   
Our libraries and applications would be more complex now as a  
result.  (Arguably better, but certainly more complex.)  Banning  
duplicates doesn't avoid significant new software requirements, it  
mandates them.

> The proposed new statement ("Existing FITS files that conformed to  
> the latest version of the standard at the time the files were  
> created are expressly exempt from any new requirements imposed by  
> subsequent versions of the standard.") is, I think, mainly intended  
> as a political statement to reassure institutions that the FITS  
> committees are not imposing new unfunded mandates that require  
> modifications to existing FITS archives.  I don't see this  
> statement as having much relevance to the way software is implemented.

You can't avoid the unfunded mandate this way.  Any software seeking  
to follow the letter of the standard would still have to detect  
instances of duplicate keywords and take some action.  What  
statements like this do is to encourage folks to treat the standard  
as some floppy set of guidelines and conformance to the standard as  
an optional nicety for polite society.

A file either conforms to the FITS standard or it does not.  A ban on  
duplicate keywords is unenforceable unless it is paired with  
versioning.  The statement above would fail to impress a lawyer since  
it isn't paired with a way for either humans or computers to  
determine which files were grandfathered in.  Further, there is a  
sense of legal entrapment in promulgating such a new requirement with  
no realistic way to encourage instrument teams and others to redesign  
their systems to avoid duplicates.  For instance, the ICE/ccdacq  
software permits observers to enter their own file of keywords,  
perhaps including duplicates.  Users can trivially use IRAF hedit to  
add duplicates, etc.  Perhaps there is no way to duplicate a keyword  
with CFITSIO?  Who would enforce the ban?

In any event, the FITS standard should be kept free of political  
statements.

> This is missing the main point of this new requirement.  No current  
> software system that I am aware of (except for the FITS verifier  
> code) checks for duplicated keywords, so users have no idea which  
> of the duplicated keywords is being used by a particular program.   
> The software might be using the first, the 'next', or the last  
> instance of the keyword.

Well, as I said, iSTB throws an error if duplicate structural  
keywords are encountered.  After 10 million files, I don't think I've  
ever seen this particular error in BITPIX, NAXISn, PCOUNT, GCOUNT or  
EXTENSION.  We did just happen to see duplicate SIMPLE keywords while  
commissioning a new instrument.  The problem was detected, reported  
and fixed.  On the other hand, there are numerous ongoing examples of  
duplicated user keywords.  It seems to me that applications should  
only be sensitive to header abnormalities that affect their own  
functionality.

Instituting an absolute ban is meaningless unless all our software  
systems become aware of all possible duplicates.  We can't just dump  
the responsibility on the users to avoid creating them in the first  
place unless our own software that they are using to create or update  
the HDUs aids in that task.

This ban is attempting to avoid placing natural requirements on  
software by placing unnatural ones on the data.  Not only is it  
unenforceable - the software requirements just pop up again elsewhere.

> This could easily cause the user to derive incorrect scientific  
> results.  What is the best way to prevent this from happening?

This is the heart of the matter.  As Dick says, there is no single  
simple solution.  We should encourage data providers (and users) to  
avoid duplicate keywords.  We should understand why such keywords may  
be created in the first place.  Our major software packages should  
reach agreement on a common strategy should duplicates be encountered  
- whether this is that the behavior remain indeterminate, or the  
first instance or the last instance take precedent.  Applications  
should detect duplicates which affect their functionality as with any  
other header peculiarities.  Libraries should provide routines and  
utility programs for validating HDUs against a wide variety of  
exceptions, including duplicate keywords.

A duplicated keyword is just one of a long list of poor header  
construction techniques that can't be fixed simply by demanding they  
not occur.

> Seems to me we should focus on the root of the problem and  
> (formally at least) disallow duplicated keywords in a conforming  
> FITS file.  This doesn't mean software should automatically throw  
> out a file that inadvertently has a duplicated keyword.

"Formal" is the essence of a standard.  I guess the notion is that  
deprecation hasn't proven strong enough so perhaps an absolute ban  
might do the trick?  In the absence of practical consequences, what  
this really does is call the integrity of the standard into question.

> I think the seriousness of this problem depends on what keyword is  
> duplicated.  If it is just some observatory-specific keyword that  
> does not directly affect the scientific results, then it does not  
> matter very much, and data providers need not worry about it.  But  
> if a critical WCS keyword, or exposure time keyword is duplicated  
> in the file with different values, then surely the data providers  
> need to take responsibility and fix the problem.

Whether the issue is duplicate keywords or some other keyword  
misformatting, there is more pressure on the data providers already  
to fix significant occurrences than this technical change to the FITS  
standard would apply.  On the other hand, for the much more frequent  
case of unintentionally duplicating some non-critical keyword, this  
change would be outlawing files for no benefit and a lot of  
annoyance.  In either case, the software faces exactly the same  
requirements.

Rob