[fitsbits] Five FITS Proposals

Tue Oct 29 10:35:05 EDT 2013

On Mon, 28 Oct 2013, William Pence wrote:

> I encourage everyone who has an opinion on these matters to speak up,

I apologize for the fact I haven't done so earlier. I will now collect a 
mixed bag of comments (to Bill's "5 proposals", to some of other people's 
comments, to "perceived general shortcomings", to my own perception ...), 
so more apologies if you find it messy :-)

----------------------------------------------------------------------
I'll first start with my own comments to Bill's 5 proposals.

(1) the proposal to have long keyword names is substantially sound,
     specially for what backward compatibility is concerned, and I would
     say also motivated (the example of how contrived some WCS kwd names
     are, for instance).

     The limit to 54 char to leave space for a full precision double is
     somewhat arbitrary if proposal (1) is read together with (2), i.e.
     continuation kwds. Should we impose a limit at all ?

     In practical life I would be the first to recommend against usage
     of EXCESSIVELY long keyword names (mainly because I do not
     overemphasize the usage of "human readable commentary" kwds vs
     "computer readable" kwds used AND NEEDED by s/w (or specific s/w
     packages), but I do not think a hard limit is really necessary.
     Just trust common sense ?

(2) also this proposal for long kwd values is substantially sound for
     what backward compatibility is concerned, and I admit the need for
     this is motivated. I am however not convinced this is necessarily
     the best solution.

     If it were, one should possibly (as done for other kwds, and as
     common in some language spec) define a maximum number of
     continuation kwds (e.g. from _1 to _999).

     Also I am not convinced that the backslash \ as last character of
     the string inside quotes is the best choice. I am afraid some
     readers or languages may be confused by the common usage of \ as
     escape, and interpret \' as an escaped prime, i.e. verbatim as a
     single prime.

     Other mechanisms ? like some special character at the beginning
     (like "column 6" in Fortran77, or first character blank as in
     e-mail headers ?)

     Of course we should release some of the unnecessary non-constraint
     about keyword order. Really, I've never appreciated why one should
     not consider that keywords are preserved by readers in the order
     in which they are.

     How many languages do we have where the NAME of the variable
     INSIDE the program changes dynamically according to the name of
     the kwd ?  I can think only of IDL (actually I have some IDL
     routine which read some non-FITS files of mine which mimic FITS
     headers, into structures and substructures.

     For instance an image could be read in a structure a with the
     data array in a.data and the keywords in a.naxis1, a.naxis2 etc.
     while a table can be read in a structure b with some keywords
     directly under it (e.g. b,naxis2), while other give name to
     substructures of b.data (e.g. b.data.pinco will be the table
     column for TTYPE1='pinco" and b.data.panco for TTYPE2='panco')

Anyhow proposals 1-2 match with what was historically done in cases like 
e-mail headers (which originally were in RFC822 Keyword: value in an 
80-char record, and now allow continuation lines), or in the 
transition from Fortran 77 to Fortran 90 and later.

(3) proposal for additional charactes in kwd names

     Also this proposal goes in a sense similar to the Fortran 77 to
     Fortran 90 transition.

     I appreciate that mixed case (specially camelCase) may improve
     the legibility of long kwds (and that the dot may improve the
     "structuring" of some sort of hierarchical structure).

     On one hand the proposal may be "not enough", in another it might
     be "too much".

     - not enough in the sense not to support other punctuation marks
       parts of 7-bit ASCII
     - some people may argue, not enough in the sense of not supporting
       other ISO-8859 character sets supported by 8-bit ASCII, or even by
       supporting UNICODE (but these things, beyond formal "political
       correctness", and beyond the difficulties of encoding UTF8 strings
       in the current scheme, are needed if at all only for "documentary
       kwds", and better handled by "metadata extensions")

     - I may argue too much if we allow lower case but with case
       insensitivity ... it seems to me it's looking for troubles.

Anyhow proposals 1-3 go in the sense of "extended headers", which is 
something worth pursuing, maybe in other ways.

(4) version numbers.

     My first reaction would be "harmless but irrelevant".

     Considering that FITS is so flexible that FITS files written according
     to a specific convention may be so specific of a dedicated reader that
     a generic reader can do little more than listing the header or showing
     data in a trivial way ... it would be more appropriate if adoption of
     a specific convention (including those of proposals 1-3) is FLAGGED
     by a specific kwd (to be placed ideally just after the mandatory
     ones).

     Therefore the reader will either call a dedicated routine (or spawn
     a dedicated reader) or issue a "convention unsupported" message

     One cannot really pretend that any reader should read and handle any
     FITS file (like Xspec reading radio interferometry, or AIPS reading
     X-ray spectra) !

(5) convention for pre-allocating blank header space

     This (pseudo-)convention meets the real requirement to be able to
     append keywords to a file manipulated in place without the need to
     rewrite the entire file.

     I have actually perceived this as a limitation or an annoyance
     (in fact my own non-FITS FITS-mimicked files were arranged with
     (a) a fixed-size miniheader with magic number, size of data area
     and size of kwd area; (b) a data area (FITS-comnpatible but in native
     machine endianness) and (c) a keyword area at the end (in that case
     with 8-char named kwd with binary values, including array values)

     Rob Seaman's mentioned IRAF imh+pix separate files were essentially
     another way to meet the same kind of need.

     So the need is real.

     The solution described is a standard-compatbile viable (but not
     general) solution ... but is and cannot be PART OF THE STANDARD
     since ... it does not make a file at all different, but just
     issues an USAGE prescription on readers and writers !

----------------------------------------------------------------------
Perceived general shortcomings

Hmm ... since my perception is reported below (and in some cases 
anticipated above), I may only "smell" from some of the postings I've seen 
(including Bob Hanisch's one with a different subject).

I may also endorse the following comment by Rob Seaman:

> I don't have a strong feeling either way about these, except that the
> evolution of FITS should be driven by the needs of FITS users, not by
> the critique of those who neither like or use FITS.

- One negative perception I've heard is "not enough stuff in the
   WCS e.g. about distortions".

   I would say this is a specialized comment which probably points to a
   real need which can dealt with WITHIN FITS (as the rest of FITS, the
   WCS is so flexible that allows handling new projections, conventions
   etc.). Possibly with specialized solutions specific of a given
   subcommunity or experiment. (and not necessarily the only solution,
   for instance some X-ray communities used to "linearize" photon
   positions instead than accumulating distorted unlinearized images).

- Another negative perception is about "missing UNICODE support".

   I do not consider this a very important issue (besides the formalities
   of "political correctness"), because the ability of entering strings
   in arbitrary scripts is likely to concern only "documentary kwds"
   (human-readable commentaries) more than kwd values used by processing
   s/w which are likely to be numeric.

   Anyhow support to Unicode could be fun. I did a quick look at the
   problems (in the context of Vatican usage of FITS for scanned
   manuscript), mainly the fact that UTF-8 strings with a given length
   in "codepoints" (characters) has a different "unpredictable" length
   in bytes.  This seems to indicate support in header kwds is
   problematic, but support in binary tables could be possible. So
   this can also be a FITS issue though marginal.

- Another complaint concerns complex data models, tree arrangements,
   relations within different data units, and complex metadata.

   First of all I endorse this other comment by Rob Seaman

> Also, FITS originated as a data interchange format and has been 
> spectacularly successful at this, with a rate of adoption that is the 
> envy of other communities.  There have always been other formats that 
> are used in production workflows.

   To this I would add that each specific project may have specific needs,
   and also specific (not general) data models. The specific needs could be
   met by project-specific FITS conventions (which may look contrived to
   other users), or by project-specific non-FITS formats.

   Or also by "external arrangements". I've been often skeptical about
   complex Multi-Extension-FITS files with LOTS of extensions, and
   therefore I would be more skeptical if one tries to tie together in
   a single file (be it a MEF or a non-FITS file) what naturally belongs
   to many separate files which can be tied by an external database (or
   by a plain FITS table, like e.g. the XMM-Newton CIF [Calibration Index
   File]).

- Concerning header data or metadata, I would not overemphasize their
   need. I tend to draw a rather clear distinction between

   - (meta)data (usually numeric) which is NEEDED for the PROCESSING of the
     data. Therefore it has to be (primarily) computer-readable. This is
     what fits in a set of header keywords (either as reserved kwds or as
     part of a general or project-specific convention) ... but if bulky
     may be accomodated in dedicated extensions.

   - metadata (usually strings or long strings) which may be nice to have
     for informative or documentary purpose (in broad sense "commentary
     keywords"), but are intended mainly to be human-readable (although
     they may benefit of being standardized in computer-readable forms).
     These are less important, and in particular their absence or non
     compliance to a standard form shall NOT stop processing s/w (and
     generic readers) from functioning !

----------------------------------------------------------------------
My own perception of shortcomings

I would say that the 2-3 shortcomings I've felt in FITS (one possible way 
to overcome those has been using a FITS-mappable non-FITS working format) 
are:

  - the fact that the header is located at front, and therefore one cannot
    easily add new kwds (typically a long HISTORY sequence) if one modifies
    the file in place (which is not always the dominant usage vs plain
    reading, creating an edited copy, or creating new files)

    Easy solution would be to segregate most of the keywords (mainly all
    the history or documentation type) to an extension at the end of
    the FITS file (such HDU could even be data-less, or one could code
    kwds in KEY-VALUE pairs, as independently suggested)

  - the fact that keyword are NOT strongly typed, but only euristically
    typed. I know that NAXIS1=768 has to be interpreted as an integer
    variable inside the s/w. But if I see ANYKWD=22 in one file or
    ANYKWD=23.57 in another, how can I know *from the file alone*
    that it has to be interpreted as a real ? And if I see LONGKWD=5.3
    in one file, LONGKWD=7.2E42 in another and LONGKWD=1.23456789876543
    in another, how can I know *from the file alone* it has to be
    interpreted as a double ?

    (on the other hand this weakness is a strength which allows to code
    precisions beyond current standard internal representations)

    Solutions may be force a syntax (NAXIS=768, ANYKWD=22., LONGWKD=5.3D0)
    associated to a type, or define data dictionaries with lists of kwds.

  - last but not least, the absence of array-valued keywords. For isntance
    for array of coefficients. There are currently tricks like indexed kwds
    (KWDNAME1, KWDNAME2, ... KWDNAMEn)  or encoding in strings within
    quotes (ARRAYKWD='1.1,2.2,3.4,99.0')

----------------------------------------------------------------------
Counter-comments to other people's comments

Well, I guess most of them have already been anticipated, and one will be 
posticipated below commenting on one of Rob Seaman's proposals (covered 
also by some of Preben Grosbol and Harro Verkouter).

----------------------------------------------------------------------
Conclusion and counter-proposals

On Thu, 24 Oct 2013, Rob Seaman wrote:

> A FITS file could then have a one-record data-less PDHU containing only 
> structural and logistical metadata (CHECKSUM, etc) followed by a 
> sequence of imaging or tabular data EHDUs and ending with a metadata 
> bin-table EHDU containing the equivalent of what are now expressed as 
> header keywords in the primary header.

> [...] could include support for non-ASCII characters (since it would be 
> a binary table), the preallocation requirement would be met by having 
> the metadata extension at the end of the file

> FITS then looks like:
>
>       1 primary header record (2880 bytes)
>       N binary tables containing data
>       1 binary table containing metadata
>
> This assumes that imaging data are tile-compressed (as why shouldn't
> they be? :-)

Well, I guess the latter is going really a little too much beyond. After 
all good old plain images are what "the crew were much pleased when they 
found it to be / A map they could all understand" 
(http://www.poetryfoundation.org/poem/173165).

I also would be reluctant to force the PHDU to be dataless, or to be only 
1 2880-byte block long. Although I won't forbid it.

But the essential idea to confine the bulk of the metadata to the last HDU
is indeed matching what I had in mind.

Note that in general I tend to favour small and simple MEFs (with the 
least possible number of extensions) as WORKING FILES, and I regard 
linking together files necessary for a given analysis a task for some 
"data organizer", but I have nothing against packing all related files 
together for archiving/distribution.

So I imagine something more flexible (too flexible) like this :

    1 PHDU (with or without primary array, essential basic kwds)
    1 IHDU (index HDU, optional, see below)
    N data HDUs (binary tables or image extensions)
    N metadata HDUs (optional, see below)
    1 general metadata HDU

or perhaps this :

    1 PHDU (with or without primary array, essential basic kwds)
    1 IHDU (index HDU, optional, see below)
    N couples composed of
      1 data HDU (binary tables or image extensions) *
      1 metadata HDUs (optional, see below)          *
    1 general metadata HDU

Each of the items marked with * (data HDU and associated metadata HDU) 
prepended with a barebones PHDU could be a standalone file (as working 
file). This is just a particular case of the general format.

Each working file has its own metadata HDU which can be modified or 
extended as the working file is manipulated. It is likely that when the 
working file is "completed" its content is frozen,

Each working file can be inserted in the archive file just stripping the 
unnecessary part of its dataless PHDU or merging the necessary kwds with 
its data HDU. The data and associated metadata HDUs are concatenated and 
inserted in the archive file at a position recorded in the IHDU.

The single-file metadata HDUs are optional, they are necessary only if 
there are peculiarities (mainly of documentary nature) which are different 
from one file to another. Otherwise the common information can go in the 
final metadata HDU.

Some inheritance rule shall be defined (in general a "metakwd" is read 
from the final metadata HDU and applies to all units, unless a specific 
metadata HDU overrides the metakwd of same name.

The metadata HDUs can be one of the following things (or even a 
combination of the two, with some smart usage of the Greenbank convention?
I am just thinking aloud):

  - a dataless HDU with just a long header comprising all kwds of
    documentary (or auxiliary or ancillary) nature (e.g. some extract
    of satellite HK for instance, as well as history and other comments)

  - a plain binary table with two columns, KEYWORD and VALUE

It has to be decided whether long-name, long-value, special-character kwds 
(i.e. proposals 1/2/3 + eventually more) will be allowed in any header, or 
only in the header of the metadata HDUs, or won't be allowed in any header 
but MUST be columns in the metadata binary table.

It has to be decided whether the metadata HDUs are considered just binary 
tables (in which case one might need to change the definition of binary 
tables e.g. for instance to support UTF8 strings), or they are considered 
a NEW SEPARATE type  XTENSION='METADATA' (which could share most of 
BINTABLE plus additions).

What I call IHDU (index HDU) will be just a table listing all components 
(HDUs) of the big file. This can occur with a binary table with many 
columns, or with a dataless HDU with indexed keywords (again with some 
smart usage of the Greenbank convention), e.g. inventing

  EXTNUM0 = 0
  XTTYP1  = 'PHDU' (or 'IMAGE')           / special for PHDU
  XTNAM1  = 'primary'                     / special for PHDU
  XTLOC1  = 0                             / byte offset in file
  XTSIZ1  = nnnn                          / size in bytes
  EXTNUM1 = 1
  XTTYP1  = 'BINTABLE' (or 'METADATA' ?)  /
  XTNAM1  = 'indextable'                  / special for index HDU
  XTLOC1  = nnnn+1
  XTSIZ1  = pppp
  EXTNUM2 = 2
  XTTYP2  = 'IMAGE'                       / example IMAGE extension
  XTNAM2  = 'whatever'                    / its EXTNAME
  XTLOC2  = nnnn+pppp+1
  XTSIZ2  = qqqq
  EXTNUM3 = 3
  XTTYP3  = 'METADATA'                    / metadata of above IMAGE
  XTNAM3  = 'indiivual_metadata 2'        / tbd
  XTASS3  = 2                             / associated to item #2
  XTLOC3  = nnnn+pppp+qqqq+1
  XTSIZ3  = rrrr
  EXTNUM4 = 4
  XTTYP4  = 'BINTABLE'                    / example bintable extension
  XTNAM4  = 'spectrum'                    / its EXTNAME
  XTLOC4  = nnnn+pppp+qqqq+rrrr+1
  XTSIZ4  = ssss
  ...
  EXTNUMn = n                             / the last HDU
  XTTYPn  = 'METADATA'                    / metadata of entire file
  XTNAMn  = 'general_metadata'            / its EXTNAME
  XTASSn  = -1                            / associated to everything
  XTLOCn  = xxxx
  XTSIZn  = yyyy

Or in the table one makes columns NUM, EXTTYPE. EXTNAME, EXTLOCATion, 
EXTSIZE, EXTASSociation, listing separately HDUs and metadata HDUs.

Or if one groups the HDU and its associated metadata one may have columns 
NUM, EXTTYPE. EXTNAME, EXTLOCATion, EXTSIZE, METANUMB, METASIZE

This will deal with the requirements of Bill's proposals 1,2,3,5 (and 
possibly something else). It is not fully alternative because 1,2 and 3 
may also be implemented (in metadata extensions only or everywhere).

Don't take it too seriously, it's just thinking aloud.

-- 
------------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html