[fitsbits] Five FITS Proposals
Lucio Chiappetti
lucio at lambrate.inaf.it
Tue Oct 29 10:35:05 EDT 2013
On Mon, 28 Oct 2013, William Pence wrote:
> I encourage everyone who has an opinion on these matters to speak up,
I apologize for the fact I haven't done so earlier. I will now collect a
mixed bag of comments (to Bill's "5 proposals", to some of other people's
comments, to "perceived general shortcomings", to my own perception ...),
so more apologies if you find it messy :-)
----------------------------------------------------------------------
I'll first start with my own comments to Bill's 5 proposals.
(1) the proposal to have long keyword names is substantially sound,
specially for what backward compatibility is concerned, and I would
say also motivated (the example of how contrived some WCS kwd names
are, for instance).
The limit to 54 char to leave space for a full precision double is
somewhat arbitrary if proposal (1) is read together with (2), i.e.
continuation kwds. Should we impose a limit at all ?
In practical life I would be the first to recommend against usage
of EXCESSIVELY long keyword names (mainly because I do not
overemphasize the usage of "human readable commentary" kwds vs
"computer readable" kwds used AND NEEDED by s/w (or specific s/w
packages), but I do not think a hard limit is really necessary.
Just trust common sense ?
(2) also this proposal for long kwd values is substantially sound for
what backward compatibility is concerned, and I admit the need for
this is motivated. I am however not convinced this is necessarily
the best solution.
If it were, one should possibly (as done for other kwds, and as
common in some language spec) define a maximum number of
continuation kwds (e.g. from _1 to _999).
Also I am not convinced that the backslash \ as last character of
the string inside quotes is the best choice. I am afraid some
readers or languages may be confused by the common usage of \ as
escape, and interpret \' as an escaped prime, i.e. verbatim as a
single prime.
Other mechanisms ? like some special character at the beginning
(like "column 6" in Fortran77, or first character blank as in
e-mail headers ?)
Of course we should release some of the unnecessary non-constraint
about keyword order. Really, I've never appreciated why one should
not consider that keywords are preserved by readers in the order
in which they are.
How many languages do we have where the NAME of the variable
INSIDE the program changes dynamically according to the name of
the kwd ? I can think only of IDL (actually I have some IDL
routine which read some non-FITS files of mine which mimic FITS
headers, into structures and substructures.
For instance an image could be read in a structure a with the
data array in a.data and the keywords in a.naxis1, a.naxis2 etc.
while a table can be read in a structure b with some keywords
directly under it (e.g. b,naxis2), while other give name to
substructures of b.data (e.g. b.data.pinco will be the table
column for TTYPE1='pinco" and b.data.panco for TTYPE2='panco')
Anyhow proposals 1-2 match with what was historically done in cases like
e-mail headers (which originally were in RFC822 Keyword: value in an
80-char record, and now allow continuation lines), or in the
transition from Fortran 77 to Fortran 90 and later.
(3) proposal for additional charactes in kwd names
Also this proposal goes in a sense similar to the Fortran 77 to
Fortran 90 transition.
I appreciate that mixed case (specially camelCase) may improve
the legibility of long kwds (and that the dot may improve the
"structuring" of some sort of hierarchical structure).
On one hand the proposal may be "not enough", in another it might
be "too much".
- not enough in the sense not to support other punctuation marks
parts of 7-bit ASCII
- some people may argue, not enough in the sense of not supporting
other ISO-8859 character sets supported by 8-bit ASCII, or even by
supporting UNICODE (but these things, beyond formal "political
correctness", and beyond the difficulties of encoding UTF8 strings
in the current scheme, are needed if at all only for "documentary
kwds", and better handled by "metadata extensions")
- I may argue too much if we allow lower case but with case
insensitivity ... it seems to me it's looking for troubles.
Anyhow proposals 1-3 go in the sense of "extended headers", which is
something worth pursuing, maybe in other ways.
(4) version numbers.
My first reaction would be "harmless but irrelevant".
Considering that FITS is so flexible that FITS files written according
to a specific convention may be so specific of a dedicated reader that
a generic reader can do little more than listing the header or showing
data in a trivial way ... it would be more appropriate if adoption of
a specific convention (including those of proposals 1-3) is FLAGGED
by a specific kwd (to be placed ideally just after the mandatory
ones).
Therefore the reader will either call a dedicated routine (or spawn
a dedicated reader) or issue a "convention unsupported" message
One cannot really pretend that any reader should read and handle any
FITS file (like Xspec reading radio interferometry, or AIPS reading
X-ray spectra) !
(5) convention for pre-allocating blank header space
This (pseudo-)convention meets the real requirement to be able to
append keywords to a file manipulated in place without the need to
rewrite the entire file.
I have actually perceived this as a limitation or an annoyance
(in fact my own non-FITS FITS-mimicked files were arranged with
(a) a fixed-size miniheader with magic number, size of data area
and size of kwd area; (b) a data area (FITS-comnpatible but in native
machine endianness) and (c) a keyword area at the end (in that case
with 8-char named kwd with binary values, including array values)
Rob Seaman's mentioned IRAF imh+pix separate files were essentially
another way to meet the same kind of need.
So the need is real.
The solution described is a standard-compatbile viable (but not
general) solution ... but is and cannot be PART OF THE STANDARD
since ... it does not make a file at all different, but just
issues an USAGE prescription on readers and writers !
----------------------------------------------------------------------
Perceived general shortcomings
Hmm ... since my perception is reported below (and in some cases
anticipated above), I may only "smell" from some of the postings I've seen
(including Bob Hanisch's one with a different subject).
I may also endorse the following comment by Rob Seaman:
> I don't have a strong feeling either way about these, except that the
> evolution of FITS should be driven by the needs of FITS users, not by
> the critique of those who neither like or use FITS.
- One negative perception I've heard is "not enough stuff in the
WCS e.g. about distortions".
I would say this is a specialized comment which probably points to a
real need which can dealt with WITHIN FITS (as the rest of FITS, the
WCS is so flexible that allows handling new projections, conventions
etc.). Possibly with specialized solutions specific of a given
subcommunity or experiment. (and not necessarily the only solution,
for instance some X-ray communities used to "linearize" photon
positions instead than accumulating distorted unlinearized images).
- Another negative perception is about "missing UNICODE support".
I do not consider this a very important issue (besides the formalities
of "political correctness"), because the ability of entering strings
in arbitrary scripts is likely to concern only "documentary kwds"
(human-readable commentaries) more than kwd values used by processing
s/w which are likely to be numeric.
Anyhow support to Unicode could be fun. I did a quick look at the
problems (in the context of Vatican usage of FITS for scanned
manuscript), mainly the fact that UTF-8 strings with a given length
in "codepoints" (characters) has a different "unpredictable" length
in bytes. This seems to indicate support in header kwds is
problematic, but support in binary tables could be possible. So
this can also be a FITS issue though marginal.
- Another complaint concerns complex data models, tree arrangements,
relations within different data units, and complex metadata.
First of all I endorse this other comment by Rob Seaman
> Also, FITS originated as a data interchange format and has been
> spectacularly successful at this, with a rate of adoption that is the
> envy of other communities. There have always been other formats that
> are used in production workflows.
To this I would add that each specific project may have specific needs,
and also specific (not general) data models. The specific needs could be
met by project-specific FITS conventions (which may look contrived to
other users), or by project-specific non-FITS formats.
Or also by "external arrangements". I've been often skeptical about
complex Multi-Extension-FITS files with LOTS of extensions, and
therefore I would be more skeptical if one tries to tie together in
a single file (be it a MEF or a non-FITS file) what naturally belongs
to many separate files which can be tied by an external database (or
by a plain FITS table, like e.g. the XMM-Newton CIF [Calibration Index
File]).
- Concerning header data or metadata, I would not overemphasize their
need. I tend to draw a rather clear distinction between
- (meta)data (usually numeric) which is NEEDED for the PROCESSING of the
data. Therefore it has to be (primarily) computer-readable. This is
what fits in a set of header keywords (either as reserved kwds or as
part of a general or project-specific convention) ... but if bulky
may be accomodated in dedicated extensions.
- metadata (usually strings or long strings) which may be nice to have
for informative or documentary purpose (in broad sense "commentary
keywords"), but are intended mainly to be human-readable (although
they may benefit of being standardized in computer-readable forms).
These are less important, and in particular their absence or non
compliance to a standard form shall NOT stop processing s/w (and
generic readers) from functioning !
----------------------------------------------------------------------
My own perception of shortcomings
I would say that the 2-3 shortcomings I've felt in FITS (one possible way
to overcome those has been using a FITS-mappable non-FITS working format)
are:
- the fact that the header is located at front, and therefore one cannot
easily add new kwds (typically a long HISTORY sequence) if one modifies
the file in place (which is not always the dominant usage vs plain
reading, creating an edited copy, or creating new files)
Easy solution would be to segregate most of the keywords (mainly all
the history or documentation type) to an extension at the end of
the FITS file (such HDU could even be data-less, or one could code
kwds in KEY-VALUE pairs, as independently suggested)
- the fact that keyword are NOT strongly typed, but only euristically
typed. I know that NAXIS1=768 has to be interpreted as an integer
variable inside the s/w. But if I see ANYKWD=22 in one file or
ANYKWD=23.57 in another, how can I know *from the file alone*
that it has to be interpreted as a real ? And if I see LONGKWD=5.3
in one file, LONGKWD=7.2E42 in another and LONGKWD=1.23456789876543
in another, how can I know *from the file alone* it has to be
interpreted as a double ?
(on the other hand this weakness is a strength which allows to code
precisions beyond current standard internal representations)
Solutions may be force a syntax (NAXIS=768, ANYKWD=22., LONGWKD=5.3D0)
associated to a type, or define data dictionaries with lists of kwds.
- last but not least, the absence of array-valued keywords. For isntance
for array of coefficients. There are currently tricks like indexed kwds
(KWDNAME1, KWDNAME2, ... KWDNAMEn) or encoding in strings within
quotes (ARRAYKWD='1.1,2.2,3.4,99.0')
----------------------------------------------------------------------
Counter-comments to other people's comments
Well, I guess most of them have already been anticipated, and one will be
posticipated below commenting on one of Rob Seaman's proposals (covered
also by some of Preben Grosbol and Harro Verkouter).
----------------------------------------------------------------------
Conclusion and counter-proposals
On Thu, 24 Oct 2013, Rob Seaman wrote:
> A FITS file could then have a one-record data-less PDHU containing only
> structural and logistical metadata (CHECKSUM, etc) followed by a
> sequence of imaging or tabular data EHDUs and ending with a metadata
> bin-table EHDU containing the equivalent of what are now expressed as
> header keywords in the primary header.
> [...] could include support for non-ASCII characters (since it would be
> a binary table), the preallocation requirement would be met by having
> the metadata extension at the end of the file
> FITS then looks like:
>
> 1 primary header record (2880 bytes)
> N binary tables containing data
> 1 binary table containing metadata
>
> This assumes that imaging data are tile-compressed (as why shouldn't
> they be? :-)
Well, I guess the latter is going really a little too much beyond. After
all good old plain images are what "the crew were much pleased when they
found it to be / A map they could all understand"
(http://www.poetryfoundation.org/poem/173165).
I also would be reluctant to force the PHDU to be dataless, or to be only
1 2880-byte block long. Although I won't forbid it.
But the essential idea to confine the bulk of the metadata to the last HDU
is indeed matching what I had in mind.
Note that in general I tend to favour small and simple MEFs (with the
least possible number of extensions) as WORKING FILES, and I regard
linking together files necessary for a given analysis a task for some
"data organizer", but I have nothing against packing all related files
together for archiving/distribution.
So I imagine something more flexible (too flexible) like this :
1 PHDU (with or without primary array, essential basic kwds)
1 IHDU (index HDU, optional, see below)
N data HDUs (binary tables or image extensions)
N metadata HDUs (optional, see below)
1 general metadata HDU
or perhaps this :
1 PHDU (with or without primary array, essential basic kwds)
1 IHDU (index HDU, optional, see below)
N couples composed of
1 data HDU (binary tables or image extensions) *
1 metadata HDUs (optional, see below) *
1 general metadata HDU
Each of the items marked with * (data HDU and associated metadata HDU)
prepended with a barebones PHDU could be a standalone file (as working
file). This is just a particular case of the general format.
Each working file has its own metadata HDU which can be modified or
extended as the working file is manipulated. It is likely that when the
working file is "completed" its content is frozen,
Each working file can be inserted in the archive file just stripping the
unnecessary part of its dataless PHDU or merging the necessary kwds with
its data HDU. The data and associated metadata HDUs are concatenated and
inserted in the archive file at a position recorded in the IHDU.
The single-file metadata HDUs are optional, they are necessary only if
there are peculiarities (mainly of documentary nature) which are different
from one file to another. Otherwise the common information can go in the
final metadata HDU.
Some inheritance rule shall be defined (in general a "metakwd" is read
from the final metadata HDU and applies to all units, unless a specific
metadata HDU overrides the metakwd of same name.
The metadata HDUs can be one of the following things (or even a
combination of the two, with some smart usage of the Greenbank convention?
I am just thinking aloud):
- a dataless HDU with just a long header comprising all kwds of
documentary (or auxiliary or ancillary) nature (e.g. some extract
of satellite HK for instance, as well as history and other comments)
- a plain binary table with two columns, KEYWORD and VALUE
It has to be decided whether long-name, long-value, special-character kwds
(i.e. proposals 1/2/3 + eventually more) will be allowed in any header, or
only in the header of the metadata HDUs, or won't be allowed in any header
but MUST be columns in the metadata binary table.
It has to be decided whether the metadata HDUs are considered just binary
tables (in which case one might need to change the definition of binary
tables e.g. for instance to support UTF8 strings), or they are considered
a NEW SEPARATE type XTENSION='METADATA' (which could share most of
BINTABLE plus additions).
What I call IHDU (index HDU) will be just a table listing all components
(HDUs) of the big file. This can occur with a binary table with many
columns, or with a dataless HDU with indexed keywords (again with some
smart usage of the Greenbank convention), e.g. inventing
EXTNUM0 = 0
XTTYP1 = 'PHDU' (or 'IMAGE') / special for PHDU
XTNAM1 = 'primary' / special for PHDU
XTLOC1 = 0 / byte offset in file
XTSIZ1 = nnnn / size in bytes
EXTNUM1 = 1
XTTYP1 = 'BINTABLE' (or 'METADATA' ?) /
XTNAM1 = 'indextable' / special for index HDU
XTLOC1 = nnnn+1
XTSIZ1 = pppp
EXTNUM2 = 2
XTTYP2 = 'IMAGE' / example IMAGE extension
XTNAM2 = 'whatever' / its EXTNAME
XTLOC2 = nnnn+pppp+1
XTSIZ2 = qqqq
EXTNUM3 = 3
XTTYP3 = 'METADATA' / metadata of above IMAGE
XTNAM3 = 'indiivual_metadata 2' / tbd
XTASS3 = 2 / associated to item #2
XTLOC3 = nnnn+pppp+qqqq+1
XTSIZ3 = rrrr
EXTNUM4 = 4
XTTYP4 = 'BINTABLE' / example bintable extension
XTNAM4 = 'spectrum' / its EXTNAME
XTLOC4 = nnnn+pppp+qqqq+rrrr+1
XTSIZ4 = ssss
...
EXTNUMn = n / the last HDU
XTTYPn = 'METADATA' / metadata of entire file
XTNAMn = 'general_metadata' / its EXTNAME
XTASSn = -1 / associated to everything
XTLOCn = xxxx
XTSIZn = yyyy
Or in the table one makes columns NUM, EXTTYPE. EXTNAME, EXTLOCATion,
EXTSIZE, EXTASSociation, listing separately HDUs and metadata HDUs.
Or if one groups the HDU and its associated metadata one may have columns
NUM, EXTTYPE. EXTNAME, EXTLOCATion, EXTSIZE, METANUMB, METASIZE
This will deal with the requirements of Bill's proposals 1,2,3,5 (and
possibly something else). It is not fully alternative because 1,2 and 3
may also be implemented (in metadata extensions only or everywhere).
Don't take it too seriously, it's just thinking aloud.
--
------------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
More information about the fitsbits
mailing list