[fitsbits] Associating ancillary data with primary HDU

Fri May 23 10:19:08 EDT 2014

Hi Stein,

> We will need the capability to have multiple "data extensions" in a single file,

Ok, this is no problem.

> allowing an arbitrary set of keywords to vary with any subset of the data dimensions.

I might rather say something like "quantities" or "variables" than keywords.  "Keyword" means something specific in FITS.

> A typical example would be XPOSURE in observations where automatic exposure control is in effect.

So in this case you wouldn't necessarily be denoting the quantity as a FITS keyword token like "XPOSURE" but rather as "exposure_time", "time.exposure", or "exptime".  (In any event it isn't clear why such a keyword name would be truncated since "EXPOSURE" fits into 8 characters :-)

> I.e. we could have a data extension w/dimensions (x,y,time) = (120,120,500), with XPOSURE varying with time, but constant for each (x,y) plane.

A data cube seems reasonable.

> Our current suggestion is to store the values either in a bintable extension or in a regular extension (we prefer regular extensions). The extension in this case would contain an array w/dimensionality (1,1,500).

There's nothing irregular about a bintable ;-)  Say rather an IMAGE extension versus a BINTABLE.  For expressing a single such vector the choice of HDU type is not dramatically different.

And, of course, native IMAGE array might well be tile-compressed into a BINTABLE.  I'd strongly recommend thinking about compression early in the process.

> Keywords that are constant in time but vary in the (x,y) plane would be in extensions with dimensions (120,120,1), etc. This seems like an "obvious" solution to us.

This is a typical usage for masks, and a 3-D data cube might well have a 2-D mask projected along any of the axes.

> This must also work with non-mandatory keywords, though. So there needs to be a way to signal that "this keyword is not actually missing, but you can find it in a separate (bintable or regular) extension".

Here's where you're losing me and likely others.  There's nothing magic about FITS keywords as a vehicle for science metadata.  And ultimately metadata and data are not intrinsically different concepts.  A complex structured data object - FITS or otherwise - will relate arrays and scalars, vectors and cubes.  For a particular instrument or pipeline the various quantities may be deemed to be dependent or independent variables whether or not they're expressed as keywords, columns or pixels/voxels.

I guess what you're saying is that sometimes such a file might have a scalar exposure time, constant across all planes of the time-series cube?  And in other instances the exposure time may vary plane-to-plane?  And that you don't want to be obligated to express a scalar as a vector?

> We do not wish to make it necessary to gobble up the entire file in order to search for such potential tabulated keywords. So we propose a keyword named TABULATE, TABULATD ("tabulated"), or TABULATK ("tabulated keywords") containing a comma-separated list of keywords that are handled with this mechanism.

This is really starting to sound like a job for a BINTABLE with a coherent purpose-designed schema.  FITS keywords are one simple way to lay out metadata.  But it isn't hard to exceed the comfortable mapping of a flat FITS header onto the job in question.

The BINTABLE could define some specific or perhaps general-purpose keywords for cases in which one or more of the columns should be omitted in favor of scalars in the table header.  Alternately just always express such as column vectors.  FPACK will squeeze all the duplicated values down with a high compression factor.

Your code to read these files would otherwise be complicated by conditionals if sometimes the EXPTIME is a vector and other times a scalar.

> ...
> However, this would *require* a separate keyword extension for all tabulated keywords in *each* data extension,

Or the various elements of your schema could map to columns (or variable length arrays perhaps) in a single BINTABLE.

> In e.g. a fits file containing *many* spectral window extracts from a spectrometer, this could potentially mean *many* repetitions of the same keyword extension for e.g. XPOSURE, temperatures, etc!

Or a single compact but comprehensive BINTABLE.  Whatever the type of extensions you use, it becomes inefficient once the size of the data records falls below some threshold relative to the size of the header records.  And searching through the file becomes inefficient if there are large numbers of small HDUs.  That said, FITS files with hundreds of HDUs can be reasonably efficient to handle - if the types and purposes of the HDUs don't get too elaborate.

> All of this could, of course, be built into a standard routine to read keywords.

FITS header keywords are suitable for simple metadata chores.  Your requirements don't sound that simple.  Some combination of BINTABLEs and mask-like IMAGE extensions should do the job, and tie these together with a relatively limited set of keywords that aren't being strained beyond their native capabilities.

> We believe that whatever we (a workgroup in an EU project called SOLARNET) adopt as a recommendation would "quite soon" be used by "quite a few" solar processing pipelines. Our recommendation is due in fall, but that won't necessarily be carved in stone, since the pipelines would not yet have been implemented and used.

So it sounds like you have some time - and that there is a premium on choosing a scalable format that is understandable to groups who weren't involved in implementing the format.

Rob Seaman
NOAO