[fitsbits] Associating ancillary data with primary HDU

Tue May 27 17:49:05 EDT 2014

Hi Stein,

You might be correct that using multiple image extensions for your data, 
and inventing some sort of ad hoc way to associate the various 
extensions with one another (e.g., via your proposed TABULATE keyword), 
may be the simplest solution that meets your requirements.  The 
Hierarchical grouping convention 
(http://fits.gsfc.nasa.gov/registry/grouping.html) does provide a more 
general mechanism for defining hierarchical associations of HDUs, but it 
is rather complex and probably has more features than you really need.

One distinct advantage of using separate image extensions, instead of 
packing all the data into a binary table, is that CFITSIO (and other 
software packages as well) can read and write compressed versions of the 
image extension on the fly, without the user having to first uncompress 
the file.  While a similar amount of data compression can be achieved if 
the data is packed into a binary table, the whole table would need to be 
uncompressed (with funpack) before CFITSIO can read it.

Regardless of what solution you end up adopting, I hope you will 
publicize the details of the convention here on the fitsbits email list, 
and also submit the convention for inclusion in the registry on the FITS 
support office web site (http://fits.gsfc.nasa.gov/fits_registry.html).

-Bill

On 05/23/2014 04:36 PM, Stein Vidar Hagfors Haugan wrote:
> On 2014/05/23, at 16:31, William Pence <William.Pence at nasa.gov> wrote:
>
>> You say that you have a preference for regular extensions (which I assume means 'IMAGE' extension),
>
> Actually, it's not so much a preference one way of the other for *me*... the only one leaning (at least moderately) strongly towards IMAGE(*) extensions over BINTABLEs is Mats. Since he's my boss, I have to argue at least somewhat on his behalf ;-).
>
> *) Taking the hint from Rob Seaman that there's nothing irregular about BINTABLES ;-)
>
>> but binary tables offer a very natural way of associating variables by writing the values in different columns of the table.  In your example the main column in the table would contain the (120, 120, 500) data cube as a vector, then the 'XPOSURE' column could contain the vector of 500 exposure values, and other columns could could contain (120, 120) vectors of values that vary in the X, Y plane.  This table would only have one row, but in principle you could store multiple observations in multiple rows (e.g., all the spectral window extracts from a single observation could be in one table).  With this arrangement it would then be easy for software to determine if a keyword such as XPOSURE has a scalar value (in which case it is written as a header keyword) or is a vector (in which case it is written as a vector column).  Everything associated with that observation is stored in a single binary table, so this completely eliminates the need to invent new complicated conventions for as
sociating different extensions with each other.
>
> At least initially, it seems to make a lot of sense to use a BINTABLE, since the TTYPEn keywords would be a natural way to signal which "keywords" are scalar vs. arrays, in the header itself, without introducing any new keywords. But:
>
> - Someone might (already) use TTYPEn equal to a "keyword" name without intending this convention to be used.
>
> - I don't see a good way of making this work seamlessly for multiple IMAGE [data] extensions. We want the ability to both:
>
> a) Have multiple IMAGE extensions take their "array keywords" from a global pool, to share them, conserving space - e.g. XPOSURE time might be the same array for all IMAGE extensions. The use of one row per IMAGE extension, as you suggest, does not work for that.
>
> b) Have each IMAGE extensions have a "private" instance of an "array keyword", for "keywords" that are *not* the same for all IAMGE extensions. For some instruments, XPOSURE could be identical for one collection of IMAGE extensions, but another group of IMAGE extensions could have a different array of XPOSURE values. Using a simple rule like TTYPEn="keyword-name" and a single row doesn't work for this case.
>
> Ah, such is life ;-).
>
> And it seems you're proposing to have the data cubes themselves as columns in the BINTABLE... that would not allow other keywords to be different for each data cube, right? Unless they are all stuffed into the same scheme (ugh)? Isn't this (at least one of the reasons) why the IMAGE extension was created? I suppose that's sort of not a requirement for using the BINTABLE approach in itself...?
>
>> As an additional enhancement, if you find that the vectors in multiple rows of the table have identical values, you could write the vector once into a variable length array column, then all the other multiple instances of that vector could point to that same vector, to save on disk space.
>
> Oh... I guess this *could* take care of it, using BINTABLES? I've never seen or heard of a variable-length array column, though, and certainly have no idea how to make one row's column value point to another row's value for the same column (if I've understood correctly). Actually, I've only ever used single-row tables, stuffing whatever dimensionality is required into each column.
>
> Bill - are there IDL routines that could handle such a scheme?
>
> And to all - is such a scheme something that's "widely used", in the sense that most common fits libraries have this feature built in?
>
> We'd like to keep this relatively simple... and I'm not sure that last bit qualifies in that sense :-)
>
> Anyhow, all of the above issues are catered for by using e.g. TABULATE = "KEYWORD1[EXTNAME1],KEYWORD2[EXTNAME2],..." in each IMAGE extension, since the association is given explicityly for each IMAGE extension. It could, of course, also be given as "KEYWORD[colnum1],KEYWORD[colnum2]" etc, where colnum indicates the relevant BINTABLE column.
>
> Sincerely,
> Stein Haugan
>
>> -Bill Pence
>>
>>> On May 21, 2014, at 3:56 AM, Stein Vidar Hagfors Haugan <s.v.h.haugan at astro.uio.no> wrote:
>>>
>>> Dear all,
>>>
>>> [Terje: please read through to catch any inconsistencies/cut-and-past errors etc ;-]
>>>
>>> As the originator of the original question, I'd like to elaborate a bit. Our situation is as follows:
>>>
>>> We will need the capability to have multiple "data extensions" in a single file, allowing an arbitrary set of keywords to vary with any subset of the data dimensions. A typical example would be XPOSURE in observations where automatic exposure control is in effect.
>>>
>>> I.e. we could have a data extension w/dimensions (x,y,time) = (120,120,500), with XPOSURE varying with time, but constant for each (x,y) plane.
>>>
>>> Our current suggestion is to store the values either in a bintable extension or in a regular extension (we prefer regular extensions). The extension in this case would contain an array w/dimensionality (1,1,500).
>>>
>>> Keywords that are constant in time but vary in the (x,y) plane would be in extensions with dimensions (120,120,1), etc. This seems like an "obvious" solution to us.
>>>
>>> This must also work with non-mandatory keywords, though. So there needs to be a way to signal that "this keyword is not actually missing, but you can find it in a separate (bintable or regular) extension".
>>>
>>> We do not wish to make it necessary to gobble up the entire file in order to search for such potential tabulated keywords. So we propose a keyword named TABULATE, TABULATD ("tabulated"), or TABULATK ("tabulated keywords") containing a comma-separated list of keywords that are handled with this mechanism.
>>>
>>> Using a mechanism where the keyword value is actually equal to the name of the relevant extension would be a bit messy - especially for string-valued keywords!
>>>
>>> So our idea is to have another keyword with a related name, such as TAB_EXTN or TABULATN containing a comma-separated list of extension names containing the *corresponding* extension names. Thus unique extensions *may* be specified for each data extension for some keywords, whereas a single extension may be specified for keywords that are identical for all data extensions (i.e. reused by multiple data extensions).
>>>
>>> I assume that Paul's method uses
>>>
>>>      EXTNAME[keyword-ext]==KW_NAME    &&     EXTVER[keyword-ext]==EXTVER[data-ext]
>>>
>>> to link the data extension keyword and the keyword extension.
>>>
>>> However, this would *require* a separate keyword extension for all tabulated keywords in *each* data extension, with no mechanism to save space by "reusing" a keyword extension - since the combination of EXTNAME and EXTVER is required [isn't it?] to be unique throughout the file.
>>>
>>> In e.g. a fits file containing *many* spectral window extracts from a spectrometer, this could potentially mean *many* repetitions of the same keyword extension for e.g. XPOSURE, temperatures, etc!
>>>
>>> I assume Paul's method requires EXTVER to be unique for each data extension, *requiring* a separate table for each data extension.
>>>
>>> Our convention could of course be modified to use a *single* keyword by introducing a "syntax" for the TABULATE keyword, such as "KEYWORD1[EXTNAME1],KEYWORD2[EXTNAME2],...". Or it could be modified in other ways, I suppose.
>>>
>>> Where linear interpolation of the keyword value is good enough, the convention could also be augmented by e.g. allowing an array with smaller dimensions than the data extensions (though *always* the same *number* of extensions). E.g. (x,y,time) = (2,2,1) in the above example for a keyword varying in the spatial plane but constant for each exposure.
>>>
>>> All of this could, of course, be built into a standard routine to read keywords.
>>>
>>> We believe that whatever we (a workgroup in an EU project called SOLARNET) adopt as a recommendation would "quite soon" be used by "quite a few" solar processing pipelines. Our recommendation is due in fall, but that won't necessarily be carved in stone, since the pipelines would not yet have been implemented and used.
>>>
>>> Your thoughts?
>>>
>>> Sincerely,
>>> Stein Haugan
>>>
>>>> On 2014/05/09, at 22:24, William Thompson <William.T.Thompson at nasa.gov> wrote:
>>>>
>>>> To the general FITS community:
>>>>
>>>> I've been asked if there are any specific conventions for associating ancillary data with primary data arrays.  The specific application is one where the exposure time differs from pixel to pixel (something that can be done with Active Pixel Sensors), but which could easily apply to other parameters which vary between pixels.
>>>>
>>>> The simplest and most obvious approach would be to store the actual data in the primary HDU, and then store the exposure times in an extension with the same dimensionality.  For example, if the primary HDU had
>>>>
>>>> SIMPLE  =                    T
>>>> BITPIX  =                   16
>>>> NAXIS   =                    2
>>>> NAXIS1  =                 1024
>>>> NAXIS2  =                 1024
>>>> EXTEND  =                    T
>>>>
>>>> The extension would have
>>>>
>>>> XTENSION=              'IMAGE'
>>>> BITPIX  =                  -32
>>>> NAXIS   =                    2
>>>> NAXIS1  =                 1024
>>>> NAXIS2  =                 1024
>>>> EXTNAME =            'XPOSURE'
>>>>
>>>> In essence, this is similar to the Green Bank Convention, but applied to the individual pixels in a data array rather than to rows in a binary table.
>>>>
>>>> Is this a commonly used method for associating ancillary data with primary images?  Are there any additional conventions that are appropriate to this situation?  I tried looking in
>>>>
>>>> http://fits.gsfc.nasa.gov/fits_conventions.html
>>>>
>>>> but couldn't find anything that seemed relevant.
>>>>
>>>> One could also imagine binary tables where the primary data array is in one column, and the array of exposure times is in another column.  However, for the present application, the use of IMAGE extensions is far simpler, and more likely to be actually adopted.
>>>>
>>>> Thank you,
>>>>
>>>> Bill Thompson
>>>>
>>>>
>>>> --
>>>> William Thompson
>>>> NASA Goddard Space Flight Center
>>>> Code 671
>>>> Greenbelt, MD  20771
>>>> USA
>>>>
>>>> 301-286-2040
>>>> William.T.Thompson at nasa.gov
>>>
>>>
>>> _______________________________________________
>>> fitsbits mailing list
>>> fitsbits at listmgr.cv.nrao.edu
>>> http://listmgr.cv.nrao.edu/mailman/listinfo/fitsbits
>

-- 
____________________________________________________________________
Dr. William Pence    Astrophysicist     William.Pence at nasa.gov
NASA/GSFC Code 662     [Emeritus]       +1-301-286-4599 (voice)
Greenbelt MD 20771                      +1-301-286-1684 (fax)