[fitsbits] Associating ancillary data with primary HDU

Stein Vidar Hagfors Haugan s.v.h.haugan at astro.uio.no
Fri May 23 17:17:33 EDT 2014


Hi Rob,

On 2014/05/23, at 16:19, Rob Seaman <seaman at noao.edu> wrote:

> Hi Stein,
> 
>> We will need the capability to have multiple "data extensions" in a single file,
> 
> Ok, this is no problem.
> 
>> allowing an arbitrary set of keywords to vary with any subset of the data dimensions.
> 
> I might rather say something like "quantities" or "variables" than keywords.  "Keyword" means something specific in FITS.
> 
>> A typical example would be XPOSURE in observations where automatic exposure control is in effect.
> 
> So in this case you wouldn't necessarily be denoting the quantity as a FITS keyword token like "XPOSURE" but rather as "exposure_time", "time.exposure", or "exptime".  (In any event it isn't clear why such a keyword name would be truncated since "EXPOSURE" fits into 8 characters :-)

I'm using the word keyword, but it should of course be something else, or at least "keyword" in quotes. 

The reason being that we want to be able to express some things that (for many solar observations) has been expressed *incorrectly* through a scalar-value fits keyword. Like quoting the mean or initial (or whatever) exposure time for a data array that has automatic exposure control, which means the value could be way off, and there's no standard way to express this.

Typically, each project has "overcome" this problem by introducing their own scheme. Often using binary tables with "appropriate" names, which are understood by their own software. But there has been no formal association rule between the keyword name and the column with the correct values. Also, the bintable columns have not necessarily had the same number of dimensions as the data array, but again, their own software can deal with that.

Oh, and XPOSURE is now recommended since some projects have used EXPOSURE for total exposure time and some have used it for single exposure times when stacking multiple exposures. XPOSURE is now defined to be the total exposure time.

>> I.e. we could have a data extension w/dimensions (x,y,time) = (120,120,500), with XPOSURE varying with time, but constant for each (x,y) plane.
> 
> A data cube seems reasonable.
> 
>> Our current suggestion is to store the values either in a bintable extension or in a regular extension (we prefer regular extensions). The extension in this case would contain an array w/dimensionality (1,1,500).
> 
> There's nothing irregular about a bintable ;-)  Say rather an IMAGE extension versus a BINTABLE.  For expressing a single such vector the choice of HDU type is not dramatically different.
> 
> And, of course, native IMAGE array might well be tile-compressed into a BINTABLE.  I'd strongly recommend thinking about compression early in the process.

Hmm... maybe. Are there "transparent", standardised ways of doing this? 

We have some issues with it anyhow, though, since some visualisation programs do a memory-mapping of the IMAGE array for fast access (without duplicating huge data chunks).

>> Keywords that are constant in time but vary in the (x,y) plane would be in extensions with dimensions (120,120,1), etc. This seems like an "obvious" solution to us.
> 
> This is a typical usage for masks, and a 3-D data cube might well have a 2-D mask projected along any of the axes.
> 
>> This must also work with non-mandatory keywords, though. So there needs to be a way to signal that "this keyword is not actually missing, but you can find it in a separate (bintable or regular) extension".
> 
> Here's where you're losing me and likely others.  There's nothing magic about FITS keywords as a vehicle for science metadata.  And ultimately metadata and data are not intrinsically different concepts.  A complex structured data object - FITS or otherwise - will relate arrays and scalars, vectors and cubes.  For a particular instrument or pipeline the various quantities may be deemed to be dependent or independent variables whether or not they're expressed as keywords, columns or pixels/voxels.
> 
> I guess what you're saying is that sometimes such a file might have a scalar exposure time, constant across all planes of the time-series cube?  And in other instances the exposure time may vary plane-to-plane?  And that you don't want to be obligated to express a scalar as a vector?

Yes. Because we're talking about a recommendation that should hold for as many types of instrument pipelines (within the solar observation community) as possible, allowing generic analysis and visualisation software to cope correctly with them all, such as correcting image brightn
ess for exposure time (or whatever). 

And, of course, see my first comment about the use of the word "keyword".

>> We do not wish to make it necessary to gobble up the entire file in order to search for such potential tabulated keywords. So we propose a keyword named TABULATE, TABULATD ("tabulated"), or TABULATK ("tabulated keywords") containing a comma-separated list of keywords that are handled with this mechanism.
> 
> This is really starting to sound like a job for a BINTABLE with a coherent purpose-designed schema.  FITS keywords are one simple way to lay out metadata.  But it isn't hard to exceed the comfortable mapping of a flat FITS header onto the job in question.
> 
> The BINTABLE could define some specific or perhaps general-purpose keywords for cases in which one or more of the columns should be omitted in favor of scalars in the table header.  

But would such schemes really be any simpler than what we propose? I'm not sure I understood it...

> Alternately just always express such as column vectors.  FPACK will squeeze all the duplicated values down with a high compression factor.

I've never tried FPACK, but I can easily see that compression will take care of all that wasted space. That's great for the data repositories, but not so good for the users, who'd want to decompress the data to work on them, and would not like to have to compress them again while working on something else, then decompressing to have a second look, etc.

> Your code to read these files would otherwise be complicated by conditionals if sometimes the EXPTIME is a vector and other times a scalar.

Life *is* complicated, unfortunately. I.e. for some instruments, EXPTIME is constant for every pixel in a multi-extension file. For others (at least theoretically), it might be different for each and every pixel. For languages like IDL, it would be fairly easy (with our proposed scheme) to say e.g.

  data = readfits(...)
  exposure_time = generic_function(hdr,"XPOSURE")
  corrected_data = data/exposure_time

This could be made to work in two ways: the generic_function will know the dimensionality of the data array (given in the header), and could *always* return an array. Or it could return a scalar sometimes and an array at other times. Still works. 

Not all things are so simple, but most cases can be handled by writing the code such that it *requires* the value to be an array, and letting the generic_function always return such an array.

>> ...
>> However, this would *require* a separate keyword extension for all tabulated keywords in *each* data extension,
> 
> Or the various elements of your schema could map to columns (or variable length arrays perhaps) in a single BINTABLE.

Again, I'm just wondering about the simplicity, and the general availability of library routines for that mapping (which I've never used nor heard of, which means that probably 99% of our "audience" haven't heard of it either).

>> In e.g. a fits file containing *many* spectral window extracts from a spectrometer, this could potentially mean *many* repetitions of the same keyword extension for e.g. XPOSURE, temperatures, etc!
> 
> Or a single compact but comprehensive BINTABLE.  Whatever the type of extensions you use, it becomes inefficient once the size of the data records falls below some threshold relative to the size of the header records.  And searching through the file becomes inefficient if there are large numbers of small HDUs.  That said, FITS files with hundreds of HDUs can be reasonably efficient to handle - if the types and purposes of the HDUs don't get too elaborate.

I agree (cf. my response to William) that a single BINTABLE combined with something like our TABULATE="KEYWORD1[colnum1],KEYWORD2[colnum2],..." might actually be more efficient.

>> All of this could, of course, be built into a standard routine to read keywords.
> 
> FITS header keywords are suitable for simple metadata chores.  Your requirements don't sound that simple.  Some combination of BINTABLEs and mask-like IMAGE extensions should do the job, and tie these together with a relatively limited set of keywords that aren't being strained beyond their native capabilities.
> 
>> We believe that whatever we (a workgroup in an EU project called SOLARNET) adopt as a recommendation would "quite soon" be used by "quite a few" solar processing pipelines. Our recommendation is due in fall, but that won't necessarily be carved in stone, since the pipelines would not yet have been implemented and used.
> 
> So it sounds like you have some time - and that there is a premium on choosing a scalable format that is understandable to groups who weren't involved in implementing the format.

If by scalable you mean generic and/or efficient then yes, certainly. And yes, regarding understandability, it should not be too complex for the intended "user group".

Sincerely,
Stein Haugan





More information about the fitsbits mailing list