[fitsbits] Associating ancillary data with primary HDU

Fri May 23 18:24:53 EDT 2014

Hi Stein,

> Typically, each project has "overcome" this problem by introducing their own scheme. Often using binary tables with "appropriate" names, which are understood by their own software. But there has been no formal association rule between the keyword name and the column with the correct values. Also, the bintable columns have not necessarily had the same number of dimensions as the data array, but again, their own software can deal with that.

Right, but community-wide standardization can be applied to BINTABLEs as well as IMAGEs.

>> And, of course, native IMAGE array might well be tile-compressed into a BINTABLE.  I'd strongly recommend thinking about compression early in the process.
> 
> Hmm... maybe. Are there "transparent", standardised ways of doing this?

Yes.  CFITSIO can both read and write compressed images and tables on the fly.  A compressed image looks like an uncompressed image.

> We have some issues with it anyhow, though, since some visualisation programs do a memory-mapping of the IMAGE array for fast access (without duplicating huge data chunks).

There are some interesting discussions online about compression and memory mapping.  The tiled nature of FITS compression can help with this, though a pragmatic choice for a pipeline workflow or for some highly interactive application like you describe might well be to uncompress the data first.  On the other hand data spend most of their life cycle either sitting still in an archive or being moved remotely over the network.  In both such cases compression plays a role.

>> The BINTABLE could define some specific or perhaps general-purpose keywords for cases in which one or more of the columns should be omitted in favor of scalars in the table header.  
> 
> But would such schemes really be any simpler than what we propose? I'm not sure I understood it…

Tables correspond to schema expressing coherent data models.  This is generally better than spreading the different pieces between arbitrary numbers of distinct FITS HDUs.  An HDU should generally correspond to a single coherent data product like a data array or mask.

>> Alternately just always express such as column vectors.  FPACK will squeeze all the duplicated values down with a high compression factor.
> 
> I've never tried FPACK, but I can easily see that compression will take care of all that wasted space. That's great for the data repositories, but not so good for the users, who'd want to decompress the data to work on them, and would not like to have to compress them again while working on something else, then decompressing to have a second look, etc.

As I said, CFITSIO (for example) can read the compressed files directly.  The tiling allows efficient indexing into the data arrays or tables.

> For languages like IDL, it would be fairly easy (with our proposed scheme) to say e.g.
> 
>  data = readfits(...)
>  exposure_time = generic_function(hdr,"XPOSURE")
>  corrected_data = data/exposure_time
> 
> This could be made to work in two ways: the generic_function will know the dimensionality of the data array (given in the header), and could *always* return an array. Or it could return a scalar sometimes and an array at other times. Still works. 
> 
> Not all things are so simple, but most cases can be handled by writing the code such that it *requires* the value to be an array, and letting the generic_function always return such an array.

Yes, this was implied by your dimensioning the duration array [1,1,500].  It isn’t obvious why the generic function couldn’t read a table, however.  (Or read a tile-compressed array expressed as a binary table.)

>> Or the various elements of your schema could map to columns (or variable length arrays perhaps) in a single BINTABLE.
> 
> Again, I'm just wondering about the simplicity, and the general availability of library routines for that mapping (which I've never used nor heard of, which means that probably 99% of our "audience" haven't heard of it either).

Maybe Bill could comment.  I would think that roughly as many users / projects use FITS tables as FITS images.

> If by scalable you mean generic and/or efficient then yes, certainly. And yes, regarding understandability, it should not be too complex for the intended "user group".

By scalable I meant able to gracefully express data products that are likely to be encountered in real world use cases.  For instance, you mentioned a need to support "a fits file containing *many* spectral window extracts from a spectrometer”.  If your data model requires numerous different instances of separate FITS extensions types, scaling this up to dozens or hundreds or more of spectral snapshots may require files with thousands of HDUs even of a relatively modest total size.  It might well be better to define a single table structure to contain these.

Rob