[fitsbits] BINTABLE convention for >999 columns

Rob Seaman seaman at lpl.arizona.edu
Sat Jul 8 08:27:16 EDT 2017


Exactly.

1) Putting the horse back in front of the cart, what are the significant
use cases here? Do they rise to the level of requiring action at the
standards level, or is it more appropriate for the application to work
within the limitations of the current standard? The suggestion of
implementing this as a local convention goes against the sense of the
community for the past several years.

2) If there are important and broad enough use cases, what number of
columns are necessary to address them? Are we talking about a several
hundred column table that just barely tips over the 999 boundary, for
instance, after running through a pipeline that calculates derived
features? Or are we talking about a new million-column paradigm
entirely? And how many rows are typical in each case? Are these
multi-terabyte files?

3) And what is the motivation for using FITS in the first place? How
does the file format fit into a concept of operations from pipeline to
disk file to relational database (or whatever)?

4) Somebody mentioned streaming. Is this really just a serialization
question for a workflow that might best be addressed - even if all the
issues above point to using FITS and modifying FITS - to modifying FITS
in a more general way to improve serialization of all FITS, images as
well as binary tables?

5) And Tom or somebody mentioned ASCII tables. Do we really need to try
to support structures of unlimited width in ASCII tables? Is this a
feature anybody would want to use? And is this a use for ASCII tables
that we would want to encourage? (For that matter, if we aren't going to
deprecate these, shouldn't they really evolve into UTF-8 tables? ;-)

6) Issues of 32-bit addressability have often come up in FITS. Again,
might it make more sense to address any contingent such issue here in a
more general fashion to "do it right"?

7) Somebody mentioned normalization. Generally a well-normalized
database will be split into smaller, more numerous tables, not a single
monolithic one. Tools to implement coherent normalization of FITS
schema, tables as well as the normal horse-trading of keywords between
primary and extension headers, would be very welcome. One does question
how frequently they would converge on million, or even thousand, column
tables.

8) On the other hand, if we do identify a significant class of extremely
wide tabular data structures of broad utility to the astronomical
community, perhaps the FITS community should entertain defining a new
extension type entirely? Among other things this would avoid placing the
burden for supporting the new format on the wide range of software
applications and libraries that will continue not to need such structures.

9) The limitations of current FITS headers have been mentioned. Again,
there are broader implications here. Might it be time to define a
general-purpose binary-table-based header data structure that directly
addresses all the issues that have previously been identified? Rather
than add some complex binary encoding scheme that doesn't provide an
arbitrary width solution? Just to be clear, it already perfectly legal
FITS for "header" metadata to be written to a table defining long
keyword names, well-typed values, arbitrary length comments,
hierarchical structure, etc and so forth (leaving minimal structural
FITS header records in place for each extension).

10) It is a natural human characteristic to try to solve the problem put
in front of us. The first question is whether issues like these make
this a problem we should try to solve at all.

11) The contingent question would then be whether binary tables are the
right paradigm for a solution. The basic FITS extension rules are quite
general. If an ideal extension format were designed to contain the data
structures needed by the still-undescribed science use cases in
question, would it closely resemble the current binary table format?
Might local usage, or for that matter community uptake of a new
convention, be simplified by defining an entirely new format tailored
for this purpose?

Rob

--


On 7/8/17 1:56 AM, Maren Purves wrote:
> Walter,
>
> one problem here: there is only so much space that can be addressed
> on any computer system. In the days of VAXes we couldn't address disks
> larger than 4 GB (at least on the version we used here). Much later I
> remember we couldn't take more NDR data with UIST at UKIRT because
> we weren't able to write files bigger than a certain size (242 seconds on
> a 32 bit machine). Unless very wide tables are done in a way that somehow
> doesn't exceed the space that can be addressed there will always be a
> limit. One can work around these limitations one way or the other, but
> even if you go to the limit of enumerating address spaces that you're
> addressing (inn as many spaces as you can address/enumerate),
> there will always be a limit. Whether anybody will reach that in our
> lifetimes
> is a different matter.
>
> Maren Purves,
> East Asian Observatory
>
> On Fri, Jul 7, 2017 at 10:21 PM, jaffe <jaffe at strw.leidenuniv.nl
> <mailto:jaffe at strw.leidenuniv.nl>> wrote:
>
>     My view is either do it right or don't do it.
>
>     If the problem is more or less one-off from a single application
>     then you should use multiple standard tables, with the connection
>     between
>     the tables intrinsic to the application and not part of any standard.
>
>     If there is a general recognized need for very wide tables then there
>     should be a generalized solution, not limited in width (say by
>     using base 36 coding).  Such a solution might be a separate table
>     defining the table format parameters for the wide table, but there
>     are probably other elegant solutions.
>
>     Walter
>
>         Mark,
>
>         Where do these wide FITS tables (> 999 columns) that you are
>         proposing
>         to support come from in the first place?  Are you just trying to
>         support conversion of other tabular formats that can support
>         more than
>         999 columns into FITS format?  If so, I don't see the point
>         since no
>         other existing software will be able to read them properly.
>
>         Also, will TOPCAT have the ability to insert or delete columns
>         within
>         these wide FITS tables?  That is a rather complicated process.
>
>         The main issue I see with your convention is that it only
>         provides a
>         modest increase in the maximum number of columns from 999 to about
>         18000.  I'd prefer a convention that places no limit on the
>         number of
>         columns.   One of the previous posters suggested using the
>         HIERARCH
>         convention for encoding keywords like 'TFORM12345', which
>         seems to me
>         to be a more robust and easier to understand convention than using
>         base 26 encoded strings.
>
>         Regards,
>         Bill Pence
>
>             On Jul 7, 2017, at 7:09 AM, Mark Taylor
>             <M.B.Taylor at bristol.ac.uk
>             <mailto:M.B.Taylor at bristol.ac.uk>> wrote:
>
>             Dear fitsbits,
>
>             I am considering a convention for storing table data in
>             FITS files
>             where the number of columns exceeds the 999 limit
>             implicitly imposed
>             by the standard BINTABLE extension type.  I have running
>             code for
>             this (available on request) and plan to incorporate it in
>             future
>             releases of STIL/STILTS/TOPCAT so that people can work
>             with wide
>             tables in FITS while using those tools.  People using software
>             that is unaware of this convention would still see a legal
>             BINTABLE
>             but not the later columns.
>
>             I'm posting the details here in case people want to comment,
>             or point out some major problem with the idea that I might
>             have
>             overlooked, or tell me that there's already a convention for
>             this out there that I should be using instead.  Otherwise,
>             please
>             feel free to ignore this post.  I'm not requesting that any
>             other software implements this, though if anyone wants to I
>             certainly don't object.
>
>             Mark
>
>             . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>             . . . .
>
>             Extended column convention for FITS BINTABLE
>             --------------------------------------------
>
>             The BINTABLE extension type as described in the FITS Standard
>             (FITS Standard v3.0, sec 7.3) requires table column metadata
>             to be described using 8-character keywords of the form
>             XXXXXnnn,
>             where XXXXX represents one of an open set of mandatory,
>             reserved
>             or user-defined root keywords up to five characters in length,
>             for instance TFORM (mandatory), TUNIT (reserved), TUCD
>             (user-defined).
>             The nnn part is an integer between 1 and 999 indicating the
>             index of the column to which the keyword in question refers.
>             Since the header syntax confines this indexed part of the
>             keyword
>             to three digits, there is an upper limit of 999 columns in
>             BINTABLE extensions.
>
>             Note that the FITS/BINTABLE format does not entail any
>             restriction on
>             the storage of column *data* beyond the 999 column limit
>             in the data
>             part of the HDU, the problem is just that client software
>             cannot be informed about the layout of this data using the
>             header cards in the usual way.
>
>             In some cases it is desirable to store FITS tables with a
>             column
>             count greater than 999.  Whether that's a good idea is not
>             within
>             the scope of this discussion.
>
>             To achieve this, I propose the following convention.
>
>             Definitions:
>
>             - 'BINTABLE columns' are those columns defined using the
>                  FITS BINTABLE standard
>
>             - 'Data columns' are the columns to be encoded
>
>             - N_TOT is the total number of data columns to be stored
>
>             - Data columns with (1-based) indexes from 999 to N_TOT
>             inclusive
>                  are known as 'extended' columns.  Their data is stored
>                  within the 'container' column.
>
>             - BINTABLE column 999 is known as the 'container' column
>                  It contains the byte data for all the 'extended' columns.
>
>             Convention:
>
>             - All column data (for columns 1 to N_TOT) is laid out in
>             the data part
>                  of the HDU in exactly the same way as if there were
>             no 999-column
>                  limit.
>
>             - The TFIELDS header is declared with the value 999.
>
>             - The container column is declared in the header with some
>                  TFORM999 value corresponding to the total field
>             length required
>                  by all the extended columns ('B' is the obvious data
>             type, but
>                  any legal TFORM value that gives the right width MAY
>             be used).
>                  The byte count implied by TFORM999 MUST be equal to the
>                  total byte count implied by all extended columns.
>
>             - Other XXXXX999 headers MAY optionally be declared to
>             describe
>                  the container column in accordance with the usual rules,
>                  e.g. TTYPE999 to give it a name.
>
>             - The NAXIS1 header is declared in the usual way to give
>             the width
>                  of a table row in bytes.  This is equal to the sum of
>                  all the BINTABLE columns as usual.  It is also equal to
>                  the sum of all the data columns, which has the same
>             value.
>
>             - Headers for Data columns 1-998 are declared as usual,
>                  corresponding to BINTABLE columns 1-998.
>
>             - Keyword XT_ICOL indicates the index of the container column.
>                  It MUST be present with the integer value 999 to indicate
>                  that this convention is in use.
>
>             - Keyword XT_NCOL indicates the total number of data
>             columns encoded.
>                  It MUST be present with an integer value equal to N_TOT.
>
>             - Metadata for each extended column is encoded with keywords
>                  of the form XXXXXaaa, where XXXXX are the same
>             keyword roots
>                  as used for normal BINTABLE extensions, and aaa is a
>             3-digit
>                  value in base 26 using the characters 'A' (0 in base
>             26) to
>                  'Z' (25 in base 26), and giving the 1-based data
>             column index
>                  minus 999.  The sequence aaa MUST be exactly three
>             characters
>                  long (leading 'A's are required).  Thus the formats
>             for data
>                  columns 999, 1000, 1001, etc are declared with the
>             keywords
>                  TFORMAAA, TFORMAAB, TFORMAAC etc.
>
>             - This convention MUST NOT be used for N_TOT<=999.
>
>             The resulting HDU is a completely legal FITS BINTABLE
>             extension.
>             Readers aware of this convention may use it to extract column
>             data and metadata beyond the 999-column limit.
>             Readers unaware of this convention will see 998 columns in
>             their
>             intended form, and an additional (possibly large) column 999
>             which contains byte data but which cannot be easily
>             interpreted.
>
>             This convention can therefore allow encoding of tables
>             with data
>             column counts N_TOT up to 998+26^3 = 18574.
>
>             An example header might look like this:
>
>               XTENSION= 'BINTABLE'           /  binary table extension
>               BITPIX  =                    8 /  8-bit bytes
>               NAXIS   =                    2 /  2-dimensional table
>               NAXIS1  =                 9229 /  width of table in bytes
>               NAXIS2  =                   26 /  number of rows in table
>               PCOUNT  =                    0 /  size of special data area
>               GCOUNT  =                    1 /  one data group
>               TFIELDS =                  999 /  number of columns
>               XT_ICOL =                  999 /  index of container column
>               XT_NCOL =                 1204 /  total columns
>             including extended
>               TTYPE1  = 'posid_1 '           /  label for column 1
>               TFORM1  = 'J       '           /  format for column 1
>               TTYPE2  = 'instrument_1'       /  label for column 2
>               TFORM2  = '4A      '           /  format for column 2
>               TTYPE3  = 'edge_code_1'        /  label for column 3
>               TFORM3  = 'I       '           /  format for column 3
>               TUCD3   = 'meta.code.qual'
>                ...
>               TTYPE998= 'var_min_s_2'        /  label for column 998
>               TFORM998= 'D       '           /  format for column 998
>               TUNIT998= 'counts/s'           /  units for column 998
>               TTYPE999= 'XT_MORECOLS'        /  label for column 999
>               TFORM999= '813I    '           /  format for column 999
>               TTYPEAAA= 'var_min_u_2'        /  label for column 999
>               TFORMAAA= 'D       '           /  format for column 999
>               TUNITAAA= 'counts/s'           /  units for column 999
>               TTYPEAAB= 'var_prob_h_2'       /  label for column 1000
>               TFORMAAB= 'D       '           /  format for column 1000
>                ...
>               TTYPEAHW= 'var_prob_w_2'       /  label for column 1203
>               TFORMAHW= 'D       '           /  format for column 1203
>               TTYPEAHX= 'var_sigma_w_2'      /  label for column 1204
>               TFORMAHX= 'D       '           /  format for column 1204
>               TUNITAHX= 'counts/s'           /  units for column 1204
>               END
>
>             This general approach was suggested by William Pence on
>             the FITSBITS
>             list in June 2012
>             (https://listmgr.nrao.edu/pipermail/fitsbits/2012-June/002367.html
>             <https://listmgr.nrao.edu/pipermail/fitsbits/2012-June/002367.html>),
>             and by Francois-Xavier Pineau (CDS) in private
>             conversation in 2016.
>             The details have been filled in by Mark Taylor (Bristol).
>             (F-X favours a different mechanism for encoding the extended
>             column metadata).
>
>             --
>             Mark Taylor   Astronomical Programmer   Physics, Bristol
>             University, UK
>             m.b.taylor at bris.ac.uk <mailto:m.b.taylor at bris.ac.uk>
>             +44-117-9288776 <tel:%2B44-117-9288776> 
>             http://www.star.bris.ac.uk/~mbt/
>             <http://www.star.bris.ac.uk/%7Embt/>
>
>             _______________________________________________
>             fitsbits mailing list
>             fitsbits at listmgr.nrao.edu <mailto:fitsbits at listmgr.nrao.edu>
>             https://listmgr.nrao.edu/mailman/listinfo/fitsbits
>             <https://listmgr.nrao.edu/mailman/listinfo/fitsbits>
>
>
>         _______________________________________________
>         fitsbits mailing list
>         fitsbits at listmgr.nrao.edu <mailto:fitsbits at listmgr.nrao.edu>
>         https://listmgr.nrao.edu/mailman/listinfo/fitsbits
>         <https://listmgr.nrao.edu/mailman/listinfo/fitsbits>
>
>
>     _______________________________________________
>     fitsbits mailing list
>     fitsbits at listmgr.nrao.edu <mailto:fitsbits at listmgr.nrao.edu>
>     https://listmgr.nrao.edu/mailman/listinfo/fitsbits
>     <https://listmgr.nrao.edu/mailman/listinfo/fitsbits>
>
>
>
>
> _______________________________________________
> fitsbits mailing list
> fitsbits at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/fitsbits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20170708/166fb6de/attachment-0001.html>


More information about the fitsbits mailing list