[fitsbits] BINTABLE/TABLE column count limitation
Mark Taylor
m.b.taylor at bristol.ac.uk
Thu Jun 7 06:02:45 EDT 2012
Bill,
thank you for your careful consideration of my proposal.
I agree there is a good chance that existing FITS software would
fail to make any sense of a wide (i.e. >999 column) table;
my suggestion that in some cases it might be able to do it was more
in the nature of an added bonus than a serious selling point.
So really I'm only expecting to come up with a convention which
- is legal FITS (except that it may not be legal BINTABLE)
- looks exactly like existing BINTABLE for narrow tables
- can encode wide tables in such a way that other convention-aware
software can decode them
- departs from the existing BINTABLE standard in ways which are
minor and easy to understand and implement
If it's legal BINTABLE as well that would be ideal, but that may
not be possible.
I hadn't thought of the issue of WCS keywords as raised by Arnold.
In mitigation of that, I'd make two points. First, the contexts in
which I've seen relatively wide (hundreds of columns) tables are not
contexts in which I've seen these WCS keywords be used, so the chance
of one of them appearing in column 1000+ of a table would seem to be
fairly small (though I'd be interested to hear of counterexamples).
Second, as far as I can tell (I don't have much experience
with the WCS part of FITS, so possibly I'm mistaken here), some
of those keywords (e.g. TPn_ka) already have the possibility
of a name-length overflow beyond the 8-character limit, so this
suggestion would not introduce a fundamentally new flaw in the table
WCS encoding scheme. The reason this existing flaw doesn't
routinely cause problems is presumably the same - WCS keywords
don't often crop up in high-numbered columns. The same may or
may not apply to non-standard (e.g. HEASARC) per-column keywords.
Having said all that, my suggestion does seem to be more problematic
than I originally thought.
Your idea of a fake 'B' column 999 is quite nice, I hadn't thought of
that. The fact that existing readers can read the first 998 columns
may be a minor advantage, but the real selling point is that it
continues to be legal BINTABLE. On the downside, it's a bit complicated,
and still runs into the problem that some WCS and other keywords may
not work well with 4-digit column indices.
Of the other suggestions raised in this thread, using multiple HDUs
is a possibility, but it has significant disadvantages, for instance
unsuitability for streaming I/O and the possibility of confusion
when multiple tables are or may be stored in the same MEF.
The one I'm inclining towards is using non-decimal digits for >999,
e.g. AAA=1000, AAB=1001, ABA=1026
(this was also suggested to me off-list by Thomas Robitaille)
It certainly isn't legal according to the existing BINTABLE definition,
but given a single rule about how to encode >999 values in three
characters, everything else follows. There is a bit more likelihood
of keyword collisions once alphabetic 'numbers' are in use
(e.g. TFORMATS = 'D', where ATS represents column 1512), I'm not
sure how serious a problem that would be. A different 3-digit
encoding could be used to minimise that problem, but that could
either get messy or have a smaller data range (26^3=17576).
I'm starting to think that the complications may not make any of
these schemes suitable for eventual incorporation into the FITS
standard (though I'd be happy to be persuaded otherwise).
I may however implement one of them in STIL if I conclude that I
need a FITS-like format to store wide tables; this would effectively
be an internal data format with a close resemblance to FITS.
In that case other software that needed to do something similar
(perhaps there isn't any) would be free to implement the same
unofficial extension.
Mark
On Thu, 7 Jun 2012, William Pence wrote:
> Mark,
>
> I can think of several serious compatibility issues for existing FITS reading
> software with your proposed convention which effectively tries to support
> having more "pseudo" columns that occupy space in the table following the
> "standard" 999 column limit. Note that I'm assuming here that the proposal
> would only apply to binary tables, not to ASCII tables.
>
> 1. In order to preserve compatibility with existing FITS readers, and to
> conform to the FITS Standard, the value of the TFIELDS keyword must be equal
> to the number of "standard" columns in the table and must have a value between
> 0 and 999, inclusive. Even if an existing FITS reader doesn't immediately
> abort when it sees a TFIELDS value greater than 999, the reader would almost
> certainly fail when it could not find the TFORM1000 (and greater) keywords.
> CFITSIO, as an example, when it first opens a FITS binary table, must
> construct an internal structure that, among other things, gives the byte
> offset in the row to the start of every column in the table. Since CFITSIO
> would be unable to determine the widths of the columns beyond column 999, it
> would be forced to exit with a fatal file format error. To get around this
> problem, it would be necessary, I think, to continue to place a maximum limit
> of 999 on the value of the TFIELDS keyword, and then define a new non-standard
> keyword to specify the number of additional pseudo columns in the table (i.e.,
> the number of columns beyond the 999 standard columns).
>
> 2. The NAXIS1 keyword must give the physical width of the table in bytes,
> which would necessarily have to include the width of all the pseudo columns.
> However, the Standard also requires that the value of the NAXIS1 keyword be
> equal to the sum of the widths of all the individual standard columns (not
> including the widths of any pseudo columns). I suspect that many existing FITS
> readers perform a sanity check to ensure that this requirement is met, and if
> it isn't, abort with a fatal file format error (my CFITSIO code certainly
> does). The only way (or at least one way) I can see to reconcile these 2
> requirements so that existing FITS readers can read the table is to reserve
> one of the standard columns (most likely column 999) as a fictitious
> placeholder column of type 'B' with a vector width that is equal to the sum of
> the width of all the pseudo columns. In other words, this fake 999th standard
> column would reserve the total space needed by all the pseudo columns. FITS
> readers that do not understand this new convention would just interpret the
> 999th column as a wide 'B' column (e.g., '8000B') whereas knowledgeable FITS
> readers would know that this space is actually filled with the values of all
> the pseudo columns, as defined by the TFORnnnn keywords.
>
> Granted, a convention such as this could be defined, but it is not nearly as
> simple as implied in your proposal. It seems to me that this additional
> complexity would be a big drawback to winning wide-spread acceptance of the
> proposal.
>
> 3. Finally, as already mentioned by Arnold Rots, there are many more
> per-column keywords currently in use than the 9 listed in your proposal.
> There are roughly 40 additional per-column WCS keywords defined in the FITS
> Standard. In addition, there are an unknown number of other per-column
> keyword that have been defined in local conventions (the HEASARC's TLMINnnn
> and TLMAXnnn keywords are good examples). It would be very difficult to come
> up with a complete list of all these keywords.
>
> regards,
> Bill
>
> On 6/6/2012 6:57 AM, Mark Taylor wrote:
> > Hi FITS,
> >
> > There is an acknowledged limitation of 999 on the maximum number of
> > columns in a FITS table.
> >
> > The TABLE and BINTABLE extensions define the following per-column fields:
> >
> > TBCOLn (ASCII table only)
> > TDIMn (Binary table only)
> > TDISPn
> > TFORMn
> > TNULLn
> > TSCALn
> > TTYPEn
> > TUNITn
> > TZEROn
> >
> > to describe per-column metadata for the encoded table. Along with
> > the 8-character limitation on header card keywords, this limits the
> > number of columns that can be described to 999; TFORM999 is legal,
> > but TFORM1000 is not. The standard explicitly constrains the value of
> > the TFIELDS keyword to<=999 in acknowledgement of this limitation.
> >
> > With column counts from the large surveys in the hundreds, a couple of
> > table joins can have you hitting this restriction. I don't know what
> > the data from LSST etc will look like, but extrapolating survey
> > column counts over time would suggest that single tables in the
> > thousand-column range may be upon us soon. A user of TOPCAT
> > has recently reported encountering this in science use
> > (https://sympa.bris.ac.uk/sympa/arc/topcat-user/2012-06/msg00003.html),
> > so already it's not merely a theoretical problem.
> >
> > As a pragmatic solution, I suggest the following convention. Columns
> > before the 1000th in any table are described as per the existing
> > standard, but for columns 1000-9999, the 5th alphabetic character
> > of the Txxxx keyword (if present - no change reqired for TDIMn)
> > is removed to make space for an additional digit. The existing
> > constraint that the TFIELDS value shall be<=999 is also of
> > course relaxed to<=9999. Thus:
> >
> > XTENSION= 'BINTABLE'
> > ...
> > TFIELDS = 2112
> > ...
> > TFORM998= 'D'
> > TTYPE998= 'foo'
> > TFORM999= 'D'
> > TTYPE999= 'foo_err'
> > TFOR1000= 'D'
> > TTYP1000= 'bar'
> > TFOR1001= 'D'
> > TTYP1001= 'bar_err'
> > ...
> >
> > Under this rule, any table with fewer than 1000 columns looks exactly
> > the same as it does now. Columns>999 in wide tables will be unreadable
> > by software which is not aware of the convention, but such software
> > would be incapable of dealing with 1000+-column tables in any case.
> > Depending on implementation, non-aware software may be able to make
> > sense of the first 999 columns of wide tables.
> >
> > I am considering implementing this convention in the FITS I/O handlers
> > used by STIL (the java table library used by TOPCAT and STILTS as well
> > as some other client- and server-side applications). If nothing else
> > this will enable STIL users to generate syntactically legal FITS files
> > (though containing illegal BINTABLE extensions) representing 1000+
> > column tables which they can use within STIL (e.g. to save/load tables
> > in TOPCAT), even if such files are not legible by other FITS table
> > applications, while I/O of tables with<1000 columns will be unaffected.
> >
> > However, if others choose to implement the same convention it could
> > become a de facto standard for wide tables, and possibly a candidate
> > for an update of the BINTABLE convention in a future version of the
> > FITS standard.
> >
> > Does anybody forsee problems with this suggestion, or want to suggest
> > a better alternative? The only possible backward compatibility issue
> > or unintended consequence I can think of is if there are already
> > keywords along the lines of TFORxxxx (x=[0-9]) in use in existing
> > table headers, but it seems rather unlikely. The other question is
> > whether 9999 is enough. 1e5-column tables are probably a little
> > way off, and extending this scheme to 5-digit column indices would
> > be problematic since TDIMn and TDISPn would both degenerate to TDInnnnn,
> > so I'd suggest punting that issue to future generations.
> >
> > Mark
> >
> > --
> > Mark Taylor Astronomical Programmer Physics, Bristol University, UK
> > m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
> >
> > _______________________________________________
> > fitsbits mailing list
> > fitsbits at listmgr.cv.nrao.edu
> > http://listmgr.cv.nrao.edu/mailman/listinfo/fitsbits
> --
> ____________________________________________________________________
> Dr. William Pence William.Pence at nasa.gov
> NASA/GSFC Code 662 HEASARC +1-301-286-4599 (voice)
> Greenbelt MD 20771 +1-301-286-1684 (fax)
>
>
--
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
More information about the fitsbits
mailing list