[fitsbits] Output array type when BZERO is an integer {External}

Wed Mar 13 15:18:35 EDT 2024

I am happy to discuss use cases for illustrative purposes, but I'd like to note that, having published a standard that explicitly states support for 64-bit signed and unsigned integers, we don't want to be in the position of saying anything to people running into issues that implies that they don't or shouldn't need these features.

* * *

As others have noted, the use of >32-bit integers for IDs is quite common in the field, so the case for needing these, at least, is pretty much ironclad.  Rubin will have tens of billions of objects and tens of trillions of observations of objects, each of which needs an ID, so this isn't even a matter of someone inefficiently dividing the integer into fields.

The Rubin baseline is currently envisioning using all 64 bits -- here there is some use of fields and therefore some entropy in the assignments.  I am presently uneasy about using the 64th bit precisely because of issues in the handling of unsigned integers in data chains and in various community standards.  There's user resistance to the alternative of representing 64-bit IDs as signed, I believe because of fears that users will assume that there's some deep significance to IDs being negative.

In image extensions, Rubin and my other project, SPHEREx, make heavy use of integer extensions that represent bit-fielded per-pixel flags describing a mix of things like quality assessments, application of algorithms, and other types of provenance, and we intend to have these in released data products.  There's a long history of this in the field.  (We will be releasing a FITS-convention-like documentation of the details of how we do this, including self-documenting headers.)

Thanks to the nice mechanisms for per-extension compression in FITS, it's possible to use wide integers for extensions containing these flags without greatly increasing the sizes of the resulting image files.

We are using more than 32 bits internally in SPHEREx; whether the final public data products have that many is still TBD.  Again, high confidence that (at least, actively maintained) interoperable community tooling will deliver the flags to users without an information-destroying pass through floating point is important to us in deciding whether to push ourselves to keep the count below 32.

Gregory

________________________________________
From: Seaman, Robert Lewis - (rseaman) <rseaman at arizona.edu>
Sent: Tuesday, March 12, 2024 15:00
To: Dubois-Felsmann, Gregory P.; Barrett, Paul
Cc: fitsbits at listmgr.nrao.edu
Subject: Re: [fitsbits] Output array type when BZERO is an integer

Hi Gregory and all,

As a matter of curiosity, do Rubin operations depend on 64-bit unsigned integers? What are example use cases for 64-bit integers (signed or unsigned) in the community? In the optical and infrared, I would hazard a guess that by far, the most prevalent raw and pipeline-reduced astronomical pixel data types are unsigned shorts, signed 32-bit integers, and 32-bit floating-point, but a greater diversity of data types must appear in binary tables.

Is there a more recent version of a FITS User’s Guide than https://archive.stsci.edu/fits/users_guide/ ? Or are there examples of such documentation tailored for particular observatories, projects, purposes, stakeholders?

Rob

On 3/12/24, 11:52 AM, "Dubois-Felsmann, Gregory P." wrote:

External Email

I think what we're hearing from the more experienced hands is that that the standard isn't at all concerned with issues like mandating upconversion.  In general it's just specifying the mathematical, not the computational, operation, and it is never mandating a specific in-memory representation in a client.

But I think there is some guidance at the end of the BZERO section in 4.4.2.5: once the client software has made the (underspecified) determination that BZERO is from Table 11 and BSCALE has "the default value of 1.0", it says "the physical value is computed by adding the offset specified by the BZERO keyword to the native data type value that is stored in the FITS file".  Note that BSCALE is explicitly left out of this; i.e., in this case the standard isn't just relying on the mathematical no-op of multiplying by one, but is saying explicitly that the client should ignore BSCALE in the calculation.

We haven't really talked about it, but all the same issues arise with regard to table columns, because of the similar definition of the currently very rarely used TSCALn and TZEROn keywords as having "default" values, rather than distinguishing between the "provided explicitly" and "not provided" cases.

Again, as a data publisher, I know the difference between "I am publishing an integer column that I expect users to see as integral" versus "I am publishing a generic numeric column and I'm trying to save space by packing it into a 16-bit integer and scaling it".  It would be nice to have a spec that says somewhat more precisely what I'm supposed to do to convey the former message unambiguously, particularly because if it's a 64-bit integer I very much indeed want to send client software a signal that I don't want it to be accidentally "promoted" to 64-bit float.

Obviously as a data provider I'm not going to troll my users by fiddling with my melodramatic 0.999999999999999s.  For signed integers I'm going to omit BZERO/TZEROn and BSCALE/TSCALn altogether and I'm confident that Paul's library will do what I want no matter what we say on this thread.

But if I'm trying to publish an unsigned integer, I am genuinely uncertain about whether I have to worry about whether a client's behavior will depend on whether I say "9223372036854775808" or "9223372036854775808.", and the consequence of it mattering is potentially information-destroying.  I will err on the side of caution and write the former, of course, but I'd rather have the standard on my side here.

As a side remark, I hope we can discuss such things without the people who put so much work into what is already in the standard feeling denigrated, and avoiding value judgements about the quality of people's work.

Gregory

________________________________________
From: fitsbits <fitsbits-bounces at listmgr.nrao.edu> on behalf of Barrett, Paul via fitsbits <fitsbits at listmgr.nrao.edu>
Sent: Tuesday, March 12, 2024 07:57
To: Seaman, Robert Lewis - (rseaman)
Cc: fitsbits at listmgr.nrao.edu
Subject: Re: [fitsbits] Output array type when BZERO is an integer {External}

I'll ask this question one more time and then I'll let it go.

I understand that the default behaviour for BZERO and BSCALE creates a floating point array because of the typical upconversion rules. However, I'm not clear about the data type for the special case where BZERO is an integer. In this case, it appears that BZERO is added first to the integer array before converting it to a floating point array, because BSCALE = 1.0 implies upconversion. Is this correct?

As for your comments:

* I disagree with your first comment. FITS is used because of peer pressure. It is mandated by NASA. That means a large sector of the community HAS to use it.
* Yes, dynamic languages are dynamic enough. In the case of Julia, it can do everything that C/C++, FORTRAN, and Python can do. Think of Julia as Python with Numba built-in.

 -- Paul

On Tue, Mar 12, 2024 at 9:39 AM Seaman, Robert Lewis - (rseaman) via fitsbits <fitsbits at listmgr.nrao.edu<mailto:fitsbits at listmgr.nrao.edu>> wrote:
Howdy,

It is always good to see a spirited FITS discussion! A few more peppy points:

  *   There is always an assertion that it would be preferable to use a “modern” format

     *   Yet projects often end up using FITS
     *   This choice does not result from peer pressure

  *   There is nothing magic about IEEE floating point or twos-complement integers

     *   Efficient (compressed) data representations may not even be binary (Rice is unary)
     *   Are dynamically typed languages dynamic enough?

  *   A tile-compressed image is a simple binary table

     *   My first encounter with FITS data (c. 1983) was writing a FITS image reader from scratch by consulting the original journal article(s) (possibly also my first encounter with C)
     *   I am confident young Rob could have written a reader for tile-compressed binary data with little more effort (or code) just from reading the current FITS standard

  *   FITS documentation is pretty good

     *   (Comments about other projects’ documentation omitted)

  *   Most FITS discussions/disagreements are about metadata

     *   Only a small minority of FITS metadata is strictly required to enforce the structure of each extension
     *   Science metadata (astronomical and computer science) would be legal (and trivial) to represent, using any schema you like, in a binary table structure, described in a convention or appendix or chapter of the standard
     *   Schemata could also include language-specific pragma, for data-typing purposes or otherwise

  *   It is perhaps peer pressure that pushes projects to use 80-char ASCII header keywords in 2880-byte records

     *   Consider, rather, what is the optimal tiled representation for your project, and separately
     *   How can your project’s (and community) metadata best be represented in a schema realized as a binary table?

Rob

_______________________________________________
fitsbits mailing list
fitsbits at listmgr.nrao.edu<mailto:fitsbits at listmgr.nrao.edu>
https://listmgr.nrao.edu/mailman/listinfo/fitsbits