[fitsbits] UTF-8 in BINTABLE String Columns {External}

Mark Taylor m.b.taylor at bristol.ac.uk
Sat Apr 4 04:25:11 EDT 2026


Arnold,

ASCII text is identical to a UTF-8 encoding of the same content,
this is one of the nice things about the UTF-8 encoding.
So if there were such a thing as UTF-8 restricted to one byte per
character, it would be identical to ASCII.
However UTF-8 is inherently a variable-length encoding, so each
character (strictly, each Unicode code point) may be encoded
as 1, 2, 3 or 4 bytes.

This means that any existing ASCII content would be read correctly
by a UTF-8 reader.  Similarly, UTF-8 content that happens to
contain only ASCII-friendly characters would be read correctly
by an ASCII-only reader.

So if TFORM='A' were redefined to mean UTF-8 instead of ASCII
as proposed, then old and new readers would still read text composed
of ASCII-friendly characters correctly, but non-ASCII content
would be read correctly only by UTF-8-compliant readers.
Old readers would see non-ASCII parts of the content as illegal
FITS and presumably either reject the input or interpret garbled
characters.

Mark

On Thu, 2 Apr 2026, Arnold Rots via fitsbits wrote:

> Question: When UTF8 is specified, is it implicitly assumed that
> characters may occupy 1, 2, or 4 bytes,
> or is the intent to still restrict the characters to one byte?
> If the latter, I don't see too much of a problem. Old ASCII compliant files
> can still be read correctly;
> it's just old ASCII compliant readers that can't correctly interpret the
> new files. But that software can be updated.
> 
> Arnold H Rots
> 
> Research Associate
> 
> SAO/HEAD
> 
> Center for Astrophysics | Harvard & Smithsonian
> 
> Email: arots at cfa.harvard.edu
> 
> Office: +1 617 496 7701 | Cell: +1 617 721 6756
> 
> 60 Garden Street | MS 69 | Cambridge, MA 02138 | USA
> 
> 
> cfa.harvard.edu | Facebook <http://cfa.harvard.edu/facebook> | Twitter
> <http://cfa.harvard.edu/twitter> | YouTube <http://cfa.harvard.edu/youtube>
> | Newsletter <http://cfa.harvard.edu/newsletter>
> 
> 
> On Sun, Mar 29, 2026 at 8:00 PM James Tocknell via fitsbits <
> fitsbits at listmgr.nrao.edu> wrote:
> 
> > I'm not sure there's any value in supporting UTF-16 or UTF-32,
> > https://utf8everywhere.org/ provides details as to why UTF-8 should be
> > the standard interchange format (basically, both take up more space and
> > encourage misconceptions about unicode). Also, for things like paths on
> > Windows (as opposed to Unix systems where it's 8bits of some encoding), you
> > can't rely on UTF-16 anyway (see https://wtf-8.codeberg.page/).
> > Practically speaking, if something accepts ASCII it'll probably accept
> > UTF-8 (and someone has already likely slipped in Latin-1 unless people are
> > validating that the data is ASCII only), that is not true of UTF-16 or
> > UTF-32.
> >
> > James
> >
> > ________________________________________
> > From: fitsbits <fitsbits-bounces at listmgr.nrao.edu> on behalf of Barrett,
> > Paul via fitsbits <fitsbits at listmgr.nrao.edu>
> > Sent: Friday, 27 March 2026 1:00 AM
> > To: Francois-Xavier PINEAU
> > Cc: fitsbits at nrao.edu
> > Subject: Re: [fitsbits] UTF-8 in BINTABLE String Columns {External}
> >
> > Because this is somewhat of a breaking change, would it not be beneficial
> > in the long run to extend this to UTF-16 and UTF-32?
> >
> >  -- Paul
> >
> >
> > On Thu, Mar 26, 2026 at 9:44 AM Francois-Xavier PINEAU via fitsbits <
> > fitsbits at listmgr.nrao.edu<mailto:fitsbits at listmgr.nrao.edu>> wrote:
> >
> > Dear fitsbits,
> >
> > # Background
> >
> > VOTable (v1.5) is closely compatible with the FITS Binary Table format:
> >
> > https://www.ivoa.net/documents/VOTable/20250116/REC-VOTable-1.5.html#tth_sEc2.3
> > <
> > https://www.ivoa.net/documents/VOTable/20250116/REC-VOTable-1.5.html#tth_sEc2.3
> > >
> >
> > In the current draft of VOTable 1.6
> >
> > https://github.com/ivoa-std/VOTable/releases/download/auto-pdf-preview/VOTable-draft.pdf
> > <
> > https://github.com/ivoa-std/VOTable/releases/download/auto-pdf-preview/VOTable-draft.pdf>
> > ,
> > UTF-8 strings replace the previous ASCII-only strings.
> >
> > If FITS cannot store UTF-8, lossless round-trip conversion from VOTable to
> > FITS will no longer be possible.
> > Some limitations already exist (e.g., unsigned integer logical types), but
> > UTF-8 seems more critical.
> >
> > Personal use cases include the usage of HEALPix sorted and indexed
> > BINTABLES to build on-the-fly HATS products
> > or intermediary HiPS catalogue representations from VizieR data (will
> > contains more and more UTF-8).
> > * HATS: https://www.ivoa.net/documents/Notes/HATS/<
> > https://www.ivoa.net/documents/Notes/HATS/>
> > * HIPS catalogue: https://www.ivoa.net/documents/HiPS/<
> > https://www.ivoa.net/documents/HiPS/>
> > * VizieR: https://vizier.u-strasbg.fr/<https://vizier.u-strasbg.fr/>
> >
> >
> > # Possible Solutions
> >
> > ## 1. Use UTF-8 in existing `TFORMn=rA`
> >
> > Like in VOTAble 1.6, interpret `r` as bytes instead of characters.
> > May break truncation operations (TDIPS) if a multi-byte UTF-8 character is
> > split.
> >
> > ## 2. Logical type "UTF-8" backed by a byte array
> >
> > TFORMn = rB
> > TLOGTn = 'UTF-8'  / LOGT stands for LOGical Type
> >
> > Unaware readers see a byte array; UTF-8 aware readers interpret it as a
> > string.
> > Introduces two string types in FITS (ASCII and UTF-8).
> >
> > ## 3. New TFORM type (e.g., `TFORMn=rU`)
> >
> > Definite breakage for current readers.
> >
> >
> > # Existing Implementations
> >
> >  * TOPCAT/STILTS (Java): Prototype supports Solutions 1 and 2 for
> > read/write (private communication with Mark Taylor).
> >  * fitstable (Rust): Supports Solutions 1 and 2 for reading (
> > https://github.com/cds-astro/cds-fitstable-rust<
> > https://github.com/cds-astro/cds-fitstable-rust>).
> >  * VizieR: Appears to provide UTF-8 in TFORMn=rA columns (Solution 1).
> >  * ??
> >
> >
> > # Feedback Requested
> >
> > I am curious about:
> >  * other possible approaches
> >  * fitsbits opinions on the most practical solution
> >  * other people interested in having UTF-8 in BINTABLE columns
> >
> > Currently, Solution 1 seems the simplest and Solution 2 the safest,
> > but I welcome constructive comments and experience from the community.
> >
> > Best regards,
> >
> > --
> >
> > Francois-Xavier Pineau
> > Ingénieur de Recherche
> > Tél : +33 (0)3 68 85 24 14,
> > francois-xavier.pineau at astro.unistra.fr<mailto:
> > francois-xavier.pineau at astro.unistra.fr>
> >
> > Centre de Données astronomiques de Strasbourg (CDS)
> > 11, rue de l'Université - E03
> >
> >
> >
> > _______________________________________________
> > fitsbits mailing list
> > fitsbits at listmgr.nrao.edu<mailto:fitsbits at listmgr.nrao.edu>
> > https://listmgr.nrao.edu/mailman/listinfo/fitsbits<
> > https://listmgr.nrao.edu/mailman/listinfo/fitsbits>
> >
> > _______________________________________________
> > fitsbits mailing list
> > fitsbits at listmgr.nrao.edu
> > https://listmgr.nrao.edu/mailman/listinfo/fitsbits
> >
> 

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          https://www.star.bristol.ac.uk/mbt/


More information about the fitsbits mailing list