[fitsbits] UTF-8 in BINTABLE String Columns {External}
Arnold Rots
arots at cfa.harvard.edu
Thu Apr 2 14:05:51 EDT 2026
Question: When UTF8 is specified, is it implicitly assumed that
characters may occupy 1, 2, or 4 bytes,
or is the intent to still restrict the characters to one byte?
If the latter, I don't see too much of a problem. Old ASCII compliant files
can still be read correctly;
it's just old ASCII compliant readers that can't correctly interpret the
new files. But that software can be updated.
Arnold H Rots
Research Associate
SAO/HEAD
Center for Astrophysics | Harvard & Smithsonian
Email: arots at cfa.harvard.edu
Office: +1 617 496 7701 | Cell: +1 617 721 6756
60 Garden Street | MS 69 | Cambridge, MA 02138 | USA
cfa.harvard.edu | Facebook <http://cfa.harvard.edu/facebook> | Twitter
<http://cfa.harvard.edu/twitter> | YouTube <http://cfa.harvard.edu/youtube>
| Newsletter <http://cfa.harvard.edu/newsletter>
On Sun, Mar 29, 2026 at 8:00 PM James Tocknell via fitsbits <
fitsbits at listmgr.nrao.edu> wrote:
> I'm not sure there's any value in supporting UTF-16 or UTF-32,
> https://utf8everywhere.org/ provides details as to why UTF-8 should be
> the standard interchange format (basically, both take up more space and
> encourage misconceptions about unicode). Also, for things like paths on
> Windows (as opposed to Unix systems where it's 8bits of some encoding), you
> can't rely on UTF-16 anyway (see https://wtf-8.codeberg.page/).
> Practically speaking, if something accepts ASCII it'll probably accept
> UTF-8 (and someone has already likely slipped in Latin-1 unless people are
> validating that the data is ASCII only), that is not true of UTF-16 or
> UTF-32.
>
> James
>
> ________________________________________
> From: fitsbits <fitsbits-bounces at listmgr.nrao.edu> on behalf of Barrett,
> Paul via fitsbits <fitsbits at listmgr.nrao.edu>
> Sent: Friday, 27 March 2026 1:00 AM
> To: Francois-Xavier PINEAU
> Cc: fitsbits at nrao.edu
> Subject: Re: [fitsbits] UTF-8 in BINTABLE String Columns {External}
>
> Because this is somewhat of a breaking change, would it not be beneficial
> in the long run to extend this to UTF-16 and UTF-32?
>
> -- Paul
>
>
> On Thu, Mar 26, 2026 at 9:44 AM Francois-Xavier PINEAU via fitsbits <
> fitsbits at listmgr.nrao.edu<mailto:fitsbits at listmgr.nrao.edu>> wrote:
>
> Dear fitsbits,
>
> # Background
>
> VOTable (v1.5) is closely compatible with the FITS Binary Table format:
>
> https://www.ivoa.net/documents/VOTable/20250116/REC-VOTable-1.5.html#tth_sEc2.3
> <
> https://www.ivoa.net/documents/VOTable/20250116/REC-VOTable-1.5.html#tth_sEc2.3
> >
>
> In the current draft of VOTable 1.6
>
> https://github.com/ivoa-std/VOTable/releases/download/auto-pdf-preview/VOTable-draft.pdf
> <
> https://github.com/ivoa-std/VOTable/releases/download/auto-pdf-preview/VOTable-draft.pdf>
> ,
> UTF-8 strings replace the previous ASCII-only strings.
>
> If FITS cannot store UTF-8, lossless round-trip conversion from VOTable to
> FITS will no longer be possible.
> Some limitations already exist (e.g., unsigned integer logical types), but
> UTF-8 seems more critical.
>
> Personal use cases include the usage of HEALPix sorted and indexed
> BINTABLES to build on-the-fly HATS products
> or intermediary HiPS catalogue representations from VizieR data (will
> contains more and more UTF-8).
> * HATS: https://www.ivoa.net/documents/Notes/HATS/<
> https://www.ivoa.net/documents/Notes/HATS/>
> * HIPS catalogue: https://www.ivoa.net/documents/HiPS/<
> https://www.ivoa.net/documents/HiPS/>
> * VizieR: https://vizier.u-strasbg.fr/<https://vizier.u-strasbg.fr/>
>
>
> # Possible Solutions
>
> ## 1. Use UTF-8 in existing `TFORMn=rA`
>
> Like in VOTAble 1.6, interpret `r` as bytes instead of characters.
> May break truncation operations (TDIPS) if a multi-byte UTF-8 character is
> split.
>
> ## 2. Logical type "UTF-8" backed by a byte array
>
> TFORMn = rB
> TLOGTn = 'UTF-8' / LOGT stands for LOGical Type
>
> Unaware readers see a byte array; UTF-8 aware readers interpret it as a
> string.
> Introduces two string types in FITS (ASCII and UTF-8).
>
> ## 3. New TFORM type (e.g., `TFORMn=rU`)
>
> Definite breakage for current readers.
>
>
> # Existing Implementations
>
> * TOPCAT/STILTS (Java): Prototype supports Solutions 1 and 2 for
> read/write (private communication with Mark Taylor).
> * fitstable (Rust): Supports Solutions 1 and 2 for reading (
> https://github.com/cds-astro/cds-fitstable-rust<
> https://github.com/cds-astro/cds-fitstable-rust>).
> * VizieR: Appears to provide UTF-8 in TFORMn=rA columns (Solution 1).
> * ??
>
>
> # Feedback Requested
>
> I am curious about:
> * other possible approaches
> * fitsbits opinions on the most practical solution
> * other people interested in having UTF-8 in BINTABLE columns
>
> Currently, Solution 1 seems the simplest and Solution 2 the safest,
> but I welcome constructive comments and experience from the community.
>
> Best regards,
>
> --
>
> Francois-Xavier Pineau
> Ingénieur de Recherche
> Tél : +33 (0)3 68 85 24 14,
> francois-xavier.pineau at astro.unistra.fr<mailto:
> francois-xavier.pineau at astro.unistra.fr>
>
> Centre de Données astronomiques de Strasbourg (CDS)
> 11, rue de l'Université - E03
>
>
>
> _______________________________________________
> fitsbits mailing list
> fitsbits at listmgr.nrao.edu<mailto:fitsbits at listmgr.nrao.edu>
> https://listmgr.nrao.edu/mailman/listinfo/fitsbits<
> https://listmgr.nrao.edu/mailman/listinfo/fitsbits>
>
> _______________________________________________
> fitsbits mailing list
> fitsbits at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/fitsbits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20260402/2884a4f6/attachment.html>
More information about the fitsbits
mailing list