[fitsbits] UTF-8 in BINTABLE String Columns {External}
Barrett, Paul
pebarrett at email.gwu.edu
Thu Mar 26 10:00:05 EDT 2026
Because this is somewhat of a breaking change, would it not be beneficial
in the long run to extend this to UTF-16 and UTF-32?
-- Paul
On Thu, Mar 26, 2026 at 9:44 AM Francois-Xavier PINEAU via fitsbits <
fitsbits at listmgr.nrao.edu> wrote:
> Dear fitsbits,
>
>
> # Background
>
> VOTable (v1.5) is closely compatible with the FITS Binary Table format:
>
> https://www.ivoa.net/documents/VOTable/20250116/REC-VOTable-1.5.html#tth_sEc2.3
>
> In the current draft of VOTable 1.6
>
> https://github.com/ivoa-std/VOTable/releases/download/auto-pdf-preview/VOTable-draft.pdf
> ,
> UTF-8 strings replace the previous ASCII-only strings.
>
> If FITS cannot store UTF-8, *lossless round-trip conversion from VOTable
> to FITS will no longer be possible*.
> Some limitations already exist (e.g., unsigned integer logical types), but
> UTF-8 seems more critical.
>
> Personal use cases include the usage of HEALPix sorted and indexed
> BINTABLES to build on-the-fly HATS products
> or intermediary HiPS catalogue representations from VizieR data (will
> contains more and more UTF-8).
> * HATS: https://www.ivoa.net/documents/Notes/HATS/
> * HIPS catalogue: https://www.ivoa.net/documents/HiPS/
> * VizieR: https://vizier.u-strasbg.fr/
>
> # Possible Solutions
>
> ## 1. Use UTF-8 in existing `TFORMn=rA`
>
> Like in VOTAble 1.6, interpret `r` as bytes instead of characters.
> May break truncation operations (TDIPS) if a multi-byte UTF-8 character is
> split.
>
> ## 2. Logical type "UTF-8" backed by a byte array
>
> TFORMn = rB
> TLOGTn = 'UTF-8' / LOGT stands for LOGical Type
>
> Unaware readers see a byte array; UTF-8 aware readers interpret it as a
> string.
> Introduces two string types in FITS (ASCII and UTF-8).
>
> ## 3. New TFORM type (e.g., `TFORMn=rU`)
>
> Definite breakage for current readers.
>
>
> # Existing Implementations
>
> * TOPCAT/STILTS (Java): Prototype supports Solutions 1 and 2 for
> read/write (private communication with Mark Taylor).
> * fitstable (Rust): Supports Solutions 1 and 2 for reading (
> https://github.com/cds-astro/cds-fitstable-rust).
> * VizieR: Appears to provide UTF-8 in TFORMn=rA columns (Solution 1).
> * ??
>
>
> # Feedback Requested
>
> I am curious about:
> * other possible approaches
> * fitsbits opinions on the most practical solution
> * other people interested in having UTF-8 in BINTABLE columns
>
> Currently, Solution 1 seems the simplest and Solution 2 the safest,
> but I welcome constructive comments and experience from the community.
>
> Best regards,
> --
>
> Francois-Xavier Pineau
> Ingénieur de Recherche
> Tél : +33 (0)3 68 85 24 14,
> francois-xavier.pineau at astro.unistra.fr
>
> Centre de Données astronomiques de Strasbourg (CDS)
> 11, rue de l'Université - E03
>
>
> _______________________________________________
> fitsbits mailing list
> fitsbits at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/fitsbits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/fitsbits/attachments/20260326/7a408163/attachment.html>
More information about the fitsbits
mailing list