[fitsbits] FITS 'keyword dictionaries'

Norman Gray norman at astro.gla.ac.uk
Mon Apr 14 11:50:04 EDT 2014


Joe and all, hello.

I'm late jumping in here.  I've lots I'd like to add here, but to avoid a huge wall of text (what's below is merely... huge-ish...), I'll (aim to) be as concise as possible below, hopefully without becoming telegraphically opaque.  I'm very happy to expand on anything here.

I was one of the editors of the IVOA 'Vocabularies' document that Joe quoted in his first message.  The goal of that document was (a) to suggest, with rationale, that SKOS is a Good Thing as far as shared vocabularies go, and (b) given that you're persuaded, to provide some good practices for designing and sharing such things within astronomy.

(It's not quite the same thing, but I'll give a plug here to the Unified Astronomy Thesaurus <http://astrothesaurus.org> which will also have a SKOS manifestation.)

On 2014 Apr 12, at 01:12, Joe Hourcle <oneiros at grace.nascom.nasa.gov> wrote:

> So, to build on the stuff that went around last week, and based on the recommendations from IVOA on using SKOS to represent vocabularies ... here's a short outline of what I'm thinking  ... this is for just trying to document things that already exist, so we don't have to modify existing FITS files.

I'd be very keen to help here, in any way I can.

> (and for those not familiar with SKOS ... it's the "Simple Knowledge Organization System", an ontology for describing thesauri and other controlled vocabularies;  see http://www.unc.edu/~prjsmith/skos_guide.html for a relatively short explanation of it)

One key thing about SKOS, and thesauri in general, is that it's distinct from an 'ontology'.  Thesauri are intended for general understanding, and for 'search' (broadly conceived); ontologies are more precise, and intended to support some level of machine 'understanding' (which sounds grand, but is in fact the very modest goal of allowing some level of meaning to survive machine processing).

So a (SKOS) thesaurus and an (OWL) ontology are significantly different things, but with only a little inventiveness, it's possible to synchronise them so that there's a one-to-one correspondence between thesaurus and ontology concepts, which lets you do the things you expect to do, in a principled way.

Note (avoids misunderstandings below): a SKOS concept (such as <http://www.ivoa.net/rdf/Vocabularies/UCD#Emir>, referring to the em.IR UCD) is a name for the _concept_ of infra-red radiation, as opposed to a name for a _class_.  There are no _instances_ of SKOS concepts, though there would be instances of a correspondingly named OWL Class (such as the class of telescopes).

Note 2: RDF is 'open world', meaning that there is never a closed list of keywords, nor any restrictions on what one can say about a particular SKOS concept (say) once it's been defined.

> Standardizing Documentation of FITS Headers:

>    Try to come up with a machine-actionable format that
>    can be used by software developers so they don't have to
>    do as much work to integrate each mission/instrument.

Spot on.

> Things to try to do:
> 
>    1. identify which keywords are supposed to be FITS standard,
>        which are discipline, mission, instrument, or processing
>        specific.
> 
>    2. Provide units if not explicitly given
> 
>    3. Identify range of possible values
> 
>    4. Provide expansion / explanation of enumerations
> 
>    5. Provide free text to describe how to interpret coded values
>        (eg, SDO 'QUALITY'), or for additional details.
> 
>    6. Identify the autoritative value from a group of similar values.
>        (thus identifying which values are derived, and have more error)

Excellent goals.

> 1. Map concepts in SKOS to recreate the various 'keyword dictionaries'
> 
>    KEYWORD:      skos.prefLabel

The SKOS way would be to make the URL <http://www.ivoa.net/rdf/Vocabularies/UCD#Emir> the 'keyword' -- the prefLabel is intended to be for human display, so that this concept might have different labels in @en and @fr, and so on.

For the example of TELESCOP, one could imagine a SKOS concept

@prefix fits: <http://fits.gsfc.nasa.gov/rdf/concepts#> .
@prefix fitsrel: <http://fits.gsfc.nasa.gov/rdf/relations#> .
@prefix skos: http://www.w3.org/2004/02/skos/core#> .

fits:Telescope  # that is, the name of this concept is <http://fits.gsfc.nasa.gov/rdf/concepts#Telescope>
  a skos:Concept; # ie, this is the name for the concept of a telescope
  skos:prefLabel "telescope"@en, "Teleskop"@de, "telescopio"@it;
  fitsrel:keyword "TELESCOP"; # name of the FITS keyword being described
  skos:definition "This gives the name or other identifier of a particular telescope".

or...

@prefix soho: <http://example.org/soho#>.

soho:SOHO
  a skos:Concept;
  skos:description "The SOHO mission....".

soho:LASCO
  a skos:Concept;
  skos:description "The LASCO package is...";
  skos:broader soho:SOHO.  # LASCO is associated with SOHO

etc, etc, etc -- there's huge flexibility here, in a couple of well-defined syntaxes with good multi-language parser support.

>    REFERENCE:    skos:exactMatch 
>    NAME:         skos:[exact|close]Match <associated UCD+> (once UCD is in SKOS)

The IVOA document mentions <http://www.ivoa.net/documents/REC/Semantics/Vocabularies-20091007.html#vocab-ucd1> a SKOS version of the UCD1+ vocabulary.

>    EXAMPLE:      skos:example
>    DESCRIPTION:  skos:definition
> 
>    Extend SKOS where necessary:
> 
>    STATUS:       fitsKeyword:isRequired
>    DEFAULT:      fitsKeyword:defaultValue
>    INDEX:        fitsKeyword:indexMin & indexMax
>    HDU:          fitsKeyword:hdu
>    DATATYPE:     fitsKeyword:valueType
>    RANGE:        fitsKeyword:valueMin & valueMax ; stringLength
>    VALUE:        (covered by DATATYPE and RANGE)
>    UNITS:        fitsKeyword:units
>    COMMENT:      fitsKeyword:cardComment

These needn't be _extensions_ to SKOS, since RDF (which SKOS is based on) is intrinsically extensible.  RDF-reading applications must simply ignore properties which they don't recognise.

I'll mention in passing QUDT <http://www.qudt.org> -- a lot of work on being precise about units.

> 2. Use SKOS to describe semantic relationships [3]
> 
>    2.1 Publish SKOS keyword definitions for FITS standard keywords
>        + registered conventions
> 
>    2.2 For each 'data collection', publish a SKOS list to
>        assert which keywords you're using based on other
>        standards.
> 
>        To map between relationships in other schema:
> 
>            use 'skos:exactMatch' ("concepts can be used interchangeably" [4])
>            or, if you're compliant with the definition but adding additional
>            assertions 'skos:broadMatch' 
> 
>            If the concept is the same, but the way the information is
>            recorded is different (eg, different date format), use
>            'skos:closeMatch' ("sufficiently similar" [4])

Technical quibble: the relationship skos:broader, say, isn't a _semantic_ relationship, but a looser grouping relation.

Thus I might declare that

    fits:Telescope skos:narrower fits:MainMirror.

This is perfectly reasonable (if I'm talking about main mirrors I'm also talking about telescopes generically), but it is not asserting a subclass relationship (obviously).  This is one reason why thesauri and ontologies have to be designed slightly differently -- they're doing different jobs.

>        To map relationships *within* a given schema:
> 
>            To identify interchangable keywords, use 'owl:sameAs';
>            for differences in encoding or units, use 'owl:equivalentClass'
> 
>            use 'skos:Collection' or 'skos:orderedCollection'to group related
>            keywords (eg, the WCS keywords for a coordinate system).
> 
>    2.3 For enumerations (fields where there is a controlled list
>        of permitted values), create the list using 'skos:Collection'.
> 
>        May need an additional property to relate the keywords to its
>        permitted values.

Indeed.

>    Unresolved Issues : 
> 
>    1.  We need to discuss with members of the semantic community what
>        the implications are of using OWL classes or properties for
>        keywords.  We currently assume that keywords would be classes.

A key point: SKOS keywords are not classes; OWL Classes are.

Thus, consider:

    fitsthes:Telescope a skos:Concept.
    fitsont:Telescope a owl:Class.

fitsthes:Telescope is the name for the abstract concept of 'telescope'; fitsont:Telescope is the name for the class of all telescopes.  A particular telescope may be asserted to be an instance of fitsont:Telescope.

>    2.  This will not correctly identify when groups are using a keyword
>        incorrectly (but think they are using it correctly); may require
>        peer review of the documentation.

A concept can have any properties attached to it -- such as units, ranges, pointers to validation services, ... -- so the concept URL is potentially a very useful place to hang such things on.

>    3.  We may need to have validation to make sure that assertions
>        aren't made about ambiguous 'standard' fields, or a way to
>        flag potentially problematic keywords.
> 
>    4.  I am not aware of a way to easily inherit from an entire schema.
>        Eg, can you do :
>            <InstrumentSchema> skos.broadMatch <MissionSchema>
>        ... or do you need to to explicitly name each keyword?

No, you can't do that within SKOS.  But it's merely a matter of scripting to generate all of the individual relations.

>    5.  In the case of #4, is it better to explicitly state each
>        keyword?
> 
>    6.  Should look into what SKOS / VOTable work has been done, as
>        they may have some conventions for portions of this.

VOTable may not be quite the right layer for this.  There have been some very tentative gestures towards SKOS in the most recent VOTable revision <http://www.ivoa.net/documents/VOTable/20130920/REC-VOTable-1.3-20130920.html#ToC22>.  There's probably plenty of scope for more, but the VOTable editors are (reasonably) very conservative when it comes to extensions to the schema, and would want very clearly elaborated use-case drivers.

Earlier, Joe said:

> ps.  I'm not sure if I should also bring this up on VOTable mailing lists
> ... I would assume that a solution that could document both formats would
> be preferable.

It wouldn't do any harm to mention this there.  I think it's important to have more than one context in mind (such as FITS _and_ VOTable; perhaps HDF5 as well, why not), to ensure that a solution isn't accidentally specific to one format.  The detailed syntax of how one expresses these things within an HDU will be FITS-specific, but that's merely a question of syntax.

Erik Bray said:

> Most of this is "merely a documentation problem, but I think it's one that can 
> and should be solved, and a standard means of documenting these things would 
> help a lot there.

Yes -- I think this is fundamentally a documentation issue.  But the goal of things like SKOS (and OWL, and RDF in general) is to make it easy to have human-readable documentation _and_ whatever machine-readable documentation is feasible.  That is (one view of) the whole focus of the RDF/SKOS/OWL/Linked-data/Semantic-Web efforts.

Tom Kuiper said:

> I remember seeing something like that a few years ago when I was working 
> on this issue but had forgotten.  As I recall, the suggestion was to 
> provide a URL for a repository with the detailed information.  I had 
> some concerns about that though.  For example, can we count on the 
> website being maintained?  A complete FITS header, at least, will exist 
> as long as the FITS file exists.

This is a very important point.  The answer is: choose your URLs carefully.

URLs at fits.gsfc.nasa.gov aren't going away any time soon, so that's one possibility.

Where are the mission's data products being made available long-term?  That's a good URL for this (human- and machine-readable) documentation to live at.

Should ADS get involved, and be persuaded (possibly through being given... money) to curate such things long-term?

(as Joe said:

> I think the important thing is that we need to be using some sort of persistent URLs to the documentation, rather than just a URL to some PI's website.  It might be possible that we could get AAS as a society to store & host the documentation along with the journal.  It's also possible that the NASA/ADS folks at SAO might consider this to be complementary to their work and agree to maintain either a repository or a redirection service.
) 

On this point, Demitri Muna said:

> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
> 
> Here, the XML namespace is described and documented at the URL provided. You are correct that these documents can't be stored at random ethereal web sites, but instead should be hosted specifically at http://fits.gsfc.nasa.gov, where the format lives (just as w3.org is where the web standards live).

I don't think the documentation necessarily all _has_ to be at fits.gsfc.nasa.gov, as long as proper consideration is given to where the URL is, and that consideration is part of the project's long-term planning and archiving.  This isn't to say it _shouldn't_ be at Goddard, but part of the 'open world' thing mentioned above is that technically things are quite flexible.

Also (still Demitri):

> The URL would be placed in the primary header user a specific keyword (e.g. FITSDICT) and then defined as part of the standard. For existing files, someone who is motivated can create a logic tree to determine which dictionary should be used given the headers present. This tree should be language independent.


RDF/SKOS/OWL has parsers and serializers in lots of languages.

The question of long-term preservation is key, but, without being glib, the answers may be easier than one expects.

And on the same topic Steve Allen, replying to Joe, said:

>> We present what we believe is needed for a machine-actionable
>> external file describing a given collection of FITS files.  We seek
>> comments from data producers, archives, and those writing software to
>> help develop a single, useful, implementable standard.
> 
> This effort has to reach all the way back to encourage the funding of
> the infrastructure that will be necessary for an instrument and
> telescope to supply reliable values for these standardized FITS keywords.

_Very_ true.  However this might be easier now that funders are getting the 'open data' bug, and are requiring projects to include data management and preservation (DMP) plans in bids.  Part of the quid pro quo is that such DMP plans have to be funded; indeed, for some funders, if you don't have funding for your DMP planning, your bid is taken to be implausible/under-costed.

Back to Tom:

> My inclination is to 
> create much more specific keywords.  The reason is that in radio 
> astronomy, at least, technology is now leading towards hardware which is 
> very adaptable to an observer's specific requirements.  The most 
> egregious example I can think of is the CASPER hardware.  We now have 
> ROACH-1 boards with KATADCs for radio astronomy at each of out three DSN 
> stations.

This is another important point.  Creating very specific keywords is useful because it allows software which understands a particular instrument to be given very precise information.  However it has a big and well-known interoperability cost.

But things like OWL (less so SKOS) allow you to say 'this is a KATADC, but that's a type of INSTRUME, so if that's all you, dear software author, know or care about, then go right ahead and interpret this value as one of those'.  That is, principled subclassing.

----

All the way through this (and yes, I deleted blocks of stuff to try to keep the length down!), I've described things in terms of RDF and SKOS and friends.  That's because I'm very confident that RDF is exactly the solution for this sort of problem -- which is no accident, because that's part of its design.

A problem is that RDF is poorly explained on the web: the wikipedia article, for example <https://en.wikipedia.org/wiki/Resource_Description_Framework> is not wrong, or particularly unclear, but it makes it look _terribly_ complicated.  Also, the RDF/XML syntax is truly hideous (and makes it look a bit like XML); the 'Turtle' is a lot clearer, and the examples I've given above are in that syntax.  Also^2, the history of this stuff -- it's been 'researchy' for a long while -- mean that it's still not easy to download something and start hacking around.  Because it's rather abstract, it can seem amorphous, which can obscure the fact that, in particular cases, it's generally very easy to see what's going on.

Despite all of that, it's basically very simple.

Enough!

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK





More information about the fitsbits mailing list