[fitsbits] 'Dataset Identifications' postings (digest)

Thu Mar 25 05:30:24 EST 2004

Let me answer to a bunch of messages in one go.

(By the way, if you've not already guessed, the messages tagged "LC's
Nospam ..." are by me as well, it depends if I use fitsbits or post the
NG ; in general I use the mailing list for longer messages)

From: Rob Seaman <seaman at noao.edu>
>Date: Tue, 23 Mar 2004 21:54:40 +0000 (UTC)

> > You will find the current list at:
> > 	http://vo.ads.harvard.edu/dv/facilities.txt
>
> A very interesting list.  ....

Indeed.

> There appears to be a confusion between a ground-based observing site
> and an observatory - perhaps this is a result of the list being compiled
> by our friends in the space-based astronomical community?

Maybe, that's why I found it quite natural for me ...
... although I've some reservations just on the "satellite" subset, namely

  is a Sa:spacecraft enough to define where the dataset is archived ?

  Surely yes if there is a single archive managed by a single space
  agency or its "contractor".

  Possibly yes if the satellite is a cooperation between different
  agencies, AND they have agreed to run the same pipeline AND to keep
  mirror sites

  May fail if different agencies, organizations or institutes decide to
  run different pipelines on the same data ! Resulting in two separate
  datasets stemming from the same raw data.

> In general an observatory is a political entity, a telescope is a facility,
> and a site like Kitt Peak is a piece of real estate that may be host
> multiple facilities from multiple observatories.  Depending on the details
> of contracts or other binding operating agreements, an observatory may
> "own" the data that result from a particular facility like a telescope,

I guess it does not matter at all who owns the data rights for the period
during which the data are not public. If the data are to be indexed, it
means they are public ... either in some official archive or possibly in
some private one.

> A dataset ID can be a relatively simple beast - perhaps as simple as
> a data source ID and a serial number.  But the full taxonomy of dataset
> provenance has to support many degrees of freedom.  At the very least:
>
>     Nation
>     Funding agency

Just for the sake of argument, a "funding agency" is not necessarily
associated with a single nation, at least this side of the Atlantic (ESA,
ESO) ... or of the Panama canal (ESO again :-) ).

>     Observatory
>     Consortium member ("partner")

The latter is hardly relevant to the identification of the dataset

>     Telescope
>     Instrument

these and the above are (loosely) covered by ORIGIN, TELESCOP, INSTRUME,
or other keywords which may be in the same FITS file, or (as said by
others already) in some database at the archive site

>     Date&Time
>     Proposal ID
>     PI and/or project ID

The latter two might be used inside the dataset identifier, or as pointers
to locate the data, internally by the archiving organization. But what is
"inside" is not our business. Similarly the date might be used in the
identifier, again none of our business.

I agree that usually an "observational" (i.e. not "multi-observation"
dataset may be linked to a single date, although the reverse is not
necessarily true. I mean I forgot one case in the examples in my previous
posting, i.e. the third below :

 - ground based observatories typically observe on position of the sky
   from one instrument at one telescope at a time

 - space observatories often observe a position of the sky from SEVERAL
   coaxial (although different FoV size) instruments/telescopes on the
   satellite (and for me this is ONE dataset)

 - however sometimes there are non-coaxial instruments. I take the case
   of BeppoSAX, where during each OP (Observing Period) one had 2-3
   different FOTs (datasets) : one for the NFIs (Narrow Field Instruments)
   pointing along the Z axis, and one each for the two WFCs (Wide Field
   Cameras) pointing along +Y and -Y (maybe just one was on). I guess
   RXTE with the ASM has something similar.

-----------------------------------------------------------------------
From: Thomas McGlynn <tam at lheapop.gsfc.nasa.gov>
> Date: Wed, 24 Mar 2004 10:11:37 -0500

> [...] any specific syntax used.  E.g., in FITS today we have keywords
> ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
> the keyword is specified, but the format is completely undefined

Unfortunately also some aspects of the semantics are ill-defined (see
discussions done at different times). May be it would be better to precise
usage a bit more.

Although most details (including some I've raised) are out of scope
indeed.

We should for instance state that the keyword is a string, and that the
first substring from the beginning to the first slash defines a namespace,
while the rest of the content is defined by the authority managing such
namespace.

We should also indicate the perspective usage, which is still not totally
clear to me (see below).

> So I see the discussion about where such a keyword would go,

I.e. in primary header, in each extension header, in some extension header

> whether we need a keyword that allows for multiple values
> (which DS_IDENT would not) as the kind of things we could

Do you mean multiple occurrences of the same keyword (like HISTORY or
COMMENT) or breaking a single long string value in continuation keywords ?

> to be at least an option for the id to be a vector value.  The
> later requirement mandates a shorter keyword (perhaps just DSID).

See below on "vector"

> However, I do not think that this is the appropriate forum
> for discussion of a particular syntax for the value of this keyword.

Except for the above notion of namespace, and for a possibility to define
that it should be a string contained in a SINGLE keyword (that would limit
its length to 68 characters).

From: Rob Seaman <seaman at noao.edu>
> Date: Wed, 24 Mar 2004 17:22:31 +0000 (UTC)

> It may well be that all astronomical semantic discussions should now
> happen under the happy VO umbrella.  Personally, I think FITS has too
> often skirted the difficult issues.  If we are to debate reserving
> DSIDnnnn for something called "dataset identifiers", isn't it
> appropriate to address what that means?  If not, why do we care if
> an obscure set of keyword names are reserved at all?

That would avoid the loose situation we have for ORIGIN etc.

> > My own read on this part of the discussion is that most people would
> > want to see the ID repeated in all relevant HDU's
>
> Yes.

My personal inclination (as an extremist Ockhamist) is that keywords shall
not be multiplied praeter necessitatem. So I would tend to put one (set
of) keyword(s) in the primary header if they apply to all the file, and to
put it in the extensions only when they differ.

> > and that there probably needs to be at least an option for the id to
> > be a vector value.
>
> If by vector, you mean repeated keywords from the same or different ID
> families, I agree.  IDs are long strings.  Won't fit many in 80 chars.

It would also be possible to impose a syntax limitation that each
identifier is limited to the space of a single kwd (68 characters
excluding the DSIDENT ='...').

If the given file (or HDU) "belongs" (or "refers" ? see below) to more
than one dataset at the same time and with equal rank, one could allow for
repeated DSIDENT kwds (like COMMENT, HISTORY).

However one may need a sequence of DSIDnn if either :

 - the file "belongs" or "refers" to different datasets with some
   priority or ranking order

 - one wants to keep track of an history : i.e. this file belongs to
   the dataset I reduced (DSID01), I started my reduction from the
   result of the pipeline provided by the xyz archive centre (DSID02),
   which used the raw data of the given observation taken with the uvw
   telescope/satellite (DSID03)

> > Why should an ID have the time?
>
> Astronomers have too often relied on convoluted filenames to convey the
> placement of a specific data file within some multidimensional parameter
> space.  Time is key to groundbased observations because access to our

Also for satellites. Time is relevant because it's related to scheduling.
But that does not mean it has (or has not) to be part of the id. See
above. None of our business.

> > Why does it need a proposal ID, nation, agency?
>
> Our need for a dataset identifier is precisely to implement the
> proprietary policies of our current organization.  I am very supportive

The identifier will just say "go to this site to eventually retrieve the
dataset". It's up to the site to then say "this dataset is not yet
public", to protect it with a password, or whatever.

From: Arnold Rots <arots at head.cfa.harvard.edu>
> Date: Wed, 24 Mar 2004 15:47:57 -0500 (EST)

> The scope of Tom's proposal is really quite limited:
>
> He is announcing the establishment of a convention that employs
> a keyword (DS_IDENT) or set of keywords (DS_IDiii).
> The intent is that the value of that keyword contains a label or key
> that will allow users to obtain a pointer to a particular volume in
> astronomical data space.  No less, but also no more.

  just a little bit more

> Within the space of data identifier strings only the subspace of
> strings starting with "ADS/" (case-insensitive!) is reserved.

I believe you should reserve also the fact that the first part of the id
is the namespace, and delegate all the rest to the namespace authority.

May be one should also add another kwd (DSAUTHOR) which points to an URL
of the namespace authority.

Or are we imagining something like the DNS with a set of "root
nameservers" ?

> and purposes.  For the Chandra Data Archive what you will get in
> response to the key is a URL that will allow you to request a download
> of data products associated with a particular observation - or maybe a
> set of observations.  If you try again next month, the files may be
> different: we may have reprocessed or decided to add some products to
> the package.

Hmmm ... I'm a bit worried by the fact that the dataset may change. Maybe
that's why it is not yet so clear to me what usage an user will do of the
dataset identifier. Let's make some examples.

a) I read a paper, which tells me "the data used here belong to dataset
   xyz". I want to repeat the analysis of the SAME data myself, so I
   use the id to retrieve the data. Obviously here I want to get the
   SAME data, not a further and better version (do I ?).

   No FITS file involved here though on the user end.

b) I retrieve the files, and I want to check they really belong to the
   correct dataset.

c) I have got somehow some files, and I want to know to what observation
   do they refer, or to retrieve more files of the same dataset, or to
   find what papers have been published using them.

d) I do my analysis and produce some more files. These are private, but
   I may want to document that the starting point of the analysis was
   the given dataset. But DS-IDENT is not the right way, my data DO NOT
   belong to the dataset, I need a separate history kwd ...

   ... if I'd ever distribute the data (I suppose I also have to quote
   the DS-IDENT in any paper I will write, for the ADS to use it)

> Again, think of the dataset identifier as a key that allows the user
> to obtain a pointer to the dataset.  There is no need to encode any
> information in it - nor is that prohibited

Agreed

> The list of informational metadata that Rob provided looks to me more
> like metadata that ought to reside in a database.

(or in other keywords in the same file if desired)

-- 
----------------------------------------------------------------------
nospam at mi.iasf.cnr.it is a newsreading account used by more persons to
avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.