[fitsbits] 'Dataset Identifications' postings (digest)

Wed Mar 24 10:11:37 EST 2004

I think there are two different things that are getting confused
in this E-mail discussion.  They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used.  E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here.  My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value.  The
later requirement mandates a shorter keyword (perhaps just DSID).

However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed.  While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence.  So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own.  It would be desirable
in this case if it was possible to distinguish the different syntaxes
used.  Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and  see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or ....  The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another.  If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary.  All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

		Tom

Rob Seaman wrote:

 > recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
 > (I imagine a thousand ADS dataset identifiers are sufficient for a
 > particular FITS HDU - are they?)

The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces.  So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

 > A very interesting list.  Might I suggest that this list be itself
 > scrubbed and extended as part of this process?  There is a lot of
 > confusion about the organizations contained on the list.  For instance,
 > here are the overtly NOAO related entries:
 >
 >     KPNO.12m        Kitt Peak National Observatory/12 meter Telescope
 >     KPNO.2.1m       Kitt Peak National Observatory/2.1 meter Telescope
 >     KPNO.BT         Kitt Peak National Observatory/Bok Telescope
 >     KPNO.MAYALL     Kitt Peak National Observatory/Mayall Telescope
 >     KPNO.MDMHT      Kitt Peak National Observatory/MDM Hitner Telescope
 >     KPNO.MDMMH      Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
 >     KPNO.MPT        Kitt Peak National Observatory/McMath-Pierce Telescope
 >     KPNO.SARA       Kitt Peak National Observatory/Southeastern Association
 >                          for Reasearch in Astronomy Telescope
 >     KPNO.SWT        Kitt Peak National Observatory/Space Watch Telescope
 >     KPNO.WIYN       Kitt Peak National Observatory/WYIN,
 >                          Wisconson-Indiana-Yale-NOAO Telescope
 >
 >     CTIO.1.5m       Cerro Tololo Inter-American Observatory/1.5 meter Telescope
 >     CTIO.2MASS      Cerro Tololo Inter-American Observatory/2MASS Telescope
 >     CTIO.VBT        Cerro Tololo Inter-American Observatory/Victor Blanco
 >                           Telescope
 >     CTIO.YALO       Cerro Tololo Inter-American Observatory/YALO,
 >                           Yale-AURA-Lisbon-OU Telescope
 >

The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users.  This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that  these ID's are
intended to be permanent.  So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime.  In the NASA world, that's actually
quite normal.

 > First, note that the "National Optical Astronomy Observatory" is not
 > mentioned yet NOAO is likely the legal owner of many data products
 > resulting from some of these facilities.
 >
 > Second, note:
 >
 >     1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
 >     the telescope),
 >     2) that data from KPNO.BT and KPNO.SWT is owned by the University
 >     of Arizona (or perhaps the state of Arizona),
 >     3) that data from KPNO.MPT is owned by the National Solar Observatory,
 >     4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
 >     MDM during the epoch of the observations in question,
 >     5) that data from KPNO.SARA is owned by the SARA consortium,
 >     6) that data from KPNO.WIYN is owned by the WIYN consortium, one
 >     member of which is NOAO,
 >     7) that there are two 2MASS telescopes and only one is at CTIO
 >     8) that CTIO.YALO was run by the - you guessed it - YALO consortium
 >     and has since ceased operations
 >
Right, our thought is that organizations will register as
responsible for particular dataset holdings.  So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party.  Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

 > It is quite likely that I got some of those nuances wrong myself :-)
 >
 > There appears to be a confusion between a ground-based observing site
 > and an observatory - perhaps this is a result of the list being compiled
 > by our friends in the space-based astronomical community?
 >
No...  As I mentioned above we didn't do this.  If we had we surely wouldn't
have lumped all space observatories together!  It may be that rather
than KPNO and CTIO they should be KP and CT.  That certainly seems
reasonable to me.  I don't think this list is set in concrete
or even particularly old jello.

 > In general an observatory is a political entity, a telescope is a facility,
 > and a site like Kitt Peak is a piece of real estate that may be host
 > multiple facilities from multiple observatories.  Depending on the details
 > of contracts or other binding operating agreements, an observatory may
 > "own" the data that result from a particular facility like a telescope,
 > instrument, archive or pipeline - or that ownership may devolve to a
 > specific member of some consortium.  In many cases, one imagines that
 > a funding agency or government or perhaps even the "people of the United
 > States of America" may ultimately own a particular data product.
 >
 > So, an example.  NOAO operates twin 8Kx8K mosaic wide field imagers
 > at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
 > Depending on the phase of the moon (quite literally :-) the resulting
 > data may be owned by NOAO or by some instrumentalities associated with
 > the University of Wisconsin, Indiana University, Yale University and
 > in the near future perhaps the University of Maryland.  Confounded with
 > this question of ownership is the issue of proprietary rights.  Time
 > on NOAO facilities is awarded competitively and the successful PIs are
 > rewarded with sole access for some period (typically 18 months).
 >

All of these issues are certainly complex, but in some sense they
are irrelevant.  Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen.   I don't
think we need to solve every problem to have a useful
capability.

 > A dataset ID can be a relatively simple beast - perhaps as simple as
 > a data source ID and a serial number.  But the full taxonomy of dataset
 > provenance has to support many degrees of freedom.  At the very least:
 >
 >     Nation
 >     Funding agency
 >     Observatory
 >     Consortium member ("partner")
 >     Telescope
 >     Instrument
 >     Date&Time
 >     Proposal ID
 >     PI and/or project ID
 >     ...
 >

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation.  Why should an ID have the time?
One might choose to use the time in the ID.  But there is not reason
why it has to be done that way.  Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this.  We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

 > The more I listen to myself talk, the more I convince (myself, anyway :-)
 > that a single DS_IDENT keyword is a very poor match to the underlying
 > requirements.  Not only might a single file belong to multiple datasets
 > certified by a particular entity (like ADS), but they may belong to
 > multiple other datasets certified by multiple other entities - and more
 > to the point, the design of the certification process will vary from one
 > to the next to the next.
 >
 > In particular, the NOAO Science Archive has been discussing the precise
 > questions of ownership and proprietary access and had already selected
 > a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
 > Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
 > ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
 > (most similar to the ADS scheme) the NOAO Proposal ID spanning all these
 > facilities.  Whatever we settle on will never fit within the confines of
 > any single keyword.  On the other hand, I'd love to *also* include an
 > ADSID tag to even further constrain the provenance.
 >

Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.