[fitsbits] 'Dataset Identifications' postings (digest)

Rob Seaman seaman at noao.edu
Wed Mar 24 12:22:31 EST 2004


Tom McGlynn writes:

> I think there are two different things that are getting confused
> in this E-mail discussion.

That's precisely my point.  Perhaps you can first clarify whether you
and Arnold are talking about the same requirements and resulting proposal.
If I understood the discussion of ADS identifiers, these supply a very
rich namespace with "multi-mission" support.  More to the point, the ADS
identifiers benefit from network externality - the more headers contain
them from more projects at more instititutions, the greater the value
of the identifiers to the community as a whole.

> However, I do not think that this is the appropriate forum for
> discussion of a particular syntax for the value of this keyword.

It may well be that all astronomical semantic discussions should now
happen under the happy VO umbrella.  Personally, I think FITS has too
often skirted the difficult issues.  If we are to debate reserving
DSIDnnnn for something called "dataset identifiers", isn't it
appropriate to address what that means?  If not, why do we care if
an obscure set of keyword names are reserved at all?

> My own read on this part of the discussion is that most people would
> want to see the ID repeated in all relevant HDU's

Yes.

> and that there probably needs to be at least an option for the id to
> be a vector value.  The later requirement mandates a shorter keyword
> (perhaps just DSID).

If by vector, you mean repeated keywords from the same or different ID
families, I agree.  IDs are long strings.  Won't fit many in 80 chars.

> However, I do not think that this is the appropriate forum for
> discussion of a particular syntax for the value of this keyword.

I think you must mean semantics, not syntax.  We can't very well express
an opinion on the contents of a keyword whose legal values aren't discussed.
I thought Arnold did a fine job of starting to lay down the ground rules.
There isn't much point for the mechanism if all the proposal states is
"any string value".

> While the ADEC hopes that our IDs will be useful and that others will
> adopt them, we have no power to force such a change -- though the
> astronomy journals may have a bit broader influence.

The FITS standards process is precisely the way to encourage conforming
usage.  Arnold's message described a mechanism that sounded very
useful.  NOAO is actively (very actively) pursuing a rich archive
facility for our large variety of high value astronomical data.  Any
mechanism that can leverage the value of our data will be gratefully
adopted.  If there are multiple astronomical naming conventions, we
may well support more than one.  Why shouldn't these separate IDs
with separate semantics resulting from the separate constraints of
separate requirements be hosted in separate keywords?

> So if, for example, NOAO were to adopt a different syntax and style
> for the dataset IDs, for good and sufficient reasons of their own, then
> they could use the same keyword or keywords and go ahead on their own.

There is an assumption here that the simple keyword being proposed will
successfully map onto an entirely different ID model.  I suspect the
NOAO IDs (that do need to include the details that Tom seems to find
unpersuasive) will require several keywords.  We'll likely populate a
large number of NSAxxxxx keywords with all sorts of info.  Not all
FITS keyword usage has to be explicitly covered under the standard
(although a community wide keyword dictionary would be gratefully
received).

> It would be desirable in this case if it was possible to distinguish
> the different syntaxes used.  Regardless I think it would be better
> to have a standard place to look for the IDs than for software to
> have to look for a list of keywords and  see if there was ADSID or
> ADECID or NOAOID or NRAOID or CDSID or ...

So, you're basically suggesting that software loop over all DSIDnnnn
to locate all the dataset identifiers.  This may be a useful feature.
On the other hand, I haven't come up with a reason that I would need
to look for an identifier whose namespace I wasn't already interested
in.  A simple keyword query will return ADSID (posited to be of general
community wide interest) and a second query will return NOAOID (of
specific interest only to NOAO staff and users).  My software can
then generate a report or whatever that ties the two together.  Give
me a use case for needing to retrieve a long list of opaque identifiers
related to projects completely outside my bailiwick.

Meanwhile, the DSIDnnnn scheme will require a potentially very expensive
traversal of several keywords in every header being considered.  Imagine
mapping your header keywords to DB schema.  Isn't your DB simply going
to contain a column (perhaps a vector) named ADS_ID? Our DB will certainly
contain a column named something like NOAO_ID.  What is the value in
piling up a bunch of unrelated information under the same keyword
heading?

The whole notion of keyword=value pairs is that the keyword identity
supplies some of the information.  When I want the date of an
observation, I query DATE-OBS, not a list of DATEnnnn keywords,
searching for a string matching "OBS/20040324T170325Z".  In effect,
the DSIDnnnn scheme asserts that our users won't find any direct use
for the dataset identifiers, otherwise we wouldn't make it so hard for
them to get at them.  Instead of:

    cl> hselect *.fits adsid yes
    "<this is an ADS ID string>"

they would have to do something like:

    cl> for (i=1; i<=3; i+=1) {
    >>> hselect ("test*.fits", "dsid000"//i, yes) | match ADS
    >>> }
    "<this is an ADS ID string>"

I'm sure you can see the usage issues immediately.  Here are just two.
What about DSIDnnnn values that contain a substring matching one of the
supported naming authorities (or one that is added in the future)?
How is the search truncated when you don't know that there are exactly
three keywords to start?  Sure, a programmer can work around each of
these - but we add keywords for the benefit of our unsophisticated
users, too :-)

This also begs the question of identifiers for individual FITS HDUs.
A particular FITS file or HDU may belong to multiple datasets.  A
particular HDU has a single identity, however.  Shouldn't part of
this discussion include how to supply a community wide identifier
for each separate FITS object?  Imagine starting with a dataset ID.
Doesn't that set ID have to coexist with some mechanism for referencing
all of its many members?

> Here I think you are confusing the metadata describing an observation
> with the 'name' of an observation.

Ah!  To return to Lucio's contribution:

>> My personal tendency (but I'm an end user and not an archive mantainer
>> in this context) would have been to put part of the information in
>> directory names and not in file names (e.g. for my own BeppoSAX
>> analysis I used to store files as [A']/[B']/[C]/[D].type,

A familiar issue is how to tie an archive's data stores together with
its metadata DB.  NOAO is specifically considering precisely this
directory tree structure for our raw data store and also how to tie
it into the resulting headers/DB.

> Why should an ID have the time?

Astronomers have too often relied on convoluted filenames to convey the
placement of a specific data file within some multidimensional parameter
space.  Time is key to groundbased observations because access to our
telescopes (and the resulting proprietary ties that bind) is distributed
via the calendar and clock.

> One might choose to use the time in the ID.  But there is not reason
> why it has to be done that way.

This is precisely why different dataset IDs might require very different
FITS support.  IDs generated for tying publications to data, for instance,
are likely going to be very different than IDs generated for tying data
objects to telescopes or archives.

> Why does it need a proposal ID, nation, agency?

Our need for a dataset identifier is precisely to implement the
proprietary policies of our current organization.  I am very supportive
of taking the very long term view of provenance.  Over the very long
term, perhaps the fact that an entity known as NOAO used to own the
data may no longer matter.  Perhaps a particular national observatory
will no longer exist because a particular nation will no longer exist :-)
(In his salad days, a college professor of mine set up the Iranian
National Observatory before the Shah fell...)  In the long term we're
all dead :-)

However, NOAO's current need is precisely to consider who owns what and
who may have access to data and precisely when.  Your mileage may vary -
which is what says to me that an ADS ID scheme should be placed in an
ADS branded keyword.  It isn't that I have an issue with ADS dataset
IDs - far from it.  I have an issue with a single style of dataset ID
coopting the entire notion of placing data within sets.

Rob



More information about the fitsbits mailing list