[fitsbits] 'Dataset Identifications' postings (digest)

Wed Mar 24 14:41:49 EST 2004

Rob Seaman wrote:

> Tom McGlynn writes:
> 
> 
>>I think there are two different things that are getting confused
>>in this E-mail discussion.
> 
> 
> That's precisely my point. 
...
.  If we are to debate reserving
> DSIDnnnn for something called "dataset identifiers", isn't it
> appropriate to address what that means?  If not, why do we care if
> an obscure set of keyword names are reserved at all?
>

Maybe you don't...  The FITS standard doesn't discuss
what observer or origin means other than in the broadest
terms.  In the context of this newsgroup I don't
think it is possible to get agreement beyond that.  As far
as whether it is possible to have useful discussion without
including the syntax, I'd suggest that understanding where
the goes and whether it is a scalar or vector value are
important issues where there has been substantial discussion.

> The FITS standards process is precisely the way to encourage conforming
> usage.  Arnold's message described a mechanism that sounded very
> useful.  NOAO is actively (very actively) pursuing a rich archive
> facility for our large variety of high value astronomical data.  Any
> mechanism that can leverage the value of our data will be gratefully
> adopted.  If there are multiple astronomical naming conventions, we
> may well support more than one.  Why shouldn't these separate IDs
> with separate semantics resulting from the separate constraints of
> separate requirements be hosted in separate keywords?
>
> 
>>So if, for example, NOAO were to adopt a different syntax and style
>>for the dataset IDs, for good and sufficient reasons of their own, then
>>they could use the same keyword or keywords and go ahead on their own.
> 
> 
> There is an assumption here that the simple keyword being proposed will
> successfully map onto an entirely different ID model.  I suspect the
> NOAO IDs (that do need to include the details that Tom seems to find
> unpersuasive) will require several keywords.  We'll likely populate a
> large number of NSAxxxxx keywords with all sorts of info.  Not all
> FITS keyword usage has to be explicitly covered under the standard
> (although a community wide keyword dictionary would be gratefully
> received).
> 
> 
>>It would be desirable in this case if it was possible to distinguish
>>the different syntaxes used.  Regardless I think it would be better
>>to have a standard place to look for the IDs than for software to
>>have to look for a list of keywords and  see if there was ADSID or
>>ADECID or NOAOID or NRAOID or CDSID or ...
> 
> 
> So, you're basically suggesting that software loop over all DSIDnnnn
> to locate all the dataset identifiers.  This may be a useful feature.
> On the other hand, I haven't come up with a reason that I would need
> to look for an identifier whose namespace I wasn't already interested
> in.

You might not.  But suppose someone builds a general service
that transforms IDs (from any origin) into pointers.  Users might then build
clients of this service that uses the links that are returned.   However
if they don't know where the ID information is stored in the FITS header
they have to pass the entire headers of each extension in the file for the
remote service to parse.  Going the other direction, when users ingest a
set of FITS files from heterogenous sources, they may well want to extract
the dataset id as they ingest the files.  They don't want to have to
update software to check for new keywords every time a new authority comes online.
It would be a lot easier if there is a nominal location for the dataset ID.

>  A simple keyword query will return ADSID (posited to be of general
> community wide interest) and a second query will return NOAOID (of
> specific interest only to NOAO staff and users).  My software can
> then generate a report or whatever that ties the two together.  Give
> me a use case for needing to retrieve a long list of opaque identifiers
> related to projects completely outside my bailiwick.
> 
> Meanwhile, the DSIDnnnn scheme will require a potentially very expensive
> traversal of several keywords in every header being considered.  Imagine
> mapping your header keywords to DB schema.  Isn't your DB simply going
> to contain a column (perhaps a vector) named ADS_ID? Our DB will certainly
> contain a column named something like NOAO_ID.  What is the value in
> piling up a bunch of unrelated information under the same keyword
> heading?

While I'm sure there might be exceptions, I'd hope that generally there would
be a single set of IDs maintained by a single institution.  Having multiple
sites responsible for independent sets of IDs might occasionally be
necessary, but I don't think that's what we want to encourage.

We would certainly want the NOAO to be maintaining the IDs for the datasets
in its domain.  But the NOAO should not be constrained regarding the format of the IDs...
Which is why I don't wish to put any significant constraint on the syntax of the ID.

> 
> The whole notion of keyword=value pairs is that the keyword identity
> supplies some of the information.  When I want the date of an
> observation, I query DATE-OBS, not a list of DATEnnnn keywords,
> searching for a string matching "OBS/20040324T170325Z".  In effect,
> the DSIDnnnn scheme asserts that our users won't find any direct use
> for the dataset identifiers, otherwise we wouldn't make it so hard for
> them to get at them.  Instead of:
> 
>     cl> hselect *.fits adsid yes
>     "<this is an ADS ID string>"
> 
> they would have to do something like:
> 
>     cl> for (i=1; i<=3; i+=1) {
>     >>> hselect ("test*.fits", "dsid000"//i, yes) | match ADS
>     >>> }
>     "<this is an ADS ID string>"
> 

Not at all...  Forgive my lack of knowledge of IRAF, but if you have
an ADSID and an NOAOID then we have to coordinate them anyway or the user
is going to have to write code like

       if (thereIsAnADSID) then
            use the ADSID
       else if (thereIsAnNOAOID) then
            use the NOAOID
       else if (thereIsaCDSID) then
            use the CDSID
       ...
and every time we get a new ID authority we have to add another
test.

I much prefer
       if (thereIsaDSID) then
            call theDSIDResolver()

This doesn't eliminate the switch statement above.  It's
just moved into theDSIDResolver, but in networked world
that's very likely not to be a Web service that many users
invoke so the impact of a new kind of ID is much less and
most people's software accommodates it with no changes.

E.g., I don't need to worry about handling resolution
of new object names as they are added to astronomical nomenclature.
I send NED and SIMBAD the strings and they do the resoluiton
for me.  The same would occur with the ID resolvers.  Other
than sending servers essentially the complete FITS headers this
approach doesn't work if providers all use their own keywords
to store the ids.

Of course if the NOAOID is used purely internally it is of no interest
to the discussion.  I am assuming that the NOAOID is an ID of interest
to users other than NOAO itself.  Everyone is always free to define
their own IDs for their internal usage.

> I'm sure you can see the usage issues immediately.  Here are just two.
> What about DSIDnnnn values that contain a substring matching one of the
> supported naming authorities (or one that is added in the future)?
> How is the search truncated when you don't know that there are exactly
> three keywords to start?  Sure, a programmer can work around each of
> these - but we add keywords for the benefit of our unsophisticated
> users, too :-)

The same issue crops up if you use different keywords or encode something
in the value.  The advantage of putting it in the value, is that software
knows where in the header to find the information it needs to start with.

> 
> This also begs the question of identifiers for individual FITS HDUs.
> A particular FITS file or HDU may belong to multiple datasets.  A
> particular HDU has a single identity, however.  Shouldn't part of
> this discussion include how to supply a community wide identifier
> for each separate FITS object?  Imagine starting with a dataset ID.
> Doesn't that set ID have to coexist with some mechanism for referencing
> all of its many members?
>

The issue of identity has been extensively discussed in the Virtual Observatory
community.  The suggested ADEC convention is compatible with the outcome
of that discussion, but the outcome was essentially  that this is not
something that can be solved generally.  Resolving the IDs into links
to the entire dataset is certainly something that we want.  The current
ADEC service does this.  It you were to build NOAO ID's I certainly
hope that you would provide such a service.

> 
>>Here I think you are confusing the metadata describing an observation
>>with the 'name' of an observation.
> 
> 
> Ah!  To return to Lucio's contribution:
> 
> 
>>>My personal tendency (but I'm an end user and not an archive mantainer
>>>in this context) would have been to put part of the information in
>>>directory names and not in file names (e.g. for my own BeppoSAX
>>>analysis I used to store files as [A']/[B']/[C]/[D].type,
> 
> 
> A familiar issue is how to tie an archive's data stores together with
> its metadata DB.  NOAO is specifically considering precisely this
> directory tree structure for our raw data store and also how to tie
> it into the resulting headers/DB.
>

> 
>>Why should an ID have the time?
> 
> 
> Astronomers have too often relied on convoluted filenames to convey the
> placement of a specific data file within some multidimensional parameter
> space.  Time is key to groundbased observations because access to our
> telescopes (and the resulting proprietary ties that bind) is distributed
> via the calendar and clock.
>
And I have no problem with including the time (or any string
a user chooses) in the dataset ID.  I just don't see why the
proposal needs to mandate it, or even worry about that level
of detail.

It's been my experience though that when building tables, its very convenient to
have a simple unique key -- even if its completely arbitrary -- rather
than building it up by concatenating enough elements in the table to
make each entry unique.  It sounds to me like that's what you are doing
here, but if it works for you that's fine with me.  I'm not trying to
suggest you use any given approach.

> This is precisely why different dataset IDs might require very different
> FITS support.  IDs generated for tying publications to data, for instance,
> are likely going to be very different than IDs generated for tying data
> objects to telescopes or archives.

Maybe, but if we allow multiple IDs for a given element I don't see
why that matters.

> 
> 
>>Why does it need a proposal ID, nation, agency?
> 
> 
> Our need for a dataset identifier is precisely to implement the
> proprietary policies of our current organization.  I am very supportive
> of taking the very long term view of provenance.  Over the very long
> term, perhaps the fact that an entity known as NOAO used to own the
> data may no longer matter.  Perhaps a particular national observatory
> will no longer exist because a particular nation will no longer exist :-)
> (In his salad days, a college professor of mine set up the Iranian
> National Observatory before the Shah fell...)  In the long term we're
> all dead :-)

But we hope our data are not!  Again this seems to be saying that
we need to cram things into the data id so that it serves as a mini-description
of the dataset.  I'm not keen on that approach, but it seems
to be easy to accommodate within the very broad context I'm
suggesting is all we should try to agree on.

> However, NOAO's current need is precisely to consider who owns what and
> who may have access to data and precisely when.  Your mileage may vary -
> which is what says to me that an ADS ID scheme should be placed in an
> ADS branded keyword.  It isn't that I have an issue with ADS dataset
> IDs - far from it.  I have an issue with a single style of dataset ID
> coopting the entire notion of placing data within sets.
> 

I guess that's what confuses me most...  All I suggesting we
agree on (right now at least) is the keyword names.  I'm explicitly not
advocating for the ADEC style -- though I think it can accommodate
much if not all of what you would like to do.

	Regards,
	Tom