[fitsbits] 'Dataset Identifications' postings (digest)

Tue Mar 23 03:47:23 EST 2004

On 16 Mar 2004, Don Wells wrote a digest :

> From: Thomas McGlynn <tam at lheapop.gsfc.nasa.gov>

> There is an effort underway at several of the NASA archives to provide
> a standard dataset identifier for data that can be retrieved from the
> archives.  The initial motivation is that when authors publish [...]

motivation understood and agreed

> The keyword 'DS_IDENT' has been suggested. Does anyone have objections
> to this or do they know of systems that already use this keyword?

I believe this or any other unused name is fine

------------------------------------------------------------------------
> From: seaman at noao.edu (Rob Seaman)

> NOAO (through "Save the bits") has three or four million discrete FITS
> images packaged up into MEF files for purposes of efficient and easy
> handling.  On the other hand, HEASARC's usage supplies an example
> involving one dataset that contains several files.

  would the former be "one file 'originally from' many datasets but now
  actually a new dataset on its own" ? While the latter seems more
  familiar to me. But I can imagine another case, i.e. data retrieved
  from a site with a database and containing part of a catalog.

> Personally, I think before we reserve "DS_IDENT" or any other keyword
> for the purpose of identifying datasets, we should define the concept
> of a "dataset".

Yes I think so.

Let me say what is my *understanding* of a "dataset" (which does not mean
it's something I propose as THE definition !) based on some past
experiences.

In the case of an X-ray satellite, typically one has a unit like an
observing proposal [A], which includes one or more pointings. The pointing
[B] occurs in a given time interval, and may involve SIMULTANEOUS
observations by more than one instrument [C]. For each instrument the
overall time may be divided in consecutive time intervals [D] in which a
given instrument configuration is used.  There may be many different
telemetry packet streams generated during each interval [D], roughly
speaking many different files ... not even FITS files.

At some stage they might be transformed in a group of many different (FITS
?) files, which will be kept together as a dataset.

Just to make some examples, for the long forlorn Exosat satellite, the
observer was receiving an half-inch tape called a FOT. There was one
logical FOT (maybe spanning several volumes) for each [A][B] combination,
where [B] was called the Observing Period (OP) and [D] were called
"observations". There were many (non-FITS) files for each [C][D]
combination, but I would call the FOT itself as "the dataset".  I don't
remember if they originally had an identifier other than the name of the
target and the date. I heard that ESTEC much later had plans to finally
re-archive as FITS event lists, however I haven't followed this.

For BeppoSAX, I'm the culprit of having forced inheritance of the above
naming, with [A][B] being the OP, and [D] observations. BeppoSAX had
FOTs (in the form of DAT cassettes with several non-FITS files) and they
were identified by the OP (sequential) number. A dataset was definitely
"the OP" or "the associated FOT". I would say more "the OP" as ASDC has
been archiving for online access also some reprocessed FITS event lists,
grouped by OP.

For XMM-Newton the naming is different but the concept is similar.
Proposals [A] have a numeric prop-id. [B] are called here "observations"
and have a 4-digit obs-id. [D] are called "exposures". What they used to
give to observers until a while ago was a CD associated to the combination
[A][B] ... and in fact the data were labelled with the concatenation of
prop-id and obs-id e.g. 0065760201. Now they distribute data online only,
but the scheme has ben retained. "The dataset" is the ensemble of all
(many!) (FITS) files pertaining to an [A][B].  I note incidentally that,
although no tapes are used, the "flat" naming scheme is still used with
long horrible file names like P0065760201M1S001EBLSLI0000.FIT.

My personal tendency (but I'm an end user and not an archive mantainer in
this context) would have been to put part of the information in directory
names and not in file names (e.g. for my own BeppoSAX analysis I used
to store files as [A']/[B']/[C]/[D].type, and I tend to use shorter names
also for my own XMM reduction (while "the dataset" as distributed by ESA
contains instead only two directories, one with the semi-raw FITS
reformatted data, and the other one with the pipeline products).

But that (flat or tree) arrangement leaves unchanged the definition of
which files constitute "a dataset".

To go back to another old (but simple) example, in the case of the UV
satellite IUE, nobody cared about the proposal id [A] or the object [B]
when referring to a dataset. The "unit" was one exposure (one spectrum
with a given camera = only one camera operative at any time), or
"image", which had identifiers [C][D], e.g. SWP11056. The data delivered
to the observer was a set of 4-5 files (originally non FITS) for each
"image"  (one raw image, and the steps and results of a pipeline). In
this case I would be inclined to consider this group of files as "the
dataset" (irrespective of the fact that more than one, unrelated, could
be placed on a tape)

I'm not terribly familiar with the way a ground site like ESO manages its
archives, but definitely a proposal [A] can refer to many targets [B], and
ultimately to units called "OBs" (Observing Blocks) which are split into
exposures. Exposures taken at different times may be associated (e.g. for
a multi-object spectrograph one can associate the exposure taken with a
given mask with the dark or lamp calibration taken later with the same
mask), so it's this association I'd call "the dataset".

In any case, I've been talking so far of raw, semi-raw or standard-reduced
data archived at the original observatory (or other site in charge of
archiving) pertaining to a pointing of an object at a given time.

More to come below ...

------------------------------------------------------------------------
> From: Jonathan McDowell <jcm at head.cfa.harvard.edu>

> suppose I have run a modelling tool to get the best deconvolved image
> fit simultaneously to ROSAT and CHandra data, and stored the
> result in the FITS file. [...]

> However, I would say to Thierry that the new file should indeed have a
> brand new dataset identifier - you have in this case created a new
> dataset. The traceability to the original observations should be done

  This is indeed a new case. In general I'm inclined to consider the
  result of any analysis (as opposed to plain "reduction") to be
  "private" data. One may keep them, but privately. What matters are the
  numbers in the published paper.

  But there might be cases indeed in which such data could be stored
  and made publicly available (forever ?) although not in a mission
  archive.

  OK, they are "a new dataset" but who names them ? Are we going to
  run into things like "official naming authorities", like the awful
  "certificates" and "self signed certificates" stuff ? Should we just
  delegate it to the journals and/or use the bibcode (somebody said
  something like that) ?

  There is at least one other different case, databases and catalogues.
  E.g. I'm managing the database for the XMM-LSS survey (which is a
  survey done *with* XMM by a consortium using some GO time, but not
  *by* the XMM ESA project staff, hence "unofficial"). Our collaboration
  members (and later the public) can export catalogue subsets as FITS
  files. So far I've not worried about "dataset identification".

  Of course each RECORD in one of my tables which refer to the XMM data
  is associated to an XMM pointing (and its propid-obsid), but I'm not
  keeping this info explicit. And there are other tables containing
  non X-ray data taken by us (with an optical telescope or with the VLA).
  There are tables which are authorized subsets of data taken by other
  consortia. There are tables which are pointers to NED or SIMBAD.

  Should I really worry here about traceability ? Or just say that the
  dataset is the XMM-LSS project (an ORIGIN keyword would be enough !) ?

----------------------------------------------------------------------------
Lucio Chiappetti - IASF/CNR - via Bassini 15 - I-20133 Milano (Italy)
----------------------------------------------------------------------------
L'Italia ripudia la guerra [...] come    Italy repudiates war {...] as a
mezzo di risoluzione delle controversie  way of resolution of  international
internazionali                           controversies
                [Art. 11 Constitution of the Italian Republic]
----------------------------------------------------------------------------
For more info : http://www.mi.iasf.cnr.it/~lucio/personal.html
----------------------------------------------------------------------------