[fitsbits] Questions about the 'REFERENC' keyword.

Thu Jan 31 10:26:20 EST 2013

Thank you all for the responses -- I've been tied up in meetings
and our annual 'recycle your posters' thing, so I apologize for
the delayed response.

I've merged together a few messages, rather than reply to them
all individually. 

On Jan 30, 2013, at 11:14 AM, Randy Thompson wrote:

> Hi Joe,
>   I just wanted to point out that the FITS standard you referenced
> at our web site archive.stsci.edu is for FITS Standard Version 2.0.
> You can find the latest (version 3.0) at
> http://fits.gsfc.nasa.gov/fits_standard.html.

I'm going to guess that Google pointed me to the older version, as
it had an HTML version available, while v.3 only has the PDF and
PS documents.

(I realized I had written a note about 'REFERENC' on one of my
posters from the AAS meeting, and this whole thing started in my
attempts to verify how it's intended to be used)

> I see there is more information on the REFERENC keyword in the latest
> version.

Thank you -- that actually leads me to believe that it might actually
be a 5th option, based on 'AUTHOR' in that same section:

	... This keyword is appropriate when the data originate
	in a published paper or are compiled from many sources.

The older wording for REFERENC then makes complete sense if it was
to be used for FITS files that contained data extracted from
journal articles, such as the data distributed by the VizieR
service.

(and in response to Thierry -- in reading v.3, they specifically
recommend bibcode or DOI, so they're more than acceptable)

On Jan 30, 2013, at 10:55 AM, Eric Greisen wrote:

[trimmed]

> FITS files were and are expected to contain the data which they describe, so I am uncomfortable with the file containing solely some URL pointer.  Since URL's do not retain their value over time at all well the information may become obsolete anyway.   A URL should point to the complete FITS file that the user would want so that their reader can swallow it should they choose to download it.
> 
> I do understand the need for a modest sized data "catalog" to allow users to browse and then select the data files for downloading.  I suppose you could use dataless FITS files to describe each of the large-data FITS files.  But those dataless files would have to be legitimate FITS files (e.g. NAXIS=0 or NAXISj = 0) and so would not look precisely the same as the actual data.  However, I thought this was the role of the VO, or at NRAO, our own home-brew data catalog web browser.

I had thought of the NAXIS=0 part (the files won't pass the
FITS file verifier otherwise), but it's actually a bit of a
problem if someone's interested in which are the full images,
and which are of a more limited field of view.

And I don't work in astronomy -- I work on the Virtual Solar
Observatory, so most of the VO & NRAO service-based standards
just don't work for us as-is.  We've talked about using VOTable
for interchange, but our community favors FITS, and it's really not
worth my time to try to push for scientists to convert and learn
new tools.

(New tools that give them new functionality, sure ... but having
to learn a parallel set of tools that aren't IDL?  I'd have
better luck herding cats)

On Jan 30, 2013, at 11:37 AM, Lucio Chiappetti wrote:
> On Wed, 30 Jan 2013, Eric Greisen wrote:

[trimmed]

>> Joe Hourcle wrote:
> 
>>> 	3. A reference (citation, URL, DOI or bibcode) to a published
>>> 	   research article that uses the data.
>>> 	4. A URL to a website with documentation on using the
>>> 	   data
> 
>> When the keyword was invented, only one of the concepts listed above was
>> even conceivable - a published article citation.
> 
> I tend to agree with Eric. I.e. case (3). A literature reference, a 
> bibcode or a DOI should be long-lived. The bibcode or DOI could be 
> translated into an URL prepending the name of some server (like e.g. one 
> of the present ADS sites), but we have no guarantee these particular 
> servers will exist if one will look at the file 20 or 100 years in the 
> future (while the bibcode or DOI could probably be used to lookup 
> elsewhere). Of course if the expected lifespan of the file is much less, 
> no problem in using an URL.

Ideally, you'd use an indirect URL through some sort of resolver,
so that you'd be able to maintain persistence.  (eg, the 'dx.doi.org'
resolver to express DOIs as URLs).

Right now, there's a cost to mint DOIs, and it's high enough that
it's unlikely that organizations would assign a DOI to each image.  I
don't think bibcodes are sufficiently long to support it.

The general consensus in the data provenance & data citation fields
right now is to assign a DOI at collection level, and then some
additional granule identifier for each image / file / whatever.
The DataCite / EZID group is pushing for DOI+ARK, while ESIPFed
seems to prefer DOI+UUID.

> I believe case (4) may also be acceptable, although I'd expect such 
> information to show up in comments (if at all).
> 
>>> 	1. It's similar to the 'FITS Serialization' in VOTable, where you
>>> 	   don't have the data attached, and can instead give a URL:
>>> 	2. A URL to the archive or repository are available to
>>> 	   download from
>>> 
> 
> Not really clear to me ... (3) and (4) point to a document which REFERS TO 
> (INTO ?) the current file. But (1) and (2) seems to refer OUT of the 
> current file to elsewhere !

They all refer out ... it's just that some of them might refer
back as well.  (which gets into issues with which one gets created
first ... but I won't get into all that)

>> I do understand the need for a modest sized data "catalog" to allow
>> users to browse and then select the data files for downloading.
> 
> Hmm, my way to do this sort of things would be (actually is, I have a 
> couple of survey databases organized that way) to have a database 
> somewhere ('the catalog') and associated data products linked to a 
> particular database column.
> 
> If I do a query, it will return a list of "objects", and a list of 
> distinct available data products (in forms of URLs). The number of data 
> products need not to be related to the n records returned by the query.
> 
> For instance if the dataproduct is a thumbnail image around the object, or 
> a spectrum, there will be one for each of the n objects, indexed on the 
> object sequence number.
> 
> But if the dataproduct is an X-ray image of the entire field where the 
> objects are, it could be that the n objects are in just p << n fields, so 
> there will be p images, indexed on the field identifier.
> 
> The way I do it, the URL is constructed on the fly from templates stored 
> in an administrative database, replacing a placeholder with the value of 
> the associated index column.
> 
> Now what is proposed ? To defer the database search to the home of the 
> user which has retrieved the FITS catalog ? Using a custom FITS reader ?

We have something really similar to what you describe.  What we have
right now is a federated catalog system -- there's a single point of
entry that can search across multiple missions & instruments:

	http://virtualsolar.org/

(there's actually clients ... web based, IDL, python, etc., but they
all use the same API)

We return an array of records (structs, objects, dicts, hashes,
exact implementation per language) back to a given response, in which
we've normalized or inferred values that might not've been in the
original file.  (eg, we return the spectral range, which might be
derivable from engineering metadata (CAMERA=3, FILTER=4) or isn't
included as it was fixed for that instrument).

But we don't return all of the fields; in some cases, we're actually
scraping FTP sites and basing the records on decoding the instrument's
filenaming scheme and directory structure.

The researcher can then give us a list of what files they're
interested in, and we then tell them how to get the files.  (this
seems counter-productive until you consider that not all of the
data was available online, and so we'd have to send requests to
the various archives to retrieve the data from tape; this also
allows us to save the list of files downloaded, and recall it
years later, when the files might've been moved to a different
location)

But when you're dealing with large images and a lot of them (57k AIA
images per day at 16 megapixel, 12 bit), we don't want someone to
have to spend a few hours or days downloading files only to find
that they're not useful for what they're trying to do.

So for that idea, the URLs don't necessarily need to be long-lived,
if they're being generated on demand.  I could see someone wanting
to store these long-term, though, so it might be that we need to
ensure that the URLs point back through our resolver (which would
redirect to the file's location).

>> I suppose you could use dataless FITS files to describe each of the 
>> large-data FITS files.
> 
> One does not need a dataless file (nor it is an optimal solution) to a
> make a "portable FITS database". The right way to go is probably a 
> BINTABLE where the *variable part* of the URLs is stored in some column 
> (the fixed template could be stored in a custom keyword, and handled by 
> the custom reader).  So a row in a BINTABLE not a multiple keyword.
> 
> Different story is if the remote user wants to see the dataproduct FITS 
> header before deciding whether to retrieve the bulky data portion.
> 
> This could be handled in different ways. One is to have the header and 
> data stored separately in their site, and having URLs pointing to a CGI 
> which either retrieves the dataless header, or merges header and data 
> before sending.
> 
> Another would be to have a custom client which can retrieve from the FITS 
> file until the END keyword of the header ignoring the data part. It is not 
> difficult to write e.g. a sequential FITS reader in Java (or, though 
> different, a Java client which retrieves an ftp file "from record n to 
> record m" ... I did it, although it naively reads unwanted records 1 to 
> n-1 and simply ignores them without outputting them).

I was thinking I'd attempt to figure out which tools people are using
for reading the files (most are using IDL, but we also have some GDL,
PDL (which I think uses cfitsio), matlab and python.) ... and then
add in support for the most used clients.

The BINTABLE approach is actually more difficult, because for the
SDO/AIA and SDO/HMI data (and for the upcoming IRIS data), the system
that they're using for managing the data actually has incomplete FITS
headers.  They're valid FITS, but only to the bare minimum ... all of
the scientific metadata is stored in a complex database, and you have
to use their software to reconstruct it.

We've hacked it once so that it streams out the result for a CGI,
rather than write it out to disk ... and I could probably find a
way to get it to look for 'END', and abort, then put a wrapper
around it to insert some additional COMMENTS or other headers.
I'm not so confident on my ability to do it with multiple files at
once, however.

One of the thoughts was to pre-generate the FITS headers, so that we
could more easily splice it into the files as we're serving,
eliminating our dependency on the complex system.  But a few months
back, I was talking to a couple of scientists, and they said that
they'd like to see the headers before downloading ... which is
why I had even thought of that when reading the 'REFERENC'
documentation.

And part of the goal was to try to make sure that we have support in
IDL / SolarSoft for these data-less files, as many of the missions
have custom reader software that our scientists are already used to.
The hope was that we could make it behave like you had the file ...
examine the headers to decide which ones were useful, and then when
you performed some operation that actually needed the data, it could
go out and retrieve it.

Or you could flag which ones were of interest, and set them all to
retrieve overnight, with some of the data volumes we're talking about.
(or not retrieve anything, if they decide that the images aren't
suitable for your purposes)

...

I'll see if I can work with the other programmers on our project
to see if we can work up a more formal proposal, and then I'll 
circulate it on this list.

(I'll actually have two ... one for the 'data-less' FITS files,
and another one to propose some headers and guidelines for
unambiguously linking to documentation)

...

Thanks again for all of your responses,

-Joe

-----
Joe Hourcle
Programmer/Analyst
Solar Data Analysis Center
Goddard Space Flight Center