[fitsbits] Recommendations for long-term preservation of FITS files?
William Thompson
William.T.Thompson at nasa.gov
Fri Jan 6 15:00:27 EST 2012
Here are my comments on Joe Hourcle's checklist for solar physics FITS headers.
I don't believe that Joe is trying to define a large new list of standard
keywords to be used by all projects, but is instead trying to define broad
goals which can be achieved in the manner that best fits the data in question.
I will therefore point out where standards can be used to address the
questions, versus where instrument-specific keywords should be expected.
In a number of places, Joe refers to certain keywords as being overloaded.
This is usually because some instrument teams failed to correctly follow the
FITS standards as published in the literature, and applied their own
interpretations. However, we in the solar physics community should be evolving
towards full compliance with the FITS standard, and consistency with the rest
of the astronomy community, and not in a separation direction away from these
goals. Therefore, below I will recommend the use of standard keywords in the
published literature, even when Joe describes these keywords as being
"overloaded".
I've chosen to organize my comments in terms of rough categories of data, even
though Joe did not organize his checklist in this fashion, rather than
addressing the document point-by-point. The original checklist is replicated
at the bottom of this email, for comparison.
Some of the non-standard keywords that Joe refers to come from the following
document.
Howard, R. and Thompson W., 2002, "Proposed keywords for SOHO",
http://stereo.gsfc.nasa.gov/~thompson/soho_keywords.pdf.
Bill Thompson
Category 1, Annotative Information
This consists of the following items in Joe's checklist:
Have I ...
... identified the file format?
... identified the responsible party?
... provided for support of the file?
* My belief is that most of this information is intended to be provided in
human-readable form rather than being computer readable. Thus, I would
expect most of this information to be addressed through COMMENT lines. For
example, the following lines identify the file as FITS, and give a reference
to the FITS standard:
COMMENT FITS (Flexible Image Transport System) format is defined in 'Astronomy
COMMENT and Astrophysics', volume 376, page 359; bibcode: 2001A&A...376..359H
* Much of the requested information is for links or citations to documentation,
such as the instrument description, keyword dictionaries, user guides, and
the like. The following text from the standard (Pence et al, 2010) about the
REFERENC keyword would also be applicable to these citations:
It is *recommended* that either the 19-digit bibliographic identifier
used in the Astrophysics Data System bibliographic databases
(http://adswww.harvard.edu/) or the Digital Object Identifier
(http://doi.org) be included in the value string when available (e.g.,
"1994A&AS..103..135A" or "doi:10.1006/jmbi.1998.2354").
as well as the associated footnote
This bibliographic convention (Schmitz et al. 1995) was initially
developed for use within NED (NASA/IPAC Extragalactic Database) and
SIMBAD (operated at CDS, Strasbourg, France).
* Note that the definition of the standard REFERENC keyword makes it clear that
it is intended for science publications based on the particular data in the
FITS file, and not for a general instrument paper.
* One standard keyword discussed in this section as "overloaded" is the ORIGIN
keyword. The non-standard SOHO keyword INSTITUT is also mentioned, because
some teams have misused it for the purpose that ORIGIN was intended for.
Even when the ORIGIN keyword has been used correctly, it usually just gives a
short abbreviation such as "GSFC" or "NRL". My recommendation is that the
standard ORIGIN keyword *should* be used to give the name of the organization
or institution responsible for creating the FITS file, with enough
information to be unambiguous, e.g.
ORIGIN = 'NASA Goddard Space Flight Center (GSFC), Greenbelt, MD, USA'
* Additional standard keywords addressing the goals of this section are AUTHOR
and OBSERVER. These are not generally applicable to solar data, but may be
relevant in specific cases.
* Part of this general category of information includes the following:
___ included important usage caveats in the file?
___ given a warning if it was quicklook data?
I would expect this level of information to be handled in a way particular to
a given instrument, and not easily generalizable to a specific set of
keywords for all cases.
Category 2, Data Identification
This consists of the following items:
... identified the file?
... identified the observation?
* The FITS standard lists TELESCOPE and INSTRUME as standard keywords which
pertain to this category. However, the definitions of these keywords are
broad enough to support a number of different interpretations. Howard and
Thompson (2002) add the keyword DETECTOR, and the STEREO project added the
keyword OBSRVTRY to distinguish between the two spacecraft (observatories)
making up that mission. The combination of these four keywords should be
enough to cover all possible situations.
* Joe also lists the non-standard keywords CAMERA (evidently a synonym for
TELESCOP, though one could also imagine it being a synonym for DETECTOR), and
SOURCE whose exact definition is unclear to me. Such keywords should be
tolerated as instrument specific, but should not be encouraged.
* Another non-standard but common keyword relevant here is FILENAME. Joe
describes this as being "the originally assigned filename", but I find this
wording problematic because "original" could suggest the name of an earlier
file before processing was applied, which I don't think is what Joe intended.
I suggest the following modified wording: "the unique name identifying this
data file".
* Joe also lists several keywords relating to time. Of these, the only
official keyword is DATE-OBS. The SOHO project adopted a convention using
underscores instead of dashes (i.e. DATE_OBS instead of DATE-OBS) because
that project adopted a Y2K-compliant format before the IAU/FWG. In addition,
in some places the DATE_... keywords are described as being corrected for the
difference in light travel time relative to Earth, a concept which is really
only valid for a spacecraft at the L1 Lagrange point. The use of
DATE_... keywords should be tolerated as SOHO mission specific, but should be
discouraged in favor of the DATE-... formulation. (Joe also lists the
keyword T_OBS, but I don't know where that comes from.) In particular,
missions and instruments should start adopting the FITS Time WCS standard
currently under development, which can be found at
http://hea-www.cfa.harvard.edu/~arots/TimeWCS/
As well as DATE-OBS, that paper also defines DATE-END and DATE-AVG.
* Some of the items that Joe refers to is the data series. This is a concept
which seems to be very specific to the Solar Dynamics Observatory (SDO), and
I don't see how this could be easily generalized. For SDO, this appears to
be related to the "usage caveats" discussed above.
* The FITS standard includes additional standard keywords, such as OBJECT,
which may or may not be relevant.
Category 3, Data Processing Information
* The description of the processing that was applied to the data in the file is
traditionally done through a series of HISTORY statements, though
instrument-specific keywords may also be relevant.
Category 4, Coordinate information
* This category is already well covered by the WCS papers, reprints of which
are available from
http://www.atnf.csiro.au/people/mcalabre/WCS/index.html,
by the draft WCS time paper referenced earlier, and by my own papers on the
adaptation of WCS to solar data:
Thompson, W. T., 2006, "Coordinate systems for solar image data", A&A
449, 791-803 (2006A&A...449..791T)
Thompson, W. T., 2010, "Precision effects for solar image coordinates
within the FITS world coordinate system", A&A 515, A59
(2010A&A...515A..59T)
Alternative methods using non-standard keywords such XCEN and YCEN in place
of the default keywords should very definitely be discouraged.
* Standard keywords for describing the position of a terrestrial observatory
are listed in the WCS time paper (Rots et al., 2012), namely
OBSGEO-X, OBSGEO-Y, OBSGEO-Z ITRS Cartesian coordinates in meters
or
OBSGEO-B Geodetic latitude in degrees
OBSGEO-L Geodetic longitude in degrees
OBSGEO-H Altitude in meters
* Thompson (2006) describes a number of keywords for describing the position of
a spacecraft in a variety of geocentric and heliocentric Cartesian coordinate
systems (Section 9.1), and in particular
HGLT_OBS Stonyhurst heliographic latitude in degrees
HGLN_OBS Stonyhurst heliographic longitude in degrees
CRLT_OBS Carrington heliographic latitude in degrees
CRLN_OBS Carrington heliographic longitude in degrees
DSUN_OBS Distance from sun center, in meters
* Pointing and other coordinate information within a FITS file has always been
implemented through the keywords CRPIXj, CRVALj, CDELTj, CTYPEj, and CROTAj,
where j is the axis number (Wells et al., 1981). The WCS papers add the
keywords CUNITj, and provide some additional clarification on how these
keywords should be defined.
* The keywords XCEN and YCEN can appear in the header, but these keywords
should not be considered a replacement for the standard keywords.
* Joe also refers to "FOV", but I think this is just a reference to the concept
of the field-of-view, and not to a specific keyword with that name. He also
mentions the size of the occulter. The concept of describing the
field-of-view of an instrument can either be simple, or very complex, and in
my opinion is not really well suited for being encoded within a few generic
keywords intended to be applied to all cases.
Category 5, Support and Provenance
* A number of the items in this category seems to be a repeat of those in the
previous categories.
* The FITS standard is explicitly hardware and software independent. Thus,
there's no good reason why information such as the type of machine, or the
version of the operating system, should be required to be documented in the
FITS header. This distinguishes FITS from many other file formats, where it
can be important to know what version of the software was used to create the
file.
* Just in general, I recommend that the list of requirements be simplified to
what is truly needed.
===============================================================================
Checklist for Solar Physics FITS Headers
Revision : 0.8.0, 2011 December 08
(early release; this needs a whole lot of work)
Have questions or suggestions? Send them to :
Joe Hourcle, joseph.a.hourcle at nasa.gov
Latest version available at:
http://sdac.virtualsolar.org/fits_headers/fits_checklist.txt
--
Glossary: (as used in this document) :
Observation : a reading from a sensor about its environment; a given
observation could exist
in multiple processed forms, in multiple FITS files.
Series : a collection files that come from the same instruments, that are
processed in
similar ways.
---
Have I ...
... identified the file format?
___ given an obvious message that it's a FITS file?
___ given a reference to the FITS standard? (citation, bibcode, DOI)
___ given a reference to any FITS extensions used?
... identified the responsible party?
... identified the responsible organization?
___ given the common abbreviation for the institution?
(note -- ORIGIN is overloaded; need to check status of INSTITUT)
___ given the full name for the institution?
___ given other identifying information, particularly if there are other
similarly named groups.
___ provided a postal address?
___ identified the PI or other responsible person?
... provided for support of the file?
... told where to get the documentation on the file?
___ citation to articles describing the instrument?
___ citation or link to documentation on the data file?
___ using a DOI or other persistent ID?
(note : documentation checklist will be coming soon)
___ included important usage caveats in the file?
___ given a warning if it was quicklook data?
___ provided basic usage information in the file?
___ given acknowledgement text or citation to use?
(note : citation standards are also being worked on)
... told people how to get help or report problems?
___ provided an e-mail address?
___ provided a URL for the organization's website
... identified the file?
___ identified the instrument that took the observation?
(INSTRUM, SOURCE, DETECTOR ... TELESCOP/CAMERA may be overloaded)
___ given instrument/source/etc names that are unique?
___ given a citation to articles describing the instrument?
... identified the (need a name ... data series? ... basically, stuff all of a
similar processing from the same source)
___ given an identifier to the (series) ?
(eg, aia.lev1 vs. aia.lev1_nrt)
___ given information about the "type" of data?
(intensity, magnetogram, IQUV, etc.)
... identified the file within the (series) ?
___ given a unique ID that can referenced if a researcher has questions?
___ listed the originally assigned filename? (FILENAME)
... identified the observation?
___ given the time of the observation?
___ using DATE_OBS or DATE-OBS ?
___ using T_OBS ?
___ given DATE_END or exposure?
___ mentioned anything about the observing mode, if it varies by observation?
___ mentioned the filter position?
___ mentioned the polarizer position?
___ mentioned everything else that may vary?
... described the data in the file?
___ provided information about what type of processing was done in
human-readable form?
(eg, flat fielded, limb darkening, corrections for point spread)
... described the platform the instrument was on?
___ specified the location of the observatory?
... if a ground-based observatory:
___ given the coordinates in Lat/Long on Earth?
... if a spacecraft:
___ given the coordinates using the World Coordinate System?
(lat/long/altitude if earth-orbiting?)
... if an imager, magnetograph, ... :
___ specified the pointing of the instrument?
___ in terms of CRPIX, XCEN, YCEN, CROTA, FOV?
___ in Carrington Lat/Long ?
___ using the World Coordinate System?
... if ground based :
___ in RA & DEC?
... if a coronograph :
___ specified the size of the occulter?
Support
... provided units for all fields?
... provided full names for abbreviations?
... provided labels for all fields?
... added checksums to ensure the file hasn't been corrupted?
Provenance
... provided a unique ID to the source observation (so that we can identify
multiple processed forms of the same observation)
... provided information about any calibration used?
... version number or date of calibration
... provided information about processing applied?
... text description of type of processing (flat field, point spread applied,
limb darkening, etc.)
... provided names of the software used
... and their version number or last modified date
... mentioned any input variables used in the processing?
... mentoned any other input files?
... and their date / version number
... provided the platform used (eg, IDL, IRAF, DRMS)
... and the version?
... given information about the machine that did the processing?
... type of processor
... operating system
... and version
... machine name
(from the ESIP 'data management 101' talk by Bob Cook:
What does the data set describe?
Why was the data set created?
Who produced the data set?
Who prepared the metadata?
How was each parameter measured?
More information about the fitsbits
mailing list