[fitsbits] Recommendations for long-term preservation of FITS files?

William Thompson William.T.Thompson at nasa.gov
Fri Jan 6 15:00:27 EST 2012


Here are my comments on Joe Hourcle's checklist for solar physics FITS headers.
I don't believe that Joe is trying to define a large new list of standard
keywords to be used by all projects, but is instead trying to define broad
goals which can be achieved in the manner that best fits the data in question.
I will therefore point out where standards can be used to address the
questions, versus where instrument-specific keywords should be expected.

In a number of places, Joe refers to certain keywords as being overloaded.
This is usually because some instrument teams failed to correctly follow the
FITS standards as published in the literature, and applied their own
interpretations.  However, we in the solar physics community should be evolving
towards full compliance with the FITS standard, and consistency with the rest
of the astronomy community, and not in a separation direction away from these
goals.  Therefore, below I will recommend the use of standard keywords in the
published literature, even when Joe describes these keywords as being
"overloaded".

I've chosen to organize my comments in terms of rough categories of data, even
though Joe did not organize his checklist in this fashion, rather than
addressing the document point-by-point.  The original checklist is replicated
at the bottom of this email, for comparison.

Some of the non-standard keywords that Joe refers to come from the following
document.

Howard, R. and Thompson W., 2002, "Proposed keywords for SOHO",
http://stereo.gsfc.nasa.gov/~thompson/soho_keywords.pdf.

Bill Thompson

		      Category 1, Annotative Information

This consists of the following items in Joe's checklist:

Have I ...
... identified the file format?
... identified the responsible party?
... provided for support of the file?

* My belief is that most of this information is intended to be provided in
   human-readable form rather than being computer readable.  Thus, I would
   expect most of this information to be addressed through COMMENT lines.  For
   example, the following lines identify the file as FITS, and give a reference
   to the FITS standard:

COMMENT   FITS (Flexible Image Transport System) format is defined in 'Astronomy
COMMENT   and Astrophysics', volume 376, page 359; bibcode: 2001A&A...376..359H

* Much of the requested information is for links or citations to documentation,
   such as the instrument description, keyword dictionaries, user guides, and
   the like.  The following text from the standard (Pence et al, 2010) about the
   REFERENC keyword would also be applicable to these citations:

	It is *recommended* that either the 19-digit bibliographic identifier
	used in the Astrophysics Data System bibliographic databases
	(http://adswww.harvard.edu/) or the Digital Object Identifier
	(http://doi.org) be included in the value string when available (e.g.,
	"1994A&AS..103..135A" or "doi:10.1006/jmbi.1998.2354").

   as well as the associated footnote

	This bibliographic convention (Schmitz et al. 1995) was initially
	developed for use within NED (NASA/IPAC Extragalactic Database) and
	SIMBAD (operated at CDS, Strasbourg, France).

* Note that the definition of the standard REFERENC keyword makes it clear that
   it is intended for science publications based on the particular data in the
   FITS file, and not for a general instrument paper.

* One standard keyword discussed in this section as "overloaded" is the ORIGIN
   keyword.  The non-standard SOHO keyword INSTITUT is also mentioned, because
   some teams have misused it for the purpose that ORIGIN was intended for.
   Even when the ORIGIN keyword has been used correctly, it usually just gives a
   short abbreviation such as "GSFC" or "NRL".  My recommendation is that the
   standard ORIGIN keyword *should* be used to give the name of the organization
   or institution responsible for creating the FITS file, with enough
   information to be unambiguous, e.g.

	ORIGIN  = 'NASA Goddard Space Flight Center (GSFC), Greenbelt, MD, USA'

* Additional standard keywords addressing the goals of this section are AUTHOR
   and OBSERVER.  These are not generally applicable to solar data, but may be
   relevant in specific cases.

* Part of this general category of information includes the following:

		___ included important usage caveats in the file?
			___ given a warning if it was quicklook data?

   I would expect this level of information to be handled in a way particular to
   a given instrument, and not easily generalizable to a specific set of
   keywords for all cases.

			Category 2, Data Identification

This consists of the following items:

... identified the file?
... identified the observation?

* The FITS standard lists TELESCOPE and INSTRUME as standard keywords which
   pertain to this category.  However, the definitions of these keywords are
   broad enough to support a number of different interpretations.  Howard and
   Thompson (2002) add the keyword DETECTOR, and the STEREO project added the
   keyword OBSRVTRY to distinguish between the two spacecraft (observatories)
   making up that mission.  The combination of these four keywords should be
   enough to cover all possible situations.

* Joe also lists the non-standard keywords CAMERA (evidently a synonym for
   TELESCOP, though one could also imagine it being a synonym for DETECTOR), and
   SOURCE whose exact definition is unclear to me.  Such keywords should be
   tolerated as instrument specific, but should not be encouraged.

* Another non-standard but common keyword relevant here is FILENAME.  Joe
   describes this as being "the originally assigned filename", but I find this
   wording problematic because "original" could suggest the name of an earlier
   file before processing was applied, which I don't think is what Joe intended.
   I suggest the following modified wording: "the unique name identifying this
   data file".

* Joe also lists several keywords relating to time.  Of these, the only
   official keyword is DATE-OBS.  The SOHO project adopted a convention using
   underscores instead of dashes (i.e. DATE_OBS instead of DATE-OBS) because
   that project adopted a Y2K-compliant format before the IAU/FWG.  In addition,
   in some places the DATE_... keywords are described as being corrected for the
   difference in light travel time relative to Earth, a concept which is really
   only valid for a spacecraft at the L1 Lagrange point.  The use of
   DATE_... keywords should be tolerated as SOHO mission specific, but should be
   discouraged in favor of the DATE-... formulation.  (Joe also lists the
   keyword T_OBS, but I don't know where that comes from.)  In particular,
   missions and instruments should start adopting the FITS Time WCS standard
   currently under development, which can be found at

	http://hea-www.cfa.harvard.edu/~arots/TimeWCS/

   As well as DATE-OBS, that paper also defines DATE-END and DATE-AVG.

* Some of the items that Joe refers to is the data series.  This is a concept
   which seems to be very specific to the Solar Dynamics Observatory (SDO), and
   I don't see how this could be easily generalized.  For SDO, this appears to
   be related to the "usage caveats" discussed above.

* The FITS standard includes additional standard keywords, such as OBJECT,
   which may or may not be relevant.

		    Category 3, Data Processing Information

* The description of the processing that was applied to the data in the file is
   traditionally done through a series of HISTORY statements, though
   instrument-specific keywords may also be relevant.

		      Category 4, Coordinate information

* This category is already well covered by the WCS papers, reprints of which
   are available from

	http://www.atnf.csiro.au/people/mcalabre/WCS/index.html,

   by the draft WCS time paper referenced earlier, and by my own papers on the
   adaptation of WCS to solar data:

	Thompson, W. T., 2006, "Coordinate systems for solar image data", A&A
	449, 791-803 (2006A&A...449..791T)

	Thompson, W. T., 2010, "Precision effects for solar image coordinates
	within the FITS world coordinate system", A&A 515, A59
	(2010A&A...515A..59T)

   Alternative methods using non-standard keywords such XCEN and YCEN in place
   of the default keywords should very definitely be discouraged.

* Standard keywords for describing the position of a terrestrial observatory
   are listed in the WCS time paper (Rots et al., 2012), namely

	OBSGEO-X, OBSGEO-Y, OBSGEO-Z	ITRS Cartesian coordinates in meters

   or

	OBSGEO-B	Geodetic latitude in degrees
	OBSGEO-L	Geodetic longitude in degrees
	OBSGEO-H	Altitude in meters

* Thompson (2006) describes a number of keywords for describing the position of
   a spacecraft in a variety of geocentric and heliocentric Cartesian coordinate
   systems (Section 9.1), and in particular

	HGLT_OBS	Stonyhurst heliographic latitude in degrees
	HGLN_OBS	Stonyhurst heliographic longitude in degrees
	CRLT_OBS	Carrington heliographic latitude in degrees
	CRLN_OBS	Carrington heliographic longitude in degrees
	DSUN_OBS	Distance from sun center, in meters

* Pointing and other coordinate information within a FITS file has always been
   implemented through the keywords CRPIXj, CRVALj, CDELTj, CTYPEj, and CROTAj,
   where j is the axis number (Wells et al., 1981).  The WCS papers add the
   keywords CUNITj, and provide some additional clarification on how these
   keywords should be defined.

* The keywords XCEN and YCEN can appear in the header, but these keywords
   should not be considered a replacement for the standard keywords.

* Joe also refers to "FOV", but I think this is just a reference to the concept
   of the field-of-view, and not to a specific keyword with that name.  He also
   mentions the size of the occulter.  The concept of describing the
   field-of-view of an instrument can either be simple, or very complex, and in
   my opinion is not really well suited for being encoded within a few generic
   keywords intended to be applied to all cases.

		      Category 5, Support and Provenance

* A number of the items in this category seems to be a repeat of those in the
   previous categories.

* The FITS standard is explicitly hardware and software independent.  Thus,
   there's no good reason why information such as the type of machine, or the
   version of the operating system, should be required to be documented in the
   FITS header.  This distinguishes FITS from many other file formats, where it
   can be important to know what version of the software was used to create the
   file.

* Just in general, I recommend that the list of requirements be simplified to
   what is truly needed.

===============================================================================
Checklist for Solar Physics FITS Headers

Revision : 0.8.0, 2011 December 08

(early release; this needs a whole lot of work)


Have questions or suggestions?  Send them to :
	Joe Hourcle, joseph.a.hourcle at nasa.gov

Latest version available at:
	http://sdac.virtualsolar.org/fits_headers/fits_checklist.txt

--

Glossary: (as used in this document) :

	Observation : a reading from a sensor about its environment; a given 
observation could exist
	              in multiple processed forms, in multiple FITS files.
	Series      : a collection files that come from the same instruments, that are 
processed in
	              similar ways.


---


Have I ...

... identified the file format?

	___ given an obvious message that it's a FITS file?
	
	___ given a reference to the FITS standard? (citation, bibcode, DOI)
	
		___ given a reference to any FITS extensions used?


... identified the responsible party?

	... identified the responsible organization?

		___ given the common abbreviation for the institution?

			(note -- ORIGIN is overloaded; need to check status of INSTITUT)

		___ given the full name for the institution?
	
		___ given other identifying information, particularly if there are other 
similarly named groups.
	
		___ provided a postal address?


	___ identified the PI or other responsible person?


... provided for support of the file?

	... told where to get the documentation on the file?
	
		___ citation to articles describing the instrument?
		
		___ citation or link to documentation on the data file?

			___ using a DOI or other persistent ID?

			(note : documentation checklist will be coming soon)

		___ included important usage caveats in the file?
		
			___ given a warning if it was quicklook data?
		
		___ provided basic usage information in the file?
		
		___ given acknowledgement text or citation to use?
		
			(note : citation standards are also being worked on)
		
	... told people how to get help or report problems?
	
		___ provided an e-mail address?

		___ provided a URL for the organization's website


... identified the file?

	___ identified the instrument that took the observation?
		(INSTRUM, SOURCE, DETECTOR ... TELESCOP/CAMERA may be overloaded)
		
		___ given instrument/source/etc names that are unique?
		
		___ given a citation to articles describing the instrument?
		
	... identified the (need a name ... data series? ... basically, stuff all of a 
similar processing from the same source)
	
		___ given an identifier to the (series) ?
			(eg, aia.lev1 vs. aia.lev1_nrt)
		
		___ given information about the "type" of data?
			(intensity, magnetogram, IQUV, etc.)
		
		
	... identified the file within the (series) ?
	
		___ given a unique ID that can referenced if a researcher has questions?
		
			___ listed the originally assigned filename?  (FILENAME)

... identified the observation?

	___ given the time of the observation?
	
		___ using DATE_OBS or DATE-OBS ?
		
		___ using T_OBS ?

		___ given DATE_END or exposure?
	
	___ mentioned anything about the observing mode, if it varies by observation?
	
		___ mentioned the filter position?
		
		___ mentioned the polarizer position?
		
		___ mentioned everything else that may vary?

... described the data in the file?

	___ provided information about what type of processing was done in 
human-readable form?
		(eg, flat fielded, limb darkening, corrections for point spread)


... described the platform the instrument was on?

	___ specified the location of the observatory?
	
		... if a ground-based observatory:
		
			___ given the coordinates in Lat/Long on Earth?

		... if a spacecraft:
		
			___ given the coordinates using the World Coordinate System?
			
			(lat/long/altitude if earth-orbiting?)

	... if an imager, magnetograph, ... :
	
		___ specified the pointing of the instrument?
		
			___ in terms of CRPIX, XCEN, YCEN, CROTA, FOV?
			
			___ in Carrington Lat/Long ?

			___ using the World Coordinate System?

			... if ground based :

				___ in RA & DEC?

		... if a coronograph :
		
			___ specified the size of the occulter?



Support
... provided units for all fields?
... provided full names for abbreviations?
... provided labels for all fields?
... added checksums to ensure the file hasn't been corrupted?

Provenance

... provided a unique ID to the source observation (so that we can identify 
multiple processed forms of the same observation)
... provided information about any calibration used?
   ... version number or date of calibration
... provided information about processing applied?
   ... text description of type of processing (flat field, point spread applied, 
limb darkening, etc.)
   ... provided names of the software used
       ... and their version number or last modified date
   ... mentioned any input variables used in the processing?
   ... mentoned any other input files?
      ... and their date / version number
... provided the platform used (eg, IDL, IRAF, DRMS)
   ... and the version?
... given information about the machine that did the processing?
   ... type of processor
   ... operating system
      ... and version
   ... machine name



(from the ESIP 'data management 101' talk by Bob Cook:
	What does the data set describe?
	Why was the data set created?
	Who produced the data set?
	Who prepared the metadata?
	How was each parameter measured?





More information about the fitsbits mailing list