[evla-sw-discuss] keeping track of data, including project heirarchy, etc.
Bryan Butler
bbutler at nrao.edu
Fri Mar 12 15:10:45 EST 2010
a smaller group of us has been discussing this at great length, and it's
time to get it out to the whole group. i'm including quite a bit of
background information here which may be old news to most of you, but i
want to be clear on why some of these things are necessary and not just
conveniences. comments are welcome, but note that we are in the process
of implementing much of this (and there is much of course that has
already been implemented), so comments need to be timely.
keeping track of our data is an important part of what we do, obviously.
one of the main goals of the "e2e" concept is to automate this, so
that, for instance, an astronomer never has to type in "AB1234" in a
search field to find their data. we just know, based on their
authentication, what data they have available. searching is also
important, to find other folks' data of course, so we have to keep track
of things that allow folks to search sensibly as well.
we track observations in a number of databases, which are often
embodiments of what we call "data models". we need to be sure that all
information for a given project is linked together in the right way in
these databases. there are for the sake of this discussion the following:
. the proposal database - contains the proposal data including
authors, sources, resources and "sessions" (which are akin to
Scheduling Blocks; see next item). this is a MySQL database
currently maintained by OpenSky (we are in the process right
now of getting it mirrored here). when a proposal is submitted
(via the Proposal Submission Tool, or PST), it is given a "project
code", of the form VLA/YYT-nnn, where YY is the last two digits of
the current year, T is the trimester in that year, and nnn is a
number starting at 001 and incrementing for each proposal submitted.
so if you have access to the database, and have the project code, you
can get at the all of the information in the database for that
proposal. for now, we also have a "legacy code" which is assigned at
submission time, of the form like AB1234. those are intended to go
away, and probably soon.
Entries Created By: PST.
Entries Used By: PST; Project Builder Tool (PBT - creates entries in
the project database, see below); query servlet that returns
proposals for a given user, or users for a given proposal.
. the user database - contains information on our users, like name,
institution, contact information, etc. note right now that the
user database is contained within the proposal database, and
maintained by OpenSky. we are currently in the process of trying
to get them separated. each user is identified by what is called a
globalID.
Entries Created By: Portal (my.nrao.edu - an OpenSky tool); stub
entries can be created by the PST.
Entries Used By: most tools, at least indirectly, because folks
have to be authenticated against it.
. the project database - contains projects. each project is a
collection of what we call "Program Blocks" (PBs). PBs are
meant to distinguish between different telescopes or array
configurations. a PB is made up of a collection of Scheduling
Blocks (SBs). each SB is an atomic unit of observing, and is
made up of Scans. a Scan has one or more Subscans, depending
on what it is to do (called the Intent of the scan). a Subscan
is made up of a Source, a Resource (hardware setup), some timing
information, and some extra information about what the telescope
is to do (and some extra Subscan Intent information). each
project has a unique identifier in the database, as does each PB
and SB. in addition, SBs have a link to the parent PB, and PBs
have a link to the parent project, so you can find any of the
parents or children in any direction given one of the identifiers.
a project also has a link to the corresponding proposal which it
was generated from (the project code).
Entries Created By: PBT (for official approved science projects);
OPT (for test projects).
Entries Used By: Observation Scheduling Tool (OST).
. the execution block (EB) database - contains EBs. the EB is
meant to be the equivalent of the SB, but is what actually
occurred on the telescope vs. what was meant to occur. note
that the actual observing script is stored here. each EB has a
unique identifier and a pointer to the SB.
Entries Created By: OST.
Entries Used By: none, currently, but eventually we need a tool
that allows one to get at these beasts (or include it as part
of science archive access - see below).
. what might be called the "science archive database" - this holds
the two parts of the science output of the array: the binary
visibility data (which are files conforming to the Binary Data
Format [BDF] definition), and the metadata (which are files
conforming to the Science Data Model [SDM] definition). much of
what we're trying to clarify is what pointers (identifiers) go
in the SDM. the SDM of course has a pointer to the appropriate
BDF.
Entries Created By: BDF - Correlator Back End (CBE); SDM - Metadata
Capture and Format (MCAF).
Entries Used By: SDM Cataloger (SDMC); NGAS; Archive Access Tool
(AAT); science data filler into CASA (asdm2ms). there may be
others.
. what might be called the "science archive search database" -
which holds the elements of the SDM that are interesting to
astronomers and which they might like to search for data with.
Entries Created By: SDMC.
Entries Used By: AAT.
so, we have done a pretty good job of defining the links between things
in the "pre-observing" elements of the system, but not so well in the
"post-observing". for post-observing, we're mostly concerned with what
goes into the SDM.
please note that the SDM is jointly defined between us and ALMA. also,
the SDM is still what i would consider a "work in progress".
unfortunately, however, because it is jointly shared with ALMA, we are
not at liberty to make whatever changes we wish - we have to negotiate
them. sometimes that is trivial. sometimes not so much.
note also that the SDM is organized as a bunch of tables, which describe
various datasets associated with that SDM. a single SDM can contain
datasets from various telescopes, from a single telescope taken at
various times, etc. it's very flexible (you might argue _too_ flexible
and i might agree, but we are beyond discussing much of that at this point).
so in the SDM for each "dataset" (which is the result of the execution
of a single Scheduling Block) we have the following things that help
identify the larger scale "structure" into which that dataset fits:
Table Name Type
Main execBlockID Tag
ExecBlock execBlockID* Tag
execBlockNum int
execBlockUID EntityRef
projectID EntityRef
observerName string
observingLog string
SBSummary sBSummaryId* Tag
sbSummaryUID EntityRef
projectUID EntityRef
sbType SBType
i believe the projectID in ExecBlock and projectUID in SBSummary are the
same thing (can we please get the names to be consistent? it should be
the UID version i think.)
as an aside, note that there is no direct place in the SDM to store a
project code. the project code will have to be retrieved by following
the link from the SDM to the project database (the projectID).
similarly a list of all observers associated with a project is not
stored directly, but rather has to be retrieved via
SDM->projectID->proposal database. there are other examples of this,
but we should get used to the idea of following multiple pointers
through to final data.
so we need to figure out what to do with: execBlockUID, projectUID,
observerName, observingLog, sbSummaryUID, and sbType.
execBlockUID
we will use a UID which points to an exec block (EB) in our EB
database like:
entityId="uid://evla/ebdb/X172200"
("ebdb" -> EB database). with Xnnnnnn the unique EB database
identifier. we will set this via an intent in the script, of the
form: ExecBlockID="Xnnnnnn". we really only need that intent on the
first scan, but MCAF can just ignore them on subsequent scans that
use that same intent.
projectUID (and projectID, if we can't get ALMA to make them consistent)
we will use a UID which points to a project in our project database
like:
entityId="uid://evla/pdb/X172200"
("pdb" -> project database). with Xnnnnnn the unique database
identifier. we will set this via an intent in the script, of the
form: ProjectID="Xnnnnnn". we really only need that intent on the
first scan, but MCAF can just ignore them on subsequent scans that
use that same intent.
observerName
MCAF will get this via an ObserverName intent. we will stuff the
PI or tester's name in this (or something benign for operations
scripts).
observingLog
i would like to lobby ALMA to get this changed to an EntityRef, and
then have our logs stored separately from the SDM. we could use
something like:
entityId="uid://evla/obslog/X172200"
we should really start thinking about putting our logs in a proper
database. even if it's just storing them as full PDF files (as a
blob) with just an ID. if it is too difficult to get ALMA to change
this, i think we should just stick the above entityID string into
the string value for this, and work it that way.
sbSummaryUID
we will use a UID which points to an SB in our project database
like:
entityId="uid://evla/pdbsb/X172200"
("pdbsb" -> project database SB). with Xnnnnnn the unique database
identifier. we will set this via an intent in the script, of the
form: SBID="Xnnnnnn". we really only need that intent on the
first scan, but MCAF can just ignore them on subsequent scans that
use that same intent.
sbType
this is an enumeration, with the following possibilities:
SBType.OBSERVATORY
SBType.OBSERVER
SBType.EXPERT
while i dislike those particular enumerations because i don't think
they describe the categories very well, i can live with them. i
think they should imply:
SBType.OBSERVATORY -> operations scripts
SBType.OBSERVER -> normal science observations
SBType.EXPERT -> test observations
we will set this via an intent in the script, of the form:
SBTYPE="enumValue". only the ones that are official projects in the
project database get an SBTYPE="OBSERVER". ones that folks cobble up
in the OPT to use as tests get SBTYPE="EXPERT". ones that we make
for operations will get SBTYPE="OBSERVATORY". we really only need
that intent on the first scan, but MCAF can just ignore them on
subsequent scans that use that same intent.
More information about the evla-sw-discuss
mailing list