[evla-sw-discuss] keeping track of data, including project heirarchy, etc.

Fri Mar 12 15:10:45 EST 2010

a smaller group of us has been discussing this at great length, and it's 
time to get it out to the whole group.  i'm including quite a bit of 
background information here which may be old news to most of you, but i 
want to be clear on why some of these things are necessary and not just 
conveniences.  comments are welcome, but note that we are in the process 
of implementing much of this (and there is much of course that has 
already been implemented), so comments need to be timely.

keeping track of our data is an important part of what we do, obviously. 
  one of the main goals of the "e2e" concept is to automate this, so 
that, for instance, an astronomer never has to type in "AB1234" in a 
search field to find their data.  we just know, based on their 
authentication, what data they have available.  searching is also 
important, to find other folks' data of course, so we have to keep track 
of things that allow folks to search sensibly as well.

we track observations in a number of databases, which are often 
embodiments of what we call "data models".  we need to be sure that all 
information for a given project is linked together in the right way in 
these databases.  there are for the sake of this discussion the following:

  . the proposal database - contains the proposal data including
    authors, sources, resources and "sessions" (which are akin to
    Scheduling Blocks; see next item).  this is a MySQL database
    currently maintained by OpenSky (we are in the process right
    now of getting it mirrored here).  when a proposal is submitted
    (via the Proposal Submission Tool, or PST), it is given a "project
    code", of the form VLA/YYT-nnn, where YY is the last two digits of
    the current year, T is the trimester in that year, and nnn is a
    number starting at 001 and incrementing for each proposal submitted.
    so if you have access to the database, and have the project code, you
    can get at the all of the information in the database for that
    proposal.  for now, we also have a "legacy code" which is assigned at
    submission time, of the form like AB1234.  those are intended to go
    away, and probably soon.
    Entries Created By: PST.
    Entries Used By: PST; Project Builder Tool (PBT - creates entries in
       the project database, see below); query servlet that returns
       proposals for a given user, or users for a given proposal.

  . the user database - contains information on our users, like name,
    institution, contact information, etc.  note right now that the
    user database is contained within the proposal database, and
    maintained by OpenSky.  we are currently in the process of trying
    to get them separated.  each user is identified by what is called a
    globalID.
    Entries Created By: Portal (my.nrao.edu - an OpenSky tool); stub
       entries can be created by the PST.
    Entries Used By: most tools, at least indirectly, because folks
       have to be authenticated against it.

  . the project database - contains projects.  each project is a
    collection of what we call "Program Blocks" (PBs).  PBs are
    meant to distinguish between different telescopes or array
    configurations.  a PB is made up of a collection of Scheduling
    Blocks (SBs).  each SB is an atomic unit of observing, and is
    made up of Scans.  a Scan has one or more Subscans, depending
    on what it is to do (called the Intent of the scan).  a Subscan
    is made up of a Source, a Resource (hardware setup), some timing
    information, and some extra information about what the telescope
    is to do (and some extra Subscan Intent information).  each
    project has a unique identifier in the database, as does each PB
    and SB.  in addition, SBs have a link to the parent PB, and PBs
    have a link to the parent project, so you can find any of the
    parents or children in any direction given one of the identifiers.
    a project also has a link to the corresponding proposal which it
    was generated from (the project code).
    Entries Created By: PBT (for official approved science projects);
       OPT (for test projects).
    Entries Used By: Observation Scheduling Tool (OST).

  . the execution block (EB) database - contains EBs.  the EB is
    meant to be the equivalent of the SB, but is what actually
    occurred on the telescope vs. what was meant to occur.  note
    that the actual observing script is stored here.  each EB has a
    unique identifier and a pointer to the SB.
    Entries Created By: OST.
    Entries Used By: none, currently, but eventually we need a tool
       that allows one to get at these beasts (or include it as part
       of science archive access - see below).

  . what might be called the "science archive database" - this holds
    the two parts of the science output of the array: the binary
    visibility data (which are files conforming to the Binary Data
    Format [BDF] definition), and the metadata (which are files
    conforming to the Science Data Model [SDM] definition).  much of
    what we're trying to clarify is what pointers (identifiers) go
    in the SDM.  the SDM of course has a pointer to the appropriate
    BDF.
    Entries Created By: BDF - Correlator Back End (CBE); SDM - Metadata
       Capture and Format (MCAF).
    Entries Used By: SDM Cataloger (SDMC); NGAS; Archive Access Tool
       (AAT); science data filler into CASA (asdm2ms).  there may be
       others.

  . what might be called the "science archive search database" -
    which holds the elements of the SDM that are interesting to
    astronomers and which they might like to search for data with.
    Entries Created By: SDMC.
    Entries Used By: AAT.

so, we have done a pretty good job of defining the links between things 
in the "pre-observing" elements of the system, but not so well in the 
"post-observing".  for post-observing, we're mostly concerned with what 
goes into the SDM.

please note that the SDM is jointly defined between us and ALMA.  also, 
the SDM is still what i would consider a "work in progress". 
unfortunately, however, because it is jointly shared with ALMA, we are 
not at liberty to make whatever changes we wish - we have to negotiate 
them.  sometimes that is trivial.  sometimes not so much.

note also that the SDM is organized as a bunch of tables, which describe 
various datasets associated with that SDM.  a single SDM can contain 
datasets from various telescopes, from a single telescope taken at 
various times, etc.  it's very flexible (you might argue _too_ flexible 
and i might agree, but we are beyond discussing much of that at this point).

so in the SDM for each "dataset" (which is the result of the execution 
of a single Scheduling Block) we have the following things that help 
identify the larger scale "structure" into which that dataset fits:

Table      Name          Type
Main       execBlockID   Tag
ExecBlock  execBlockID*  Tag
            execBlockNum  int
            execBlockUID  EntityRef
            projectID     EntityRef
            observerName  string
            observingLog  string
SBSummary  sBSummaryId*  Tag
            sbSummaryUID  EntityRef
            projectUID    EntityRef
            sbType        SBType

i believe the projectID in ExecBlock and projectUID in SBSummary are the 
same thing (can we please get the names to be consistent?  it should be 
the UID version i think.)

as an aside, note that there is no direct place in the SDM to store a 
project code.  the project code will have to be retrieved by following 
the link from the SDM to the project database (the projectID). 
similarly a list of all observers associated with a project is not 
stored directly, but rather has to be retrieved via 
SDM->projectID->proposal database.  there are other examples of this, 
but we should get used to the idea of following multiple pointers 
through to final data.

so we need to figure out what to do with: execBlockUID, projectUID, 
observerName, observingLog, sbSummaryUID, and sbType.

execBlockUID

    we will use a UID which points to an exec block (EB) in our EB
    database like:

       entityId="uid://evla/ebdb/X172200"

    ("ebdb" -> EB database).  with Xnnnnnn the unique EB database
    identifier.  we will set this via an intent in the script, of the
    form: ExecBlockID="Xnnnnnn".  we really only need that intent on the
    first scan, but MCAF can just ignore them on subsequent scans that
    use that same intent.

projectUID (and projectID, if we can't get ALMA to make them consistent)

    we will use a UID which points to a project in our project database
    like:

       entityId="uid://evla/pdb/X172200"

    ("pdb" -> project database).  with Xnnnnnn the unique database
    identifier.  we will set this via an intent in the script, of the
    form: ProjectID="Xnnnnnn".  we really only need that intent on the
    first scan, but MCAF can just ignore them on subsequent scans that
    use that same intent.

observerName

    MCAF will get this via an ObserverName intent.  we will stuff the
    PI or tester's name in this (or something benign for operations
    scripts).

observingLog

    i would like to lobby ALMA to get this changed to an EntityRef, and
    then have our logs stored separately from the SDM.  we could use
    something like:

       entityId="uid://evla/obslog/X172200"

    we should really start thinking about putting our logs in a proper
    database.  even if it's just storing them as full PDF files (as a
    blob) with just an ID.  if it is too difficult to get ALMA to change
    this, i think we should just stick the above entityID string into
    the string value for this, and work it that way.

sbSummaryUID

    we will use a UID which points to an SB in our project database
    like:

       entityId="uid://evla/pdbsb/X172200"

    ("pdbsb" -> project database SB).  with Xnnnnnn the unique database
    identifier.  we will set this via an intent in the script, of the
    form: SBID="Xnnnnnn".  we really only need that intent on the
    first scan, but MCAF can just ignore them on subsequent scans that
    use that same intent.

sbType

    this is an enumeration, with the following possibilities:

       SBType.OBSERVATORY
       SBType.OBSERVER
       SBType.EXPERT

    while i dislike those particular enumerations because i don't think
    they describe the categories very well, i can live with them.  i
    think they should imply:

       SBType.OBSERVATORY -> operations scripts
       SBType.OBSERVER -> normal science observations
       SBType.EXPERT -> test observations

    we will set this via an intent in the script, of the form:
    SBTYPE="enumValue".  only the ones that are official projects in the
    project database get an SBTYPE="OBSERVER".  ones that folks cobble up
    in the OPT to use as tests get SBTYPE="EXPERT".  ones that we make
    for operations will get SBTYPE="OBSERVATORY".  we really only need
    that intent on the first scan, but MCAF can just ignore them on
    subsequent scans that use that same intent.