[evla-sw-discuss] keeping track of data, including project heirarchy, etc.

Fri Mar 12 15:35:34 EST 2010

some implications of this:

  - all observations should go through the scheduler.  *all*.  either
    that, or anything else which actually queues scripts to the executor
    needs to be taught how to make EB database entries, and to also be
    taught about SB database IDs.  we might think about making exceptions
    for directly submitted scripts - in which case nothing is guaranteed
    to make it to the archive, but i'm leery of that because we then have
    no tracers to data taken that way at all (at least potentially).
    basically, folks who are making scripts are going to have to get used
    to including the proper things in those scripts to guarantee that
    they get stored in the archive properly.  this used to be as simple
    as making something like a /.BBTST first card in the OBSERVE file.
    unfortunately we are well beyond those simple days now.  operators
    will have to be trained not to use AOI to submit scripts, but rather
    the OST.  OST will have to be made to support the things that AOI
    does (like operations scripts, antenna selection, etc.).

  - we should probably set up a project which contains all of the SBs for
    testers.  this should be done formally through the PST.  all testers
    should be given access to it.  not sure what the best way to do that
    is.  we could explicitly list them, or would it be possible to make a
    "testers group" and somehow give them access to that project?  each
    tester can make their own SBs within the project.  we might even
    split testers by giving each of them their own PB.  an alternative
    would be to create a project for each tester, but that seems a bit
    cumbersome.  comments and thoughts are welcome here.

  - similarly we should set up a project to hold regular operations
    scripts.  baselines, pointing, delays, etc.  what i'd like to do for
    these is allow in the OPT the ability to bypass actually setting up
    all of the details of an SB, but rather have a toggle switch that
    indicates that this SB is really just a script, and have a pointer
    directly to that script.  each of those operations scripts will have
    to be modified so that they set the intents properly, so that when
    the OST schedules them, the right things get put into the EB DB, and
    the SDM.  note that this ability is also useful for the testers.

  - adam has made a vex2script program for VLBI observations.  this will
    have to be taught how to link to the SB database, and the OST will
    have to be taught how to execute it for the appropriate SBs (instead
    of model2script).

i'm sure there are others that i'm not thinking of just at this moment.

	-bryan

Bryan Butler wrote, On 3/12/10 13:10:
> a smaller group of us has been discussing this at great length, and it's 
> time to get it out to the whole group.  i'm including quite a bit of 
> background information here which may be old news to most of you, but i 
> want to be clear on why some of these things are necessary and not just 
> conveniences.  comments are welcome, but note that we are in the process 
> of implementing much of this (and there is much of course that has 
> already been implemented), so comments need to be timely.
> 
> keeping track of our data is an important part of what we do, obviously. 
>   one of the main goals of the "e2e" concept is to automate this, so 
> that, for instance, an astronomer never has to type in "AB1234" in a 
> search field to find their data.  we just know, based on their 
> authentication, what data they have available.  searching is also 
> important, to find other folks' data of course, so we have to keep track 
> of things that allow folks to search sensibly as well.
> 
> we track observations in a number of databases, which are often 
> embodiments of what we call "data models".  we need to be sure that all 
> information for a given project is linked together in the right way in 
> these databases.  there are for the sake of this discussion the following:
> 
>   . the proposal database - contains the proposal data including
>     authors, sources, resources and "sessions" (which are akin to
>     Scheduling Blocks; see next item).  this is a MySQL database
>     currently maintained by OpenSky (we are in the process right
>     now of getting it mirrored here).  when a proposal is submitted
>     (via the Proposal Submission Tool, or PST), it is given a "project
>     code", of the form VLA/YYT-nnn, where YY is the last two digits of
>     the current year, T is the trimester in that year, and nnn is a
>     number starting at 001 and incrementing for each proposal submitted.
>     so if you have access to the database, and have the project code, you
>     can get at the all of the information in the database for that
>     proposal.  for now, we also have a "legacy code" which is assigned at
>     submission time, of the form like AB1234.  those are intended to go
>     away, and probably soon.
>     Entries Created By: PST.
>     Entries Used By: PST; Project Builder Tool (PBT - creates entries in
>        the project database, see below); query servlet that returns
>        proposals for a given user, or users for a given proposal.
> 
>   . the user database - contains information on our users, like name,
>     institution, contact information, etc.  note right now that the
>     user database is contained within the proposal database, and
>     maintained by OpenSky.  we are currently in the process of trying
>     to get them separated.  each user is identified by what is called a
>     globalID.
>     Entries Created By: Portal (my.nrao.edu - an OpenSky tool); stub
>        entries can be created by the PST.
>     Entries Used By: most tools, at least indirectly, because folks
>        have to be authenticated against it.
> 
>   . the project database - contains projects.  each project is a
>     collection of what we call "Program Blocks" (PBs).  PBs are
>     meant to distinguish between different telescopes or array
>     configurations.  a PB is made up of a collection of Scheduling
>     Blocks (SBs).  each SB is an atomic unit of observing, and is
>     made up of Scans.  a Scan has one or more Subscans, depending
>     on what it is to do (called the Intent of the scan).  a Subscan
>     is made up of a Source, a Resource (hardware setup), some timing
>     information, and some extra information about what the telescope
>     is to do (and some extra Subscan Intent information).  each
>     project has a unique identifier in the database, as does each PB
>     and SB.  in addition, SBs have a link to the parent PB, and PBs
>     have a link to the parent project, so you can find any of the
>     parents or children in any direction given one of the identifiers.
>     a project also has a link to the corresponding proposal which it
>     was generated from (the project code).
>     Entries Created By: PBT (for official approved science projects);
>        OPT (for test projects).
>     Entries Used By: Observation Scheduling Tool (OST).
> 
>   . the execution block (EB) database - contains EBs.  the EB is
>     meant to be the equivalent of the SB, but is what actually
>     occurred on the telescope vs. what was meant to occur.  note
>     that the actual observing script is stored here.  each EB has a
>     unique identifier and a pointer to the SB.
>     Entries Created By: OST.
>     Entries Used By: none, currently, but eventually we need a tool
>        that allows one to get at these beasts (or include it as part
>        of science archive access - see below).
> 
>   . what might be called the "science archive database" - this holds
>     the two parts of the science output of the array: the binary
>     visibility data (which are files conforming to the Binary Data
>     Format [BDF] definition), and the metadata (which are files
>     conforming to the Science Data Model [SDM] definition).  much of
>     what we're trying to clarify is what pointers (identifiers) go
>     in the SDM.  the SDM of course has a pointer to the appropriate
>     BDF.
>     Entries Created By: BDF - Correlator Back End (CBE); SDM - Metadata
>        Capture and Format (MCAF).
>     Entries Used By: SDM Cataloger (SDMC); NGAS; Archive Access Tool
>        (AAT); science data filler into CASA (asdm2ms).  there may be
>        others.
> 
>   . what might be called the "science archive search database" -
>     which holds the elements of the SDM that are interesting to
>     astronomers and which they might like to search for data with.
>     Entries Created By: SDMC.
>     Entries Used By: AAT.
> 
> so, we have done a pretty good job of defining the links between things 
> in the "pre-observing" elements of the system, but not so well in the 
> "post-observing".  for post-observing, we're mostly concerned with what 
> goes into the SDM.
> 
> please note that the SDM is jointly defined between us and ALMA.  also, 
> the SDM is still what i would consider a "work in progress". 
> unfortunately, however, because it is jointly shared with ALMA, we are 
> not at liberty to make whatever changes we wish - we have to negotiate 
> them.  sometimes that is trivial.  sometimes not so much.
> 
> note also that the SDM is organized as a bunch of tables, which describe 
> various datasets associated with that SDM.  a single SDM can contain 
> datasets from various telescopes, from a single telescope taken at 
> various times, etc.  it's very flexible (you might argue _too_ flexible 
> and i might agree, but we are beyond discussing much of that at this point).
> 
> so in the SDM for each "dataset" (which is the result of the execution 
> of a single Scheduling Block) we have the following things that help 
> identify the larger scale "structure" into which that dataset fits:
> 
> Table      Name          Type
> Main       execBlockID   Tag
> ExecBlock  execBlockID*  Tag
>             execBlockNum  int
>             execBlockUID  EntityRef
>             projectID     EntityRef
>             observerName  string
>             observingLog  string
> SBSummary  sBSummaryId*  Tag
>             sbSummaryUID  EntityRef
>             projectUID    EntityRef
>             sbType        SBType
> 
> i believe the projectID in ExecBlock and projectUID in SBSummary are the 
> same thing (can we please get the names to be consistent?  it should be 
> the UID version i think.)
> 
> as an aside, note that there is no direct place in the SDM to store a 
> project code.  the project code will have to be retrieved by following 
> the link from the SDM to the project database (the projectID). 
> similarly a list of all observers associated with a project is not 
> stored directly, but rather has to be retrieved via 
> SDM->projectID->proposal database.  there are other examples of this, 
> but we should get used to the idea of following multiple pointers 
> through to final data.
> 
> so we need to figure out what to do with: execBlockUID, projectUID, 
> observerName, observingLog, sbSummaryUID, and sbType.
> 
> execBlockUID
> 
>     we will use a UID which points to an exec block (EB) in our EB
>     database like:
> 
>        entityId="uid://evla/ebdb/X172200"
> 
>     ("ebdb" -> EB database).  with Xnnnnnn the unique EB database
>     identifier.  we will set this via an intent in the script, of the
>     form: ExecBlockID="Xnnnnnn".  we really only need that intent on the
>     first scan, but MCAF can just ignore them on subsequent scans that
>     use that same intent.
> 
> projectUID (and projectID, if we can't get ALMA to make them consistent)
> 
>     we will use a UID which points to a project in our project database
>     like:
> 
>        entityId="uid://evla/pdb/X172200"
> 
>     ("pdb" -> project database).  with Xnnnnnn the unique database
>     identifier.  we will set this via an intent in the script, of the
>     form: ProjectID="Xnnnnnn".  we really only need that intent on the
>     first scan, but MCAF can just ignore them on subsequent scans that
>     use that same intent.
> 
> observerName
> 
>     MCAF will get this via an ObserverName intent.  we will stuff the
>     PI or tester's name in this (or something benign for operations
>     scripts).
> 
> observingLog
> 
>     i would like to lobby ALMA to get this changed to an EntityRef, and
>     then have our logs stored separately from the SDM.  we could use
>     something like:
> 
>        entityId="uid://evla/obslog/X172200"
> 
>     we should really start thinking about putting our logs in a proper
>     database.  even if it's just storing them as full PDF files (as a
>     blob) with just an ID.  if it is too difficult to get ALMA to change
>     this, i think we should just stick the above entityID string into
>     the string value for this, and work it that way.
> 
> sbSummaryUID
> 
>     we will use a UID which points to an SB in our project database
>     like:
> 
>        entityId="uid://evla/pdbsb/X172200"
> 
>     ("pdbsb" -> project database SB).  with Xnnnnnn the unique database
>     identifier.  we will set this via an intent in the script, of the
>     form: SBID="Xnnnnnn".  we really only need that intent on the
>     first scan, but MCAF can just ignore them on subsequent scans that
>     use that same intent.
> 
> sbType
> 
>     this is an enumeration, with the following possibilities:
> 
>        SBType.OBSERVATORY
>        SBType.OBSERVER
>        SBType.EXPERT
> 
>     while i dislike those particular enumerations because i don't think
>     they describe the categories very well, i can live with them.  i
>     think they should imply:
> 
>        SBType.OBSERVATORY -> operations scripts
>        SBType.OBSERVER -> normal science observations
>        SBType.EXPERT -> test observations
> 
>     we will set this via an intent in the script, of the form:
>     SBTYPE="enumValue".  only the ones that are official projects in the
>     project database get an SBTYPE="OBSERVER".  ones that folks cobble up
>     in the OPT to use as tests get SBTYPE="EXPERT".  ones that we make
>     for operations will get SBTYPE="OBSERVATORY".  we really only need
>     that intent on the first scan, but MCAF can just ignore them on
>     subsequent scans that use that same intent.
> 
> 
> _______________________________________________
> evla-sw-discuss mailing list
> evla-sw-discuss at listmgr.cv.nrao.edu
> http://listmgr.cv.nrao.edu/mailman/listinfo/evla-sw-discuss