[evla-sw-discuss] severity of alerts

Thu Jul 21 11:45:21 EDT 2005

All:

Comments inserted in text.

--Tom

On Wednesday 20 July 2005 22:58, Bryan Butler wrote:
> all,
>
> we've gotten to the point where we need to define a severity level for
> alerts.  the operators need this in order to tell the importance level
> of them as they arrive on the checker screen.
>
> i propose that we define an integer alert level from 0 to 5, with 0
> being the highest importance (issues of safety) and 5 being
> informational only.  if somebody can make a case for more granularity
> (do we need 10 levels?), that's fine.

High Level Design says 4. It does not specifically address safety as a special 
case. The 4 levels are Failure, Warning, Error, Info. 

>
> the engineers will be going over each MIB and its monitor points and
> assigning this severity code to its alerts.  pat van buskirk is going to
> do the leg work of pestering the engineers on this.
>
> in addition to a severity level, an "action" for the operator has to be
> defined for each of these alerts.  this would be similar to the page at:
> http://www.vla.nrao.edu/operators/alarms/ for the VLA.

An operator action for each alert coming from the MIBS is not in keeping with 
the Alert design in the High Level Design Doc. Alerts will be propagated up 
the hierarchy of the system. Operators are at the top of the pyramid and 
should only be presented with alert reports from subsystems (see pp 55-57 of 
the High Level Design). At the level of the MIBS, the important question is 
what the alert means to the next higher level of the system (especially in 
the context of alerts coming in from other sources) and how the next level up 
will respond.

>
> once they have them defined, then we need to support them.  there are
> two ways that i see to do this:
>   1 - each MIB has coded into it these severity levels, just as it
>       has coded into it the levels at which alerts are triggered, and
>       when the alert is sent out, the severity code goes out with it;
>   2 - there is a lookup table which checker uses, given the MIB and
>       the monitor point/alert, to assign severity, and any program
>       that receives the alerts can use that lookup table to retrieve
>       the severity level.
>

Following the High Level Design, only two alert conditions apply to MIBS: 
failure or warning. Failure means no longer functional, warning means 
functional but no longer operating within normal limits. As a result, the 
alert condition or severity is automatically know by each MIB and can be 
easily included in the message. This is essentially option 1), but will never 
require programming changes - redfinition of "normal" due to hardware changes 
is handled automatically. Option 2) will never be needed at the MIB level. 

> the advantage to 1 is that it keeps the information closest to the MIB.
>   it also saves the "management" software upstream.  the disadvantage is
> that if you decide to change anything you have to modify all of those
> MIB images.  the advantage to 2 is that you avoid that MIB image
> modification, and can centralize everything (in a database or similar).
>   another advantage is that you can also include the "action" in this
> database, as well as flagging information.  since you are going to need
> these other things there, you might as well add a column for severity.
>
> i prefer the lookup table/database, but would like to hear other opinions.
>
> 	-bryan
>
>
> _______________________________________________
> evla-sw-discuss mailing list
> evla-sw-discuss at listmgr.cv.nrao.edu
> http://listmgr.cv.nrao.edu/mailman/listinfo/evla-sw-discuss