[evla-sw-discuss] severity of alerts

Thu Jul 21 12:58:07 EDT 2005

All:

Questions/Comments on comments inserted.

--Tom

On Thursday 21 July 2005 10:21, Bryan Butler wrote:
> On 7/21/05 09:45, Tom Morgan wrote:
> > All:
> >
> > Comments inserted in text.
> >
> > --Tom
> >
> > On Wednesday 20 July 2005 22:58, Bryan Butler wrote:
> >> all,
> >>
> >> we've gotten to the point where we need to define a severity level for
> >> alerts.  the operators need this in order to tell the importance level
> >> of them as they arrive on the checker screen.
> >>
> >> i propose that we define an integer alert level from 0 to 5, with 0
> >> being the highest importance (issues of safety) and 5 being
> >> informational only.  if somebody can make a case for more granularity
> >> (do we need 10 levels?), that's fine.
> >
> > High Level Design says 4. It does not specifically address safety as a
> > special case. The 4 levels are Failure, Warning, Error, Info.
>
> in keeping with bruce's suggestion, we might need to extend this to 8
> states (syslog compatibility).  if we only use 4 of them, so be it.
>
> >> the engineers will be going over each MIB and its monitor points and
> >> assigning this severity code to its alerts.  pat van buskirk is going to
> >> do the leg work of pestering the engineers on this.
> >>
> >> in addition to a severity level, an "action" for the operator has to be
> >> defined for each of these alerts.  this would be similar to the page at:
> >> http://www.vla.nrao.edu/operators/alarms/ for the VLA.
> >
> > An operator action for each alert coming from the MIBS is not in keeping
> > with the Alert design in the High Level Design Doc. Alerts will be
> > propagated up the hierarchy of the system. Operators are at the top of
> > the pyramid and should only be presented with alert reports from
> > subsystems (see pp 55-57 of the High Level Design). At the level of the
> > MIBS, the important question is what the alert means to the next higher
> > level of the system (especially in the context of alerts coming in from
> > other sources) and how the next level up will respond.
>
> i must not have been very clear here.  i wasn't proposing that the MIB
> have any of the "action" information - that information has to be at a
> higher level (which was why i listed the combination of severity level,
> action, and flagging into some dB as an advantage to option 2 below).

So, this higher level will use MID alert messages and the lookup table to 
generate messages to the operator? If so we will need to think in terms of a 
higher level of context than single MIB alerts. Some component failures are 
more important than others, some combinations of failures are very important, 
some failures vary in importance depending on the needs of the current 
observation, etc. Will this be in "Checker" ?

>
> >> once they have them defined, then we need to support them.  there are
> >> two ways that i see to do this:
> >>   1 - each MIB has coded into it these severity levels, just as it
> >>       has coded into it the levels at which alerts are triggered, and
> >>       when the alert is sent out, the severity code goes out with it;
> >>   2 - there is a lookup table which checker uses, given the MIB and
> >>       the monitor point/alert, to assign severity, and any program
> >>       that receives the alerts can use that lookup table to retrieve
> >>       the severity level.
> >
> > Following the High Level Design, only two alert conditions apply to MIBS:
> > failure or warning. Failure means no longer functional, warning means
> > functional but no longer operating within normal limits. As a result, the
> > alert condition or severity is automatically know by each MIB and can be
> > easily included in the message. This is essentially option 1), but will
> > never require programming changes - redfinition of "normal" due to
> > hardware changes is handled automatically. Option 2) will never be needed
> > at the MIB level.
>
> well, we'll have to modify the HLD as needed here if we decide that more
> granularity is needed within the MIB itself.  again, i'm arguing for
> solution 2 below anyway, and in that case this is a non-issue.

Seems to me that a component can have only three states: operating, operating 
but not within normal limits, not operating. Therefore the MIB responsible 
for this component can only report these three possible conditions.

>
> >> the advantage to 1 is that it keeps the information closest to the MIB.
> >>   it also saves the "management" software upstream.  the disadvantage is
> >> that if you decide to change anything you have to modify all of those
> >> MIB images.  the advantage to 2 is that you avoid that MIB image
> >> modification, and can centralize everything (in a database or similar).
> >>   another advantage is that you can also include the "action" in this
> >> database, as well as flagging information.  since you are going to need
> >> these other things there, you might as well add a column for severity.
> >>
> >> i prefer the lookup table/database, but would like to hear other
> >> opinions.
> >>
> >> 	-bryan
> >>
> >>
> >> _______________________________________________
> >> evla-sw-discuss mailing list
> >> evla-sw-discuss at listmgr.cv.nrao.edu
> >> http://listmgr.cv.nrao.edu/mailman/listinfo/evla-sw-discuss
>
> _______________________________________________
> evla-sw-discuss mailing list
> evla-sw-discuss at listmgr.cv.nrao.edu
> http://listmgr.cv.nrao.edu/mailman/listinfo/evla-sw-discuss