[evla-sw-discuss] severity of alerts

Thu Jul 21 14:14:37 EDT 2005

On 7/21/05 10:58, Tom Morgan wrote:
>> >> the engineers will be going over each MIB and its monitor points and
>> >> assigning this severity code to its alerts.  pat van buskirk is going to
>> >> do the leg work of pestering the engineers on this.
>> >>
>> >> in addition to a severity level, an "action" for the operator has to be
>> >> defined for each of these alerts.  this would be similar to the page at:
>> >> http://www.vla.nrao.edu/operators/alarms/ for the VLA.
>> >
>> > An operator action for each alert coming from the MIBS is not in keeping
>> > with the Alert design in the High Level Design Doc. Alerts will be
>> > propagated up the hierarchy of the system. Operators are at the top of
>> > the pyramid and should only be presented with alert reports from
>> > subsystems (see pp 55-57 of the High Level Design). At the level of the
>> > MIBS, the important question is what the alert means to the next higher
>> > level of the system (especially in the context of alerts coming in from
>> > other sources) and how the next level up will respond.
>>
>> i must not have been very clear here.  i wasn't proposing that the MIB
>> have any of the "action" information - that information has to be at a
>> higher level (which was why i listed the combination of severity level,
>> action, and flagging into some dB as an advantage to option 2 below).
> 
> So, this higher level will use MID alert messages and the lookup table to 
> generate messages to the operator? 

yes.

> If so we will need to think in terms of a 
> higher level of context than single MIB alerts. Some component failures are 
> more important than others, some combinations of failures are very important, 
> some failures vary in importance depending on the needs of the current 
> observation, etc. 

yes - you are completely right here.  combinations are also important. 
but let's start with the simple first, then build up.  let's just define 
the severity levels per monitor point, and the "actions", and 
potentially some simple flags, then think about the combinations.

> Will this be in "Checker" ?

yes, in my mind.  whether it is a separate "piece" (or program out in 
front) of Checker is TBD (by rich, i guess, since he's implementing all 
of this).  Checker is where errors are reported, so it might as well be 
where the heirarchy is implemented.

>> >> once they have them defined, then we need to support them.  there are
>> >> two ways that i see to do this:
>> >>   1 - each MIB has coded into it these severity levels, just as it
>> >>       has coded into it the levels at which alerts are triggered, and
>> >>       when the alert is sent out, the severity code goes out with it;
>> >>   2 - there is a lookup table which checker uses, given the MIB and
>> >>       the monitor point/alert, to assign severity, and any program
>> >>       that receives the alerts can use that lookup table to retrieve
>> >>       the severity level.
>> >
>> > Following the High Level Design, only two alert conditions apply to MIBS:
>> > failure or warning. Failure means no longer functional, warning means
>> > functional but no longer operating within normal limits. As a result, the
>> > alert condition or severity is automatically know by each MIB and can be
>> > easily included in the message. This is essentially option 1), but will
>> > never require programming changes - redfinition of "normal" due to
>> > hardware changes is handled automatically. Option 2) will never be needed
>> > at the MIB level.
>>
>> well, we'll have to modify the HLD as needed here if we decide that more
>> granularity is needed within the MIB itself.  again, i'm arguing for
>> solution 2 below anyway, and in that case this is a non-issue.
> 
> Seems to me that a component can have only three states: operating, operating 
> but not within normal limits, not operating. Therefore the MIB responsible 
> for this component can only report these three possible conditions.

severity is separate from state.  an engineer might want to know if, for 
instance, a particular voltage is out of range - so an alert is 
generated.  but it might have only very modest effect on the data coming 
out, so the severity is low.

	-bryan