[evla-sw-discuss] Alert server

Thu Jul 12 14:20:04 EDT 2007

This type of problem, I believe, is caused by trying to use what is  
essentially an open-loop logging system as the monitor portion of a  
control system.  I believe that to maintain reliable and accurate  
system state its components must be periodically polled.

The EVLA's method of having the 'patient report its own illness'  
seems simpler compared to a system where every component is polled -  
but that may not actually be the case.  For one, a server is required  
in the former to maintain an 'image' of system state.  Keeping that  
image accurate requires more complexity as illustrated by this  
problem.  Band-aid solutions will grow a fragile system that does  
polling anyway (see Barry's suggestions below).

I've wondered, if one were to right now walk out to an antenna and  
cut power to a MIB, will we know it or do we find out only when we  
notice that the antenna stopped responding to the associated commands?

A way to fix the problem below without polling would be to have  
components periodically announce their 'wellness' as well as alerts.   
But this will get even more complicated and we still still end up  
with just an image of state maintained by a single central thing.

Rich argues that polling will require a process that knows about  
every component in the system.  This is where hierarchy comes in.   
VLA and EVLA Antenna objects are each experts on their own specific  
components and are able to provide an 'executive summary' of state to  
whomever, in turn, polls them.  Executive decisions such as flagging  
and antenna scheduling can be made without having to know what an  
'L8' even is.  The state of the system is maintained 'live' - in the  
system itself - no image or associated maintainer required.

Bruce once argued against polling: 'Would you want the fire  
department to call you every five minutes to see if your house is on  
fire?'.  The answer is yes! - albeit not manually by phone.  A house  
that is on fire (or the person that may or may not be in it) cannot  
be relied upon to contact the fire department.  By actively 'pinging'  
the house, the fire department will be guaranteed to know either 1)  
the state of the house or 2) that it cannot communicate with the  
house.  Either is valuable information to the monitor half of a  
control system.

What does anyone else think about this?  Periodic polling is a  
difficult thing for people to want to embrace but I believe it gives  
the most reliable and accurate system state and may possibly be the  
least complex in the long run.

Kevin

On Jul 11, 2007, at 3:08 PM, Barry Clark wrote:

> We've always been unclear on the concept of stale alerts (note  
> included
> message at the end of this.
>
> Now that we have an alert server, it should take care of things in a
> better way.  But, as illustrated by the incident below, it has merely
> gone from alerts being erroneously overlooked to alerts being  
> erroneously
> preserved.
>
> The alert server is perfectly capable of going out to the monitor  
> point
> and asking if the alert is still in force.  It should do so.  Question
> is when.  Could be done periodically, on a slow period.  Or, the  
> Executor,
> whenever a new script starts, could send a REST message to the alert
> server, saying "Here is my ID.  Please check and see if any alerts you
> have for me are still valid."
>
>> From evlatests-bounces at donar.cv.nrao.edu  Wed Jul 11 14:14:22 2007
>> Date: Wed, 11 Jul 2007 14:14:04 -0600 (MDT)
>> From: Ken Sowinski <ksowinsk at nrao.edu>
>> To: evlatests at nrao.edu
>>
>> There was much confusion at the VLA today with regard to timing
>> between the arrays whcih ened with the CMP in a strange state
>> and having to be rebooted more than once.  This resulted in
>> stale "L8 out of sync" messages in the alert server causing
>> all data from VLA antennas to be flagged as bad.
>>
>> AS a temporary measure Walter has kludged idcaf so that L8
>> alertsd are not turned into flags.  However the alerts are
>> still there and no one I have talked to knows how to make
>> them go away.  We need either a little more distributed
>> knowledge about these parts of the system, or a system
>> with less remembered state.  I wonder if some (certainly not
>> all) of our problems with bad flagging may be related to this
>> kind of behavior.
>>
>> Notice of removal of these alerts would be appreciated so that
>> idcaf can be restored to its usual funtionality.
>>
>> _______________________________________________
>> evlatests mailing list
>> evlatests at listmgr.cv.nrao.edu
>> http://listmgr.cv.nrao.edu/mailman/listinfo/evlatests
>>
> _______________________________________________
> evla-sw-discuss mailing list
> evla-sw-discuss at listmgr.cv.nrao.edu
> http://listmgr.cv.nrao.edu/mailman/listinfo/evla-sw-discuss