[evla-sw-discuss] alerts - reliable delivery

Mon Mar 14 15:07:51 EST 2005

I'm disturbed that we do see missing alert packets.  This is a few orders
of magnitude greater loss than we were questimating on the basis of no
data, and, if the number of missing packets grows as the square of the
data rates, as theory suggests it should, when we complete the EVLA
conversion, we will have 50 times the traffic and 2500 times the missing
alert packets.  Sounds like we might be in trouble.

In a real-time context, it is not clear to me that TCP is more reliable
than UDP.  This is especially so of the sort wherein you establish a 
connection for each message.  Yes, TCP does retries, but establishing a 
connection is a lot more complicated than sending a packet, and networks
tend to have rather bursty statistics such that if one packet is dropped,
it is likely that a bunch of them are going to be at about the the same
time.  So I rather think that the rate of failed connections would be about 
equal to the rate of dropped packets.  True, on the sending side you can 
add yet another level of retries, and probably eventually get through.  
But this can get complicated.

Because TCP often takes a little vacation to think about things, in a
real-time system we cannot have the connect-and-send in the same thread
as something needed to actually control the system.  We would need to
put it off on a thread of its own, with a message queue to feed it the
things it needs to send, which raises the question of how much room we
should allocate for the message queue, and what to do when it fills up
(eg, when the Flagger is off the air).  

Perhaps we should explore a bit other possibilities.  In particular, we
can take account of the difference between Checker and Flagger.  Since
Checker is almost entirely for human consumption, we can do things on
a human timescale.  For instance, we can have the Device reissue alerts,
say every three minutes (Perhaps with the word "Still " inserted before
the message).  Then the Checker screens can simply delete from their 
display anything uncancelled over three minutes old.  The other use 
for Checker type alerts - answering "How often has this thing been glitching?" -
does not depend much on reliability.

For Flagger purposes we do need pretty good reliability.  One thing we 
could do would be to add, in the MP structure, the start and stop times
of the last time a flag was set.  Then, if Flagger receives a cancellation
flag without having received a corresponding set flag, it can query the
Device and reconstruct the whole sequence.  Once it gets a flag set, it
can then proceed to ask the Device, on a data flagging type time scale -
say 10 seconds, if the flag is still set.  By doing about the right thing
if it receives either a "set" or a "cancell", this should result in 
roughly squaring the percentage of lost flags.

This approach, where the unreliability is handled centrally, is rather 
easier to instrument and figure out what is happening, than attempting 
to handle it in the MIBs.

> The current scheme for alerts coming from EVLA and VLA antennas
> is to multicast an alert-on message once when a monitor point
> goes into an alert state, and to multicast an alert-off message
> once when a monitor point exits an alert state.  As I have
> mentioned on several occasions I do not consider this scheme
> robust w.r.t. dropped packets and other network glitches.  We
> already have examples of alert-off messages not been seen for
> a corresponding alert-on message even though direct query of
> the mib shows the monitor point to have exited the alert state.