[evlatests] Antenna network dropouts

James Robnett jrobnett at nrao.edu
Fri Feb 10 14:03:45 EST 2012


As most of you know we've had a problem in the past month with MIBs
in antennas dropping off the network.   I believe we've identified and
fixed the problem.

Last week we began recording network traffic to and from the antennas
for review the next time an event occurred.  On Friday evening antenna
22 dropped off the network.  The packet record showed that at the time
of the outage the paging unit in antenna 22 sent 10x more traffic to
the paging server over a 45 second period than all other traffic to or
from all antennas for the surrounding hour.

 From the time we started dumping packets in the middle of last week to
today that pattern of traffic only appeared at the time antenna 22
dropped off the network Friday February 3rd at 17:51 pm.

For reasons unknown the paging unit starts attempting to reset a
non-existent TCP connection to the paging service as fast its network
connection will allow.  The paging service runs on a standard windows
server in the old correlator room;  that service did not restart
properly after the December 23rd power outage and had been down ever since.

Mid way through the period the switches begin flooding the traffic to
all ports on the same logical network as the transmitter.  This includes
all MIBs on antenna 22 both in the control building and in the antenna.
It does not include devices in other antenna networks, the M&C network
or the WIDAR network.

Once the flood of traffic ends most network devices recover gracefully
but the MIBs are unable to handle that kind of data rate and need to
be power cycled.

We've taken several steps after restarting the paging service.

1) The operators tested the paging system to various antennas and
confirmed it's operable once again and the units in the antenna
can communicate with the service in the control building.  We had
never seen this type of traffic from the paging units in previous
years and don't expect to again now that the service is started.

2) We've added the paging service to Nagios.  Nagios is an active
monitoring system that sends an alert when services on our Windows
and Linux servers are not running.   This was tested yesterday and
the computing division now receives emails whenever the service isn't
running.  If it fails in the future we'll know about it.

3) I believe Hichem is looking into making the MIB's network stack
more robust in the face of high data rates.

James Robnett






More information about the evlatests mailing list