[evlatests] system recovery from poeer outage

Ken Sowinski ksowinsk at nrao.edu
Wed Oct 10 12:16:13 EDT 2012


In the hope that there are lessons for next time, here is a list
of the major problems I am aware of.

At the moment everything is working and stable except for a few
antenns which are out of service for maintnance work.

The most serious problem is that the correlator was not working
correctly because of timecode related problems/  No timecode CRC
errors were reported, but the station boards were not able to
understand timecode and continually tried to reset the timing
FPGA.  This was true whether using timecode A or B.  This morning
in the light of day, it was found that the L350 was not synced to
the GPS second, so that the two components of timecode were not
consistent.  When this was fixed the station boards immediately
synced correctly to timecode.

The warning here is that the crossbar boards were all green, as
were the station boards.  The former is likely as the crossbar
boards don't deal  with the semantics of timecode.  Bruce is
looking into why the timing FPGA was not more red.

Less serious, in that there were easy workarounds were two
networking issues.  James will see to it that both are fixed
today.  Lag frames destined for CBE node three never made it
because the switch recognized the mac address of nics in rack
three and treated packets for them specially.  The second problem
was that mchammer, which runs telcal, lost its connection to
lustre.

There were the usual correlator problems.  A number of baseline
boards did not come up cleanly and had to be restarted.  Most
likely because data into their RXPs was not stable/correct enough
to allow them to sync.  Delay module 0 in s102-t-3 (ea08-A)
deconfigured itself and the board had to be rebooted.

The only casualty at antennas I am aware of is that ea15-D304
cannot be turned on.



More information about the evlatests mailing list