[evlatests] Startup difficulties

Barry Clark bclark at nrao.edu
Wed Jan 2 11:43:30 EST 2008


> From phicks at nrao.edu  Wed Jan  2 07:31:34 2008
> 
>    Work Order		:  C121979
> 
>    Fault Code		:  FAILURE
> 
>    Work Requested		:  Following the shutdown over New
> Years, approx 22:55 IAT, running sysstartX, the D10 and Fringe rapidly
> output large amounts of data.  Every second or two, data sets from many
> previous integrations would output all at once.  Restarting the
> Executor, Idcaf, Operator Interface, and rebooting the System Controller
> did not fix it.  (After restarting Idcaf, the Executor began aborting
> scripts due to an Oracle table issue which Pat fixed.)  D10 stopped once
> or twice, Barry said Idcaf was sometimes not receiving data from the
> Executor and Idcaf was probably assuming the script had been aborted.
> No software alerts present.   Barry & James resolved the data issue by
> rebooting igloo and mchost.  (Note: after rebooting Igloo, Idcaf did not
> start automatically.)
> 

When the operator called me, the system was running Xsysstart, and he reported
that the operator displays were behaving as above.  I opened a window to
run D10, and found no data appearing.  About every ten minutes, data would
start to appear, and ten second records would appear for one or two minutes,
and then stop again.  When no data were appearing, idcaf claimed to be in 
'idle' state, which is caused by not receiving u,v records from the executor.
The executor appeared to be functioning normally, and indeed, when idcaf
wrote records, the data looked good, with normal amplitudes and stable phases,
so commands were being sent to antenna devices.

However, when the operator restarted the script, idcaf did not receive the
XML documents that should have accompanied this.

I had James look in Mchost, and he claimed there were no multicast packets
originating there.

It would appear that something specifically was interfering with multicast
that did not affect the UDP packets going to the antennas and to the correlator
controller.  (Both antennas and correlator controller have flywheels, so 
moderate loss of data would not be noticed, but complete dropouts would 
soon be noticed and things would stop working.)

Rebooting Mchost appears to have cured the problem.  (Igloo was rebooted
at the same time, although I did not suspect anything wrong over there.)

I haven't heard yet from Pat just what went wrong with the database, so 
there is still a possibility that something got screwed up in the long lived
data (duplicate non zero DCS numbers produce slightly different symptoms,
but something along that line...) caused the problem.  However, I'm thinking
something more fundamental



More information about the evlatests mailing list