[evla-sw-discuss] LTA RAM failures and CM

Ken Sowinski ksowinsk at nrao.edu
Thu Dec 5 13:43:14 EST 2019


It is becoming more obvious that we need a way to deal with the
increasing occurrences of LTA RAM failures.  These are the same RAM
chips as those which were replaced on all the delay daughter boards
and they are now failing at an increasing rate on baseline boards.

ConfigurationMapper has long had the capability to modify the usage of
correlator components based on entried in the CRM database but it has
never been used in earnest.  Relevant to this problem is that it is
possible to mark a row (or column) on the baseline board as bad to
prevent use of that row or column.  This facility has been tested and
shown to work.  One failure on a board can be bypassed in this way; if
no more than 24 antennas are in use, two independent failures can be
bypassed.  In fact more than oe failure can be bypassed as long as all
the failures are on a row or column with the same ID.  If more
baseline boards than necessary are allocated to a subband, CM is able
to use those boards to provide resources when there are not enough
available rows/columns on a board with failed LTA RAMs.  This facility
has not yet been tested by us.

I suggest that we take the simplest path to begin with by allowing CM to
access the CRM database.  This will allow us to bypass one failure per
board and produce all valid correlation products.  If there are two failures
only one can be bypassed.  In the future if we are pressed by inreasing
failure rates we can consider more sophisticated alogrithms to add extra
baselineobard pairs to the allocation list for each subband.  This will
only be possible for configurations which do not use all baseline boards.

The most serious side effect I am aware of is that some of our test
scripts which require all baseline board pairs will not work if they
include use of a baseline board which is marked as "bad".  We can
either live with this and create ad hoc VCI files as needed, or extend
the idea of an "expendable" subband to an attribute in the subBand
element of the VCI and teach CM to honor its intent.

Some things need to be verified to work as expected before we do this
routinely.
1.  Most importantly is that when m2s queries the CRM data base to determine
     currently available baseline boards it only records boards marked as "bad",
     and does not promote any component failure into a bad board copndition.
     I once thought it did take any failure to be a board failure, but recent
     testing suggested that it did not.  A thoughtful answer or careful test
     is needed.
2.  Verify that CM will use extra baseline board pairs if they are sllocated.
3.  Verify that the AllStationsMaxProd algorithm for autocorrelations
     continues to degrade to HalfStationsMaxProd if sufficient resources are
     not available.  This is relevant for test scripts becuase OPT does not
     generate VCI files using AllStationsMaxProd.



More information about the evla-sw-discuss mailing list