[evla-sw-discuss] Widar board boot problem

James Robnett jrobnett at nrao.edu
Mon Oct 15 17:27:46 EDT 2012


As some of you know the disk in widar-boot-2, which supports
WIDAR racks 5 through 8, died around noon today.  We've temporarily
redirected boards to widar-boot-1 but we're in a rather hybrid
state.

Boards that were already up still point at widar-boot-2, as long
as they don't actually page any new files in they're fine.  Normally
they don't but will eventually.

A few boards that tried to reboot after the disk failure have since
been rebooted and are running off widar-boot-1.  So we have a mix
of boards in racks 5 to 8.  Those booting off widar-boot-1 and
those blissfully unaware that any I/O they try to widar-boot-2
will fail but probably will never need to try.

This is probably ok for a day or so but not a good plan long term.
Here's what we'd like to do.

1) Tomorrow or Wednesday (your pick) we shutdown widar-boot-2, swap
in a new disk and sync it to widar-boot-1.  Should take about 2 hours.

The boards that are currently on widar-boot-2 should be fine while
we're doing the swap but they might hang. So we should find a window
when we're either not using them or we are but if we lose them it's
ok.

2) Once that's done we test a board and if it boots we reboot all the
boards in racks 5 to 8.  The NFS filehandles will have changed so at
that point we just need to reboot all boards in racks 5 to 8.
Otherwise we have weird classes of boards that are either booting off
widar-boot-1 or had been rebooted and running off widar-boot-2 or
hadn't been rebooted and are a train wreck waiting to happen.


3) In addition we'd like to replace the gbit SFP to mchammer.  It
caused some issues last week.  That probably takes about 5 seconds
and shouldn't even be noticeable except for a few dropped packets.

james



More information about the evla-sw-discuss mailing list