[evla-sw-discuss] Testing CBE IB failures
Bryan Butler
bbutler at nrao.edu
Fri Mar 29 13:07:27 EDT 2019
I'm curious why we never saw this until recently though?
And, can we automate any of these IB diagnostics?
-Bryan
K. Scott Rowe wrote on 3/29/19 11:03:
> Paul and I noticed the SymbolErrorCounter on cbe-node-01 (Port 2)
> was at 65535.
>
> Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
> GUID 0x2c90200477a08 port 2: [PortRcvSwitchRelayErrors == 65535]
> Link info: 3 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 0x0002c903002829a7 23 1[ ] "cbe-node-01 mlx4_0" ( )
>
> So, I cleared the counters and shutdown cbe-node-01 with "shutdown -h
> -H". Before I powered it off, it immediatly went back to 65535 right
> about the time it stopped pinging.
>
> I am beginning to wonder if this is evidence of bad ports instead of
> bad cables or cards. We could do some tests by connecting one of
> these problematic nodes to the IB switch in the mccc rack.
>
> On Mar 27 11:53, K. Scott Rowe wrote:
> }Looking back in my history, I see the same thing when they accidently
> }shutdown cbe-node-12 (Port 13). So, maybe this is common when a node
> }it shutoff.
> }
> }Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
> } GUID 0x2c90200477a08 port ALL: [SymbolErrorCounter == 65534] [LinkDownedCounter == 2] [PortRcvErrors == 679]
> } GUID 0x2c90200477a08 port 13: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]
> } Link info: 3 13[ ] ==( Down/ Polling)==> [ ] "" ( )
> } GUID 0x2c90200477a08 port 14: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1] [PortRcvErrors == 679]
> } Link info: 3 14[ ] ==( Down/ Polling)==> [ ] "" ( )
> }
> }
> }On Mar 27 11:51, K. Scott Rowe wrote:
> }}And, as soon as Jeff shut down cbe-node-13 and cbe-node-31, the
> }}SymbolErrorCounter of both ports went to Max (65535).
> }}
> }}Perhaps this is a normal side-effect of powering off a node?
> }}
> }}Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
> }} GUID 0x2c90200477a08 port ALL: [SymbolErrorCounter == 65534] [LinkDownedCounter == 2]
> }} GUID 0x2c90200477a08 port 14: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]
> }} Link info: 3 14[ ] ==( Down/ Polling)==> [ ] "" ( )
> }} GUID 0x2c90200477a08 port 32: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]
> }} Link info: 3 32[ ] ==( Down/ Polling)==> [ ] "" ( )
> }}
> }}
> }}
> }}On Mar 27 11:46, K. Scott Rowe wrote:
> }}}I have now run iozone tests on both cbe-node-31 and cbe-node-13 and
> }}}have seen no errors. So I think they can be shut back down and
> }}}considered "fixed" and put back into the CBE at some later date.
> }}}
> }}}On Mar 27 11:27, K. Scott Rowe wrote:
> }}}}cbe-node-13 is up and no errors were reported on it's port. I have
> }}}}started iozone tests and will run them for a little while.
> }}}}
> }}}}Apparetnly cbe-node-12's power cord was bumped in the process of
> }}}}working on cbe-node-13. Just a good reminder to be carefull when
> }}}}doing such work.
> }}}}
> }}}}Strangely, we are still seeing PortXmitDiscards for m6036-widar Port 8
> }}}}and m6036-cb216b Port 8. I suppose we could reboot the switch itself
> }}}}to see if that clears things, otherwise I don't know what to do.
> }}}}
> }}}}On Mar 27 11:04, K. Scott Rowe wrote:
> }}}}}I have run three rounds of iozone on cbe-node-31, reading/writing to
> }}}}}/lustre/evla and have seen no errors in the ibqueryerrors report
> }}}}}for that port (Port 32). So, it may be there was a problem with
> }}}}}the cable to cbe-node-31.
> }}}}}
> }}}}}Meanwhile, I think they guys have accidently shutdown cbe-node-12 as
> }}}}}well as cbe-node-13. Jeff is going to try and straighten that out.
> }}}}}
> }}}}}On Mar 27 10:43, K. Scott Rowe wrote:
> }}}}}}They replaced the IB cables on cbe-node-13 and cbe-node-31 and booted both
> }}}}}}of them. I cleared the counters and saw cbe-node-13 (Port 14) quickly
> }}}}}}get back to 65536 SymbolErrorCounter.
> }}}}}}
> }}}}}} Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
> }}}}}} GUID 0x2c90200477a08 port ALL: [SymbolErrorCounter == 65535] [PortRcvErrors == 124]
> }}}}}}
> }}}}}}So, Jeff is going to have them replace the card on cbe-node-13.
> }}}}}}
> }}}}}}Meanwhile, I will run some iozone tests on cbe-node-31 to see if I can
> }}}}}}produce any errors.
> }}}}}}
> }}}}}}On Mar 26 14:05, James Robnett wrote:
> }}}}}}}
> }}}}}}}For testing cbe-node-13 all I planned to do was:
> }}}}}}}
> }}}}}}}1) Clear the IB counters: /usr/sbin/ibqueryerrors -kK
> }}}}}}}2) Poll ever minute or so with: /usr/sbin/ibqueryerrors -r -s PortXmitWait
> }}}}}}}
> }}}}}}}Once that's clearly working and no errors are showing up for
> }}}}}}}cbe-node-13 then run either a iozone or bonnie (or any other form of
> }}}}}}}test) type test to read/write between it and lustre.
> }}}}}}}
> }}}}}}}That *should* expose the issue that existed last week. As soon as
> }}}}}}}cbe-node-13 tried to communicate to lustre because of the observation
> }}}}}}}it began generating port errors. When it was idle it was error free.
> }}}}}}}
> }}}}}}}James
> }
>
> _______________________________________________
> evla-sw-discuss mailing list
> evla-sw-discuss at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/evla-sw-discuss
>
More information about the evla-sw-discuss
mailing list