[evla-sw-discuss] [comms tkt #111047] Understand and correct CBE IB errors
James Robnett
jrobnett at nrao.edu
Fri Mar 22 11:48:13 EDT 2019
There was reportedly another VLASS failure overnight. In the past 3
days since I last cleared the counters we still only see issues on 3
interfaces. No new errors cropped up. I see a number of errors on port
14 (cbe-node-13). In addition I still see congestion
(PortXmitDiscards) on ports 8 of both the widar and what I take to be
westchamber IB switch (m60306-cb216b)..
So I think we should do several things.
1) Decide whether we want to run diagnostics on cbe-node-13 or simply
turn it off. Either way I don't think we should go into the weekend
with it up.
2) Understand (Jeff, Martin and I?) what traffic could be flowing
between those two ports and possibly nuke it.
It may be that 2 is the real problem. My preference would be to only
make one change. Possibly we can spend the day deciding which one is best.
I've cleared the counters and re-run the query. After 15 minutes it
only shows problems on port 8 of both switches but I'm still suspicious
of port 14 (cbe-node-13), maybe not as much as those two port 8 errors
but still ...
James
cbe-master$ /usr/sbin/ibqueryerrors -r -s PortXmitWait
Errors for 0x7cfe900300f77680 "MF0;m6036-widar:SX6036/U1"
GUID 0x7cfe900300f77680 port ALL: [PortXmitDiscards == 140]
GUID 0x7cfe900300f77680 port 8: [PortXmitDiscards == 140]
Link info: 41 8[ ] ==( Down/
Polling)==> [ ] "" ( )
Errors for 0xec0d9a0300541040 "MF0;m6036-cb216b:SX6036/U1"
GUID 0xec0d9a0300541040 port ALL: [PortXmitDiscards == 140]
GUID 0xec0d9a0300541040 port 8: [PortXmitDiscards == 140]
Link info: 1 8[ ] ==( Down/
Polling)==> [ ] "" ( )
## Summary: 57 nodes checked, 2 bad nodes found
## 164 ports checked, 2 ports have errors beyond threshold
## Thresholds:
## Suppressed: PortXmitWait
More information about the evla-sw-discuss
mailing list