[evla-sw-discuss] Testing CBE IB failures
K. Scott Rowe
krowe at nrao.edu
Fri Mar 29 13:03:17 EDT 2019
Paul and I noticed the SymbolErrorCounter on cbe-node-01 (Port 2)
was at 65535.
Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
GUID 0x2c90200477a08 port 2: [PortRcvSwitchRelayErrors == 65535]
Link info: 3 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 0x0002c903002829a7 23 1[ ] "cbe-node-01 mlx4_0" ( )
So, I cleared the counters and shutdown cbe-node-01 with "shutdown -h
-H". Before I powered it off, it immediatly went back to 65535 right
about the time it stopped pinging.
I am beginning to wonder if this is evidence of bad ports instead of
bad cables or cards. We could do some tests by connecting one of
these problematic nodes to the IB switch in the mccc rack.
On Mar 27 11:53, K. Scott Rowe wrote:
}Looking back in my history, I see the same thing when they accidently
}shutdown cbe-node-12 (Port 13). So, maybe this is common when a node
}it shutoff.
}
}Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
} GUID 0x2c90200477a08 port ALL: [SymbolErrorCounter == 65534] [LinkDownedCounter == 2] [PortRcvErrors == 679]
} GUID 0x2c90200477a08 port 13: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]
} Link info: 3 13[ ] ==( Down/ Polling)==> [ ] "" ( )
} GUID 0x2c90200477a08 port 14: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1] [PortRcvErrors == 679]
} Link info: 3 14[ ] ==( Down/ Polling)==> [ ] "" ( )
}
}
}On Mar 27 11:51, K. Scott Rowe wrote:
}}And, as soon as Jeff shut down cbe-node-13 and cbe-node-31, the
}}SymbolErrorCounter of both ports went to Max (65535).
}}
}}Perhaps this is a normal side-effect of powering off a node?
}}
}}Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
}} GUID 0x2c90200477a08 port ALL: [SymbolErrorCounter == 65534] [LinkDownedCounter == 2]
}} GUID 0x2c90200477a08 port 14: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]
}} Link info: 3 14[ ] ==( Down/ Polling)==> [ ] "" ( )
}} GUID 0x2c90200477a08 port 32: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]
}} Link info: 3 32[ ] ==( Down/ Polling)==> [ ] "" ( )
}}
}}
}}
}}On Mar 27 11:46, K. Scott Rowe wrote:
}}}I have now run iozone tests on both cbe-node-31 and cbe-node-13 and
}}}have seen no errors. So I think they can be shut back down and
}}}considered "fixed" and put back into the CBE at some later date.
}}}
}}}On Mar 27 11:27, K. Scott Rowe wrote:
}}}}cbe-node-13 is up and no errors were reported on it's port. I have
}}}}started iozone tests and will run them for a little while.
}}}}
}}}}Apparetnly cbe-node-12's power cord was bumped in the process of
}}}}working on cbe-node-13. Just a good reminder to be carefull when
}}}}doing such work.
}}}}
}}}}Strangely, we are still seeing PortXmitDiscards for m6036-widar Port 8
}}}}and m6036-cb216b Port 8. I suppose we could reboot the switch itself
}}}}to see if that clears things, otherwise I don't know what to do.
}}}}
}}}}On Mar 27 11:04, K. Scott Rowe wrote:
}}}}}I have run three rounds of iozone on cbe-node-31, reading/writing to
}}}}}/lustre/evla and have seen no errors in the ibqueryerrors report
}}}}}for that port (Port 32). So, it may be there was a problem with
}}}}}the cable to cbe-node-31.
}}}}}
}}}}}Meanwhile, I think they guys have accidently shutdown cbe-node-12 as
}}}}}well as cbe-node-13. Jeff is going to try and straighten that out.
}}}}}
}}}}}On Mar 27 10:43, K. Scott Rowe wrote:
}}}}}}They replaced the IB cables on cbe-node-13 and cbe-node-31 and booted both
}}}}}}of them. I cleared the counters and saw cbe-node-13 (Port 14) quickly
}}}}}}get back to 65536 SymbolErrorCounter.
}}}}}}
}}}}}} Errors for 0x2c90200477a08 "MF0;m5030-widar:IS5030/U1"
}}}}}} GUID 0x2c90200477a08 port ALL: [SymbolErrorCounter == 65535] [PortRcvErrors == 124]
}}}}}}
}}}}}}So, Jeff is going to have them replace the card on cbe-node-13.
}}}}}}
}}}}}}Meanwhile, I will run some iozone tests on cbe-node-31 to see if I can
}}}}}}produce any errors.
}}}}}}
}}}}}}On Mar 26 14:05, James Robnett wrote:
}}}}}}}
}}}}}}}For testing cbe-node-13 all I planned to do was:
}}}}}}}
}}}}}}}1) Clear the IB counters: /usr/sbin/ibqueryerrors -kK
}}}}}}}2) Poll ever minute or so with: /usr/sbin/ibqueryerrors -r -s PortXmitWait
}}}}}}}
}}}}}}}Once that's clearly working and no errors are showing up for
}}}}}}}cbe-node-13 then run either a iozone or bonnie (or any other form of
}}}}}}}test) type test to read/write between it and lustre.
}}}}}}}
}}}}}}}That *should* expose the issue that existed last week. As soon as
}}}}}}}cbe-node-13 tried to communicate to lustre because of the observation
}}}}}}}it began generating port errors. When it was idle it was error free.
}}}}}}}
}}}}}}}James
}
More information about the evla-sw-discuss
mailing list