[evla-sw-discuss] MCCC Kernel upgrade and modification
James Robnett
jrobnett at nrao.edu
Mon Apr 7 17:35:31 EDT 2014
As some of you know we have a suspicion that the kernel on mccc and
other machines with a similar ethernet chip has a bug. We see a pattern
of framing errors on servers with that chip and older kernels, servers
with a more recent RHEL 6.5 kernel and that chip do not show errors.
We can upgrade the OS on either mccctest or mccc next wednesday. Doing
mccctest first delays updating mccc but reduces risks inherent in doing
an OS upgrade. Doing mccc first increases the chances of resolving the
dropped delay models and out of order configurations sooner but
increases risks from an OS upgrade.
I'll leave it to you guys to decide which you want. I suspect Kscott
will need to know by wednesday of this week.
There's good reason to wait till wednesday.
1) Kscott is planning to upgrade Gygax on wednesday. That's the ssh
server and it also exhibits this framing error count. We should know by
later on wednesday whether the OS upgrade fixes it.
2) I've been doing some snooping after talking to Bruce about plausible
scenarios where a race condition in the driver could result in corrupted
RX ring buffers (essentially freeing too early and corrupting the next
packet). Sure enough there's a patch last December that looks like it
could be related.
Further more I dug into some of the counters and see that mccc is
reporting 212 drops and 3620 framing errors according to ifconfig.
The kernel more accurately (or at least more descriptively) reports
these as:
rxbds_empty: 3620
rx_discards: 212
The former is an RX buffer full state (not a framing error despite what
ifconfig says) the latter is an actual RX discard.
There's a patch from December for that as well.
http://www.spinics.net/lists/netdev/msg260062.html
Basically the driver is misreporting ring full as overflow which caused
ifconfig to report it as a framing error rather than a drop (buggers).
Last spring we played around with increasing the RX ring buffer to no
effect when chasing a different problem on mccc's old hardware. That's
the problem that was ultimately resolved by tweaking the driver interupt
handler. I have not played with the ring buffer on this hardware.
I'd like to increase mccc's rx ring buffer tomorrow. This can be done
on the fly and should be non-disruptive.
I'll try and find a candidate system here in the AOC to tweak over night
to see if it helps.
James
More information about the evla-sw-discuss
mailing list