[evla-sw-discuss] MCCC Kernel upgrade and modification

James Robnett jrobnett at nrao.edu
Mon Apr 7 17:35:31 EDT 2014


As some of you know we have a suspicion that the kernel on mccc and 
other machines with a similar ethernet chip has a bug.  We see a pattern 
of framing errors on servers with that chip and older kernels, servers 
with a more recent RHEL 6.5 kernel and that chip do not show errors.

We can upgrade the OS on either mccctest or mccc next wednesday.  Doing 
mccctest first delays updating mccc but reduces risks inherent in doing 
an OS upgrade.  Doing mccc first increases the chances of resolving the 
dropped delay models and out of order configurations sooner but 
increases risks from an OS upgrade.

I'll leave it to you guys to decide which you want.  I suspect Kscott 
will need to know by wednesday of this week.

There's good reason to wait till wednesday.

1) Kscott is planning to upgrade Gygax on wednesday.  That's the ssh 
server and it also exhibits this framing error count.  We should know by 
later on wednesday whether the OS upgrade fixes it.

2) I've been doing some snooping after talking to Bruce about plausible 
scenarios where a race condition in the driver could result in corrupted 
RX ring buffers (essentially freeing too early and corrupting the next 
packet).  Sure enough there's a patch last December that looks like it 
could be related.

Further more I dug into some of the counters and see that mccc is 
reporting 212 drops and 3620 framing errors according to ifconfig.

The kernel more accurately (or at least more descriptively) reports 
these as:
rxbds_empty: 3620
rx_discards: 212

The former is an RX buffer full state (not a framing error despite what 
ifconfig says) the latter is an actual RX discard.

There's a patch from December for that as well.
http://www.spinics.net/lists/netdev/msg260062.html
Basically the driver is misreporting ring full as overflow which caused 
ifconfig to report it as a framing error rather than a drop (buggers).

Last spring we played around with increasing the RX ring buffer to no 
effect when chasing a different problem on mccc's old hardware.  That's 
the problem that was ultimately resolved by tweaking the driver interupt 
handler.  I have not played with the ring buffer on this hardware.

I'd like to increase mccc's rx ring buffer tomorrow.  This can be done 
on the fly and should be non-disruptive.

I'll try and find a candidate system here in the AOC to tweak over night 
to see if it helps.

James



More information about the evla-sw-discuss mailing list