[Difx-users] HPC using Ethernet vs Infiniband

Mark Kettenis kettenis at jive.eu
Tue Sep 21 04:30:47 EDT 2021


> Date: Mon, 20 Sep 2021 13:28:03 -0600 (MDT)
> From: Walter Brisken via Difx-users <difx-users at listmgr.nrao.edu>
> 
> Hi DiFX Users,

Hi Walter,

> In the not so distant future we at VLBA may be be in the position to upgrade 
> the network backbone of the VLBA correlator.  Currently we have a 40 Gbps 
> Infiniband system dating back about 10 years.  At the time we installed that 
> system, Infiniband showed clear advantages, likely driven by RDMA capability 
> which offloads a significant amount of work from the CPU.  Now it seems 
> Ethernet has RoCE (RDMA over Converged Ethernet) which aims to do the same 
> thing.
> 
> 1. Does anyone have experience with RoCE?  If so, is this as easy to
> configure as the OpenMPI page suggests?  Any drawbacks of using it?

An important difference between Infiniband and Ethernet is that
Infiniband is credits based.  This means that Infiniband has pretty
much guaranteed packet delivery, whereas Ethernet will drop packets on
the floor if there happen to be collisions.  This is one of the
reasons why Infiniband still has better latency than something like
RoCE.

> 2. Has anyone else gone through this decision process recently?  If so, any 
> thoughts or advice?

One thing that might be interesting for you to test is how important
RDMA is for your DiFX implementation.  You might be able to test this
on your existing Infiniband cluster by turning off RDMA.  I believe
that's possible in OpenMPI, but I'm not sure how.

> 3. Has anyone run DiFX on an RoCE-based network?

None of the of the above, but in the upgrade to our SFXC cluster we
did a few years back we ditched Infiniband in favour of Ethernet and I
can't say I'm entirely happy.  We're not doing RoCE though.

The problems we're facing is that it seems that even a small amount of
packet loss on a link causes a slowdown of the entire cluster.
Infiniband had its issues too, but there a failure usually was more
catastrophic and therefore easier to diagnose.

The somewhat poor performance of our Ethernet solution may be caused
by a bad design of using blade servers with integrated Ethernet
switches and dual links to get more bandwith/redundancy.  A network
that is specifically designed to do RoCE might fare better.

What all this may hint at is that for a RoCE solution you actually
want a dedicated Ethernet network to connect your machines.  Which
kinda takes the "C" out of RoCE.

Cheers,

Mark



More information about the Difx-users mailing list