[Difx-users] {External} Re: HPC using Ethernet vs Infiniband

Greg Lindahl glindahl at cfa.harvard.edu
Fri Oct 1 00:31:40 EDT 2021


Hi Bill, funny to run into you in an astronomy context! (We know each other
from HPC, and have 126 mutual connections on LinkedIn!)

I'm, er, reasonably well known for having some HPC opinions, some common
and some unusual. See the bottom for a short summary of my involvement in
this industry.

In the DiFX case, I have not done an extensive analysis of what DiFX does,
but it appears that DiFX does a good job of handling overlap of compute and
communication, and that its messages are large. This means that it's a
relatively easy job for any network to deal with. Nodes have a lot of
compute cores these days, but still, bandwidths are now high enough for
these core counts.

Now, the largest DiFX deployments I'm familiar with (Haystack and Bonn)
both have a somewhat unusual setup for I/O, in which a small number of
Mark6 recorder nodes have much greater network needs. Those nodes in
particular need to worry about efficiency, and so if you want to prototype
a new DiFX cluster, that's the one thing I would actually prototype -- it
only takes a couple of nodes to test.

I suspect that at this point Adam is nodding and wants to say "Why, of
course! That's the whole point behind DiFX's design!" But still, it's nice
to see that these things seem to still be true after quite a lot of
evolution of computing gear since DiFX was first deployed.

Given these current speeds, I suspect that at 100 gigabits, if you use any
mechanism that bypasses the Linux kernel -- such as the libpsm3 / libucx
combo in OpenMPI that Jan Wagner just mentioned -- I think you'll be fine,
as long as that's good enough for whatever storage bottleneck you have. And
yes, in many circumstances, on Ethernet that means using RoCE.

I'm currently working on some next generation recorder prototypes. Our
first bet is that SSDs will eventually become cost-effective, because
"spinning rust" has a bandwidth problem that is getting worse over time. I
suspect I will be able to record 64 gbps over 100 gbps links to SSD, and
also play back at similar rates. So I'm using a "white box" 100 gigabit
ethernet switch in this prototype. It will be interesting to see how these
technology bets play out over time.

-- greg

p.s. So, in my life between dropping out of astronomy grad school in 1995
and working for the EHT now, I was a founder at a startup named PathScale.
I'm the system architect for the first 3 generations of the InfiniPath  /
TrueScale / Omni-Path interconnect. It was initially InfiniBand, because we
didn't have enough money to also build a switch chip. But by the time Intel
bought the technology, they gave up pretending. Intel recently spun this
technology out to a new company named Cornelis. I'm pleased to see that
some of the ideas have finally leaked out to the broader community.




On Tue, Sep 21, 2021 at 7:26 AM Bill Boas via Difx-users <
difx-users at listmgr.nrao.edu> wrote:

> Walter, Adam, et all,
>
> The Open Fabrics Alliance, www.openfabrics.org, developed the software
> for both IB and ROCE. Suggest your questions may well get useful responses
> for both, for and against, by contacting the Alliance.
>
> One useful rarely mentioned fact is that at the physical cable level the
> SERDES for both Ethernet and IB is identical in the NVIDIA (nee Mellanox)
> chips and adapter cards, and the physical cable latency difference is the
> serialization time for serial (Ethernet) vs parallel (IB).
>
> So the criteria to consider are primarily in the software distributions
> and host interfaces, mostly PCIe. Here the options to evaluate are NVIDIA,
> Cornelis (nee Intel's Omnipath, IB by another label) and most recently UCX
> and CXL both follow ons from IB and OpenFabrics which incidentally is
> coming up to 20 years from conception. There is also GIGAIO which is
> physically a PCI fabric.
>
> Bill.
> Bill Boas
> ex-Co-Founder OpenFabrics Alliance
> M: 510-375-8840
>
> On Mon, Sep 20, 2021 at 4:05 PM Adam Deller via Difx-users <
> difx-users at listmgr.nrao.edu> wrote:
>
>> I've spoken to people about RoCE, but not sure if any of them have gone
>> ahead and taken the plunge on it yet.  I'll ask around to update myself.
>>
>> Cheers,
>> Adam
>>
>>
>>
>> On Tue, 21 Sept 2021 at 05:28, Walter Brisken via Difx-users <
>> difx-users at listmgr.nrao.edu> wrote:
>>
>>>
>>> Hi DiFX Users,
>>>
>>> In the not so distant future we at VLBA may be be in the position to
>>> upgrade
>>> the network backbone of the VLBA correlator.  Currently we have a 40
>>> Gbps
>>> Infiniband system dating back about 10 years.  At the time we installed
>>> that
>>> system, Infiniband showed clear advantages, likely driven by RDMA
>>> capability
>>> which offloads a significant amount of work from the CPU.  Now it seems
>>> Ethernet has RoCE (RDMA over Converged Ethernet) which aims to do the
>>> same
>>> thing.
>>>
>>> 1. Does anyone have experience with RoCE?  If so, is this as easy to
>>> configure
>>> as the OpenMPI page suggests?  Any drawbacks of using it?
>>>
>>> 2. Has anyone else gone through this decision process recently?  If so,
>>> any
>>> thoughts or advice?
>>>
>>> 3. Has anyone run DiFX on an RoCE-based network?
>>>
>>>         -Walter
>>>
>>> -------------------------
>>> Walter Brisken
>>> NRAO
>>> Deputy Assistant Director for VLBA Development
>>> (505)-234-5912 (cell)
>>> (575)-835-7133 (office; not useful during COVID times)
>>>
>>> _______________________________________________
>>> Difx-users mailing list
>>> Difx-users at listmgr.nrao.edu
>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>
>>
>>
>> --
>> !=============================================================!
>> A/Prof. Adam Deller
>> ARC Future Fellow
>> Centre for Astrophysics & Supercomputing
>> Swinburne University of Technology
>> John St, Hawthorn VIC 3122 Australia
>> phone: +61 3 9214 5307
>> fax: +61 3 9214 8797
>>
>> office days (usually): Mon-Thu
>> !=============================================================!
>> _______________________________________________
>> Difx-users mailing list
>> Difx-users at listmgr.nrao.edu
>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>
>
>
> --
> Bill.
> Bill Boas
> 510-375-8840
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>


-- 
Greg Lindahl
Software Architect, Event Horizon Telescope
Smithsonian Astrophysical Observatory
60 Garden Street | MS 66 | Cambridge, MA 02138
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20210930/b767a041/attachment.html>


More information about the Difx-users mailing list