<div dir="ltr"><div dir="ltr">Hi Bill, funny to run into you in an astronomy context! (We know each other from HPC, and have 126 mutual connections on LinkedIn!)<div><br></div><div>I'm, er, reasonably well known for having some HPC opinions, some common and some unusual. See the bottom for a short summary of my involvement in this industry.</div><div><br></div><div>In the DiFX case, I have not done an extensive analysis of what DiFX does, but it appears that DiFX does a good job of handling overlap of compute and communication, and that its messages are large. This means that it's a relatively easy job for any network to deal with. Nodes have a lot of compute cores these days, but still, bandwidths are now high enough for these core counts.</div><div><br></div><div>Now, the largest DiFX deployments I'm familiar with (Haystack and Bonn) both have a somewhat unusual setup for I/O, in which a small number of Mark6 recorder nodes have much greater network needs. Those nodes in particular need to worry about efficiency, and so if you want to prototype a new DiFX cluster, that's the one thing I would actually prototype -- it only takes a couple of nodes to test.</div><div><br></div><div>I suspect that at this point Adam is nodding and wants to say "Why, of course! That's the whole point behind DiFX's design!" But still, it's nice to see that these things seem to still be true after quite a lot of evolution of computing gear since DiFX was first deployed.</div><div><br></div><div>Given these current speeds, I suspect that at 100 gigabits, if you use any mechanism that bypasses the Linux kernel -- such as the libpsm3 / libucx combo in OpenMPI that Jan Wagner just mentioned -- I think you'll be fine, as long as that's good enough for whatever storage bottleneck you have. And yes, in many circumstances, on Ethernet that means using RoCE.</div><div><br></div><div>I'm currently working on some next generation recorder prototypes. Our first bet is that SSDs will eventually become cost-effective, because "spinning rust" has a bandwidth problem that is getting worse over time. I suspect I will be able to record 64 gbps over 100 gbps links to SSD, and also play back at similar rates. So I'm using a "white box" 100 gigabit ethernet switch in this prototype. It will be interesting to see how these technology bets play out over time.</div><div><br></div><div>-- greg</div><div><br></div><div>p.s. So, in my life between dropping out of astronomy grad school in 1995 and working for the EHT now, I was a founder at a startup named PathScale. I'm the system architect for the first 3 generations of the InfiniPath / TrueScale / Omni-Path interconnect. It was initially InfiniBand, because we didn't have enough money to also build a switch chip. But by the time Intel bought the technology, they gave up pretending. Intel recently spun this technology out to a new company named Cornelis. I'm pleased to see that some of the ideas have finally leaked out to the broader community.</div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 21, 2021 at 7:26 AM Bill Boas via Difx-users <<a href="mailto:difx-users@listmgr.nrao.edu" target="_blank">difx-users@listmgr.nrao.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Walter, Adam, et all,<div><br></div><div>The Open Fabrics Alliance, <a href="http://www.openfabrics.org" target="_blank">www.openfabrics.org</a>, developed the software for both IB and ROCE. Suggest your questions may well get useful responses for both, for and against, by contacting the Alliance. </div><div><br></div><div>One useful rarely mentioned fact is that at the physical cable level the SERDES for both Ethernet and IB is identical in the NVIDIA (nee Mellanox) chips and adapter cards, and the physical cable latency difference is the serialization time for serial (Ethernet) vs parallel (IB).</div><div><br></div><div>So the criteria to consider are primarily in the software distributions and host interfaces, mostly PCIe. Here the options to evaluate are NVIDIA, Cornelis (nee Intel's Omnipath, IB by another label) and most recently UCX and CXL both follow ons from IB and OpenFabrics which incidentally is coming up to 20 years from conception. There is also GIGAIO which is physically a PCI fabric.</div><div><br></div><div>Bill.</div><div>Bill Boas</div><div>ex-Co-Founder OpenFabrics Alliance</div><div>M: 510-375-8840</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 20, 2021 at 4:05 PM Adam Deller via Difx-users <<a href="mailto:difx-users@listmgr.nrao.edu" target="_blank">difx-users@listmgr.nrao.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I've spoken to people about RoCE, but not sure if any of them have gone ahead and taken the plunge on it yet. I'll ask around to update myself.<div><br></div><div>Cheers,</div><div>Adam<br><div><br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 21 Sept 2021 at 05:28, Walter Brisken via Difx-users <<a href="mailto:difx-users@listmgr.nrao.edu" target="_blank">difx-users@listmgr.nrao.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
Hi DiFX Users,<br>
<br>
In the not so distant future we at VLBA may be be in the position to upgrade <br>
the network backbone of the VLBA correlator. Currently we have a 40 Gbps <br>
Infiniband system dating back about 10 years. At the time we installed that <br>
system, Infiniband showed clear advantages, likely driven by RDMA capability <br>
which offloads a significant amount of work from the CPU. Now it seems <br>
Ethernet has RoCE (RDMA over Converged Ethernet) which aims to do the same <br>
thing.<br>
<br>
1. Does anyone have experience with RoCE? If so, is this as easy to configure <br>
as the OpenMPI page suggests? Any drawbacks of using it?<br>
<br>
2. Has anyone else gone through this decision process recently? If so, any <br>
thoughts or advice?<br>
<br>
3. Has anyone run DiFX on an RoCE-based network?<br>
<br>
-Walter<br>
<br>
-------------------------<br>
Walter Brisken<br>
NRAO<br>
Deputy Assistant Director for VLBA Development<br>
(505)-234-5912 (cell)<br>
(575)-835-7133 (office; not useful during COVID times)<br>
<br>
_______________________________________________<br>
Difx-users mailing list<br>
<a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank">Difx-users@listmgr.nrao.edu</a><br>
<a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users" rel="noreferrer" target="_blank">https://listmgr.nrao.edu/mailman/listinfo/difx-users</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px">!=============================================================!<br><div dir="ltr" style="font-size:12.8px">A/Prof. Adam Deller </div><div dir="ltr" style="font-size:12.8px">ARC Future Fellow</div></div><div style="font-size:12.8px">Centre for Astrophysics & Supercomputing </div><div dir="ltr" style="font-size:12.8px">Swinburne University of Technology <br>John St, Hawthorn VIC 3122 Australia</div><div style="font-size:12.8px">phone: +61 3 9214 5307</div><div style="font-size:12.8px">fax: +61 3 9214 8797</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">office days (usually): Mon-Thu<br>!=============================================================!</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>
_______________________________________________<br>
Difx-users mailing list<br>
<a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank">Difx-users@listmgr.nrao.edu</a><br>
<a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users" rel="noreferrer" target="_blank">https://listmgr.nrao.edu/mailman/listinfo/difx-users</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr">Bill.<div>Bill Boas</div><div>510-375-8840</div><div><br></div></div></div>
_______________________________________________<br>
Difx-users mailing list<br>
<a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank">Difx-users@listmgr.nrao.edu</a><br>
<a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users" rel="noreferrer" target="_blank">https://listmgr.nrao.edu/mailman/listinfo/difx-users</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr">Greg Lindahl<br>Software Architect, Event Horizon Telescope<br>Smithsonian Astrophysical Observatory<br>60 Garden Street | MS 66 | Cambridge, MA 02138<br></div></div>
</div>