<div dir="ltr">Hi Walter, everyone,<div><br></div><div>Coincidentally, I've been at a hackathon at Pawsey in WA this week, working on a general purpose GPU FX correlator with Chris Phillips, Mark Kettenis, Jamie Stevens, Cherie Day, and a couple of GPU mentors.  Rather than porting DiFX (or SFXC), we decided to make a simplified correlator with less flexibility that would be easier to port and compare.  The result, "fxkernel", processes exactly one subband (upper sideband, x2 polarisations) for exactly 1 subintegration, producing all the polarisation products.  You give it pointers to voltage data, and it gives you back the subintegration's worth of visibilities.  It does implement full pre-F fringe rotation along with the usual post-F fractional sample correction, and uses a standard FFT for channelisation, just like DiFX.  I've checked the produced visibilities for correctness against DiFX using simulated data with 500 km baselines and it seems fine.  The only thing it doesn't speed-wise compared to DiFX is buffered FFTs and chunked cross-multiplies.</div><div><br></div><div>Testing fxkernel on my desktop under the same conditions that you describe (10x VLBA antennas) but benchmarking only the processing (so not including the time to load the voltage data from disk or write the visibilities out), I get a throughput of around 250 Mbps aggregate (=25 Mbps/VLBA antenna) when using one CPU core.  This is on a recent Core i7:</div><div><br></div><div>

<span></span>

<p class="gmail-p1" style="margin:0px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz</span></p><br>I didn't run a fake DiFX correlation on this machine, but when running DiFX in normal reading mode (with all the datastream nodes on the same machine) I was getting an aggregate throughput of ~100 Mbps or so.  This isn't a particularly fair comparison, but at a coarse level (looking at both my tests and the numbers you give), it looks like fxkernel gives reasonably comparable performance to DiFX, as one would expect.  It probably wins a bit on simplicity because so much logic has been bypassed, but on the other hand it is probably slightly less cache-optimal in the cross-multiply.

</div><div><br></div><div>Anyway, the reason why I bring this is up that we are just about finished the GPU port of fxkernel (when I left to go to the airport a couple of hours ago, it was producing fringes on the 500 km baseline simulated data, but not yet giving identical-enough results to fxkernel, so there must be a small bug still), so we will soon be in a position to publicise some numbers on the speed-up on the GPU.  It'll be a different dataset, but just looking at the runtime ratio for gpufxkernel vs fxkernel and knowing that fxkernel is pretty comparable to DiFX will give you an idea of how useful it might potentially be.  I'll leave Chris to email the results when available, but I'll put out a teaser by saying it is looking pretty worthwhile at this stage!</div><div><br></div><div>Cheers,</div><div>Adam</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 19 April 2018 at 12:19, Walter Brisken <span dir="ltr"><<a href="mailto:wbrisken@lbo.us" target="_blank">wbrisken@lbo.us</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

Hi DiFX Users,<br>

<br>

Over the past few days I've had the chance to benchmark a few different CPU types.  The test was 5 seconds of VLBA data: 10 stations, Mark5B format, 2048 Mbps.  All machines being tested had separate datastream nodes running in "fake" mode, with essentially unlimited throughput. 40 Gbps Infiniband fabric was used.  The top two boxes are new and were on loan for the DiFX tests (thanks Casey Law!).  The third is one of our new Mark6 units attached to our correlator.  The last three entries represent the compute nodes we have on the DiFX cluster in Socorro.  With one exception, each box tested was a dual-CPU setup.<br>

<br>

I give two performance numbers for each tested CPU:<br>

<br>

1 core: the bit-rate per VLBA antenna (Mbps) that one core of the stated CPU type could digest.  2048 divided by this number is the number of cores, assuming perfect scalability, required to process VLBA data in real-time.<br>

<br>

1 box: the bit-rate per VLBA antenna (Mbps) that one server box can digest.  In each cases one "DiFX thread" was spawned for each physical core in the box.<br>

<br>

And the results are...<br>

<br>

CPU             1 core  1 box   Notes<br>

---             ------  -----   -----<br>

2x Gold 6126    19.0    367     Excellent scalability w/ cores<br>

2x Gold 5115    13.5    242     Probably limited by TDP<br>

1x E5-2650v3    20.4    165     Unexpectedly good single-core performance<br>

2x E5-2670v2    19.5    272     Fastest in DiFX cluster; terrible scalability<br>

2x X5650        13.0    110     Medium fastest nodes in DiFX cluster<br>

2x E5520        10.7    64.6    Slowest CPUs in DiFX cluster<br>

<br>

The main conclusion I can draw from this is that the Thermal Design Power (TDP) is a very good predictor of whole-CPU performance for a given process (e.g., 14nm or 22nm).  Above a certain clock*cores level, TDP per dollar may be the correct metric to use when selecting between CPU options for DiFX.  I'm not sure where the best value is.  I'd love to try something in the "Skylake-W" CPU line, especially Xeon W-2145.  That may be a good test of the TDP theory.<br>

<br>

Some details on each mentioned CPU:<br>

<br>

CPU             Cores   Clock   Process TDP     Release<br>

---             -----   -----   ------- ---     -------<br>

Gold 6126       12      2.6GHz  14nm    125W    Sep 2017<br>

Gold 5115       10      2.4GHz  14nm    85W     Sep 2017<br>

E5-2650v3       10      2.3GHz  22nm    105W    Sep 2014<br>

E5-2670v2       10      2.5GHz  22nm    115W    Sep 2013<br>

X5650           6       2.67GHz 32nm    95W     Mar 2010<br>

E5520           4       2.27GHz 45nm    80W     Mar 2009<br>

<br>

Xeon W-2145     8       3.7GHz  14nm    140W    Aug 2017<br>

<br>

        -Walter<br>

<br>

______________________________<wbr>_________________<br>

Difx-users mailing list<br>

<a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank">Difx-users@listmgr.nrao.edu</a><br>

<a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users" rel="noreferrer" target="_blank">https://listmgr.nrao.edu/mailm<wbr>an/listinfo/difx-users</a><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr" style="font-size:12.8000001907349px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px">!=============================================================!<br>Dr. Adam Deller         </div><div dir="ltr" style="font-size:12.8px">ARC Future Fellow, Senior Lecturer</div><div style="font-size:12.8px">Centre for Astrophysics & Supercomputing </div><div dir="ltr" style="font-size:12.8px">Swinburne University of Technology    <br>John St, Hawthorn VIC 3122 Australia</div><div style="font-size:12.8px">phone: +61 3 9214 5307</div><div style="font-size:12.8px">fax: +61 3 9214 8797</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">office days (usually): Mon-Thu<br>!=============================================================!</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>

</div>