[Difx-users] Some recent benchmarks

Thu Apr 19 05:16:57 EDT 2018

Hi Walter, everyone,

Coincidentally, I've been at a hackathon at Pawsey in WA this week, working
on a general purpose GPU FX correlator with Chris Phillips, Mark Kettenis,
Jamie Stevens, Cherie Day, and a couple of GPU mentors.  Rather than
porting DiFX (or SFXC), we decided to make a simplified correlator with
less flexibility that would be easier to port and compare.  The result,
"fxkernel", processes exactly one subband (upper sideband, x2
polarisations) for exactly 1 subintegration, producing all the polarisation
products.  You give it pointers to voltage data, and it gives you back the
subintegration's worth of visibilities.  It does implement full pre-F
fringe rotation along with the usual post-F fractional sample correction,
and uses a standard FFT for channelisation, just like DiFX.  I've checked
the produced visibilities for correctness against DiFX using simulated data
with 500 km baselines and it seems fine.  The only thing it doesn't
speed-wise compared to DiFX is buffered FFTs and chunked cross-multiplies.

Testing fxkernel on my desktop under the same conditions that you describe
(10x VLBA antennas) but benchmarking only the processing (so not including
the time to load the voltage data from disk or write the visibilities out),
I get a throughput of around 250 Mbps aggregate (=25 Mbps/VLBA antenna)
when using one CPU core.  This is on a recent Core i7:

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

I didn't run a fake DiFX correlation on this machine, but when running DiFX
in normal reading mode (with all the datastream nodes on the same machine)
I was getting an aggregate throughput of ~100 Mbps or so.  This isn't a
particularly fair comparison, but at a coarse level (looking at both my
tests and the numbers you give), it looks like fxkernel gives reasonably
comparable performance to DiFX, as one would expect.  It probably wins a
bit on simplicity because so much logic has been bypassed, but on the other
hand it is probably slightly less cache-optimal in the cross-multiply.

Anyway, the reason why I bring this is up that we are just about finished
the GPU port of fxkernel (when I left to go to the airport a couple of
hours ago, it was producing fringes on the 500 km baseline simulated data,
but not yet giving identical-enough results to fxkernel, so there must be a
small bug still), so we will soon be in a position to publicise some
numbers on the speed-up on the GPU.  It'll be a different dataset, but just
looking at the runtime ratio for gpufxkernel vs fxkernel and knowing that
fxkernel is pretty comparable to DiFX will give you an idea of how useful
it might potentially be.  I'll leave Chris to email the results when
available, but I'll put out a teaser by saying it is looking pretty
worthwhile at this stage!

Cheers,
Adam

On 19 April 2018 at 12:19, Walter Brisken <wbrisken at lbo.us> wrote:

>
> Hi DiFX Users,
>
> Over the past few days I've had the chance to benchmark a few different
> CPU types.  The test was 5 seconds of VLBA data: 10 stations, Mark5B
> format, 2048 Mbps.  All machines being tested had separate datastream nodes
> running in "fake" mode, with essentially unlimited throughput. 40 Gbps
> Infiniband fabric was used.  The top two boxes are new and were on loan for
> the DiFX tests (thanks Casey Law!).  The third is one of our new Mark6
> units attached to our correlator.  The last three entries represent the
> compute nodes we have on the DiFX cluster in Socorro.  With one exception,
> each box tested was a dual-CPU setup.
>
> I give two performance numbers for each tested CPU:
>
> 1 core: the bit-rate per VLBA antenna (Mbps) that one core of the stated
> CPU type could digest.  2048 divided by this number is the number of cores,
> assuming perfect scalability, required to process VLBA data in real-time.
>
> 1 box: the bit-rate per VLBA antenna (Mbps) that one server box can
> digest.  In each cases one "DiFX thread" was spawned for each physical core
> in the box.
>
> And the results are...
>
> CPU             1 core  1 box   Notes
> ---             ------  -----   -----
> 2x Gold 6126    19.0    367     Excellent scalability w/ cores
> 2x Gold 5115    13.5    242     Probably limited by TDP
> 1x E5-2650v3    20.4    165     Unexpectedly good single-core performance
> 2x E5-2670v2    19.5    272     Fastest in DiFX cluster; terrible
> scalability
> 2x X5650        13.0    110     Medium fastest nodes in DiFX cluster
> 2x E5520        10.7    64.6    Slowest CPUs in DiFX cluster
>
> The main conclusion I can draw from this is that the Thermal Design Power
> (TDP) is a very good predictor of whole-CPU performance for a given process
> (e.g., 14nm or 22nm).  Above a certain clock*cores level, TDP per dollar
> may be the correct metric to use when selecting between CPU options for
> DiFX.  I'm not sure where the best value is.  I'd love to try something in
> the "Skylake-W" CPU line, especially Xeon W-2145.  That may be a good test
> of the TDP theory.
>
> Some details on each mentioned CPU:
>
> CPU             Cores   Clock   Process TDP     Release
> ---             -----   -----   ------- ---     -------
> Gold 6126       12      2.6GHz  14nm    125W    Sep 2017
> Gold 5115       10      2.4GHz  14nm    85W     Sep 2017
> E5-2650v3       10      2.3GHz  22nm    105W    Sep 2014
> E5-2670v2       10      2.5GHz  22nm    115W    Sep 2013
> X5650           6       2.67GHz 32nm    95W     Mar 2010
> E5520           4       2.27GHz 45nm    80W     Mar 2009
>
> Xeon W-2145     8       3.7GHz  14nm    140W    Aug 2017
>
>         -Walter
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>

-- 
!=============================================================!
Dr. Adam Deller
ARC Future Fellow, Senior Lecturer
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797

office days (usually): Mon-Thu
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20180419/47bda6d3/attachment.html>