[Difx-users] Some recent benchmarks
Adam Deller
adeller at astro.swin.edu.au
Thu Apr 19 05:16:57 EDT 2018
Hi Walter, everyone,
Coincidentally, I've been at a hackathon at Pawsey in WA this week, working
on a general purpose GPU FX correlator with Chris Phillips, Mark Kettenis,
Jamie Stevens, Cherie Day, and a couple of GPU mentors. Rather than
porting DiFX (or SFXC), we decided to make a simplified correlator with
less flexibility that would be easier to port and compare. The result,
"fxkernel", processes exactly one subband (upper sideband, x2
polarisations) for exactly 1 subintegration, producing all the polarisation
products. You give it pointers to voltage data, and it gives you back the
subintegration's worth of visibilities. It does implement full pre-F
fringe rotation along with the usual post-F fractional sample correction,
and uses a standard FFT for channelisation, just like DiFX. I've checked
the produced visibilities for correctness against DiFX using simulated data
with 500 km baselines and it seems fine. The only thing it doesn't
speed-wise compared to DiFX is buffered FFTs and chunked cross-multiplies.
Testing fxkernel on my desktop under the same conditions that you describe
(10x VLBA antennas) but benchmarking only the processing (so not including
the time to load the voltage data from disk or write the visibilities out),
I get a throughput of around 250 Mbps aggregate (=25 Mbps/VLBA antenna)
when using one CPU core. This is on a recent Core i7:
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
I didn't run a fake DiFX correlation on this machine, but when running DiFX
in normal reading mode (with all the datastream nodes on the same machine)
I was getting an aggregate throughput of ~100 Mbps or so. This isn't a
particularly fair comparison, but at a coarse level (looking at both my
tests and the numbers you give), it looks like fxkernel gives reasonably
comparable performance to DiFX, as one would expect. It probably wins a
bit on simplicity because so much logic has been bypassed, but on the other
hand it is probably slightly less cache-optimal in the cross-multiply.
Anyway, the reason why I bring this is up that we are just about finished
the GPU port of fxkernel (when I left to go to the airport a couple of
hours ago, it was producing fringes on the 500 km baseline simulated data,
but not yet giving identical-enough results to fxkernel, so there must be a
small bug still), so we will soon be in a position to publicise some
numbers on the speed-up on the GPU. It'll be a different dataset, but just
looking at the runtime ratio for gpufxkernel vs fxkernel and knowing that
fxkernel is pretty comparable to DiFX will give you an idea of how useful
it might potentially be. I'll leave Chris to email the results when
available, but I'll put out a teaser by saying it is looking pretty
worthwhile at this stage!
Cheers,
Adam
On 19 April 2018 at 12:19, Walter Brisken <wbrisken at lbo.us> wrote:
>
> Hi DiFX Users,
>
> Over the past few days I've had the chance to benchmark a few different
> CPU types. The test was 5 seconds of VLBA data: 10 stations, Mark5B
> format, 2048 Mbps. All machines being tested had separate datastream nodes
> running in "fake" mode, with essentially unlimited throughput. 40 Gbps
> Infiniband fabric was used. The top two boxes are new and were on loan for
> the DiFX tests (thanks Casey Law!). The third is one of our new Mark6
> units attached to our correlator. The last three entries represent the
> compute nodes we have on the DiFX cluster in Socorro. With one exception,
> each box tested was a dual-CPU setup.
>
> I give two performance numbers for each tested CPU:
>
> 1 core: the bit-rate per VLBA antenna (Mbps) that one core of the stated
> CPU type could digest. 2048 divided by this number is the number of cores,
> assuming perfect scalability, required to process VLBA data in real-time.
>
> 1 box: the bit-rate per VLBA antenna (Mbps) that one server box can
> digest. In each cases one "DiFX thread" was spawned for each physical core
> in the box.
>
> And the results are...
>
> CPU 1 core 1 box Notes
> --- ------ ----- -----
> 2x Gold 6126 19.0 367 Excellent scalability w/ cores
> 2x Gold 5115 13.5 242 Probably limited by TDP
> 1x E5-2650v3 20.4 165 Unexpectedly good single-core performance
> 2x E5-2670v2 19.5 272 Fastest in DiFX cluster; terrible
> scalability
> 2x X5650 13.0 110 Medium fastest nodes in DiFX cluster
> 2x E5520 10.7 64.6 Slowest CPUs in DiFX cluster
>
> The main conclusion I can draw from this is that the Thermal Design Power
> (TDP) is a very good predictor of whole-CPU performance for a given process
> (e.g., 14nm or 22nm). Above a certain clock*cores level, TDP per dollar
> may be the correct metric to use when selecting between CPU options for
> DiFX. I'm not sure where the best value is. I'd love to try something in
> the "Skylake-W" CPU line, especially Xeon W-2145. That may be a good test
> of the TDP theory.
>
> Some details on each mentioned CPU:
>
> CPU Cores Clock Process TDP Release
> --- ----- ----- ------- --- -------
> Gold 6126 12 2.6GHz 14nm 125W Sep 2017
> Gold 5115 10 2.4GHz 14nm 85W Sep 2017
> E5-2650v3 10 2.3GHz 22nm 105W Sep 2014
> E5-2670v2 10 2.5GHz 22nm 115W Sep 2013
> X5650 6 2.67GHz 32nm 95W Mar 2010
> E5520 4 2.27GHz 45nm 80W Mar 2009
>
> Xeon W-2145 8 3.7GHz 14nm 140W Aug 2017
>
> -Walter
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>
--
!=============================================================!
Dr. Adam Deller
ARC Future Fellow, Senior Lecturer
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797
office days (usually): Mon-Thu
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20180419/47bda6d3/attachment.html>
More information about the Difx-users
mailing list