[Difx-users] {Disarmed} Re: {Disarmed} {Disarmed} Infiniband device problems with Slurm/mpifxcorr

Mon Aug 16 18:33:12 EDT 2021

Just to add to Walter's excellent description, while I've thankfully not
had to deal much with MPI's infiniband pickiness myself I think that Cormac
Reynolds and/or Helge Rottman have overcome similar problems on their
compute clusters in the past, so they might be able to chime in if Walter's
suggestions don't work.

Googling 'infiniband slurm mpi' yields a lot of hits, so I doubt you are
the only person that's come across similar issues!

Cheers,
Adam

On Tue, 17 Aug 2021 at 08:21, Walter Brisken via Difx-users <
difx-users at listmgr.nrao.edu> wrote:

>
> Hi Joe,
>
> I've not used slurm or any process management layer above MPI, so can't
> directly get at your problem.
>
> I can give you one bit of advice on working out MPI problems though.
>
> mpifxcorr is a pretty complicated program and it can add complexity when
> sorting out MPI issues.  There is a program packaged with mpifxcorr called
> "mpispeed".  This program requires an even number of processes to be
> started at the same time.  All the program does is tell the odd numbered
> processes to stream data as fast as possible to the even numbered
> processes (1 goes to 2, 3 goes to 4, ...)  The program would exercise all
> of the machinery required to start the MPI process without the
> dependencies on difx filesets, ...
>
> On the MPI issue itself: a couple things to try:
>
> 1. If some of your machines have multiple ethernet ports that are on
> different networks, then the routing tables need to be configured properly
> so you stay on the initiating network.
>
> 2. If you are using ssh between nodes, there can be subtle authentication
> issues that creep in.  Make sure you can ssh from the head node to each of
> the other nodes without entry of a password or passphrase.
>
> 3. Probably if you have infiniband, it will only try to use TCP/ethernet
> for the process management, not for IPC.  You might try removing the
> "--mca btl_tcp_if_include eth0" parameters.  You could even try forcefully
> excluding it with "--mca btl self,openib".  If you were explicitly using
> "btl_tcp_if_include" to work around network routing issues, see suggestion
> #1 above.
>
>
> Hopefully something above helps a bit...
>
> -Walter
>
>
> -------------------------
> Walter Brisken
> NRAO
> Deputy Assistant Director for VLBA Development
> (505)-234-5912 (cell)
> (575)-835-7133 (office; not useful during COVID times)
>
> On Mon, 16 Aug 2021, Joe Skeens via Difx-users wrote:
>
> > Hi all,
> > I'm what you might call an MPI newbie, and I've been trying to run
> mpifxcorr on a cluster with the Slurm
> > scheduler and running into some problems. In the cluster setup, there's
> an InfiniBand device that handles
> > communication between nodes, but the setup doesn't seem to
> recognize/utilize it properly.
> >
> > For the command line prompt,
> > salloc -N 7 mpirun -np 7 mpifxcorr ${EXPER}.input
> >
> > I get:
> > WARNING: There is at least non-excluded one OpenFabrics device found,
> but there are no active ports
> > detected (or Open MPI was unable to use them). This is most certainly
> not what you wanted. Check your
> > cables, subnet manager configuration, etc. The openib BTL will be
> ignored for this job. Local host: nod50
> >
> > This leads to a fatal failure to connect between nodes (I think):
> >
> > WARNING: Open MPI failed to TCP connect to a peer MPI process. This
> should not happen. Your Open MPI job
> > may now fail. Local host: nod77 PID: 4410 Message: connect()
> to MailScanner warning: numerical links are
> > often malicious: MailScanner warning: numerical links are often
> malicious: 192.168.5.76:1024 failed Error:
> > Operation now in progress (115)
> >
> > Notably, if I force connection through an ethernet device with the
> command line prompt,
> >
> > salloc -N 7 mpirun -np 7 --mca btl_tcp_if_include eth0 mpifxcorr
> ${EXPER}.input
> >
> > mpifxcorr runs with no problem, although presumably at a large loss in
> efficiency.
> >
> > This may be impossible to diagnose without knowing more about the
> server/cluster architecture, but I
> > figured I'd see if anyone else has run into similar issues and found a
> solution. It's also entirely
> > possible I'm missing something obvious.
> >
> >
> > Thanks,
> >
> > Joe Skeens
> >
> >
> > [no_photo.png]
> > ReplyForward
> >
> >_______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>

-- 
!=============================================================!
A/Prof. Adam Deller
ARC Future Fellow
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797

office days (usually): Mon-Thu
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20210817/7d27b8ea/attachment-0001.html>