[Difx-users] {Disarmed} {Disarmed} Infiniband device problems with Slurm/mpifxcorr

Mon Aug 16 18:21:12 EDT 2021

Hi Joe,

I've not used slurm or any process management layer above MPI, so can't 
directly get at your problem.

I can give you one bit of advice on working out MPI problems though.

mpifxcorr is a pretty complicated program and it can add complexity when 
sorting out MPI issues.  There is a program packaged with mpifxcorr called 
"mpispeed".  This program requires an even number of processes to be 
started at the same time.  All the program does is tell the odd numbered 
processes to stream data as fast as possible to the even numbered 
processes (1 goes to 2, 3 goes to 4, ...)  The program would exercise all 
of the machinery required to start the MPI process without the 
dependencies on difx filesets, ...

On the MPI issue itself: a couple things to try:

1. If some of your machines have multiple ethernet ports that are on 
different networks, then the routing tables need to be configured properly 
so you stay on the initiating network.

2. If you are using ssh between nodes, there can be subtle authentication 
issues that creep in.  Make sure you can ssh from the head node to each of 
the other nodes without entry of a password or passphrase.

3. Probably if you have infiniband, it will only try to use TCP/ethernet 
for the process management, not for IPC.  You might try removing the 
"--mca btl_tcp_if_include eth0" parameters.  You could even try forcefully 
excluding it with "--mca btl self,openib".  If you were explicitly using 
"btl_tcp_if_include" to work around network routing issues, see suggestion 
#1 above.

Hopefully something above helps a bit...

-Walter

-------------------------
Walter Brisken
NRAO
Deputy Assistant Director for VLBA Development
(505)-234-5912 (cell)
(575)-835-7133 (office; not useful during COVID times)

On Mon, 16 Aug 2021, Joe Skeens via Difx-users wrote:

> Hi all,
> I'm what you might call an MPI newbie, and I've been trying to run mpifxcorr on a cluster with the Slurm
> scheduler and running into some problems. In the cluster setup, there's an InfiniBand device that handles
> communication between nodes, but the setup doesn't seem to recognize/utilize it properly.
> 
> For the command line prompt,
> salloc -N 7 mpirun -np 7 mpifxcorr ${EXPER}.input
> 
> I get:
> WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports
> detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your
> cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: nod50
> 
> This leads to a fatal failure to connect between nodes (I think):
> 
> WARNING: Open MPI failed to TCP connect to a peer MPI process. This should not happen. Your Open MPI job
> may now fail. Local host: nod77 PID: 4410 Message: connect() to MailScanner warning: numerical links are
> often malicious: MailScanner warning: numerical links are often malicious: 192.168.5.76:1024 failed Error:
> Operation now in progress (115)
> 
> Notably, if I force connection through an ethernet device with the command line prompt,
> 
> salloc -N 7 mpirun -np 7 --mca btl_tcp_if_include eth0 mpifxcorr ${EXPER}.input
> 
> mpifxcorr runs with no problem, although presumably at a large loss in efficiency.
> 
> This may be impossible to diagnose without knowing more about the server/cluster architecture, but I
> figured I'd see if anyone else has run into similar issues and found a solution. It's also entirely
> possible I'm missing something obvious.
> 
> 
> Thanks,
> 
> Joe Skeens
> 
> 
> [no_photo.png]
> ReplyForward
> 
>