[Difx-users] {Disarmed} Infiniband device problems with Slurm/mpifxcorr

Joe Skeens jskeens1 at utexas.edu
Mon Aug 16 17:21:11 EDT 2021


Hi all,

I'm what you might call an MPI newbie, and I've been trying to run
mpifxcorr on a cluster with the Slurm scheduler and running into some
problems. In the cluster setup, there's an InfiniBand device that handles
communication between nodes, but the setup doesn't seem to
recognize/utilize it properly.

For the command line prompt,
salloc -N 7 mpirun -np 7 mpifxcorr ${EXPER}.input

I get:
WARNING: There is at least non-excluded one OpenFabrics device found, but
there are no active ports detected (or Open MPI was unable to use them).
This is most certainly not what you wanted. Check your cables, subnet
manager configuration, etc. The openib BTL will be ignored for this job.
Local host: nod50

This leads to a fatal failure to connect between nodes (I think):

WARNING: Open MPI failed to TCP connect to a peer MPI process. This should
not happen. Your Open MPI job may now fail. Local host: nod77 PID: 4410
Message: connect() to 192.168.5.76:1024 failed Error: Operation now in
progress (115)

Notably, if I force connection through an ethernet device with the command
line prompt,

salloc -N 7 mpirun -np 7 --mca btl_tcp_if_include eth0 mpifxcorr
${EXPER}.input

mpifxcorr runs with no problem, although presumably at a large loss in
efficiency.

This may be impossible to diagnose without knowing more about the
server/cluster architecture, but I figured I'd see if anyone else has run
into similar issues and found a solution. It's also entirely possible I'm
missing something obvious.


Thanks,

Joe Skeens


ReplyForward
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20210816/22250396/attachment.html>


More information about the Difx-users mailing list