[Difx-users] {Disarmed} {Disarmed} Infiniband device problems with Slurm/mpifxcorr

Helge Rottmann rottmann at mpifr-bonn.mpg.de
Tue Aug 17 03:53:16 EDT 2021


Hi Joe,

The normal behaviour of openmpi is to inspect possible routings between the nodes and then automatically choose the fastest
one available (in your case Infiniband). While this normally works well we have also run into situations where slow 1G ethernet
was selected by openmpi depsite infiniband being available As Walter pointed out this is typically related to problems in 
in the routing setup.
Walters suggestion of using 
> --mca btl self,openib
> 
Should take care of it. 
Looking at the error message you got it might indicate that a only a single node (nod50) is unreachable by infiniband.
If the above fails and the error is indeed occurring on specific nodes only you might want to check whether you can ping (or ssh ) to the nodes.

Cheers,
Helge

> Am 17.08.2021 um 00:21 schrieb Walter Brisken via Difx-users <difx-users at listmgr.nrao.edu>:
> 
> 
> Hi Joe,
> 
> I've not used slurm or any process management layer above MPI, so can't directly get at your problem.
> 
> I can give you one bit of advice on working out MPI problems though.
> 
> mpifxcorr is a pretty complicated program and it can add complexity when sorting out MPI issues.  There is a program packaged with mpifxcorr called "mpispeed".  This program requires an even number of processes to be started at the same time.  All the program does is tell the odd numbered processes to stream data as fast as possible to the even numbered processes (1 goes to 2, 3 goes to 4, ...)  The program would exercise all of the machinery required to start the MPI process without the dependencies on difx filesets, ...
> 
> On the MPI issue itself: a couple things to try:
> 
> 1. If some of your machines have multiple ethernet ports that are on different networks, then the routing tables need to be configured properly so you stay on the initiating network.
> 
> 2. If you are using ssh between nodes, there can be subtle authentication issues that creep in.  Make sure you can ssh from the head node to each of the other nodes without entry of a password or passphrase.
> 
> 3. Probably if you have infiniband, it will only try to use TCP/ethernet for the process management, not for IPC.  You might try removing the "--mca btl_tcp_if_include eth0" parameters.  You could even try forcefully excluding it with "--mca btl self,openib".  If you were explicitly using "btl_tcp_if_include" to work around network routing issues, see suggestion #1 above.
> 
> 
> Hopefully something above helps a bit...
> 
> -Walter
> 
> 
> -------------------------
> Walter Brisken
> NRAO
> Deputy Assistant Director for VLBA Development
> (505)-234-5912 (cell)
> (575)-835-7133 (office; not useful during COVID times)
> 
> On Mon, 16 Aug 2021, Joe Skeens via Difx-users wrote:
> 
>> Hi all,
>> I'm what you might call an MPI newbie, and I've been trying to run mpifxcorr on a cluster with the Slurm
>> scheduler and running into some problems. In the cluster setup, there's an InfiniBand device that handles
>> communication between nodes, but the setup doesn't seem to recognize/utilize it properly.
>> For the command line prompt,
>> salloc -N 7 mpirun -np 7 mpifxcorr ${EXPER}.input
>> I get:
>> WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports
>> detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your
>> cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: nod50
>> This leads to a fatal failure to connect between nodes (I think):
>> WARNING: Open MPI failed to TCP connect to a peer MPI process. This should not happen. Your Open MPI job
>> may now fail. Local host: nod77 PID: 4410 Message: connect() to MailScanner warning: numerical links are
>> often malicious: MailScanner warning: numerical links are often malicious: 192.168.5.76:1024 failed Error:
>> Operation now in progress (115)
>> Notably, if I force connection through an ethernet device with the command line prompt,
>> salloc -N 7 mpirun -np 7 --mca btl_tcp_if_include eth0 mpifxcorr ${EXPER}.input
>> mpifxcorr runs with no problem, although presumably at a large loss in efficiency.
>> This may be impossible to diagnose without knowing more about the server/cluster architecture, but I
>> figured I'd see if anyone else has run into similar issues and found a solution. It's also entirely
>> possible I'm missing something obvious.
>> Thanks,
>> Joe Skeens
>> [no_photo.png]
>> ReplyForward
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu <mailto:Difx-users at listmgr.nrao.edu>
> https://listmgr.nrao.edu/mailman/listinfo/difx-users <https://listmgr.nrao.edu/mailman/listinfo/difx-users>
------------------------------------------------------
Helge Rottmann
- Head VLBI Technology - 

Max-Planck-Institut für Radioastronomie
Auf dem Hügel 69
53121 Bonn
Germany

Tel: ++49 (0)228 525 123
------------------------------------------------------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20210817/403003a3/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5052 bytes
Desc: not available
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20210817/403003a3/attachment-0001.p7s>


More information about the Difx-users mailing list