[Difx-users] DiFX mpirun problem

Chris.Phillips at csiro.au Chris.Phillips at csiro.au
Tue Aug 30 22:30:16 EDT 2016


Hi Stuart

This is an openmpi issue, not DIFX as such (probably) due to multiple IPs per host.

I don’t have time for a full explanation sorry.

Try editing/creating

	.openmpi/mca-params.conf

And add something like:

btl_tcp_if_include=eth1


This assumes the IP you want is on eth1 on *all* machines. Changes as appropriate. Read the openmpi for more details

Cheers
Chris


> On 31 Aug 2016, at 11:45 AM, Stuart Weston <nzobservers at gmail.com> wrote:
> 
> 
> Do IP addresses get added in when the code is compiled ?
> 
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines -np 12 mpifxcorr v534a_9.input
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node wark167
> [ww-flexbuf-01][[12885,1],6][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 156.62.231.167 failed: No route to host (113)
> [ww-flexbuf-01][[12885,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 156.62.231.167 failed: No route to host (113)
> [ww-flexbuf-01][[12885,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 156.62.231.167 failed: No route to host (113)
> [ww-flexbuf-01][[12885,1],11][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 156.62.231.167 failed: No route to host (113)
> 
> The correct IP address should be 163.7.128.11 and not 156.62.231.167.
> 
> I have checked "/etc/hosts" on both servers. Also stop/start "rpcbind" just in case. I have tried putting the IP addresses in the machines file and not the host name. Still get the error ?
> 
> Tried with a very simple mpirun and thats good, ie:
> 
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat hosts
> 163.7.128.194
> 163.7.128.11
> 
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -np 2 -hostfile hosts hostname
> ww-flexbuf-01
> wark167
> 
> Any ideas as to why it insists on picking up the wrong IP address ?
> 
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat v534a_9.machines
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.11
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines -np 12 mpifxcorr v534a_9.input
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node wark167
> [ww-flexbuf-01][[3498,1],6][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 156.62.231.167 failed: No route to host (113)
> 
> 
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users




More information about the Difx-users mailing list