[Difx-users] DiFX mpirun problem

Stuart Weston nzobservers at gmail.com
Tue Aug 30 22:17:33 EDT 2016


It has just come to my attention:

I notice you are using a multicast group DIFX_MESSAGE_GROUP=239.253.253.90
are use able to unicast instead. We haven’t enabled multicasting out to
REANNZ.



Can I use unicast ?

On Wed, Aug 31, 2016 at 1:45 PM, Stuart Weston <nzobservers at gmail.com>
wrote:

>
> Do IP addresses get added in when the code is compiled ?
>
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines
> -np 12 mpifxcorr v534a_9.input
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node wark167
> [ww-flexbuf-01][[12885,1],6][../../../../../../ompi/mca/btl/
> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 156.62.231.167 failed: No route to host (113)
> [ww-flexbuf-01][[12885,1],5][../../../../../../ompi/mca/btl/
> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 156.62.231.167 failed: No route to host (113)
> [ww-flexbuf-01][[12885,1],3][../../../../../../ompi/mca/btl/
> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 156.62.231.167 failed: No route to host (113)
> [ww-flexbuf-01][[12885,1],11][../../../../../../ompi/mca/
> btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 156.62.231.167 failed: No route to host (113)
>
> The correct IP address should be 163.7.128.11 and not 156.62.231.167.
>
> I have checked "/etc/hosts" on both servers. Also stop/start "rpcbind"
> just in case. I have tried putting the IP addresses in the machines file
> and not the host name. Still get the error ?
>
> Tried with a very simple mpirun and thats good, ie:
>
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat hosts
> 163.7.128.194
> 163.7.128.11
>
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -np 2 -hostfile hosts hostname
> ww-flexbuf-01
> wark167
>
> Any ideas as to why it insists on picking up the wrong IP address ?
>
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat v534a_9.machines
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.194
> 163.7.128.11
> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines
> -np 12 mpifxcorr v534a_9.input
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node ww-flexbuf-01
> About to run MPIInit on node wark167
> [ww-flexbuf-01][[3498,1],6][../../../../../../ompi/mca/btl/
> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 156.62.231.167 failed: No route to host (113)
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20160831/4a903409/attachment.html>


More information about the Difx-users mailing list