[Difx-users] DiFX mpirun problem

Walter Brisken wbrisken at nrao.edu
Tue Aug 30 22:50:48 EDT 2016


I believe if you set DIFX_MESSAGE_GROUP to the IP address of a single 
recipient machine and the DIFX_MESSAGE_PORT to something that machine can 
listen on, that one machine will be able to see all of the DIFX messages. 
This probably is not what you want in practice as there is reason for 
multiple different machines to receive messages, but it still might buy 
you a bit of ground.

I think Richard Dodson and/or Cormac Reynolds are the pioneers in using 
this, by necessity, when porting to some novel architecture.  Maybe one of 
them knows more about the actual performance of DiFX in such situations.

If so, perhaps a wiki page describing the reasons and consequences of this 
usage would be a good resource for the future!

 	-Walter


On Wed, 31 Aug 2016, Stuart Weston wrote:

> It has just come to my attention:
>
> I notice you are using a multicast group DIFX_MESSAGE_GROUP=239.253.253.90
> are use able to unicast instead. We havenÿÿt enabled multicasting out to
> REANNZ.
>
>
>
> Can I use unicast ?
>
> On Wed, Aug 31, 2016 at 1:45 PM, Stuart Weston <nzobservers at gmail.com>
> wrote:
>
>>
>> Do IP addresses get added in when the code is compiled ?
>>
>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines
>> -np 12 mpifxcorr v534a_9.input
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node wark167
>> [ww-flexbuf-01][[12885,1],6][../../../../../../ompi/mca/btl/
>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 156.62.231.167 failed: No route to host (113)
>> [ww-flexbuf-01][[12885,1],5][../../../../../../ompi/mca/btl/
>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 156.62.231.167 failed: No route to host (113)
>> [ww-flexbuf-01][[12885,1],3][../../../../../../ompi/mca/btl/
>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 156.62.231.167 failed: No route to host (113)
>> [ww-flexbuf-01][[12885,1],11][../../../../../../ompi/mca/
>> btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 156.62.231.167 failed: No route to host (113)
>>
>> The correct IP address should be 163.7.128.11 and not 156.62.231.167.
>>
>> I have checked "/etc/hosts" on both servers. Also stop/start "rpcbind"
>> just in case. I have tried putting the IP addresses in the machines file
>> and not the host name. Still get the error ?
>>
>> Tried with a very simple mpirun and thats good, ie:
>>
>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat hosts
>> 163.7.128.194
>> 163.7.128.11
>>
>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -np 2 -hostfile hosts hostname
>> ww-flexbuf-01
>> wark167
>>
>> Any ideas as to why it insists on picking up the wrong IP address ?
>>
>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat v534a_9.machines
>> 163.7.128.194
>> 163.7.128.194
>> 163.7.128.194
>> 163.7.128.194
>> 163.7.128.194
>> 163.7.128.194
>> 163.7.128.194
>> 163.7.128.11
>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines
>> -np 12 mpifxcorr v534a_9.input
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node ww-flexbuf-01
>> About to run MPIInit on node wark167
>> [ww-flexbuf-01][[3498,1],6][../../../../../../ompi/mca/btl/
>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 156.62.231.167 failed: No route to host (113)
>>
>>
>>
>


More information about the Difx-users mailing list