[Difx-users] DiFX mpirun problem

Cormac Reynolds cormac.reynolds at gmail.com
Wed Aug 31 00:32:01 EDT 2016


hi,

Just a couple of additional points on this. Make sure you are running
DiFX-2.4.3 or later or the loss of the multicast messages is likely to
cause the mpifxcorr process to hang.

On 31 August 2016 at 11:30,  <Chris.Phillips at csiro.au> wrote:
> Hi
>
> Walter is spot on with this.  Multicast addresses are all in a single block, so if you specify a “real” IP address, DIFX detects that and just uses unicast. So just set DIFX_MESSAGE_GROUP to the IP of the machine where you want to receive messages.
>
> Cormac is the expert on this now, needing it for LBA correlation.
>
> There are a few caveats - the main one is if you run simultaneous jobs you need to use a different port for each job.

espresso has smarts built in to make sure simultaneous jobs are not
using the same port, but currently that is only enabled for 'batch'
processing (i.e. the slurm job scheduler). It would be straightforward
to add that logic to the interactive mode if you are interested in
using that (just let me know).


cheers,
Cormac.

>
> This is needed when your switch does not support multicast.
>
> Stuart: Without giving *any* context to your message, it is hard to know what you are trying to achieve. Often multicast is available within a cluster but not externally - in most cases this is all you need.
>
> errormon2 should support unicast also. If others have written difxmessage receivers, some trivial code changes are needed to allow unicast.
>
> Cheers
> Chris
>
>
>
>> On 31 Aug 2016, at 12:50 PM, Walter Brisken <wbrisken at nrao.edu> wrote:
>>
>>
>> I believe if you set DIFX_MESSAGE_GROUP to the IP address of a single recipient machine and the DIFX_MESSAGE_PORT to something that machine can listen on, that one machine will be able to see all of the DIFX messages. This probably is not what you want in practice as there is reason for multiple different machines to receive messages, but it still might buy you a bit of ground.
>>
>> I think Richard Dodson and/or Cormac Reynolds are the pioneers in using this, by necessity, when porting to some novel architecture.  Maybe one of them knows more about the actual performance of DiFX in such situations.
>>
>> If so, perhaps a wiki page describing the reasons and consequences of this usage would be a good resource for the future!
>>
>>       -Walter
>>
>>
>> On Wed, 31 Aug 2016, Stuart Weston wrote:
>>
>>> It has just come to my attention:
>>>
>>> I notice you are using a multicast group DIFX_MESSAGE_GROUP=239.253.253.90
>>> are use able to unicast instead. We havenÿÿt enabled multicasting out to
>>> REANNZ.
>>>
>>>
>>>
>>> Can I use unicast ?
>>>
>>> On Wed, Aug 31, 2016 at 1:45 PM, Stuart Weston <nzobservers at gmail.com>
>>> wrote:
>>>
>>>>
>>>> Do IP addresses get added in when the code is compiled ?
>>>>
>>>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines
>>>> -np 12 mpifxcorr v534a_9.input
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node wark167
>>>> [ww-flexbuf-01][[12885,1],6][../../../../../../ompi/mca/btl/
>>>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 156.62.231.167 failed: No route to host (113)
>>>> [ww-flexbuf-01][[12885,1],5][../../../../../../ompi/mca/btl/
>>>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 156.62.231.167 failed: No route to host (113)
>>>> [ww-flexbuf-01][[12885,1],3][../../../../../../ompi/mca/btl/
>>>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 156.62.231.167 failed: No route to host (113)
>>>> [ww-flexbuf-01][[12885,1],11][../../../../../../ompi/mca/
>>>> btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 156.62.231.167 failed: No route to host (113)
>>>>
>>>> The correct IP address should be 163.7.128.11 and not 156.62.231.167.
>>>>
>>>> I have checked "/etc/hosts" on both servers. Also stop/start "rpcbind"
>>>> just in case. I have tried putting the IP addresses in the machines file
>>>> and not the host name. Still get the error ?
>>>>
>>>> Tried with a very simple mpirun and thats good, ie:
>>>>
>>>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat hosts
>>>> 163.7.128.194
>>>> 163.7.128.11
>>>>
>>>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -np 2 -hostfile hosts hostname
>>>> ww-flexbuf-01
>>>> wark167
>>>>
>>>> Any ideas as to why it insists on picking up the wrong IP address ?
>>>>
>>>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> cat v534a_9.machines
>>>> 163.7.128.194
>>>> 163.7.128.194
>>>> 163.7.128.194
>>>> 163.7.128.194
>>>> 163.7.128.194
>>>> 163.7.128.194
>>>> 163.7.128.194
>>>> 163.7.128.11
>>>> oper at ww-flexbuf-01 DiFX-2.4.3 v534a> mpirun -machinefile v534a_9.machines
>>>> -np 12 mpifxcorr v534a_9.input
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node ww-flexbuf-01
>>>> About to run MPIInit on node wark167
>>>> [ww-flexbuf-01][[3498,1],6][../../../../../../ompi/mca/btl/
>>>> tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 156.62.231.167 failed: No route to host (113)
>>>>
>>>>
>>>>
>> _______________________________________________
>> Difx-users mailing list
>> Difx-users at listmgr.nrao.edu
>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users



-- 
----------------------------------------------------
Cormac Reynolds
email: cormac.reynolds at gmail.com
----------------------------------------------------



More information about the Difx-users mailing list