[Difx-users] WARNING Could not open monitoring socket

Stuart Weston nzobservers at gmail.com
Thu Apr 21 22:58:41 EDT 2016


I have tee'ed the output from mpirun and errormon2 :

oper at ww-flexbuf-01 DiFX-2.4.3 hw04> cat mpirun.log
About to run MPIInit on node ww-flexbuf-01
About to run MPIInit on node ww-flexbuf-01
About to run MPIInit on node wark167
About to run MPIInit on node wark167
NOTE: difxmessage is in use.  If you are not running errormon/errormon2,
you are missing all the (potentially important) info messages!
[wark167 1]    INFO MPI Process 1 is running on host wark167
[wark167 3]    INFO MPI Process 3 is running on host wark167
[wark167 3]    INFO About to process the input file..
[wark167 1]    INFO About to process the input file..
[wark167 3]    INFO DIFX VERSION = DiFX-2.4.3
[wark167 1]    INFO DIFX VERSION = DiFX-2.4.3
[wark167 3]   DEBUG NS accumulate is 125 and max geom slip is 39.9397,
maxnsslip is 0
[wark167 3]   DEBUG NS accumulate is 125 and max geom slip is 25.2648,
maxnsslip is 0
[wark167 1]   DEBUG NS accumulate is 125 and max geom slip is 39.9397,
maxnsslip is 0
[wark167 1]   DEBUG NS accumulate is 125 and max geom slip is 25.2648,
maxnsslip is 0
[wark167 3]    INFO Receive socket opened; socket is -1
[wark167 3] WARNING Could not open command monitoring socket! Aborting
message receive thread.
[wark167 1]    INFO Receive socket opened; socket is -1
[wark167 1] WARNING Could not open command monitoring socket! Aborting
message receive thread.

oper at ww-flexbuf-01 DiFX-2.4.3 hw04> cat /tmp/errormon2.log
2016-04-22 14:56:26,955 DiFXAlert INFO    MPI[ 2] ww-flexbuf-01 hw04_1
  MPI Process 2 is running on host ww-flexbuf-01
2016-04-22 14:56:26,955 DiFXAlert INFO    MPI[ 0] ww-flexbuf-01 hw04_1
  MPI Process 0 is running on host ww-flexbuf-01
2016-04-22 14:56:26,956 DiFXAlert INFO    MPI[ 2] ww-flexbuf-01 hw04_1
  DIFX VERSION = DiFX-2.4.3
2016-04-22 14:56:26,956 DiFXAlert INFO    MPI[ 0] ww-flexbuf-01 hw04_1
  DIFX VERSION = DiFX-2.4.3
2016-04-22 14:56:26,957 DiFXAlert INFO    MPI[ 0] ww-flexbuf-01 hw04_1
  Receive socket opened; socket is 12
2016-04-22 14:56:26,957 DiFXAlert INFO    MPI[ 2] ww-flexbuf-01 hw04_1
  Receive socket opened; socket is 12

oper at wark167:/etc/network# cat /tmp/errormon2.log
2016-04-22 14:57:56,932 DiFXAlert INFO    MPI[ 2] ww-flexbuf-01 hw04_1
  MPI Process 2 is running on host ww-flexbuf-01
2016-04-22 14:57:56,933 DiFXAlert INFO    MPI[ 0] ww-flexbuf-01 hw04_1
  MPI Process 0 is running on host ww-flexbuf-01
2016-04-22 14:57:56,933 DiFXAlert INFO    MPI[ 2] ww-flexbuf-01 hw04_1
  DIFX VERSION = DiFX-2.4.3
2016-04-22 14:57:56,934 DiFXAlert INFO    MPI[ 0] ww-flexbuf-01 hw04_1
  DIFX VERSION = DiFX-2.4.3
2016-04-22 14:57:56,941 DiFXAlert INFO    MPI[ 2] ww-flexbuf-01 hw04_1
  Receive socket opened; socket is 12
2016-04-22 14:57:56,941 DiFXAlert INFO    MPI[ 0] ww-flexbuf-01 hw04_1
  Receive socket opened; socket is 12




On Fri, Apr 22, 2016 at 1:25 PM, <Chris.Phillips at csiro.au> wrote:

> Hi Stuart
>
> I assume that actually the message was:
>
> "Could not open command monitoring socket! Aborting message receive
> thread.”
>
> You really need to send the full output for us to have any chance of
> diagnosing this. When you say “nothing more” in errormon2, does ANYTHING
> appear there?
>
> The message receive thread in most circumstances is not important. If DIFX
> messages is not working you will however not get any logging messages.
>
> Which processes give this message and on which machines are they running?
>
> difxmessage library does not report error unfortunately, just return if
> there were errors.
>
> I would suggest making a temp change to difxmessage/multicast.c and
> recompiling it and mpifxcorr
>
> Add some calls to perror before all the error returns in the routine
> openMultiCastSocket. E.g.
>
>
>      /* Make UDP socket */
>         sock = socket(AF_INET, SOCK_DGRAM, 0);
>         if(sock < 0)
>         {
>          perror(“Trying to create socket: ”);
>         return -1;
>         }
>
>       /* Allow reuse of port */
>         v = setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof(yes));
>         if(v < 0)
>         {
> perror(“Setsockopt: “);
>                 return -2;
>         }
>
>       /* bind to receive address */
>         v = bind(sock, (struct sockaddr *)&addr, sizeof(struct
> sockaddr_in));
>         if(v < 0)
>         {
> perror(“Binding to socket: ");
>                 return -3;
>         }
>
>         v = inet_aton(group, &mreq.imr_multiaddr);
>         if(!v)
>         {
> perror(“inet_aton: );
>                return -4;
>         }
>
> I am pretty sure the problem is not the choice of multicast address - if
> cannot connect to multicast group the code should give a major warning.
>
> Just to double check - do you see the following message:
>
>  Unicast (XXXXX) difxMessage in use. Some functionallity may be reduced
>
> If you do, thats the problem
>
> Cheers
> Chris
>
> On 22 Apr 2016, at 11:02 AM, Stuart Weston <nzobservers at gmail.com> wrote:
>
> I have two servers, they both have 2 x CPU ( 6 cores, hyperthreaded). So
> potentially I have 24 cores and 48 threads.
>
>
> mpirun starts mpifxcorr on both servers, but we get the “WARNING Could not
> open monitoring socket ! Aborting message receive thread” on the master ?
> The processes seem to sit there and do nothing, nothing more in errmon2.
>
>
> If I change the machines file I can run the same correlation on each
> server individually to completion, so DiFX has to be good.
>
>
> ww-flexbuf-01:/raid0/etransfer/hw04# cat machines
>
> ww-flexbuf-01
>
> wark167
>
> ww-flexbuf-01:/raid0/etransfer/hw04# cat threads
>
> NUMBER OF CORES:    6
>
> 2
>
> 2
>
> 2
>
> 2
>
> 2
>
> 2
>
>
> Note our network we have been asked to use a different multicast address,
> so in DIFXHOME/setup.bash I have set:
>
>
> DIFX_MESSAGE_GROUP=239.253.253.90
>
> DIFX_BINARY_GROUP=239.253.253.90
>
>
>
>
> Any ideas ?
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20160422/c9485dcb/attachment.html>


More information about the Difx-users mailing list