[Difx-users] mpicorrdifx cannot be loaded correctly on more than a single node

Chris.Phillips at csiro.au Chris.Phillips at csiro.au
Wed Jun 28 19:03:16 EDT 2017


Hi Arash,

Without the full setup, it is basically impossible to debug such problems. We would need to see the .input file and preferably the .v2d and .vex file also.  Also the full output from DIFX for the entire run (use errormon or errormon2)

Specifically how many antennas are being correlated? Maybe you just have too few DIFX processes.

Given you are running all the DataStream processes in a single node, I would suggest you probably have too many threads running per core. I doubt that is the problem you are seeing though. 

Actually maybe it is - stardifx maybe is being clever (I don't use it) and not being willing to allocate more than 20 processes (ignore hyperthreads, they tend to be useless for DIFX).  Try changing # threads to, say, 5.

Cheers
Chris

________________________________________
From: Difx-users <difx-users-bounces at listmgr.nrao.edu> on behalf of Arash Roshanineshat <arashroshani92 at gmail.com>
Sent: Thursday, 29 June 2017 7:34 AM
To: difx-users at listmgr.nrao.edu
Cc: arash.roshanineshat at cfa.harvard.edu
Subject: [Difx-users] mpicorrdifx cannot be loaded correctly on more than a     single node

Hi,

I could install difx but it can only be run on a single node cluster.

The *.machines and *.threads files are attached to this email.

Openmpi is installed on all nodes and difx folder and data folder is
shared among the clusters using NFS filesystem. Difx works perfectly
with correct output on single machines.

executing "startdifx -v -f e17d05-Sm-Sr_1000.input" returns the
following error:

DIFX_MACHINES -> /home/arash/Shared_Examples/Example2/C.txt
Found modules:
Executing:  mpirun -np 6 --hostfile
/home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines --mca
mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.5
/home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.input
--------------------------------------------------------------------------
While computing bindings, we found no available cpus on
the following node:

   Node:  fringes-difx0

Please check your allocation.
--------------------------------------------------------------------------
Elapsed time (s) = 0.50417590141

and executing

$ mpirun -np 6 --hostfile
/home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines
/home/arash/difx/bin/mpifxcorr
/home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.input

seems to be working but by observing the cpu usage, I see only 6 cpus
involving "5 in fringes-difx0 and 1 in fringes-difx1". I was expecting
it to use the number of cpus equal to the number in "*.threads" file.
How can I solve this issue?

the specification of the cluster is Socket=2, Core per Socket=10 and
Threads per core=2.

Best Regards

Arash Roshanineshat







More information about the Difx-users mailing list