[Difx-users] (SOLVED) mpicorrdifx cannot be loaded correctly on more than a single node
Geoff Crew
gbc at haystack.mit.edu
Thu Jun 29 15:15:40 EDT 2017
I recently tried openmpi-1.10.6 and it didn't work immediately.
Currently we use openmpi-1.4.5 which is rather old.
My suspicion is that some of the details have changed between
openmpi 1.x versions with regard to network specification, &c.
(Figuring out what, exactly is on my things-to-do list....)
--
Geoff (gbc at haystack.mit.edu)
On Thu, Jun 29, 2017 at 02:29:47PM -0400, Arash Roshanineshat wrote:
> Hi All
>
> Thank you all for your help. I want to update you about what I did and
> what the result was.
>
> I tried executing "mpirun" on a target node using "--host" argument:
>
> user at fringes-difx0$ mpirun --host fringes-difx1 ...
>
> but it couldn't use more than 1 cpu. to use more than a cpu I had to
> pass "--oversubscribe" argument to "mpirun". Furthermore, although the
> node had 40 cpus (checked with "htop"), it couldn't use more than 20
> cpus. I passed 20 and I also tried passing 90 threads to "mpirun", but
> it didn't make any difference. only maximum of 20 of cpus were used.
>
> I installed SLURM on the cluster and nodes and done the same experiment
> using the following command:
>
> $ salloc -N 2 mpirun ~/difx/bin/mpispeed
>
> All 80 cpus (2 nodes) got involved in some processing. but no output, no
> update and no results. the output was just bunch of:
>
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx1
>
> and it stuck there and I had to Ctrl+C to get out.
>
> Based on Adam's suggestion and the experiences above, I decided to try
> other versions of "openmpi". The experiments above were done using
> Ubuntu 16.04, openmpi 1.10.2 and SLURM 15.08.7.
>
> I tried installing "Openmpi 2.1.1", but no success. I was getting an
> error about "pmi" component. Although pmi2.h was included in the package
> and I tried several ways to introduce it, I failed.
>
> So I purged and removed Openmpi and its libraries and decided to move to
> "mpich" (version 3.2) instead.
>
> "startdifx" didn't work again. Because it tries to execute a command
> which includes "mpirun" and "mca" and it seems "mpich" does not
> recognize "mca".
>
> However, the first try using difx and "mpirun" was successful and it
> used the whole system resources. It also worked on several nodes at the
> same time as well successfully. I executed a difx experiment and checked
> the result and it was correct.
>
> I would be grateful if you give your feedback and opinion. I also would
> be happy to track the problem I had with "openmpi".
>
> Best Regards
> Arash Roshanineshat
>
>
> On 06/28/2017 09:08 PM, Adam Deller wrote:
> >OK, then there is clearly a problem with the mpirun command generated by
> >startdifx, if it is giving the same error when you are only using a
> >single node. What is the mpirun command you have used previously which
> >*did* work, and on which machine are you running it?
> >
> >The other thing that might be an issue is mpirun simply getting confused
> >about how many processes it should be allowed to start on the node. See
> >https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php for a huge list of
> >options (specifically the "Mapping, ranking and binding" section). What
> >happens if you run the mpirun command manually, but add --report-bindings?
> >
> >Cheers,
> >Adam/
> >/
> >
> >On 29 June 2017 at 09:58, Arash Roshanineshat
> ><arash.roshanineshat at cfa.harvard.edu
> ><mailto:arash.roshanineshat at cfa.harvard.edu>> wrote:
> >
> > Hi Adam
> >
> > Thank you for your information.
> >
> > I disabled each node by putting 0 in the third column of cluster
> > configuration file (C.txt) one by one and I doubled checked to see
> > if the disabled node is removed from "threads" and "machines" file.
> >
> > However, the problem is still there with the same error message in
> > both cases.
> >
> > Yes, the nodes in the cluster are connected to each other using 40G
> > cables and the master node that I want it to run difx is connected
> > to my workstation using a regular LAN (RJ-45) cable. So the master
> > node has two interfaces up.
> >
> >
> > Best Regards
> > Arash Roshanineshat
> >
> >
> > On 06/28/2017 07:42 PM, Adam Deller wrote:
> >
> > Hi Arash,
> >
> > I'm fairly sure this is an openmpi issue and not a DiFX issue,
> > hence the number of threads should not be important - it is
> > barfing well before the stage of trying to start processing
> > threads. For some reason, openmpi thinks there are not enough
> > CPUs available to bind processes on your first machine (although
> > there should be, given that you're only allocating 5 processes
> > to it and it has 20 cores). I know Lupin Liu posted a similar
> > problem about 2 years ago, but when I Iook at that thread there
> > was never a resolution - perhaps Lupin can comment? (You can
> > search through the difx-users archive for "While computing
> > bindings", and you'll see it).
> >
> > If you change C.txt to only have one machine enabled (first
> > fringes-difx0, then fringes-difx1), does it work in both cases?
> > Do you have any funny networking infrastructure like infiniband
> > in parallel with ethernet? Sometimes mpi gets confused when
> > multiple interfaces are present.
> >
> > If you can easily do so, I also suggest trying a different
> > version of openmpi.
> >
> > Cheers,
> > Adam
> >
> > On 29 June 2017 at 09:21, Arash Roshanineshat
> > <arashroshani92 at gmail.com <mailto:arashroshani92 at gmail.com>
> > <mailto:arashroshani92 at gmail.com
> > <mailto:arashroshani92 at gmail.com>>> wrote:
> >
> > Thank you Chris.
> >
> > I reduced the number of threads to 5 in the cluster
> > configuration
> > file (C.txt) and the same problem is still there.
> >
> > As per your request, I have attached the files to this email.
> >
> > "errormon2" does not report anything when I execute
> > startdifx. I
> > mean the output of errormon2 is blank.
> >
> > It might be useful to say that, executing "mpispeed", as it
> > was
> > suggested in an archived mail-list email, using the following
> > command and the same configuration files, works correctly.
> > It uses 6
> > cpus in total in both cluster nodes and returns "done"
> > outputs.
> >
> > $ mpirun -np 6 --hostfile
> > /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines
> > /home/arash/difx/bin/mpispeed
> >
> > Regarding the large number of threads, If I have understood
> > correctly from NRAO's difx tutorial, MPI should handle the
> > threads(and hyperthreads) automatically. So I chose a large
> > number
> > to use whole system's resources and speed up the difx.
> >
> > Best Regards
> >
> > Arash Roshanineshat
> >
> >
> >
> > On 06/28/2017 07:03 PM, Chris.Phillips at csiro.au wrote:
> >
> > Hi Arash,
> >
> > Without the full setup, it is basically impossible to
> > debug such
> > problems. We would need to see the .input file and
> > preferably
> > the .v2d and .vex file also. Also the full output from
> > DIFX for
> > the entire run (use errormon or errormon2)
> >
> > Specifically how many antennas are being correlated?
> > Maybe you
> > just have too few DIFX processes.
> >
> > Given you are running all the DataStream processes in a
> > single
> > node, I would suggest you probably have too many
> > threads running
> > per core. I doubt that is the problem you are seeing
> > though.
> >
> > Actually maybe it is - stardifx maybe is being clever
> > (I don't
> > use it) and not being willing to allocate more than 20
> > processes
> > (ignore hyperthreads, they tend to be useless for
> > DIFX). Try
> > changing # threads to, say, 5.
> >
> > Cheers
> > Chris
> >
> > ________________________________________
> > From: Difx-users <difx-users-bounces at listmgr.nrao.edu
> > <mailto:difx-users-bounces at listmgr.nrao.edu>
> > <mailto:difx-users-bounces at listmgr.nrao.edu
> > <mailto:difx-users-bounces at listmgr.nrao.edu>>> on behalf of Arash
> > Roshanineshat <arashroshani92 at gmail.com
> > <mailto:arashroshani92 at gmail.com>
> > <mailto:arashroshani92 at gmail.com
> > <mailto:arashroshani92 at gmail.com>>>
> > Sent: Thursday, 29 June 2017 7:34 AM
> > To: difx-users at listmgr.nrao.edu
> > <mailto:difx-users at listmgr.nrao.edu>
> > <mailto:difx-users at listmgr.nrao.edu
> > <mailto:difx-users at listmgr.nrao.edu>>
> > Cc: arash.roshanineshat at cfa.harvard.edu
> > <mailto:arash.roshanineshat at cfa.harvard.edu>
> > <mailto:arash.roshanineshat at cfa.harvard.edu
> > <mailto:arash.roshanineshat at cfa.harvard.edu>>
> >
> > Subject: [Difx-users] mpicorrdifx cannot be loaded
> > correctly on
> > more than a single node
> >
> > Hi,
> >
> > I could install difx but it can only be run on a single
> > node
> > cluster.
> >
> > The *.machines and *.threads files are attached to this
> > email.
> >
> > Openmpi is installed on all nodes and difx folder and
> > data folder is
> > shared among the clusters using NFS filesystem. Difx
> > works perfectly
> > with correct output on single machines.
> >
> > executing "startdifx -v -f e17d05-Sm-Sr_1000.input"
> > returns the
> > following error:
> >
> > DIFX_MACHINES ->
> > /home/arash/Shared_Examples/Example2/C.txt
> > Found modules:
> > Executing: mpirun -np 6 --hostfile
> >
> > /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines
> > --mca
> > mpi_yield_when_idle 1 --mca rmaps seq
> > runmpifxcorr.DiFX-2.5
> >
> > /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.input
> >
> > --------------------------------------------------------------------------
> > While computing bindings, we found no available cpus on
> > the following node:
> >
> > Node: fringes-difx0
> >
> > Please check your allocation.
> >
> > --------------------------------------------------------------------------
> > Elapsed time (s) = 0.50417590141
> >
> > and executing
> >
> > $ mpirun -np 6 --hostfile
> >
> > /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines
> > /home/arash/difx/bin/mpifxcorr
> >
> > /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.input
> >
> > seems to be working but by observing the cpu usage, I
> > see only 6
> > cpus
> > involving "5 in fringes-difx0 and 1 in fringes-difx1".
> > I was
> > expecting
> > it to use the number of cpus equal to the number in
> > "*.threads"
> > file.
> > How can I solve this issue?
> >
> > the specification of the cluster is Socket=2, Core per
> > Socket=10 and
> > Threads per core=2.
> >
> > Best Regards
> >
> > Arash Roshanineshat
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Difx-users mailing list
> > Difx-users at listmgr.nrao.edu <mailto:Difx-users at listmgr.nrao.edu>
> > <mailto:Difx-users at listmgr.nrao.edu
> > <mailto:Difx-users at listmgr.nrao.edu>>
> > https://listmgr.nrao.edu/mailman/listinfo/difx-users
> > <https://listmgr.nrao.edu/mailman/listinfo/difx-users>
> > <https://listmgr.nrao.edu/mailman/listinfo/difx-users
> > <https://listmgr.nrao.edu/mailman/listinfo/difx-users>>
> >
> >
> >
> >
> > --
> > !=============================================================!
> > Dr. Adam Deller
> > ARC Future Fellow, Senior Lecturer
> > Centre for Astrophysics & Supercomputing
> > Swinburne University of Technology
> > John St, Hawthorn VIC 3122 Australia
> > phone: +61 3 9214 5307 <tel:%2B61%203%209214%205307>
> > fax: +61 3 9214 8797 <tel:%2B61%203%209214%208797>
> >
> > office days (usually): Mon-Thu
> > !=============================================================!
> >
> >
> >
> >
> >--
> >!=============================================================!
> >Dr. Adam Deller
> >ARC Future Fellow, Senior Lecturer
> >Centre for Astrophysics & Supercomputing
> >Swinburne University of Technology
> >John St, Hawthorn VIC 3122 Australia
> >phone: +61 3 9214 5307
> >fax: +61 3 9214 8797
> >
> >office days (usually): Mon-Thu
> >!=============================================================!
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>
> !DSPAM:5955478410446707818138!
More information about the Difx-users
mailing list