[Difx-users] (SOLVED) mpicorrdifx cannot be loaded correctly on more than a single node

Thu Jun 29 15:15:40 EDT 2017

I recently tried openmpi-1.10.6 and it didn't work immediately.
Currently we use openmpi-1.4.5 which is rather old.

My suspicion is that some of the details have changed between
openmpi 1.x versions with regard to network specification, &c.
(Figuring out what, exactly is on my things-to-do list....)

-- 

		Geoff (gbc at haystack.mit.edu)

On Thu, Jun 29, 2017 at 02:29:47PM -0400, Arash Roshanineshat wrote:
> Hi All
> 
> Thank you all for your help. I want to update you about what I did and 
> what the result was.
> 
> I tried executing "mpirun" on a target node using "--host" argument:
> 
> user at fringes-difx0$ mpirun --host fringes-difx1 ...
> 
> but it couldn't use more than 1 cpu. to use more than a cpu I had to 
> pass "--oversubscribe" argument to "mpirun". Furthermore, although the 
> node had 40 cpus (checked with "htop"), it couldn't use more than 20 
> cpus. I passed 20 and I also tried passing 90 threads to "mpirun", but 
> it didn't make any difference. only maximum of 20 of cpus were used.
> 
> I installed SLURM on the cluster and nodes and done the same experiment 
> using the following command:
> 
> $ salloc -N 2 mpirun ~/difx/bin/mpispeed
> 
> All 80 cpus (2 nodes) got involved in some processing. but no output, no 
> update and no results. the output was just bunch of:
> 
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx0
> About to run MPIInit on node fringes-difx1
> 
> and it stuck there and I had to Ctrl+C to get out.
> 
> Based on Adam's suggestion and the experiences above, I decided to try 
> other versions of "openmpi". The experiments above were done using 
> Ubuntu 16.04, openmpi 1.10.2 and SLURM 15.08.7.
> 
> I tried installing "Openmpi 2.1.1", but no success. I was getting an 
> error about "pmi" component. Although pmi2.h was included in the package 
> and I tried several ways to introduce it, I failed.
> 
> So I purged and removed Openmpi and its libraries and decided to move to 
> "mpich" (version 3.2) instead.
> 
> "startdifx" didn't work again. Because it tries to execute a command 
> which includes "mpirun" and "mca" and it seems "mpich" does not 
> recognize "mca".
> 
> However, the first try using difx and "mpirun" was successful and it 
> used the whole system resources. It also worked on several nodes at the 
> same time as well successfully. I executed a difx experiment and checked 
> the result and it was correct.
> 
> I would be grateful if you give your feedback and opinion. I also would 
> be happy to track the problem I had with "openmpi".
> 
> Best Regards
> Arash Roshanineshat
> 
> 
> On 06/28/2017 09:08 PM, Adam Deller wrote:
> >OK, then there is clearly a problem with the mpirun command generated by 
> >startdifx, if it is giving the same error when you are only using a 
> >single node.  What is the mpirun command you have used previously which 
> >*did* work, and on which machine are you running it?
> >
> >The other thing that might be an issue is mpirun simply getting confused 
> >about how many processes it should be allowed to start on the node.  See 
> >https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php for a huge list of 
> >options (specifically the "Mapping, ranking and binding" section).  What 
> >happens if you run the mpirun command manually, but add --report-bindings?
> >
> >Cheers,
> >Adam/
> >/
> >
> >On 29 June 2017 at 09:58, Arash Roshanineshat 
> ><arash.roshanineshat at cfa.harvard.edu 
> ><mailto:arash.roshanineshat at cfa.harvard.edu>> wrote:
> >
> >    Hi Adam
> >
> >    Thank you for your information.
> >
> >    I disabled each node by putting 0 in the third column of cluster
> >    configuration file (C.txt) one by one and I doubled checked to see
> >    if the disabled node is removed from "threads" and "machines" file.
> >
> >    However, the problem is still there with the same error message in
> >    both cases.
> >
> >    Yes, the nodes in the cluster are connected to each other using 40G
> >    cables and the master node that I want it to run difx is connected
> >    to my workstation using a regular LAN (RJ-45) cable. So the master
> >    node has two interfaces up.
> >
> >
> >    Best Regards
> >    Arash Roshanineshat
> >
> >
> >    On 06/28/2017 07:42 PM, Adam Deller wrote:
> >
> >        Hi Arash,
> >
> >        I'm fairly sure this is an openmpi issue and not a DiFX issue,
> >        hence the number of threads should not be important - it is
> >        barfing well before the stage of trying to start processing
> >        threads.  For some reason, openmpi thinks there are not enough
> >        CPUs available to bind processes on your first machine (although
> >        there should be, given that you're only allocating 5 processes
> >        to it and it has 20 cores).  I know Lupin Liu posted a similar
> >        problem about 2 years ago, but when I Iook at that thread there
> >        was never a resolution - perhaps Lupin can comment?  (You can
> >        search through the difx-users archive for "While computing
> >        bindings", and you'll see it).
> >
> >        If you change C.txt to only have one machine enabled (first
> >        fringes-difx0, then fringes-difx1), does it work in both cases? 
> >        Do you have any funny networking infrastructure like infiniband
> >        in parallel with ethernet?  Sometimes mpi gets confused when
> >        multiple interfaces are present.
> >
> >        If you can easily do so, I also suggest trying a different
> >        version of openmpi.
> >
> >        Cheers,
> >        Adam
> >
> >        On 29 June 2017 at 09:21, Arash Roshanineshat
> >        <arashroshani92 at gmail.com <mailto:arashroshani92 at gmail.com>
> >        <mailto:arashroshani92 at gmail.com
> >        <mailto:arashroshani92 at gmail.com>>> wrote:
> >
> >             Thank you Chris.
> >
> >             I reduced the number of threads to 5 in the cluster
> >        configuration
> >             file (C.txt) and the same problem is still there.
> >
> >             As per your request, I have attached the files to this email.
> >
> >             "errormon2" does not report anything when I execute
> >        startdifx. I
> >             mean the output of errormon2 is blank.
> >
> >             It might be useful to say that, executing "mpispeed", as it 
> >             was
> >             suggested in an archived mail-list email, using the following
> >             command and the same configuration files, works correctly.
> >        It uses 6
> >             cpus in total in both cluster nodes and returns "done" 
> >             outputs.
> >
> >             $ mpirun -np 6 --hostfile
> >             /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines
> >             /home/arash/difx/bin/mpispeed
> >
> >             Regarding the large number of threads, If I have understood
> >             correctly from NRAO's difx tutorial, MPI should handle the
> >             threads(and hyperthreads) automatically. So I chose a large
> >        number
> >             to use whole system's resources and speed up the difx.
> >
> >             Best Regards
> >
> >             Arash Roshanineshat
> >
> >
> >
> >             On 06/28/2017 07:03 PM, Chris.Phillips at csiro.au wrote:
> >
> >                 Hi Arash,
> >
> >                 Without the full setup, it is basically impossible to
> >        debug such
> >                 problems. We would need to see the .input file and
> >        preferably
> >                 the .v2d and .vex file also.  Also the full output from
> >        DIFX for
> >                 the entire run (use errormon or errormon2)
> >
> >                 Specifically how many antennas are being correlated?
> >        Maybe you
> >                 just have too few DIFX processes.
> >
> >                 Given you are running all the DataStream processes in a
> >        single
> >                 node, I would suggest you probably have too many
> >        threads running
> >                 per core. I doubt that is the problem you are seeing
> >        though.
> >
> >                 Actually maybe it is - stardifx maybe is being clever
> >        (I don't
> >                 use it) and not being willing to allocate more than 20
> >        processes
> >                 (ignore hyperthreads, they tend to be useless for
> >        DIFX).  Try
> >                 changing # threads to, say, 5.
> >
> >                 Cheers
> >                 Chris
> >
> >                 ________________________________________
> >                 From: Difx-users <difx-users-bounces at listmgr.nrao.edu
> >        <mailto:difx-users-bounces at listmgr.nrao.edu>
> >                 <mailto:difx-users-bounces at listmgr.nrao.edu
> >        <mailto:difx-users-bounces at listmgr.nrao.edu>>> on behalf of Arash
> >                 Roshanineshat <arashroshani92 at gmail.com
> >        <mailto:arashroshani92 at gmail.com>
> >                 <mailto:arashroshani92 at gmail.com
> >        <mailto:arashroshani92 at gmail.com>>>
> >                 Sent: Thursday, 29 June 2017 7:34 AM
> >                 To: difx-users at listmgr.nrao.edu
> >        <mailto:difx-users at listmgr.nrao.edu>
> >        <mailto:difx-users at listmgr.nrao.edu
> >        <mailto:difx-users at listmgr.nrao.edu>>
> >                 Cc: arash.roshanineshat at cfa.harvard.edu
> >        <mailto:arash.roshanineshat at cfa.harvard.edu>
> >                 <mailto:arash.roshanineshat at cfa.harvard.edu
> >        <mailto:arash.roshanineshat at cfa.harvard.edu>>
> >
> >                 Subject: [Difx-users] mpicorrdifx cannot be loaded
> >        correctly on
> >                 more than a     single node
> >
> >                 Hi,
> >
> >                 I could install difx but it can only be run on a single
> >        node
> >                 cluster.
> >
> >                 The *.machines and *.threads files are attached to this
> >        email.
> >
> >                 Openmpi is installed on all nodes and difx folder and
> >        data folder is
> >                 shared among the clusters using NFS filesystem. Difx
> >        works perfectly
> >                 with correct output on single machines.
> >
> >                 executing "startdifx -v -f e17d05-Sm-Sr_1000.input"
> >        returns the
> >                 following error:
> >
> >                 DIFX_MACHINES -> 
> >                 /home/arash/Shared_Examples/Example2/C.txt
> >                 Found modules:
> >                 Executing:  mpirun -np 6 --hostfile
> >                
> >        /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines
> >                 --mca
> >                 mpi_yield_when_idle 1 --mca rmaps seq 
> >        runmpifxcorr.DiFX-2.5
> >                
> >        /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.input
> >                
> >        --------------------------------------------------------------------------
> >                 While computing bindings, we found no available cpus on
> >                 the following node:
> >
> >                      Node:  fringes-difx0
> >
> >                 Please check your allocation.
> >                
> >        --------------------------------------------------------------------------
> >                 Elapsed time (s) = 0.50417590141
> >
> >                 and executing
> >
> >                 $ mpirun -np 6 --hostfile
> >                
> >        /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.machines
> >                 /home/arash/difx/bin/mpifxcorr
> >                
> >        /home/arash/Shared_Examples/Example2/e17d05-Sm-Sr_1000.input
> >
> >                 seems to be working but by observing the cpu usage, I
> >        see only 6
> >                 cpus
> >                 involving "5 in fringes-difx0 and 1 in fringes-difx1".
> >        I was
> >                 expecting
> >                 it to use the number of cpus equal to the number in
> >        "*.threads"
> >                 file.
> >                 How can I solve this issue?
> >
> >                 the specification of the cluster is Socket=2, Core per
> >        Socket=10 and
> >                 Threads per core=2.
> >
> >                 Best Regards
> >
> >                 Arash Roshanineshat
> >
> >
> >
> >
> >
> >
> >             _______________________________________________
> >             Difx-users mailing list
> >        Difx-users at listmgr.nrao.edu <mailto:Difx-users at listmgr.nrao.edu>
> >        <mailto:Difx-users at listmgr.nrao.edu
> >        <mailto:Difx-users at listmgr.nrao.edu>>
> >        https://listmgr.nrao.edu/mailman/listinfo/difx-users
> >        <https://listmgr.nrao.edu/mailman/listinfo/difx-users>
> >             <https://listmgr.nrao.edu/mailman/listinfo/difx-users
> >        <https://listmgr.nrao.edu/mailman/listinfo/difx-users>>
> >
> >
> >
> >
> >        -- 
> >        !=============================================================!
> >        Dr. Adam Deller
> >        ARC Future Fellow, Senior Lecturer
> >        Centre for Astrophysics & Supercomputing
> >        Swinburne University of Technology
> >        John St, Hawthorn VIC 3122 Australia
> >        phone: +61 3 9214 5307 <tel:%2B61%203%209214%205307>
> >        fax: +61 3 9214 8797 <tel:%2B61%203%209214%208797>
> >
> >        office days (usually): Mon-Thu
> >        !=============================================================!
> >
> >
> >
> >
> >-- 
> >!=============================================================!
> >Dr. Adam Deller
> >ARC Future Fellow, Senior Lecturer
> >Centre for Astrophysics & Supercomputing
> >Swinburne University of Technology
> >John St, Hawthorn VIC 3122 Australia
> >phone: +61 3 9214 5307
> >fax: +61 3 9214 8797
> >
> >office days (usually): Mon-Thu
> >!=============================================================!
> 
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
> 
> !DSPAM:5955478410446707818138!