<div dir="ltr"><div><div><div>OK, then there is clearly a problem with the mpirun command generated by startdifx, if it is giving the same error when you are only using a single node.  What is the mpirun command you have used previously which *did* work, and on which machine are you running it? <br><br></div>The other thing that might be an issue is mpirun simply getting confused about how many processes it should be allowed to start on the node.  See <a href="https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php">https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php</a> for a huge list of options (specifically the "Mapping, ranking and binding" section).  What happens if you run the mpirun command manually, but add  --report-bindings?<br><br></div>Cheers,<br></div>Adam<i><br></i></div><div class="gmail_extra"><br><div class="gmail_quote">On 29 June 2017 at 09:58, Arash Roshanineshat <span dir="ltr"><<a href="mailto:arash.roshanineshat@cfa.harvard.edu" target="_blank">arash.roshanineshat@cfa.harvard.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Adam<br>

<br>

Thank you for your information.<br>

<br>

I disabled each node by putting 0 in the third column of cluster configuration file (C.txt) one by one and I doubled checked to see if the disabled node is removed from "threads" and "machines" file.<br>

<br>

However, the problem is still there with the same error message in both cases.<br>

<br>

Yes, the nodes in the cluster are connected to each other using 40G cables and the master node that I want it to run difx is connected to my workstation using a regular LAN (RJ-45) cable. So the master node has two interfaces up.<br>

<br>

<br>

Best Regards<br>

Arash Roshanineshat<span class=""><br>

<br>

<br>

On 06/28/2017 07:42 PM, Adam Deller wrote:<br>

</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

Hi Arash,<br>

<br>

I'm fairly sure this is an openmpi issue and not a DiFX issue, hence the number of threads should not be important - it is barfing well before the stage of trying to start processing threads.  For some reason, openmpi thinks there are not enough CPUs available to bind processes on your first machine (although there should be, given that you're only allocating 5 processes to it and it has 20 cores).  I know Lupin Liu posted a similar problem about 2 years ago, but when I Iook at that thread there was never a resolution - perhaps Lupin can comment?  (You can search through the difx-users archive for "While computing bindings", and you'll see it).<br>

<br>

If you change C.txt to only have one machine enabled (first fringes-difx0, then fringes-difx1), does it work in both cases?  Do you have any funny networking infrastructure like infiniband in parallel with ethernet?  Sometimes mpi gets confused when multiple interfaces are present.<br>

<br>

If you can easily do so, I also suggest trying a different version of openmpi.<br>

<br>

Cheers,<br>

Adam<br>

<br></span><div><div class="h5">

On 29 June 2017 at 09:21, Arash Roshanineshat <<a href="mailto:arashroshani92@gmail.com" target="_blank">arashroshani92@gmail.com</a> <mailto:<a href="mailto:arashroshani92@gmail.com" target="_blank">arashroshani92@gmail.c<wbr>om</a>>> wrote:<br>

<br>

    Thank you Chris.<br>

<br>

    I reduced the number of threads to 5 in the cluster configuration<br>

    file (C.txt) and the same problem is still there.<br>

<br>

    As per your request, I have attached the files to this email.<br>

<br>

    "errormon2" does not report anything when I execute startdifx. I<br>

    mean the output of errormon2 is blank.<br>

<br>

    It might be useful to say that, executing "mpispeed", as it was<br>

    suggested in an archived mail-list email, using the following<br>

    command and the same configuration files, works correctly. It uses 6<br>

    cpus in total in both cluster nodes and returns "done" outputs.<br>

<br>

    $ mpirun -np 6 --hostfile<br>

    /home/arash/Shared_Examples/Ex<wbr>ample2/e17d05-Sm-Sr_1000.machi<wbr>nes<br>

    /home/arash/difx/bin/mpispeed<br>

<br>

    Regarding the large number of threads, If I have understood<br>

    correctly from NRAO's difx tutorial, MPI should handle the<br>

    threads(and hyperthreads) automatically. So I chose a large number<br>

    to use whole system's resources and speed up the difx.<br>

<br>

    Best Regards<br>

<br>

    Arash Roshanineshat<br>

<br>

<br>

<br>

    On 06/28/2017 07:03 PM, Chris.Phillips@csiro.au wrote:<br>

<br>

        Hi Arash,<br>

<br>

        Without the full setup, it is basically impossible to debug such<br>

        problems. We would need to see the .input file and preferably<br>

        the .v2d and .vex file also.  Also the full output from DIFX for<br>

        the entire run (use errormon or errormon2)<br>

<br>

        Specifically how many antennas are being correlated? Maybe you<br>

        just have too few DIFX processes.<br>

<br>

        Given you are running all the DataStream processes in a single<br>

        node, I would suggest you probably have too many threads running<br>

        per core. I doubt that is the problem you are seeing though.<br>

<br>

        Actually maybe it is - stardifx maybe is being clever (I don't<br>

        use it) and not being willing to allocate more than 20 processes<br>

        (ignore hyperthreads, they tend to be useless for DIFX).  Try<br>

        changing # threads to, say, 5.<br>

<br>

        Cheers<br>

        Chris<br>

<br>

        ______________________________<wbr>__________<br>

        From: Difx-users <<a href="mailto:difx-users-bounces@listmgr.nrao.edu" target="_blank">difx-users-bounces@listmgr.nr<wbr>ao.edu</a><br></div></div>

        <mailto:<a href="mailto:difx-users-bounces@listmgr.nrao.edu" target="_blank">difx-users-bounces@lis<wbr>tmgr.nrao.edu</a>>> on behalf of Arash<br>

        Roshanineshat <<a href="mailto:arashroshani92@gmail.com" target="_blank">arashroshani92@gmail.com</a><br>

        <mailto:<a href="mailto:arashroshani92@gmail.com" target="_blank">arashroshani92@gmail.c<wbr>om</a>>><span class=""><br>

        Sent: Thursday, 29 June 2017 7:34 AM<br></span>

        To: <a href="mailto:difx-users@listmgr.nrao.edu" target="_blank">difx-users@listmgr.nrao.edu</a> <mailto:<a href="mailto:difx-users@listmgr.nrao.edu" target="_blank">difx-users@listmgr.nra<wbr>o.edu</a>><br>

        Cc: <a href="mailto:arash.roshanineshat@cfa.harvard.edu" target="_blank">arash.roshanineshat@cfa.harvar<wbr>d.edu</a><br>

        <mailto:<a href="mailto:arash.roshanineshat@cfa.harvard.edu" target="_blank">arash.roshanineshat@cf<wbr>a.harvard.edu</a>><div><div class="h5"><br>

        Subject: [Difx-users] mpicorrdifx cannot be loaded correctly on<br>

        more than a     single node<br>

<br>

        Hi,<br>

<br>

        I could install difx but it can only be run on a single node<br>

        cluster.<br>

<br>

        The *.machines and *.threads files are attached to this email.<br>

<br>

        Openmpi is installed on all nodes and difx folder and data folder is<br>

        shared among the clusters using NFS filesystem. Difx works perfectly<br>

        with correct output on single machines.<br>

<br>

        executing "startdifx -v -f e17d05-Sm-Sr_1000.input" returns the<br>

        following error:<br>

<br>

        DIFX_MACHINES -> /home/arash/Shared_Examples/Ex<wbr>ample2/C.txt<br>

        Found modules:<br>

        Executing:  mpirun -np 6 --hostfile<br>

        /home/arash/Shared_Examples/Ex<wbr>ample2/e17d05-Sm-Sr_1000.machi<wbr>nes<br>

        --mca<br>

        mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.5<br>

        /home/arash/Shared_Examples/Ex<wbr>ample2/e17d05-Sm-Sr_1000.input<br>

        ------------------------------<wbr>------------------------------<wbr>--------------<br>

        While computing bindings, we found no available cpus on<br>

        the following node:<br>

<br>

             Node:  fringes-difx0<br>

<br>

        Please check your allocation.<br>

        ------------------------------<wbr>------------------------------<wbr>--------------<br>

        Elapsed time (s) = 0.50417590141<br>

<br>

        and executing<br>

<br>

        $ mpirun -np 6 --hostfile<br>

        /home/arash/Shared_Examples/Ex<wbr>ample2/e17d05-Sm-Sr_1000.machi<wbr>nes<br>

        /home/arash/difx/bin/mpifxcorr<br>

        /home/arash/Shared_Examples/Ex<wbr>ample2/e17d05-Sm-Sr_1000.input<br>

<br>

        seems to be working but by observing the cpu usage, I see only 6<br>

        cpus<br>

        involving "5 in fringes-difx0 and 1 in fringes-difx1". I was<br>

        expecting<br>

        it to use the number of cpus equal to the number in "*.threads"<br>

        file.<br>

        How can I solve this issue?<br>

<br>

        the specification of the cluster is Socket=2, Core per Socket=10 and<br>

        Threads per core=2.<br>

<br>

        Best Regards<br>

<br>

        Arash Roshanineshat<br>

<br>

<br>

<br>

<br>

<br>

<br>

    ______________________________<wbr>_________________<br>

    Difx-users mailing list<br></div></div>

    <a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank">Difx-users@listmgr.nrao.edu</a> <mailto:<a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank">Difx-users@listmgr.nra<wbr>o.edu</a>><br>

    <a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users" rel="noreferrer" target="_blank">https://listmgr.nrao.edu/mailm<wbr>an/listinfo/difx-users</a><span class=""><br>

    <<a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users" rel="noreferrer" target="_blank">https://listmgr.nrao.edu/mail<wbr>man/listinfo/difx-users</a>><br>

<br>

<br>

<br>

<br>

-- <br>

!=============================<wbr>==============================<wbr>==!<br>

Dr. Adam Deller<br>

ARC Future Fellow, Senior Lecturer<br>

Centre for Astrophysics & Supercomputing<br>

Swinburne University of Technology<br>

John St, Hawthorn VIC 3122 Australia<br>

phone: <a href="tel:%2B61%203%209214%205307" value="+61392145307" target="_blank">+61 3 9214 5307</a><br>

fax: <a href="tel:%2B61%203%209214%208797" value="+61392148797" target="_blank">+61 3 9214 8797</a><br>

<br>

office days (usually): Mon-Thu<br>

!=============================<wbr>==============================<wbr>==!<br>

</span></blockquote>

</blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr" style="font-size:12.8000001907349px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px">!=============================================================!<br>Dr. Adam Deller         </div><div dir="ltr" style="font-size:12.8px">ARC Future Fellow, Senior Lecturer</div><div style="font-size:12.8px">Centre for Astrophysics & Supercomputing </div><div dir="ltr" style="font-size:12.8px">Swinburne University of Technology    <br>John St, Hawthorn VIC 3122 Australia</div><div style="font-size:12.8px">phone: +61 3 9214 5307</div><div style="font-size:12.8px">fax: +61 3 9214 8797</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">office days (usually): Mon-Thu<br>!=============================================================!</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>

</div>