<div dir="ltr">Thanks Jan!</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 30 Sept 2021 at 22:12, Jan Florian Wagner via Difx-users <<a href="mailto:difx-users@listmgr.nrao.edu">difx-users@listmgr.nrao.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi all,<br><br>in the end the issue was that CentOS libucx wasn't installed on any Mark6, and a few compute nodes were missing it as well.<br><br>With debug output during mpirun ("I_MPI_DEBUG=10 mpirun <args>") it turned out that MPI tried to use libfabric provider psm3 on Mark6s, and on compute nodes instead tried libfabric provider mlx that in turn depends on ucx. I installed the CentOS-provided libucx on all hosts that were lacking it.<br><br>With that, DiFX correlation started working for mixed Mark6 and file based correlation. For reference here are the settings:<br><br>export I_MPI_FABRICS=shm:ofi<br>export FI_PROVIDER=psm3<br># or: export FI_PROVIDER=verbs<br>export DIFX_MPIRUNOPTIONS="-print-rank-map -prepend-rank -perhost 1 -iface ib0" <br># or: export DIFX_MPIRUNOPTIONS="-gdb -print-rank-map -prepend-rank -perhost 1 -iface ib0" # parallel debug<br># or: export DIFX_MPIRUNOPTIONS="
-l vtune -collect hotspots -k sampling-mode=hw -print-rank-map -prepend-rank -perhost 1 -iface ib0" # profiling<br><br>startdifx -v -f *.input<br><br><br>The standard 'startdifx' code needs one small change in the line that prepares the mpirun command. The command string that for OpenMPI is<br><br>cmd = 'mpirun -np %d --hostfile %s.machines %s %s %s.input%s' % (...)<br><br>needed a change of -np and --hostfile for Intel MPI+Hydra, like this:<br><br>cmd = 'mpirun -n %d -machinefile %s.machines %s %s %s.input%s' % (...)<br><br>regards,<br>Jan</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 28, 2021 at 1:55 PM Jan Florian Wagner <<a href="mailto:jwagner105@googlemail.com" target="_blank">jwagner105@googlemail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi all,<br><br>has anyone tried out the Intel oneAPI 2021.3 packages? How has your experience been? In particular, did you get Intel MPI working?<br><br>I've installed oneAPI here and the DiFX components compile fine under the respectively required Intel icc (C), icpc (C++), or Intel MPI mpicxx compilers, plus the Intel IPP 2021.3 library.<br><br>However I cannot get MPI to work across compute nodes and Mark6. For example:<br><br>$ which mpirun<br>/opt/intel/oneapi/mpi/2021.3.1/bin/mpirun<br><br>($ export I_MPI_PLATFORM=auto)<br>$ mpirun -prepend-rank -n 6 -perhost 1 -machinefile intel.hostfile -bind-to none -iface ib0 mpifxcorr<br>[1] About to run MPIInit on node mark6-02<br>[0] About to run MPIInit on node mark6-01<br>[2] About to run MPIInit on node mark6-03<br>[5] About to run MPIInit on node node12.service<br>[3] About to run MPIInit on node node10.service<br>[4] About to run MPIInit on node node11.service<br>[1] Abort(1615503) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:<br>[1] MPIR_Init_thread(138)........:<br>[1] MPID_Init(1169)..............:<br>[1] MPIDI_OFI_mpi_init_hook(1842):<br>[1] MPIDU_bc_table_create(336)...: Missing hostname or invalid host/port description in business card<br><br>The error is quite cryptic and I have not found much help elsewhere online. <br><br>Maybe someone here has come across it?<br><br>Oddly, mpirun or rather the MPI_Init() in mpifxcorr works just fine when the machinefile contains only Mark6 units, or when it contains only compute nodes. <br><br>Mixing both compute and Mark6 leads to the above error. All hosts have the same CentOS 7.7.1908 and Mellanox Infiniband mlx4_0 as ib0...<br><br>many thanks,<br>regards,<br>Jan</div>
</blockquote></div>
_______________________________________________<br>
Difx-users mailing list<br>
<a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank">Difx-users@listmgr.nrao.edu</a><br>
<a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users" rel="noreferrer" target="_blank">https://listmgr.nrao.edu/mailman/listinfo/difx-users</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px"><div dir="ltr" style="font-size:12.8px">!=============================================================!<br><div dir="ltr" style="font-size:12.8px">A/Prof. Adam Deller </div><div dir="ltr" style="font-size:12.8px">ARC Future Fellow</div></div><div style="font-size:12.8px">Centre for Astrophysics & Supercomputing </div><div dir="ltr" style="font-size:12.8px">Swinburne University of Technology <br>John St, Hawthorn VIC 3122 Australia</div><div style="font-size:12.8px">phone: +61 3 9214 5307</div><div style="font-size:12.8px">fax: +61 3 9214 8797</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">office days (usually): Mon-Thu<br>!=============================================================!</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>