<div dir="ltr">Hi all,<br><br>in the end the issue was that CentOS libucx wasn't installed on any Mark6, and a few compute nodes were missing it as well.<br><br>With debug output during mpirun ("I_MPI_DEBUG=10 mpirun <args>") it turned out that MPI tried to use libfabric provider psm3 on Mark6s, and on compute nodes instead tried libfabric provider mlx that in turn depends on ucx. I installed the CentOS-provided libucx on all hosts that were lacking it.<br><br>With that, DiFX correlation started working for mixed Mark6 and file based correlation. For reference here are the settings:<br><br>export I_MPI_FABRICS=shm:ofi<br>export FI_PROVIDER=psm3<br># or: export FI_PROVIDER=verbs<br>export DIFX_MPIRUNOPTIONS="-print-rank-map -prepend-rank -perhost 1 -iface ib0" <br># or: export DIFX_MPIRUNOPTIONS="-gdb -print-rank-map -prepend-rank -perhost 1 -iface ib0" # parallel debug<br># or: export DIFX_MPIRUNOPTIONS="
-l vtune -collect hotspots -k sampling-mode=hw -print-rank-map -prepend-rank -perhost 1 -iface ib0" # profiling<br><br>startdifx -v -f *.input<br><br><br>The standard 'startdifx' code needs one small change in the line that prepares the mpirun command. The command string that for OpenMPI is<br><br>cmd = 'mpirun -np %d --hostfile %s.machines %s %s %s.input%s' % (...)<br><br>needed a change of -np and --hostfile for Intel MPI+Hydra, like this:<br><br>cmd = 'mpirun -n %d -machinefile %s.machines %s %s %s.input%s' % (...)<br><br>regards,<br>Jan</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 28, 2021 at 1:55 PM Jan Florian Wagner <<a href="mailto:jwagner105@googlemail.com">jwagner105@googlemail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi all,<br><br>has anyone tried out the Intel oneAPI 2021.3 packages? How has your experience been? In particular, did you get Intel MPI working?<br><br>I've installed oneAPI here and the DiFX components compile fine under the respectively required Intel icc (C), icpc (C++), or Intel MPI mpicxx compilers, plus the Intel IPP 2021.3 library.<br><br>However I cannot get MPI to work across compute nodes and Mark6. For example:<br><br>$ which mpirun<br>/opt/intel/oneapi/mpi/2021.3.1/bin/mpirun<br><br>($ export I_MPI_PLATFORM=auto)<br>$ mpirun -prepend-rank -n 6 -perhost 1 -machinefile intel.hostfile -bind-to none -iface ib0 mpifxcorr<br>[1] About to run MPIInit on node mark6-02<br>[0] About to run MPIInit on node mark6-01<br>[2] About to run MPIInit on node mark6-03<br>[5] About to run MPIInit on node node12.service<br>[3] About to run MPIInit on node node10.service<br>[4] About to run MPIInit on node node11.service<br>[1] Abort(1615503) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:<br>[1] MPIR_Init_thread(138)........:<br>[1] MPID_Init(1169)..............:<br>[1] MPIDI_OFI_mpi_init_hook(1842):<br>[1] MPIDU_bc_table_create(336)...: Missing hostname or invalid host/port description in business card<br><br>The error is quite cryptic and I have not found much help elsewhere online. <br><br>Maybe someone here has come across it?<br><br>Oddly, mpirun or rather the MPI_Init() in mpifxcorr works just fine when the machinefile contains only Mark6 units, or when it contains only compute nodes. <br><br>Mixing both compute and Mark6 leads to the above error. All hosts have the same CentOS 7.7.1908 and Mellanox Infiniband mlx4_0 as ib0...<br><br>many thanks,<br>regards,<br>Jan</div>
</blockquote></div>