[Difx-users] Intel oneAPI and MPI

Jan Florian Wagner jwagner105 at googlemail.com
Thu Sep 30 08:11:14 EDT 2021


Hi all,

in the end the issue was that CentOS libucx wasn't installed on any Mark6,
and a few compute nodes were missing it as well.

With debug output during mpirun ("I_MPI_DEBUG=10 mpirun <args>") it turned
out that MPI tried to use libfabric provider psm3 on Mark6s, and on compute
nodes instead tried libfabric provider mlx that in turn depends on ucx. I
installed the CentOS-provided libucx on all hosts that were lacking it.

With that, DiFX correlation started working for mixed Mark6 and file based
correlation. For reference here are the settings:

export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=psm3
# or: export FI_PROVIDER=verbs
export DIFX_MPIRUNOPTIONS="-print-rank-map -prepend-rank -perhost 1 -iface
ib0"
# or: export DIFX_MPIRUNOPTIONS="-gdb -print-rank-map -prepend-rank
-perhost 1 -iface ib0"  # parallel debug
# or: export DIFX_MPIRUNOPTIONS=" -l vtune -collect hotspots -k
sampling-mode=hw -print-rank-map -prepend-rank -perhost 1 -iface ib0"  #
profiling

startdifx -v -f *.input


The standard 'startdifx' code needs one small change in the line that
prepares the mpirun command. The command string that for OpenMPI is

cmd = 'mpirun -np %d --hostfile %s.machines %s  %s %s.input%s' % (...)

needed a change of -np and --hostfile for Intel MPI+Hydra, like this:

cmd = 'mpirun -n %d -machinefile %s.machines %s  %s %s.input%s' % (...)

regards,
Jan

On Tue, Sep 28, 2021 at 1:55 PM Jan Florian Wagner <
jwagner105 at googlemail.com> wrote:

> Hi all,
>
> has anyone tried out the Intel oneAPI 2021.3 packages? How has your
> experience been? In particular, did you get Intel MPI working?
>
> I've installed oneAPI here and the DiFX components compile fine under the
> respectively required Intel icc (C), icpc (C++), or Intel MPI mpicxx
> compilers, plus the Intel IPP 2021.3 library.
>
> However I cannot get MPI to work across compute nodes and Mark6. For
> example:
>
> $ which mpirun
> /opt/intel/oneapi/mpi/2021.3.1/bin/mpirun
>
> ($ export I_MPI_PLATFORM=auto)
> $ mpirun -prepend-rank -n 6 -perhost 1 -machinefile intel.hostfile
> -bind-to none -iface ib0 mpifxcorr
> [1] About to run MPIInit on node mark6-02
> [0] About to run MPIInit on node mark6-01
> [2] About to run MPIInit on node mark6-03
> [5] About to run MPIInit on node node12.service
> [3] About to run MPIInit on node node10.service
> [4] About to run MPIInit on node node11.service
> [1] Abort(1615503) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init:
> Other MPI error, error stack:
> [1] MPIR_Init_thread(138)........:
> [1] MPID_Init(1169)..............:
> [1] MPIDI_OFI_mpi_init_hook(1842):
> [1] MPIDU_bc_table_create(336)...: Missing hostname or invalid host/port
> description in business card
>
> The error is quite cryptic and I have not found much help elsewhere
> online.
>
> Maybe someone here has come across it?
>
> Oddly, mpirun or rather the MPI_Init() in mpifxcorr works just fine when
> the machinefile contains only Mark6 units, or when it contains only compute
> nodes.
>
> Mixing both compute and Mark6 leads to the above error. All hosts have the
> same CentOS 7.7.1908 and Mellanox Infiniband mlx4_0 as ib0...
>
> many thanks,
> regards,
> Jan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20210930/dade8fef/attachment.html>


More information about the Difx-users mailing list