[Difx-users] Intel oneAPI and MPI
    Adam Deller 
    adeller at astro.swin.edu.au
       
    Thu Sep 30 09:36:17 EDT 2021
    
    
  
Thanks Jan!
On Thu, 30 Sept 2021 at 22:12, Jan Florian Wagner via Difx-users <
difx-users at listmgr.nrao.edu> wrote:
> Hi all,
>
> in the end the issue was that CentOS libucx wasn't installed on any Mark6,
> and a few compute nodes were missing it as well.
>
> With debug output during mpirun ("I_MPI_DEBUG=10 mpirun <args>") it turned
> out that MPI tried to use libfabric provider psm3 on Mark6s, and on compute
> nodes instead tried libfabric provider mlx that in turn depends on ucx. I
> installed the CentOS-provided libucx on all hosts that were lacking it.
>
> With that, DiFX correlation started working for mixed Mark6 and file based
> correlation. For reference here are the settings:
>
> export I_MPI_FABRICS=shm:ofi
> export FI_PROVIDER=psm3
> # or: export FI_PROVIDER=verbs
> export DIFX_MPIRUNOPTIONS="-print-rank-map -prepend-rank -perhost 1 -iface
> ib0"
> # or: export DIFX_MPIRUNOPTIONS="-gdb -print-rank-map -prepend-rank
> -perhost 1 -iface ib0"  # parallel debug
> # or: export DIFX_MPIRUNOPTIONS=" -l vtune -collect hotspots -k
> sampling-mode=hw -print-rank-map -prepend-rank -perhost 1 -iface ib0"  #
> profiling
>
> startdifx -v -f *.input
>
>
> The standard 'startdifx' code needs one small change in the line that
> prepares the mpirun command. The command string that for OpenMPI is
>
> cmd = 'mpirun -np %d --hostfile %s.machines %s  %s %s.input%s' % (...)
>
> needed a change of -np and --hostfile for Intel MPI+Hydra, like this:
>
> cmd = 'mpirun -n %d -machinefile %s.machines %s  %s %s.input%s' % (...)
>
> regards,
> Jan
>
> On Tue, Sep 28, 2021 at 1:55 PM Jan Florian Wagner <
> jwagner105 at googlemail.com> wrote:
>
>> Hi all,
>>
>> has anyone tried out the Intel oneAPI 2021.3 packages? How has your
>> experience been? In particular, did you get Intel MPI working?
>>
>> I've installed oneAPI here and the DiFX components compile fine under the
>> respectively required Intel icc (C), icpc (C++), or Intel MPI mpicxx
>> compilers, plus the Intel IPP 2021.3 library.
>>
>> However I cannot get MPI to work across compute nodes and Mark6. For
>> example:
>>
>> $ which mpirun
>> /opt/intel/oneapi/mpi/2021.3.1/bin/mpirun
>>
>> ($ export I_MPI_PLATFORM=auto)
>> $ mpirun -prepend-rank -n 6 -perhost 1 -machinefile intel.hostfile
>> -bind-to none -iface ib0 mpifxcorr
>> [1] About to run MPIInit on node mark6-02
>> [0] About to run MPIInit on node mark6-01
>> [2] About to run MPIInit on node mark6-03
>> [5] About to run MPIInit on node node12.service
>> [3] About to run MPIInit on node node10.service
>> [4] About to run MPIInit on node node11.service
>> [1] Abort(1615503) on node 1 (rank 1 in comm 0): Fatal error in
>> PMPI_Init: Other MPI error, error stack:
>> [1] MPIR_Init_thread(138)........:
>> [1] MPID_Init(1169)..............:
>> [1] MPIDI_OFI_mpi_init_hook(1842):
>> [1] MPIDU_bc_table_create(336)...: Missing hostname or invalid host/port
>> description in business card
>>
>> The error is quite cryptic and I have not found much help elsewhere
>> online.
>>
>> Maybe someone here has come across it?
>>
>> Oddly, mpirun or rather the MPI_Init() in mpifxcorr works just fine when
>> the machinefile contains only Mark6 units, or when it contains only compute
>> nodes.
>>
>> Mixing both compute and Mark6 leads to the above error. All hosts have
>> the same CentOS 7.7.1908 and Mellanox Infiniband mlx4_0 as ib0...
>>
>> many thanks,
>> regards,
>> Jan
>>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>
-- 
!=============================================================!
A/Prof. Adam Deller
ARC Future Fellow
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797
office days (usually): Mon-Thu
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20210930/3e81482d/attachment.html>
    
    
More information about the Difx-users
mailing list