[Difx-users] Error in running the startdifx command with DiFX software {External} {External} {External}

Adam Deller adeller at astro.swin.edu.au
Thu Jun 22 03:53:32 EDT 2023


Hi De Wu,

The "SIGPIPE detected on fd 13 - aborting" errors when running mpispeed are
related to piping the output to head.  Remove the "| head" and you should
see it run normally.

For running mpifxcorr, the obvious difference between your invocation of
mpispeed and mpifxcorr is the use of the various mca options.  What happens
if you add " --mca mpi_yield_when_idle 1 --mca rmaps seq" to your
mpispeed launch (before or after the -H localhost,localhost)?  If it
doesn't work, then probably one or the other of those options is the
problem, and you need to change startdifx to get rid of the offending
option when running mpirun.

If running mpispeed still works when with those options, what about the
following:
1. manually run mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
--mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
/vlbi/aov070/aov070_1.input, see what output comes out
2. manually run mpirun -np 4 -H localhost,localhost,localhost,localhost
--mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
/vlbi/aov070/aov070_1.input, see what output comes out

Cheers,
Adam

On Mon, 19 Jun 2023 at 18:02, 深空探测 via Difx-users <
difx-users at listmgr.nrao.edu> wrote:

> Hello,
>
> I recently reinstalled OpenMPI-1.6.5 and successfully ran the example
> program provided within the OpenMPI package. By executing the command
> "mpiexec -n 6 ./hello_c," I obtained the following output:
>
> ```
> wude at wude DiFX-2.6.2 examples> mpiexec -n 6 ./hello_c
> Hello, world, I am 4 of 6
> Hello, world, I am 2 of 6
> Hello, world, I am 0 of 6
> Hello, world, I am 1 of 6
> Hello, world, I am 3 of 6
> Hello, world, I am 5 of 6
> ```
>
> The program executed without any issues, displaying the expected output.
> Each line represents a separate process, showing the process number and the
> total number of processes involved.
>
> However, I encountered some difficulties when running the command "mpirun
> -H localhost,localhost mpispeed 1000 10s 1 | head." Although both nodes
> seem to run properly, there appear to be some errors in the output. Below
> is the output I received, with "wude" being my username:
>
> ```
> wude at wude DiFX-2.6.2 ~> mpirun -H localhost,localhost mpispeed 1000 10s 1
> | head
> Processor = wude
> Rank = 0/2
> [0] Starting
> Processor = wude
> Rank = 1/2
> [1] Starting
> [1] Recvd 0 -> 0 : 2740.66 Mbps curr : 2740.66 Mbps mean
> [1] Recvd 1 -> 0 : 60830.52 Mbps curr : 5245.02 Mbps mean
> [1] Recvd 2 -> 0 : 69260.57 Mbps curr : 7580.50 Mbps mean
> [1] Recvd 3 -> 0 : 68545.44 Mbps curr : 9747.65 Mbps mean
> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
> mpirun: killing job...
>
> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
> mpirun: killing job...
> ```
>
> I'm unsure whether you experience the same "mpirun: SIGPIPE detected on fd
> 13 - aborting mpirun: killing job..." message when running this command on
> your computer.
>
> Furthermore, when I ran the command "startdifx -v -f -n aov070.joblist,"
> the .difx file was not generated. Could you please provide some guidance or
> suggestions to help me troubleshoot this issue?
>
> Here is the output I received when running the command:
>
> ```
> wude at wude DiFX-2.6.2 aov070> startdifx -v -f -n aov070.joblist
> No errors with input file /vlbi/aov070/aov070_1.input
>
> Executing:  mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines --mca
> mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
> /vlbi/aov070/aov070_1.input
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> Elapsed time (s) = 82.2610619068
> ```
> Best regards,
>
> De Wu
>
> Adam Deller <adeller at astro.swin.edu.au> 于2023年5月25日周四 08:42写道:
>
>> Hi De Wu,
>>
>> If I run
>>
>> mpirun -H localhost,localhost mpispeed 1000 10s 1
>>
>> it runs correctly as follows:
>>
>> adeller at ar313-adeller trunk Downloads> mpirun -H localhost,localhost
>> mpispeed 1000 10s 1 | head
>> Processor = <my host name>
>> Rank = 0/2
>> [0] Starting
>> Processor =<my host name>
>> Rank = 1/2
>> [1] Starting
>>
>> It seems like in your case, MPI is looking at the two identical host
>> names you've given and is deciding to only start one process, rather than
>> two. What if you run
>>
>> mpirun -n 2 -H wude,wude mpispeed 1000 10s 1
>>
>> ?
>>
>> I think the issue is with your MPI installation / the parameters being
>> passed to mpirun. Unfortunately as I've mentioned previously the behaviour
>> of MPI with default parameters seems to change from implementation to
>> implementation and version to version - you just need to track down what is
>> needed to make sure it actually runs the number of processes you want on
>> the nodes you want!
>>
>> Cheers,
>> Adam
>>
>>
>> On Wed, 24 May 2023 at 18:30, 深空探测 via Difx-users <
>> difx-users at listmgr.nrao.edu> wrote:
>>
>>> Hi  All,
>>>
>>> I am writing to seek assistance regarding an issue I encountered while
>>> working with MPI on a CentOS 7 virtual machine.
>>>
>>> I have successfully installed openmpi-1.6.5 on the CentOS 7 virtual
>>> machine. However, when I attempted to execute the command "startdifx -f -n
>>> -v aov070.joblist," I received the following error message:
>>>
>>> "Environment variable DIFX_CALC_PROGRAM was set, so
>>> Using specified calc program: difxcalc
>>>
>>> No errors with input file /vlbi/corr/aov070/aov070_1.input
>>>
>>> Executing: mpirun -np 4 --hostfile /vlbi/corr/aov070/aov070_1.machines
>>> --mca mpi_yield_when_idle 1 --mca rmaps seq runmpifxcorr.DiFX-2.6.2
>>> /vlbi/corr/aov070/aov070_1.input
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------"
>>>
>>> To further investigate the MPI functionality, I wrote a Python program
>>> “mpi_hello_world.py” as follows:
>>>
>>> from mpi4py import MPI
>>>
>>> comm = MPI.COMM_WORLD
>>> rank = comm.Get_rank()
>>> size = comm.Get_size()
>>>
>>> print("Hello from rank", rank, "of", size)
>>>
>>> When I executed the command "mpiexec -n 4 python mpi_hello_world.py,"
>>> the output was as follows:
>>>
>>> ('Hello from rank', 0, 'of', 1)
>>> ('Hello from rank', 0, 'of', 1)
>>> ('Hello from rank', 0, 'of', 1)
>>> ('Hello from rank', 0, 'of', 1)
>>>
>>> Additionally, I attempted to test the MPI functionality using the
>>> "mpispeed" command with the following execution command: "mpirun -H
>>> wude,wude mpispeed 1000 10s 1".  “wude” is my hostname. However, I
>>> encountered the following error message:
>>>
>>> "Processor = wude
>>> Rank = 0/1
>>> Sorry, must run with an even number of processes
>>> This program should be invoked in a manner similar to:
>>> mpirun -H host1,host2,...,hostN mpispeed [<numSends>|<timeSend>s]
>>> [<sendSizeMByte>]
>>>     where
>>>         numSends: number of blocks to send (e.g., 256), or
>>>         timeSend: duration in seconds to send (e.g., 100s)
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------"
>>>
>>> I am uncertain about the source of these issues and would greatly
>>> appreciate your guidance in resolving them. If you have any insights or
>>> suggestions regarding the aforementioned errors and how I can rectify them,
>>> please let me know.
>>>
>>> Thank you for your time and assistance.
>>>
>>> Best regards,
>>>
>>> De Wu
>>> _______________________________________________
>>> Difx-users mailing list
>>> Difx-users at listmgr.nrao.edu
>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>
>>
>>
>> --
>> !=============================================================!
>> Prof. Adam Deller
>> Centre for Astrophysics & Supercomputing
>> Swinburne University of Technology
>> John St, Hawthorn VIC 3122 Australia
>> phone: +61 3 9214 5307
>> fax: +61 3 9214 8797
>> !=============================================================!
>>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>


-- 
!=============================================================!
Prof. Adam Deller
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20230622/0698900b/attachment.html>


More information about the Difx-users mailing list