[Difx-users] Error in running the startdifx command with DiFX software {External} {External}

深空探测 wude7826580 at gmail.com
Mon Jun 19 04:01:31 EDT 2023


Hello,

I recently reinstalled OpenMPI-1.6.5 and successfully ran the example
program provided within the OpenMPI package. By executing the command
"mpiexec -n 6 ./hello_c," I obtained the following output:

```
wude at wude DiFX-2.6.2 examples> mpiexec -n 6 ./hello_c
Hello, world, I am 4 of 6
Hello, world, I am 2 of 6
Hello, world, I am 0 of 6
Hello, world, I am 1 of 6
Hello, world, I am 3 of 6
Hello, world, I am 5 of 6
```

The program executed without any issues, displaying the expected output.
Each line represents a separate process, showing the process number and the
total number of processes involved.

However, I encountered some difficulties when running the command "mpirun
-H localhost,localhost mpispeed 1000 10s 1 | head." Although both nodes
seem to run properly, there appear to be some errors in the output. Below
is the output I received, with "wude" being my username:

```
wude at wude DiFX-2.6.2 ~> mpirun -H localhost,localhost mpispeed 1000 10s 1 |
head
Processor = wude
Rank = 0/2
[0] Starting
Processor = wude
Rank = 1/2
[1] Starting
[1] Recvd 0 -> 0 : 2740.66 Mbps curr : 2740.66 Mbps mean
[1] Recvd 1 -> 0 : 60830.52 Mbps curr : 5245.02 Mbps mean
[1] Recvd 2 -> 0 : 69260.57 Mbps curr : 7580.50 Mbps mean
[1] Recvd 3 -> 0 : 68545.44 Mbps curr : 9747.65 Mbps mean
[wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
mpirun: killing job...

[wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
mpirun: killing job...
```

I'm unsure whether you experience the same "mpirun: SIGPIPE detected on fd
13 - aborting mpirun: killing job..." message when running this command on
your computer.

Furthermore, when I ran the command "startdifx -v -f -n aov070.joblist,"
the .difx file was not generated. Could you please provide some guidance or
suggestions to help me troubleshoot this issue?

Here is the output I received when running the command:

```
wude at wude DiFX-2.6.2 aov070> startdifx -v -f -n aov070.joblist
No errors with input file /vlbi/aov070/aov070_1.input

Executing:  mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines --mca
mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
/vlbi/aov070/aov070_1.input
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
Elapsed time (s) = 82.2610619068
```
Best regards,

De Wu

Adam Deller <adeller at astro.swin.edu.au> 于2023年5月25日周四 08:42写道:

> Hi De Wu,
>
> If I run
>
> mpirun -H localhost,localhost mpispeed 1000 10s 1
>
> it runs correctly as follows:
>
> adeller at ar313-adeller trunk Downloads> mpirun -H localhost,localhost
> mpispeed 1000 10s 1 | head
> Processor = <my host name>
> Rank = 0/2
> [0] Starting
> Processor =<my host name>
> Rank = 1/2
> [1] Starting
>
> It seems like in your case, MPI is looking at the two identical host names
> you've given and is deciding to only start one process, rather than two.
> What if you run
>
> mpirun -n 2 -H wude,wude mpispeed 1000 10s 1
>
> ?
>
> I think the issue is with your MPI installation / the parameters being
> passed to mpirun. Unfortunately as I've mentioned previously the behaviour
> of MPI with default parameters seems to change from implementation to
> implementation and version to version - you just need to track down what is
> needed to make sure it actually runs the number of processes you want on
> the nodes you want!
>
> Cheers,
> Adam
>
>
> On Wed, 24 May 2023 at 18:30, 深空探测 via Difx-users <
> difx-users at listmgr.nrao.edu> wrote:
>
>> Hi  All,
>>
>> I am writing to seek assistance regarding an issue I encountered while
>> working with MPI on a CentOS 7 virtual machine.
>>
>> I have successfully installed openmpi-1.6.5 on the CentOS 7 virtual
>> machine. However, when I attempted to execute the command "startdifx -f -n
>> -v aov070.joblist," I received the following error message:
>>
>> "Environment variable DIFX_CALC_PROGRAM was set, so
>> Using specified calc program: difxcalc
>>
>> No errors with input file /vlbi/corr/aov070/aov070_1.input
>>
>> Executing: mpirun -np 4 --hostfile /vlbi/corr/aov070/aov070_1.machines
>> --mca mpi_yield_when_idle 1 --mca rmaps seq runmpifxcorr.DiFX-2.6.2
>> /vlbi/corr/aov070/aov070_1.input
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>>
>> --------------------------------------------------------------------------"
>>
>> To further investigate the MPI functionality, I wrote a Python program
>> “mpi_hello_world.py” as follows:
>>
>> from mpi4py import MPI
>>
>> comm = MPI.COMM_WORLD
>> rank = comm.Get_rank()
>> size = comm.Get_size()
>>
>> print("Hello from rank", rank, "of", size)
>>
>> When I executed the command "mpiexec -n 4 python mpi_hello_world.py," the
>> output was as follows:
>>
>> ('Hello from rank', 0, 'of', 1)
>> ('Hello from rank', 0, 'of', 1)
>> ('Hello from rank', 0, 'of', 1)
>> ('Hello from rank', 0, 'of', 1)
>>
>> Additionally, I attempted to test the MPI functionality using the
>> "mpispeed" command with the following execution command: "mpirun -H
>> wude,wude mpispeed 1000 10s 1".  “wude” is my hostname. However, I
>> encountered the following error message:
>>
>> "Processor = wude
>> Rank = 0/1
>> Sorry, must run with an even number of processes
>> This program should be invoked in a manner similar to:
>> mpirun -H host1,host2,...,hostN mpispeed [<numSends>|<timeSend>s]
>> [<sendSizeMByte>]
>>     where
>>         numSends: number of blocks to send (e.g., 256), or
>>         timeSend: duration in seconds to send (e.g., 100s)
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>>
>> --------------------------------------------------------------------------"
>>
>> I am uncertain about the source of these issues and would greatly
>> appreciate your guidance in resolving them. If you have any insights or
>> suggestions regarding the aforementioned errors and how I can rectify them,
>> please let me know.
>>
>> Thank you for your time and assistance.
>>
>> Best regards,
>>
>> De Wu
>> _______________________________________________
>> Difx-users mailing list
>> Difx-users at listmgr.nrao.edu
>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>
>
>
> --
> !=============================================================!
> Prof. Adam Deller
> Centre for Astrophysics & Supercomputing
> Swinburne University of Technology
> John St, Hawthorn VIC 3122 Australia
> phone: +61 3 9214 5307
> fax: +61 3 9214 8797
> !=============================================================!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20230619/f5986e39/attachment.html>


More information about the Difx-users mailing list