[Difx-users] Error in running the startdifx command with DiFX software {External} {External} {External}

深空探测 wude7826580 at gmail.com
Sun Jun 25 11:57:43 EDT 2023


Hi Adam,

As you suggested, I removed the "| head" from the command, and I was able
to run it successfully.

However, when executing the following command: "mpirun -np 4 --hostfile
/vlbi/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca rmaps seq
runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input". The output displayed
the following message:

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

Additionally, when running the command "mpirun -np 4 -H
localhost,localhost,localhost,localhost --mca mpi_yield_when_idle 1 --mca
rmaps seq runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input," and it
resulted in the following error message:

--------------------------------------------------------------------------
There are no nodes allocated to this job.
--------------------------------------------------------------------------

It is quite puzzling that even when specifying only one localhost in the
command, I still receive this output. I have been considering the
possibility that this issue might be due to limitations in system
resources, node access permissions, or node configuration within the
CentOS7 virtual machine environment.

Thank you for your attention to this matter.

Best regards,

De Wu

Adam Deller <adeller at astro.swin.edu.au> 于2023年6月22日周四 15:53写道:

> Hi De Wu,
>
> The "SIGPIPE detected on fd 13 - aborting" errors when running mpispeed
> are related to piping the output to head.  Remove the "| head" and you
> should see it run normally.
>
> For running mpifxcorr, the obvious difference between your invocation of
> mpispeed and mpifxcorr is the use of the various mca options.  What happens
> if you add " --mca mpi_yield_when_idle 1 --mca rmaps seq" to your
> mpispeed launch (before or after the -H localhost,localhost)?  If it
> doesn't work, then probably one or the other of those options is the
> problem, and you need to change startdifx to get rid of the offending
> option when running mpirun.
>
> If running mpispeed still works when with those options, what about the
> following:
> 1. manually run mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
> /vlbi/aov070/aov070_1.input, see what output comes out
> 2. manually run mpirun -np 4 -H localhost,localhost,localhost,localhost
> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
> /vlbi/aov070/aov070_1.input, see what output comes out
>
> Cheers,
> Adam
>
> On Mon, 19 Jun 2023 at 18:02, 深空探测 via Difx-users <
> difx-users at listmgr.nrao.edu> wrote:
>
>> Hello,
>>
>> I recently reinstalled OpenMPI-1.6.5 and successfully ran the example
>> program provided within the OpenMPI package. By executing the command
>> "mpiexec -n 6 ./hello_c," I obtained the following output:
>>
>> ```
>> wude at wude DiFX-2.6.2 examples> mpiexec -n 6 ./hello_c
>> Hello, world, I am 4 of 6
>> Hello, world, I am 2 of 6
>> Hello, world, I am 0 of 6
>> Hello, world, I am 1 of 6
>> Hello, world, I am 3 of 6
>> Hello, world, I am 5 of 6
>> ```
>>
>> The program executed without any issues, displaying the expected output.
>> Each line represents a separate process, showing the process number and the
>> total number of processes involved.
>>
>> However, I encountered some difficulties when running the command "mpirun
>> -H localhost,localhost mpispeed 1000 10s 1 | head." Although both nodes
>> seem to run properly, there appear to be some errors in the output. Below
>> is the output I received, with "wude" being my username:
>>
>> ```
>> wude at wude DiFX-2.6.2 ~> mpirun -H localhost,localhost mpispeed 1000 10s
>> 1 | head
>> Processor = wude
>> Rank = 0/2
>> [0] Starting
>> Processor = wude
>> Rank = 1/2
>> [1] Starting
>> [1] Recvd 0 -> 0 : 2740.66 Mbps curr : 2740.66 Mbps mean
>> [1] Recvd 1 -> 0 : 60830.52 Mbps curr : 5245.02 Mbps mean
>> [1] Recvd 2 -> 0 : 69260.57 Mbps curr : 7580.50 Mbps mean
>> [1] Recvd 3 -> 0 : 68545.44 Mbps curr : 9747.65 Mbps mean
>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>> mpirun: killing job...
>>
>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>> mpirun: killing job...
>> ```
>>
>> I'm unsure whether you experience the same "mpirun: SIGPIPE detected on
>> fd 13 - aborting mpirun: killing job..." message when running this command
>> on your computer.
>>
>> Furthermore, when I ran the command "startdifx -v -f -n aov070.joblist,"
>> the .difx file was not generated. Could you please provide some guidance or
>> suggestions to help me troubleshoot this issue?
>>
>> Here is the output I received when running the command:
>>
>> ```
>> wude at wude DiFX-2.6.2 aov070> startdifx -v -f -n aov070.joblist
>> No errors with input file /vlbi/aov070/aov070_1.input
>>
>> Executing:  mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines --mca
>> mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
>> /vlbi/aov070/aov070_1.input
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> Elapsed time (s) = 82.2610619068
>> ```
>> Best regards,
>>
>> De Wu
>>
>> Adam Deller <adeller at astro.swin.edu.au> 于2023年5月25日周四 08:42写道:
>>
>>> Hi De Wu,
>>>
>>> If I run
>>>
>>> mpirun -H localhost,localhost mpispeed 1000 10s 1
>>>
>>> it runs correctly as follows:
>>>
>>> adeller at ar313-adeller trunk Downloads> mpirun -H localhost,localhost
>>> mpispeed 1000 10s 1 | head
>>> Processor = <my host name>
>>> Rank = 0/2
>>> [0] Starting
>>> Processor =<my host name>
>>> Rank = 1/2
>>> [1] Starting
>>>
>>> It seems like in your case, MPI is looking at the two identical host
>>> names you've given and is deciding to only start one process, rather than
>>> two. What if you run
>>>
>>> mpirun -n 2 -H wude,wude mpispeed 1000 10s 1
>>>
>>> ?
>>>
>>> I think the issue is with your MPI installation / the parameters being
>>> passed to mpirun. Unfortunately as I've mentioned previously the behaviour
>>> of MPI with default parameters seems to change from implementation to
>>> implementation and version to version - you just need to track down what is
>>> needed to make sure it actually runs the number of processes you want on
>>> the nodes you want!
>>>
>>> Cheers,
>>> Adam
>>>
>>>
>>> On Wed, 24 May 2023 at 18:30, 深空探测 via Difx-users <
>>> difx-users at listmgr.nrao.edu> wrote:
>>>
>>>> Hi  All,
>>>>
>>>> I am writing to seek assistance regarding an issue I encountered while
>>>> working with MPI on a CentOS 7 virtual machine.
>>>>
>>>> I have successfully installed openmpi-1.6.5 on the CentOS 7 virtual
>>>> machine. However, when I attempted to execute the command "startdifx -f -n
>>>> -v aov070.joblist," I received the following error message:
>>>>
>>>> "Environment variable DIFX_CALC_PROGRAM was set, so
>>>> Using specified calc program: difxcalc
>>>>
>>>> No errors with input file /vlbi/corr/aov070/aov070_1.input
>>>>
>>>> Executing: mpirun -np 4 --hostfile /vlbi/corr/aov070/aov070_1.machines
>>>> --mca mpi_yield_when_idle 1 --mca rmaps seq runmpifxcorr.DiFX-2.6.2
>>>> /vlbi/corr/aov070/aov070_1.input
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------"
>>>>
>>>> To further investigate the MPI functionality, I wrote a Python program
>>>> “mpi_hello_world.py” as follows:
>>>>
>>>> from mpi4py import MPI
>>>>
>>>> comm = MPI.COMM_WORLD
>>>> rank = comm.Get_rank()
>>>> size = comm.Get_size()
>>>>
>>>> print("Hello from rank", rank, "of", size)
>>>>
>>>> When I executed the command "mpiexec -n 4 python mpi_hello_world.py,"
>>>> the output was as follows:
>>>>
>>>> ('Hello from rank', 0, 'of', 1)
>>>> ('Hello from rank', 0, 'of', 1)
>>>> ('Hello from rank', 0, 'of', 1)
>>>> ('Hello from rank', 0, 'of', 1)
>>>>
>>>> Additionally, I attempted to test the MPI functionality using the
>>>> "mpispeed" command with the following execution command: "mpirun -H
>>>> wude,wude mpispeed 1000 10s 1".  “wude” is my hostname. However, I
>>>> encountered the following error message:
>>>>
>>>> "Processor = wude
>>>> Rank = 0/1
>>>> Sorry, must run with an even number of processes
>>>> This program should be invoked in a manner similar to:
>>>> mpirun -H host1,host2,...,hostN mpispeed [<numSends>|<timeSend>s]
>>>> [<sendSizeMByte>]
>>>>     where
>>>>         numSends: number of blocks to send (e.g., 256), or
>>>>         timeSend: duration in seconds to send (e.g., 100s)
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------"
>>>>
>>>> I am uncertain about the source of these issues and would greatly
>>>> appreciate your guidance in resolving them. If you have any insights or
>>>> suggestions regarding the aforementioned errors and how I can rectify them,
>>>> please let me know.
>>>>
>>>> Thank you for your time and assistance.
>>>>
>>>> Best regards,
>>>>
>>>> De Wu
>>>> _______________________________________________
>>>> Difx-users mailing list
>>>> Difx-users at listmgr.nrao.edu
>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>
>>>
>>>
>>> --
>>> !=============================================================!
>>> Prof. Adam Deller
>>> Centre for Astrophysics & Supercomputing
>>> Swinburne University of Technology
>>> John St, Hawthorn VIC 3122 Australia
>>> phone: +61 3 9214 5307
>>> fax: +61 3 9214 8797
>>> !=============================================================!
>>>
>> _______________________________________________
>> Difx-users mailing list
>> Difx-users at listmgr.nrao.edu
>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>
>
>
> --
> !=============================================================!
> Prof. Adam Deller
> Centre for Astrophysics & Supercomputing
> Swinburne University of Technology
> John St, Hawthorn VIC 3122 Australia
> phone: +61 3 9214 5307
> fax: +61 3 9214 8797
> !=============================================================!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20230625/ad24de01/attachment-0001.html>


More information about the Difx-users mailing list