[Difx-users] Error in running the startdifx command with DiFX software {External} {External} {External}

Adam Deller adeller at astro.swin.edu.au
Thu Jul 6 00:16:48 EDT 2023


Hi Wu,

calcif2 is the delay-generating program that requires the calcserver to be
running (which wasn't the case for you). Setting DIFX_CALC_PROGRAM=difxcalc
determines which program which will be called by startdifx.   But you were
trying to run calcif2 itself from the command line, so naturally this won't
work.  If you run difxcalc wude_1.calc, it should work.  And as you saw, if
you run startdifx after setting DIFX_CALC_PROGRAM=difxcalc , that also
works fine.

Once you have run difxcalc (or calcif2) the .im file will be generated. If
you try to run difxcalc/calcif2 again once the .im file has been generated,
it won't run unless you force it (since it sees that the .im file has been
generated, so no need to re-generate it).

So your remaining problem now is that MPI seems to think that you don't
have any available CPUs on your host.  Once again (I think this is the
third time I'm making this suggestion): please try running the mpirun
command *without* the --mca options.  I.e.,

mpirun -np 4 --hostfile wude_1.machines runmpifxcorr.DiFX-2.6.2 wude_1.input

You may also have success by adding --oversubscribe to the mpirun command
(although that is more of a band-aid getting around the fact that it seems
that openmpi isn't seeing how many CPUs are available).

If you can figure out what mpirun option is causing the problem, you will
then be able to modify startdifx to remove the offending option for you
always.

Cheers,
Adam

On Tue, 4 Jul 2023 at 17:30, 深空探测 <wude7826580 at gmail.com> wrote:

> Subject: Issue with DiFX Testing - RPC Errors and CPU Allocation
>
> Hi Adam,
>
> I apologize for the delay in getting back to you. I've been conducting
> tests with DiFX lately, and I encountered a few issues that I would
> appreciate your insight on.
>
> Initially, I faced problems running the `mpirun` command, but I managed to
> resolve them by reinstalling DiFX on a new CentOS7 system. Previously, I
> had installed `openmpi-1.6.5` in the `/usr/local` directory, but this time,
> I used the command `sudo yum install openmpi-devel` to install `openmpi`,
> and then I installed DiFX in the `/home/wude/difx/DIFXROOT` directory.
> Following this setup, the `mpirun` command started working correctly. I
> suspect that the previous installation in the system directory might have
> been causing the issues with `mpirun`.
>
> However, I encountered a new problem when running the command `calcif2
> wude_1.calc`. The output displayed the following error:
>
>
> ----------------------------------------------------------------------------------------
> calcif2 processing file 1/1 = wude_1
> localhost: RPC: Program not registered
> Error: calcif2: RPC clnt_create fails for host: localhost
> Error: Cannot initialize CalcParams
>
> ----------------------------------------------------------------------------------------
>
> Previously, I resolved a similar error by running the command: `export
> DIFX_CALC_PROGRAM=difxcalc`. However, when I tried the same solution this
> time, it didn't resolve the issue.
>
> Additionally, when running the command: `mpirun -np 4 --hostfile
> wude_1.machines --mca mpi_yield_when_idle 1 --mca rmaps seq
> runmpifxcorr.DiFX-2.6.2 wude_1.input`, the output displayed the following
> message:
>
>
> ---------------------------------------------------------------------------------------------------------------
> While computing bindings, we found no available CPUs on the following node:
>     Node: wude
> Please check your allocation.
>
> ---------------------------------------------------------------------------------------------------------------
>
> My hostname is "wude", and it seems like there are no available CPUs, but
> I can't determine the cause of this issue. Hence, I am reaching out to seek
> your guidance on this matter.
>
> Thank you for your time and support.
>
> Best regards,
>
> De Wu
>
> Adam Deller <adeller at astro.swin.edu.au> 于2023年6月26日周一 07:36写道:
>
>> Have you tried removing the --mca options from the command? E.g.,
>>
>> mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>> runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input
>>
>> I have a suspicion that either the seq or rmaps option is not playing
>> nice, but it is easiest to just remove all the options and see if that
>> makes any difference.
>>
>> Cheers,
>> Adam
>>
>> On Mon, 26 Jun 2023 at 01:58, 深空探测 <wude7826580 at gmail.com> wrote:
>>
>>> Hi Adam,
>>>
>>> As you suggested, I removed the "| head" from the command, and I was
>>> able to run it successfully.
>>>
>>> However, when executing the following command: "mpirun -np 4 --hostfile
>>> /vlbi/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca rmaps seq
>>> runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input". The output displayed
>>> the following message:
>>>
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> Additionally, when running the command "mpirun -np 4 -H
>>> localhost,localhost,localhost,localhost --mca mpi_yield_when_idle 1 --mca
>>> rmaps seq runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input," and it
>>> resulted in the following error message:
>>>
>>>
>>> --------------------------------------------------------------------------
>>> There are no nodes allocated to this job.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> It is quite puzzling that even when specifying only one localhost in the
>>> command, I still receive this output. I have been considering the
>>> possibility that this issue might be due to limitations in system
>>> resources, node access permissions, or node configuration within the
>>> CentOS7 virtual machine environment.
>>>
>>> Thank you for your attention to this matter.
>>>
>>> Best regards,
>>>
>>> De Wu
>>>
>>> Adam Deller <adeller at astro.swin.edu.au> 于2023年6月22日周四 15:53写道:
>>>
>>>> Hi De Wu,
>>>>
>>>> The "SIGPIPE detected on fd 13 - aborting" errors when running mpispeed
>>>> are related to piping the output to head.  Remove the "| head" and you
>>>> should see it run normally.
>>>>
>>>> For running mpifxcorr, the obvious difference between your invocation
>>>> of mpispeed and mpifxcorr is the use of the various mca options.  What
>>>> happens if you add " --mca mpi_yield_when_idle 1 --mca rmaps seq" to your
>>>> mpispeed launch (before or after the -H localhost,localhost)?  If it
>>>> doesn't work, then probably one or the other of those options is the
>>>> problem, and you need to change startdifx to get rid of the offending
>>>> option when running mpirun.
>>>>
>>>> If running mpispeed still works when with those options, what about the
>>>> following:
>>>> 1. manually run mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>>>> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
>>>> /vlbi/aov070/aov070_1.input, see what output comes out
>>>> 2. manually run mpirun -np 4 -H localhost,localhost,localhost,localhost
>>>> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
>>>> /vlbi/aov070/aov070_1.input, see what output comes out
>>>>
>>>> Cheers,
>>>> Adam
>>>>
>>>> On Mon, 19 Jun 2023 at 18:02, 深空探测 via Difx-users <
>>>> difx-users at listmgr.nrao.edu> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I recently reinstalled OpenMPI-1.6.5 and successfully ran the example
>>>>> program provided within the OpenMPI package. By executing the command
>>>>> "mpiexec -n 6 ./hello_c," I obtained the following output:
>>>>>
>>>>> ```
>>>>> wude at wude DiFX-2.6.2 examples> mpiexec -n 6 ./hello_c
>>>>> Hello, world, I am 4 of 6
>>>>> Hello, world, I am 2 of 6
>>>>> Hello, world, I am 0 of 6
>>>>> Hello, world, I am 1 of 6
>>>>> Hello, world, I am 3 of 6
>>>>> Hello, world, I am 5 of 6
>>>>> ```
>>>>>
>>>>> The program executed without any issues, displaying the expected
>>>>> output. Each line represents a separate process, showing the process number
>>>>> and the total number of processes involved.
>>>>>
>>>>> However, I encountered some difficulties when running the command
>>>>> "mpirun -H localhost,localhost mpispeed 1000 10s 1 | head." Although both
>>>>> nodes seem to run properly, there appear to be some errors in the output.
>>>>> Below is the output I received, with "wude" being my username:
>>>>>
>>>>> ```
>>>>> wude at wude DiFX-2.6.2 ~> mpirun -H localhost,localhost mpispeed 1000
>>>>> 10s 1 | head
>>>>> Processor = wude
>>>>> Rank = 0/2
>>>>> [0] Starting
>>>>> Processor = wude
>>>>> Rank = 1/2
>>>>> [1] Starting
>>>>> [1] Recvd 0 -> 0 : 2740.66 Mbps curr : 2740.66 Mbps mean
>>>>> [1] Recvd 1 -> 0 : 60830.52 Mbps curr : 5245.02 Mbps mean
>>>>> [1] Recvd 2 -> 0 : 69260.57 Mbps curr : 7580.50 Mbps mean
>>>>> [1] Recvd 3 -> 0 : 68545.44 Mbps curr : 9747.65 Mbps mean
>>>>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>>>>> mpirun: killing job...
>>>>>
>>>>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>>>>> mpirun: killing job...
>>>>> ```
>>>>>
>>>>> I'm unsure whether you experience the same "mpirun: SIGPIPE detected
>>>>> on fd 13 - aborting mpirun: killing job..." message when running this
>>>>> command on your computer.
>>>>>
>>>>> Furthermore, when I ran the command "startdifx -v -f -n
>>>>> aov070.joblist," the .difx file was not generated. Could you please provide
>>>>> some guidance or suggestions to help me troubleshoot this issue?
>>>>>
>>>>> Here is the output I received when running the command:
>>>>>
>>>>> ```
>>>>> wude at wude DiFX-2.6.2 aov070> startdifx -v -f -n aov070.joblist
>>>>> No errors with input file /vlbi/aov070/aov070_1.input
>>>>>
>>>>> Executing:  mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>>>>> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
>>>>> /vlbi/aov070/aov070_1.input
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Elapsed time (s) = 82.2610619068
>>>>> ```
>>>>> Best regards,
>>>>>
>>>>> De Wu
>>>>>
>>>>> Adam Deller <adeller at astro.swin.edu.au> 于2023年5月25日周四 08:42写道:
>>>>>
>>>>>> Hi De Wu,
>>>>>>
>>>>>> If I run
>>>>>>
>>>>>> mpirun -H localhost,localhost mpispeed 1000 10s 1
>>>>>>
>>>>>> it runs correctly as follows:
>>>>>>
>>>>>> adeller at ar313-adeller trunk Downloads> mpirun -H localhost,localhost
>>>>>> mpispeed 1000 10s 1 | head
>>>>>> Processor = <my host name>
>>>>>> Rank = 0/2
>>>>>> [0] Starting
>>>>>> Processor =<my host name>
>>>>>> Rank = 1/2
>>>>>> [1] Starting
>>>>>>
>>>>>> It seems like in your case, MPI is looking at the two identical host
>>>>>> names you've given and is deciding to only start one process, rather than
>>>>>> two. What if you run
>>>>>>
>>>>>> mpirun -n 2 -H wude,wude mpispeed 1000 10s 1
>>>>>>
>>>>>> ?
>>>>>>
>>>>>> I think the issue is with your MPI installation / the parameters
>>>>>> being passed to mpirun. Unfortunately as I've mentioned previously the
>>>>>> behaviour of MPI with default parameters seems to change from
>>>>>> implementation to implementation and version to version - you just need to
>>>>>> track down what is needed to make sure it actually runs the number of
>>>>>> processes you want on the nodes you want!
>>>>>>
>>>>>> Cheers,
>>>>>> Adam
>>>>>>
>>>>>>
>>>>>> On Wed, 24 May 2023 at 18:30, 深空探测 via Difx-users <
>>>>>> difx-users at listmgr.nrao.edu> wrote:
>>>>>>
>>>>>>> Hi  All,
>>>>>>>
>>>>>>> I am writing to seek assistance regarding an issue I encountered
>>>>>>> while working with MPI on a CentOS 7 virtual machine.
>>>>>>>
>>>>>>> I have successfully installed openmpi-1.6.5 on the CentOS 7 virtual
>>>>>>> machine. However, when I attempted to execute the command "startdifx -f -n
>>>>>>> -v aov070.joblist," I received the following error message:
>>>>>>>
>>>>>>> "Environment variable DIFX_CALC_PROGRAM was set, so
>>>>>>> Using specified calc program: difxcalc
>>>>>>>
>>>>>>> No errors with input file /vlbi/corr/aov070/aov070_1.input
>>>>>>>
>>>>>>> Executing: mpirun -np 4 --hostfile
>>>>>>> /vlbi/corr/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca rmaps
>>>>>>> seq runmpifxcorr.DiFX-2.6.2 /vlbi/corr/aov070/aov070_1.input
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>> process that caused that situation.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------"
>>>>>>>
>>>>>>> To further investigate the MPI functionality, I wrote a Python
>>>>>>> program “mpi_hello_world.py” as follows:
>>>>>>>
>>>>>>> from mpi4py import MPI
>>>>>>>
>>>>>>> comm = MPI.COMM_WORLD
>>>>>>> rank = comm.Get_rank()
>>>>>>> size = comm.Get_size()
>>>>>>>
>>>>>>> print("Hello from rank", rank, "of", size)
>>>>>>>
>>>>>>> When I executed the command "mpiexec -n 4 python
>>>>>>> mpi_hello_world.py," the output was as follows:
>>>>>>>
>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>
>>>>>>> Additionally, I attempted to test the MPI functionality using the
>>>>>>> "mpispeed" command with the following execution command: "mpirun -H
>>>>>>> wude,wude mpispeed 1000 10s 1".  “wude” is my hostname. However, I
>>>>>>> encountered the following error message:
>>>>>>>
>>>>>>> "Processor = wude
>>>>>>> Rank = 0/1
>>>>>>> Sorry, must run with an even number of processes
>>>>>>> This program should be invoked in a manner similar to:
>>>>>>> mpirun -H host1,host2,...,hostN mpispeed [<numSends>|<timeSend>s]
>>>>>>> [<sendSizeMByte>]
>>>>>>>     where
>>>>>>>         numSends: number of blocks to send (e.g., 256), or
>>>>>>>         timeSend: duration in seconds to send (e.g., 100s)
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>> process that caused that situation.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------"
>>>>>>>
>>>>>>> I am uncertain about the source of these issues and would greatly
>>>>>>> appreciate your guidance in resolving them. If you have any insights or
>>>>>>> suggestions regarding the aforementioned errors and how I can rectify them,
>>>>>>> please let me know.
>>>>>>>
>>>>>>> Thank you for your time and assistance.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> De Wu
>>>>>>> _______________________________________________
>>>>>>> Difx-users mailing list
>>>>>>> Difx-users at listmgr.nrao.edu
>>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> !=============================================================!
>>>>>> Prof. Adam Deller
>>>>>> Centre for Astrophysics & Supercomputing
>>>>>> Swinburne University of Technology
>>>>>> John St, Hawthorn VIC 3122 Australia
>>>>>> phone: +61 3 9214 5307
>>>>>> fax: +61 3 9214 8797
>>>>>> !=============================================================!
>>>>>>
>>>>> _______________________________________________
>>>>> Difx-users mailing list
>>>>> Difx-users at listmgr.nrao.edu
>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>
>>>>
>>>>
>>>> --
>>>> !=============================================================!
>>>> Prof. Adam Deller
>>>> Centre for Astrophysics & Supercomputing
>>>> Swinburne University of Technology
>>>> John St, Hawthorn VIC 3122 Australia
>>>> phone: +61 3 9214 5307
>>>> fax: +61 3 9214 8797
>>>> !=============================================================!
>>>>
>>>
>>
>> --
>> !=============================================================!
>> Prof. Adam Deller
>> Centre for Astrophysics & Supercomputing
>> Swinburne University of Technology
>> John St, Hawthorn VIC 3122 Australia
>> phone: +61 3 9214 5307
>> fax: +61 3 9214 8797
>> !=============================================================!
>>
>

-- 
!=============================================================!
Prof. Adam Deller
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20230706/7442b873/attachment-0001.html>


More information about the Difx-users mailing list