[Difx-users] Error in running the startdifx command with DiFX software {External} {External} {External}

Adam Deller adeller at astro.swin.edu.au
Thu Jul 6 00:18:42 EDT 2023


Sorry, I just saw that you had done this (and reported in your second
email):

*Subsequently, I proceeded to run the command "mpirun -np 8 -machinefile
wude_1.machines mpifxcorr wude_1.input," and I was able to obtain the
".difx" files successfully.*

So if you edit the startdifx file and find where mpirun is being invoked,
and remove those --mca options, you should be fine.

Cheers,
Adam

On Thu, 6 Jul 2023 at 14:16, Adam Deller <adeller at astro.swin.edu.au> wrote:

> Hi Wu,
>
> calcif2 is the delay-generating program that requires the calcserver to be
> running (which wasn't the case for you). Setting DIFX_CALC_PROGRAM=difxcalc
> determines which program which will be called by startdifx.   But you were
> trying to run calcif2 itself from the command line, so naturally this won't
> work.  If you run difxcalc wude_1.calc, it should work.  And as you saw, if
> you run startdifx after setting DIFX_CALC_PROGRAM=difxcalc , that also
> works fine.
>
> Once you have run difxcalc (or calcif2) the .im file will be generated. If
> you try to run difxcalc/calcif2 again once the .im file has been generated,
> it won't run unless you force it (since it sees that the .im file has been
> generated, so no need to re-generate it).
>
> So your remaining problem now is that MPI seems to think that you don't
> have any available CPUs on your host.  Once again (I think this is the
> third time I'm making this suggestion): please try running the mpirun
> command *without* the --mca options.  I.e.,
>
> mpirun -np 4 --hostfile wude_1.machines runmpifxcorr.DiFX-2.6.2
> wude_1.input
>
> You may also have success by adding --oversubscribe to the mpirun command
> (although that is more of a band-aid getting around the fact that it seems
> that openmpi isn't seeing how many CPUs are available).
>
> If you can figure out what mpirun option is causing the problem, you will
> then be able to modify startdifx to remove the offending option for you
> always.
>
> Cheers,
> Adam
>
> On Tue, 4 Jul 2023 at 17:30, 深空探测 <wude7826580 at gmail.com> wrote:
>
>> Subject: Issue with DiFX Testing - RPC Errors and CPU Allocation
>>
>> Hi Adam,
>>
>> I apologize for the delay in getting back to you. I've been conducting
>> tests with DiFX lately, and I encountered a few issues that I would
>> appreciate your insight on.
>>
>> Initially, I faced problems running the `mpirun` command, but I managed
>> to resolve them by reinstalling DiFX on a new CentOS7 system. Previously, I
>> had installed `openmpi-1.6.5` in the `/usr/local` directory, but this time,
>> I used the command `sudo yum install openmpi-devel` to install `openmpi`,
>> and then I installed DiFX in the `/home/wude/difx/DIFXROOT` directory.
>> Following this setup, the `mpirun` command started working correctly. I
>> suspect that the previous installation in the system directory might have
>> been causing the issues with `mpirun`.
>>
>> However, I encountered a new problem when running the command `calcif2
>> wude_1.calc`. The output displayed the following error:
>>
>>
>> ----------------------------------------------------------------------------------------
>> calcif2 processing file 1/1 = wude_1
>> localhost: RPC: Program not registered
>> Error: calcif2: RPC clnt_create fails for host: localhost
>> Error: Cannot initialize CalcParams
>>
>> ----------------------------------------------------------------------------------------
>>
>> Previously, I resolved a similar error by running the command: `export
>> DIFX_CALC_PROGRAM=difxcalc`. However, when I tried the same solution this
>> time, it didn't resolve the issue.
>>
>> Additionally, when running the command: `mpirun -np 4 --hostfile
>> wude_1.machines --mca mpi_yield_when_idle 1 --mca rmaps seq
>> runmpifxcorr.DiFX-2.6.2 wude_1.input`, the output displayed the following
>> message:
>>
>>
>> ---------------------------------------------------------------------------------------------------------------
>> While computing bindings, we found no available CPUs on the following
>> node:
>>     Node: wude
>> Please check your allocation.
>>
>> ---------------------------------------------------------------------------------------------------------------
>>
>> My hostname is "wude", and it seems like there are no available CPUs, but
>> I can't determine the cause of this issue. Hence, I am reaching out to seek
>> your guidance on this matter.
>>
>> Thank you for your time and support.
>>
>> Best regards,
>>
>> De Wu
>>
>> Adam Deller <adeller at astro.swin.edu.au> 于2023年6月26日周一 07:36写道:
>>
>>> Have you tried removing the --mca options from the command? E.g.,
>>>
>>> mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>>> runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input
>>>
>>> I have a suspicion that either the seq or rmaps option is not playing
>>> nice, but it is easiest to just remove all the options and see if that
>>> makes any difference.
>>>
>>> Cheers,
>>> Adam
>>>
>>> On Mon, 26 Jun 2023 at 01:58, 深空探测 <wude7826580 at gmail.com> wrote:
>>>
>>>> Hi Adam,
>>>>
>>>> As you suggested, I removed the "| head" from the command, and I was
>>>> able to run it successfully.
>>>>
>>>> However, when executing the following command: "mpirun -np 4 --hostfile
>>>> /vlbi/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca rmaps seq
>>>> runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input". The output displayed
>>>> the following message:
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> Additionally, when running the command "mpirun -np 4 -H
>>>> localhost,localhost,localhost,localhost --mca mpi_yield_when_idle 1 --mca
>>>> rmaps seq runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input," and it
>>>> resulted in the following error message:
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> There are no nodes allocated to this job.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> It is quite puzzling that even when specifying only one localhost in
>>>> the command, I still receive this output. I have been considering the
>>>> possibility that this issue might be due to limitations in system
>>>> resources, node access permissions, or node configuration within the
>>>> CentOS7 virtual machine environment.
>>>>
>>>> Thank you for your attention to this matter.
>>>>
>>>> Best regards,
>>>>
>>>> De Wu
>>>>
>>>> Adam Deller <adeller at astro.swin.edu.au> 于2023年6月22日周四 15:53写道:
>>>>
>>>>> Hi De Wu,
>>>>>
>>>>> The "SIGPIPE detected on fd 13 - aborting" errors when running
>>>>> mpispeed are related to piping the output to head.  Remove the "| head" and
>>>>> you should see it run normally.
>>>>>
>>>>> For running mpifxcorr, the obvious difference between your invocation
>>>>> of mpispeed and mpifxcorr is the use of the various mca options.  What
>>>>> happens if you add " --mca mpi_yield_when_idle 1 --mca rmaps seq" to your
>>>>> mpispeed launch (before or after the -H localhost,localhost)?  If it
>>>>> doesn't work, then probably one or the other of those options is the
>>>>> problem, and you need to change startdifx to get rid of the offending
>>>>> option when running mpirun.
>>>>>
>>>>> If running mpispeed still works when with those options, what about
>>>>> the following:
>>>>> 1. manually run mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>>>>> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
>>>>> /vlbi/aov070/aov070_1.input, see what output comes out
>>>>> 2. manually run mpirun -np 4 -H
>>>>> localhost,localhost,localhost,localhost --mca mpi_yield_when_idle 1 --mca
>>>>> rmaps seq  runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input, see what
>>>>> output comes out
>>>>>
>>>>> Cheers,
>>>>> Adam
>>>>>
>>>>> On Mon, 19 Jun 2023 at 18:02, 深空探测 via Difx-users <
>>>>> difx-users at listmgr.nrao.edu> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I recently reinstalled OpenMPI-1.6.5 and successfully ran the example
>>>>>> program provided within the OpenMPI package. By executing the command
>>>>>> "mpiexec -n 6 ./hello_c," I obtained the following output:
>>>>>>
>>>>>> ```
>>>>>> wude at wude DiFX-2.6.2 examples> mpiexec -n 6 ./hello_c
>>>>>> Hello, world, I am 4 of 6
>>>>>> Hello, world, I am 2 of 6
>>>>>> Hello, world, I am 0 of 6
>>>>>> Hello, world, I am 1 of 6
>>>>>> Hello, world, I am 3 of 6
>>>>>> Hello, world, I am 5 of 6
>>>>>> ```
>>>>>>
>>>>>> The program executed without any issues, displaying the expected
>>>>>> output. Each line represents a separate process, showing the process number
>>>>>> and the total number of processes involved.
>>>>>>
>>>>>> However, I encountered some difficulties when running the command
>>>>>> "mpirun -H localhost,localhost mpispeed 1000 10s 1 | head." Although both
>>>>>> nodes seem to run properly, there appear to be some errors in the output.
>>>>>> Below is the output I received, with "wude" being my username:
>>>>>>
>>>>>> ```
>>>>>> wude at wude DiFX-2.6.2 ~> mpirun -H localhost,localhost mpispeed 1000
>>>>>> 10s 1 | head
>>>>>> Processor = wude
>>>>>> Rank = 0/2
>>>>>> [0] Starting
>>>>>> Processor = wude
>>>>>> Rank = 1/2
>>>>>> [1] Starting
>>>>>> [1] Recvd 0 -> 0 : 2740.66 Mbps curr : 2740.66 Mbps mean
>>>>>> [1] Recvd 1 -> 0 : 60830.52 Mbps curr : 5245.02 Mbps mean
>>>>>> [1] Recvd 2 -> 0 : 69260.57 Mbps curr : 7580.50 Mbps mean
>>>>>> [1] Recvd 3 -> 0 : 68545.44 Mbps curr : 9747.65 Mbps mean
>>>>>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>>>>>> mpirun: killing job...
>>>>>>
>>>>>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>>>>>> mpirun: killing job...
>>>>>> ```
>>>>>>
>>>>>> I'm unsure whether you experience the same "mpirun: SIGPIPE detected
>>>>>> on fd 13 - aborting mpirun: killing job..." message when running this
>>>>>> command on your computer.
>>>>>>
>>>>>> Furthermore, when I ran the command "startdifx -v -f -n
>>>>>> aov070.joblist," the .difx file was not generated. Could you please provide
>>>>>> some guidance or suggestions to help me troubleshoot this issue?
>>>>>>
>>>>>> Here is the output I received when running the command:
>>>>>>
>>>>>> ```
>>>>>> wude at wude DiFX-2.6.2 aov070> startdifx -v -f -n aov070.joblist
>>>>>> No errors with input file /vlbi/aov070/aov070_1.input
>>>>>>
>>>>>> Executing:  mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>>>>>> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
>>>>>> /vlbi/aov070/aov070_1.input
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Elapsed time (s) = 82.2610619068
>>>>>> ```
>>>>>> Best regards,
>>>>>>
>>>>>> De Wu
>>>>>>
>>>>>> Adam Deller <adeller at astro.swin.edu.au> 于2023年5月25日周四 08:42写道:
>>>>>>
>>>>>>> Hi De Wu,
>>>>>>>
>>>>>>> If I run
>>>>>>>
>>>>>>> mpirun -H localhost,localhost mpispeed 1000 10s 1
>>>>>>>
>>>>>>> it runs correctly as follows:
>>>>>>>
>>>>>>> adeller at ar313-adeller trunk Downloads> mpirun -H
>>>>>>> localhost,localhost mpispeed 1000 10s 1 | head
>>>>>>> Processor = <my host name>
>>>>>>> Rank = 0/2
>>>>>>> [0] Starting
>>>>>>> Processor =<my host name>
>>>>>>> Rank = 1/2
>>>>>>> [1] Starting
>>>>>>>
>>>>>>> It seems like in your case, MPI is looking at the two identical host
>>>>>>> names you've given and is deciding to only start one process, rather than
>>>>>>> two. What if you run
>>>>>>>
>>>>>>> mpirun -n 2 -H wude,wude mpispeed 1000 10s 1
>>>>>>>
>>>>>>> ?
>>>>>>>
>>>>>>> I think the issue is with your MPI installation / the parameters
>>>>>>> being passed to mpirun. Unfortunately as I've mentioned previously the
>>>>>>> behaviour of MPI with default parameters seems to change from
>>>>>>> implementation to implementation and version to version - you just need to
>>>>>>> track down what is needed to make sure it actually runs the number of
>>>>>>> processes you want on the nodes you want!
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Adam
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 24 May 2023 at 18:30, 深空探测 via Difx-users <
>>>>>>> difx-users at listmgr.nrao.edu> wrote:
>>>>>>>
>>>>>>>> Hi  All,
>>>>>>>>
>>>>>>>> I am writing to seek assistance regarding an issue I encountered
>>>>>>>> while working with MPI on a CentOS 7 virtual machine.
>>>>>>>>
>>>>>>>> I have successfully installed openmpi-1.6.5 on the CentOS 7 virtual
>>>>>>>> machine. However, when I attempted to execute the command "startdifx -f -n
>>>>>>>> -v aov070.joblist," I received the following error message:
>>>>>>>>
>>>>>>>> "Environment variable DIFX_CALC_PROGRAM was set, so
>>>>>>>> Using specified calc program: difxcalc
>>>>>>>>
>>>>>>>> No errors with input file /vlbi/corr/aov070/aov070_1.input
>>>>>>>>
>>>>>>>> Executing: mpirun -np 4 --hostfile
>>>>>>>> /vlbi/corr/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca rmaps
>>>>>>>> seq runmpifxcorr.DiFX-2.6.2 /vlbi/corr/aov070/aov070_1.input
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>> process that caused that situation.
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------"
>>>>>>>>
>>>>>>>> To further investigate the MPI functionality, I wrote a Python
>>>>>>>> program “mpi_hello_world.py” as follows:
>>>>>>>>
>>>>>>>> from mpi4py import MPI
>>>>>>>>
>>>>>>>> comm = MPI.COMM_WORLD
>>>>>>>> rank = comm.Get_rank()
>>>>>>>> size = comm.Get_size()
>>>>>>>>
>>>>>>>> print("Hello from rank", rank, "of", size)
>>>>>>>>
>>>>>>>> When I executed the command "mpiexec -n 4 python
>>>>>>>> mpi_hello_world.py," the output was as follows:
>>>>>>>>
>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>>
>>>>>>>> Additionally, I attempted to test the MPI functionality using the
>>>>>>>> "mpispeed" command with the following execution command: "mpirun -H
>>>>>>>> wude,wude mpispeed 1000 10s 1".  “wude” is my hostname. However, I
>>>>>>>> encountered the following error message:
>>>>>>>>
>>>>>>>> "Processor = wude
>>>>>>>> Rank = 0/1
>>>>>>>> Sorry, must run with an even number of processes
>>>>>>>> This program should be invoked in a manner similar to:
>>>>>>>> mpirun -H host1,host2,...,hostN mpispeed [<numSends>|<timeSend>s]
>>>>>>>> [<sendSizeMByte>]
>>>>>>>>     where
>>>>>>>>         numSends: number of blocks to send (e.g., 256), or
>>>>>>>>         timeSend: duration in seconds to send (e.g., 100s)
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>> process that caused that situation.
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------"
>>>>>>>>
>>>>>>>> I am uncertain about the source of these issues and would greatly
>>>>>>>> appreciate your guidance in resolving them. If you have any insights or
>>>>>>>> suggestions regarding the aforementioned errors and how I can rectify them,
>>>>>>>> please let me know.
>>>>>>>>
>>>>>>>> Thank you for your time and assistance.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> De Wu
>>>>>>>> _______________________________________________
>>>>>>>> Difx-users mailing list
>>>>>>>> Difx-users at listmgr.nrao.edu
>>>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> !=============================================================!
>>>>>>> Prof. Adam Deller
>>>>>>> Centre for Astrophysics & Supercomputing
>>>>>>> Swinburne University of Technology
>>>>>>> John St, Hawthorn VIC 3122 Australia
>>>>>>> phone: +61 3 9214 5307
>>>>>>> fax: +61 3 9214 8797
>>>>>>> !=============================================================!
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Difx-users mailing list
>>>>>> Difx-users at listmgr.nrao.edu
>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> !=============================================================!
>>>>> Prof. Adam Deller
>>>>> Centre for Astrophysics & Supercomputing
>>>>> Swinburne University of Technology
>>>>> John St, Hawthorn VIC 3122 Australia
>>>>> phone: +61 3 9214 5307
>>>>> fax: +61 3 9214 8797
>>>>> !=============================================================!
>>>>>
>>>>
>>>
>>> --
>>> !=============================================================!
>>> Prof. Adam Deller
>>> Centre for Astrophysics & Supercomputing
>>> Swinburne University of Technology
>>> John St, Hawthorn VIC 3122 Australia
>>> phone: +61 3 9214 5307
>>> fax: +61 3 9214 8797
>>> !=============================================================!
>>>
>>
>
> --
> !=============================================================!
> Prof. Adam Deller
> Centre for Astrophysics & Supercomputing
> Swinburne University of Technology
> John St, Hawthorn VIC 3122 Australia
> phone: +61 3 9214 5307
> fax: +61 3 9214 8797
> !=============================================================!
>


-- 
!=============================================================!
Prof. Adam Deller
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20230706/97e7e5ca/attachment-0001.html>


More information about the Difx-users mailing list