[Difx-users] Error in running the startdifx command with DiFX software {External} {External} {External}

Thu Jul 6 22:51:51 EDT 2023

Hi Adam,

I wanted to provide you with an update regarding my previous confusion
regarding the "--mca" option. I have finally gained a clear understanding,
and I can confirm that the "startdifx" command executed successfully
without using the "--mca" option.  I sincerely apologize for not thoroughly
comprehending the implications of the "--mca" option, which caused the
recurring issues with mpirun.

For testing purposes, I utilized the rdv70 dataset and followed the
instructions outlined in the README file. Specifically, I used the command
"diffDiFX.py reference_1.difx/DIFX_54656_074996.s0000.b0000
example_1.difx/DIFX_54656_074996.s0000.b0000 -i example_1.input" to compare
my own computed results (example data) with the reference data. The final
two lines of the displayed results were as follows:

"At the end, 1320 records disagreed on the header.
After 1848 records, the mean percentage absolute difference is 67.67053685,
and the mean difference is 2.27732242 + 5.89398558 i."

Although I did not encounter any issues during the data processing, it
appears that there are substantial differences in the comparison results. I
am uncertain about the specific step that might have caused this problem.

Furthermore, when I generated the directory for the 1234 files using the
"difx2mark4 -e 1234 example_1.difx" command, I encountered an issue when
executing the command "fourfit -pt -c ../1234 191-2050" within the 1234
directory. The resulting error messages were as follows:

"fourfit: Invalid $block statement '$STATION A B BR-VLBA AXEL 2.0000 90.0
......
fourfit: Failure in locate_blocks()
fourfit: Low-level parse of
'/home/wude/difx/test_data/rdv70/1234/191-2050//4C39_25.2SN1CT' failed
fourfit: The above errors occurred while processing
fourfit: 191-2050//4C39_25.2SN1CT
fourfit: the top-level resolution is as follows: Error reading root for
file 191-2050/, skipping."

However, when I conducted a test using the tc016a.pulsar dataset and ran
the command "fourfit -pt -c ../1234 No0040," I successfully obtained the
interference fringe image.

Thank you for your time and support.

Best regards,

De Wu

Adam Deller <adeller at astro.swin.edu.au> 于2023年7月6日周四 12:19写道：

> Sorry, I just saw that you had done this (and reported in your second
> email):
>
> *Subsequently, I proceeded to run the command "mpirun -np 8 -machinefile
> wude_1.machines mpifxcorr wude_1.input," and I was able to obtain the
> ".difx" files successfully.*
>
> So if you edit the startdifx file and find where mpirun is being invoked,
> and remove those --mca options, you should be fine.
>
> Cheers,
> Adam
>
> On Thu, 6 Jul 2023 at 14:16, Adam Deller <adeller at astro.swin.edu.au>
> wrote:
>
>> Hi Wu,
>>
>> calcif2 is the delay-generating program that requires the calcserver to
>> be running (which wasn't the case for you). Setting
>> DIFX_CALC_PROGRAM=difxcalc determines which program which will be called by
>> startdifx.   But you were trying to run calcif2 itself from the command
>> line, so naturally this won't work.  If you run difxcalc wude_1.calc, it
>> should work.  And as you saw, if you run startdifx after setting
>> DIFX_CALC_PROGRAM=difxcalc , that also works fine.
>>
>> Once you have run difxcalc (or calcif2) the .im file will be generated.
>> If you try to run difxcalc/calcif2 again once the .im file has been
>> generated, it won't run unless you force it (since it sees that the .im
>> file has been generated, so no need to re-generate it).
>>
>> So your remaining problem now is that MPI seems to think that you don't
>> have any available CPUs on your host.  Once again (I think this is the
>> third time I'm making this suggestion): please try running the mpirun
>> command *without* the --mca options.  I.e.,
>>
>> mpirun -np 4 --hostfile wude_1.machines runmpifxcorr.DiFX-2.6.2
>> wude_1.input
>>
>> You may also have success by adding --oversubscribe to the mpirun command
>> (although that is more of a band-aid getting around the fact that it seems
>> that openmpi isn't seeing how many CPUs are available).
>>
>> If you can figure out what mpirun option is causing the problem, you will
>> then be able to modify startdifx to remove the offending option for you
>> always.
>>
>> Cheers,
>> Adam
>>
>> On Tue, 4 Jul 2023 at 17:30, 深空探测 <wude7826580 at gmail.com> wrote:
>>
>>> Subject: Issue with DiFX Testing - RPC Errors and CPU Allocation
>>>
>>> Hi Adam,
>>>
>>> I apologize for the delay in getting back to you. I've been conducting
>>> tests with DiFX lately, and I encountered a few issues that I would
>>> appreciate your insight on.
>>>
>>> Initially, I faced problems running the `mpirun` command, but I managed
>>> to resolve them by reinstalling DiFX on a new CentOS7 system. Previously, I
>>> had installed `openmpi-1.6.5` in the `/usr/local` directory, but this time,
>>> I used the command `sudo yum install openmpi-devel` to install `openmpi`,
>>> and then I installed DiFX in the `/home/wude/difx/DIFXROOT` directory.
>>> Following this setup, the `mpirun` command started working correctly. I
>>> suspect that the previous installation in the system directory might have
>>> been causing the issues with `mpirun`.
>>>
>>> However, I encountered a new problem when running the command `calcif2
>>> wude_1.calc`. The output displayed the following error:
>>>
>>>
>>> ----------------------------------------------------------------------------------------
>>> calcif2 processing file 1/1 = wude_1
>>> localhost: RPC: Program not registered
>>> Error: calcif2: RPC clnt_create fails for host: localhost
>>> Error: Cannot initialize CalcParams
>>>
>>> ----------------------------------------------------------------------------------------
>>>
>>> Previously, I resolved a similar error by running the command: `export
>>> DIFX_CALC_PROGRAM=difxcalc`. However, when I tried the same solution this
>>> time, it didn't resolve the issue.
>>>
>>> Additionally, when running the command: `mpirun -np 4 --hostfile
>>> wude_1.machines --mca mpi_yield_when_idle 1 --mca rmaps seq
>>> runmpifxcorr.DiFX-2.6.2 wude_1.input`, the output displayed the following
>>> message:
>>>
>>>
>>> ---------------------------------------------------------------------------------------------------------------
>>> While computing bindings, we found no available CPUs on the following
>>> node:
>>>     Node: wude
>>> Please check your allocation.
>>>
>>> ---------------------------------------------------------------------------------------------------------------
>>>
>>> My hostname is "wude", and it seems like there are no available CPUs,
>>> but I can't determine the cause of this issue. Hence, I am reaching out to
>>> seek your guidance on this matter.
>>>
>>> Thank you for your time and support.
>>>
>>> Best regards,
>>>
>>> De Wu
>>>
>>> Adam Deller <adeller at astro.swin.edu.au> 于2023年6月26日周一 07:36写道：
>>>
>>>> Have you tried removing the --mca options from the command? E.g.,
>>>>
>>>> mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>>>> runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input
>>>>
>>>> I have a suspicion that either the seq or rmaps option is not playing
>>>> nice, but it is easiest to just remove all the options and see if that
>>>> makes any difference.
>>>>
>>>> Cheers,
>>>> Adam
>>>>
>>>> On Mon, 26 Jun 2023 at 01:58, 深空探测 <wude7826580 at gmail.com> wrote:
>>>>
>>>>> Hi Adam,
>>>>>
>>>>> As you suggested, I removed the "| head" from the command, and I was
>>>>> able to run it successfully.
>>>>>
>>>>> However, when executing the following command: "mpirun -np 4
>>>>> --hostfile /vlbi/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca
>>>>> rmaps seq runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input". The output
>>>>> displayed the following message:
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Additionally, when running the command "mpirun -np 4 -H
>>>>> localhost,localhost,localhost,localhost --mca mpi_yield_when_idle 1 --mca
>>>>> rmaps seq runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input," and it
>>>>> resulted in the following error message:
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> There are no nodes allocated to this job.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> It is quite puzzling that even when specifying only one localhost in
>>>>> the command, I still receive this output. I have been considering the
>>>>> possibility that this issue might be due to limitations in system
>>>>> resources, node access permissions, or node configuration within the
>>>>> CentOS7 virtual machine environment.
>>>>>
>>>>> Thank you for your attention to this matter.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> De Wu
>>>>>
>>>>> Adam Deller <adeller at astro.swin.edu.au> 于2023年6月22日周四 15:53写道：
>>>>>
>>>>>> Hi De Wu,
>>>>>>
>>>>>> The "SIGPIPE detected on fd 13 - aborting" errors when running
>>>>>> mpispeed are related to piping the output to head.  Remove the "| head" and
>>>>>> you should see it run normally.
>>>>>>
>>>>>> For running mpifxcorr, the obvious difference between your invocation
>>>>>> of mpispeed and mpifxcorr is the use of the various mca options.  What
>>>>>> happens if you add " --mca mpi_yield_when_idle 1 --mca rmaps seq" to your
>>>>>> mpispeed launch (before or after the -H localhost,localhost)?  If it
>>>>>> doesn't work, then probably one or the other of those options is the
>>>>>> problem, and you need to change startdifx to get rid of the offending
>>>>>> option when running mpirun.
>>>>>>
>>>>>> If running mpispeed still works when with those options, what about
>>>>>> the following:
>>>>>> 1. manually run mpirun -np 4 --hostfile
>>>>>> /vlbi/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca rmaps seq
>>>>>>  runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input, see what output comes
>>>>>> out
>>>>>> 2. manually run mpirun -np 4 -H
>>>>>> localhost,localhost,localhost,localhost --mca mpi_yield_when_idle 1 --mca
>>>>>> rmaps seq  runmpifxcorr.DiFX-2.6.2 /vlbi/aov070/aov070_1.input, see what
>>>>>> output comes out
>>>>>>
>>>>>> Cheers,
>>>>>> Adam
>>>>>>
>>>>>> On Mon, 19 Jun 2023 at 18:02, 深空探测 via Difx-users <
>>>>>> difx-users at listmgr.nrao.edu> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I recently reinstalled OpenMPI-1.6.5 and successfully ran the
>>>>>>> example program provided within the OpenMPI package. By executing the
>>>>>>> command "mpiexec -n 6 ./hello_c," I obtained the following output:
>>>>>>>
>>>>>>> ```
>>>>>>> wude at wude DiFX-2.6.2 examples> mpiexec -n 6 ./hello_c
>>>>>>> Hello, world, I am 4 of 6
>>>>>>> Hello, world, I am 2 of 6
>>>>>>> Hello, world, I am 0 of 6
>>>>>>> Hello, world, I am 1 of 6
>>>>>>> Hello, world, I am 3 of 6
>>>>>>> Hello, world, I am 5 of 6
>>>>>>> ```
>>>>>>>
>>>>>>> The program executed without any issues, displaying the expected
>>>>>>> output. Each line represents a separate process, showing the process number
>>>>>>> and the total number of processes involved.
>>>>>>>
>>>>>>> However, I encountered some difficulties when running the command
>>>>>>> "mpirun -H localhost,localhost mpispeed 1000 10s 1 | head." Although both
>>>>>>> nodes seem to run properly, there appear to be some errors in the output.
>>>>>>> Below is the output I received, with "wude" being my username:
>>>>>>>
>>>>>>> ```
>>>>>>> wude at wude DiFX-2.6.2 ~> mpirun -H localhost,localhost mpispeed 1000
>>>>>>> 10s 1 | head
>>>>>>> Processor = wude
>>>>>>> Rank = 0/2
>>>>>>> [0] Starting
>>>>>>> Processor = wude
>>>>>>> Rank = 1/2
>>>>>>> [1] Starting
>>>>>>> [1] Recvd 0 -> 0 : 2740.66 Mbps curr : 2740.66 Mbps mean
>>>>>>> [1] Recvd 1 -> 0 : 60830.52 Mbps curr : 5245.02 Mbps mean
>>>>>>> [1] Recvd 2 -> 0 : 69260.57 Mbps curr : 7580.50 Mbps mean
>>>>>>> [1] Recvd 3 -> 0 : 68545.44 Mbps curr : 9747.65 Mbps mean
>>>>>>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>>>>>>> mpirun: killing job...
>>>>>>>
>>>>>>> [wude:05649] mpirun: SIGPIPE detected on fd 13 - aborting
>>>>>>> mpirun: killing job...
>>>>>>> ```
>>>>>>>
>>>>>>> I'm unsure whether you experience the same "mpirun: SIGPIPE detected
>>>>>>> on fd 13 - aborting mpirun: killing job..." message when running this
>>>>>>> command on your computer.
>>>>>>>
>>>>>>> Furthermore, when I ran the command "startdifx -v -f -n
>>>>>>> aov070.joblist," the .difx file was not generated. Could you please provide
>>>>>>> some guidance or suggestions to help me troubleshoot this issue?
>>>>>>>
>>>>>>> Here is the output I received when running the command:
>>>>>>>
>>>>>>> ```
>>>>>>> wude at wude DiFX-2.6.2 aov070> startdifx -v -f -n aov070.joblist
>>>>>>> No errors with input file /vlbi/aov070/aov070_1.input
>>>>>>>
>>>>>>> Executing:  mpirun -np 4 --hostfile /vlbi/aov070/aov070_1.machines
>>>>>>> --mca mpi_yield_when_idle 1 --mca rmaps seq  runmpifxcorr.DiFX-2.6.2
>>>>>>> /vlbi/aov070/aov070_1.input
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>> process
>>>>>>> that caused that situation.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> Elapsed time (s) = 82.2610619068
>>>>>>> ```
>>>>>>> Best regards,
>>>>>>>
>>>>>>> De Wu
>>>>>>>
>>>>>>> Adam Deller <adeller at astro.swin.edu.au> 于2023年5月25日周四 08:42写道：
>>>>>>>
>>>>>>>> Hi De Wu,
>>>>>>>>
>>>>>>>> If I run
>>>>>>>>
>>>>>>>> mpirun -H localhost,localhost mpispeed 1000 10s 1
>>>>>>>>
>>>>>>>> it runs correctly as follows:
>>>>>>>>
>>>>>>>> adeller at ar313-adeller trunk Downloads> mpirun -H
>>>>>>>> localhost,localhost mpispeed 1000 10s 1 | head
>>>>>>>> Processor = <my host name>
>>>>>>>> Rank = 0/2
>>>>>>>> [0] Starting
>>>>>>>> Processor =<my host name>
>>>>>>>> Rank = 1/2
>>>>>>>> [1] Starting
>>>>>>>>
>>>>>>>> It seems like in your case, MPI is looking at the two identical
>>>>>>>> host names you've given and is deciding to only start one process, rather
>>>>>>>> than two. What if you run
>>>>>>>>
>>>>>>>> mpirun -n 2 -H wude,wude mpispeed 1000 10s 1
>>>>>>>>
>>>>>>>> ?
>>>>>>>>
>>>>>>>> I think the issue is with your MPI installation / the parameters
>>>>>>>> being passed to mpirun. Unfortunately as I've mentioned previously the
>>>>>>>> behaviour of MPI with default parameters seems to change from
>>>>>>>> implementation to implementation and version to version - you just need to
>>>>>>>> track down what is needed to make sure it actually runs the number of
>>>>>>>> processes you want on the nodes you want!
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Adam
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 24 May 2023 at 18:30, 深空探测 via Difx-users <
>>>>>>>> difx-users at listmgr.nrao.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi  All,
>>>>>>>>>
>>>>>>>>> I am writing to seek assistance regarding an issue I encountered
>>>>>>>>> while working with MPI on a CentOS 7 virtual machine.
>>>>>>>>>
>>>>>>>>> I have successfully installed openmpi-1.6.5 on the CentOS 7
>>>>>>>>> virtual machine. However, when I attempted to execute the command
>>>>>>>>> "startdifx -f -n -v aov070.joblist," I received the following error message:
>>>>>>>>>
>>>>>>>>> "Environment variable DIFX_CALC_PROGRAM was set, so
>>>>>>>>> Using specified calc program: difxcalc
>>>>>>>>>
>>>>>>>>> No errors with input file /vlbi/corr/aov070/aov070_1.input
>>>>>>>>>
>>>>>>>>> Executing: mpirun -np 4 --hostfile
>>>>>>>>> /vlbi/corr/aov070/aov070_1.machines --mca mpi_yield_when_idle 1 --mca rmaps
>>>>>>>>> seq runmpifxcorr.DiFX-2.6.2 /vlbi/corr/aov070/aov070_1.input
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>> process that caused that situation.
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------"
>>>>>>>>>
>>>>>>>>> To further investigate the MPI functionality, I wrote a Python
>>>>>>>>> program “mpi_hello_world.py” as follows:
>>>>>>>>>
>>>>>>>>> from mpi4py import MPI
>>>>>>>>>
>>>>>>>>> comm = MPI.COMM_WORLD
>>>>>>>>> rank = comm.Get_rank()
>>>>>>>>> size = comm.Get_size()
>>>>>>>>>
>>>>>>>>> print("Hello from rank", rank, "of", size)
>>>>>>>>>
>>>>>>>>> When I executed the command "mpiexec -n 4 python
>>>>>>>>> mpi_hello_world.py," the output was as follows:
>>>>>>>>>
>>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>>> ('Hello from rank', 0, 'of', 1)
>>>>>>>>>
>>>>>>>>> Additionally, I attempted to test the MPI functionality using the
>>>>>>>>> "mpispeed" command with the following execution command: "mpirun -H
>>>>>>>>> wude,wude mpispeed 1000 10s 1".  “wude” is my hostname. However, I
>>>>>>>>> encountered the following error message:
>>>>>>>>>
>>>>>>>>> "Processor = wude
>>>>>>>>> Rank = 0/1
>>>>>>>>> Sorry, must run with an even number of processes
>>>>>>>>> This program should be invoked in a manner similar to:
>>>>>>>>> mpirun -H host1,host2,...,hostN mpispeed [<numSends>|<timeSend>s]
>>>>>>>>> [<sendSizeMByte>]
>>>>>>>>>     where
>>>>>>>>>         numSends: number of blocks to send (e.g., 256), or
>>>>>>>>>         timeSend: duration in seconds to send (e.g., 100s)
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>> process that caused that situation.
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------"
>>>>>>>>>
>>>>>>>>> I am uncertain about the source of these issues and would greatly
>>>>>>>>> appreciate your guidance in resolving them. If you have any insights or
>>>>>>>>> suggestions regarding the aforementioned errors and how I can rectify them,
>>>>>>>>> please let me know.
>>>>>>>>>
>>>>>>>>> Thank you for your time and assistance.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> De Wu
>>>>>>>>> _______________________________________________
>>>>>>>>> Difx-users mailing list
>>>>>>>>> Difx-users at listmgr.nrao.edu
>>>>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> !=============================================================!
>>>>>>>> Prof. Adam Deller
>>>>>>>> Centre for Astrophysics & Supercomputing
>>>>>>>> Swinburne University of Technology
>>>>>>>> John St, Hawthorn VIC 3122 Australia
>>>>>>>> phone: +61 3 9214 5307
>>>>>>>> fax: +61 3 9214 8797
>>>>>>>> !=============================================================!
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Difx-users mailing list
>>>>>>> Difx-users at listmgr.nrao.edu
>>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> !=============================================================!
>>>>>> Prof. Adam Deller
>>>>>> Centre for Astrophysics & Supercomputing
>>>>>> Swinburne University of Technology
>>>>>> John St, Hawthorn VIC 3122 Australia
>>>>>> phone: +61 3 9214 5307
>>>>>> fax: +61 3 9214 8797
>>>>>> !=============================================================!
>>>>>>
>>>>>
>>>>
>>>> --
>>>> !=============================================================!
>>>> Prof. Adam Deller
>>>> Centre for Astrophysics & Supercomputing
>>>> Swinburne University of Technology
>>>> John St, Hawthorn VIC 3122 Australia
>>>> phone: +61 3 9214 5307
>>>> fax: +61 3 9214 8797
>>>> !=============================================================!
>>>>
>>>
>>
>> --
>> !=============================================================!
>> Prof. Adam Deller
>> Centre for Astrophysics & Supercomputing
>> Swinburne University of Technology
>> John St, Hawthorn VIC 3122 Australia
>> phone: +61 3 9214 5307
>> fax: +61 3 9214 8797
>> !=============================================================!
>>
>
>
> --
> !=============================================================!
> Prof. Adam Deller
> Centre for Astrophysics & Supercomputing
> Swinburne University of Technology
> John St, Hawthorn VIC 3122 Australia
> phone: +61 3 9214 5307
> fax: +61 3 9214 8797
> !=============================================================!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20230707/6d202e2b/attachment-0001.html>