[Difx-users] startdifx 2.5.4/2.6.3 gives benign segfault if using ssh -Y and not setting LC_CTYPE {External} {External}

Greg Lindahl glindahl at cfa.harvard.edu
Thu Mar 10 10:09:23 EST 2022


Since this failure is during the startup of "mpirun" it shouldn't be a bug
in difx -- "opal" is a part of OpenMPI. I'd recommend updating your OpenMPI
version, perhaps the bug is already fixed.

The dependence on the environment variables is something that I've seen
before -- the exact size of the text of environment variables moves the
rest of the code around in memory. "ssh -X" and "ssh -Y" have different
environment variables.

On Wed, Mar 9, 2022 at 5:23 AM Eskil Varenius via Difx-users <
difx-users at listmgr.nrao.edu> wrote:

> Dear DiFX users,
> I wanted to share an intriguing segfault-error which has kept me puzzled
> for some time. Just in case someone else runs into the same, or maybe
> knows the reason. Strictly speaking it's (very likely) not an
> difx-issue, but somehow related to the way I run difx.
>
> Problem: I try to correlate some r1-data using difx 2.5.4 or 2.6.3 (same
> behaviour with both; I did not test older versions). I connect from my
> laptop (OS X 12.0.1) to my server (Linux Mint 19.3) using "ssh -Y
> user at server" and then run "startdifx -n -f -v r11026_01.input".
> Everything runs fine, except that the last rows on screen are
>
> [...]
> start frame = 0
> end second = 61220
> end frame = 5048
> first frame offset = 0 bytes
> [gyller:30568] *** Process received signal ***
> [gyller:30568] Signal: Segmentation fault (11)
> [gyller:30568] Signal code: Address not mapped (1)
> [gyller:30568] Failing at address: 0xceec27309
> [gyller:30568] [ 0]
> /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd2b0f23040]
> [gyller:30568] [ 1]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc201_hwloc_bitmap_free+0x9)[0x7fd2b136e2c9]
> [gyller:30568] [ 2]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f70b)[0x7fd2b136470b]
> [gyller:30568] [ 3]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc_base_free_topology+0x79)[0x7fd2b13671b9]
> [gyller:30568] [ 4]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f5a0)[0x7fd2b13645a0]
> [gyller:30568] [ 5]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(mca_base_framework_close+0x67)[0x7fd2b1339567]
> [gyller:30568] [ 6]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_finalize+0x83)[0x7fd2b130c113]
> [gyller:30568] [ 7] mpirun(+0xfbd)[0x561bed3d0fbd]
> [gyller:30568] [ 8]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd2b0f05bf7]
> [gyller:30568] [ 9] mpirun(+0xd8a)[0x561bed3d0d8a]
> [gyller:30568] *** End of error message ***
> Segmentation fault (core dumped)
> Elapsed time (s) = 12.6550719738
>
> The segfault got me nervous. Investigating environment settings, Simon
> Casey and I found that the parameter "LC_CTYPE" was not set to anything.
> Setting this as export LC_CTYPE="UTF-8" before running "startdifx" makes
> the problem go away.
>
> Another way to make the problem go away is to use "ssh" or "ssh -X"
> instead of "ssh -Y" to connect to my server. With this, there are no
> segfault errors - even without setting the "LC_CTYPE". However, I need
> the "-Y flag" to get X-forwarding working for my current OS X setup.
> Technically, I of course don't need that for running DiFX (which makes
> it more puzzling that it has an impact), but for e.g. fourfit and
> similar later. So it's easy to work around this problem.
>
> Not sure what to make of this, but the error (if using ssh -Y and not
> setting LC_CTYPE) appears benign as far as the geodetic results go.
> Maybe this can save someone from doing the same investigation, if
> someone is nervous about the segfault :).
>
> Kind regards
> Eskil and Simon in Onsala
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>


-- 
Greg Lindahl
Software Architect, Event Horizon Telescope
Smithsonian Astrophysical Observatory
60 Garden Street | MS 66 | Cambridge, MA 02138
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20220310/8215ee2c/attachment.html>


More information about the Difx-users mailing list