[Difx-users] startdifx 2.5.4/2.6.3 gives benign segfault if using ssh -Y and not setting LC_CTYPE {External} {External}

Eskil Varenius eskil.varenius at chalmers.se
Tue Mar 15 12:56:38 EDT 2022


Greg, all,

Good suggestion. We have now tested with openmpi version 4.1.2 and the 
problem now appears to be gone. So, the verdict seems to be: if someone 
runs into similar issues, try upgrading openmpi to latest version.

Thanks

Eskil

On 2022-03-10 16:09, Greg Lindahl wrote:
> Since this failure is during the startup of "mpirun" it shouldn't be a 
> bug in difx -- "opal" is a part of OpenMPI. I'd recommend updating 
> your OpenMPI version, perhaps the bug is already fixed.
>
> The dependence on the environment variables is something that I've 
> seen before -- the exact size of the text of environment variables 
> moves the rest of the code around in memory. "ssh -X" and "ssh -Y" 
> have different environment variables.
>
> On Wed, Mar 9, 2022 at 5:23 AM Eskil Varenius via Difx-users 
> <difx-users at listmgr.nrao.edu> wrote:
>
>     Dear DiFX users,
>     I wanted to share an intriguing segfault-error which has kept me
>     puzzled
>     for some time. Just in case someone else runs into the same, or maybe
>     knows the reason. Strictly speaking it's (very likely) not an
>     difx-issue, but somehow related to the way I run difx.
>
>     Problem: I try to correlate some r1-data using difx 2.5.4 or 2.6.3
>     (same
>     behaviour with both; I did not test older versions). I connect
>     from my
>     laptop (OS X 12.0.1) to my server (Linux Mint 19.3) using "ssh -Y
>     user at server" and then run "startdifx -n -f -v r11026_01.input".
>     Everything runs fine, except that the last rows on screen are
>
>     [...]
>     start frame = 0
>     end second = 61220
>     end frame = 5048
>     first frame offset = 0 bytes
>     [gyller:30568] *** Process received signal ***
>     [gyller:30568] Signal: Segmentation fault (11)
>     [gyller:30568] Signal code: Address not mapped (1)
>     [gyller:30568] Failing at address: 0xceec27309
>     [gyller:30568] [ 0]
>     /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd2b0f23040]
>     [gyller:30568] [ 1]
>     /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc201_hwloc_bitmap_free+0x9)[0x7fd2b136e2c9]
>     [gyller:30568] [ 2]
>     /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f70b)[0x7fd2b136470b]
>     [gyller:30568] [ 3]
>     /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc_base_free_topology+0x79)[0x7fd2b13671b9]
>     [gyller:30568] [ 4]
>     /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f5a0)[0x7fd2b13645a0]
>     [gyller:30568] [ 5]
>     /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(mca_base_framework_close+0x67)[0x7fd2b1339567]
>     [gyller:30568] [ 6]
>     /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_finalize+0x83)[0x7fd2b130c113]
>     [gyller:30568] [ 7] mpirun(+0xfbd)[0x561bed3d0fbd]
>     [gyller:30568] [ 8]
>     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd2b0f05bf7]
>     [gyller:30568] [ 9] mpirun(+0xd8a)[0x561bed3d0d8a]
>     [gyller:30568] *** End of error message ***
>     Segmentation fault (core dumped)
>     Elapsed time (s) = 12.6550719738
>
>     The segfault got me nervous. Investigating environment settings,
>     Simon
>     Casey and I found that the parameter "LC_CTYPE" was not set to
>     anything.
>     Setting this as export LC_CTYPE="UTF-8" before running "startdifx"
>     makes
>     the problem go away.
>
>     Another way to make the problem go away is to use "ssh" or "ssh -X"
>     instead of "ssh -Y" to connect to my server. With this, there are no
>     segfault errors - even without setting the "LC_CTYPE". However, I
>     need
>     the "-Y flag" to get X-forwarding working for my current OS X setup.
>     Technically, I of course don't need that for running DiFX (which
>     makes
>     it more puzzling that it has an impact), but for e.g. fourfit and
>     similar later. So it's easy to work around this problem.
>
>     Not sure what to make of this, but the error (if using ssh -Y and not
>     setting LC_CTYPE) appears benign as far as the geodetic results go.
>     Maybe this can save someone from doing the same investigation, if
>     someone is nervous about the segfault :).
>
>     Kind regards
>     Eskil and Simon in Onsala
>
>     _______________________________________________
>     Difx-users mailing list
>     Difx-users at listmgr.nrao.edu
>     https://listmgr.nrao.edu/mailman/listinfo/difx-users
>
>
>
> -- 
> Greg Lindahl
> Software Architect, Event Horizon Telescope
> Smithsonian Astrophysical Observatory
> 60 Garden Street | MS 66 | Cambridge, MA 02138
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20220315/6b011185/attachment.html>


More information about the Difx-users mailing list