[Difx-users] startdifx 2.5.4/2.6.3 gives benign segfault if using ssh -Y and not setting LC_CTYPE {External} {External}
Eskil Varenius
eskil.varenius at chalmers.se
Tue Mar 15 12:56:38 EDT 2022
Greg, all,
Good suggestion. We have now tested with openmpi version 4.1.2 and the
problem now appears to be gone. So, the verdict seems to be: if someone
runs into similar issues, try upgrading openmpi to latest version.
Thanks
Eskil
On 2022-03-10 16:09, Greg Lindahl wrote:
> Since this failure is during the startup of "mpirun" it shouldn't be a
> bug in difx -- "opal" is a part of OpenMPI. I'd recommend updating
> your OpenMPI version, perhaps the bug is already fixed.
>
> The dependence on the environment variables is something that I've
> seen before -- the exact size of the text of environment variables
> moves the rest of the code around in memory. "ssh -X" and "ssh -Y"
> have different environment variables.
>
> On Wed, Mar 9, 2022 at 5:23 AM Eskil Varenius via Difx-users
> <difx-users at listmgr.nrao.edu> wrote:
>
> Dear DiFX users,
> I wanted to share an intriguing segfault-error which has kept me
> puzzled
> for some time. Just in case someone else runs into the same, or maybe
> knows the reason. Strictly speaking it's (very likely) not an
> difx-issue, but somehow related to the way I run difx.
>
> Problem: I try to correlate some r1-data using difx 2.5.4 or 2.6.3
> (same
> behaviour with both; I did not test older versions). I connect
> from my
> laptop (OS X 12.0.1) to my server (Linux Mint 19.3) using "ssh -Y
> user at server" and then run "startdifx -n -f -v r11026_01.input".
> Everything runs fine, except that the last rows on screen are
>
> [...]
> start frame = 0
> end second = 61220
> end frame = 5048
> first frame offset = 0 bytes
> [gyller:30568] *** Process received signal ***
> [gyller:30568] Signal: Segmentation fault (11)
> [gyller:30568] Signal code: Address not mapped (1)
> [gyller:30568] Failing at address: 0xceec27309
> [gyller:30568] [ 0]
> /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd2b0f23040]
> [gyller:30568] [ 1]
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc201_hwloc_bitmap_free+0x9)[0x7fd2b136e2c9]
> [gyller:30568] [ 2]
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f70b)[0x7fd2b136470b]
> [gyller:30568] [ 3]
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc_base_free_topology+0x79)[0x7fd2b13671b9]
> [gyller:30568] [ 4]
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f5a0)[0x7fd2b13645a0]
> [gyller:30568] [ 5]
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(mca_base_framework_close+0x67)[0x7fd2b1339567]
> [gyller:30568] [ 6]
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_finalize+0x83)[0x7fd2b130c113]
> [gyller:30568] [ 7] mpirun(+0xfbd)[0x561bed3d0fbd]
> [gyller:30568] [ 8]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd2b0f05bf7]
> [gyller:30568] [ 9] mpirun(+0xd8a)[0x561bed3d0d8a]
> [gyller:30568] *** End of error message ***
> Segmentation fault (core dumped)
> Elapsed time (s) = 12.6550719738
>
> The segfault got me nervous. Investigating environment settings,
> Simon
> Casey and I found that the parameter "LC_CTYPE" was not set to
> anything.
> Setting this as export LC_CTYPE="UTF-8" before running "startdifx"
> makes
> the problem go away.
>
> Another way to make the problem go away is to use "ssh" or "ssh -X"
> instead of "ssh -Y" to connect to my server. With this, there are no
> segfault errors - even without setting the "LC_CTYPE". However, I
> need
> the "-Y flag" to get X-forwarding working for my current OS X setup.
> Technically, I of course don't need that for running DiFX (which
> makes
> it more puzzling that it has an impact), but for e.g. fourfit and
> similar later. So it's easy to work around this problem.
>
> Not sure what to make of this, but the error (if using ssh -Y and not
> setting LC_CTYPE) appears benign as far as the geodetic results go.
> Maybe this can save someone from doing the same investigation, if
> someone is nervous about the segfault :).
>
> Kind regards
> Eskil and Simon in Onsala
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>
>
>
> --
> Greg Lindahl
> Software Architect, Event Horizon Telescope
> Smithsonian Astrophysical Observatory
> 60 Garden Street | MS 66 | Cambridge, MA 02138
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20220315/6b011185/attachment.html>
More information about the Difx-users
mailing list