[Difx-users] startdifx 2.5.4/2.6.3 gives benign segfault if using ssh -Y and not setting LC_CTYPE {External}
Eskil Varenius
eskil.varenius at chalmers.se
Wed Mar 9 08:22:08 EST 2022
Dear DiFX users,
I wanted to share an intriguing segfault-error which has kept me puzzled
for some time. Just in case someone else runs into the same, or maybe
knows the reason. Strictly speaking it's (very likely) not an
difx-issue, but somehow related to the way I run difx.
Problem: I try to correlate some r1-data using difx 2.5.4 or 2.6.3 (same
behaviour with both; I did not test older versions). I connect from my
laptop (OS X 12.0.1) to my server (Linux Mint 19.3) using "ssh -Y
user at server" and then run "startdifx -n -f -v r11026_01.input".
Everything runs fine, except that the last rows on screen are
[...]
start frame = 0
end second = 61220
end frame = 5048
first frame offset = 0 bytes
[gyller:30568] *** Process received signal ***
[gyller:30568] Signal: Segmentation fault (11)
[gyller:30568] Signal code: Address not mapped (1)
[gyller:30568] Failing at address: 0xceec27309
[gyller:30568] [ 0]
/lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd2b0f23040]
[gyller:30568] [ 1]
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc201_hwloc_bitmap_free+0x9)[0x7fd2b136e2c9]
[gyller:30568] [ 2]
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f70b)[0x7fd2b136470b]
[gyller:30568] [ 3]
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc_base_free_topology+0x79)[0x7fd2b13671b9]
[gyller:30568] [ 4]
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f5a0)[0x7fd2b13645a0]
[gyller:30568] [ 5]
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(mca_base_framework_close+0x67)[0x7fd2b1339567]
[gyller:30568] [ 6]
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_finalize+0x83)[0x7fd2b130c113]
[gyller:30568] [ 7] mpirun(+0xfbd)[0x561bed3d0fbd]
[gyller:30568] [ 8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd2b0f05bf7]
[gyller:30568] [ 9] mpirun(+0xd8a)[0x561bed3d0d8a]
[gyller:30568] *** End of error message ***
Segmentation fault (core dumped)
Elapsed time (s) = 12.6550719738
The segfault got me nervous. Investigating environment settings, Simon
Casey and I found that the parameter "LC_CTYPE" was not set to anything.
Setting this as export LC_CTYPE="UTF-8" before running "startdifx" makes
the problem go away.
Another way to make the problem go away is to use "ssh" or "ssh -X"
instead of "ssh -Y" to connect to my server. With this, there are no
segfault errors - even without setting the "LC_CTYPE". However, I need
the "-Y flag" to get X-forwarding working for my current OS X setup.
Technically, I of course don't need that for running DiFX (which makes
it more puzzling that it has an impact), but for e.g. fourfit and
similar later. So it's easy to work around this problem.
Not sure what to make of this, but the error (if using ssh -Y and not
setting LC_CTYPE) appears benign as far as the geodetic results go.
Maybe this can save someone from doing the same investigation, if
someone is nervous about the segfault :).
Kind regards
Eskil and Simon in Onsala
More information about the Difx-users
mailing list