[Difx-users] startdifx 2.5.4/2.6.3 gives benign segfault if using ssh -Y and not setting LC_CTYPE {External}

Eskil Varenius eskil.varenius at chalmers.se
Wed Mar 9 08:22:08 EST 2022


Dear DiFX users,
I wanted to share an intriguing segfault-error which has kept me puzzled 
for some time. Just in case someone else runs into the same, or maybe 
knows the reason. Strictly speaking it's (very likely) not an 
difx-issue, but somehow related to the way I run difx.

Problem: I try to correlate some r1-data using difx 2.5.4 or 2.6.3 (same 
behaviour with both; I did not test older versions). I connect from my 
laptop (OS X 12.0.1) to my server (Linux Mint 19.3) using "ssh -Y 
user at server" and then run "startdifx -n -f -v r11026_01.input". 
Everything runs fine, except that the last rows on screen are

[...]
start frame = 0
end second = 61220
end frame = 5048
first frame offset = 0 bytes
[gyller:30568] *** Process received signal ***
[gyller:30568] Signal: Segmentation fault (11)
[gyller:30568] Signal code: Address not mapped (1)
[gyller:30568] Failing at address: 0xceec27309
[gyller:30568] [ 0] 
/lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd2b0f23040]
[gyller:30568] [ 1] 
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc201_hwloc_bitmap_free+0x9)[0x7fd2b136e2c9]
[gyller:30568] [ 2] 
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f70b)[0x7fd2b136470b]
[gyller:30568] [ 3] 
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc_base_free_topology+0x79)[0x7fd2b13671b9]
[gyller:30568] [ 4] 
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f5a0)[0x7fd2b13645a0]
[gyller:30568] [ 5] 
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(mca_base_framework_close+0x67)[0x7fd2b1339567]
[gyller:30568] [ 6] 
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_finalize+0x83)[0x7fd2b130c113]
[gyller:30568] [ 7] mpirun(+0xfbd)[0x561bed3d0fbd]
[gyller:30568] [ 8] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd2b0f05bf7]
[gyller:30568] [ 9] mpirun(+0xd8a)[0x561bed3d0d8a]
[gyller:30568] *** End of error message ***
Segmentation fault (core dumped)
Elapsed time (s) = 12.6550719738

The segfault got me nervous. Investigating environment settings, Simon 
Casey and I found that the parameter "LC_CTYPE" was not set to anything. 
Setting this as export LC_CTYPE="UTF-8" before running "startdifx" makes 
the problem go away.

Another way to make the problem go away is to use "ssh" or "ssh -X" 
instead of "ssh -Y" to connect to my server. With this, there are no 
segfault errors - even without setting the "LC_CTYPE". However, I need 
the "-Y flag" to get X-forwarding working for my current OS X setup. 
Technically, I of course don't need that for running DiFX (which makes 
it more puzzling that it has an impact), but for e.g. fourfit and 
similar later. So it's easy to work around this problem.

Not sure what to make of this, but the error (if using ssh -Y and not 
setting LC_CTYPE) appears benign as far as the geodetic results go. 
Maybe this can save someone from doing the same investigation, if 
someone is nervous about the segfault :).

Kind regards
Eskil and Simon in Onsala



More information about the Difx-users mailing list