[Difx-users] startdifx 2.5.4/2.6.3 gives benign segfault if using ssh -Y and not setting LC_CTYPE {External} {External}

Adam Deller adeller at astro.swin.edu.au
Thu Mar 10 06:27:12 EST 2022


That is really weird.  I can't think of why a character set variable
**override** should need to be set to avoid a random MPI segfault! So I
don't have anything useful to add, but thanks for the heads up in case
anyone else runs into the same thing...

Cheers,
Adam

On Thu, 10 Mar 2022 at 00:23, Eskil Varenius via Difx-users <
difx-users at listmgr.nrao.edu> wrote:

> Dear DiFX users,
> I wanted to share an intriguing segfault-error which has kept me puzzled
> for some time. Just in case someone else runs into the same, or maybe
> knows the reason. Strictly speaking it's (very likely) not an
> difx-issue, but somehow related to the way I run difx.
>
> Problem: I try to correlate some r1-data using difx 2.5.4 or 2.6.3 (same
> behaviour with both; I did not test older versions). I connect from my
> laptop (OS X 12.0.1) to my server (Linux Mint 19.3) using "ssh -Y
> user at server" and then run "startdifx -n -f -v r11026_01.input".
> Everything runs fine, except that the last rows on screen are
>
> [...]
> start frame = 0
> end second = 61220
> end frame = 5048
> first frame offset = 0 bytes
> [gyller:30568] *** Process received signal ***
> [gyller:30568] Signal: Segmentation fault (11)
> [gyller:30568] Signal code: Address not mapped (1)
> [gyller:30568] Failing at address: 0xceec27309
> [gyller:30568] [ 0]
> /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd2b0f23040]
> [gyller:30568] [ 1]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc201_hwloc_bitmap_free+0x9)[0x7fd2b136e2c9]
> [gyller:30568] [ 2]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f70b)[0x7fd2b136470b]
> [gyller:30568] [ 3]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc_base_free_topology+0x79)[0x7fd2b13671b9]
> [gyller:30568] [ 4]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f5a0)[0x7fd2b13645a0]
> [gyller:30568] [ 5]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(mca_base_framework_close+0x67)[0x7fd2b1339567]
> [gyller:30568] [ 6]
>
> /usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_finalize+0x83)[0x7fd2b130c113]
> [gyller:30568] [ 7] mpirun(+0xfbd)[0x561bed3d0fbd]
> [gyller:30568] [ 8]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd2b0f05bf7]
> [gyller:30568] [ 9] mpirun(+0xd8a)[0x561bed3d0d8a]
> [gyller:30568] *** End of error message ***
> Segmentation fault (core dumped)
> Elapsed time (s) = 12.6550719738
>
> The segfault got me nervous. Investigating environment settings, Simon
> Casey and I found that the parameter "LC_CTYPE" was not set to anything.
> Setting this as export LC_CTYPE="UTF-8" before running "startdifx" makes
> the problem go away.
>
> Another way to make the problem go away is to use "ssh" or "ssh -X"
> instead of "ssh -Y" to connect to my server. With this, there are no
> segfault errors - even without setting the "LC_CTYPE". However, I need
> the "-Y flag" to get X-forwarding working for my current OS X setup.
> Technically, I of course don't need that for running DiFX (which makes
> it more puzzling that it has an impact), but for e.g. fourfit and
> similar later. So it's easy to work around this problem.
>
> Not sure what to make of this, but the error (if using ssh -Y and not
> setting LC_CTYPE) appears benign as far as the geodetic results go.
> Maybe this can save someone from doing the same investigation, if
> someone is nervous about the segfault :).
>
> Kind regards
> Eskil and Simon in Onsala
>
> _______________________________________________
> Difx-users mailing list
> Difx-users at listmgr.nrao.edu
> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>


-- 
!=============================================================!
Prof. Adam Deller
Centre for Astrophysics & Supercomputing
Swinburne University of Technology
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20220310/cccc8936/attachment.html>


More information about the Difx-users mailing list