<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Greg, all,</p>
<p>Good suggestion. We have now tested with openmpi version 4.1.2
and the problem now appears to be gone. So, the verdict seems to
be: if someone runs into similar issues, try upgrading openmpi to
latest version.</p>
<p>Thanks</p>
<p>Eskil</p>
<div class="moz-cite-prefix">On 2022-03-10 16:09, Greg Lindahl
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJT0TuNoZ9OozMf0XyGtXpWxfH1bDYx3kY9LDPqkmh8FKkAoaQ@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr">Since this failure is during the startup of
"mpirun" it shouldn't be a bug in difx -- "opal" is a part of
OpenMPI. I'd recommend updating your OpenMPI version, perhaps
the bug is already fixed.
<div><br>
</div>
<div>The dependence on the environment variables is something
that I've seen before -- the exact size of the text of
environment variables moves the rest of the code around in
memory. "ssh -X" and "ssh -Y" have different environment
variables.</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Mar 9, 2022 at 5:23 AM
Eskil Varenius via Difx-users <<a
href="mailto:difx-users@listmgr.nrao.edu"
moz-do-not-send="true" class="moz-txt-link-freetext">difx-users@listmgr.nrao.edu</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear
DiFX users,<br>
I wanted to share an intriguing segfault-error which has kept
me puzzled <br>
for some time. Just in case someone else runs into the same,
or maybe <br>
knows the reason. Strictly speaking it's (very likely) not an
<br>
difx-issue, but somehow related to the way I run difx.<br>
<br>
Problem: I try to correlate some r1-data using difx 2.5.4 or
2.6.3 (same <br>
behaviour with both; I did not test older versions). I connect
from my <br>
laptop (OS X 12.0.1) to my server (Linux Mint 19.3) using "ssh
-Y <br>
user@server" and then run "startdifx -n -f -v
r11026_01.input". <br>
Everything runs fine, except that the last rows on screen are<br>
<br>
[...]<br>
start frame = 0<br>
end second = 61220<br>
end frame = 5048<br>
first frame offset = 0 bytes<br>
[gyller:30568] *** Process received signal ***<br>
[gyller:30568] Signal: Segmentation fault (11)<br>
[gyller:30568] Signal code: Address not mapped (1)<br>
[gyller:30568] Failing at address: 0xceec27309<br>
[gyller:30568] [ 0] <br>
/lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7fd2b0f23040]<br>
[gyller:30568] [ 1] <br>
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc201_hwloc_bitmap_free+0x9)[0x7fd2b136e2c9]<br>
[gyller:30568] [ 2] <br>
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f70b)[0x7fd2b136470b]<br>
[gyller:30568] [ 3] <br>
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_hwloc_base_free_topology+0x79)[0x7fd2b13671b9]<br>
[gyller:30568] [ 4] <br>
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(+0x8f5a0)[0x7fd2b13645a0]<br>
[gyller:30568] [ 5] <br>
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(mca_base_framework_close+0x67)[0x7fd2b1339567]<br>
[gyller:30568] [ 6] <br>
/usr/local/openmpi_4.1.1_gcc/lib/libopen-pal.so.40(opal_finalize+0x83)[0x7fd2b130c113]<br>
[gyller:30568] [ 7] mpirun(+0xfbd)[0x561bed3d0fbd]<br>
[gyller:30568] [ 8] <br>
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd2b0f05bf7]<br>
[gyller:30568] [ 9] mpirun(+0xd8a)[0x561bed3d0d8a]<br>
[gyller:30568] *** End of error message ***<br>
Segmentation fault (core dumped)<br>
Elapsed time (s) = 12.6550719738<br>
<br>
The segfault got me nervous. Investigating environment
settings, Simon <br>
Casey and I found that the parameter "LC_CTYPE" was not set to
anything. <br>
Setting this as export LC_CTYPE="UTF-8" before running
"startdifx" makes <br>
the problem go away.<br>
<br>
Another way to make the problem go away is to use "ssh" or
"ssh -X" <br>
instead of "ssh -Y" to connect to my server. With this, there
are no <br>
segfault errors - even without setting the "LC_CTYPE".
However, I need <br>
the "-Y flag" to get X-forwarding working for my current OS X
setup. <br>
Technically, I of course don't need that for running DiFX
(which makes <br>
it more puzzling that it has an impact), but for e.g. fourfit
and <br>
similar later. So it's easy to work around this problem.<br>
<br>
Not sure what to make of this, but the error (if using ssh -Y
and not <br>
setting LC_CTYPE) appears benign as far as the geodetic
results go. <br>
Maybe this can save someone from doing the same investigation,
if <br>
someone is nervous about the segfault :).<br>
<br>
Kind regards<br>
Eskil and Simon in Onsala<br>
<br>
_______________________________________________<br>
Difx-users mailing list<br>
<a href="mailto:Difx-users@listmgr.nrao.edu" target="_blank"
moz-do-not-send="true" class="moz-txt-link-freetext">Difx-users@listmgr.nrao.edu</a><br>
<a href="https://listmgr.nrao.edu/mailman/listinfo/difx-users"
rel="noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">https://listmgr.nrao.edu/mailman/listinfo/difx-users</a><br>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr" class="gmail_signature">
<div dir="ltr">Greg Lindahl<br>
Software Architect, Event Horizon Telescope<br>
Smithsonian Astrophysical Observatory<br>
60 Garden Street | MS 66 | Cambridge, MA 02138<br>
</div>
</div>
</blockquote>
</body>
</html>