[evla-sw-discuss] CBE-NODE-20 back online
Martin Pokorny
mpokorny at nrao.edu
Tue Feb 28 12:27:57 EST 2017
On 02/28/2017 10:23 AM, Karyn Roberts wrote:
> Bob Huber will be onsite - if it indeed does require service, please
> have Bob bring this server back to the AOC for service as well.. Can
> you confirm?
I suspect it's premature to do that. We should try to diagnose the issue
first; see if we can make any progress without moving the machine. It's
been a lingering issue for some time, and I want to make sure that it's
not forgotten.
>
> Regards,
>
> Karyn Roberts
>
>
> On 02/28/2017 10:22 AM, Martin Pokorny wrote:
>> Since we're on the subject of bad CBE nodes, we've kept node 29 out of
>> use for some time, as well. The problems with it are different than
>> node 20, as I recall, but I don't remember the details. It's an
>> obvious failure, I believe, in that the normal CBE software clearly
>> fails to run when that machine is used. Paul would know for certain,
>> but the problem doesn't seem to affect the yuppi software. We should
>> remind ourselves what's wrong with that node, and fix it so it's
>> possible to bring it fully back into service.
>>
>> On 02/27/2017 03:38 PM, Paul Demorest wrote:
>>> It actually died within minutes once we started running processing on
>>> it. In case it's helpful for debugging, this mode had 128 MB/s coming
>>> in via two of the 1gig correlator interfaces (p2p1/p2p2), and 0.5 MB/s
>>> being written out to lustre.
>>>
>>> -Paul
>>>
>>> On 2017-02-27 15:20, Karyn Roberts wrote:
>>>> Will do - this is as we expected, though I must admit I'm surprised it
>>>> only lasted a day. I'll send the team out tomorrow to retrieve the
>>>> system for repair. Thanks for letting me know!
>>>>
>>>> Regards,
>>>> Karyn
>>>>
>>>>
>>>> On 02/27/2017 03:16 PM, Vivek Dhawan wrote:
>>>>> Node 20 was, up, Paul tried some pulsar processing on it, and it
>>>>> promptly died. it now seems to have rebooted itself. But we cannot
>>>>> depend on it during actual observing, so I vote for a real fix.
>>>>>
>>>>> We will avoid using it until then. Priority is moderately high -
>>>>> a few days without it is OK.
>>>>>
>>>>> On Mon, February 27, 2017 11:20, Karyn Roberts wrote:
>>>>> | The server CBE-NODE-20 is back online - after speaking with K.Scott
>>>>> | we're not sure for how long - it has crashed twice before and we
>>>>> expect
>>>>> | it to crash again under similar circumstances. Upon the next crash
>>>>> well
>>>>> | will remove the service to AOC for service.
>>>>> |
>>>>> | Regards,
>>>>> |
>>>>> | Karyn
>>>>> |
>>>>> | _______________________________________________
>>>>> | evla-sw-discuss mailing list
>>>>> | evla-sw-discuss at listmgr.nrao.edu
>>>>> | https://listmgr.nrao.edu/mailman/listinfo/evla-sw-discuss
>>>>> |
>>>>>
>>>
>>> _______________________________________________
>>> evla-sw-discuss mailing list
>>> evla-sw-discuss at listmgr.nrao.edu
>>> https://listmgr.nrao.edu/mailman/listinfo/evla-sw-discuss
>>
>>
--
Martin
More information about the evla-sw-discuss
mailing list