[Difx-users] FxManager: Error in launching writethread!!

江悟 jiangwu at shao.ac.cn
Thu May 25 06:56:56 EDT 2017


Hi all,

I think Bill Chen found the problem. Many difxlog threads are opened and kept there when running the correlation. So if the number of jobs is large, it is easy to reach the maximum of thread number limited by the operating system.
Is there any 'key' or setting to turn off the difxlog threads timely?  

Cheers,
Wu

-----原始邮件-----
发件人: "Chen Bill" <billchen001 at gmail.com>
发送时间: 2017年5月24日 星期三
收件人: "江悟" <jiangwu at shao.ac.cn>
抄送: "Adam Deller" <adeller at astro.swin.edu.au>, difxusers <difx-users at listmgr.nrao.edu>
主题: Re: [Difx-users] FxManager: Error in launching writethread!!


Hi All,


I checked this issue, I think the issue is about too many of difxlog processes, in this case there are around 800 scans need to be process, and there will be 800 difxlog process running until all work done. The default Linux kernel only support 1024 process for one user.


I just wonder is it possible to close that difxlog process when one scan finished.
for this issue, we can increase the kernel parameter "noproc" to a  big number, but I think the good way is to enhance the code to reduce the number of difxlog process.


Jiangwu, please correct me if I have mistake.


 


Thanks,
Bill Chen
www.simplehpc.com


On Tue, May 23, 2017 at 1:27 PM, 江悟 <jiangwu at shao.ac.cn> wrote:

Hi Chris and Adam,

Attached are the .v2d, .input, .vex files I used. I was using errormon2 when the error turned out, please check the last line of the errormon2.log. Unfortunately, when I used errormon this morning for re-correlating the same scans, no error reported.

Regards,
Wu

-----原始邮件-----
发件人: "Adam Deller" <adeller at astro.swin.edu.au>
发送时间: 2017年5月23日 星期二
收件人: "江悟" <jiangwu at shao.ac.cn>
抄送: difxusers <difx-users at listmgr.nrao.edu>
主题: Re: FxManager: Error in launching writethread!!


I don't recall seeing this before.  Might it be an error with pthreads generally?  Does it still occur if you're running a much smaller correlation?


Cheers,
Adam


On 23 May 2017 at 12:03, 江悟 <jiangwu at shao.ac.cn> wrote:

Hi all,

Recently I ran difx (2.4.1 version) and came cross the error as following:
FxManager: Error in launching writethread!!

I was using 100 cores and 4 threads each, 1 seperate header node as set in the v2d file. And the visbufferlength was set to 80. The number of stations was 3, the raw data was put in a RAID with parallel file system, while the visbility output was collected and recorded on the local disk of the header noder. Other correlation parameters was,
  tInt =0.131072
  subintNS = 8192000
  nChan = 512

I also checked the memory of the header node, the maximum occupied memory is less than 7%. So I don't know the reason of this error. Have you ever met this error before and could you please help to identify it?

Thanks a lot.

Best regards,
Wu Jiang










--

!=============================================================!
Dr. Adam Deller         
ARC Future Fellow, Senior Lecturer
Centre for Astrophysics & Supercomputing 
Swinburne University of Technology    
John St, Hawthorn VIC 3122 Australia
phone: +61 3 9214 5307
fax: +61 3 9214 8797


office days (usually): Mon-Thu
!=============================================================!





_______________________________________________
Difx-users mailing list
Difx-users at listmgr.nrao.edu
https://listmgr.nrao.edu/mailman/listinfo/difx-users








-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20170525/c2b7244f/attachment.html>


More information about the Difx-users mailing list