[Difx-users] Multiple Heads in same subnet - one waits for the other to finish ?

Walter Brisken wbrisken at lbo.us
Wed Apr 19 23:59:48 EDT 2017


Hi Stuart,

Sorry if this is a repeat question...  what command line are you supplying 
to start the difx jobs?  if you are using startdifx and supplying both 
jobs at once it certainly will wait for one to finish before the next 
begins.  You can get around this by running two separate startdifx 
programs in separate shells.  I do this routinely when needing parallel 
correlation.  When doing this there is no specific need for separate head 
nodes either unless your output data rate is very high.

-W

On Thu, 20 Apr 2017, Adam Deller wrote:

> Hi Stuart,
>
> On 20 April 2017 at 12:52, Stuart Weston <stuart.weston at aut.ac.nz> wrote:
>
>> Hi Adam,
>>
>>
>>
>> Job2 ÿÿ yes it starts and waits for job1 to finish. Yes it writes to difxlog
>>
>>
>>
>
> Can you include the bit of difxlog that is written up until the point at
> which the job pauses?
>
> If you think the default difxlog isn't including enough info and you want
> to get debug info, you can start a separate errormon2 process before
> kicking off the mpifxcorr jobs:
>
> errormon2 --split 6
>
> will give you everything down to the debug level.
>
> Cheers,
> Adam
>
>
>> File-based correlation ÿÿ yes, mk5 files
>>
>>
>>
>> Do we need a higher level of debug to see why it pauses ? should I use
>> ÿÿÿÿbind-to noneÿÿ ?
>>
>>
>>
>> Stuart
>>
>>
>>
>> *From:* adeller at gmail.com [mailto:adeller at gmail.com] *On Behalf Of *Adam
>> Deller
>> *Sent:* Thursday, 20 April 2017 2:46 p.m.
>> *To:* Stuart Weston <stuart.weston at aut.ac.nz>
>> *Cc:* Difx-users at listmgr.nrao.edu
>> *Subject:* Re: [Difx-users] Multiple Heads in same subnet - one waits for
>> the other to finish ?
>>
>>
>>
>> Hi Stuart,
>>
>>
>>
>> So I'm assuming that job2 does actually start and something gets written
>> to the difxlog, then it pauses until job1 finishes, and then it fires up
>> and runs to completion?  If that is the case, can you post the job2 difxlog
>> as it stands during the "pause" phase?  That might give a clue as to what
>> it is waiting for.
>>
>>
>>
>> Also is this file-based correlation?
>>
>>
>>
>> Cheers,
>>
>> Adam
>>
>>
>>
>> On 20 April 2017 at 12:37, Stuart Weston <stuart.weston at aut.ac.nz> wrote:
>>
>> I have two head nodes, each head node has 6 workers.
>>
>>
>>
>> I split the job up into two groups of files, the idea being Head-1 does
>> scans/files 1-6 and Head-2 does scans/files 7-11.
>>
>>
>>
>> I create two separate input files with different file lists etc. Also two
>> separate thread and machine files appropriate to the two different groups
>> of ip addresses.
>>
>>
>>
>> head-1, worker-1-1, worker-1-2 ÿÿ worker-1-6
>>
>> head-2, worker-2-1 ÿÿ.. worker-2-6
>>
>>
>>
>> So set two jobs running in parallel
>>
>>
>>
>> Head-1 > mpirun -machinefile machines-1 -np 5 mpifxcorr hw03_1.input
>>
>> Head-2 > mpirun -machinefile machines-2 -np 5 mpifxcorr hw03_2.input
>>
>>
>>
>> Now all machines are in the same subnet. I am guessing some communication
>> is going on as Head-2 seemÿÿs to wait while Head-1 processes files 1-6, once
>> Head -1 has finished Head-2 gets busy doing files 7-11.
>>
>>
>>
>> Is there any way to have Head-1 and Head-2 running at the same time ? ie
>> Head-2 doesnÿÿt wait for Head-1 to finish !
>>
>> Stuart Weston Bsc (Hons), MPhil (Hons), MInstP
>>
>> Mobile: 021 713062
>>
>> Skype: stuart.d.weston
>> Email:  stuart.weston at aut.ac.nz
>>
>> http://www.atnf.csiro.au/people/Stuart.Weston/index.html
>>
>> Software Engineer
>> Institute for Radio Astronomy & Space Research (IRASR)
>> School of Computing & Mathematical Sciences
>> Faculty of Creative Technologies
>> Auckland University of Technology, New Zealand.
>>
>> http://www.irasr.aut.ac.nz/
>>
>>
>>
>> [image: NewIRASRLogo]
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Difx-users mailing list
>> Difx-users at listmgr.nrao.edu
>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>
>>
>>
>>
>>
>> --
>>
>> !=============================================================!
>> Dr. Adam Deller
>>
>> ARC Future Fellow, Senior Lecturer
>>
>> Centre for Astrophysics & Supercomputing
>>
>> Swinburne University of Technology
>> John St, Hawthorn VIC 3122 Australia
>>
>> phone: +61 3 9214 5307 <+61%203%209214%205307>
>>
>> fax: +61 3 9214 8797 <+61%203%209214%208797>
>>
>>
>>
>> office days (usually): Mon-Thu
>> !=============================================================!
>>
>
>
>
>

-- 
-------------------------
Walter Brisken
Director
Long Baseline Observatory
(505)-234-5912


More information about the Difx-users mailing list