[daip] "parallel" AIPS issue

Eric Greisen egreisen at nrao.edu
Wed Mar 31 11:57:41 EDT 2010


Joseph Lazio wrote:
> Hi, 
> 
> I'm trying to use AIPS in a simple (stupid?) parallelization mode,
> which is leading to paralyzation.
> 
> AIPS 31DEC08
> 
> Background:
> 
> I have a bunch of FITS u-v data files.  I'd like to read them into
> AIPS (FITLD), image them (IMAGR), and do some simple statistics
> (IMSTAT).  I've written a little script to do so, and in a variety of
> tests, it has proven robust.
> 
> 
> Parallelization:
> 
> I have a driver script that sets up some variables, then spawns
> multiple instances of AIPS.  Each instance of AIPS uses a different
> user number.  For example, 
> day 1 -> AIPS id 10001
> day 2 -> AIPS id 10002
> day 3 -> AIPS id 10003
> ...
> 
> If I do this by hand, using command-line editing, everything works,
> e.g., 
> 
> % run-aips.sh  10001 & <CR>
> <up-arrow>
> % run-aips.sh  10002 & <CR>
> <up-arrow>
> % run-aips.sh  10003 & <CR>
> <up-arrow>
> ...
> 
> 
> However, if I use the OS to spawn the runs "simultaneously," all kinds
> of apparently random errors start appearing.  AIPS will declare 'TASK
> ACTIVE' for FITLD, even though for the AIPS id in question, it isn't.
> Occasionally in the output log, I see errors in FTLIN, which are
> reported from GTPARM.
> 
> 
> 
> My conclusion is that, somehow, when I use the OS to spawn the runs
> simultaneously, the various instances of AIPS are attempting to access 
> the TD* accounting file simultaneously and having access issues.  If
> I'm correct, this appears to be a fundamental design issue of AIPS,
> and not something that could be changed simply.
> 
> Agree?  Comments?
> 
> -- Joe
> 

I think that what has happened is that more than one of the jobs is 
assigned the same AIPS number.  When that happens everything will go to 
hell.  The process of assigning an AIPS number involves looking in /tmp 
for files named e.g. AIPS1.nnnn where nnnn is a process number.  Then 
the system is checked to see if nnn is actually running.  If not, the 
file is deleted and a new AIPS1.mmmm where mmmm is the current process 
number is created.  If the AIPS1 process is running then it checks AIPS2 
etc.  When you do it by hand, enough time passes that the first one gets 
AIPS1 in the file name and gets the AIPS1 process fully started before 
the 2nd one does its checking.  When they are simultaneous, the first 
one makes its file but its AIPS1 is not yet running when the 2nd does 
its checking - so in the end you get 2 processes both named AIPS1 and 
they compete for resources.  Try having the OS do a delay - or I suppose 
we could add some command line control that says "give me aips number N 
or fail" with 0 the default meaning try all starting at 1.  We do 
something like that with tv=local:? already.

Note that I have added to 31DEC10 the option to start batch jobs on 
other computers (in the local LAN) and to add and remove the AIPS disks 
of another computer from within AIPS.  These are designed to allow the 
simplest sorts of parallelization.  Bill in obit and I in aips have both 
found that cache limitations come into play pretty fast meaning that you 
do not win big by running a whole bunch of IMAGRs at once on the same 
machine - even if the RAM can handle all of the files.

Eric Greisen




More information about the Daip mailing list