[daip] "parallel" AIPS issue
Eric Greisen
egreisen at nrao.edu
Wed Mar 31 11:57:41 EDT 2010
Joseph Lazio wrote:
> Hi,
>
> I'm trying to use AIPS in a simple (stupid?) parallelization mode,
> which is leading to paralyzation.
>
> AIPS 31DEC08
>
> Background:
>
> I have a bunch of FITS u-v data files. I'd like to read them into
> AIPS (FITLD), image them (IMAGR), and do some simple statistics
> (IMSTAT). I've written a little script to do so, and in a variety of
> tests, it has proven robust.
>
>
> Parallelization:
>
> I have a driver script that sets up some variables, then spawns
> multiple instances of AIPS. Each instance of AIPS uses a different
> user number. For example,
> day 1 -> AIPS id 10001
> day 2 -> AIPS id 10002
> day 3 -> AIPS id 10003
> ...
>
> If I do this by hand, using command-line editing, everything works,
> e.g.,
>
> % run-aips.sh 10001 & <CR>
> <up-arrow>
> % run-aips.sh 10002 & <CR>
> <up-arrow>
> % run-aips.sh 10003 & <CR>
> <up-arrow>
> ...
>
>
> However, if I use the OS to spawn the runs "simultaneously," all kinds
> of apparently random errors start appearing. AIPS will declare 'TASK
> ACTIVE' for FITLD, even though for the AIPS id in question, it isn't.
> Occasionally in the output log, I see errors in FTLIN, which are
> reported from GTPARM.
>
>
>
> My conclusion is that, somehow, when I use the OS to spawn the runs
> simultaneously, the various instances of AIPS are attempting to access
> the TD* accounting file simultaneously and having access issues. If
> I'm correct, this appears to be a fundamental design issue of AIPS,
> and not something that could be changed simply.
>
> Agree? Comments?
>
> -- Joe
>
I think that what has happened is that more than one of the jobs is
assigned the same AIPS number. When that happens everything will go to
hell. The process of assigning an AIPS number involves looking in /tmp
for files named e.g. AIPS1.nnnn where nnnn is a process number. Then
the system is checked to see if nnn is actually running. If not, the
file is deleted and a new AIPS1.mmmm where mmmm is the current process
number is created. If the AIPS1 process is running then it checks AIPS2
etc. When you do it by hand, enough time passes that the first one gets
AIPS1 in the file name and gets the AIPS1 process fully started before
the 2nd one does its checking. When they are simultaneous, the first
one makes its file but its AIPS1 is not yet running when the 2nd does
its checking - so in the end you get 2 processes both named AIPS1 and
they compete for resources. Try having the OS do a delay - or I suppose
we could add some command line control that says "give me aips number N
or fail" with 0 the default meaning try all starting at 1. We do
something like that with tv=local:? already.
Note that I have added to 31DEC10 the option to start batch jobs on
other computers (in the local LAN) and to add and remove the AIPS disks
of another computer from within AIPS. These are designed to allow the
simplest sorts of parallelization. Bill in obit and I in aips have both
found that cache limitations come into play pretty fast meaning that you
do not win big by running a whole bunch of IMAGRs at once on the same
machine - even if the RAM can handle all of the files.
Eric Greisen
More information about the Daip
mailing list