[Difx-users] Debugging a Nightmare problem

Richard Dodson richard.dodson at uwa.edu.au
Thu Sep 15 05:41:18 EDT 2016


Dear All

 Just a contribution to this thread listing the issues and how they were
solved. These are probably more for KaVA users than anyone else ..

 i) format=VDIF  was not sufficient in the V2D file, format=VDIF/1312/2 was
required

    Without this the correlation would not even start

 ii) There are a number of VDIF format issues in KaVA which we will now
ensure are addressed in KASI. These (only?) come up with modes other than
the default 1024-16-2 (16x16MHz).
     The symptom was that there are fringes only from IF1 and IF8, and
these are at reduced SNR.

  Thanks for all those who gave their time to help!!

       Richard

On Fri, Aug 26, 2016 at 10:28 AM, Walter Brisken <wbrisken at nrao.edu> wrote:

>
> Agreed.  -W
>
> On Fri, 26 Aug 2016, Chris.Phillips at csiro.au wrote:
>
> > ÿÿHi Richard
> >
> >
> > Are you sure "VLBA" stats would be 25%? That sounds very dubious to
> me.... 2 bit data you want the 17/33% ratio for optimal SNR. Unless it is
> just side affect on how the data is decoded (e.g. the S2 used to report
> 50%, but that was just how it was counting the bits - if you report 4
> numbers they should be the 17/33/33/17 ratio).
> >
> >
> >
> > Cheers
> >
> > Chris
> >
> >
> > ________________________________
> > From: Difx-users <difx-users-bounces at listmgr.nrao.edu> on behalf of
> Richard Dodson <richard.dodson at uwa.edu.au>
> > Sent: Friday, 26 August 2016 12:11 PM
> > To: Walter Brisken
> > Cc: Jan Wagner; difxusers; Adam Deller
> > Subject: Re: [Difx-users] Debugging a Nightmare problem
> >
> > Hi Walter
> >
> >  Back in KASI now, so I can attach the VEX, etc files.
> >  There certainly is a good chance that these contain an error (as well?)
> >
> >  I think I see a problem in the Bpasses (made w. m5spec) of two of the
> KVN datasets (KY & KU, but not KT), that is a very high DC term. This (as I
> recall) can be generated a wrongly set sample transition level, but can
> also come from an incorrect byte order. Is this write? I wonder if the
> conversion from M5B to VDIF for KVN includes by default the required VERA
> MSB/LSB bit order change?
> >
> >  These questions I will discuss today, with the correlator team.
> >
> >   m5stat.py generates nice numbers for all VDIF datafiles -- but I am
> not sure if this should report the (stated) --HiMag to +HiMag stats (i.e.
> approx 15,35,35,15%) (which the historical/Australians of us might call
> VSOP format) or the VLBA stats (25,25,25,25%). Any ideas?
> >  VERA has VLBA stats, KVN has VSOP stats.
> >
> >     ATB
> >       Richard
> >
> > On Wed, Aug 24, 2016 at 8:30 PM, Walter Brisken <wbrisken at nrao.edu
> <mailto:wbrisken at nrao.edu>> wrote:
> >
> > It sounds like the data file has a problem.  You might want to use
> > printVDIFheader to diagnose.  If you could send the output of that for
> the
> > first few frames (and if you see a jump in threadId somewhere in the file
> > you might capture that as well).  Also might be good to send around the
> > .vex, .v2d and .input file around so we can see if everything hangs
> > together.
> >
> > -W
> >
> > On Wed, 24 Aug 2016, Adam Deller wrote:
> >
> >> Hi Richard,
> >>
> >> Walter and/or Chris may be interested in the diagnosis a little further
> >> down.
> >>
> >> On Wed, Aug 24, 2016 at 11:01 AM, Richard Dodson <
> richard.dodson at uwa.edu.au<mailto:richard.dodson at uwa.edu.au>>
> >> wrote:
> >>
> >>> Hi Adam
> >>>
> >>>  As I say this is a mess. The first TianMa, ATCA & KaVA observations.
> The
> >>> Australians, the Chinese and the Japanese all have their own `unique'
> >>> systems. Then these (except ATCA) have been extracted for the KJJCC and
> >>> hardware correlated. This is the data that was exported for use with
> DiFX.
> >>>
> >>> At least 3 conversions between this file and the sky, all of which
> could
> >>> be wrong.
> >>>
> >>> The BW should be 32MHz. 8IFs of L pol. T6 and KaVA with different
> >>> sidebands. ATCA with 64MHz and dual pol (so only 50% coverage).
> >>>
> >>> So VDIF_1280-1024-8-2  is what I have been using. You say "which you
> >>> supply to the v2d file". In which place? As the FORMAT field? I have
> used
> >>> VDIF -- is this wrong?
> >>>
> >>
> >> No, you're right.  I misremembered where the format string for the
> unpacker
> >> gets generated (it is actually generated internally to DiFX, based on
> the
> >> format [VDIF] and the other information like number of bits, frame size,
> >> and number of subbands that are supplied elsewhere in the vex file and
> >> placed in the input file by vex2difx.)
> >>
> >>
> >>>
> >>> As an aside the conversion to VDIF was wrong (in invalid flag, day(!)
> and
> >>> no of sidebands). These I _think_ I have fixed, but using tools I don't
> >>> understand.
> >>>
> >>
> >> OK, so the header indeed thinks that there are 8 channels, so that is
> good.
> >>
> >> But when using m5d with VDIF_1280-1024-8-2, after the second frame it
> >> starts complaining of errors.  But if one tells m2d that the format is
> >> VDIF_1280-1024-1-2
> >> (which means basically identical payload, but it is just 5120 samples
> from
> >> one channel in every packet, rather than 640 samples from each of 8
> >> channels in every packet) then it works fine.  I think there might be a
> bug
> >> in the mark5access validator: if I run with valgrind I get:
> >>
> >> ==18198== Invalid read of size 4
> >> ==18198==    at 0x4E887A2: mark5_format_vdif_validate
> (format_vdif.c:3989)
> >> ==18198==    by 0x4E39C88: mark5_stream_next_frame (mark5_stream.c:166)
> >> ==18198==    by 0x4E8664F: vdif_decode_8channel_2bit_decimation1
> >> (format_vdif.c:1131)
> >> ==18198==    by 0x4018AD: decode_short (m5d.c:165)
> >> ==18198==    by 0x4018AD: main (m5d.c:502)
> >> ==18198==  Address 0x58399f8 is 6 bytes after a block of size 2,626
> alloc'd
> >> ==18198==    at 0x4C2AB80: malloc (in
> >> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> >> ==18198==    by 0x4017AF: decode_short (m5d.c:142)
> >> ==18198==    by 0x4017AF: main (m5d.c:502)
> >>
> >> No such error is seen if I say this is 1 channel data.
> >>
> >> Not sure if the problem is specific to m5d or indicative of a wider bug
> in
> >> format_vdif for multichannel data.  You could try running mpifxcorr
> under
> >> valgrind and see if a similar error is caught.
> >>
> >> As an aside, in all cases, m5d reports MJD/seconds of 0/0.  But the
> header
> >> is clearly fine, because printVDIF shows the correct dates and times.
> Not
> >> sure if this is important or not - I'm guess probably not, it is
> probably
> >> just not being set properly before being printed out.  I doubt this is
> the
> >> root cause of the issue.
> >>
> >> I'm pretty sure similar data (8 channel, 2 bit real data) has been
> >> successfully used in mpifxcorr before, so I'm a bit puzzled to be
> honest.
> >> But I'm short on time right now to investigate further.
> >>
> >> Cheers,
> >> Adam
> >>
> >>
> >>> I made spectra, (m5spec) of a few of the files and they looked OK.
> >>>
> >>> I will get back to this later (tonight?). Juggling events. I'll check
> >>> bandpasses for a number of possible setups. If the Bpass looks right I
> will
> >>> get the correct filters have been used.
> >>>
> >>>    Thanks for the help. I am at sea at the mo'
> >>>
> >>>          Richard
> >>>
> >>> On Wed, Aug 24, 2016 at 4:20 PM, Adam Deller <deller at astron.nl<mailto:
> deller at astron.nl>> wrote:
> >>>
> >>>> Hi Richard,
> >>>>
> >>>> I have a few observations for you:
> >>>>
> >>>> * Nothing strange in the file at a first glance - countVDIFPackets and
> >>>> printVDIF are happy with it.  It is 2 bit data.  Frame size is 1312
> bytes,
> >>>> and the number of frames per second indicates that this is 1 Gbps
> data.
> >>>> * Using printVDIFheader tells me there are 8 channels in the single
> VDIF
> >>>> thread.  Combined with the other info, that implies the bandwidth per
> >>>> subband is 32 MHz? So then the format name (which you supply to the
> v2d
> >>>> file and hence the .input file) should be VDIF_1280-1024-8-2, I think.
> >>>>
> >>>> However, I then get funny results when I try to unpack the data using
> m5d
> >>>> and that format name.  It's happy for a while, and then starts to give
> >>>> unpack errors (which one usually gets if one mucks up the format
> name).  If
> >>>> I instead say the number of channels is 1 (so VDIF_1280-1024-1-2),
> which
> >>>> would mean a single 256 MHz wide channel, then it unpacks happily.
> >>>>
> >>>> So what's the deal with the number of subbands?  I think something is
> >>>> wrong somewhere, either 8 has been written into the header where 1
> should
> >>>> have been, or something else like that.
> >>>>
> >>>> Cheers,
> >>>> Adam
> >>>>
> >>>> On Wed, Aug 24, 2016 at 4:31 AM, Richard Dodson <
> >>>> richard.dodson at uwa.edu.au<mailto:richard.dodson at uwa.edu.au>> wrote:
> >>>>
> >>>>> Hi Adam
> >>>>>
> >>>>> vdifsummary seems to be a file in ~/Util in oper as KASI. I guess it
> is
> >>>>> something that Jan wrote. I will check.
> >>>>>
> >>>>> countVDIF is slow (took all night to finish) &  I should have looked
> at
> >>>>> thread 1 not 0 (correct?). It is now running for 1. Nothing to note
> so far
> >>>>> eg:
> >>>>>
> >>>>> For thread 1, at second 39896, read 29300000 frames, spotted 0
> missing
> >>>>> frames
> >>>>> The start of the VDIF file (1GB) is at:
> >>>>>  http://ict.icrar.org/store/staff/rdodson/k16mk02f_ktn_start.vdif
> >>>>>
> >>>>>   Thanks for your help
> >>>>>      Richard
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Aug 22, 2016 at 6:18 PM, Adam Deller <deller at astron.nl
> <mailto:deller at astron.nl>> wrote:
> >>>>>
> >>>>>> Hi Richard,
> >>>>>>
> >>>>>> Looks like there is a problem mid-file, and when it tries to re-sync
> >>>>>> the header it finds is corrupted.  I can suggest a couple of things
> to try:
> >>>>>>
> >>>>>> you can run countVDIFpackets (a utility in vdifio) which is probably
> >>>>>> slower than vdifsummary (what utility is this?  I'm not aware of a
> >>>>>> "vdifsummary", there is a "vsum"...?) and is pretty basic but
> actually does
> >>>>>> check for every packet, and prints a message every time a problem
> is seen.
> >>>>>> That might give you some extra clues, so I'd try that first.  And
> if you
> >>>>>> really want to get blasted away by lots of logging, you can use
> printVDIF,
> >>>>>> which prints a little summary of each and every packet header.  You
> could
> >>>>>> pipe that to grep to look for anomalies.
> >>>>>>
> >>>>>> Looks like the problem is very early in the file, so if you dd the
> >>>>>> first second or so and put it on an ftp server somewhere, I could
> also take
> >>>>>> a look.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Adam
> >>>>>>
> >>>>>> On Mon, Aug 22, 2016 at 10:57 AM, Richard Dodson <
> >>>>>> richard.dodson at uwa.edu.au<mailto:richard.dodson at uwa.edu.au>> wrote:
> >>>>>>
> >>>>>>> Dear All
> >>>>>>>
> >>>>>>>  I have one of the usual nightmare twisted DiFX correlation
> problems.
> >>>>>>>
> >>>>>>>  I am trying to use DiFX on VDIF data which has been copied off the
> >>>>>>> VERA OCTAVE systems (and similar) and converted.
> >>>>>>>
> >>>>>>>   The problem is almost certainly in the data copying -- but I
> need to
> >>>>>>> provide some feedback on what is wrong for it to be fixed
> >>>>>>>
> >>>>>>>   The first problem that I found was in the VDIF file: all the
> invalid
> >>>>>>> flags were set, the number of channels was wrong and the date was
> wrong by
> >>>>>>> 1 day. :(
> >>>>>>>
> >>>>>>>   Jan has a program to fix all of these :) -- but he is not around
> to
> >>>>>>> check if I have used this correctly :( :(
> >>>>>>>
> >>>>>>>    After these fixes the correlation runs, but the data file is
> empty.
> >>>>>>> What messages should I be checking to work out what is happening?
> I append
> >>>>>>> some messages which look suspicious but don't convey any
> information to me
> >>>>>>> ...
> >>>>>>>
> >>>>>>>         All the best
> >>>>>>>             Richard
> >>>>>>>
> >>>>>>> Comments:
> >>>>>>>   vdifsummary reports seem OK
> >>>>>>>
> >>>>>>> # vdifsummary /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
> >>>>>>> [1:1] check k16mk02f_kava_miz.vdif -> Good! it is a VDIF data scan
> ->
> >>>>>>> add to 1
> >>>>>>> k16mk02f_kava_miz.vdif   4,108,790,400,000   31317 sec( 8:41:57)
> >>>>>>> 57467 Mar 20 2016y080d 11:00:03 - 19:41:59  1312 100000
> >>>>>>> 3,827 GB(=  3.7 TB)(= 4,108,790,400,000 B)
> >>>>>>>
> >>>>>>> Log messages which might be relevant:
> >>>>>>>
> >>>>>>> 2016-08-22 16:30:32,548 DiFXAlert INFO    MPI[ 1]
> compute-0-28.local
> >>>>>>> k16mk02f_1   Datastream 1 has opened file index 0, which was
> >>>>>>> /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
> >>>>>>>
> >>>>>>> 2016-08-22 16:30:32,548 DiFXAlert VERBOSE MPI[ 2]
> compute-0-28.local
> >>>>>>> k16mk02f_1   input.bad() is 0, input.fail() is 0
> >>>>>>>
> >>>>>>> 2016-08-22 16:30:32,700 DiFXAlert ERROR   MPI[ 1]
> compute-0-28.local
> >>>>>>> k16mk02f_1   Lost Sync on segment 1! Will attempt to resync.
> Deltatime was
> >>>>>>> -1.13239e+09
> >>>>>>>
> >>>>>>> 2016-08-22 16:30:32,701 DiFXAlert INFO    MPI[ 1]
> compute-0-28.local
> >>>>>>> k16mk02f_1   Config has changed!
> >>>>>>>
> >>>>>>> 2016-08-22 16:30:32,702 DiFXAlert INFO    MPI[ 1]
> compute-0-28.local
> >>>>>>> k16mk02f_1   After regaining sync, the frame start day is 70573,
> the frame
> >>>>>>> start seconds is 70631, the frame start ns is -2147483648,
> readscan is 2,
> >>>>>>> readseconds is 1132388471, readnanoseconds is -2147483648
> >>>>>>>         note the 2^31 values !!!!
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Difx-users mailing list
> >>>>>>> Difx-users at listmgr.nrao.edu<mailto:Difx-users at listmgr.nrao.edu>
> >>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> !=============================================================!
> >>>>>> Dr. Adam Deller
> >>>>>> Ph  +31 521595785<tel:%2B31%20521595785> / Fax +31 521595101
> <tel:%2B31%20521595101>
> >>>>>> Staff Astronomer, Astronomy Group
> >>>>>> ASTRON, Oude Hoogeveensedijk 4
> >>>>>> 7991 PD Dwingeloo, The Netherlands
> >>>>>> !=============================================================!
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> -------------------------
> >>>>> Dr Richard Dodson,
> >>>>> International Centre for Radio Astronomy Research
> >>>>> University of Western Australia
> >>>>> P: +8 6488 7842 E: richard.dodson at icrar.org<mailto:
> richard.dodson at icrar.org>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> !=============================================================!
> >>>> Dr. Adam Deller
> >>>> Ph  +31 521595785<tel:%2B31%20521595785> / Fax +31 521595101
> <tel:%2B31%20521595101>
> >>>> Staff Astronomer, Astronomy Group
> >>>> ASTRON, Oude Hoogeveensedijk 4
> >>>> 7991 PD Dwingeloo, The Netherlands
> >>>> !=============================================================!
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> -------------------------
> >>> Dr Richard Dodson,
> >>> International Centre for Radio Astronomy Research
> >>> University of Western Australia
> >>> P: +8 6488 7842 E: richard.dodson at icrar.org<mailto:
> richard.dodson at icrar.org>
> >>>
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > -------------------------
> > Dr Richard Dodson,
> > International Centre for Radio Astronomy Research
> > University of Western Australia
> > P: +8 6488 7842 E: richard.dodson at icrar.org<mailto:richard.dodson at icrar.
> org>
> >
>



-- 
-------------------------
Dr Richard Dodson,
International Centre for Radio Astronomy Research
University of Western Australia
P: +8 6488 7842 E: richard.dodson at icrar.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20160915/9cf043f1/attachment-0001.html>


More information about the Difx-users mailing list