[Difx-users] Debugging a Nightmare problem

Richard Dodson richard.dodson at uwa.edu.au
Wed Aug 24 08:41:06 EDT 2016


On Wed, Aug 24, 2016 at 8:59 PM, Walter Brisken <wbrisken at nrao.edu> wrote:

>
> That non-compliance should not affect any downstream decoding.  The note
> should be used to improve the quality of VDIF data being produced though,
> and could hint at a problem in some data conversion (e.g., failing to
> leave 16 bytes of zeros).
>
> One generally difficult case to handle (not likely relevant here, but just
> in case...) is when multiple data sources are used to generate separate
> threads but these threads don't all start and/or stop at the same time.
> This can lead to long periods where incomplete thread sets are present and
> it can confuse indexing into the file.


To which I note that VERA generates separate files for each IF, that have
been remerged here. A possible issue; we will discuss this with Dr Oh
Friday back in Daejeon.

But the KVN files did not work with just those three antennas.....

        R+


> -W
>
> On Wed, 24 Aug 2016, Adam Deller wrote:
>
> > Hi Walter,
> >
> > I had used printVDIFheader earlier (to get the nChan) and thought there
> > were no problems.  But:
> >
> > deller at bunker trunk richard-vdif> printVDIFheader
> k16mk02f_ktn_start.vdif |
> > more
> > Error: non-compliant VDIF data: this data has EDV set to 0 but the
> extended
> > header is not identically 0
> > Set framesize to be 1312 bytes, based on first frame found
> > FrameNum Epoch  Seconds  Frame Thread Length Chans Bits L I C EDV
> >       0 32 6865203      0      1   1312     8    2 0 0 0   0
> >       1    32  6865203      1      1   1312     8    2 0 0 0   0
> >       2    32  6865203      2      1   1312     8    2 0 0 0   0
> >       3    32  6865203      3      1   1312     8    2 0 0 0   0
> >       4    32  6865203      4      1   1312     8    2 0 0 0   0
> >       5    32  6865203      5      1   1312     8    2 0 0 0   0
> >       6    32  6865203      6      1   1312     8    2 0 0 0   0
> > ....
> >
> > I'd looked at the header printouts, which seemed fine, but missed the
> first
> > line where it notes that the file is non-compliant (non-zero values in
> the
> > extended header region).  I wouldn't have thought that would lead to a
> > problem, since all the data in the important part of the header is
> > apparently right, but maybe it is tripping up the validate routine in
> > vdifformat?
> >
> > Cheers,
> > Adam
> >
> > On Wed, Aug 24, 2016 at 1:30 PM, Walter Brisken <wbrisken at nrao.edu>
> wrote:
> >
> >>
> >> It sounds like the data file has a problem.  You might want to use
> >> printVDIFheader to diagnose.  If you could send the output of that for
> the
> >> first few frames (and if you see a jump in threadId somewhere in the
> file
> >> you might capture that as well).  Also might be good to send around the
> >> .vex, .v2d and .input file around so we can see if everything hangs
> >> together.
> >>
> >> -W
> >>
> >>
> >> On Wed, 24 Aug 2016, Adam Deller wrote:
> >>
> >> Hi Richard,
> >>>
> >>> Walter and/or Chris may be interested in the diagnosis a little further
> >>> down.
> >>>
> >>> On Wed, Aug 24, 2016 at 11:01 AM, Richard Dodson <
> >>> richard.dodson at uwa.edu.au>
> >>> wrote:
> >>>
> >>> Hi Adam
> >>>>
> >>>>  As I say this is a mess. The first TianMa, ATCA & KaVA observations.
> The
> >>>> Australians, the Chinese and the Japanese all have their own `unique'
> >>>> systems. Then these (except ATCA) have been extracted for the KJJCC
> and
> >>>> hardware correlated. This is the data that was exported for use with
> >>>> DiFX.
> >>>>
> >>>> At least 3 conversions between this file and the sky, all of which
> could
> >>>> be wrong.
> >>>>
> >>>> The BW should be 32MHz. 8IFs of L pol. T6 and KaVA with different
> >>>> sidebands. ATCA with 64MHz and dual pol (so only 50% coverage).
> >>>>
> >>>> So VDIF_1280-1024-8-2  is what I have been using. You say "which you
> >>>> supply to the v2d file". In which place? As the FORMAT field? I have
> used
> >>>> VDIF -- is this wrong?
> >>>>
> >>>>
> >>> No, you're right.  I misremembered where the format string for the
> >>> unpacker
> >>> gets generated (it is actually generated internally to DiFX, based on
> the
> >>> format [VDIF] and the other information like number of bits, frame
> size,
> >>> and number of subbands that are supplied elsewhere in the vex file and
> >>> placed in the input file by vex2difx.)
> >>>
> >>>
> >>>
> >>>> As an aside the conversion to VDIF was wrong (in invalid flag, day(!)
> and
> >>>> no of sidebands). These I _think_ I have fixed, but using tools I
> don't
> >>>> understand.
> >>>>
> >>>>
> >>> OK, so the header indeed thinks that there are 8 channels, so that is
> >>> good.
> >>>
> >>> But when using m5d with VDIF_1280-1024-8-2, after the second frame it
> >>> starts complaining of errors.  But if one tells m2d that the format is
> >>> VDIF_1280-1024-1-2
> >>> (which means basically identical payload, but it is just 5120 samples
> from
> >>> one channel in every packet, rather than 640 samples from each of 8
> >>> channels in every packet) then it works fine.  I think there might be a
> >>> bug
> >>> in the mark5access validator: if I run with valgrind I get:
> >>>
> >>> ==18198== Invalid read of size 4
> >>> ==18198==    at 0x4E887A2: mark5_format_vdif_validate
> (format_vdif.c:3989)
> >>> ==18198==    by 0x4E39C88: mark5_stream_next_frame (mark5_stream.c:166)
> >>> ==18198==    by 0x4E8664F: vdif_decode_8channel_2bit_decimation1
> >>> (format_vdif.c:1131)
> >>> ==18198==    by 0x4018AD: decode_short (m5d.c:165)
> >>> ==18198==    by 0x4018AD: main (m5d.c:502)
> >>> ==18198==  Address 0x58399f8 is 6 bytes after a block of size 2,626
> >>> alloc'd
> >>> ==18198==    at 0x4C2AB80: malloc (in
> >>> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> >>> ==18198==    by 0x4017AF: decode_short (m5d.c:142)
> >>> ==18198==    by 0x4017AF: main (m5d.c:502)
> >>>
> >>> No such error is seen if I say this is 1 channel data.
> >>>
> >>> Not sure if the problem is specific to m5d or indicative of a wider
> bug in
> >>> format_vdif for multichannel data.  You could try running mpifxcorr
> under
> >>> valgrind and see if a similar error is caught.
> >>>
> >>> As an aside, in all cases, m5d reports MJD/seconds of 0/0.  But the
> header
> >>> is clearly fine, because printVDIF shows the correct dates and times.
> Not
> >>> sure if this is important or not - I'm guess probably not, it is
> probably
> >>> just not being set properly before being printed out.  I doubt this is
> the
> >>> root cause of the issue.
> >>>
> >>> I'm pretty sure similar data (8 channel, 2 bit real data) has been
> >>> successfully used in mpifxcorr before, so I'm a bit puzzled to be
> honest.
> >>> But I'm short on time right now to investigate further.
> >>>
> >>> Cheers,
> >>> Adam
> >>>
> >>>
> >>> I made spectra, (m5spec) of a few of the files and they looked OK.
> >>>>
> >>>> I will get back to this later (tonight?). Juggling events. I'll check
> >>>> bandpasses for a number of possible setups. If the Bpass looks right I
> >>>> will
> >>>> get the correct filters have been used.
> >>>>
> >>>>    Thanks for the help. I am at sea at the mo'
> >>>>
> >>>>          Richard
> >>>>
> >>>> On Wed, Aug 24, 2016 at 4:20 PM, Adam Deller <deller at astron.nl>
> wrote:
> >>>>
> >>>> Hi Richard,
> >>>>>
> >>>>> I have a few observations for you:
> >>>>>
> >>>>> * Nothing strange in the file at a first glance - countVDIFPackets
> and
> >>>>> printVDIF are happy with it.  It is 2 bit data.  Frame size is 1312
> >>>>> bytes,
> >>>>> and the number of frames per second indicates that this is 1 Gbps
> data.
> >>>>> * Using printVDIFheader tells me there are 8 channels in the single
> VDIF
> >>>>> thread.  Combined with the other info, that implies the bandwidth per
> >>>>> subband is 32 MHz? So then the format name (which you supply to the
> v2d
> >>>>> file and hence the .input file) should be VDIF_1280-1024-8-2, I
> think.
> >>>>>
> >>>>> However, I then get funny results when I try to unpack the data using
> >>>>> m5d
> >>>>> and that format name.  It's happy for a while, and then starts to
> give
> >>>>> unpack errors (which one usually gets if one mucks up the format
> >>>>> name).  If
> >>>>> I instead say the number of channels is 1 (so VDIF_1280-1024-1-2),
> which
> >>>>> would mean a single 256 MHz wide channel, then it unpacks happily.
> >>>>>
> >>>>> So what's the deal with the number of subbands?  I think something is
> >>>>> wrong somewhere, either 8 has been written into the header where 1
> >>>>> should
> >>>>> have been, or something else like that.
> >>>>>
> >>>>> Cheers,
> >>>>> Adam
> >>>>>
> >>>>> On Wed, Aug 24, 2016 at 4:31 AM, Richard Dodson <
> >>>>> richard.dodson at uwa.edu.au> wrote:
> >>>>>
> >>>>> Hi Adam
> >>>>>>
> >>>>>> vdifsummary seems to be a file in ~/Util in oper as KASI. I guess
> it is
> >>>>>> something that Jan wrote. I will check.
> >>>>>>
> >>>>>> countVDIF is slow (took all night to finish) &  I should have
> looked at
> >>>>>> thread 1 not 0 (correct?). It is now running for 1. Nothing to note
> so
> >>>>>> far
> >>>>>> eg:
> >>>>>>
> >>>>>> For thread 1, at second 39896, read 29300000 frames, spotted 0
> missing
> >>>>>> frames
> >>>>>> The start of the VDIF file (1GB) is at:
> >>>>>>  http://ict.icrar.org/store/staff/rdodson/k16mk02f_ktn_start.vdif
> >>>>>>
> >>>>>>   Thanks for your help
> >>>>>>      Richard
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Aug 22, 2016 at 6:18 PM, Adam Deller <deller at astron.nl>
> wrote:
> >>>>>>
> >>>>>> Hi Richard,
> >>>>>>>
> >>>>>>> Looks like there is a problem mid-file, and when it tries to
> re-sync
> >>>>>>> the header it finds is corrupted.  I can suggest a couple of things
> >>>>>>> to try:
> >>>>>>>
> >>>>>>> you can run countVDIFpackets (a utility in vdifio) which is
> probably
> >>>>>>> slower than vdifsummary (what utility is this?  I'm not aware of a
> >>>>>>> "vdifsummary", there is a "vsum"...?) and is pretty basic but
> >>>>>>> actually does
> >>>>>>> check for every packet, and prints a message every time a problem
> is
> >>>>>>> seen.
> >>>>>>> That might give you some extra clues, so I'd try that first.  And
> if
> >>>>>>> you
> >>>>>>> really want to get blasted away by lots of logging, you can use
> >>>>>>> printVDIF,
> >>>>>>> which prints a little summary of each and every packet header.  You
> >>>>>>> could
> >>>>>>> pipe that to grep to look for anomalies.
> >>>>>>>
> >>>>>>> Looks like the problem is very early in the file, so if you dd the
> >>>>>>> first second or so and put it on an ftp server somewhere, I could
> >>>>>>> also take
> >>>>>>> a look.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Adam
> >>>>>>>
> >>>>>>> On Mon, Aug 22, 2016 at 10:57 AM, Richard Dodson <
> >>>>>>> richard.dodson at uwa.edu.au> wrote:
> >>>>>>>
> >>>>>>> Dear All
> >>>>>>>>
> >>>>>>>>  I have one of the usual nightmare twisted DiFX correlation
> problems.
> >>>>>>>>
> >>>>>>>>  I am trying to use DiFX on VDIF data which has been copied off
> the
> >>>>>>>> VERA OCTAVE systems (and similar) and converted.
> >>>>>>>>
> >>>>>>>>   The problem is almost certainly in the data copying -- but I
> need
> >>>>>>>> to
> >>>>>>>> provide some feedback on what is wrong for it to be fixed
> >>>>>>>>
> >>>>>>>>   The first problem that I found was in the VDIF file: all the
> >>>>>>>> invalid
> >>>>>>>> flags were set, the number of channels was wrong and the date was
> >>>>>>>> wrong by
> >>>>>>>> 1 day. :(
> >>>>>>>>
> >>>>>>>>   Jan has a program to fix all of these :) -- but he is not
> around to
> >>>>>>>> check if I have used this correctly :( :(
> >>>>>>>>
> >>>>>>>>    After these fixes the correlation runs, but the data file is
> >>>>>>>> empty.
> >>>>>>>> What messages should I be checking to work out what is happening?
> I
> >>>>>>>> append
> >>>>>>>> some messages which look suspicious but don't convey any
> information
> >>>>>>>> to me
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>         All the best
> >>>>>>>>             Richard
> >>>>>>>>
> >>>>>>>> Comments:
> >>>>>>>>   vdifsummary reports seem OK
> >>>>>>>>
> >>>>>>>> # vdifsummary /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
> >>>>>>>> [1:1] check k16mk02f_kava_miz.vdif -> Good! it is a VDIF data
> scan ->
> >>>>>>>> add to 1
> >>>>>>>> k16mk02f_kava_miz.vdif   4,108,790,400,000   31317 sec( 8:41:57)
> >>>>>>>> 57467 Mar 20 2016y080d 11:00:03 - 19:41:59  1312 100000
> >>>>>>>> 3,827 GB(=  3.7 TB)(= 4,108,790,400,000 B)
> >>>>>>>>
> >>>>>>>> Log messages which might be relevant:
> >>>>>>>>
> >>>>>>>> 2016-08-22 16:30:32,548 DiFXAlert INFO    MPI[ 1]
> compute-0-28.local
> >>>>>>>> k16mk02f_1   Datastream 1 has opened file index 0, which was
> >>>>>>>> /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
> >>>>>>>>
> >>>>>>>> 2016-08-22 16:30:32,548 DiFXAlert VERBOSE MPI[ 2]
> compute-0-28.local
> >>>>>>>> k16mk02f_1   input.bad() is 0, input.fail() is 0
> >>>>>>>>
> >>>>>>>> 2016-08-22 16:30:32,700 DiFXAlert ERROR   MPI[ 1]
> compute-0-28.local
> >>>>>>>> k16mk02f_1   Lost Sync on segment 1! Will attempt to resync.
> >>>>>>>> Deltatime was
> >>>>>>>> -1.13239e+09
> >>>>>>>>
> >>>>>>>> 2016-08-22 16:30:32,701 DiFXAlert INFO    MPI[ 1]
> compute-0-28.local
> >>>>>>>> k16mk02f_1   Config has changed!
> >>>>>>>>
> >>>>>>>> 2016-08-22 16:30:32,702 DiFXAlert INFO    MPI[ 1]
> compute-0-28.local
> >>>>>>>> k16mk02f_1   After regaining sync, the frame start day is 70573,
> the
> >>>>>>>> frame
> >>>>>>>> start seconds is 70631, the frame start ns is -2147483648,
> readscan
> >>>>>>>> is 2,
> >>>>>>>> readseconds is 1132388471, readnanoseconds is -2147483648
> >>>>>>>>         note the 2^31 values !!!!
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Difx-users mailing list
> >>>>>>>> Difx-users at listmgr.nrao.edu
> >>>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> !=============================================================!
> >>>>>>> Dr. Adam Deller
> >>>>>>> Ph  +31 521595785 / Fax +31 521595101
> >>>>>>> Staff Astronomer, Astronomy Group
> >>>>>>> ASTRON, Oude Hoogeveensedijk 4
> >>>>>>> 7991 PD Dwingeloo, The Netherlands
> >>>>>>> !=============================================================!
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -------------------------
> >>>>>> Dr Richard Dodson,
> >>>>>> International Centre for Radio Astronomy Research
> >>>>>> University of Western Australia
> >>>>>> P: +8 6488 7842 E: richard.dodson at icrar.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> !=============================================================!
> >>>>> Dr. Adam Deller
> >>>>> Ph  +31 521595785 / Fax +31 521595101
> >>>>> Staff Astronomer, Astronomy Group
> >>>>> ASTRON, Oude Hoogeveensedijk 4
> >>>>> 7991 PD Dwingeloo, The Netherlands
> >>>>> !=============================================================!
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -------------------------
> >>>> Dr Richard Dodson,
> >>>> International Centre for Radio Astronomy Research
> >>>> University of Western Australia
> >>>> P: +8 6488 7842 E: richard.dodson at icrar.org
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>



-- 
-------------------------
Dr Richard Dodson,
International Centre for Radio Astronomy Research
University of Western Australia
P: +8 6488 7842 E: richard.dodson at icrar.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20160824/f2f6e355/attachment-0001.html>


More information about the Difx-users mailing list