[Difx-users] Debugging a Nightmare problem

Richard Dodson richard.dodson at uwa.edu.au
Thu Aug 25 22:37:49 EDT 2016


Hi Chris
  You are right -- the 50% came from counting the bitstream channels, not
from the actual values, as reported in m5stat.py
  More questions for the guys downstairs.

  My guess then is that the swap between LSB & MSB is not done at all.

On Fri, Aug 26, 2016 at 11:24 AM, Chris.Phillips at csiro.au <
Chris.Phillips at csiro.au> wrote:

> ​Hi Richard
>
>
> Are you sure "VLBA" stats would be 25%? That sounds very dubious to me....
> 2 bit data you want the 17/33% ratio for optimal SNR. Unless it is just
> side affect on how the data is decoded (e.g. the S2 used to report 50%, but
> that was just how it was counting the bits - if you report 4 numbers they
> should be the 17/33/33/17 ratio).
>
>
>
> Cheers
>
> Chris
>
>
> ------------------------------
> *From:* Difx-users <difx-users-bounces at listmgr.nrao.edu> on behalf of
> Richard Dodson <richard.dodson at uwa.edu.au>
> *Sent:* Friday, 26 August 2016 12:11 PM
> *To:* Walter Brisken
> *Cc:* Jan Wagner; difxusers; Adam Deller
> *Subject:* Re: [Difx-users] Debugging a Nightmare problem
>
> Hi Walter
>
>   Back in KASI now, so I can attach the VEX, etc files.
>   There certainly is a good chance that these contain an error (as well?)
>
>   I think I see a problem in the Bpasses (made w. m5spec) of two of the
> KVN datasets (KY & KU, but not KT), that is a very high DC term. This (as I
> recall) can be generated a wrongly set sample transition level, but can
> also come from an incorrect byte order. Is this write? I wonder if the
> conversion from M5B to VDIF for KVN includes by default the required VERA
> MSB/LSB bit order change?
>
>   These questions I will discuss today, with the correlator team.
>
>    m5stat.py generates nice numbers for all VDIF datafiles -- but I am not
> sure if this should report the (stated) --HiMag to +HiMag stats (i.e.
> approx 15,35,35,15%) (which the historical/Australians of us might call
> VSOP format) or the VLBA stats (25,25,25,25%). Any ideas?
>   VERA has VLBA stats, KVN has VSOP stats.
>
>      ATB
>        Richard
>
> On Wed, Aug 24, 2016 at 8:30 PM, Walter Brisken <wbrisken at nrao.edu> wrote:
>
>>
>> It sounds like the data file has a problem.  You might want to use
>> printVDIFheader to diagnose.  If you could send the output of that for the
>> first few frames (and if you see a jump in threadId somewhere in the file
>> you might capture that as well).  Also might be good to send around the
>> .vex, .v2d and .input file around so we can see if everything hangs
>> together.
>>
>> -W
>>
>> On Wed, 24 Aug 2016, Adam Deller wrote:
>>
>> > Hi Richard,
>> >
>> > Walter and/or Chris may be interested in the diagnosis a little further
>> > down.
>> >
>> > On Wed, Aug 24, 2016 at 11:01 AM, Richard Dodson <
>> richard.dodson at uwa.edu.au>
>> > wrote:
>> >
>> >> Hi Adam
>> >>
>> >>  As I say this is a mess. The first TianMa, ATCA & KaVA observations.
>> The
>> >> Australians, the Chinese and the Japanese all have their own `unique'
>> >> systems. Then these (except ATCA) have been extracted for the KJJCC and
>> >> hardware correlated. This is the data that was exported for use with
>> DiFX.
>> >>
>> >> At least 3 conversions between this file and the sky, all of which
>> could
>> >> be wrong.
>> >>
>> >> The BW should be 32MHz. 8IFs of L pol. T6 and KaVA with different
>> >> sidebands. ATCA with 64MHz and dual pol (so only 50% coverage).
>> >>
>> >> So VDIF_1280-1024-8-2  is what I have been using. You say "which you
>> >> supply to the v2d file". In which place? As the FORMAT field? I have
>> used
>> >> VDIF -- is this wrong?
>> >>
>> >
>> > No, you're right.  I misremembered where the format string for the
>> unpacker
>> > gets generated (it is actually generated internally to DiFX, based on
>> the
>> > format [VDIF] and the other information like number of bits, frame size,
>> > and number of subbands that are supplied elsewhere in the vex file and
>> > placed in the input file by vex2difx.)
>> >
>> >
>> >>
>> >> As an aside the conversion to VDIF was wrong (in invalid flag, day(!)
>> and
>> >> no of sidebands). These I _think_ I have fixed, but using tools I don't
>> >> understand.
>> >>
>> >
>> > OK, so the header indeed thinks that there are 8 channels, so that is
>> good.
>> >
>> > But when using m5d with VDIF_1280-1024-8-2, after the second frame it
>> > starts complaining of errors.  But if one tells m2d that the format is
>> > VDIF_1280-1024-1-2
>> > (which means basically identical payload, but it is just 5120 samples
>> from
>> > one channel in every packet, rather than 640 samples from each of 8
>> > channels in every packet) then it works fine.  I think there might be a
>> bug
>> > in the mark5access validator: if I run with valgrind I get:
>> >
>> > ==18198== Invalid read of size 4
>> > ==18198==    at 0x4E887A2: mark5_format_vdif_validate
>> (format_vdif.c:3989)
>> > ==18198==    by 0x4E39C88: mark5_stream_next_frame (mark5_stream.c:166)
>> > ==18198==    by 0x4E8664F: vdif_decode_8channel_2bit_decimation1
>> > (format_vdif.c:1131)
>> > ==18198==    by 0x4018AD: decode_short (m5d.c:165)
>> > ==18198==    by 0x4018AD: main (m5d.c:502)
>> > ==18198==  Address 0x58399f8 is 6 bytes after a block of size 2,626
>> alloc'd
>> > ==18198==    at 0x4C2AB80: malloc (in
>> > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
>> > ==18198==    by 0x4017AF: decode_short (m5d.c:142)
>> > ==18198==    by 0x4017AF: main (m5d.c:502)
>> >
>> > No such error is seen if I say this is 1 channel data.
>> >
>> > Not sure if the problem is specific to m5d or indicative of a wider bug
>> in
>> > format_vdif for multichannel data.  You could try running mpifxcorr
>> under
>> > valgrind and see if a similar error is caught.
>> >
>> > As an aside, in all cases, m5d reports MJD/seconds of 0/0.  But the
>> header
>> > is clearly fine, because printVDIF shows the correct dates and times.
>> Not
>> > sure if this is important or not - I'm guess probably not, it is
>> probably
>> > just not being set properly before being printed out.  I doubt this is
>> the
>> > root cause of the issue.
>> >
>> > I'm pretty sure similar data (8 channel, 2 bit real data) has been
>> > successfully used in mpifxcorr before, so I'm a bit puzzled to be
>> honest.
>> > But I'm short on time right now to investigate further.
>> >
>> > Cheers,
>> > Adam
>> >
>> >
>> >> I made spectra, (m5spec) of a few of the files and they looked OK.
>> >>
>> >> I will get back to this later (tonight?). Juggling events. I'll check
>> >> bandpasses for a number of possible setups. If the Bpass looks right I
>> will
>> >> get the correct filters have been used.
>> >>
>> >>    Thanks for the help. I am at sea at the mo'
>> >>
>> >>          Richard
>> >>
>> >> On Wed, Aug 24, 2016 at 4:20 PM, Adam Deller <deller at astron.nl> wrote:
>> >>
>> >>> Hi Richard,
>> >>>
>> >>> I have a few observations for you:
>> >>>
>> >>> * Nothing strange in the file at a first glance - countVDIFPackets and
>> >>> printVDIF are happy with it.  It is 2 bit data.  Frame size is 1312
>> bytes,
>> >>> and the number of frames per second indicates that this is 1 Gbps
>> data.
>> >>> * Using printVDIFheader tells me there are 8 channels in the single
>> VDIF
>> >>> thread.  Combined with the other info, that implies the bandwidth per
>> >>> subband is 32 MHz? So then the format name (which you supply to the
>> v2d
>> >>> file and hence the .input file) should be VDIF_1280-1024-8-2, I think.
>> >>>
>> >>> However, I then get funny results when I try to unpack the data using
>> m5d
>> >>> and that format name.  It's happy for a while, and then starts to give
>> >>> unpack errors (which one usually gets if one mucks up the format
>> name).  If
>> >>> I instead say the number of channels is 1 (so VDIF_1280-1024-1-2),
>> which
>> >>> would mean a single 256 MHz wide channel, then it unpacks happily.
>> >>>
>> >>> So what's the deal with the number of subbands?  I think something is
>> >>> wrong somewhere, either 8 has been written into the header where 1
>> should
>> >>> have been, or something else like that.
>> >>>
>> >>> Cheers,
>> >>> Adam
>> >>>
>> >>> On Wed, Aug 24, 2016 at 4:31 AM, Richard Dodson <
>> >>> richard.dodson at uwa.edu.au> wrote:
>> >>>
>> >>>> Hi Adam
>> >>>>
>> >>>> vdifsummary seems to be a file in ~/Util in oper as KASI. I guess it
>> is
>> >>>> something that Jan wrote. I will check.
>> >>>>
>> >>>> countVDIF is slow (took all night to finish) &  I should have looked
>> at
>> >>>> thread 1 not 0 (correct?). It is now running for 1. Nothing to note
>> so far
>> >>>> eg:
>> >>>>
>> >>>> For thread 1, at second 39896, read 29300000 frames, spotted 0
>> missing
>> >>>> frames
>> >>>> The start of the VDIF file (1GB) is at:
>> >>>>  http://ict.icrar.org/store/staff/rdodson/k16mk02f_ktn_start.vdif
>> >>>>
>> >>>>   Thanks for your help
>> >>>>      Richard
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Aug 22, 2016 at 6:18 PM, Adam Deller <deller at astron.nl>
>> wrote:
>> >>>>
>> >>>>> Hi Richard,
>> >>>>>
>> >>>>> Looks like there is a problem mid-file, and when it tries to re-sync
>> >>>>> the header it finds is corrupted.  I can suggest a couple of things
>> to try:
>> >>>>>
>> >>>>> you can run countVDIFpackets (a utility in vdifio) which is probably
>> >>>>> slower than vdifsummary (what utility is this?  I'm not aware of a
>> >>>>> "vdifsummary", there is a "vsum"...?) and is pretty basic but
>> actually does
>> >>>>> check for every packet, and prints a message every time a problem
>> is seen.
>> >>>>> That might give you some extra clues, so I'd try that first.  And
>> if you
>> >>>>> really want to get blasted away by lots of logging, you can use
>> printVDIF,
>> >>>>> which prints a little summary of each and every packet header.  You
>> could
>> >>>>> pipe that to grep to look for anomalies.
>> >>>>>
>> >>>>> Looks like the problem is very early in the file, so if you dd the
>> >>>>> first second or so and put it on an ftp server somewhere, I could
>> also take
>> >>>>> a look.
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Adam
>> >>>>>
>> >>>>> On Mon, Aug 22, 2016 at 10:57 AM, Richard Dodson <
>> >>>>> richard.dodson at uwa.edu.au> wrote:
>> >>>>>
>> >>>>>> Dear All
>> >>>>>>
>> >>>>>>  I have one of the usual nightmare twisted DiFX correlation
>> problems.
>> >>>>>>
>> >>>>>>  I am trying to use DiFX on VDIF data which has been copied off the
>> >>>>>> VERA OCTAVE systems (and similar) and converted.
>> >>>>>>
>> >>>>>>   The problem is almost certainly in the data copying -- but I
>> need to
>> >>>>>> provide some feedback on what is wrong for it to be fixed
>> >>>>>>
>> >>>>>>   The first problem that I found was in the VDIF file: all the
>> invalid
>> >>>>>> flags were set, the number of channels was wrong and the date was
>> wrong by
>> >>>>>> 1 day. :(
>> >>>>>>
>> >>>>>>   Jan has a program to fix all of these :) -- but he is not around
>> to
>> >>>>>> check if I have used this correctly :( :(
>> >>>>>>
>> >>>>>>    After these fixes the correlation runs, but the data file is
>> empty.
>> >>>>>> What messages should I be checking to work out what is happening?
>> I append
>> >>>>>> some messages which look suspicious but don't convey any
>> information to me
>> >>>>>> ...
>> >>>>>>
>> >>>>>>         All the best
>> >>>>>>             Richard
>> >>>>>>
>> >>>>>> Comments:
>> >>>>>>   vdifsummary reports seem OK
>> >>>>>>
>> >>>>>> # vdifsummary /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
>> >>>>>> [1:1] check k16mk02f_kava_miz.vdif -> Good! it is a VDIF data scan
>> ->
>> >>>>>> add to 1
>> >>>>>> k16mk02f_kava_miz.vdif   4,108,790,400,000   31317 sec( 8:41:57)
>> >>>>>> 57467 Mar 20 2016y080d 11:00:03 - 19:41:59  1312 100000
>> >>>>>> 3,827 GB(=  3.7 TB)(= 4,108,790,400,000 B)
>> >>>>>>
>> >>>>>> Log messages which might be relevant:
>> >>>>>>
>> >>>>>> 2016-08-22 16:30:32,548 DiFXAlert INFO    MPI[ 1]
>> compute-0-28.local
>> >>>>>> k16mk02f_1   Datastream 1 has opened file index 0, which was
>> >>>>>> /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
>> >>>>>>
>> >>>>>> 2016-08-22 16:30:32,548 DiFXAlert VERBOSE MPI[ 2]
>> compute-0-28.local
>> >>>>>> k16mk02f_1   input.bad() is 0, input.fail() is 0
>> >>>>>>
>> >>>>>> 2016-08-22 16:30:32,700 DiFXAlert ERROR   MPI[ 1]
>> compute-0-28.local
>> >>>>>> k16mk02f_1   Lost Sync on segment 1! Will attempt to resync.
>> Deltatime was
>> >>>>>> -1.13239e+09
>> >>>>>>
>> >>>>>> 2016-08-22 16:30:32,701 DiFXAlert INFO    MPI[ 1]
>> compute-0-28.local
>> >>>>>> k16mk02f_1   Config has changed!
>> >>>>>>
>> >>>>>> 2016-08-22 16:30:32,702 DiFXAlert INFO    MPI[ 1]
>> compute-0-28.local
>> >>>>>> k16mk02f_1   After regaining sync, the frame start day is 70573,
>> the frame
>> >>>>>> start seconds is 70631, the frame start ns is -2147483648,
>> readscan is 2,
>> >>>>>> readseconds is 1132388471, readnanoseconds is -2147483648
>> >>>>>>         note the 2^31 values !!!!
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> Difx-users mailing list
>> >>>>>> Difx-users at listmgr.nrao.edu
>> >>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> !=============================================================!
>> >>>>> Dr. Adam Deller
>> >>>>> Ph  +31 521595785 / Fax +31 521595101
>> >>>>> Staff Astronomer, Astronomy Group
>> >>>>> ASTRON, Oude Hoogeveensedijk 4
>> >>>>> 7991 PD Dwingeloo, The Netherlands
>> >>>>> !=============================================================!
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> -------------------------
>> >>>> Dr Richard Dodson,
>> >>>> International Centre for Radio Astronomy Research
>> >>>> University of Western Australia
>> >>>> P: +8 6488 7842 E: richard.dodson at icrar.org
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> !=============================================================!
>> >>> Dr. Adam Deller
>> >>> Ph  +31 521595785 / Fax +31 521595101
>> >>> Staff Astronomer, Astronomy Group
>> >>> ASTRON, Oude Hoogeveensedijk 4
>> >>> 7991 PD Dwingeloo, The Netherlands
>> >>> !=============================================================!
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> -------------------------
>> >> Dr Richard Dodson,
>> >> International Centre for Radio Astronomy Research
>> >> University of Western Australia
>> >> P: +8 6488 7842 E: richard.dodson at icrar.org
>> >>
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> -------------------------
> Dr Richard Dodson,
> International Centre for Radio Astronomy Research
> University of Western Australia
> P: +8 6488 7842 E: richard.dodson at icrar.org
>



-- 
-------------------------
Dr Richard Dodson,
International Centre for Radio Astronomy Research
University of Western Australia
P: +8 6488 7842 E: richard.dodson at icrar.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20160826/7d0ac1d1/attachment-0001.html>


More information about the Difx-users mailing list