[Difx-users] Debugging a Nightmare problem

Adam Deller deller at astron.nl
Wed Aug 24 06:04:47 EDT 2016


Hi Richard,

Walter and/or Chris may be interested in the diagnosis a little further
down.

On Wed, Aug 24, 2016 at 11:01 AM, Richard Dodson <richard.dodson at uwa.edu.au>
wrote:

> Hi Adam
>
>  As I say this is a mess. The first TianMa, ATCA & KaVA observations. The
> Australians, the Chinese and the Japanese all have their own `unique'
> systems. Then these (except ATCA) have been extracted for the KJJCC and
> hardware correlated. This is the data that was exported for use with DiFX.
>
> At least 3 conversions between this file and the sky, all of which could
> be wrong.
>
> The BW should be 32MHz. 8IFs of L pol. T6 and KaVA with different
> sidebands. ATCA with 64MHz and dual pol (so only 50% coverage).
>
> So VDIF_1280-1024-8-2  is what I have been using. You say "which you
> supply to the v2d file". In which place? As the FORMAT field? I have used
> VDIF -- is this wrong?
>

No, you're right.  I misremembered where the format string for the unpacker
gets generated (it is actually generated internally to DiFX, based on the
format [VDIF] and the other information like number of bits, frame size,
and number of subbands that are supplied elsewhere in the vex file and
placed in the input file by vex2difx.)


>
> As an aside the conversion to VDIF was wrong (in invalid flag, day(!) and
> no of sidebands). These I _think_ I have fixed, but using tools I don't
> understand.
>

OK, so the header indeed thinks that there are 8 channels, so that is good.

But when using m5d with VDIF_1280-1024-8-2, after the second frame it
starts complaining of errors.  But if one tells m2d that the format is
VDIF_1280-1024-1-2
(which means basically identical payload, but it is just 5120 samples from
one channel in every packet, rather than 640 samples from each of 8
channels in every packet) then it works fine.  I think there might be a bug
in the mark5access validator: if I run with valgrind I get:

==18198== Invalid read of size 4
==18198==    at 0x4E887A2: mark5_format_vdif_validate (format_vdif.c:3989)
==18198==    by 0x4E39C88: mark5_stream_next_frame (mark5_stream.c:166)
==18198==    by 0x4E8664F: vdif_decode_8channel_2bit_decimation1
(format_vdif.c:1131)
==18198==    by 0x4018AD: decode_short (m5d.c:165)
==18198==    by 0x4018AD: main (m5d.c:502)
==18198==  Address 0x58399f8 is 6 bytes after a block of size 2,626 alloc'd
==18198==    at 0x4C2AB80: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18198==    by 0x4017AF: decode_short (m5d.c:142)
==18198==    by 0x4017AF: main (m5d.c:502)

No such error is seen if I say this is 1 channel data.

Not sure if the problem is specific to m5d or indicative of a wider bug in
format_vdif for multichannel data.  You could try running mpifxcorr under
valgrind and see if a similar error is caught.

As an aside, in all cases, m5d reports MJD/seconds of 0/0.  But the header
is clearly fine, because printVDIF shows the correct dates and times. Not
sure if this is important or not - I'm guess probably not, it is probably
just not being set properly before being printed out.  I doubt this is the
root cause of the issue.

I'm pretty sure similar data (8 channel, 2 bit real data) has been
successfully used in mpifxcorr before, so I'm a bit puzzled to be honest.
But I'm short on time right now to investigate further.

Cheers,
Adam


> I made spectra, (m5spec) of a few of the files and they looked OK.
>
> I will get back to this later (tonight?). Juggling events. I'll check
> bandpasses for a number of possible setups. If the Bpass looks right I will
> get the correct filters have been used.
>
>    Thanks for the help. I am at sea at the mo'
>
>          Richard
>
> On Wed, Aug 24, 2016 at 4:20 PM, Adam Deller <deller at astron.nl> wrote:
>
>> Hi Richard,
>>
>> I have a few observations for you:
>>
>> * Nothing strange in the file at a first glance - countVDIFPackets and
>> printVDIF are happy with it.  It is 2 bit data.  Frame size is 1312 bytes,
>> and the number of frames per second indicates that this is 1 Gbps data.
>> * Using printVDIFheader tells me there are 8 channels in the single VDIF
>> thread.  Combined with the other info, that implies the bandwidth per
>> subband is 32 MHz? So then the format name (which you supply to the v2d
>> file and hence the .input file) should be VDIF_1280-1024-8-2, I think.
>>
>> However, I then get funny results when I try to unpack the data using m5d
>> and that format name.  It's happy for a while, and then starts to give
>> unpack errors (which one usually gets if one mucks up the format name).  If
>> I instead say the number of channels is 1 (so VDIF_1280-1024-1-2), which
>> would mean a single 256 MHz wide channel, then it unpacks happily.
>>
>> So what's the deal with the number of subbands?  I think something is
>> wrong somewhere, either 8 has been written into the header where 1 should
>> have been, or something else like that.
>>
>> Cheers,
>> Adam
>>
>> On Wed, Aug 24, 2016 at 4:31 AM, Richard Dodson <
>> richard.dodson at uwa.edu.au> wrote:
>>
>>> Hi Adam
>>>
>>> vdifsummary seems to be a file in ~/Util in oper as KASI. I guess it is
>>> something that Jan wrote. I will check.
>>>
>>> countVDIF is slow (took all night to finish) &  I should have looked at
>>> thread 1 not 0 (correct?). It is now running for 1. Nothing to note so far
>>> eg:
>>>
>>> For thread 1, at second 39896, read 29300000 frames, spotted 0 missing
>>> frames
>>> The start of the VDIF file (1GB) is at:
>>>  http://ict.icrar.org/store/staff/rdodson/k16mk02f_ktn_start.vdif
>>>
>>>   Thanks for your help
>>>      Richard
>>>
>>>
>>>
>>>
>>> On Mon, Aug 22, 2016 at 6:18 PM, Adam Deller <deller at astron.nl> wrote:
>>>
>>>> Hi Richard,
>>>>
>>>> Looks like there is a problem mid-file, and when it tries to re-sync
>>>> the header it finds is corrupted.  I can suggest a couple of things to try:
>>>>
>>>> you can run countVDIFpackets (a utility in vdifio) which is probably
>>>> slower than vdifsummary (what utility is this?  I'm not aware of a
>>>> "vdifsummary", there is a "vsum"...?) and is pretty basic but actually does
>>>> check for every packet, and prints a message every time a problem is seen.
>>>> That might give you some extra clues, so I'd try that first.  And if you
>>>> really want to get blasted away by lots of logging, you can use printVDIF,
>>>> which prints a little summary of each and every packet header.  You could
>>>> pipe that to grep to look for anomalies.
>>>>
>>>> Looks like the problem is very early in the file, so if you dd the
>>>> first second or so and put it on an ftp server somewhere, I could also take
>>>> a look.
>>>>
>>>> Cheers,
>>>> Adam
>>>>
>>>> On Mon, Aug 22, 2016 at 10:57 AM, Richard Dodson <
>>>> richard.dodson at uwa.edu.au> wrote:
>>>>
>>>>> Dear All
>>>>>
>>>>>  I have one of the usual nightmare twisted DiFX correlation problems.
>>>>>
>>>>>  I am trying to use DiFX on VDIF data which has been copied off the
>>>>> VERA OCTAVE systems (and similar) and converted.
>>>>>
>>>>>   The problem is almost certainly in the data copying -- but I need to
>>>>> provide some feedback on what is wrong for it to be fixed
>>>>>
>>>>>   The first problem that I found was in the VDIF file: all the invalid
>>>>> flags were set, the number of channels was wrong and the date was wrong by
>>>>> 1 day. :(
>>>>>
>>>>>   Jan has a program to fix all of these :) -- but he is not around to
>>>>> check if I have used this correctly :( :(
>>>>>
>>>>>    After these fixes the correlation runs, but the data file is empty.
>>>>> What messages should I be checking to work out what is happening? I append
>>>>> some messages which look suspicious but don't convey any information to me
>>>>> ...
>>>>>
>>>>>         All the best
>>>>>             Richard
>>>>>
>>>>> Comments:
>>>>>   vdifsummary reports seem OK
>>>>>
>>>>> # vdifsummary /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
>>>>> [1:1] check k16mk02f_kava_miz.vdif -> Good! it is a VDIF data scan ->
>>>>> add to 1
>>>>> k16mk02f_kava_miz.vdif   4,108,790,400,000   31317 sec( 8:41:57)
>>>>> 57467 Mar 20 2016y080d 11:00:03 - 19:41:59  1312 100000
>>>>> 3,827 GB(=  3.7 TB)(= 4,108,790,400,000 B)
>>>>>
>>>>> Log messages which might be relevant:
>>>>>
>>>>> 2016-08-22 16:30:32,548 DiFXAlert INFO    MPI[ 1] compute-0-28.local
>>>>> k16mk02f_1   Datastream 1 has opened file index 0, which was
>>>>> /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
>>>>>
>>>>> 2016-08-22 16:30:32,548 DiFXAlert VERBOSE MPI[ 2] compute-0-28.local
>>>>> k16mk02f_1   input.bad() is 0, input.fail() is 0
>>>>>
>>>>> 2016-08-22 16:30:32,700 DiFXAlert ERROR   MPI[ 1] compute-0-28.local
>>>>> k16mk02f_1   Lost Sync on segment 1! Will attempt to resync. Deltatime was
>>>>> -1.13239e+09
>>>>>
>>>>> 2016-08-22 16:30:32,701 DiFXAlert INFO    MPI[ 1] compute-0-28.local
>>>>> k16mk02f_1   Config has changed!
>>>>>
>>>>> 2016-08-22 16:30:32,702 DiFXAlert INFO    MPI[ 1] compute-0-28.local
>>>>> k16mk02f_1   After regaining sync, the frame start day is 70573, the frame
>>>>> start seconds is 70631, the frame start ns is -2147483648, readscan is 2,
>>>>> readseconds is 1132388471, readnanoseconds is -2147483648
>>>>>         note the 2^31 values !!!!
>>>>>
>>>>> _______________________________________________
>>>>> Difx-users mailing list
>>>>> Difx-users at listmgr.nrao.edu
>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> !=============================================================!
>>>> Dr. Adam Deller
>>>> Ph  +31 521595785 / Fax +31 521595101
>>>> Staff Astronomer, Astronomy Group
>>>> ASTRON, Oude Hoogeveensedijk 4
>>>> 7991 PD Dwingeloo, The Netherlands
>>>> !=============================================================!
>>>>
>>>
>>>
>>>
>>> --
>>> -------------------------
>>> Dr Richard Dodson,
>>> International Centre for Radio Astronomy Research
>>> University of Western Australia
>>> P: +8 6488 7842 E: richard.dodson at icrar.org
>>>
>>
>>
>>
>> --
>> !=============================================================!
>> Dr. Adam Deller
>> Ph  +31 521595785 / Fax +31 521595101
>> Staff Astronomer, Astronomy Group
>> ASTRON, Oude Hoogeveensedijk 4
>> 7991 PD Dwingeloo, The Netherlands
>> !=============================================================!
>>
>
>
>
> --
> -------------------------
> Dr Richard Dodson,
> International Centre for Radio Astronomy Research
> University of Western Australia
> P: +8 6488 7842 E: richard.dodson at icrar.org
>



-- 
!=============================================================!
Dr. Adam Deller
Ph  +31 521595785 / Fax +31 521595101
Staff Astronomer, Astronomy Group
ASTRON, Oude Hoogeveensedijk 4
7991 PD Dwingeloo, The Netherlands
!=============================================================!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listmgr.nrao.edu/pipermail/difx-users/attachments/20160824/8967a933/attachment-0001.html>


More information about the Difx-users mailing list