[Difx-users] Debugging a Nightmare problem

Walter Brisken wbrisken at nrao.edu
Wed Aug 24 07:30:25 EDT 2016


It sounds like the data file has a problem.  You might want to use 
printVDIFheader to diagnose.  If you could send the output of that for the 
first few frames (and if you see a jump in threadId somewhere in the file 
you might capture that as well).  Also might be good to send around the 
.vex, .v2d and .input file around so we can see if everything hangs 
together.

-W

On Wed, 24 Aug 2016, Adam Deller wrote:

> Hi Richard,
>
> Walter and/or Chris may be interested in the diagnosis a little further
> down.
>
> On Wed, Aug 24, 2016 at 11:01 AM, Richard Dodson <richard.dodson at uwa.edu.au>
> wrote:
>
>> Hi Adam
>>
>>  As I say this is a mess. The first TianMa, ATCA & KaVA observations. The
>> Australians, the Chinese and the Japanese all have their own `unique'
>> systems. Then these (except ATCA) have been extracted for the KJJCC and
>> hardware correlated. This is the data that was exported for use with DiFX.
>>
>> At least 3 conversions between this file and the sky, all of which could
>> be wrong.
>>
>> The BW should be 32MHz. 8IFs of L pol. T6 and KaVA with different
>> sidebands. ATCA with 64MHz and dual pol (so only 50% coverage).
>>
>> So VDIF_1280-1024-8-2  is what I have been using. You say "which you
>> supply to the v2d file". In which place? As the FORMAT field? I have used
>> VDIF -- is this wrong?
>>
>
> No, you're right.  I misremembered where the format string for the unpacker
> gets generated (it is actually generated internally to DiFX, based on the
> format [VDIF] and the other information like number of bits, frame size,
> and number of subbands that are supplied elsewhere in the vex file and
> placed in the input file by vex2difx.)
>
>
>>
>> As an aside the conversion to VDIF was wrong (in invalid flag, day(!) and
>> no of sidebands). These I _think_ I have fixed, but using tools I don't
>> understand.
>>
>
> OK, so the header indeed thinks that there are 8 channels, so that is good.
>
> But when using m5d with VDIF_1280-1024-8-2, after the second frame it
> starts complaining of errors.  But if one tells m2d that the format is
> VDIF_1280-1024-1-2
> (which means basically identical payload, but it is just 5120 samples from
> one channel in every packet, rather than 640 samples from each of 8
> channels in every packet) then it works fine.  I think there might be a bug
> in the mark5access validator: if I run with valgrind I get:
>
> ==18198== Invalid read of size 4
> ==18198==    at 0x4E887A2: mark5_format_vdif_validate (format_vdif.c:3989)
> ==18198==    by 0x4E39C88: mark5_stream_next_frame (mark5_stream.c:166)
> ==18198==    by 0x4E8664F: vdif_decode_8channel_2bit_decimation1
> (format_vdif.c:1131)
> ==18198==    by 0x4018AD: decode_short (m5d.c:165)
> ==18198==    by 0x4018AD: main (m5d.c:502)
> ==18198==  Address 0x58399f8 is 6 bytes after a block of size 2,626 alloc'd
> ==18198==    at 0x4C2AB80: malloc (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==18198==    by 0x4017AF: decode_short (m5d.c:142)
> ==18198==    by 0x4017AF: main (m5d.c:502)
>
> No such error is seen if I say this is 1 channel data.
>
> Not sure if the problem is specific to m5d or indicative of a wider bug in
> format_vdif for multichannel data.  You could try running mpifxcorr under
> valgrind and see if a similar error is caught.
>
> As an aside, in all cases, m5d reports MJD/seconds of 0/0.  But the header
> is clearly fine, because printVDIF shows the correct dates and times. Not
> sure if this is important or not - I'm guess probably not, it is probably
> just not being set properly before being printed out.  I doubt this is the
> root cause of the issue.
>
> I'm pretty sure similar data (8 channel, 2 bit real data) has been
> successfully used in mpifxcorr before, so I'm a bit puzzled to be honest.
> But I'm short on time right now to investigate further.
>
> Cheers,
> Adam
>
>
>> I made spectra, (m5spec) of a few of the files and they looked OK.
>>
>> I will get back to this later (tonight?). Juggling events. I'll check
>> bandpasses for a number of possible setups. If the Bpass looks right I will
>> get the correct filters have been used.
>>
>>    Thanks for the help. I am at sea at the mo'
>>
>>          Richard
>>
>> On Wed, Aug 24, 2016 at 4:20 PM, Adam Deller <deller at astron.nl> wrote:
>>
>>> Hi Richard,
>>>
>>> I have a few observations for you:
>>>
>>> * Nothing strange in the file at a first glance - countVDIFPackets and
>>> printVDIF are happy with it.  It is 2 bit data.  Frame size is 1312 bytes,
>>> and the number of frames per second indicates that this is 1 Gbps data.
>>> * Using printVDIFheader tells me there are 8 channels in the single VDIF
>>> thread.  Combined with the other info, that implies the bandwidth per
>>> subband is 32 MHz? So then the format name (which you supply to the v2d
>>> file and hence the .input file) should be VDIF_1280-1024-8-2, I think.
>>>
>>> However, I then get funny results when I try to unpack the data using m5d
>>> and that format name.  It's happy for a while, and then starts to give
>>> unpack errors (which one usually gets if one mucks up the format name).  If
>>> I instead say the number of channels is 1 (so VDIF_1280-1024-1-2), which
>>> would mean a single 256 MHz wide channel, then it unpacks happily.
>>>
>>> So what's the deal with the number of subbands?  I think something is
>>> wrong somewhere, either 8 has been written into the header where 1 should
>>> have been, or something else like that.
>>>
>>> Cheers,
>>> Adam
>>>
>>> On Wed, Aug 24, 2016 at 4:31 AM, Richard Dodson <
>>> richard.dodson at uwa.edu.au> wrote:
>>>
>>>> Hi Adam
>>>>
>>>> vdifsummary seems to be a file in ~/Util in oper as KASI. I guess it is
>>>> something that Jan wrote. I will check.
>>>>
>>>> countVDIF is slow (took all night to finish) &  I should have looked at
>>>> thread 1 not 0 (correct?). It is now running for 1. Nothing to note so far
>>>> eg:
>>>>
>>>> For thread 1, at second 39896, read 29300000 frames, spotted 0 missing
>>>> frames
>>>> The start of the VDIF file (1GB) is at:
>>>>  http://ict.icrar.org/store/staff/rdodson/k16mk02f_ktn_start.vdif
>>>>
>>>>   Thanks for your help
>>>>      Richard
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Aug 22, 2016 at 6:18 PM, Adam Deller <deller at astron.nl> wrote:
>>>>
>>>>> Hi Richard,
>>>>>
>>>>> Looks like there is a problem mid-file, and when it tries to re-sync
>>>>> the header it finds is corrupted.  I can suggest a couple of things to try:
>>>>>
>>>>> you can run countVDIFpackets (a utility in vdifio) which is probably
>>>>> slower than vdifsummary (what utility is this?  I'm not aware of a
>>>>> "vdifsummary", there is a "vsum"...?) and is pretty basic but actually does
>>>>> check for every packet, and prints a message every time a problem is seen.
>>>>> That might give you some extra clues, so I'd try that first.  And if you
>>>>> really want to get blasted away by lots of logging, you can use printVDIF,
>>>>> which prints a little summary of each and every packet header.  You could
>>>>> pipe that to grep to look for anomalies.
>>>>>
>>>>> Looks like the problem is very early in the file, so if you dd the
>>>>> first second or so and put it on an ftp server somewhere, I could also take
>>>>> a look.
>>>>>
>>>>> Cheers,
>>>>> Adam
>>>>>
>>>>> On Mon, Aug 22, 2016 at 10:57 AM, Richard Dodson <
>>>>> richard.dodson at uwa.edu.au> wrote:
>>>>>
>>>>>> Dear All
>>>>>>
>>>>>>  I have one of the usual nightmare twisted DiFX correlation problems.
>>>>>>
>>>>>>  I am trying to use DiFX on VDIF data which has been copied off the
>>>>>> VERA OCTAVE systems (and similar) and converted.
>>>>>>
>>>>>>   The problem is almost certainly in the data copying -- but I need to
>>>>>> provide some feedback on what is wrong for it to be fixed
>>>>>>
>>>>>>   The first problem that I found was in the VDIF file: all the invalid
>>>>>> flags were set, the number of channels was wrong and the date was wrong by
>>>>>> 1 day. :(
>>>>>>
>>>>>>   Jan has a program to fix all of these :) -- but he is not around to
>>>>>> check if I have used this correctly :( :(
>>>>>>
>>>>>>    After these fixes the correlation runs, but the data file is empty.
>>>>>> What messages should I be checking to work out what is happening? I append
>>>>>> some messages which look suspicious but don't convey any information to me
>>>>>> ...
>>>>>>
>>>>>>         All the best
>>>>>>             Richard
>>>>>>
>>>>>> Comments:
>>>>>>   vdifsummary reports seem OK
>>>>>>
>>>>>> # vdifsummary /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
>>>>>> [1:1] check k16mk02f_kava_miz.vdif -> Good! it is a VDIF data scan ->
>>>>>> add to 1
>>>>>> k16mk02f_kava_miz.vdif   4,108,790,400,000   31317 sec( 8:41:57)
>>>>>> 57467 Mar 20 2016y080d 11:00:03 - 19:41:59  1312 100000
>>>>>> 3,827 GB(=  3.7 TB)(= 4,108,790,400,000 B)
>>>>>>
>>>>>> Log messages which might be relevant:
>>>>>>
>>>>>> 2016-08-22 16:30:32,548 DiFXAlert INFO    MPI[ 1] compute-0-28.local
>>>>>> k16mk02f_1   Datastream 1 has opened file index 0, which was
>>>>>> /lustre/kjcc/k16mk02f/MIZ/k16mk02f_kava_miz.vdif
>>>>>>
>>>>>> 2016-08-22 16:30:32,548 DiFXAlert VERBOSE MPI[ 2] compute-0-28.local
>>>>>> k16mk02f_1   input.bad() is 0, input.fail() is 0
>>>>>>
>>>>>> 2016-08-22 16:30:32,700 DiFXAlert ERROR   MPI[ 1] compute-0-28.local
>>>>>> k16mk02f_1   Lost Sync on segment 1! Will attempt to resync. Deltatime was
>>>>>> -1.13239e+09
>>>>>>
>>>>>> 2016-08-22 16:30:32,701 DiFXAlert INFO    MPI[ 1] compute-0-28.local
>>>>>> k16mk02f_1   Config has changed!
>>>>>>
>>>>>> 2016-08-22 16:30:32,702 DiFXAlert INFO    MPI[ 1] compute-0-28.local
>>>>>> k16mk02f_1   After regaining sync, the frame start day is 70573, the frame
>>>>>> start seconds is 70631, the frame start ns is -2147483648, readscan is 2,
>>>>>> readseconds is 1132388471, readnanoseconds is -2147483648
>>>>>>         note the 2^31 values !!!!
>>>>>>
>>>>>> _______________________________________________
>>>>>> Difx-users mailing list
>>>>>> Difx-users at listmgr.nrao.edu
>>>>>> https://listmgr.nrao.edu/mailman/listinfo/difx-users
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> !=============================================================!
>>>>> Dr. Adam Deller
>>>>> Ph  +31 521595785 / Fax +31 521595101
>>>>> Staff Astronomer, Astronomy Group
>>>>> ASTRON, Oude Hoogeveensedijk 4
>>>>> 7991 PD Dwingeloo, The Netherlands
>>>>> !=============================================================!
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -------------------------
>>>> Dr Richard Dodson,
>>>> International Centre for Radio Astronomy Research
>>>> University of Western Australia
>>>> P: +8 6488 7842 E: richard.dodson at icrar.org
>>>>
>>>
>>>
>>>
>>> --
>>> !=============================================================!
>>> Dr. Adam Deller
>>> Ph  +31 521595785 / Fax +31 521595101
>>> Staff Astronomer, Astronomy Group
>>> ASTRON, Oude Hoogeveensedijk 4
>>> 7991 PD Dwingeloo, The Netherlands
>>> !=============================================================!
>>>
>>
>>
>>
>> --
>> -------------------------
>> Dr Richard Dodson,
>> International Centre for Radio Astronomy Research
>> University of Western Australia
>> P: +8 6488 7842 E: richard.dodson at icrar.org
>>
>
>
>
>



More information about the Difx-users mailing list