[evla-sw-discuss] Re: EVLA communications

Thu Mar 3 13:39:06 EST 2005

Doug,

Thank you for the update and the included comments.  In general I agree
with your comments.  I've embedded a few responses below, but I do not
attempt to address all of your points at this time.

Bill

Doug Tody wrote:
> Bill -
> 
> Fred needs the NRAO VO plan ASAP (early next week at the latest) to
> prepare for the upcoming Visiting Committee meeting.  I need to get back
> to this tomorrow and deal with it before I can spend any more time on
> EVLA communications.
> 
> Rather than delay your report any further I suggest you make any changes
> you deem desirable based on our discussions, and send your report out
> for further review and comment.  I will try to get more formal written
> comments back to you within several days.
> 
> My comments on EVLA communications infrastructure are generally favorable
> although I still have some concerns.  Brief comments follow.  In general
> I am pleased to see the level of detail to which the system design has
> been carried, and the emphasis on asynchronous messaging to maintain
> system state.
> 
>     o    I think EVLA communications can be broken down into as many as
>     three areas:
> 
>     1) MIB and antenna level.  This may require lower level, real
>     time communications.  The platform (MIB) requirements are severe
>     and may (probably do) require custom software.    The problem
>     is sufficiently constrained that a special purpose solution is
>     possible, achievable, and may be simplest at this level.
> 
>     2) Telescope level (executor, telemetry, etc).    It might be
>     nice if the artifical constraints of the MIB could be eliminated
>     here, presenting a higher level interface to M&C.  A higher level
>     interface providing reliable communications is possible and could
>     simplify the system.

One of the reasons I favored a model for the Observation Executor that
separates it into two layers, the lower layer being responsible for
direct communication with the hardware, was to keep the MIB interface
and similar interfaces from rising too high in the system.  We have now
made the decision to explore a design that places an antenna server
layer between the Executor and the actual devices.  This decision helps
to confine the low-level MIB interface to the lower levels of the system.
It also makes possible the use of a more reliable, higher-level approach
to communications between the antenna server layer and the next layer up
- the Executor.

There will be a ripple effect from the decision to interpose an antenna
server layer between the Executor and actual devices.  For example, it
may have an effect on how some screens access some data, especially
screens interested in an overview that includes information from more
than one device.  (It should be mentioned that this decision also, in
some ways, decreases the overall efficiency of the system.)  All of the
implications of this decision are not yet clear.  Some will be positive
and some will be negative.  I am hoping that the net effect will be on
the positive side.

>     3) E2E/dataflow level.  This includes dynamic scheduling, data
>     capture, and so forth.  The project model, scheduling block,
>     project metadata, archive, etc. are important at this level.
>     This is where we can most benefit from commonality with ALMA.

Agreed.
> 
>     Possibly 1 and 2 can be combined but this may be too much ground
>     to cover with one approach, given the limitations of the MIB.
> 
>     o    ACS compliance is most important at level 3.  I would say that
>     ACS compliance is not an important issue for level 1-2.  We would
>     like EVLA to be able to function in a basic mode independently of
>     ALMA (which is complex and difficult to control) at level 1-2.
>     Basic telescope operations, similar to the current VLA, require
>     only level 1-2.  ACS compliance is desirable for level 3 but is
>     not necessarily required.  An alternative would be to provide a
>     separate communications infrastructure for communicating with the
>     archive and with DC.  This is TBD.  We don't have to decide this
>     for now; it can be deferred to the next phase of system design
>     where we consider E2E and the overlap with ALMA.

The basic thrust of the report (EVLA M & C Communications Infrastructure)
was to address the issue of whether to use asynchronous messaging or an
RPC type approach for the distribution of monitor and control information
(not visibility data) within the EVLA Monitor and Control system - what
you are calling levels 1 & 2 of the overall EVLA software system.  That
single issue is really the only issue that the report seeks to address.
We take ACS to be, fundamentally, an RPC type approach.  The report
recommends that asynchronous messaging be the foundation of our approach
within the M & C system, i.e. levels 1 & 2.
> 
>     o    I am still concerned about using an unreliable communications
>     protocol (IP, UDP, multicast) generally within the system.  It is
>     possible, given the highly constrained execution environment and
>     hardware configuration, that IP/UDP/multicast will be sufficiently
>     reliable at some level within EVLA.  However this is hard to
>     guarantee given that these are fundamentally unreliable protocols.

The report is not really meant to address this issue.  I added a brief
section on these topics only because it seems to be an issue of concern.
These issues will be addressed in much greater detail in an M&C design
document that exists now only as an outline.
> 
>     Lab tests may well indicate no problems but I would not rely
>     upon these in designing a complex system to be used for 20 years
>     with a wide range of loading conditions.  Network congestion
>     (resulting in switch overlow and lost datagrams) or CPU loading
>     (resulting in transmit/receive buffer overflow and datagram discard)
>     may result in lost packets.  Multiple copies of datagrams or
>     datagrams delivered out of order are also possible.  Fragmentation
>     concerns will limit the size of datagrams and require more complex
>     protocols to avoid fragmentation.  Streaming large dataflows may
>     require special care - but this is what TCP was designed for.
>     Depending upon the protocol these cases may be recoverable and
>     not a serious problem, however my impression with EVLA is that in
>     most or all cases a reliable protocol is desirable and preferable.

Concerning fragmentation.  Even for UDP datagrams, _applications_ do not
bear the burden of reassembling fragmented datagrams.  Reassembly of
fragmented datagrams is handled at the IP level, but it is handled at
the destination so there is a danger that a fragmented datagram will
be discarded due to lost fragments or some of the fragments not arriving
within the allowed time.

Currently we limit packets containing monitor and alert data to a size
that is somewhat under the MTU of our network, which means they do not
experience fragmentation when distributed within the AOC or VLA.
However, I know this fact does not address your concerns.  I will not
comment further here, except to say that we are designing the system to
be tolerant of dropped packets and are trying to limit the context in
which unreliable protocols are used.  This issue will be addressed in a
much more extensive manner as the design and documentation of the design
proceed.
> 
>     o    I agree that ACS is too complex for use within EVLA M&C (although
>     we may want to use it above the level of the Executor).  There is
>     no obvious solution to use for level 2 communications, but at least
>     we have reduced this to a purely engineering decision.    CORBA is the
>     "obvious" solution (and has been used sucessfully for several large
>     telescopes) but is complex and difficult to control, and probably
>     overkill for this application.    ICE is an interesting alternative
>     to CORBA and may be worth further investigation.  D-BUS, PVM, etc.,
>     are problematic for this application.  IP/UDP/multicast would be
>     fine as low level protocols if they could be encapsulated in an
>     interface which provided reliable communications and flow control.
>     There is software (e.g., TIPC) which appears to provide this,
>     however there is nothing sufficiently widely used to be worth
>     a clear recommendation.  There are commercial products which
>     provide secure multicast but for a system of this type I would
>     not recommend using anything other than open source software for
>     which EVLA can control the source and system integration.

I fully agree with your recommendation that we confine ourselves to open
source software.  I also want to avoid writing elaborate packages
to address infrastructure issues in-house.
> 
>     o    XML/RPC is fine for simple RPC but that is really all it is
>     good for.  In general for a distributed system something more
>     complex is required which supports asynchronous messaging.
>     Messaging is at least as important as RPC, and in fact it is
>     more fundamental as RPC can be implemented on top of asynchronous
>     messaging.  Control and maintaining state in a distributed system
>     is in general best done by a combination of requests (RPC in some
>     form) and broadcast of asynchronous state-change events to allow
>     multiple subscribers to track and respond to the change of state
>     of a subsystem.  Reliable communication is critical or you have
>     to work hard at the protocol level to make up for the lack of it.
> 
> The best solution may be to use inherently unreliable protocols such as
> IP/UDP/multicast only 1) at a low level in the system where everything
> can be fully controlled, and hence made reliable via elimination, 2)
> where they can be encapsulated within a reliable protocol.  A reliable
> TCP-based protocol is desirable whenever a large amount of data needs
> to be moved and buffering and flow control is desirable.  If the data
> rate is low and MIB address space limitations are not an issue it is not
> clear why you should bother with anything other than a reliable protocol.
> Commonality with ALMA is important only above the level of the Executor.
> 
>     - Doug

While I have much more to say on these matters, this posting is not the
correct venue.  I will only point out that we are basically in agreement.

Bill
> _______________________________________________
> evla-sw-discuss mailing list
> evla-sw-discuss at listmgr.cv.nrao.edu
> http://listmgr.cv.nrao.edu/mailman/listinfo/evla-sw-discuss