[evla-sw-discuss] Re: EVLA communications
Doug Tody
dtody at nrao.edu
Thu Mar 3 00:20:27 EST 2005
Bill -
Fred needs the NRAO VO plan ASAP (early next week at the latest) to
prepare for the upcoming Visiting Committee meeting. I need to get back
to this tomorrow and deal with it before I can spend any more time on
EVLA communications.
Rather than delay your report any further I suggest you make any changes
you deem desirable based on our discussions, and send your report out
for further review and comment. I will try to get more formal written
comments back to you within several days.
My comments on EVLA communications infrastructure are generally favorable
although I still have some concerns. Brief comments follow. In general
I am pleased to see the level of detail to which the system design has
been carried, and the emphasis on asynchronous messaging to maintain
system state.
o I think EVLA communications can be broken down into as many as
three areas:
1) MIB and antenna level. This may require lower level, real
time communications. The platform (MIB) requirements are severe
and may (probably do) require custom software. The problem
is sufficiently constrained that a special purpose solution is
possible, achievable, and may be simplest at this level.
2) Telescope level (executor, telemetry, etc). It might be
nice if the artifical constraints of the MIB could be eliminated
here, presenting a higher level interface to M&C. A higher level
interface providing reliable communications is possible and could
simplify the system.
3) E2E/dataflow level. This includes dynamic scheduling, data
capture, and so forth. The project model, scheduling block,
project metadata, archive, etc. are important at this level.
This is where we can most benefit from commonality with ALMA.
Possibly 1 and 2 can be combined but this may be too much ground
to cover with one approach, given the limitations of the MIB.
o ACS compliance is most important at level 3. I would say that
ACS compliance is not an important issue for level 1-2. We would
like EVLA to be able to function in a basic mode independently of
ALMA (which is complex and difficult to control) at level 1-2.
Basic telescope operations, similar to the current VLA, require
only level 1-2. ACS compliance is desirable for level 3 but is
not necessarily required. An alternative would be to provide a
separate communications infrastructure for communicating with the
archive and with DC. This is TBD. We don't have to decide this
for now; it can be deferred to the next phase of system design
where we consider E2E and the overlap with ALMA.
o I am still concerned about using an unreliable communications
protocol (IP, UDP, multicast) generally within the system. It is
possible, given the highly constrained execution environment and
hardware configuration, that IP/UDP/multicast will be sufficiently
reliable at some level within EVLA. However this is hard to
guarantee given that these are fundamentally unreliable protocols.
Lab tests may well indicate no problems but I would not rely
upon these in designing a complex system to be used for 20 years
with a wide range of loading conditions. Network congestion
(resulting in switch overlow and lost datagrams) or CPU loading
(resulting in transmit/receive buffer overflow and datagram discard)
may result in lost packets. Multiple copies of datagrams or
datagrams delivered out of order are also possible. Fragmentation
concerns will limit the size of datagrams and require more complex
protocols to avoid fragmentation. Streaming large dataflows may
require special care - but this is what TCP was designed for.
Depending upon the protocol these cases may be recoverable and
not a serious problem, however my impression with EVLA is that in
most or all cases a reliable protocol is desirable and preferable.
o I agree that ACS is too complex for use within EVLA M&C (although
we may want to use it above the level of the Executor). There is
no obvious solution to use for level 2 communications, but at least
we have reduced this to a purely engineering decision. CORBA is the
"obvious" solution (and has been used sucessfully for several large
telescopes) but is complex and difficult to control, and probably
overkill for this application. ICE is an interesting alternative
to CORBA and may be worth further investigation. D-BUS, PVM, etc.,
are problematic for this application. IP/UDP/multicast would be
fine as low level protocols if they could be encapsulated in an
interface which provided reliable communications and flow control.
There is software (e.g., TIPC) which appears to provide this,
however there is nothing sufficiently widely used to be worth
a clear recommendation. There are commercial products which
provide secure multicast but for a system of this type I would
not recommend using anything other than open source software for
which EVLA can control the source and system integration.
o XML/RPC is fine for simple RPC but that is really all it is
good for. In general for a distributed system something more
complex is required which supports asynchronous messaging.
Messaging is at least as important as RPC, and in fact it is
more fundamental as RPC can be implemented on top of asynchronous
messaging. Control and maintaining state in a distributed system
is in general best done by a combination of requests (RPC in some
form) and broadcast of asynchronous state-change events to allow
multiple subscribers to track and respond to the change of state
of a subsystem. Reliable communication is critical or you have
to work hard at the protocol level to make up for the lack of it.
The best solution may be to use inherently unreliable protocols such as
IP/UDP/multicast only 1) at a low level in the system where everything
can be fully controlled, and hence made reliable via elimination, 2)
where they can be encapsulated within a reliable protocol. A reliable
TCP-based protocol is desirable whenever a large amount of data needs
to be moved and buffering and flow control is desirable. If the data
rate is low and MIB address space limitations are not an issue it is not
clear why you should bother with anything other than a reliable protocol.
Commonality with ALMA is important only above the level of the Executor.
- Doug
More information about the evla-sw-discuss
mailing list