[evla-sw-discuss] Re: EVLA communications

Doug Tody dtody at nrao.edu
Thu Mar 3 00:20:27 EST 2005


Bill -

Fred needs the NRAO VO plan ASAP (early next week at the latest) to
prepare for the upcoming Visiting Committee meeting.  I need to get back
to this tomorrow and deal with it before I can spend any more time on
EVLA communications.

Rather than delay your report any further I suggest you make any changes
you deem desirable based on our discussions, and send your report out
for further review and comment.  I will try to get more formal written
comments back to you within several days.

My comments on EVLA communications infrastructure are generally favorable
although I still have some concerns.  Brief comments follow.  In general
I am pleased to see the level of detail to which the system design has
been carried, and the emphasis on asynchronous messaging to maintain
system state.

     o	I think EVLA communications can be broken down into as many as
 	three areas:

 	1) MIB and antenna level.  This may require lower level, real
 	time communications.  The platform (MIB) requirements are severe
 	and may (probably do) require custom software.	The problem
 	is sufficiently constrained that a special purpose solution is
 	possible, achievable, and may be simplest at this level.

 	2) Telescope level (executor, telemetry, etc).	It might be
 	nice if the artifical constraints of the MIB could be eliminated
 	here, presenting a higher level interface to M&C.  A higher level
 	interface providing reliable communications is possible and could
 	simplify the system.

 	3) E2E/dataflow level.  This includes dynamic scheduling, data
 	capture, and so forth.  The project model, scheduling block,
 	project metadata, archive, etc. are important at this level.
 	This is where we can most benefit from commonality with ALMA.

 	Possibly 1 and 2 can be combined but this may be too much ground
 	to cover with one approach, given the limitations of the MIB.

     o	ACS compliance is most important at level 3.  I would say that
 	ACS compliance is not an important issue for level 1-2.  We would
 	like EVLA to be able to function in a basic mode independently of
 	ALMA (which is complex and difficult to control) at level 1-2.
 	Basic telescope operations, similar to the current VLA, require
 	only level 1-2.  ACS compliance is desirable for level 3 but is
 	not necessarily required.  An alternative would be to provide a
 	separate communications infrastructure for communicating with the
 	archive and with DC.  This is TBD.  We don't have to decide this
 	for now; it can be deferred to the next phase of system design
 	where we consider E2E and the overlap with ALMA.

     o	I am still concerned about using an unreliable communications
 	protocol (IP, UDP, multicast) generally within the system.  It is
 	possible, given the highly constrained execution environment and
 	hardware configuration, that IP/UDP/multicast will be sufficiently
 	reliable at some level within EVLA.  However this is hard to
 	guarantee given that these are fundamentally unreliable protocols.

 	Lab tests may well indicate no problems but I would not rely
 	upon these in designing a complex system to be used for 20 years
 	with a wide range of loading conditions.  Network congestion
 	(resulting in switch overlow and lost datagrams) or CPU loading
 	(resulting in transmit/receive buffer overflow and datagram discard)
 	may result in lost packets.  Multiple copies of datagrams or
 	datagrams delivered out of order are also possible.  Fragmentation
 	concerns will limit the size of datagrams and require more complex
 	protocols to avoid fragmentation.  Streaming large dataflows may
 	require special care - but this is what TCP was designed for.
 	Depending upon the protocol these cases may be recoverable and
 	not a serious problem, however my impression with EVLA is that in
 	most or all cases a reliable protocol is desirable and preferable.

     o	I agree that ACS is too complex for use within EVLA M&C (although
 	we may want to use it above the level of the Executor).  There is
 	no obvious solution to use for level 2 communications, but at least
 	we have reduced this to a purely engineering decision.	CORBA is the
 	"obvious" solution (and has been used sucessfully for several large
 	telescopes) but is complex and difficult to control, and probably
 	overkill for this application.	ICE is an interesting alternative
 	to CORBA and may be worth further investigation.  D-BUS, PVM, etc.,
 	are problematic for this application.  IP/UDP/multicast would be
 	fine as low level protocols if they could be encapsulated in an
 	interface which provided reliable communications and flow control.
 	There is software (e.g., TIPC) which appears to provide this,
 	however there is nothing sufficiently widely used to be worth
 	a clear recommendation.  There are commercial products which
 	provide secure multicast but for a system of this type I would
 	not recommend using anything other than open source software for
 	which EVLA can control the source and system integration.

     o	XML/RPC is fine for simple RPC but that is really all it is
 	good for.  In general for a distributed system something more
 	complex is required which supports asynchronous messaging.
 	Messaging is at least as important as RPC, and in fact it is
 	more fundamental as RPC can be implemented on top of asynchronous
 	messaging.  Control and maintaining state in a distributed system
 	is in general best done by a combination of requests (RPC in some
 	form) and broadcast of asynchronous state-change events to allow
 	multiple subscribers to track and respond to the change of state
 	of a subsystem.  Reliable communication is critical or you have
 	to work hard at the protocol level to make up for the lack of it.

The best solution may be to use inherently unreliable protocols such as
IP/UDP/multicast only 1) at a low level in the system where everything
can be fully controlled, and hence made reliable via elimination, 2)
where they can be encapsulated within a reliable protocol.  A reliable
TCP-based protocol is desirable whenever a large amount of data needs
to be moved and buffering and flow control is desirable.  If the data
rate is low and MIB address space limitations are not an issue it is not
clear why you should bother with anything other than a reliable protocol.
Commonality with ALMA is important only above the level of the Executor.

 	- Doug



More information about the evla-sw-discuss mailing list