Interprocessor Communication Mechanisms

Extends:
Scalable System Architectures (#3)

Description:
(1) General Area:
Current and future embedded systems consist of multiple networked
processors, either on the same chip (CMP) or on multiple chips.
This cluster concerns the hardware and software mechanisms that are
at the basis of allowing such processors to communicate with each other.
Initial emphasis is less on communicating via coherent caches and more

on explicit communication mechanisms --e.g. remote DMA, remote queues,
message send/receive-- and programming models and runtime libraries
in support of software-controlled and deterministic-performance
communication. However, one of our most exciting targets is the
likely development of hardware primitives that can support equally well
*both* the shared-memory and the message-passing communication paradigms.
(2) Specific Topics:
(a) Tightly-coupled, light-weight, on-chip network interfaces:
multiple processors and accelerator engines integrated on the same chip

need to communicate with each other via the network-on-chip (NoC)
at low latency and low cost. The challenge is to initiate and complete
convenient primitive operations like RDMA and remote enqueue/dequeue
within a few clock cycles each --similar to L1 cache access-- and to
organize the network interface so that it only uses minimal buffer
memory, dynamically shared with the processor's own memory space.
(b) Events triggering network-interface (NI) actions: the challenge
is to develop a low-cost and general-purpose mechanism whereby packet
arrival events at a NI can trigger new RDMA or message-send actions.

The resulting hardware should efficiently support synchronization
operations and hardware-assisted software cache coherence (below).
(c) Flexible memory addressing: routing a memory access request
(load/store) from an issuing processor or NI to its target SRAM block
via the NoC is a process that potentially involves (i) virtual-to-physical
address translation; and/or (ii) cache tag verification; and/or network
routing tables or algorithms. The challenge is to integrate these
mechanisms with each other and to provide configurable hardware that
can access memory in different ways at different times and for different

purposes. Examples of use include: direct access to cache areas for
running software coherence protocols of for achieving deterministic
performance; copy-on-write primitives in support of checkpointing,
or incremental-communication pipelines, or transactional memory.
(d) Integrated NI support for shared-memory and message-passing:
using the above mechanisms (a), (b), (c), it may become possible
to treat message-send like non-cacheable stores with write-combine,
message-receive like loads with pre-fetching, and cache coherence
misses like message-generating events that initiate software protocol

processing on a nearby coherence co-processor that uses remote queues
and remote DMA to service the misses, thus combining the efficiency
of hardware with the flexibility, adaptability, and potential for
deterministic performance guarantees of software.
(e) Runtime system: at this level the research will take two complementary
directions. On one hand, effort will be placed on the use of the new
hardware mechanisms in the internal functionality of the runtime
system. Thread spawning and synchronization at the runtime system level can
easily use RDMA and remote enqueue/dequeue operations. Using such mechanisms

will put the neighboring cores much more close in access time, than what
it is currently available in the current multicore implementations.
On the other hand, the runtime system can also expose these mechanisms to
the programming model. In this way, the applications can also directly use
the mechanisms to access hardware accelerators, that will offer much more
performance than the general purpose cores for specific algorithms.
(f) Support from the programming model: the challenge is to allow the
detection of specific software constructions and algorithms that are
suitable to be mapped to hardware accelerators using RDMA and remote

enqueue/dequeue operations. Also, there is the possibility to allow
the programmer to annotate memory operations that should be considered
to go through such mechanisms to accelerate data accesses.
The final goal should be that the compiler can have enough information
on the algorithms of the application to automatically determine this
information. These kinds of architectures will pose the need to
programmers to learn new ways of programming and structuring data
in the applications.
(g) Evaluate using simulation: there is the need to develop a simulation

environment to evaluate these hardware mechanisms and software techniques.
The simulation environment will allow the execution on several cores and
accelerators showing the benefits of the program transformations and
the underlying hardware mechanisms. For this topic, we plan to use the
Unisim simulation infrastructure developed in the context of Hipeac.
(3) Background in HiPEAC:
This cluster is an off-spring of the "Scalable System Architectures"
cluster: the key ideas for the above research topics were developed
within the scalable systems cluster, especially the ideas of the NI

being closely related to the cache controller and multi-party remote
dequeue being related to atomic fetch-and-increment.
Because the title "scalable systems" is quite wide and encompasses
many diverse subtopics, it was decided to now split that cluster
into two new clusters, one being this cluster, and the other dealing
with interconnection networks (including NoC).

Nature:

Research cluster

Total funding:
Requested: € 50000

Including fellowship funding:
Requested: € 16000

Description of how the funding will be used:
(a) Travel:
* An average of 4 cluster members attending each of 4 HiPEAC cluster
meetings per year for 18 months =
4 persons * 6 meetings * 1000 Euro/person-meeting ~= 24000 Euro
* One long-term visit (~month-long) by Georgi Gaydadjiev (Delft) to FORTH:

~= 3000 Euro
* Two long-term visits (1.5 months each) by the fellow to partner sites:
2 * 3500 Euro/visit = 7000 Euro
(b) Fellowship:
* Stamatis Kavadias (FORTH): 1/5/2007 - 31/8/2008 (16 months):
16 months * 1000 Euro/month = 16000 Euro
(this is a continuation of fellowship/cluster #447, 1/5/2006-30/4/2007)

Cluster duration: March 2007 - August 2008 (18 months)

Duration of the funding:
Requested: 18 month(s)

Participating members:
KATEVENIS Manolis (FORTH) (--member--)BILAS Angelos (FORTH) (--member--)PNEVMATIKATOS Dionisios (FORTH) (--member--)GAYDADJIEV Georgi (Delft University of Technology) (--member--)BEREKOVIC Mladen (Delft University of Technology) (--member--)MARTORELL Xavier (UPC) (--member--)RAMIREZ Alex (UPC) (--member--)NAVARRO Nacho (UPC) (--member--)GIL Marisa (UPC) (--member--)SOURDIS Ioannis (Delft University of Technology) (--phd student--)RICO Alejandro (UPC) (--phd student--)STENSTROM Per (Chalmers University of Technology) (--member--)DUATO Jose (University Politecnica de Valencia) (--member--)KAVVADIAS Stamatis (FORTH) (--phd student--)

Other people collaborating:
* Kees Goossens (DELFT)
* Giuseppe Desoli (ST Microelectronics)
* David Rodenas (UPC, phd student)
* Felipe Cabarcas (UPC, phd student)