Embedded multithreaded processors for hard and soft real-time systems


952 / Accepted / Finalized / Final evaluation

A significant share of real-time application systems needs high-performance or can be strongly increased in function by a higher performance hardware. Such high performance real-time applications range from video-coding and decoding applications like tv sets, set-top boxes, DVD recorders etc. with so-called soft real-time requirements to automotive, railway, and avionic applications and all kind of industrial control systems with hard real-time requirements, meaning, a single miss of the deadline may cause a fatal accident. HDTV and new video coding/decoding standards require a very high performance embedded processor. Higher performance processors are also demanded by automotive applications to reach further fuel savings and more safety.

From an architectural point of view, there is a huge difference between soft and hard real-time systems although both systems have to face more and more performance-demanding applications. In soft real-time systems, the aim of the architects is to build a system which offers a guaranteed bandwidth to each real-time task. In hard real-time systems, the architecture must guarantee that, at some points, a task will always be finished. On the soft real-time side, the hardware must ensure that a percentage of the resources are allocated to each task. In such a system, any dynamic decision towards this goal is welcome. On the hard real-time side, the hardware must ensure that a deadline is always met and this must be proved statically thus any dynamic decision must be carefully chosen as it might increase the difficulty of the static analysis. Hardware and software of such real-time systems must allow a worst case execution (WCET) analysis that guarantees that the execution will be finished within a specific deadline. The cost criterion requires an effective WCET analysis i.e. the real execution time (RET) should be close to the calculated WCET.

Processors in current embedded systems are characterised by a simple architecture, encompassing short pipelines and in-order execution. These types of pipelines ease the computation of the worst case execution time. On the other hand, current embedded systems place an increasing demand on the performance of the processor. This requires the design of embedded processors using hardware features that improve the performance and that are currently dedicated to high-performance general-purpose processors. Examples of such hardware features are the cache hierarchy, branch prediction, pipeline buffers, out-of-order superscalar, simultaneous multithreading, and chip-multiprocessors. However, adding the before-mentioned high-performance features to embedded processors causes problems with the analysability of the code – either the WCET of a program is no more computable due to a dynamic interchange of effects between different threads in simultaneous multithreading, or the computed WCET is very far away from the RETs (real execution times) in case of caches and branch prediction. Besides, embedded processors must be low in cost. Thus, obtaining as much performance as possible from each resource is desirable. Multithreaded processors are a viable option for embedded systems due to their good performance/cost and performance/energy consumption ratio. Multithreaded processors range from simultaneous-multithreaded processors (SMT), in which most processor resources are shared among threads, to chip multiprocessors (CMP), in which every thread has its own dedicated processor resources, only sharing the highest levels of the cache hierarchy. Recent evolutions of these architectures have also generated CMP/SMT processors, i.e., chip multiprocessors in which every core is a SMT. A higher resource sharing leads to a lower cost but also causes higher problems in order to accomplish with real-time constraints of tasks.

Objectives of the Cluster
------------------------------

The main objective of the cluster is to investigate the usage of multithreaded processors (SMT and CMP) in soft and hard real-time systems. This objective can be divided into the following objectives:

- We will research on the application of multithreading in embedded processors in order to determine the best trade-off between performance and resource sharing. On the one hand, a high resource sharing (SMT) implies that chips are smaller and that the performance per resource is higher. But also, a high resource sharing causes a high interference between threads, which causes more variable execution times. On the other hand, a reduced resource sharing (like CMPs) causes much smaller execution time variability, but implies that many hardware resources are duplicated increasing area and cost of the chip.

- We also will research on how the performance of multithreaded processors should be measured, as currently there is not a clear procedure that dictates how to do it. This objective is two-fold. On the one hand, we will research on the metrics to use. On the other hand, we will study current simulation methodologies used in high-performance multithreaded systems. We will try to determine its applicability in real-time systems, proposing new ones if necessary.

- Static WCET analysis has to consider the worst-case that could happen in order to ensure that a deadline will NEVER be violated. Thus, static analysis has to consider plenty of states (i.e. whether a branch is well predicted or not, there's a cache miss or not, etc) in addition to modelling all these dynamic features. It results that static analysis largely overestimates the execution time because it cannot be restricted to a single state but instead has to deal with several states and consider the worst. Thus, we have to invent hardware and software techniques which increase performance but ensure an effective WCET analysis. Dynamic schemes will certainly be necessary, but these schemes will have to be easily analyzable. We are going to consider various approaches in order to increase performance while keeping a reasonably complex static analysis; it is probably a combination of them which will allow us to meet both objectives. As an example, a multithreaded processor may be considered where threads are not sharing resources or where resource sharing can be easily modelled and do not provide overestimation. Complex and dynamic choices that the static analysis can't model precisely are avoided in such an architectural solution. Another example is designing a chip multiprocessor (multi-core) where each core is simple enough to be statically modelled. Memory access conflicts might be easier analyzable because of the simplicity of the processors. However, a general conclusion is that hard real-time requirements hinder and often exclude the use of high-performance features commonly known in the processor architecture community.

- Research on hard real-time architectures is needed because the applications (mainly in automotive, aeronautics and space) render current embedded hardware solutions obsolete. Indeed, these applications will require much more performance than today while preserving the need for a static WCET analysis to determine and guarantee the maximum execution time of a task. Another point which hinders static analysis is the sharing of resources like the memory between either several processors or a processor and a DMA. The presence of such features renders the static analysis horribly complex, needs a lot of time to model the hardware and to do the static analysis each time a program is modified and, finally, needs a lot of memory. So, the whole system should be considered and not the processor in isolation as, depending on what resource sharing happens in the system, some solutions for the processor architecture might be tolerated or not.

- We also plan to investigate a new paradigm for real-time systems that replaces the hard real-time demand by a “rate-critical systems” demand. This new relaxed real-time demand only guarantees that nearly all RETs are within a boundary called LET (longest execution time). The critical system demand requires a quality-of-service measure expressing either the expected number of RETs exceeding the LET (e.g. once per minute, hour, day, year etc. during run-time) or the (very small) likelihood that the LET will be exceeded. The LET could be determined by simulations, e.g. the LET observed during a high number of simulation runs, or by probability calculation based on queuing theory and on the probabilities of usage of the high-performance features. The technique of “extreme analysis” may be of use. More investigations are needed. The advantage of the rate-critical systems paradigm is that standard high-performance features could be more easily applied within an embedded processor and new scheduling schemes can be used that guarantee an IPC (instructions per cycle) bandwidth, but not a WCET. The current drawback of the paradigm is on the application side. Not all applications that need hard real-time guarantees will be sufficiently flexible to replace the hard real-time requirement by the weaker rate-critical system requirement. More discussions with applications people are necessary.


Research cluster

Requested: € 43200

Requested: € 13200

- 3 Cluster meetings (2p from UPC, 3p from Augsburg, 2p from Tolouse, 2p Karlsrue, 1pRapita) 3 x 10 x 1000 = 30.000 EUR

- 1 Student fellowship at UPC (12 months): 12 x 1100 = 13200 EUR


Requested: 12 month(s)

VALERO Mateo (UPC) (--member--)
UNGERER Theo (University of Augsburg) (--member--)
SAINRAT Pascal (CNRS) (--member--)
CAZORLA Francisco (Barcelona Supercomputing Center) (--member--)
RAMIREZ Alex (UPC) (--member--)
BRINKSCHULTE Uwe (University of Karlsruhe) (--member--)
CéDRIC landet (IRIT) (--phd student--)
UHRIG Sascha (University of Augsburg) (--colleague--)
KLUGE Florian (University of Augsburg) (--phd student--)

Glenn Farrall, Infineon
Guillem Bernat, Rapita Systems
Rafael Zalman, Infineon
Knut Hufeld, Infineon
Christine Rochange, CNRS