User login |
Reliable Embedded ProcessorsFollowing performance and power, reliability has emerged as the latest challenge in microarchitecture. Various developments have combined to make reliability a concern: soft-error rate is projected to increase with scaling; variability due to non-deterministic placement of dopant atoms and channel length is increasing design margins; better than worst-case design techniques for power/performance require error detection/correction; aggressive application of power-saving mechanisms such as clock- and Vdd-gating are increasing voltage droops; the verification manpower budget is becoming a significant part of the design effort; oxide breakdown and electromigration are decreasing processor lifetimes. This cluster will address this challenge in a synergistic way, for example involving the compiler layer as well as the architectural layer. Building a resilient processor framework involves three steps: fault modelling, injection and detection; resiliency through recovery from failure; verification of resiliency through experiments. Those are detailed below: 1. Fault modelling, injection and detection: We need to model the faults that are most likely to occur in the future-generation (such as sub-45nm feature size) processors. Such faults could be categorized into three areas: permanent, intermittent and transient. Permanent faults usually appear as stuck-at (either at 1 or 0) logic levels and need to be modelled as such. Intermittent and transient faults occur due to process variations and environmental factors and are of temporary duration; therefore the fault duration as well as the fault rate needs to be modelled per processor block. We will study, through circuit area analysis as well as other methods such as activity factor based ones, to establish the fault types and rates for each processor block. Furthermore, according to the cause of the fault (alpha particles or solar rays, thermal spikes, di/dt noise), its spatial distribution could be different and this will be modelled as well: standalone faults manifest themselves as single event upsets, casing single bit flips. In contrast clustered faults affect a larger area of the chip. 2. Recovery methods: At the hardware level, we are proposing the use of both space (multiple threads, duplex and triplex processing units coupled with novel checker and voter mechanisms) and time (checkpointing) redundancy. The most efficient method will depend on the fault type and characteristics. For example: intermittent faults usually occur at the same location while the distribution of transient faults is more random. Intermittent faults could be eliminated through isolating and replacing (through duplicated hardware or software) the involved block. However, disconnect and replace is not a viable recovery policy for a transient fault, a more apt policy would be using multiple threads redundancy. 3. Experiments and verification: The resilient processor will be compared with a baseline processor which doesn’t have any recovery capability. The comparison will be done by injecting faults which will simulate the impact of various operating conditions as well as process variations on the system. After this step, the reliability of the baseline and resilient processor will be compared by observing their ability to continue operating correctly in the presence of faults. The performance/power impact of the proposed schemes will also be studied. Deliverables: Fault modelling, injection and detection modules will be developed and integrated with an existing simulator such as Simplescalar. The simulator will be appended to take into account the performance and power implications of time/space redundancy. Analyses and optimizations will be developed at the compiler layer that will be oriented towards maximizing reliability (as opposed to maximizing performance or power savings). Research cluster Requested: € 26400 Requested: € 14400 We will use the funding to cover our collaboration expenses in 2007. These visits will usually be 2-person visits as we plan to have some graduate students in these collaboration meetings. We want to have at least two meetings in 2007 with all cluster members (Ankara, Barcelona, Edinburgh) and we assume that each meeting will cost at least 1000 euros per person. Total cost adds up to 12000 euros which will be divided equally among 3 hipeac members. In order to develop the simulation infrastructure we plan to employ a PhD student in UPC - Barcelona Supercomputing Center (BSC). This student will cost 1200 euros per month (14400 euros in 12 months) Requested: 12 month(s) VALERO Mateo (UPC) (--member--) ERGIN Oguz (TOBB Economics and Technology University) (--member--) CRISTAL Adrián (Barcelona Supercomputing Center) (--colleague--) JONES Timothy (Edinburgh University) (--colleague--) UNSAL Osman (Barcelona Supercomputing Center) (--member--)
|