ISCA 2002 Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased
ISCA 2002 Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanovi Computer Architecture Group, MIT LCS Leakage Power Growing impact of leakage power Increase of leakage power due to scaling of transistor lengths and threshold voltages Power budget limits use of fast leaky transistors Challenge: How to maintain performance scaling in face of increasing leakage power? Leakage Reduction Techniques Static: Design-time Selection of Slow Transistors (SSST) for non-critical paths Replace fast transistors with slow ones on non-critical paths Tradeoff between delay and leakage power Dynamic: Run-time Deactivation of Fast Transistors (DDFT) for critical paths DDFT switches critical path transistors between inactive and active modes Observation: Critical paths dominate leakage after
applying SSST techniques Example: PowerPC 750 5% of transistor width is low Vt, but these account for >50% of total leakage. DDFT could give large leakage savings Existing DDFT Circuit Techniques Vbody > Vdd Gate Body Biasing Drain Source Vt increase by Body reverse-biased body effect Large transition time and wakeup latency due to well cap and resistance Power Gating Vdd Sleep signal Sleep transistor between Virtual Vdd supply and virtual supply lines Logic cells Increased delay due to sleep transistor Sleep Vector
0 Input vector which minimizes leakage 0 Increased delay due to mux and active energy due to spurious toggles after applying sleep vector Fine-Grain DDFT Techniques Have to turn off small pieces of an active processor for short periods of time Difficult to turn off large pieces for long periods Fine-grain DDFT techniques Requirements of Fine-grain DDFT techniques Circuits with low active delay penalty, low energy moving in and out of sleep, and fast wakeup time Micro-architectural scheduling to keep the sleep time as long and often as possible Compare to coarse-grain DDFT techniques O.S. puts whole processor to sleep for a long time doesnt save power when running code Low steady-state leakage only concern. Highlights of This Work We introduce metrics for comparing finegrain dynamic deactivation techniques
We present a new circuit-level leakage reduction technique, Leakage-Biased Bitlines (LBB) Steady-stage leakage, Transition time, Fixed transition energy, Breakeven time Low deactivation energy and fast wakeup We save leakage power of I-Cache and Multiported regfile by LBB I-cache: idle subbank deactivation Multiported regfile: idle read ports and dead register deactivation Outline 1. Methodology and DDFT Metrics 2. Cache Leakage Saving Idle subbank deactivation 3. Multiported Regfile Leakage Saving Dead reg deactivation (Horizontal) Idle read port deactivation (Vertical)
4. Conclusion Methodology Process Technology 180nm DVT process modeled after 0.18um TSMC LVT and MVT processes Scaled to 130, 100, and 70nm processes based on SIA roadmap Optimistic/pessimistic leakage prediction: 2x/4x increase of leakage current density (nA/um) Evaluation with SimpleScalar Modified to model unified physical register file 4 issue, 100 integer physical regs, 16KB/4-Way/32-B block I-Cache and D-Cache, Unified L-2 Cache SPECint95 refs Energy measurements Hspice simulation for 180nm process and scaled to other processes accordingly Metrics for Fine-Grain DDFT Techniques Leakage Current Leakage Energy Original Leakage DDFT applied Original
Leakage Transition Time Break-Even Time DDFT Leakage Fixed Active Transition Energy Steady-state Sleep Leakage Time Wakeup Latency Active delay and power Length of Sleep L1 Cache and Multiported Regfile Good targets for Fine-grain DDFT techniques Timing-critical Contrast: L2 cache is a better target for SSST (long channel or HVT transistors) Large leakage current Cache: Large number of fast transistors Multiported Regfile: Ever increasing number of registers and ports Alpha 21464 register file is 5x larger than 64KB data cache LBB for Caches Modern cache structure
Subbank : Hierarchical Bitlines To save active power To reduce delay To reduce bitline noise Global Bitline Local Bitline Local-Global Switch SenseAmp Local bitlines (32-bit cells) disconnected from senseamp by local-global switch. LBB for Caches: If a subbank is not in use, turn off precharge transistors and delay precharging. Cache: Dual Vt SRAM cell GLOBAL BIT 1 1 BIT GLOBAL BIT_BAR BIT_BAR 0
0 HVT transistors: green-colored 1 WL Cache: Dual Vt SRAM cell GLOBAL BIT 1 1 BIT GLOBAL BIT_BAR BIT_BAR 0 0 1 WL Cache: Dual Vt SRAM cell GLOBAL BIT
1 1 BIT GLOBAL BIT_BAR BIT_BAR 0 0 WL 1 Bitline leakage depends on the stored value Cache: Dual Vt SRAM cell GLOBAL BIT 1 1 BIT GLOBAL BIT_BAR BIT_BAR
0 Our Target 0 WL 1 Bitline leakage depends on the stored value Forcing 1 Forcing 0 Forcing ? 0 1 0 0 1 1 Leakage-Biased Bitlines (LBB) Discharge to an intermediate value between 0 and 1
Stay at 1 Discharge to 0 0 1 0 0 1 1 LBB lets bitlines float by turning off the local HVT NMOS precharge transistors No static current draw because local bitline isolated LBB uses leakage itself to bias bitlines to the voltage which minimizes leakage! A good fine-grain dynamic technique Minimal transition energy: Same number of precharges (delayed precharge) Minimal transition time: Wakeup latency is only that of precharge phase LBB versus Sleep Vector LBB finds the minimal leakage state. Always better than sleep vectors Leakage Power of 32x16B SRAM subbank Leakage Power (uW) 350
Original 300 Sleep Vector 1 250 Sleep Vector 0 LBB 200 150 100 0 20 40 60 80 Zero Percentage (%) 100 32-row x 32B SRAM subbank (optimistic leakage current used. 75% zero assumed) 70nm 180nm
50 Original 40 30 LBB 20 10 50 Energy (pJ ) Energy (pJ ) Cumulative Leakage Energy 0 Original 40 30 20 10 LBB 0 0 100
200 300 400 Length of Sleep (cycles) 500 0 100 200 300 400 Length of Sleep (cycles) Dynamic energy cost: Need to replace the lost charge -LBB curve increases fast in the beginning Decrease of Breakeven time -180nm: 200 cycles, 70nm: less than a cycle -Active energy scales down faster than leakage energy 500 Performance Issues for LBB Caches Subbank must be precharged before use Case 1 (best): subbank decode and precharge happen before more complex word-line decode, therefore no penalty. Case 2 (worst): add additional pipeline stage for precharge One cycle increase in branch misprediction penalty
Focus on I-Cache because any latency increase can be partly hidden by branch prediction I-Cache Subbank Deactivation Leakage energy saving at 70nm process Total energy saving at 70nm process 30 Percentage (%) Percentage (%) 30 25 20 15 Pessimistic Prediction Optimistic Prediction 10 5 0 70nm p m o c c gc
go eg jp li k 88 m rl pe rt vo 20 15 10 5 0 p m o c g av 25 20
20 10 5 0 -5 - 10 180nm 130nm 100nm 70nm Percentage (%) 25 15 c gc go eg jp li k
88 m rl pe rt vo g av Total energy saving across processes Leakage energy saving across processes Percentage (%) 25 15 10 5 0 -5 180nm 130nm 100nm - 10
Case 2 (worst) assumption (adding additional pipeline stage) 2.5% IPC decrease on average 70nm Multiported Regfile Cell 8R, 4W unbalanced DVT reg cell READ[0:7] WRITE[0:3] WRITEB[0:3] WWL[0:3] RWL[0:7] x4 x4 x8 HVT transistors: green-colored Simplified but active/leakage power-aware baseline LBB for Multiported Regfiles LBB for Multiported Regfiles: Turn off the precharge transistor on idle subbank read ports Leakage current discharges bitlines to 0 if any bits are holding 1.
Dead Register Deactivation Horizontal technique Dead registers = Registers in Subbank 1 free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to reprecharge between allocation and write. Readport 0 Readport 1 Readport 2 Dead Register Deactivation Horizontal technique Dead registers = Registers in Subbank 1 free list If all registers in a subbank
are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to reprecharge between allocation and write. Readport 0 Readport 1 Readport 2 NMOS Sleep Transistor (NST) Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can be 1 turned off individually Disadvantage: increased read access time Set delay penalty to 5% (tradeoff between delay and leakage) Readport 0 Readport 1 Readport 2
NMOS Sleep Transistor (NST) Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can be 0 turned off individually Disadvantage: increased read access time Set delay penalty to 5% (tradeoff between delay and leakage) Readport 0 Readport 1 Readport 2 Idle Readport Deactivation Vertical technique
Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline. Readport 0 Readport 1 Readport 2 Idle Readport Deactivation Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be
accessed in the pipeline. Readport 0 Readport 1 Readport 2 Comparison of DDFTs 32 x 32-b Regfile subbank (75% zero assumed. Optimistic leakage current used.) Process Tech. (nm) Original (uW) SV steady-state (uW) LBB steady-state (uW) NST steady-state (uW) 180 177.9 2.0 2.0 1.8 130 214.1 2.4 2.4 2.2 50 Original 40 Sleep Vector
276.7 3.1 3.1 2.9 70nm Energy (pJ ) Energy (pJ ) 180nm 100 263.6 3.0 3.0 2.7 1000 0 500 Length of Sleep (cycles) 1000 Energy (pJ ) Comparison of DDFTs Blowup: 70nm 70nm 8
7 6 5 4 3 2 1 0 Original Sleep Vector NMOS Sleep Transistor Leakage-Biased Bitlines 0 10 20 30 40 Length of Sleep (cycles) 50 Dead Register/Subbank Deactivation Policies Free list policies for NST (NMOS Sleep Transistor): queue and stack
queue: conventional stack: keeps some regs dead for longer 2.4-10% greater savings than queue at 70nm Benefit increases as feature sizes shrink Subbank allocation policy for LBB: stack Allocate a new subbank only when the previous bank is empty of dead registers Dead Reg Deactivation (Horizontal) Leakage energy savings (70nm process) Total energy savings (70nm process) 40 20 40 20 Leakage Energy Savings percent (%) percent (%) 60 20 180
130 100 Process (nm) 70 av g Total Energy Savings 60 40 m li 88 k pe co rl m p vo rt gc c g jp o eg av g m li
88 k pe co rl m p vo rt 0 gc c go jp eg 0 0 Colored: optimistic White: pessimistic 60 percent (%) percent (%) 60 NST Queue NST Stack LBB 16 regs/bank
LBB 8 regs/bank 40 20 0 180 130 100 Process (nm) 70 NST stack better than NST queue, LBB stack better than either NST Read Port Deactivation (Vertical) Leakage energy saving at 70nm process Total energy saving at 70nm process Percentage (%) 70 60 50 40 30 20 10 0 25p m
co c gc go eg jp li k 88 m rl pe rt vo Percentage (%) 20 Leakage energy saving across processes 70 15 60 50 5 30
0 20 10 -5 180nm 130nm 100nm 0 - 10 - 10 50 40 30 20 10 p m o c g av 10
40 60 0 c gc go eg jp li k 88 m rl pe rt vo g av Total energy saving across processes 70 Percentage (%) Percentage (%)
More energy saving for wider issue processors Readport deactivation can be combined with dead subbank deactivation. Conclusion Most leakage power is in critical paths Dynamic leakage reduction (DDFT) desired LBB allows Fine-grain dynamic leakage reduction with zero or minimal performance penalty. 0% performance penalty for multiported regfiles Sleep time can be improved by changing micro-architectural scheduling policies. Stack better than queue for free list policy Follow on work: Leakage-biased domino logic to save leakage power in critical ALUs [ VLSI Symposium 2002 ] Acknowledgments Thanks to Christopher Batten, Ronny Krashinsky, Rajesh Kumar, and anonymous reviewers Funded by DARPA PAC/C award F3060200-2-0562, NSF CAREER award CCR0093354, and a donation from Infineon Technologies. DDFT Examples Body Biasing Steady-state leakage power
Power Gating Sleep Vector Less than 5% Less than 5% Less than 50% (depends on (depends on sleep (depends on the Vbody) transistor) circuit) Transition time, Wakeup latency 0.1~100us Less than a cycle Less than a cycle Transition energy ,Breakeve n time Well cap switching energy Sleep transistor gate cap switching energy Active energy consumed due to
spurious toggling after sleep vector No Yes. Due to sleep transistor Yes. Due to mux Area for sleep transistor and virtual supplies Finding sleep vector is hard Delay Impact Etc
on November 4 1922 english archaeologist howard Carter made one of the most important discoverys of modern times While on an expedition to egypt, carter discovered the tomb of king tutankhamen King tut had became egypts ruler when he was...
Kaip ieškoti vaizdų (Images)? Vaizdinė paieška (Visual Search) Bibliotekos duomenų bazių administratorius turi aktyvuoti ją EBSCOadmin sistemoje Skirta palengvinti paiešką vartotojams, lengviau besiorientuojantiems vizualioje aplinkoje Paieškos rezultatai pateikiami interaktyviais grafiniais vaizdais Kad paieška būtų efektyvesnė…
multimedia software packages, multimedia application concepts, data manipulation, file format, media storage, memory management and configurations and screen display techniques. Syllabus Chapter 1: Introduction To Multimedia
The net amount of electric charge produced in any process is ZERO!! If one object or one region of space acquires a positive charge, then an equal amount of negative charge will be found in neighboring areas or objects. No...
Finally, conducted a detailed survey of the accuracy possible with inexpensive AC metering hardware. Based on a 21-point automated calibration of a population of 500 devices, they find that it is possible to produce nearly utility-grade metering data.
Explain why the term "OSHA state" is ambiguous. Explain the function of OSHRC. Explain application of OSHA to volunteer and part-time firefighters. Objectives Administrative Agencies Exist within the executive branch Fill a vital role in our government Create laws, called...
People are not permitted to divorce until they reach a property settlement. Divorce - often straightforward. Property Settlement + Parenting Arrangements - often complicated. Parenting Arrangement. JOINT & EQUAL responsibility until the Family Court makes an order. Couples must partake...