ISCA 2002 Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased

ISCA 2002 Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased

ISCA 2002 Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanovi Computer Architecture Group, MIT LCS Leakage Power Growing impact of leakage power Increase of leakage power due to scaling of transistor lengths and threshold voltages Power budget limits use of fast leaky transistors Challenge: How to maintain performance scaling in face of increasing leakage power? Leakage Reduction Techniques Static: Design-time Selection of Slow Transistors (SSST) for non-critical paths Replace fast transistors with slow ones on non-critical paths Tradeoff between delay and leakage power Dynamic: Run-time Deactivation of Fast Transistors (DDFT) for critical paths DDFT switches critical path transistors between inactive and active modes Observation: Critical paths dominate leakage after

applying SSST techniques Example: PowerPC 750 5% of transistor width is low Vt, but these account for >50% of total leakage. DDFT could give large leakage savings Existing DDFT Circuit Techniques Vbody > Vdd Gate Body Biasing Drain Source Vt increase by Body reverse-biased body effect Large transition time and wakeup latency due to well cap and resistance Power Gating Vdd Sleep signal Sleep transistor between Virtual Vdd supply and virtual supply lines Logic cells Increased delay due to sleep transistor Sleep Vector

0 Input vector which minimizes leakage 0 Increased delay due to mux and active energy due to spurious toggles after applying sleep vector Fine-Grain DDFT Techniques Have to turn off small pieces of an active processor for short periods of time Difficult to turn off large pieces for long periods Fine-grain DDFT techniques Requirements of Fine-grain DDFT techniques Circuits with low active delay penalty, low energy moving in and out of sleep, and fast wakeup time Micro-architectural scheduling to keep the sleep time as long and often as possible Compare to coarse-grain DDFT techniques O.S. puts whole processor to sleep for a long time doesnt save power when running code Low steady-state leakage only concern. Highlights of This Work We introduce metrics for comparing finegrain dynamic deactivation techniques

We present a new circuit-level leakage reduction technique, Leakage-Biased Bitlines (LBB) Steady-stage leakage, Transition time, Fixed transition energy, Breakeven time Low deactivation energy and fast wakeup We save leakage power of I-Cache and Multiported regfile by LBB I-cache: idle subbank deactivation Multiported regfile: idle read ports and dead register deactivation Outline 1. Methodology and DDFT Metrics 2. Cache Leakage Saving Idle subbank deactivation 3. Multiported Regfile Leakage Saving Dead reg deactivation (Horizontal) Idle read port deactivation (Vertical)

4. Conclusion Methodology Process Technology 180nm DVT process modeled after 0.18um TSMC LVT and MVT processes Scaled to 130, 100, and 70nm processes based on SIA roadmap Optimistic/pessimistic leakage prediction: 2x/4x increase of leakage current density (nA/um) Evaluation with SimpleScalar Modified to model unified physical register file 4 issue, 100 integer physical regs, 16KB/4-Way/32-B block I-Cache and D-Cache, Unified L-2 Cache SPECint95 refs Energy measurements Hspice simulation for 180nm process and scaled to other processes accordingly Metrics for Fine-Grain DDFT Techniques Leakage Current Leakage Energy Original Leakage DDFT applied Original

Leakage Transition Time Break-Even Time DDFT Leakage Fixed Active Transition Energy Steady-state Sleep Leakage Time Wakeup Latency Active delay and power Length of Sleep L1 Cache and Multiported Regfile Good targets for Fine-grain DDFT techniques Timing-critical Contrast: L2 cache is a better target for SSST (long channel or HVT transistors) Large leakage current Cache: Large number of fast transistors Multiported Regfile: Ever increasing number of registers and ports Alpha 21464 register file is 5x larger than 64KB data cache LBB for Caches Modern cache structure

Subbank : Hierarchical Bitlines To save active power To reduce delay To reduce bitline noise Global Bitline Local Bitline Local-Global Switch SenseAmp Local bitlines (32-bit cells) disconnected from senseamp by local-global switch. LBB for Caches: If a subbank is not in use, turn off precharge transistors and delay precharging. Cache: Dual Vt SRAM cell GLOBAL BIT 1 1 BIT GLOBAL BIT_BAR BIT_BAR 0

0 HVT transistors: green-colored 1 WL Cache: Dual Vt SRAM cell GLOBAL BIT 1 1 BIT GLOBAL BIT_BAR BIT_BAR 0 0 1 WL Cache: Dual Vt SRAM cell GLOBAL BIT

1 1 BIT GLOBAL BIT_BAR BIT_BAR 0 0 WL 1 Bitline leakage depends on the stored value Cache: Dual Vt SRAM cell GLOBAL BIT 1 1 BIT GLOBAL BIT_BAR BIT_BAR

0 Our Target 0 WL 1 Bitline leakage depends on the stored value Forcing 1 Forcing 0 Forcing ? 0 1 0 0 1 1 Leakage-Biased Bitlines (LBB) Discharge to an intermediate value between 0 and 1

Stay at 1 Discharge to 0 0 1 0 0 1 1 LBB lets bitlines float by turning off the local HVT NMOS precharge transistors No static current draw because local bitline isolated LBB uses leakage itself to bias bitlines to the voltage which minimizes leakage! A good fine-grain dynamic technique Minimal transition energy: Same number of precharges (delayed precharge) Minimal transition time: Wakeup latency is only that of precharge phase LBB versus Sleep Vector LBB finds the minimal leakage state. Always better than sleep vectors Leakage Power of 32x16B SRAM subbank Leakage Power (uW) 350

Original 300 Sleep Vector 1 250 Sleep Vector 0 LBB 200 150 100 0 20 40 60 80 Zero Percentage (%) 100 32-row x 32B SRAM subbank (optimistic leakage current used. 75% zero assumed) 70nm 180nm

50 Original 40 30 LBB 20 10 50 Energy (pJ ) Energy (pJ ) Cumulative Leakage Energy 0 Original 40 30 20 10 LBB 0 0 100

200 300 400 Length of Sleep (cycles) 500 0 100 200 300 400 Length of Sleep (cycles) Dynamic energy cost: Need to replace the lost charge -LBB curve increases fast in the beginning Decrease of Breakeven time -180nm: 200 cycles, 70nm: less than a cycle -Active energy scales down faster than leakage energy 500 Performance Issues for LBB Caches Subbank must be precharged before use Case 1 (best): subbank decode and precharge happen before more complex word-line decode, therefore no penalty. Case 2 (worst): add additional pipeline stage for precharge One cycle increase in branch misprediction penalty

Focus on I-Cache because any latency increase can be partly hidden by branch prediction I-Cache Subbank Deactivation Leakage energy saving at 70nm process Total energy saving at 70nm process 30 Percentage (%) Percentage (%) 30 25 20 15 Pessimistic Prediction Optimistic Prediction 10 5 0 70nm p m o c c gc

go eg jp li k 88 m rl pe rt vo 20 15 10 5 0 p m o c g av 25 20

20 10 5 0 -5 - 10 180nm 130nm 100nm 70nm Percentage (%) 25 15 c gc go eg jp li k

88 m rl pe rt vo g av Total energy saving across processes Leakage energy saving across processes Percentage (%) 25 15 10 5 0 -5 180nm 130nm 100nm - 10

Case 2 (worst) assumption (adding additional pipeline stage) 2.5% IPC decrease on average 70nm Multiported Regfile Cell 8R, 4W unbalanced DVT reg cell READ[0:7] WRITE[0:3] WRITEB[0:3] WWL[0:3] RWL[0:7] x4 x4 x8 HVT transistors: green-colored Simplified but active/leakage power-aware baseline LBB for Multiported Regfiles LBB for Multiported Regfiles: Turn off the precharge transistor on idle subbank read ports Leakage current discharges bitlines to 0 if any bits are holding 1.

Dead Register Deactivation Horizontal technique Dead registers = Registers in Subbank 1 free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to reprecharge between allocation and write. Readport 0 Readport 1 Readport 2 Dead Register Deactivation Horizontal technique Dead registers = Registers in Subbank 1 free list If all registers in a subbank

are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to reprecharge between allocation and write. Readport 0 Readport 1 Readport 2 NMOS Sleep Transistor (NST) Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can be 1 turned off individually Disadvantage: increased read access time Set delay penalty to 5% (tradeoff between delay and leakage) Readport 0 Readport 1 Readport 2

NMOS Sleep Transistor (NST) Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can be 0 turned off individually Disadvantage: increased read access time Set delay penalty to 5% (tradeoff between delay and leakage) Readport 0 Readport 1 Readport 2 Idle Readport Deactivation Vertical technique

Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline. Readport 0 Readport 1 Readport 2 Idle Readport Deactivation Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be

accessed in the pipeline. Readport 0 Readport 1 Readport 2 Comparison of DDFTs 32 x 32-b Regfile subbank (75% zero assumed. Optimistic leakage current used.) Process Tech. (nm) Original (uW) SV steady-state (uW) LBB steady-state (uW) NST steady-state (uW) 180 177.9 2.0 2.0 1.8 130 214.1 2.4 2.4 2.2 50 Original 40 Sleep Vector

30 50 Leakage-Biased Bitlines 20 NMOS Sleep Transistor 10 0 Original 40 30 20 NMOS Sleep Transistor 10 Sleep Vector Leakage-Biased Bitlines 0 0 500 Length of Sleep (cycles) 70

276.7 3.1 3.1 2.9 70nm Energy (pJ ) Energy (pJ ) 180nm 100 263.6 3.0 3.0 2.7 1000 0 500 Length of Sleep (cycles) 1000 Energy (pJ ) Comparison of DDFTs Blowup: 70nm 70nm 8

7 6 5 4 3 2 1 0 Original Sleep Vector NMOS Sleep Transistor Leakage-Biased Bitlines 0 10 20 30 40 Length of Sleep (cycles) 50 Dead Register/Subbank Deactivation Policies Free list policies for NST (NMOS Sleep Transistor): queue and stack

queue: conventional stack: keeps some regs dead for longer 2.4-10% greater savings than queue at 70nm Benefit increases as feature sizes shrink Subbank allocation policy for LBB: stack Allocate a new subbank only when the previous bank is empty of dead registers Dead Reg Deactivation (Horizontal) Leakage energy savings (70nm process) Total energy savings (70nm process) 40 20 40 20 Leakage Energy Savings percent (%) percent (%) 60 20 180

130 100 Process (nm) 70 av g Total Energy Savings 60 40 m li 88 k pe co rl m p vo rt gc c g jp o eg av g m li

88 k pe co rl m p vo rt 0 gc c go jp eg 0 0 Colored: optimistic White: pessimistic 60 percent (%) percent (%) 60 NST Queue NST Stack LBB 16 regs/bank

LBB 8 regs/bank 40 20 0 180 130 100 Process (nm) 70 NST stack better than NST queue, LBB stack better than either NST Read Port Deactivation (Vertical) Leakage energy saving at 70nm process Total energy saving at 70nm process Percentage (%) 70 60 50 40 30 20 10 0 25p m

co c gc go eg jp li k 88 m rl pe rt vo Percentage (%) 20 Leakage energy saving across processes 70 15 60 50 5 30

0 20 10 -5 180nm 130nm 100nm 0 - 10 - 10 50 40 30 20 10 p m o c g av 10

40 60 0 c gc go eg jp li k 88 m rl pe rt vo g av Total energy saving across processes 70 Percentage (%) Percentage (%)

70 60 Pessimistic Prediction Optimistic Prediction 50 40 30 20 70nm 10 0 180nm 130nm 100nm 70nm - 10 180nm 130nm 100nm 70nm

More energy saving for wider issue processors Readport deactivation can be combined with dead subbank deactivation. Conclusion Most leakage power is in critical paths Dynamic leakage reduction (DDFT) desired LBB allows Fine-grain dynamic leakage reduction with zero or minimal performance penalty. 0% performance penalty for multiported regfiles Sleep time can be improved by changing micro-architectural scheduling policies. Stack better than queue for free list policy Follow on work: Leakage-biased domino logic to save leakage power in critical ALUs [ VLSI Symposium 2002 ] Acknowledgments Thanks to Christopher Batten, Ronny Krashinsky, Rajesh Kumar, and anonymous reviewers Funded by DARPA PAC/C award F3060200-2-0562, NSF CAREER award CCR0093354, and a donation from Infineon Technologies. DDFT Examples Body Biasing Steady-state leakage power

Power Gating Sleep Vector Less than 5% Less than 5% Less than 50% (depends on (depends on sleep (depends on the Vbody) transistor) circuit) Transition time, Wakeup latency 0.1~100us Less than a cycle Less than a cycle Transition energy ,Breakeve n time Well cap switching energy Sleep transistor gate cap switching energy Active energy consumed due to

spurious toggling after sleep vector No Yes. Due to sleep transistor Yes. Due to mux Area for sleep transistor and virtual supplies Finding sleep vector is hard Delay Impact Etc

Recently Viewed Presentations

  • Monday 11-07-2016 Week 12I USING English and IIINFORMATIONAL

    Monday 11-07-2016 Week 12I USING English and IIINFORMATIONAL

    on November 4 1922 english archaeologist howard Carter made one of the most important discoverys of modern times While on an expedition to egypt, carter discovered the tomb of king tutankhamen King tut had became egypts ruler when he was...
  • EBSCO PUBLISHING DUOMEN BAZI PRISTATYMAS Informacini gdi ugdymas

    EBSCO PUBLISHING DUOMEN BAZI PRISTATYMAS Informacini gdi ugdymas

    Kaip ieškoti vaizdų (Images)? Vaizdinė paieška (Visual Search) Bibliotekos duomenų bazių administratorius turi aktyvuoti ją EBSCOadmin sistemoje Skirta palengvinti paiešką vartotojams, lengviau besiorientuojantiems vizualioje aplinkoje Paieškos rezultatai pateikiami interaktyviais grafiniais vaizdais Kad paieška būtų efektyvesnė…
  • Interactive Multimedia - Siti Nurbaya Ismail

    Interactive Multimedia - Siti Nurbaya Ismail

    multimedia software packages, multimedia application concepts, data manipulation, file format, media storage, memory management and configurations and screen display techniques. Syllabus Chapter 1: Introduction To Multimedia
  • PHYS 1443 - Section 501 Lecture #1

    PHYS 1443 - Section 501 Lecture #1

    The net amount of electric charge produced in any process is ZERO!! If one object or one region of space acquires a positive charge, then an equal amount of negative charge will be found in neighboring areas or objects. No...
  • @scale: Insights from a Large, Long Lived Appliance Energy WSN

    @scale: Insights from a Large, Long Lived Appliance Energy WSN

    Finally, conducted a detailed survey of the accuracy possible with inexpensive AC metering hardware. Based on a 21-point automated calibration of a population of 500 devices, they find that it is possible to produce nearly utility-grade metering data.
  • Administrative Agencies Chapter 4 Objectives  Identify executive-branch agencies.

    Administrative Agencies Chapter 4 Objectives Identify executive-branch agencies.

    Explain why the term "OSHA state" is ambiguous. Explain the function of OSHRC. Explain application of OSHA to volunteer and part-time firefighters. Objectives Administrative Agencies Exist within the executive branch Fill a vital role in our government Create laws, called...
  • SEPARATION & DIVORCE DIVORCE STATISTICS  Explore the following

    SEPARATION & DIVORCE DIVORCE STATISTICS Explore the following

    People are not permitted to divorce until they reach a property settlement. Divorce - often straightforward. Property Settlement + Parenting Arrangements - often complicated. Parenting Arrangement. JOINT & EQUAL responsibility until the Family Court makes an order. Couples must partake...
  • Statoakustický aparát - Univerzita Karlova

    Statoakustický aparát - Univerzita Karlova

    Auditory and vestibular system Auris, is = Us, oton 3rd neuron: cells in nuclei ventrales thalami → cerebral cortex lobus parietalis - gyrus postcentralis (area 2) - primary cortex parieto-insular cortex (gyrus insularis longus) + lobus temporalis - gyrus temporalis...