Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation

Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation

Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005 Presentation Overview Current Research Direction Related Work Experiments What Next? 2 Current Research Direction Wide superscalar, out-of-order execution processor core Exploits ILP But true data dependencies are inherent in application programs MIPS R10k, NetBurst, AMD etc. use bypass network to forward just-computed result allow back-to-back issue of dependent instructions Complexity of bypass network grows quadratic w.r.t. issue width Hardware Components of MIPS R10000 Fetch Instruction Predecode Branch History Table Instruction Cache Instruction TLB Decode Active List (32 entries)

Instruction Decode Register Map Tables (1 for Int, 1 for FP) Free Register Lists (1 for Int, 1 For FP) Issue Integer Queue (16 entries) Mem Queue (16 entries) Integer Registers / Bypass 64 x 64 bits FP Queue (16 entries) Write FP Registers / Bypass 64 x 64 bits Execute Address + Data TLB ALU1 ALU2 FP + FP *, , Data Cache [Yeager96]

3 Current Research Direction Observation 1: Multi-cycle broadcast Wire delays accounted for in Intel NetBurst Allows higher processor clock frequency at the cost of reduced IPC Observation 2: FP execution unit is idle most of the time, even in FP-intensive applications (5-10%) [Sassone04] Proportion of Functional Unit Type Requested Proportion of Functional Unit Type Requested 100% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 90% 80% 70% 16 4. gz ip 16 16 .gr 4. 4. ap g z g hi 16 i p zi p c 4 .p .l

16 .g zi rog og 4. p.r ra gz a m i p nd .s om ou 17 rc e 6 1 .g 19 81 cc 16 7. .m 8. p a cf w rs up e 17 wi r 1 se 17 .s w 3. im ap 18 179 plu 3. . a e 18 qu rt 8. ak a e 30 mm 1. p ap si 60% SPEC2000 Applications 50% Rd/Wr Ports Rd/Wr Ports 40% FP_MULT/DIV FP_MULT/DIV

30% FP_ALU FP_ALU 20% Int_MULT/DIV Int_MULT/DIV 10% Int_ALU Int_ALU 0% MediaBench Applications 4 Literature Survey [Epalza04]: Dynamic Allocation of Functional Units in Superscalar Processors Switch the execution mode of idle floating-point units to four additional integer ALUs Addition of bypass networks add 1 cycle latency to the modified FPU. 19% speedup for SPECint2000 3.5% speedup for SPECfp2000 Issues: Need to improve control for mode switching 5 Plans Other patterns:

x1 x1 x2 x3 y Possible inputs: 1. fromregister file 2. fromnode x1 3. fromnode x2 4. fromnode x3 Possible 1st input for node y2: 1. fromnode x1 2. fromnode x2 y2 x1 Possible 2nd input for node y1/y2/y3: 1. fromregister file 2. fromnode x y1 Possible 2nd input for node y1: 1. fromregister file 2. fromnode x1 3. fromnode x2 x2 y1 x y3 Possible 2nd input for node y1: 1. fromregister file 2. fromnode x

x y1 y2 Possible 2nd input for node y: 1. fromregister file 2. fromnode x x Possible 2nd input for node y: 1. fromregister file 2. fromnode x1 3. fromnode x2 Possible 2nd input for node z1: 1. fromregister file 2. fromnode x 3. fromnode y y Possible 2nd input for node z: z 1. fromregister file 2. fromnode x1 3. fromnode x2 4. fromnode y z1 Possible 2nd input for node y2: 1. fromregister file 2. fromnode x Possible 2nd input for node z: 1. fromregister file 2. fromnode x 3. fromnode y1 4. fromnode y2 z

Possible 2nd input for node y2: 1. fromregister file 2. fromnode x1 3. fromnode x2 x2 y y2 z2 Possible 2nd input for node z2: 1. fromregister file 2. fromnode x 3. fromnode y 6 Related Work [Palacharla97] Dependence-based (FIFO queues + clustered execution units) 7 Related Work Extension to rePLay framework [Yehia04] 8 Experiment : Chaining pairs of dependent instructions [Intel01] Double-speed ALUs [Vassiliadis96] 3-1 Interlock Collapsing ALUs from Register File Normal Integer ALU Result of first instruction in dependent sequence

4 stages Carry Lookahead Adder 1 stage Logic Operations 3-1 Interlock Collapsing ALU Result of second instruction in dependent sequence Control 1 stage mux 4 stages Carry-Save Adder Logic Operations Carry-Lookahead Adder + Logic Operations 9 Experiment : Chaining pairs of dependent instructions ruu_fetch()

Instruction Fetch Queue (IFQ) F_MEM Register Update Unit (RUU) Operands ready Load/Store Queue (LSQ) ruu_dispatch() EA ready Ready Queue Modifications to sim-outorder for SimpleScalar PISA. ruu_issue() if the requested functional unit is IntALU && the list of in-flight instructions waiting only on the result of this instruction is non-empty && the chained ALU is not busy => schedule this instruction and the first obtained dependent instruction to the chained ALU Issue if requested functional unit is not busy IntALUs Int Mult/Div Rd/Wr Ports FP Adders FP Mult/Div Chained ALU Event Queue ruu_writeback()

Instruction WriteBack (Broadcast/Bypass Logic) Branch Misprediction? If so, recover ruu_commit() Instruction Commit 10 Experiment : Chaining pairs of dependent instructions Speedup on IPC for MediaBench Applications (fetch-decode-issue-com m it w idth = 8, ruu:size = 32, #ialu = 8, #cialu = 2) 2 CIALUs sufficient 25% IPC improvement of ~8%, solely due to savings of broadcast cycles 20% 15% Reduces utlization of IALUs by ~50% broadcast_ delay =1 10% broadcast_ delay =2 broadcast_ delay =3 Reduces up to 45% of queue entries waiting for result broadcast_ delay =4 5% 0%

Up to 25% speedup as broadcast cycles = 4 MediaBench Applications Speedup on IPC for MediaBench Applications (fetch-decode-issue-commit width = 8, ruu:size = 32, broadcast delay = 1 cycle) 25% #IntALU = 2, #CIntALU = 1 #IntALU = 4, #CIntALU = 1 #IntALU = 8, #CIntALU = 1 20% #IntALU = 2, #CIntALU = 2 #IntALU = 4, #CIntALU = 2 15% #IntALU = 8, #CIntALU = 2 #IntALU = 2, #CIntALU = 3 #IntALU = 4, #CIntALU = 3 10% #IntALU = 8, #CIntALU = 3 #IntALU = 2, #CIntALU = 4 5% #IntALU = 4, #CIntALU = 4 #IntALU = 8, #CIntALU = 4 0% d cau raw io u da raw dio

ep ic un ic ep en e cod de e cod e cjp g djp eg mp eg MediaBench Applications n 2e de co mp eg e

2d e cod p w it eg c en p w it eg c de 11 What Next? Chaining sequence of 3 dependent instructions, other patterns out of the 80. Architectural impact of adding chained units complexity of local bypass network etc. Replace chained units by xALUs converted from the CSA trees in a FP multiply/divide unit Need to explore the hardware circuits of FP multiply/divide Develop an adaptive configuration scheme to best match the interconnections of the swappable xALUs to the patterns of in-flight instructions. Need to determine the most frequent subset of patterns 12 References [Vassiliadis96] High-Performance 3-1 Interlock Collapsing ALUs. James Phillips and Stamatis Vassiliadis. [Yeager96] The MIPS R10000 Superscalar Microprocessor. Kenneth C. Yeager. IEEE Micro 1996. [Palacharla97] Subbarao Palacharla, Norman P. Jouppi, J.E. Smith. Complexity-Effective Superscalar Processor. ISCA 1997. [Intel01] The Microarchitecture of the Pentium 4 Processor. Glenn Hinton, Dave Sager, Mike

Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel Intel Technology Journal Q1. 2001. [Epalza04] Dynamic Reallocation of Functional Units In Superscalar Processors. Marc Epalza, Paolo Ienne, Daniel Mlynek. In the 9th Asia-Pacific Computer Systems Architecture Conference (ACSAC), 2004. [Yehia04] From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance without ILP or Speculation. Sami Yehia and Olivier Temam. [Sassone04] Multicycle Broadcast Bypass: Too Readily Overlooked. Peter G. Sassone and D. Scott Wills, Proceedings of the Workshop on Complexity-Effective Design (WCED), May 2004. Thank You Overview of Research Topic Goal of this research: investigate the feasibility and potential benefit of effective, automated runtime compilation and execution of software binaries on reconfigurable microprocessors Software binaries executing only on superscalar processor Continue normal execution of binary code following the transformed critical code region. Software Binaries Monitor the execution of the coupled system Profile committed instructions to identify critical code regions Identify and extract suitable instructions from critical code regions for collapsing into complex, atomic instructions

Reconfigurable Microprocessor superscalar processor Transfer execution from superscalar processor to the reconfigurable unit reconfigurable logic Assembly-to-hardware mapping of collapsed instructions On the next execution of the transformed critical region, load configuration for the reconfigurable logic 15 Motivations Improved execution performance by exploiting parallelism and redundancy in hardware. Adaptation of hardware resources based on the dynamic behaviour of programs. Availability of runtime profile allows exploitation of runtime optimizations otherwise difficult to exploit at compile time. Compilation at the binary level allows execution of legacy software binaries. Runtime compilation allows transparent migration of software code to hardware. 16

Recently Viewed Presentations

  • Traumatic memory

    Traumatic memory

    Traumatic Stress Clinic ... associative form of reasoning Therapy principles Identify content of flashbacks Focus exposure on these and on other moments of intense emotion (hotspots) Modulate arousal so that individual is fully aware and does not dissociate (graded exposure,...
  • Ramayana Performances - ASAN 310: South & Southeast Asia

    Ramayana Performances - ASAN 310: South & Southeast Asia

    Arial MS Pゴシック Calibri Office Theme Ramayana in Performance Sept. 15, 2011 Slide 2 Ramayana Art Ramayana Art from the Metropolitan Museum, NYC Story Cloths Ramayana Performances - India Ramayana Performances Southeast Asia Ramayana Performance Indonesia (Java/Bali) Ramayana Puppetry -...
  • CAP6135: Malware and Software Vulnerability Analysis Network Traffic

    CAP6135: Malware and Software Vulnerability Analysis Network Traffic

    CAP6135: Malware and Software Vulnerability Analysis Network Traffic Monitoring Using Wireshark Cliff Zou Spring 2013 * * * * * * * * * * * Depending on the kind of traffic, make some general observations - sources, destinations, kinds...
  • Grand Canyon Reader Award - Kyrene School District

    Grand Canyon Reader Award - Kyrene School District

    Author: Tom McNeal. ... Based on a true story. Based on the life of Jack Gruener, this book relates his story of survival from the Nazi occupation of Krakow, when he was eleven, through a succession of concentration camps, to...
  • Vaka sunumu - Kocaeli Üniversitesi

    Vaka sunumu - Kocaeli Üniversitesi

    Kocaeli Üniversitesi Tıp Fakültesi Çocuk Sağlığı ve Hastalıkları Anabilim Dalı Alerji-İmmünoloji Bilim Dalı Olgu Sunumu 8 Eylül 2011
  • Muslim Civilizations - Weebly

    Muslim Civilizations - Weebly

    Muslim Civilizations. What messages and teachings did Muhammad spread with Islam? 1: The Rise of Islam ... Muslim Civilization's Golden Age. International trade network. Camel caravans - "ships of the desert" ... Hindu-Muslim Clash. Islam. Blending of Cultures.
  • AIM: Who are the Montagues and Capulets? DO NOW: Read "The ...

    AIM: Who are the Montagues and Capulets? DO NOW: Read "The ...

    AIM: Who are the Montagues and Capulets?DO NOW: Read "The Hatfields and McCoys" What is the nature of the conflict? Describe each family.HOMEWORK:PROJECT DUE 12/1 NEW DUE DATE; prologue due 11/30; Vocabulary Unit 4 due 12/2
  • The British Rule of India - UT Liberal Arts

    The British Rule of India - UT Liberal Arts

    The British Rule of India Ian Woolford Department of Asian Studies The University of Texas at Austin * In 1919 there was a terrible massacre. British troops opened fire on an unarmed group of people who were assembling peacefully.