Methods - Penn Engineering

Methods - Penn Engineering

Simulating a $2M Commercial Server on a $2K PC Alaa Alameldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark D. Hill, & David A. Wood Multifacet Project ( Computer Sciences Department University of WisconsinMadison February 2003 (C) 2003 Mulitfacet Project University of Wisconsin-Madison Summary Context Commercial server design is important Multifacet project seeks improved designs Must evaluate alternatives Commercial Servers Processors, memory, disks $2M

Run large multithreaded transaction-oriented workloads Use commercial applications on commercial OS To Simulate on $2K PC Scale & tune workloads Manage simulation complexity Cope with workload variability Methods 2 Keep L2 miss rates, etc. Separate timing & function Use randomness & statistics Wisconsin Multifacet Project Outline Context Commercial Servers Multifacet Project

Workload & Simulation Methods Separate Timing & Functional Simulation Cope with Workload Variability Summary Methods 3 Wisconsin Multifacet Project Why Commercial Servers? Many (Academic) Architects Desktop computing Wireless appliances We focus on servers

Methods (Important Market) Performance Challenges Robustness Challenges Methodological Challenges 4 Wisconsin Multifacet Project 3-Tier Internet Service Multifacet Focus LAN / SAN PCs w/ soft state Methods

LAN / SAN Servers running applications for business rules 5 Servers running databases for hard state Wisconsin Multifacet Project Multifacet: Commercial Server Design Wisconsin Multifacet Project Directed by Mark D. Hill & David A. Wood Sponsors: NSF, WI, Compaq, IBM, Intel, & Sun Current Contributors: Alaa Alameldeen, Brad Beckman, Nikhil Gupta, Pacia Harper, Jarrod Lewis, Milo Martin, Carl Mauer, Kevin Moore, Daniel Sorin, & Min Xu Past Contributors: Anastassia Ailamaki, Ender Bilir,

Ross Dickson, Ying Hu, Manoj Plakal, & Anne Condon Analysis Want 4-64 processors Many cache-to-cache misses Neither snooping nor directories ideal Multifacet Designs Snooping w/ multicast [ISCA99] or unordered network [ASPLOS00] Bandwidth-adaptive [HPCA02] & token coherence [ISCA03] Methods 6 Wisconsin Multifacet Project Outline Context Workload & Simulation Methods

Select, scale, & tune workloads Transition workload to simulator Specify & test the proposed design Evaluate design with simple/detailed processor models Separate Timing & Functional Simulation Cope with Workload Variability Summary Methods 7 Wisconsin Multifacet Project Multifacet Simulation Overview Full Workloads Commercial Server (Sun E6000) Scaled Workloads

Workload Development Memory Protocol Generator (SLICC) Pseudo-Random Protocol Checker Full System Functional Simulator (Simics) Memory Timing Simulator (Ruby) Protocol Development Processor Timing Simulator (Opal) Timing Simulator Virtutech Simics ( Rest is Multifacet software Methods

8 Wisconsin Multifacet Project Select Important Workloads Full Workloads Online Transaction Processing: DB2 w/ TPC-C-like Java Server Workload: SPECjbb Static web content serving: Apache Dynamic web content serving: Slashcode Java-based Middleware: (soon) Methods 9

Wisconsin Multifacet Project Setup & Tune Workloads (on real hardware) Full Workloads Commercial Server (Sun E6000) Tune workload, OS parameters Measure transaction rate, speed-up, miss rates, I/O Compare to published results Methods 10 Wisconsin Multifacet Project Scale & Re-tune Workloads Commercial Server (Sun E6000)

Scaled Workloads Scale-down for PC memory limits Retaining similar behavior (e.g., L2 cache miss rate) Re-tune to achieve higher transaction rates (OLTP: raw disk, multiple disks, more users, etc.) Methods 11 Wisconsin Multifacet Project Transition Workloads to Simulation Scaled Workloads Full System Functional Simulator (Simics) Create disk dumps of tuned workloads In simulator: Boot OS, start, & warm application Create Simics checkpoint (snapshot) Methods

12 Wisconsin Multifacet Project Specify Proposed Computer Design Memory Protocol Generator (SLICC) Memory Timing Simulator (Ruby) Coherence Protocol (control tables: states X events) Cache Hierarchy (parameters & queues) Interconnect (switches & queues) Processor (later) Methods

13 Wisconsin Multifacet Project Test Proposed Computer Design Pseudo-Random Protocol Checker Memory Timing Simulator (Ruby) Randomly select write action & later read check Massive false-sharing for interaction Perverse network stresses design Transient error & deadlock detection Sound but not complete

Methods 14 Wisconsin Multifacet Project Simulate with Simple Blocking Processor Scaled Workloads Full System Functional Simulator (Simics) Memory Timing Simulator (Ruby) Warm-up caches or sometimes sufficient (SafetyNet) Run for fixed number of transactions Some transaction partially done at start Other transactions partially done at end Cope with workload variability (later) Methods 15

Wisconsin Multifacet Project Simulate with Detailed Processor Scaled Workloads Full System Functional Simulator (Simics) Memory Timing Simulator (Ruby) Processor Timing Simulator (Opal) Accurate (future) timing & (current) function Simulation complexity decoupled (discussed soon) Same transaction methodology & work variability issues Methods 16 Wisconsin Multifacet Project

Simulation Infrastructure & Workload Process Full Workloads Memory Protocol Generator (SLICC) Pseudo-Random Protocol Checker Commercial Server (Sun E6000) Scaled Workloads Full System Functional Simulator (Simics) Memory Timing Simulator (Ruby) Processor Timing

Simulator (Opal) Select important workloads: run, tune, scale, & re-tune Specify system & pseudo-randomly test Create warm workload checkpoint Simulate with simple or detailed processor Fixed #transactions, manage simulation complexity (next), cope with workload variability (next next) Methods 17 Wisconsin Multifacet Project Outline Context Simulation Infrastructure & Workload Process Separate Timing & Functional Simulation

Simulation Challenges Managing Simulation Complexity Timing-First Simulation Evaluation Cope with Workload Variability Summary Methods 18 Wisconsin Multifacet Project Challenges to Timing Simulation Execution driven simulation is getting harder Micro-architecture complexity Multiple in-flight instructions Speculative execution Out-of-order execution Thread-level parallelism

Hardware Multi-threading Traditional Multi-processing Methods 19 Wisconsin Multifacet Project Challenges to Functional Simulation Commercial workloads have high functional fidelity demands Web Server Application complexity Target Application (Simulated) Target System Kernels SPEC

Benchmarks Database Operating System MMU Status Registers Real Time Clock Serial Port I/O MMU Controller DMA Controller IRQ

Controller Terminal Processor RAM PCI Bus Graphics Card Methods 20 Ethernet Controller CDROM SCSI Disk

Fiber Channel Controller SCSI Controller SCSI Disk Wisconsin Multifacet Project Managing Simulator Complexity Timing and Functional Simulator Integrated (SimOS) - Complex Functional

Simulator Timing Simulator Timing Simulator Functional Simulator Complete Timing No? Function Timing Simulator Complete Timing Partial Function Methods Functional-First (Trace-driven) - Timing feedback

Timing-Directed No Timing Complete Function + Timing feedback - Tight Coupling - Performance? Timing-First (Multifacet) Functional Simulator No Timing Complete Function 21 + Timing feedback + Using existing simulators + Software development advantages Wisconsin Multifacet Project

Timing-First Simulation Timing Simulator does functional execution of user and privileged operations does speculative, out-of-order multiprocessor timing simulation does NOT implement functionality of full instruction set or any devices Functional Simulator add load Execute Cache CPU Network does full-system multiprocessor simulation does NOT model detailed micro-architectural timing CPU

Commit Verify Timing Simulator Methods System RAM Functional Simulator 22 Wisconsin Multifacet Project Timing-First Operation As instruction retires, step CPU in functional simulator Verify instructions execution Reload state if timing simulator deviates from functional add

load Execute Cache Network Loads in multi-processors Instructions with unidentified side-effects NOT loads/store to I/O devices CPU Commit Verify CPU Timing Simulator Methods Reload

23 System RAM Functional Simulator Wisconsin Multifacet Project Benefits of Timing-First Supports speculative multi-processor timing models Leverages existing simulators Software development advantages Increases flexibility and reduces code complexity Immediate, precise check on timing simulator However: How much performance error is introduced in this approach? Are there simulation performance penalties? Methods

24 Wisconsin Multifacet Project Evaluation Our implementation, TFsim uses: Functional Simulator: Virtutech Simics Timing simulator: Implemented less than one-person year Evaluated using OS intensive commercial workloads OS Boot: > 1 billion instructions of Solaris 8 startup OLTP: TPC-C-like benchmark using a 1 GB database Dynamic Web: Apache serving message board, using code and data similar to Static Web: Apache web server serving static web pages Barnes-Hut: Scientific SPLASH-2 benchmark Methods 25 Wisconsin Multifacet Project

Measured Deviations Less than 20 deviations per 100,000 instructions (0.02%) Methods 26 Wisconsin Multifacet Project If the Timing Simulator Modeled Fewer Events Methods 27 Wisconsin Multifacet Project Analysis of Results Runs full-system workloads! Timing performance impact of deviations Worst case: less than 3% performance error Overhead of redundant execution

18% on average for uniprocessors 18% (2 processors) up to 36% (16 processors) Functional Simulator Timing Simulator Total Execution Time Methods 29 Wisconsin Multifacet Project Performance Comparison Target Application SPLASH-2 Kernels

match SPLASH-2 Kernels (Simulated) Target System Out-of-Order MP SPARC V9 close Out-of-Order MP Full-system SPARC V9 Host Computer 400 MHz SPARC running Solaris

different 1.2 GHz Pentium running Linux RSIM TFsim Absolute simulation performance comparison In kilo-instructions committed per second (KIPS) RSIM Scaled: 107 KIPS Uniprocessor TFsim: 119 KIPS Methods 30 Wisconsin Multifacet Project Timing-First Conclusions Execution-driven simulators are increasingly complex How to manage complexity?

Our answer: Timing Simulator Complete Timing Partial Function Functional Simulator Timing-First Simulation No Timing Complete Function Introduces relatively little performance error (worst case: 3%) Has low-overhead (18% uniprocessor average) Rapid development time Methods 32 Wisconsin Multifacet Project

Outline Context Workload Process & Infrastructure Separate Timing & Functional Simulation Cope with Workload Variability Variability in Multithreaded Workloads Coping in Simulation Examples & Statistics Summary Methods 33 Wisconsin Multifacet Project

What is Happening Here? OLTP Methods 34 Wisconsin Multifacet Project What is Happening Here? How can slower memory lead to faster workload? Answer: Multithreaded workload takes different path Different lock race outcomes Different scheduling decisions (1) Does this happen for real hardware? (2) If so, what should we do about it? Methods 35 Wisconsin Multifacet Project

One Second Intervals (on real hardware) OLTP Methods 36 Wisconsin Multifacet Project 60 Second Intervals (on real hardware) 16-day simulation OLTP Methods 37 Wisconsin Multifacet Project

Coping with Workload Variability Running (simulating) long enough not appealing Need to separate coincidental & real effects Standard statistics on real hardware Variation within base system runs vs. variation between base & enhanced system runs But deterministic simulation has no within variation Solution with deterministic simulation Add pseudo-random delay on L2 misses Simulate base (enhanced) system many times Use simple or complex statistics Methods 38 Wisconsin Multifacet Project Coincidental (Space) Variability Methods 39

Wisconsin Multifacet Project Wrong Conclusion Ratio WCR (16,32) = 18% WCR (16,64) = 7.5% WCR (32,64) = 26% Methods 40 Wisconsin Multifacet Project More Generally: Use Standard Statistics As one would for a measurement of a live system Confidence Intervals 95% confidence intervals contain true value 95% of the time Non-overlapping confidence intervals give statistically significant conclusions Use ANOVA or Hypothesis Testing even better!

Methods 41 Wisconsin Multifacet Project Confidence Interval Example ROB Estimate #runs to get non-overlapping confidence intervals Methods 42 Wisconsin Multifacet Project Also Time Variability (on real hardware) OLTP Therefore, select checkpoint(s) carefully

Methods 43 Wisconsin Multifacet Project Workload Variability Summary Variability is a real phenomenon for multi-threaded workloads Runs from same initial conditions are different Variability is a challenge for simulations Simulations are short Wrong conclusions may be drawn Our solution accounts for variability Multiple runs, confidence intervals Reduces wrong conclusion probability Methods 44

Wisconsin Multifacet Project Talk Summary Simulations of $2M Commercial Servers must Complete in reasonable time (on $2K PCs) Handle OS, devices, & multithreaded hardware Cope with variability of multithreaded software Multifacet Scale & tune transactional workloads Separate timing & functional simulation Cope w/ workload variability via randomness & statistics References ( Simulating a $2M Commercial Server on a $2K PC [Computer03] Full-System Timing-First Simulation [Sigmetrics02] Variability in Architectural Simulations [HPCA03] Methods 45 Wisconsin Multifacet Project

Other Multifacet Methods Work Specifying & Verifying Coherence Protocols [SPAA98], [HPCA99], [SPAA99], & [TPDS02] Workload Analysis & Improvement Database systems [VLDB99] & [VLDB01] Pointer-based [PLDI99] & [Computer00] Middleware [HPCA03] Modeling & Simulation Methods Commercial workloads [Computer02] & [HPCA03] Decoupling timing/functional simulation [Sigmetrics02] Simulation generation [PLDI01] Analytic modeling [Sigmetrics00] & [TPDS TBA] Micro-architectural slack [ISCA02] 46

Wisconsin Multifacet Project Backup Slides Methods 47 Wisconsin Multifacet Project One Ongoing/Future Methods Direction Middleware Applications Memory system behavior of Java Middleware [HPCA 03] Machine measurements Full-system simulation Future Work: Multi-Machine Simulation Isolate middle-tier from client emulators and database Understand fundamental workload behaviors Drives future system design

Methods 48 Wisconsin Multifacet Project ECPerf vs. SpecJBB Cache-to-Cache Transfers (%) 100 80 60 40 20 0 0 256

512 768 1024 Touched Cache Lines (KB) ECperf SPECjbb Different cache-to-cache transfer ratios! Methods 49 Wisconsin Multifacet Project Online Transaction Processing (OLTP)

DB2 with a TPC-C-like workload. The TPC-C benchmark is widely used to evaluate system performance for the on-line transaction processing market. The benchmark itself is a specification that describes the schema, scaling rules, transaction types and transaction mix, but not the exact implementation of the database. TPC-C transactions are of five transaction types, all related to an order-processing environment. Performance is measured by the number of New Order transactions performed per minute (tpmC). Our OLTP workload is based on the TPC-C v3.0 benchmark. We use IBMs DB2 V7.2 EEE database management system and an IBM benchmark kit to build the database and emulate users. We build an 800 MB 4000-warehouse database on five raw disks and an additional dedicated database log disk. We scaled down the sizes of each warehouse by maintaining the reduced ratios of 3 sales districts per warehouse, 30 customers per district, and 100 items per warehouse (compared to 10, 30,000 and 100,000 required by the TPC-C specification). Each user randomly executes transactions according to the TPCC transaction mix specifications, and we set the think and keying times for users to zero. A different database thread is started for each user. We measure all completed transactions, even those that do not satisfy timing constraints of the TPC-C benchmark specification. Methods 50

Wisconsin Multifacet Project Java Server Workload (SPECjbb) Java-based middleware applications are increasingly used in modern e-business settings. SPECjbb is a Java benchmark emulating a 3-tier system with emphasis on the middle tier server business logic. SPECjbb runs in a single Java Virtual Machine (JVM) in which threads represent terminals in a warehouse. Each thread independently generates random input (tier 1 emulation) before calling transactionspecific business logic. The business logic operates on the data held in binary trees of java objects (tier 3 emulation). The specification states that the benchmark does no disk or network I/O. We used Suns HotSpot 1.4.0 Server JVM and Solariss native thread implementation. The benchmark includes driver threads to generate transactions. We set the system heap size to 1.8 GB and the new object heap size to 256 MB to reduce the frequency of garbage collection. Our experiments used 24 warehouses, with a data size of approximately 500 MB. Methods 51 Wisconsin Multifacet Project

Static Web Content Serving: Apache Web servers such as Apache represent an important enterprise server application. Apache is a popular open-source web server used in many internet/intranet settings. In this benchmark, we focus on static web content serving. We use Apache 2.0.39 for SPARC/Solaris 8 configured to use pthread locks and minimal logging at the web server. We use the Scalable URL Request Generator (SURGE) as the client. SURGE generates a sequence of static URL requests which exhibit representative distributions for document popularity, document sizes, request sizes, temporal and spatial locality, and embedded document count. We use a repository of 20,000 files (totalling ~500 MB), and use clients with zero think time. We compiled both Apache and Surge using Suns WorkShop C 6.1 with aggressive optimization. Methods 52 Wisconsin Multifacet Project Dynamic Web Content Serving: Slashcode

Dynamic web content serving has become increasingly important for web sites that serve large amount of information. Dynamic content is used by online stores, instant news, and community message board systems. Slashcode is an open-source dynamic web message posting system used by the popular message board system. We used Slashcode 2.0, Apache 1.3.20, and Apaches mod_perl module 1.25 (with perl 5.6) on the server side. We used MySQL 3.23.39 as the database engine. The server content is a snapshot from the site, containing approximately 3000 messages with a total size of 5 MB. Most of the run time is spent on dynamic web page generation. We use a multi-threaded user emulation program to emulate user browsing and posting behavior. Each user independently and randomly generates browsing and posting requests to the server according to a transaction mix specification. We compiled both server and client programs using Suns WorkShop C 6.1 with aggressive optimization. Methods 53 Wisconsin Multifacet Project

Recently Viewed Presentations

  • High throughput crystallography à la SGC

    High throughput crystallography à la SGC

    28 March 2007 Bootstrap in refinement Gábor Bunkóczi Bootstrap - basics Bootstrap - aims Bootstrap - algorithm Bootstrap - implementation Bootstrap - results Bootstrap - development Bootstrap in refinement Gábor Bunkóczi Bootstrap - basics Bootstrap - aims Bootstrap - algorithm...
  • Agenda: Monday April 17, 2017

    Agenda: Monday April 17, 2017

    Warm Up 4/17. 1. Compare and contrast the events at the beginning of Act III (the fight scene) to modern day. Consider the following: Could something like this happen in high school?
  • Soil is NOT dirt! It is so much

    Soil is NOT dirt! It is so much

    Brick and cement are also made out of soil in parts. Earth/Soil were some of the first words ever written. Stories about soil, farming, and land management were some of the first things written down. Earth/Soil. usually means life!
  • Engineering Economics in Canada - Electrical engineering

    Engineering Economics in Canada - Electrical engineering

    The internal rate of return (IRR) is 10% The Internal Rate of Return (con't) The IRR is the interest rate at which the project "breaks even" It is the interest rate such that: PW = 0, or PW(receipts) = PW(disbursements)...
  • Important note: This file contains the presentation used

    Important note: This file contains the presentation used

    Important note: . This file contains the presentation used at the Senate meeting. The president's informal notes have been added in red.. These notes have not been reviewed nor have they been approved by the Academic Senate; they were created...
  • (REx) Operations Meetings/Flight Robert Wood, University of Washington

    (REx) Operations Meetings/Flight Robert Wood, University of Washington

    VOCALS Regional Experiment (REx) Operations Meetings/Flight Planning Robert Wood, University of Washington many contributors REx Timeline Day-to-day planning strawman Decision making process 08:00 General Meeting agenda items: Final go/no go for today's day flights Three day outlook (change D N?
  • Democracy in America: Participation and Social Justice

    Democracy in America: Participation and Social Justice

    Windsor 2013, slip opinion, pp 3, 25-6). Kennedy's opinion is that Section 3 of DOMA serves no useful purpose other than to effectively "disparage and to injure those whom the state, by its marriage laws, sought to protect in personhood...
  • Aim: ECONOMIC POLICY - Sewanhaka High School

    Aim: ECONOMIC POLICY - Sewanhaka High School

    Aim: ECONOMIC POLICY POLITICS OF TAXING AND SPENDING Fiscal policy = taxation (revenues) and spending (expenditures) What Americans want is inconsistent? What Congress/President give - tends to "feed" public inconsistency. Produces budget deficits Aim: ECONOMIC THEORIES - to promote national...