Communication Modeling for System-Level Design Andrew B. Kahng#,* [email protected] Kambiz Samadi* [email protected] CSE# and ECE* Departments, UCSD November 24, 2008 Outline Motivation Communication Synthesis for Network-on-Chip Network-on-Chip Architecture Modeling Buffered Interconnect Model Router Power and Area Model Bus Architecture Modeling Conclusions ISSOC-2008 2 Motivation Focus of design process is shifting from computation to communication Device and interconnect performance scaling mismatches cause breakdown of traditional across-chip communication
System-level designers require accurate, yet simple models to bridge planning and implementation stages Todays system-level performance, power modeling suffers: Ad hoc selection of models Poor balance between accuracy and simplicity Lack of model extensibility across future technology nodes Due to design performance / power constraints, early-stage design exploration has become crucial Our Goal: Develop accurate models that are easily usable by system-level design early in the design cycle ISSOC-2008 3 Communication Synthesis for Network-on-Chip Given An input specification as a set of communication constraints A library of communication components An objective function (e.g., power, area, delay) Find A network-on-chip implementation as a composition of library components that Satisfies the specification Minimizes the cost function Communication Synthesis Infrastructure (COSI) Based on the Platform-Based Design methodology Takes specification and library descriptions in XML format Produces a variety of outputs, including a cycle accurate
SystemC implementation of the optimal network-on-chip ISSOC-2008 4 Constraint-Driven Communication Synthesis Synthesis Application Point-to-Point Specification Implementation Perf. / Cost Abstractions Constraints Propagation On-Chip Communication Library Synthesis Result ISSOC-2008 5 Buffered Interconnect Model Inputs for repeater delay calculation Components
Delay and slew values for a set of input slew and load capacitance values Repeater model (obtained fromdelay Liberty / SPICE) Input capacitance for different (Liberty, PTM) Separate models for repeater intrinsicsize delay, output Inputs for wire delay calculation capacitance slew, input Wire dimensions (ITRS/PTM, LEF, ITF) Wire delay model Inter-wire spacings for global and intermediate layers (ITRS/PTM, LEF, ITF)
Inputs for power calculation Accounts for coupling capacitance impact on wire delay Input capacitance (Liberty, PTM) Repeater Wire parasiticspower (computedmodel in wire delay calculation) Accounts for sub-threshold and gate leakages Device Interconnect Automatic Interconnect Repeater . lib area model Extraction Min. Inverter T Automatic Derived Rd HILD from existing cell layouts (can be extrapolated)
LEF/ITF Extraction MASTAR Cin Wmin Wire area model Interconnect ITRS Ioff Smin Chapter Derived from wire width and spacing SPICE(can Sim. be extrapolated) tintrinsic ILD PTM TIERS(L,I,SG,G) Global Local Intermediate Semi-global Technology parameter extraction flows. ISSOC-2008 6 Repeater and Wire Models Intrinsic Delay Model i(slewin)
Drive Res Model r(slewin) delay = i(slewin) + r(slewin) * CL r(s) = f(size, slewin) slewout = f(slewin,CL) wire delay = Elmore Repeater area and power linear with repeater size Predictions extend down to 16nm Delay model is < 15% of PrimeTime ISSOC-2008 7 Impact on System-Level Design Testcases VPROC: video processor with 42 cores and 128-bit datawidth dVOPD: dual video object plane decoder with 26 cores and 128-bit datawidth 2 Dynamic Power (mW) Leakage Power (mW) Total Area (mm ) SoC Orig. Prop. Orig.
Prop. Orig. Prop. 90nm 117.3 364.8 38.1 99.6 0.37 0.346 VPROC 65nm 51.1 179.9 69.9 86.7 0.217 0.223 90nm 63.4 88 14.2 32.5 0.141 0.162 dVOPD 65nm 27.3 73.2 25.7 33.2 0.082 0.085
Original model (Orig.) underestimates power compared to the Proposed Model (Prop.) Avg # hops Orig. Prop. 3.09 3.01 3.1 3.42 1.76 1.76 1.76 1.91 Max # hops Orig. Prop. 4 5 4 6 3 3 3 4 Original Model is very optimistic in delay becomes more critical as technology scales and the chip size becomes larger ISSOC-2008
8 ORION2.0: Accurate NoC Router Models circuit implementation & buffering scheme architectural parameters # of ports; # of buffers # of xbar ports; # of VC SRAM and register FIFO MUX-tree and Matrix crossbar voltage, frequency different arbitration scheme hybrid buffering scheme Area Leakage FIFO Arbiter Crossbar Clock Link technology parameters interconnect parameters device parameters scaling factors for future technologies
ORION2.0 NEW ! Dynamic Built on top of ORION1.0 Provides, previously missing, power subcomponents Provides significant accuracy improvement vs. ORION1.0 Uses our automatic flows to obtain technology inputs To appear in DATE-2009 (A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi) 9 ISSOC-2008 Validation and Significance Assessment Validation: Two Intel NoC Chips (1) Intel 80-core Teraflops, and (2) Intel SCC ORION2.0 offers significant accuracy improvement Intel 80-core v1.0 v2.0 %diff (total power) -85.3 %diff (total area) -80.9 -19.4 -23.6 Intel SCC
v1.0 v2.0 +202.4 +20.47 +31.87 +26.37 System-level Impact: COSI-OCC ORION2.0 models lead to better-performing NoC: (1) less # hops, and (2) less # routers Relative power due to additional port not as high in ORION2.0 vs. 1.0 SoC P (mW) v1.0 v2.0 VPROC 0.875 dVOPD 0.412 2 A (mm ) v1.0 v2.0 # routers v1.0 v2.0 0.924 2.043 2.329 0.486 1.217 1.343
33 18 ISSOC-2008 25 16 max. # router ports v1.0 v2.0 8 6 12 6 max. # hops v1.0 v2.0 6 11 5 10 10 AMBA Models
Signal Bus Modeling: system-level interconnect model (described earlier) Logic Modeling (multiplexers, decoders, and arbiter): Block latency based on gate delay model (cf. Carloni et al. ASPDAC08) Dynamic power is computed after measuring the switching capacitance Leakage power is computed from average device leakages Area is computed from cell areas of logic gates ISSOC-2008 11 AMBA Modeling and Bus vs. NoC Study Delay, power, area models within 11% of physical implementation Functional forms verified against physical implementation of AMBA-AHB controller Bus vs. NoC study enables design space explorations of heterogeneous communication fabrics Area
Delay Dynamic AMBA Model technology & design style Leakage floorplan transaction min. width, spacing, thickness dielectric thickness, constant read and write device drive res, cap, leakage length width/spacing, buffering scheme address progression ISSOC-2008 location of all masters, slaves bit widths of all masters, slaves optionally, locations of arbiter, decoder, and multiplexers 12 Conclusions and Future Directions Accurate models can drive effective system-level
exploration Reproducible methodology for extracting inputs to models Modeling at different levels of abstractions protocol encapsulation (e.g., hand-shaking for AMBA bus allocation) buses, pipelined rings (e.g. EIB in IBM Cell) routers, network interfaces FIFOs, queues, crossbar switches (ORION2.0) Extending to other technologies 3D IC integration (i.e., TSV modeling, multi-layer router modeling, etc.) ISSOC-2008 13 Backup Slides ISSOC-2008 14 Communication Synthesis Key Elements Specification of input constraints Set of IP cores: area and interface End-to-end communication requirements between pairs of IP cores: latency and throughput Characterization of library of components Interface types, max number of ports
Max capacities: bandwidth, latency, max distance Performance and cost model Component instantiation and parallel composition Rename, set parameters of library components Composition based on algebra on quantities (including type compatibility) ISSOC-2008 15 Communication Synthesis Example Synthesis of optimal network-on-chip Return valid composition that meets input constraints and Minimizes the objective function (e.g., power dissipation) (Original Specification) Platform Instance 2 Platform Instance 1 ISSOC-2008 16 COSI: Communication Synthesis Infrastructure COSI is a public-domain software package for NoC synthesis http://embedded.eecs.berkeley.edu/cosi/ ISSOC-2008
17 Dynamic and Leakage Power Models Dynamic Power: Switching Capacitance Clock power: Pclk = CclkVdd2f Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring Physical Links: due to charging and discharging of capacitive load Pd = CloadVdd2f; Cload = Cground + Ccoupling + Cinput Register-based FIFO: implemented as shift registers Other components: we use ORION 1.0 models Leakage Power: Subthreshold and Gate From 65nm and beyond gate leakage becomes significant Isub(i,s) and Igate(i,s) are subthreshold and gate leakage currents per unit transistor width for a specific technology Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input state s for subthreshold and gate leakage, respectively Key circuit components INVx1, NAND2x1, NOR2x1, and DFF ' ' Ileak ( Block ) = Prob( i , s ).(Wsub ( i , s ).Isub
( i , s ) + W ( i , s )gate I gate ( i , s )) i s ISSOC-2008 18 Area Model As number of cores increases, the area occupied by communication components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip) Gate area model by Yoshida et al. (DAC04) Link area model by Carloni et al. (ASPDAC08) We model FIFO, crossbar switch, and arbiter areas using the adopted gate area model Areaarbiter = (AreaNOR2x1.2(R-1)R) + (AreaDFF.(R(R-1)/2)) + (AreaINVx1.R) ISSOC-2008 Matrix Arbiter 19