CS252 Graduate Computer Architecture Lecture 12 Vector Processing (Cont) Branch Prediction John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252 http://www-inst.eecs.berkeley.edu/~cs252 Review: Vector Programming Model Scalar Registers Vector Registers r15 v15 r0 v0 [0] [1] [2] [VLRMAX-1]

Vector Length Register Vector Arithmetic Instructions ADDV v3, v1, v2 Vector Load and Store Instructions LV v1, r1, r2 Base, 3/5/2007 r1 VLR v1 v2 + + [0] [1] + + +

+ v3 v1 [VLR-1] Vector Register Stride, r2 cs252-S07, Lecture 12 Memory 2 Review: Vector Unit Structure Functional Unit Vector Registers Elements 0, 4, 8, Elements 1, 5, 9, Elements 2, 6, 10,

Elements 3, 7, 11, Lane Memory Subsystem 3/5/2007 cs252-S07, Lecture 12 3 Review: Vector Stripmining Problem: Vector registers have finite length Solution: Break loops into pieces that fit into vector registers, Stripmining ANDI R1, N, 63 # N mod 64 for (i=0; i

MTC1 VLR, R1 loop: LV V1, RA DSLL R2, R1, 3 # Do remainder # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 64 elements ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length cs252-S07, 12 BGTZLecture N, loop # Any more to do? 4 Vector Reductions Problem: Loop-carried dependence on reduction variables sum = 0; for (i=0; i

sum += A[i]; # Loop-carried dependence on sum Solution: Re-associate operations if possible, use binary tree to perform reduction # Rearrange as: sum[0:VL-1] = 0 # for(i=0; i1) 3/5/2007 Vector of VL partial sums Stripmine VL-sized chunks Vector sum vector register # Halve vector length # Halve no. of partials cs252-S07, Lecture 12 5 Novel Matrix Multiply Solution Consider the following: /* Multiply a[m][k] * b[k][n] to get c[m][n] */

for (i=1; i

prod[0:31] = b_vector[0:31]*a_scalar; /* Vector-vector add into results. */ sum[0:31] += prod[0:31]; } /* Unit-stride store of vector of results. */ c[i][j:j+31] = sum[0:31]; } } 3/5/2007 cs252-S07, Lecture 12 7 How Pick Vector Length? Longer good because: 1) Hide vector startup 2) lower instruction bandwidth 3) tiled access to memory reduce scalar processor memory bandwidth needs 4) if know max length of app. is < max vector length, no strip mining overhead 5) Better spatial locality for memory access Longer not much help because: 1) diminishing returns on overhead savings as keep doubling number of element 2) need natural app. vector length to match physical register length, or no help (lots of short vectors in modern codes!) 3/5/2007

cs252-S07, Lecture 12 8 How Pick Number of Vector Registers? More Vector Registers: 1) Reduces vector register spills (save/restore) 20% reduction to 16 registers for su2cor and tomcatv 40% reduction to 32 registers for tomcatv others 10%-15% 2) Aggressive scheduling of vector instructinons: better compiling to take advantage of ILP Fewer: 1) Fewer bits in instruction format (usually 3 fields) 2) Easier implementation 3/5/2007 cs252-S07, Lecture 12 9 Context switch overhead: Huge amounts of state! Extra dirty bit per processor If vector registers not written, dont need to save on context switch

Extra valid bit per vector register, cleared on process start Dont need to restore on context switch until needed 3/5/2007 cs252-S07, Lecture 12 10 Exception handling: External Interrupts? If external exception, can just put pseudo-op into pipeline and wait for all vector ops to complete Alternatively, can wait for scalar unit to complete and begin working on exception code assuming that vector unit will not cause exception and interrupt code does not use vector unit 3/5/2007 cs252-S07, Lecture 12 11 Exception handling: Arithmetic Exceptions Arithmetic traps harder Precise interrupts => large performance loss! Alternative model: arithmetic exceptions set vector flag registers, 1 flag bit per element Software inserts trap barrier instructions from

SW to check the flag bits as needed IEEE Floating Point requires 5 flag bits 3/5/2007 cs252-S07, Lecture 12 12 Exception handling: Page Faults Page Faults must be precise Instruction Page Faults not a problem Could just wait for active instructions to drain Also, scalar core runs page-fault code anyway Data Page Faults harder Option 1: Save/restore internal vector unit state Freeze pipeline, dump vector state perform needed ops Restore state and continue vector pipeline Option 2: expand memory pipeline to check addresses before send to memory + memory buffer between address check and registers multiple queues to transfer from memory buffer to registers; check last address in queues before load 1st element from buffer. Per Address Instruction Queue (PAIQ) which sends to TLB and memory while in parallel go to Address Check Instruction Queue (ACIQ) When passes checks, instruction goes to Committed Instruction Queue (CIQ) to be there when data returns. On page fault, only save intructions in PAIQ and ACIQ

3/5/2007 cs252-S07, Lecture 12 13 Multimedia Extensions Very short vectors added to existing ISAs for micros Usually 64-bit registers split into 2x32b or 4x16b or 8x8b Newer designs have 128-bit registers (Altivec, SSE2) Limited instruction set: no vector length control no strided load/store or scatter/gather unit-stride loads must be aligned to 64/128-bit boundary Limited vector register length: requires superscalar dispatch to keep multiply/add/load units busy loop unrolling to hide latencies increases register pressure Trend towards fuller vector support in microprocessors 3/5/2007 cs252-S07, Lecture 12 14 Vector for Multimedia? Intel MMX: 57 additional 80x86 instructions (1st since 386)

similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits reuse 8 FP registers (FP and MMX cannot mix) short vector: load, add, store 8 8-bit operands + Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ... use in drivers or added to library routines; no compiler 3/5/2007 cs252-S07, Lecture 12 15 MMX Instructions Move 32b, 64b Add, Subtract in parallel: 8 8b, 4 16b, 2 32b opt. signed/unsigned saturate (set to max) if overflow Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b Multiply, Multiply-Add in parallel: 4 16b Compare = , > in parallel: 8 8b, 4 16b, 2 32b sets field to 0s (false) or 1s (true); removes branches Pack/Unpack Convert 32b<> 16b, 16b <> 8b

Pack saturates (set to max) if number is too large 3/5/2007 cs252-S07, Lecture 12 16 Administrivia Exam: Wednesday 3/14 Location: TBA TIME: 5:30 - 8:30 This info is on the Lecture page (has been) Meet at LaVals afterwards for Pizza and Beverages CS252 Project proposal due by Monday 3/5 Need two people/project (although can justify three for right project) Complete Research project in 8 weeks Typically investigate hypothesis by building an artifact and measuring it against a base case Generate conference-length paper/give oral presentation Often, can lead to an actual publication. 3/5/2007 cs252-S07, Lecture 12 17

Operation & Instruction Count: RISC v. Vector Processor (from F. Quintana, U. Barcelona.) Spec92fp Program swim256 hydro2d nasa7 su2cor tomcatv wave5 mdljdp2 Operations (Millions) Instructions (M) RISC Vector R / V RISC Vector 115 95 1.1x 115 0.8 58 40 1.4x 58 0.8 69 41 1.7x

69 2.2 51 35 1.4x 51 1.8 15 10 1.4x 15 1.3 27 25 1.1x 27 7.2 32 52 0.6x 32 15.8 R/V 142x 71x 31x 29x 11x 4x 2x

Vector reduces ops by 1.2X, instructions by 20X 3/5/2007 cs252-S07, Lecture 12 18 Common Vector Metrics R: MFLOPS rate on an infinite-length vector vector speed of light Real problems do not have unlimited vector lengths, and the start-up penalties encountered in real problems will be larger (Rn is the MFLOPS rate for a vector of length n) N1/2: The vector length needed to reach one-half of R a good measure of the impact of start-up NV: The vector length needed to make vector mode faster than scalar mode measures both start-up and speed of scalars relative to vectors, quality of connection of scalar unit to vector unit 3/5/2007 cs252-S07, Lecture 12 19 Vector Execution Time Time = f(vector length, data dependicies, struct. hazards) Initiation rate: rate that FU consumes vector elements

(= number of lanes; usually 1 or 2 on Cray T-90) Convoy: set of vector instructions that can begin execution in same clock (no struct. or data hazards) Chime: approx. time for a vector operation m convoys take m chimes; if each vector length is n, then they take approx. m x n clock cycles (ignores overhead; good approximization for long vectors) 1: LV V1,Rx ;load vector X 2: MULV V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y 3: ADDV V4,V2,V3 ;add 4: SV 3/5/2007 Ry,V4 ;store the result 4 convoys, 1 lane, VL=64 => 4 x 64 = 256 clocks

(or 4 clocks per result) cs252-S07, Lecture 12 20 Memory operations Load/store operations move groups of data between registers and memory Three types of addressing Unit stride Contiguous block of information in memory Fastest: always possible to optimize this Non-unit (constant) stride Harder to optimize memory system for all possible strides Prime number of data banks makes it easier to support different strides at full bandwidth Indexed (gather-scatter) Vector equivalent of register indirect Good for sparse arrays of data Increases number of programs that vectorize 3/5/2007 cs252-S07, Lecture 12 32 21 Interleaved Memory Layout Vector Processor

Unpipelined DRAM Unpipelined DRAM Unpipelined DRAM Unpipelined DRAM Unpipelined DRAM Unpipelined DRAM Unpipelined DRAM Unpipelined DRAM Addr Addr Addr Addr Addr Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 =0 =2 =3

=4 =1 Addr Addr Addr Mod 8 Mod 8 Mod 8 =7 =6 =5 Great for unit stride: Contiguous elements in different DRAMs Startup time for vector operation is latency of single read What about non-unit stride? Above good for strides that are relatively prime to 8 Bad for: 2, 4 3/5/2007 cs252-S07, Lecture 12 22 How to get full bandwidth for Unit Stride? Memory system must sustain (# lanes x word) /clock No. memory banks > memory latency to avoid stalls m banks m words per memory lantecy l clocks if m < l, then gap in memory pipeline: clock: 0 ll+1 l+2 l+m- 1 l+m2 l word: -- 01 2 m-1 -- m

may have 1024 banks in SRAM If desired throughput greater than one word per cycle Either more banks (start multiple requests simultaneously) Or wider DRAMS. Only good for unit stride or large data types More banks/weird numbers of banks good to support more strides at full bandwidth can read paper on how to do prime number of banks efficiently 3/5/2007 cs252-S07, Lecture 12 23 Avoiding Bank Conflicts Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses SW: loop interchange or declaring array not power of 2 (array padding) HW: Prime number of banks

3/5/2007 bank number = address mod number of banks address within bank = address / number of words in bank modulo & divide per memory access with prime no. banks? address within bank = address mod number words in bank bank number? easy if 2N words per bank cs252-S07, Lecture 12 24 Fast Bank Number Chinese Remainder Theorem As long as two sets of integers ai and bi follow these rules bi x mod ai, 0 bi ai, 0 x a 0 a1 a 2 and that ai and aj are co-prime if i j, then the integer x has only one solution (unambiguous mapping): bank number = b0, number of banks = a0 (= 3 in example) address within bank = b1, number of words in bank = a1 (= 8 in example) N word address 0 to N-1, prime no. banks, words power of 2 Bank Number: Address within Bank:

0 1 2 3 4 5 6 7 3/5/2007 Seq. Interleaved 0 1 2 0 3 6 9 12 15 18 21 1 4 7 10 13 16 19

22 2 5 8 11 14 17 20 23 cs252-S07, Lecture 12 Modulo Interleaved 0 1 2 0 9 18 3 12 21 6 15 16 1 10 19 4

13 22 7 8 17 2 11 20 5 14 23 25 Vectors Are Inexpensive Scalar N ops per cycle 2) circuitry HP PA-8000 4-way issue reorder buffer: 850K transistors 3/5/2007

Vector N ops per cycle 2) circuitry T0 vector micro 24 ops per cycle 730K transistors total incl. 6,720 5-bit register number comparators only 23 5-bit register number comparators No floating point cs252-S07, Lecture 12 26

Vectors Lower Power Single-issue Scalar One instruction fetch, decode, dispatch per operation Arbitrary register accesses, adds area and power Loop unrolling and software pipelining for high performance increases instruction cache footprint All data passes through cache; waste power if no temporal locality One TLB lookup per load or store Vector One inst fetch, decode, dispatch per vector Structured register accesses

Smaller code for high performance, less power in instruction cache misses Bypass cache One TLB lookup per group of loads or stores Move only necessary data across chip boundary Off-chip access in whole cache 3/5/2007lines cs252-S07, Lecture 12 27 Superscalar Energy Efficiency Even Worse Vector Superscalar 3/5/2007 Control logic grows quadratically with issue width

Control logic consumes energy regardless of available parallelism Speculation to increase visible parallelism wastes energy Control logic grows linearly with issue width Vector unit switches off when not in use Vector instructions expose parallelism without speculation Software control of speculation when desired: Whether to use vector mask or compress/expand for conditionals cs252-S07, Lecture 12 28 Vector Applications Limited to scientific computing? Multimedia Processing (compress., graphics, audio synth, image proc.)

Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing, LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/Networking (memcpy, memset, parity, checksum) Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95 3/5/2007 cs252-S07, Lecture 12 29 Older Vector Machines Machine Cray 1 Cray XMP Cray YMP Cray C-90 Cray T-90 Conv. C-1 Conv. C-4

Fuj. VP200 Fuj. VP300 NEC SX/2 NEC SX/3 3/5/2007 Year Clock Regs Elements FUs LSUs 1976 80 MHz 8 64 6 1 1983 120 MHz 8 64 8 2 L, 1 S 1988 166 MHz 8 64 8 2 L, 1 S 1991 240 MHz 8 128 8 4 1996 455 MHz 8 128 8 4 1984 10 MHz

8 128 4 1 1994 133 MHz 16 128 3 1 1982 133 MHz 8-256 32-1024 3 2 1996 100 MHz 8-256 32-1024 3 2 1984 160 MHz 8+8K 256+var 16 8 1995 400 MHz 8+8K 256+var 16 8 cs252-S07, Lecture 12 30 Newer Vector Computers Cray X1 MIPS like ISA + Vector in CMOS NEC Earth Simulator Fastest computer in world for 3 years; 40 TFLOPS 640 CMOS vector nodes

3/5/2007 cs252-S07, Lecture 12 31 Key Architectural Features of X1 New vector instruction set architecture (ISA) Much larger register set (32x64 vector, 64+64 scalar) 64- and 32-bit memory and IEEE arithmetic Based on 25 years of experience compiling with Cray1 ISA Decoupled Execution Scalar unit runs ahead of vector unit, doing addressing and control Hardware dynamically unrolls loops, and issues multiple loops concurrently Special sync operations keep pipeline full, even across barriers Allows the processor to perform well on short nested loops Scalable, distributed shared memory (DSM) architecture Memory hierarchy: caches, local memory, remote memory Low latency, load/store access to entire machine (tens of TBs) 3/5/2007 Processors support 1000s of outstanding refs with flexible addressing Very high bandwidth network Coherence protocol, addressing andLecture synchronization optimized for DM cs252-S07,

12 32 Cray X1E Mid-life Enhancement Technology refresh of the X1 (0.13m) ~50% faster processors Scalar performance enhancements Doubling processor density Modest increase in memory system bandwidth Same interconnect and I/O Machine upgradeable Can replace Cray X1 nodes with X1E nodes 3/5/2007 cs252-S07, Lecture 12 33 ESS configuration of a general purpose supercomputer 1. Processor Nodes (PN) Total number of processor nodes is 640. Each processor node consists of eight vector processors of 8 GFLOPS and 16GB shared memories. Therefore, total numbers of processors is 5,120 and total peak performance and main memory of the system are 40 TFLOPS and 10 TB, respectively. Two nodes are installed into one cabinet, which size is 40x56x80. 16 nodes are in a cluster. Power consumption per cabinet is approximately 20 KVA.

2) Interconnection Network (IN): Each node is coupled together with more than 83,000 copper cables via single-stage crossbar switches of 16GB/s x2 (Load + Store). The total length of the cables is approximately 1,800 miles. 3) Hard Disk. Raid disks are used for the system. The capacities are 450 TB for the systems operations and 250 TB for users. 4) Mass Storage system: 12 Automatic Cartridge Systems (STK PowderHorn9310); total storage capacity is approximately 1.6 PB. 3/5/2007 From Horst D. Simon, NERSC/LBNL, May 15, 2002, ESS Rapid Response Meeting 34 cs252-S07, Lecture 12 Earth Simulator 3/5/2007 cs252-S07, Lecture 12 35 Earth Simulator Building 3/5/2007 cs252-S07, Lecture 12 36

ESS complete system installed 4/1/2002 3/5/2007 cs252-S07, Lecture 12 37 Vector Summary Vector is alternative model for exploiting ILP If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations Fundamental design issue is memory bandwidth With virtual address translation and caching Will multimedia popularity revive vector architectures? 3/5/2007 cs252-S07, Lecture 12 38