FFTs, Portability, & Performance

FFTs, Portability, & Performance

FFTs, Portability, & Performance Steven G. Johnson, MIT Matteo Frigo, ITA Software (formerly MIT LCS) Fast Fourier Transforms (FFTs) [Gauss (1805), Cooley & Tukey (1965), & others] Fast, O(n log n) algorithm(s) to compute the Discrete Fourier Transform (DFT) of n points Widely used in many fields, including to solve partial differential equations: e.g. electromagnetic eigenmodes of photonic crystals I couldnt find any FFT code that was simultaneously: fast (near-optimal), flexible, and portable with M. Frigo, wrote a new one:

(starting 1997) (fftw.org) the Fastest Fourier Tranform in the West Whats the fastest algorithm for _____? (computer science = math + time = math + $) 1 Find best asymptotic complexity nave DFT to FFT: O(n2) to O(n log n) 2

Find best exact operation count? flops ~ 5 n log n [Cooley-Tukey, 1965, radix 2] to ~ 4 n log n [Yavne, 1968, split radix] to last 15+ years: flop count (varies by ~20%) no longer determines speed (varies by factor of ~10) Whats the fastest algorithm for _____? (computer science = math + time = math + $) 1 Find best asymptotic complexity nave DFT to FFT: O(n2) to O(n log n)

2 Find variant/implementation that runs fastest hardware-dependent unstable answer! Better to change the question A question with a more stable answer? Whats the simplest/smallest set of algorithmic steps whose compositions ~always span the ~fastest algorithm? the Fastest

Fourier Tranform in the West FFTW C library for real & complex FFTs (arbitrary size/dimensionality) (+ parallel versions for threads & MPI) Computational kernels (80% of code) automatically generated Self-optimizes for your hardware (picks best composition of steps) = portability + performance free software: http://www.fftw.org/ FFTW performance power-of-two sizes, double precision

833 MHz Alpha EV6 2 GHz AMD Opteron 2 GHz PowerPC G5 500 MHz Ultrasparc IIe FFTW performance non-power-of-two sizes, double precision unusual: non-power-of-two sizes receive as much optimization as powers of two 833 MHz Alpha EV6

2 GHz AMD Opteron because we let the code do the optimizing FFTW performance double precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2) powers of two exploiting CPU-specific SIMD instructions (rewriting the code) is easy non-powers-of-two

because we let the code write itself Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 1 The resulting plan is executed

with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose compiler 2 FFTW is easy to use { complex x[n]; plan p; p = plan_dft_1d(n, x, x, FORWARD, MEASURE); ...

execute(p); /* repeat as needed */ ... destroy_plan(p); } Key fact: usually, many transforms of same size are required. Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition

by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose compiler 2 Cooley-Tukey FFTs

[ 1965 or Gauss, 1802 ] n = pq 1d DFT of size n: = ~2d DFT of size p x q p q multiply by n twiddle factors transpose = contiguous first DFT columns, size q

(non-contiguous) q p finally, DFT columns, size p (non-contiguous) Cooley-Tukey FFTs [ 1965 or Gauss, 1802 ] 1d DFT of size n: n = pq = ~2d DFT of size p x q

(+ phase rotation by twiddle factors) = Recursive DFTs of sizes p and q O(n2) O(n log n) (divide-and-conquer algorithm) Cooley-Tukey FFTs [ 1965 or Gauss, 1802 ] size-p DFTs

size-q DFTs twiddles Cooley-Tukey is Naturally Recursive Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 2 DFT Size 2 DFT

Size 4 DFT Size 2 DFT Size 2 DFT But traditional implementation is non-recursive, breadth-first traversal: log2 n passes over whole array Traditional cache solution: Blocking Size 8 DFT p = 2 (radix 2)

Size 4 DFT Size 2 DFT Size 2 DFT Size 4 DFT Size 2 DFT Size 2 DFT breadth-first, but with blocks of size = cache requires program specialized for cache size

Recursive Divide & Conquer is Good [Singleton, 1967] (depth-first traversal) Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 2 DFT Size 2 DFT Size 4 DFT

Size 2 DFT Size 2 DFT eventually small enough to fit in cache no matter what size the cache is Cache Obliviousness A cache-oblivious algorithm does not know the cache size it can be optimal for any machine & for all levels of cache simultaneously Exist for many other algorithms, too [Frigo et al. 1999] all via the recursive divide & conquer approach

Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets:

highly-optimized dense code automatically generated by a special-purpose compiler 2 The Codelet Generator a domain-specific FFT compiler Generates fast hard-coded C for FFTs of arbitrary size Necessary to give the planner a large space of codelets to experiment with (any factorization). Exploits modern CPU deep pipelines & large register sets.

Allows easy experimentation with different optimizations & algorithms. CPU-specific hacks (SIMD) feasible The Codelet Generator written in Objective Caml [Leroy, 1998], an ML dialect Abstract FFT algorithm Cooley-Tukey: n=pq, Prime-Factor: gcd(p,q) = 1, Rader: n prime, n Symbolic graph (dag)

Simplifications powerful enough to e.g. derive real-input FFT from complex FFT algorithm and even find new algorithms Cache-oblivious scheduling (cache = registers) Optimized C code (or other language) The Generator Finds Good/New FFTs Symbolic Algorithms are Easy

Cooley-Tukey in OCaml Simple Simplifications Well-known optimizations: Algebraic simplification, e.g. a + 0 = a Constant folding Common-subexpression elimination Symbolic Pattern Matching in OCaml The following actual code fragment is solely responsible for simplifying multiplications: stimesM = function | (Uminus a, b) -> stimesM (a, b) >>= suminusM | (a, Uminus b) -> stimesM (a, b) >>= suminusM | (Num a, Num b) -> snumM (Number.mul a b)

| (Num a, Times (Num b, c)) -> snumM (Number.mul a b) >>= fun x -> stimesM (x, c) | (Num a, b) when Number.is_zero a -> snumM Number.zero | (Num a, b) when Number.is_one a -> makeNode b | (Num a, b) when Number.is_mone a -> suminusM b | (a, b) when is_known_constant b && not (is_known_constant a) -> stimesM (b, a) | (a, b) -> makeNode (Times (a, b)) (Common-subexpression elimination is implicit via memoization and monadic programming style.) Simple Simplifications Well-known optimizations: Algebraic simplification, e.g. a + 0 = a

Constant folding Common-subexpression elimination FFT-specific optimizations: Network transposition (transpose + simplify + transpose) _________________ negative constants A Quiz: Is One Faster? Both compute the same thing, and have the same number of arithmetic operations: a c e f

= = = = 0.5 0.5 1.0 1.0 * * + -

b; d; a; c; Faster because no separate load for -0.5 1015% speedup a c e f

= = = = 0.5 * b; -0.5 * d; 1.0 + a; 1.0 + c; Non-obvious transformations require experimentation Quiz 2: Which is Faster? accessing strided array

inside codelet (amid dense numeric code) array[stride * i] array[strides[i]] using precomputed stride array: strides[i] = stride * i This is faster, of course! Except on brain-dead architectures namely, Intel Pentia: integer multiplication conflicts with floating-point

up to ~20% speedup Machine-specific hacks are feasible if you just generate special code stride precomputation SIMD instructions (SSE, Altivec, 3dNow!) fused multiply-add instructions Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition

by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose compiler 2 What does the planner compose?

(new in FFTW 3, spring 2003) The Cooley-Tukey algorithm presents many choices: which factorization? what order? memory reshuffling? Find simple steps that combine without restriction to form many different algorithms. steps solve a problem, specified as a DFT(v,n): multi-dimensional vector loops v of multi-dimensional transforms n {sets of (size, input/output strides)} Some Composable Steps (out of ~16) SOLVE Directly solve a small DFT by a codelet CT-FACTOR[r] Radix-r Cooley-Tukey step

r (loop) sub-problems of size n/r VECLOOP Perform one vector loop (can choose any loop, i.e. loop reordering) INDIRECT DFT = copy + in-place DFT (separates copy/reordering from DFT) TRANPOSE solve in-place m n transpose Dynamic Programming the assumption of optimal substructure Try all applicable steps: DFT(16) = fastest of:

CT-FACTOR[2]: 2 DFT(8) CT-FACTOR[4]: 4 DFT(4) DFT(8) = fastest of: CT-FACTOR[2]: 2 DFT(4) CT-FACTOR[4]: 4 DFT(2) SOLVE[1,8] If exactly the same problem appears twice, assume that we can re-use the plan. Many Resulting Algorithms INDIRECT + TRANSPOSE gives in-place DFTs, bit-reversal = product of transpositions

no separate bit-reversal pass [ Johnson (unrelated) & Burrus (1984) ] VECLOOP can push topmost loop to leaves vector FFT algorithm [ Swarztrauber (1987) ] CT-FACTOR then VECLOOP(s) gives breadth-first FFT, erases iterative/recursive distinction Actual Plan for size 219=524288 (2GHz Pentium IV, double precision, out-of-place) CT-FACTOR[4] (buffered variant) CT-FACTOR[32] (buffered variant) VECLOOP(b) x32 ~2000 lines CT-FACTOR[64] hard-coded C!

INDIRECT INDIRECT + VECLOOP(b) VECLOOP(b) x64 (+ ) VECLOOP(a) x4 = COPY[64] huge improvements VECLOOP(a) x4 for large 1d sizes SOLVE[64, 64] Unpredictable: (automated) experimentation is the only solution.

Planner Unpredictability double-precision, power-of-two sizes, 2GHz PowerPC G5 FFTW 3 heuristic: pick plan with fewest adds + multiplies + loads/stores Classic strategy: minimize ops fails badly another test: Use plan from: another machine?

e.g. Pentium-IV? lose 2040% Weve Come a Long Way 1965 Cooley & Tukey, IBM 7094, 36-bit single precision: size 2048 DFT in 1.2 seconds 2003 FFTW3+SIMD, 2GHz Pentium-IV 64-bit double precision: size 2048 DFT in 50 microseconds (24,000x speedup) (= 30% improvement per year) Weve Come a Long Way? In the name of performance, computers have become complex & unpredictable. Optimization is hard: simple heuristics (e.g. fewest flops)

no longer work. The solution is to avoid the details, not embrace them: (Recursive) composition of simple modules + feedback (self-optimization) High-level languages (not C) & code generation are a powerful tool for high performance. FFTW Homework Problems? Try an FFTPACK-style back-and-forth solver Implement Vector-Radix for multi-dimensional n Pruned FFTs: VECLOOP that skips zeros Better heuristic planner some sort of optimization of per-solver costs? Implement convolution solvers

Recently Viewed Presentations

  • La baladodiffusion en classe de langues au collège

    La baladodiffusion en classe de langues au collège

    La baladodiffusion en classe de langues vivantes Sommaire Qu'est-ce que la baladodiffusion? Comment mettre en place la baladodiffusion dans un établissement scolaire?
  • Field Orientation - Western Kentucky University

    Field Orientation - Western Kentucky University

    BSW Field Orientation Kellye McIntyre, CSW ... first contact and the interview Agency-Field Instructor Visited/approved Current Contract/MOU Licensed/Exempt 2 years post degree experience Willing to participate in teaching/learning Abide by WKU field policy * Worksite Placements (WKU does not ...
  • Career Cruising - barren.k12.ky.us

    Career Cruising - barren.k12.ky.us

    Career Cruising How To Build A Resume! What is a Resume? According to Alison Doyle, a resume is a written document that describes your education, work experience, skills, and accomplishments and is used to apply for jobs. There are many...
  • Carbon cycle  units = Gt C = 10

    Carbon cycle units = Gt C = 10

    Title: PowerPoint Presentation Last modified by: Laura Gonzalez Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles
  • Student Success In A Block Immersion, Mastery Learning

    Student Success In A Block Immersion, Mastery Learning

    The Roseman College of Nursing is a unique non-traditional BSN program implementing the pedagogical practices of Block Immersion & Mastery Learning.. Purpose. A retrospective, correlational analysis study was conducted. Data collection included retrospective data records of all nursing students (n=504)...
  • Ymca Rebranding

    Ymca Rebranding

    "There may be millions of questions, but having all the answers isn't my job…that's where faith comes in. What I've learned is that I can't look at my student as a project - I have to relate to her as...
  • A Comparison of Chinese Philosophies - White Plains Public ...

    A Comparison of Chinese Philosophies - White Plains Public ...

    A Comparison of Chinese Philosophies In this lesson, students will be able to identify characteristics of Confucianism, Daoism, and Legalism. ... He believed that a harmonious society depended on Five Relationships. In four of Confucius' Five Relationships, an inferior had...
  • Session Title Must Match the Title in Your Contract

    Session Title Must Match the Title in Your Contract

    Insert the ARMA approved title for your session (this title can be found in your contract). Insert the facilitator's name. If you choose, insert the facilitator's job title and company name. Insert the Education Code (this will be provided to...