FFTs, Portability, & Performance Steven G. Johnson, MIT Matteo Frigo, ITA Software (formerly MIT LCS) Fast Fourier Transforms (FFTs) [Gauss (1805), Cooley & Tukey (1965), & others] Fast, O(n log n) algorithm(s) to compute the Discrete Fourier Transform (DFT) of n points Widely used in many fields, including to solve partial differential equations: e.g. electromagnetic eigenmodes of photonic crystals I couldnt find any FFT code that was simultaneously: fast (near-optimal), flexible, and portable with M. Frigo, wrote a new one:

(starting 1997) (fftw.org) the Fastest Fourier Tranform in the West Whats the fastest algorithm for _____? (computer science = math + time = math + $) 1 Find best asymptotic complexity nave DFT to FFT: O(n2) to O(n log n) 2

Find best exact operation count? flops ~ 5 n log n [Cooley-Tukey, 1965, radix 2] to ~ 4 n log n [Yavne, 1968, split radix] to last 15+ years: flop count (varies by ~20%) no longer determines speed (varies by factor of ~10) Whats the fastest algorithm for _____? (computer science = math + time = math + $) 1 Find best asymptotic complexity nave DFT to FFT: O(n2) to O(n log n)

2 Find variant/implementation that runs fastest hardware-dependent unstable answer! Better to change the question A question with a more stable answer? Whats the simplest/smallest set of algorithmic steps whose compositions ~always span the ~fastest algorithm? the Fastest

Fourier Tranform in the West FFTW C library for real & complex FFTs (arbitrary size/dimensionality) (+ parallel versions for threads & MPI) Computational kernels (80% of code) automatically generated Self-optimizes for your hardware (picks best composition of steps) = portability + performance free software: http://www.fftw.org/ FFTW performance power-of-two sizes, double precision

833 MHz Alpha EV6 2 GHz AMD Opteron 2 GHz PowerPC G5 500 MHz Ultrasparc IIe FFTW performance non-power-of-two sizes, double precision unusual: non-power-of-two sizes receive as much optimization as powers of two 833 MHz Alpha EV6

2 GHz AMD Opteron because we let the code do the optimizing FFTW performance double precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2) powers of two exploiting CPU-specific SIMD instructions (rewriting the code) is easy non-powers-of-two

because we let the code write itself Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 1 The resulting plan is executed

with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose compiler 2 FFTW is easy to use { complex x[n]; plan p; p = plan_dft_1d(n, x, x, FORWARD, MEASURE); ...

execute(p); /* repeat as needed */ ... destroy_plan(p); } Key fact: usually, many transforms of same size are required. Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition

by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose compiler 2 Cooley-Tukey FFTs

[ 1965 or Gauss, 1802 ] n = pq 1d DFT of size n: = ~2d DFT of size p x q p q multiply by n twiddle factors transpose = contiguous first DFT columns, size q

(non-contiguous) q p finally, DFT columns, size p (non-contiguous) Cooley-Tukey FFTs [ 1965 or Gauss, 1802 ] 1d DFT of size n: n = pq = ~2d DFT of size p x q

(+ phase rotation by twiddle factors) = Recursive DFTs of sizes p and q O(n2) O(n log n) (divide-and-conquer algorithm) Cooley-Tukey FFTs [ 1965 or Gauss, 1802 ] size-p DFTs

size-q DFTs twiddles Cooley-Tukey is Naturally Recursive Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 2 DFT Size 2 DFT

Size 4 DFT Size 2 DFT Size 2 DFT But traditional implementation is non-recursive, breadth-first traversal: log2 n passes over whole array Traditional cache solution: Blocking Size 8 DFT p = 2 (radix 2)

Size 4 DFT Size 2 DFT Size 2 DFT Size 4 DFT Size 2 DFT Size 2 DFT breadth-first, but with blocks of size = cache requires program specialized for cache size

Recursive Divide & Conquer is Good [Singleton, 1967] (depth-first traversal) Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 2 DFT Size 2 DFT Size 4 DFT

Size 2 DFT Size 2 DFT eventually small enough to fit in cache no matter what size the cache is Cache Obliviousness A cache-oblivious algorithm does not know the cache size it can be optimal for any machine & for all levels of cache simultaneously Exist for many other algorithms, too [Frigo et al. 1999] all via the recursive divide & conquer approach

Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets:

highly-optimized dense code automatically generated by a special-purpose compiler 2 The Codelet Generator a domain-specific FFT compiler Generates fast hard-coded C for FFTs of arbitrary size Necessary to give the planner a large space of codelets to experiment with (any factorization). Exploits modern CPU deep pipelines & large register sets.

Allows easy experimentation with different optimizations & algorithms. CPU-specific hacks (SIMD) feasible The Codelet Generator written in Objective Caml [Leroy, 1998], an ML dialect Abstract FFT algorithm Cooley-Tukey: n=pq, Prime-Factor: gcd(p,q) = 1, Rader: n prime, n Symbolic graph (dag)

Simplifications powerful enough to e.g. derive real-input FFT from complex FFT algorithm and even find new algorithms Cache-oblivious scheduling (cache = registers) Optimized C code (or other language) The Generator Finds Good/New FFTs Symbolic Algorithms are Easy

Cooley-Tukey in OCaml Simple Simplifications Well-known optimizations: Algebraic simplification, e.g. a + 0 = a Constant folding Common-subexpression elimination Symbolic Pattern Matching in OCaml The following actual code fragment is solely responsible for simplifying multiplications: stimesM = function | (Uminus a, b) -> stimesM (a, b) >>= suminusM | (a, Uminus b) -> stimesM (a, b) >>= suminusM | (Num a, Num b) -> snumM (Number.mul a b)

| (Num a, Times (Num b, c)) -> snumM (Number.mul a b) >>= fun x -> stimesM (x, c) | (Num a, b) when Number.is_zero a -> snumM Number.zero | (Num a, b) when Number.is_one a -> makeNode b | (Num a, b) when Number.is_mone a -> suminusM b | (a, b) when is_known_constant b && not (is_known_constant a) -> stimesM (b, a) | (a, b) -> makeNode (Times (a, b)) (Common-subexpression elimination is implicit via memoization and monadic programming style.) Simple Simplifications Well-known optimizations: Algebraic simplification, e.g. a + 0 = a

Constant folding Common-subexpression elimination FFT-specific optimizations: Network transposition (transpose + simplify + transpose) _________________ negative constants A Quiz: Is One Faster? Both compute the same thing, and have the same number of arithmetic operations: a c e f

= = = = 0.5 0.5 1.0 1.0 * * + -

b; d; a; c; Faster because no separate load for -0.5 1015% speedup a c e f

= = = = 0.5 * b; -0.5 * d; 1.0 + a; 1.0 + c; Non-obvious transformations require experimentation Quiz 2: Which is Faster? accessing strided array

inside codelet (amid dense numeric code) array[stride * i] array[strides[i]] using precomputed stride array: strides[i] = stride * i This is faster, of course! Except on brain-dead architectures namely, Intel Pentia: integer multiplication conflicts with floating-point

up to ~20% speedup Machine-specific hacks are feasible if you just generate special code stride precomputation SIMD instructions (SSE, Altivec, 3dNow!) fused multiply-add instructions Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition

by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose compiler 2 What does the planner compose?

(new in FFTW 3, spring 2003) The Cooley-Tukey algorithm presents many choices: which factorization? what order? memory reshuffling? Find simple steps that combine without restriction to form many different algorithms. steps solve a problem, specified as a DFT(v,n): multi-dimensional vector loops v of multi-dimensional transforms n {sets of (size, input/output strides)} Some Composable Steps (out of ~16) SOLVE Directly solve a small DFT by a codelet CT-FACTOR[r] Radix-r Cooley-Tukey step

r (loop) sub-problems of size n/r VECLOOP Perform one vector loop (can choose any loop, i.e. loop reordering) INDIRECT DFT = copy + in-place DFT (separates copy/reordering from DFT) TRANPOSE solve in-place m n transpose Dynamic Programming the assumption of optimal substructure Try all applicable steps: DFT(16) = fastest of:

CT-FACTOR[2]: 2 DFT(8) CT-FACTOR[4]: 4 DFT(4) DFT(8) = fastest of: CT-FACTOR[2]: 2 DFT(4) CT-FACTOR[4]: 4 DFT(2) SOLVE[1,8] If exactly the same problem appears twice, assume that we can re-use the plan. Many Resulting Algorithms INDIRECT + TRANSPOSE gives in-place DFTs, bit-reversal = product of transpositions

no separate bit-reversal pass [ Johnson (unrelated) & Burrus (1984) ] VECLOOP can push topmost loop to leaves vector FFT algorithm [ Swarztrauber (1987) ] CT-FACTOR then VECLOOP(s) gives breadth-first FFT, erases iterative/recursive distinction Actual Plan for size 219=524288 (2GHz Pentium IV, double precision, out-of-place) CT-FACTOR[4] (buffered variant) CT-FACTOR[32] (buffered variant) VECLOOP(b) x32 ~2000 lines CT-FACTOR[64] hard-coded C!

INDIRECT INDIRECT + VECLOOP(b) VECLOOP(b) x64 (+ ) VECLOOP(a) x4 = COPY[64] huge improvements VECLOOP(a) x4 for large 1d sizes SOLVE[64, 64] Unpredictable: (automated) experimentation is the only solution.

Planner Unpredictability double-precision, power-of-two sizes, 2GHz PowerPC G5 FFTW 3 heuristic: pick plan with fewest adds + multiplies + loads/stores Classic strategy: minimize ops fails badly another test: Use plan from: another machine?

e.g. Pentium-IV? lose 2040% Weve Come a Long Way 1965 Cooley & Tukey, IBM 7094, 36-bit single precision: size 2048 DFT in 1.2 seconds 2003 FFTW3+SIMD, 2GHz Pentium-IV 64-bit double precision: size 2048 DFT in 50 microseconds (24,000x speedup) (= 30% improvement per year) Weve Come a Long Way? In the name of performance, computers have become complex & unpredictable. Optimization is hard: simple heuristics (e.g. fewest flops)

no longer work. The solution is to avoid the details, not embrace them: (Recursive) composition of simple modules + feedback (self-optimization) High-level languages (not C) & code generation are a powerful tool for high performance. FFTW Homework Problems? Try an FFTPACK-style back-and-forth solver Implement Vector-Radix for multi-dimensional n Pruned FFTs: VECLOOP that skips zeros Better heuristic planner some sort of optimization of per-solver costs? Implement convolution solvers