# FFT in Hardware and Software - University of Calgary

FFT in Hardware and Software Background Core Algorithm Original Algorithm, the DFT, O(n2) complexity New Algorithm, the FFT (Fast Fourier Transform), O(nlog2(n)) depending on implementation. DFT Computation A summation over the whole input array for every single element in the output array. A VERY computationally inefficient algorithm to implement. X () DFT ( x[n]) x[n]e n jn [1] FFT Computation A much more computationally efficient algorithm Works using the divide and conquer principle. First developed by Cooley and Tukey in 1965! DFT vs. FFT (Number of Operations) Problem Size (N) Standard DFT FFT % of DFT (smaller is better) (smaller is better) (smaller is better)

2 4 1 25 4 16 4 25 8 64 12 19 16 256 32 13 32 1024 80 8 64 4096 192 5

128 16384 448 3 256 65536 1024 2 512 262144 2304 1 1024 1048576 5120 <1 DFT vs. FFT Thousands 1200 1000 Percent of DFT Computation Time (Smaller is Better) 800 600 30% 400

200 0 0 200 400 600 800 1000 1200 Problem Size Thousands Computations Required Nearly Linear Growth of FFT (Smaller is Better) 6 5 4 Percent of DFT Computation Time Computations Required Exponential Growth of DFT (Smaller is Better) 25% 20% 15% 10% 5% 0% 0 3 200 400

600 Problem Size 2 1 0 0 200 400 600 Problem Size 800 1000 800 1000 1200 FFT Butterfly Operations Butterfly arrangement of computations Repeated on successive pairs of input data Then half as many times on alternating pairs Then half again as many times on every fourth element The Butterfly Simple operations repeated many times xe[n] X[n] WnN xo[n] X[n+N/2] -WnN

W nk N e j 2nk N 8-point FFT Demonstration The Entire Calculation Output Input Array x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3]

x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + + X[0]

x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6]

x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6] + +

X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + +

X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + +

X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6] +

+ X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] +

+ X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] +

+ X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6]

+ + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0]

+ + X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3]

+ + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2]

x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output

x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5]

x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] + +

X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array

Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + +

X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] +

+ X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition

Why Hardware? Even more speed for FFT Extremely parallelizable A whole layer can be done in two FPGA clock cycles 1 multiply cycle 1 add cycle (Assuming sufficient multipliers) Hardware Problems Complexity Input speed Output speed If the FPGA takes 24.4ns but takes 20s to transfer the input data, what gain is there? i.e. 24.4ns + 20s + 20s = ~40s! Mitigation of Hardware Problems Use a faster bus AMD Opterons Hypertransport 20.8 GB/s (166.4 Gb/s) per Link (V. 3) Modules that fit into an AMD 64-bit Opteron Socket http:// www.drccomputer.com/pages/modules.html xilinx based module http://www.xtremedatainc.com/xd1000_brief.html - altera based module Mitigation of Hardware Problems Put the FPGA on the die with the DSP Need silicon vendor support FPGA can access memory on a very wide bus (i.e. 128 bits per cycle) Implement the entire project in FPGA Time consuming to program Possibly insufficient room on the FPGA 8-point FFT Demonstration In Hardware Input Array

Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5]

x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration In Hardware Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2]

x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration In Hardware Input Array

Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2] x[6] + + X[3] x[1] + + X[4] x[5] + + X[5]

x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition 8-point FFT Demonstration In Hardware Input Array Output x[0] + + X[0] x[4] + + X[1] x[2] + + X[2]

x[6] + + X[3] x[1] + + X[4] x[5] + + X[5] x[3] + + X[6] x[7] + + X[7] Multiplication by W factor + Addition Why Not Software? Each butterfly must be done sequentially Only slight parallelism enabled by a DSP

like the TigerSHARC Each Butterfly can be done in 2 cycles (after optimization). Results of Testing Linear Profiling of FFT Algorithm in C++ Stage Cycle count Time 8-point 32-point 256-point 8-point 32-point 256-point Initialization 21 25 25 35.07ns 41.75ns 41.75ns Computation 6922 1135 1.895 s 11.559 s

290.950 s Butterfly 91 174222 151.97ns Results of Testing Profiling of VHDL on FPGA Butterfly takes 24.377ns to execute 62% is computational, 38% is routing on FPGA Product Offerings Most DSP Vendors Many FPGA Vendors (IP Intellectual Property) Microcontroller Vendors (i.e. Blackfin) FFTW The Fastest Fourier Transform in the West AMD Math Core Library Intel Library Highly Optimized for the expected hardware Published Results The Radix 4 version delivers a 1 K points complex processing time of 25 microseconds at 200-MHz system speeds and uses only about 10 percent of the resources in a mid-range Stratix device. The Radix 2 is half the size of the Radix 4 and offers a 1 K points complex processing time of 50 microseconds at 200MHz system speeds. Additional versions of the new cores are under development. [6] FFT IP Core Published Results [7] FFT/IFFT length Texas Instruments C6713 Single 4DSP FFT core

(Smaller is Better) Quad 4DSP FFT core (Smaller is Better) 256 12.3s 3.68s 920ns 512 27.3s 6.24s 1.56s 1024 60.2s 11.4s 2.85s References [1] Signals Systems and Transforms [2] James W. Cooley and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput. 19, 297301 (1965). [3] http://www.drccomputer.com/pages/modules.html - xilinx based module [4] http://www.xtremedatainc.com/xd1000_brief.html altera based module [5] http://www.amd.com/us-en/Processors/DevelopWithAMD /0,,30_2252_2353,00.html [6] http://www.us.design-reuse.com/news/news5650.html [7] http://www.4dsp.com/fft.htm

## Recently Viewed Presentations

• The High Career Dream: A career counselling modelThe impact of a career counselling model and the implications for HE Guidance Services . Orlaith Tunney, [email protected] Careers Adviser. Trinity College Dublin. Warwick University, 22nd May 2014
• Serotonin Dopamine Antagonism (SDA) Dopamine antagonism is not as robust at the D2 receptor and therefore has a much lower risk for EPS symptoms than the conventional antipsychotics. Serotonin receptor antagonism may actually release dopamine in the frontal regions of...
• WA Hospital Medication Chart (WA HMC) ... If there are more than one WA HMC or WA Paediatric NIMC in use, then this must be indicated by filling in the appropriate numbers in the space provided. If additional charts are...
• IMUL (signed integer multiply ) multiplies an 8-, 16-, or 32-bit signed operand by either AL, AX, or EAX. Preserves the sign of the product by sign-extending it into the upper half of the destination register. The Overflow flag indicates...
• Facts of the Sacco- Vanzetti Case:. At about 3pm on the afternoon of April 15, 1920, Frederick Parmenter, a paymaster, and his guard, Alessandro Bardelli, were killed by two men, armed with pistols at the Slater and Morrill Shoe Factory...
• 1 Leape LL, Brennan TA, Laird N, Lawthers Ag, Localio AR, Barnes BA. The nature of adverse events in hospitalized patients: results of the Har­vard medical practice study II. N Engl J Med 1991;324:377­84. 2 Wilson RM, Harrison BT, Gibberd...
• The number of pulleys and their arrangement can reduce the amount of effort force needed to raise a particular load. The number of lifting strands will determine the advantage to using a pulley system 1 Lifting Strand 2 Lifting Strands...
• To be mean 3. To take attendance 4. So Julia doesn't have to lecture as much 5. No reason The final exam is _____. 1. optional 2. required What % of the course grade is in-class participation? 0% 15% 45%...