Wireless Networking and Communications Group Reducing Complexity in Signal Processing Algorithms for Communication Receiver and Image Display Software Brian L. Evans Prof. Brian L. Evans 27 July 2010 Seminar at the American University of Beirut Outline Embedded digital systems Generating sinusoidal waveforms Discrete-time filters Multicarrier equalizers Image halftoning algorithms Conclusion 20 0 20 0 4 205 0 20 0 6 7200 8200

20 1 9 0 2 Embedded Digital Systems Often work on application-specific tasks In consumer products (2008 units) 1200M 300M 100M 100M cell phones PCs digital cameras DVD players 70M DSL modems 55M cars/light trucks 30M gaming consoles (2007) iPhone has six programmable processors (2008) Embedded programmable processors Inexpensive with small area and volume Predictable off-chip input/output (I/O) rates Low power (TI C5504 45mW @ 300MHz) Limited onchip memory Fixed-point

arithmetic Embedded Digital Systems Memory access in processors External I/O: block data transfers to/from on-chip memory Internal I/O: on-chip memory to CPU registers using data buses (e.g. TI C6000 processor has two 32-bit data buses) Common word sizes for signal processing software 64-bit floating-point for desktop computing (e.g. Matlab) 32-bit floating-point for pro-audio and sonar beamforming 16-bit fixed-point for speech, consumer audio, image proc. IEEE floating-point operations Handles many special cases (e.g. +, - and not a number) Add, multiply, divide have comparable hardware complexity 4 Embedded Digital Systems Fixed-point operations Multiplication based on addition operations Division takes 1-2 instructions per bit of accuracy Multiplication can consume much dynamic power Truncate constants for power savings

56% Multiplier used in TI C64 processors [Han, Evans & Swartzlander, 2005] 5 Generating Sinusoidal Waveforms Sample continuous-time cosine signal at rate fs cos( 2 f 0 t ) u (t ) f s cos(0 n) u[n] Discrete-time fixed frequency 0 = 2 f0 / fs Example: f0 = 1200 Hz and fs = 8000 Hz, 0 = 3/10 Discrete-time realization drops fs term in front of cosine Math library call to cos function in C Uses double-precision floating-point arithmetic No standard in C for internal implementation Generally meant for high-accuracy desktop calculations Call to gsl_sf_cos_e in GNU scientific library 1.8 20 multiply, 30 add, 2 divide, 2 power calculations/output 6 Generating Sinusoidal Waveforms Difference equation with input x[n] and output y[n] y[n] = (2 cos 0) y[n-1] - y[n-2] + x[n] - (cos 0) x[n-1] From inverse z-transform of z-transform of cos(0 n) u[n] Impulse response gives cos(0 n) u[n] Initial

conditions 2 multiplications and 3 adds per output value are all zero Buildup in error as n increases due to feedback Lookup table pre-compute samples offline Discrete-time frequency 0 = 2 f0 / fs = 2 N / L All common factors between integers N and L removed = 2 k = 2 (N / L) n n = L store L samples Entries in either floating-point or fixed-point format Table would contain N periods of the cosine 7 Generating Sinusoidal Waveforms Signal quality vs. implementation complexity in generating cos(0 n) u[n] with 0 = 2 N / L Method MACs/ ROM RAM Quality in Quality in sample (words) (words) floating pt. fixed point C math library call 30 22 1

Second Best N/A Difference equation 2 2 3 Worst Second Best Lookup table 0 L 0 Best Best

MAC Multiplication-accumulation RAM Random Access Memory (writeable) ROM Read-Only Memory 8 Discrete-Time Filters Finite impulse response (FIR) filter x[k-1] x[k] h[0] z-1 z-1 z-1 h[1] h[2] h[M-1] S

y[k] Impulse response h[k] has finite extent k = 0,, M-1 M1 y[k ] h[m] x[k m] m 0 Discrete-time convolution 9 Discrete-Time Filters Infinite impulse response (IIR) filter Biquad building block: 2 poles and 0-2 zeros x[k] v[k] S a1 a2 Unit Delay v[k-1] Unit Delay v[k-2] b0 b1

S y[k] Biquad is short for biquadratic transfer function is ratio of two quadratic polynomials b2 Y ( z ) V ( z ) Y ( z ) b0 b1 z 1 b2 z 2 H ( z) X ( z ) X ( z ) V ( z ) 1 a1 z 1 a2 z 2 Generally, coefficients a1, a2, b0, b1, b2 are real-valued 10 Discrete-Time Filters Implementation complexity (1) FIR Filters Higher Minimum order design Stable?

Parks-McClellan (Remez exchange) algorithm (2) Always Linear phase If impulse response is symmetric or antisymmetric about midpoint IIR Filters Lower (sometimes by factor of four) Elliptic design algorithm May become unstable when implemented (3) No, but phase may made approximately linear over passband (or other band) (1) For same piecewise constant magnitude specification (2) Algorithm to estimate minimum order for Parks-McClellan algorithm by Kaiser may be off by 10%. Search for minimum order is often needed. (3) Algorithms can tune design to implementation target to minimize risk 11 Discrete-Time Filters Keep roots computed by filter design algorithms Polynomial deflation (rooting) reliable in floating-point Polynomial inflation (expansion) may degrade roots

Choice of IIR filter structure matters Direct form IIR structures expand zeros and poles, and may become unstable for large order filters (order > 12) Cascade of biquads expands zeros and poles in each biquad Minimum order design not always most efficient Efficiency depends on target implementation Consider power-of-two coefficient design Efficient designs may require search of design space 12 Halftime: AUB Summer 2005 EECE 503 Real-Time DSP Lab Embedded digital systems Generating sinusoidal waveforms Discrete-time filters Multicarrier equalizers Image halftoning algorithms Conclusion Channel Equalization Channel degrades transmitted signal Nonlinear distortion, e.g. amplitude nonlinearities Linear distortion, e.g. convolution by channel impulse response

Additive noise, e.g. thermal (Gaussian) and impulsive Equalization compensates linear distortion Spreading/attenuation in time Magnitude/phase distortion in frequency Message bit stream Transmitter Channel Equalizer Received bit stream Receiver 14 Multicarrier Modulation Divide channel into narrowband subchannels Discrete multitone modulation magnitude Baseband transmission based on fast Fourier transform (FFT) Each subchannel carries single-carrier transmission Standardized for digital subscriber line (DSL) communication channel carrier

Subchannels are 4.3 kHz wide in DSL systems subchannel frequency Channel Equalization Equalizer Shortens channel impulse response (time domain eq.) Compensates phase/ magnitude distortion (freq. domain eq.) xk Channel nk yk Equalizerrk ek w + + h + Ideal Channel Training signal

Receiver generates xk z- g Discretized Baseband System Single carrier system g is scalar constant FIR filter w performs time and frequency domain equalization Multicarrier system g is FIR filter of length n+1 Time domain equalizer (w) then FFT & freq. domain equalizer Equalization in DSL receivers increases bit rate by 10x 16 Multicarrier Equalization Maximum shortening SNR time domain equalizer Minimize energy leakage outside shortened channel length For each position of window [Melsa, Younce & Rohrs, 1996] energy inside window after TEQ max SSNR in dB max 10 log10 w w energy outside window after TEQ w T Bw n1 samples subject to w T Bw 1

max 10 log10 T w Aw w channel impulse Cholesky decomposition of B leads to optimal eigensolution Computationally-intensive: O(Lw3) Floating-point multiplications/divisions Restricts TEQ length to be less than n+1 response effective channel impulse response Time Domain Equalizer Design 6.6 Bit Rate (Mbps) 6.5 6.4 6.3

SymMinSSNR SymMMSE 6.2 MinSSNR SymMinISI 6.1 MMSE MinISI 6 MDS DualPath TEQ length of 17 Data rates averaged over eight standard DSL test lines [Martin et al., 2006] 5.9 5.8 5.7 5.6 4.5

5 5.5 6 6.5 7 Training complexity in log10(multiply-add operations) Most efficient floating-point versions of algorithms used 18 Time Domain Equalizer Design Unified framework [Martin et al., 2006] w opt w T Bw arg max T w Aw w subject to w T Bw 1

A and B are square (Lw Lw) and depend on choice of Constraint prevents trivial non-practical solution w = 0 Find eigenvector for largest generalized eigenvalue Iterative Methods Formulation Solve Bw Aw e.g. solve w B 1Aw Power method Solve Bw k 1 Aw k then w k 1 w k 1 / w k 1 Alternating k 1 w k Aw k then w k 1 w k 1 / w Tk 1Bw k 1 w Lagrangian w k 1 w k Bw k Aw k w Tk Bw k division-free 20 iterations to converge for 17-tap MSSNR TEQ design

19 Digital Image Halftoning For display on devices with fewer bits of gray/color resolution than original image Grayscale: 8-bit image to 1-bit image Color: 24-bit RGB image to 12-bit RGB display Produces artifacts x(m) b(m) Each pixel in original image is 8-bit unsigned intensity in [0, 255] Original Image For display, 0 is black and 255 is white Threshold at Mid-Gray 20 Quantization with Feedback Consider Average output 4-bit data value on 2-bit

display (unsigned) (10+10+10+11) = 1001 Feedback 4-bit resolution quantization at DC !error 4 Input Noise shaping signal 2 To display 2device bits increases 2 words Truncating from 42to noise by ~12dB Feedback removes noise at DC & increases HF noise 1 sample delay For constant input 1001 = 9

Periodic Adder Inputs Time Upper Lower 1 1001 00 2 1001 01 3 1001 10 4 1001 11 Sum 1001 1010 1011 1100 Output to display 10 10 10 11 Added noise 12 dB (2 bits) f

Error Diffusion Halftoning Quantize each pixel Diffuse filtered quantization error to future pixels difference x(m) threshold u(m) + b(m) _ _ h(m) shape error Original current pixel 7/16 + 3/16 5/16 1/16

e(m) compute error Halftone [Floyd & Steinberg, 1976] error filter weights Halftone Spectrum Error Diffusion Halftoning Artifact Sharpening False textures Model Compensation Added Complexity Sharpness 1 multiplication and control 1 addition per pixel

Deterministic bit Nonlinear flipping quantization 1 comparison per pixel Linear Deterministic bit flipping quantizer (DBF) [Damera-Venkata & Evans, 2001] Thresholds input to black (0) or white (255) Flip quantized value about mid-gray (128) Reduces false textures in mid-grays Implemented with two comparisons DBF(x) 255 x1 128 x2 x 23 Sharpness Control Model quantizer as gain plus noise [Kite, Evans & Bovik, 1997] Signal transfer function models sharpening Ks 2 for Floyd-Steinberg STF Bs z Ks X ( z ) 1 K s 1 H z

Noise transfer function models noise-shaping Kn = 1 NTF Bn z 1 H (z ) N (z ) 2 Ks = 2 1 1 1 Pass low and enhance high frequencies 1 1 Pass high frequency noise Plots for ideal lowpass H()

24 Sharpness Control Adjust by threshold modulation [Eschbach & Knox, 1991] Scale image by gain L and add it to quantizer input L x(m) + u(m) _ _ h(m) b(m) + e(m) 1 1 Ks FlattenLsignal 1transfer function L ( 1,0] since [Kite, K

Evans s 1& Bovik, 2000] Ks Ks 25 Results Floyd-Steinberg DBF quantizer Unsharpened Original 26 Conclusion Processor architecture Decrease data sizes to reduce on-chip memory usage and increase data bus efficiency Truncate multiplicand constants to reduce power Compute signal values by recursion or lookup table Algorithm design Keep offline design results in full precision until end Order of calculations matters in implementation Exploit problem structure in developing fixed-point algorithms

Linearize nonlinear systems to leverage linear system methods Many other ways to reduce complexity exist 27 Invitations Panel discussion on graduate studies Tomorrow (Wednesday) 1:30 2:30 pm in this room (RCR) Panelists: Prof. Zaher Dawy (AUB), Prof. Imad El-Hajj (AUB) and Prof. Brian Evans (UT Austin) IEEE Workshop on Signal Processing Systems Early October 2011 Short walk from the AUB campus Organizers include Prof. Magdy Bayoumi (Univ. of Louisiana at Lafayette), Prof. Brian Evans (UT Austin), Dean Ibrahim Hajj (AUB) and Prof. Mohammad Mansour (AUB) 28 Thank You! Digital Signal Processors Market ~1/3 of $25B embedded digital signal processing market 2007 cholesterol lowering Pzifer Lipitor sales: $13B Applications (2007) Wireless

Source: Forward Concepts Consumer Video Automotive Wireline Computer DSP Processor Market 9 8 Annual 7 Revenue 6 5 4 3 Billions of Dollars 2 1 0 1999 2001 2003

2005 2007 70 Share 60 TI Freescale Agere Analog Dev Philips Other 50 40 30 20 10 0 2004 2005 2006 2007

Source: Forward Concepts Introduction Screening (Masking) Methods Periodic thresholds to binarize image Periodic application leads to aliasing (gridding effect) Clustered dot screening is more resistant to ink spread Dispersed dot screening has higher spatial resolution Blue larger masks (e.g. 1 by 1) Clustered dot mask Dispersed dot mask index Threshold Lookup 1 , 3 , 5 , 7 , 9 , 11 , 13 , 15 , 17 , 19 , 21 , 23 , 25 , 27 , 29 , 31 * 256 Table 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 Linear Gain Model for Quantizer Extend sigma-delta modulation analysis to 2-D Linear gain model for quantizer in 1-D [Ardalan and Paulos, 1988] Linear gain model for grayscale image [Kite, Evans, Bovik, 1997] us(m) Ks us(m) Ks u(m) b(m)

{ Signal Path n(m) un(m) un(m) + n(m) Noise Path Error diffusion is modeled as linear, shift-invariant Signal transfer function (STF): quantizer acts as scalar gain Spatial Domain Original Image Threshold at Mid-Gray Dispersed Dot Screening Clustered Dot Floyd Steinberg Error Diffusion Stucki Error Diffusion Magnitude Spectra

Original Image Threshold at Mid-Gray Dispersed Dot Screening Clustered Dot Screening Floyd Steinberg Error Diffusion Stucki Error Diffusion Human Visual System Modeling Contrast at particular spatial frequency for visibility Bandpass: non-dim backgrounds [Manos & Sakrison, 1974; 1978] Lowpass: high-luminance office settings with low-contrast images [Georgeson & G. Sullivan, 1975] Exponential decay [Nssen, 1984] Modified lowpass version [e.g. J. Sullivan, Ray & Miller, 1990] Angular dependence: cosine

function [Sullivan, Miller & Pios, 1993] Analysis and Modeling Linear Gain Model for Quantizer Best linear fit for Ks between quantizer input u(m) and halftone b(m) Image Floyd Stuck Jarvi 2 K s arg min u (m) b(m) i s barba 2.01 3.62 3.76 Ks m u(m) 1 m 1 E u (m) 2 u 2 (m) 2 E u 2 (m m ra boats

lena mandr ill Avera ge 1.98 2.09 2.03 2.03 4.28 4.49 3.38 3.94 4.93 5.32 3.45 4.37 Stable for Floyd-Steinberg Can use average value to estimate Ks from only error filter Sharpening: proportional to Ks [Kite, Evans & Bovik, 2000] Value of Ks: Floyd Steinberg < Stucki < Jarvis Weighted SNR using unsharpened halftone 36