Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin Kennedy and Bogdan Pasaniuc Outline Introduction Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion Single Nucleotide Polymorphisms Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) ataggtccCtatttcgcgcCgtatacacgggActata ataggtccGtatttcgcgcCgtatacacgggTctata ataggtccCtatttcgcgcCgtatacacgggTctata

High density in the human genome: 1 107 SNPs out of total 3 109 base pairs 3 Haplotypes and Genotypes Diploids: two homologous copies of each chromosome Haplotype: description of SNP alleles on a chromosome One inherited from mother and one from father 0/1 vector: 0 for major allele, 1 for minor Genotype: description of alleles on both chromosomes

0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles 011100110 + 001000010 021200210 two haplotypes per individual genotype Why SNP Genotypes? Identification and fine mapping of disease-related genes Methods: Linkage analysis, allele-sharing, association studies Genotype data: large pedigrees, sibling pairs, trios, unrelated 5 Genotyping Errors A real problem despite advances in genotyping technology

[Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP genotypes typed multiple times Error types Systematic errors (e.g., assay failure) detected by departure from HWE [Hosking et al. 2004] For pedigree data some errors detected as Mendelian Inconsistencies (MIs) Undetected errors E.g., if mother/father/child are all heterozygous, any error is Mendelian consistent Only ~30% detectable as MIs for trios [Gordon et al. 1999]

Effects of Undetected Genotyping Errors Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based) Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04] 1% errors decrease power by 10-50% for linkage, and by 520% for association [Douglas et al. 00, Abecasis et al. 01] Related Work Improved genotype calling algorithms

[Di et al. 05, Rabbee&Speed 06, Nicolae et al. 06] Explicit modeling in analysis methods [Sieberts et al. 01, Sobel et al. 02, Abecasis et al. 02,Cheng 06] Computationally complex Separate error detection step [Douglas et al. 00, Abecasis et al. 02, Becker et al. 06] Detected errors can be retyped, imputed, or ignored in downstream analyses Outline Introduction Likelihood Sensitivity Approach to Error Detection

HMM-Based Algorithms Experimental Results Conclusion Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother 012102 Father 0 1 1 1 0 0 h1 0 1 0 1 0 1 h2 022102 0 0 0 1 0 1 h3 0 1 1 1 0 0 h4 Child 022102 0 1 1 1 0 0 h1 0 0 0 1 0 1 h3 L(T ) MAX p( h1 ) p(h2 ) p( h3 ) p( h4 )

Likelihood of best phasing for original trio T Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother 012102 Father 010101 h1 011100 h2 Child 0 2 2? 1 0 2 0 0 0 1 0 0 h 3 0 1 1 1 0 1 h 4 022102 0 1 0 1 0 1 h 1 0 0 0 1 0 0 h 3 L(T ) MAX p( h1 ) p(h2 ) p( h3 ) p( h4 )

L(T ' ) MAX p( h'1 ) p( h'2 ) p( h'3 ) p(h'4 ) Likelihood of best phasing for original trio T Likelihood of best phasing for modified trio T Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father 012102 022102 Child 0 2 2? 1 0 2 Large change in likelihood suggests likely error Flag genotype as an error if L(T)/L(T) > R, where R is the detection threshold (e.g., R=104) Implementation in FAMHAP

[Becker et al. 06] Window-based algorithm For each window including the SNP under test, generate list of H most frequent haplotypes (default H=50) Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes Flag genotype as an error if L(T)/L(T) > R for at least one window Mother 201012 1 02210... Father 201202 2 10211... Child 000120 2

21021... Limitations of FAMHAP Implementation Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values False positives caused by nearby errors (due to the use of multiple short windows) Our approach: HMM model of haplotype diversity all haplotypes are represented + no need for short windows Alternate likelihood functions scalable runtime Outline Introduction

Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion HMM Model (Figure from Rastas et al. 07) Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05] Unlike [Scheet&Stephens 06], recombination ratios not modeled explicitly Block-free model, paths with high transition probability correspond to founder haplotypes HMM Training

Previous works use EM training of HMM based on unrelated genotype data Our 2-step algorithm exploits pedigree info Step 1: Infer haplotypes using pedigree-aware algorithm based on entropy-minimization Step 2: train HMM based on inferred haplotypes, using Baum-Welch Complexity of Computing Maximum Phasing Probability For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f-) unless ZPP=NP, where f is the number of founders For trios, hard to approx. within O(f1/4 -) Reductions from the clique problem Alternate Likelihood Functions

Viterbi probability (ViterbiProb): the maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio Probability of Viterbi Haplotypes (ViterbiHaps): product of total probabilities of the 4 Viterbi haplotypes Total Trio Probability (TotalProb): total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths Efficient Computation of Viterbi Probability for Trios For a fixed trio, Viterbi paths can be found using a 4path version of Viterbis algorithm time O ( NK 8 )in K3 speed-up by factoring common terms: V ( j 1; q1 , q2 , q3 , q4 ) E ( j 1; q1 , q2 , q3 , q4 ) max q '4Q j {Pre3 ( j; q1 , q2 , q3 , q'4 ) ( q'4 , q4 )} Where: E ( j 1; q1 , q2 , q3 , q4 ) = maximum probability of emitting ( q1states , q2 , q3 , q4 ) SNP genotypes at locus j+1 from

= transition probability Overall Runtimes Viterbi probability Probability of Viterbi haplotypes Likelihoods of all 3N modified trios can be computed within O ( NK 5 ) time using forward-backward algorithm Overall runtime for M trios O ( MNK 5 ) Obtain haplotypes from standard traceback, then compute

haplotype probabilities using forward algorithms O ( M ( NK 5 N 2 K )) Overall runtime Total trio probability Similar pre-computation speed-up & forward-backward algorithm 5 Overall runtimeO ( MNK ) Outline Introduction Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion Datasets Real dataset [Becker et al. 2006]

35 SNP loci on chromosome 16 covering a region of 91kb 551 trios Synthetic datasets 35 SNPs, 30-551 trios Preserved missing data pattern of real dataset Haplotypes assigned to trios based on frequencies inferred from real dataset 1% error rate, four error insertion models Random allele Random genotype

Heterozygous-to-homozygous Homozygous-to-heterozygous Experimental Setup Two strategies for handling MIs Set all three individuals to unknown prior to error detection, or Set child only to unknown (preserving parents original data) Two testing strategies Test one SNP genotype: ViterbiProb-1, ViterbiHaps-1, TotalProb-1 Simultaneously test three SNP genotypes at the same

locus: ViterbiProb-3, ViterbiHaps-3, TotalProb-3 Comparison with FAMHAP (Random Allele Errors) 1 0.9 Sensitivity 0.8 TrioProb-1 0.7 ViterbiHaps-1 0.6 ViterbiProb-1 FAMHAP-1 0.5 TrioProb-3

0.4 ViterbiHaps-3 0.3 ViterbiProb-3 FAMHAP-3 0.2 0.1 0 0.0001 0.001 0.01 FP rate 0.1 1 Children vs. Parents (Random Allele Errors) 1

0.9 0.8 Sensitivity 0.7 0.6 TrioProb-1-P 0.5 TrioProb-1-C 0.4 0.3 0.2 0.1 0 0 0.005 0.01 0.015 FP rate

0.02 0.025 Error Model Comparison (TrioProb-1 Parents) 1 0.9 Sensitivity 0.8 0.7 Random Allele 1% 0.6 Random Genotype 1% 0.5 Heterozygous-toHomozygous 1% Homozygous-toHeterozygous 1%

0.4 0.3 0.2 0.1 0 0 0.005 0.01 0.015 FP rate 0.02 0.025 TrioProb-1 Results on Real Dataset Threshold Parents Children Total

Total Signals True Positives False Positives 2 3 4 2 3 4 2 3 4 80 15 9 9 9 8 2 1 1 27 21 17 11 10 10 3 3 1 107 36 26 20 19 18 5 4

2 Unknown 2 3 69 5 13 8 82 13 [Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3 23 SNP genotypes were identified as true errors 41*3-23=100 resequenced SNP genotypes agree with original calls Predictive value for R=104 is between 18/26=69% and 24/26=92%, compared to 23/41=56% for FAMHAP-3 4 0

6 6 Pedigree Info vs. Sample Size Effect 1 0.9 0.8 Sensitivity 0.7 551-TrioProb-1-T 129-TrioProb-1-T 30-TrioProb-1-T 551-Unrelated-ViterbiProb-1 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.005 0.01 0.015 FP rate 0.02 0.025 0.03 Unrelated vs. Trio Likelihood Sensitivity no error error 100000 10000 Unrelated ViterbiProb-1 Likelihood ratios (children)

1000 100 10 1 no error error 100000 10000 Trio ViterbiProb-1 Likelihood ratios (children) 1000 100 10

>5 4.9 4.7 4.5 4.3 4.1 3.9 3.7 3.5 3.3 3.1 2.9 2.7

2.5 2.3 2.1 1.9 1.7 1.5 1.3 1.1 0.9 0.7 0.5 0.3 0.1

1 Combining Likelihood Functions (Children, Random Allele Model) 1 0.95 0.9 FAMHAP Unrelated Duo Trio MinUT MinDT MinUDT Majority MinUD 0.85 0.8 0.75 0.7 0.65 0.6 0.55

0.5 0 0.002 0.004 0.006 0.008 0.01 Combining Likelihood Functions (Parents, Random Allele Model) 1 0.95 0.9 FAMHAP Unrelated Duo Trio MinUT MinDT MinUDT

Majority MinUD 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 0.002 0.004 0.006 0.008 0.01 Outline Introduction

Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion Conclusion Proposed efficient methods for error detection in trio genotype data based on a HMM model of haplotype diversity Significantly improved detection accuracy compared to FAMHAP High sensitivity even for very low FP rates Runtime linear in #SNPs and #trios Ongoing work Iterative error detection

Integration with genotype calling algorithms Fix MIs using likelihood before error detection Correct errors with high likelihood ratio, then recompute likelihood ratios (possibly after re-phasing and HMM re-training) Combine low level intensity data with haplotype-based likelihoods Most useful when less pedigree info is available (unrelated, sibling pairs w/o parent genotypes, parents in trios) Locus specific thresholds, p-values Via simulations similar to [Douglas et al. 00] Questions?