s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Sequence alignment Statistics Scores Algorithms 2003/04 Alessandro Bogliolo Outline s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE 1. Alignment likelihood and score models 1. 2. 3. 4. Problem statement Log-odds ratio Probabilistic derivation of a score/substitution matrix Characterization of a score matrix 2. Algorithms 1. 2. 3. 4. Problem statement Dynamic pogramming Exact algorithms Heuristics 2003/04 Alessandro Bogliolo Problem statement s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Given two aligned sequences, evaluate the relative likelihood that the two sequences are related rather than unrelated. Since sequence alignment aims at finding relations between sequences, the

above mentioned likelihood can be used as a score to be assigned to the alignment Given two sequences, and a scoring function, find the maximum-score alignment between the two sequences The scoring function depends on the nature of the sequences under comparison and on the relation under investigation The maximum score (provided by the best alignment) is a measure of similarity between the two sequences 2003/04 Alessandro Bogliolo Random model s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Assumptions: Given two sequences X and Y of N characters Each symbol a (e.g., nucleotide, amino acid, ) occurs with a given probability pa in the sequences, independent of its position and of its neighbors The two sequences are unrelated The probability of each alignment between X and Y under the random model is N N P( X , Y | Random) p xi p y j i 1 2003/04 Alessandro Bogliolo j 1 Match model s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Assumptions: Given two sequences X and Y of N characters Aligned pairs of symbols a and b in X and Y occur with a joint probability pab, independent of their position and neighbors pab=papb if the aligned pairs are unrelated pabpapb if the aligned pairs are related (e.g., they are derived from the same residue in a common ancestor) The probability of the alignment between X and Y under the match model is

N P ( X , Y | Match) p xi yi i 1 2003/04 Alessandro Bogliolo Relative likelihood Odds ratio: s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE N P ( X , Y | Match) P( X , Y | Random) p i 1 N xi yi N pxi p yi i 1 N i 1 p xi yi p xi p yi i 1 Log-odds ratio: logarithm of the odds ratio p xi yi P ( X , Y | Match) N log S ( X , Y ) log px p y P ( X , Y | Random) i 1 i i Can be used as a score for alignment, expressed as the sum of individual scores associated with each pair of aligned symbols N S ( X , Y ) s ( xi , yi ) i 1

p xi yi s( xi , yi ) log px p y i i 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Relative likelihood (interpretation) Odds ratio of an individual alignment : p xi y i p xi p yi Conditional probability of character yi given the alignment with xi p( yi | xi ) p xi yi p xi p xi yi p xi p yi p( yi | xi ) p yi The odds ratio is greater than 1 if and only if the conditional probability of yi given xi is higher than the marginal probability of yi 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Score (substitution) matrix Individual scores associated with each pair of symbols taken from the same finite alphabet can be represented on a symmetric matrix, called score matrix or substitution matrix A R N D

C Q A 5 -2 -1 -2 -1 -1 R -2 7 -1 -2 -4 1 N -1 -1 7 2 -2 0 D -2 -2 2 8 -4 0 C -1 -4 -2 -4 13 -3 Q -1 1 0 0 -3 7 E -1 0 0 2 -3 2 G 0 -3

0 -1 -3 -2 H -2 0 1 -1 -3 1 I -1 -4 -3 -4 -2 -3 L -2 -3 -4 -4 -2 -2 K -1 3 0 -1 -3 2 M -1 -2 -2 -4 -2 0 F -3 -3 -4 -5 -2 -4 P -1 -3 -2 -1 -4 -1 S

1 -1 1 0 -1 0 T 0 -1 0 -1 -1 -1 W -3 -3 -4 -5 -5 -1 Y -2 -1 -2 -3 -3 -1 V 0 -3 -3 -4 -1 -3 A C G T A C G T 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 BLOSUM50 BLOSUM50 The additive nature of log-odds metrics is exploited by alignment algorithms 2003/04 Alessandro Bogliolo Gap penalties s t i

ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE ACCGTTAC-GA Gap: sequence of g missing characters AC--TTACCGA inserted in a string to achieve alignment Gaps are assigned with two kinds of negative scores: Gap-open penalty (-d): negative score associated with the initiation of a gap (i.e., with the first missing character) Gap-extension penalty (-e): negative score associated with each additional missing character (usually e

E ECNOLOGIE DELL NFORMAZIONE The aim of sequence alignment is to assess the likelihood that the two sequences are related rather then unrelated The odds ratio is based on P(X,Y|Match) What we are interested in is P(Match|X,Y) According to Bayes: P( X ,Y | Match )P( Match ) P( Match | X ,Y ) P( X ,Y ) 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Significance of scores (2) P( X ,Y | Match )P( Match ) P( X ,Y ) P( X ,Y | M )P( M ) P( X ,Y | M )P( M ) P( X ,Y | R )P( R ) P( X ,Y | M ) P( M ) P( X ,Y | R ) P( R ) P( X ,Y | M ) P( M ) 1 P( X ,Y | R ) P( R ) P( Match | X ,Y ) PP( (AA) )PP( (AA| B | B)P)P( (BB) ) PP( (AA| B | B)P)P( (BB) ) AAXX,Y,Y BBMatch Match( (MM) ) BBRandom Random( (RR) ) PP( (XX,Y,Y| |MM) ) Score S ( X , Y ) log Score S ( X ,Y ) log P( X ,Y | R )

P( X ,Y | R ) PP( (MM) ) SS0 log log P( R ) 0 P( R ) eeSS( (XX,Y,Y))SS0 0 PP((Match Match||XX,Y ,Y)) 1 e SS( (XX,Y,Y))SS0 0 1 e P(Match|X,Y) can be directly derived from the score S(X,Y) obtained during alignment 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Characterizing score models Trivial approach: given a sample of confirmed alignment, use relative frequencies to estimate probabilities pab and pa for each pair of symbols a and b and f(g) for each gap size g Issues: Obtaining a good random sample of confirmed alignments Use sequence pairs for which the same given relation holds (e.g., if they diverged from a common ancestor, the score matrix may depend on the time since divergence) 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE PAM Matrices [Dayhoff, Schwartz, Orcutt, 1978] Idea: 1. Rather than characterizing the probability pab of finding a aligned to b, characterize the substitution probability p(b|a,t) that symbol a is replaced by b in time t p( b | a ,t ) pb

2. Characterize substitution probabilities for a short (unit) time p( b | a ,t ) pab ( t ) / pa 3. 4. s( a ,b | t ) log Use hypothetical phylogenetic trees to establish unit-time substitution relations among available sequences Derive long-time substitution matrices S(t) from unit-time substitution matrices S(1) by means of matrix multiplication S(n)=S(1)n PAM1: expected number of substitutions 1% PAM250: most widely used matrix 2003/04 Alessandro Bogliolo Characterized Characterizedon onaligned aligned sequences sequencesthat thatdiffer differfor foratat most most1% 1%ofofcharacters characters Derived Derivedfrom fromPAM1 PAM1 PAM Matrices s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE [Dayhoff, Schwartz, Orcutt, 1978] Time-dependent substitution matrices: 1. S(1): probability substitution matrix characterized on aligned sequences that differ in no more than 1% of positions Taken as the unit-time substitution matrix

Entry S(1)ab=p(b|a,1): probability of a to be replaced by b in a time unit 2. S(2): probability substitution matrix in 2 time units Entry S(2)ab=p(b|a,2) Matrix composition: p( b | a ,2 ) S(2)=S(1)S(1)=S(1)2 p( i | a ,1 ) p( b | i ,1 ) a i Alphabet Probability Probabilityofofgetting gettingfrom fromaatotobb inin22steps, steps,through throughany anypossible possible intermediate intermediatesymbol symboli i 2003/04 Alessandro Bogliolo i b BLOSUM Matrices s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE [Pearson, 1996] Idea: 1. Estimate substitutions probabilities by means of relative frequencies computed on a set of confirmed alignments between sequences with more than L% identical residues BLOSUM50 (L=50%): mainly used for alignment with gaps

BLOSUM62 (L=62%): mainly used for ungapped alignment 2003/04 Alessandro Bogliolo Outline s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE 1. Alignment likelihood and score models 1. 2. 3. 4. Problem statement Log-odds ratio Probabilistic derivation of a score/substitution matrix Characterization of a score matrix 2. Algorithms 1. 2. 3. 4. Problem statement Dynamic pogramming Exact algorithms Heuristics 2003/04 Alessandro Bogliolo Problem statement s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Given two sequences, and a scoring (cost) function, find the maximum-score (minimum-cost) alignment between the two sequences If sequences X and Y under alignment have length N, the number of possible alignments (with gaps) is 2 N ( 2 N )! 2 2 N 2

N N ( N! ) 2003/04 Alessandro Bogliolo Dynamic programming s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Given an additive alignment score (cost), optimal alignments betweens two sequences of N characters can be found in O(N2) by means of dynamic programming Dynamic programming incrementally compute the optimal score between the first i characters of X and the first j characters of Y Each step takes constant time The total number of steps if O(N2) 2003/04 Alessandro Bogliolo Global alignment s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE [Needleman-Wunsch, 1970] Given two strings X and Y of M and N characters, a matrix F is constructed with M+1 rows and N+1 columns, indexed by i and j Entry F(i,j) contains the score of the best alignment between X(1:i) and Y(1:j) F(i,j) can be recursively computed from top left F(0,0) to bottom right F(M,N) F ( 0 ,0 ) 0 F ( i 1, j 1 ) s( xi , y j ) F ( i , j ) max F ( i 1, j ) d F ( i , j 1 ) d 2003/04 Alessandro Bogliolo

s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Global alignment [Needleman-Wunsch, 1970] Example: BLOSUM50, linear gap penalty d=8 P A W H E A E 0 -8 -16 -24 -32 -40 -48 -56 H -8 -2 -10 -18 -14 -22 -30 -38 E -16 -9 -3 -11 -18 -8 -16 -24 A -24 -17 -4 -6 -13 -16 -3 -11 G A -32 -40 -25 -33

-12 -20 -7 -15 -8 -1-9 -16 -9 -11 -11 -6 -12 W -48 -42 -28 -5 -13 -12 -12 -14 G -56 -49 -36 -13 -7 -15 -12 -15 2003/04 Alessandro Bogliolo H -64 -57 -44 -21 -3 -7 -15 -12 E -72 -65 -52 -29 -11 3 -5 -9 E -80 -73 -60 -37 -19 -5 2 1 s t i

ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Global alignment (traceback) [Needleman-Wunsch, 1970] The value in the last entry of the matrix F(M,N) is the best score. The corresponding best alignment can be found by means of traceback on matrix F Starting from F(M,N), at each step we move back from a cell to the one from which the best score was derived Traceback steps can be driven either by pointers annotated during matrix construction, or by looking back at the scores There may be more than 1 best alignments 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Global alignment (traceback) [Needleman-Wunsch, 1970] HEAGAWGHE-E --P-AW-HEAE Example: BLOSUM50, linear gap penalty d=8 P A W H E A E 0 -8 -16 -24 -32 -40 -48 -56 H -8 -2 -10 -18 -14 -22 -30 -38

E -16 -9 -3 -11 -18 -8 -16 -24 A -24 -17 -4 -6 -13 -16 -3 -11 G -32 -25 -12 -7 -8 -16 -11 -6 A -40 -33 -20 -15 -9 -9 -11 -12 W -48 -42 -28 -5 -13 -12 -12 -14 G -56 -49 -36 -13 -7 -15 -12 -15 2003/04 Alessandro Bogliolo

H -64 -57 -44 -21 -3 -7 -15 -12 E -72 -65 -52 -29 -11 3 -5 -9 E -80 -73 -60 -37 -19 -5 2 1 Overlap matches s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE (allow ends to slide with no penalty) Align strings that either contain each other or overlap Similar to global alignment, but gaps at the two ends of each string are not penalized Alignment can start from any cell in the top or left border of the matrix F (0, j ) 0 F ( j ,0) 0 Alignment can end in any cell in the right or bottom border max F (i, N ) S ( X , Y ) max i F (M , j) max j

Recursive equation is unchanged 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Overlap matches (allow ends to slide with no penalty) GAWGHEE PAW-HEA Example: BLOSUM50, linear gap penalty d=8 P A W H E A E 0 0 0 0 0 0 0 0 H 0 -2 -2 -3 10 2 -2 0 E 0 -1 -2 -5 2 16 8 4 A 0 -1

4 -4 6 8 21 13 G 0 -2 -1 1 -6 0 13 18 A 0 -1 3 -4 -1 7 5 12 W 0 -4 -4 18 10 2 3 4 G 0 -2 -4 10 16 8 2 4 2003/04 Alessandro Bogliolo H 0 -2 -4 2 20 16 8 2 E 0 -1

-3 6 12 26 18 14 E 0 -1 -2 -6 4 18 25 24 Local alignment s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE [Smith-Waterman, 1981] Looks for best alignment between subsequences of X and Y There are two main differences with respect to Needleman-Wunsch: 1. Negative scores are not allowed F ( i 1, j 1 ) s( xi , y j ) F ( i 1, j ) d F ( i , j ) max F ( i , j 1 ) d 0 2. The score of the best local alignment is the highest score in the matrix (that can be enywhere) 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Local alignment [Smith-Waterman, 1981] AWGHE

AW-HE Example: BLOSUM50, linear gap penalty d=8 P A W H E A E 0 0 0 0 0 0 0 0 H 0 0 0 0 10 2 0 0 E 0 0 0 0 2 16 8 0 A 0 0 5 0 0 8 21 13 G 0 0 0 2 0 0 13 18 A

0 0 5 0 0 0 5 12 W 0 0 0 20 12 4 0 4 G 0 0 0 12 18 10 4 0 2003/04 Alessandro Bogliolo H 0 0 0 0 22 18 10 4 E 0 0 0 0 14 28 20 16 E 0 0 0 0 6 20 27 26 s

t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Repeated matches (asymmetric) Sequence X contains a domain, or motif We look for multiple local matches of X in sequence Y, with score higher than a positive threshold T F ( 0, j 1 ) F ( 0, j ) max F ( i , j 1 ) T max i F ( i 1, j 1 ) s( xi , y j ) F ( i 1, j ) d F ( i , j ) max F ( i , j 1 ) d F ( 0 , j ) 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Repeated matches HEAGAWGHEE HEA.AW-HE. Example: BLOSUM50, linear gap penalty d=8, T=20 P A W H E A E 0 0 0 0 0 0 0 0 H 0 0 0

0 10 2 0 0 E 0 0 0 0 2 16 8 0 A 0 0 5 0 0 8 21 13 G 1 1 1 2 1 1 13 18 A 1 1 6 1 1 1 6 12 W 1 1 1 21 13 5 1 4 G 1 1 1 13 19

11 5 1 2003/04 Alessandro Bogliolo H 1 1 1 5 23 19 11 5 E 3 3 3 3 15 29 21 17 E 9 9 9 9 9 21 28 27 9 Comparing alignments Global alignment Overlap Local alignment Repeated matches 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE Heuristics s t i

ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE The issue of performance Dynamic programming requires O(MN) steps Data base size (e.g. TrMBL) 314,641,655 amino acids in november 2003 Grows very fast Query length: 300 a.a. Complexity: 1011 cells Computation time (at 107 cells per second): 3 hours per query Solution Heuristic techniques trading off accuracy for performance Return not-only the best match to compensate for inaccuracy 2003/04 Alessandro Bogliolo BLAST s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE [Altschul et al., 1990] Use short high-score matching words as seeds from which to extend the alignment Create a table of short words (i.e., 3r, 11bp) that match the query with high score Scan the database searching for words in the table Extend the matching region (without gaps) stopping at the maximum scoring extension 2003/04 Alessandro Bogliolo FASTA s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE [Pearson and Lipman, 1988] Search for short high-score matching words arranged on the same diagonal 1. Use a lookup table to locate short matching words

(i.e., 2r, 6bp) 2. Search for diagonals with many matching words (this can be done by sorting matches on the difference of their indices i-j) 3. Extend the matching seeds in the best diagonals (without gaps) 4. Use gaps to join ungapped regions 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE FASTA [Pearson and Lipman, 1988] Example: BLOSUM50, linear gap penalty d=8, seed_size=1, min_score=6 1. Locate seeds with over threshold score P A W H E A E 0 0 0 0 0 0 0 0 H 0 -2 -2 -3 10 0 -2 0 E 0 -1 -1 -3 0 6 -1 6 A

0 -1 5 -3 -2 -1 5 -1 G 0 -2 0 -3 -2 -3 0 -3 A 0 -1 5 -3 -2 -1 5 -1 W 0 -4 -3 15 -3 -3 -3 -3 2003/04 Alessandro Bogliolo G 0 -2 0 -3 -2 -3 0 -3 H 0 -2 -2 -3 10 0 -2 0 E

0 -1 -1 -3 0 6 -1 6 E 0 -1 -1 -3 0 6 -1 6 s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE FASTA [Pearson and Lipman, 1988] Example: BLOSUM50, linear gap penalty d=8, seed_size=1, seed_size=1 min_score=6 2. Find best diagonal with multiple supporting seeds P A W H E A E 0 0 0 0 0 0 0 0 H 0 -2 -2 -3 10 0 -2 0

E 0 -1 -1 -3 0 6 -1 6 6 A 0 -1 5 -3 -2 -1 5 -1 G 0 -2 0 -3 -2 -3 0 -3 A 0 -1 5 -3 -2 -1 5 -1 W 0 -4 -3 15 -3 -3 -3 -3 16 2003/04 Alessandro Bogliolo G 0 -2 0 -3 -2

-3 0 -3 H 0 -2 -2 -3 10 0 -2 0 E E 0 0 -1 -1 -1 -1 -3 -3 0 0 6 6 -1 6-1 6 166 6 21 s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE FASTA [Pearson and Lipman, 1988] Example: BLOSUM50, linear gap penalty d=8, seed_size=1, seed_size=1 min_score=6 3. Maximum ungapped extension P A W H E A E 0 0 0 0 0 0 0

0 H 0 -2 -2 -3 10 0 -2 0 E 0 -1 -1 -3 0 6 -1 6 A 0 -1 5 -3 -2 -1 5 -1 G 0 -2 0 -3 -2 -3 0 -3 A 0 -1 5 -3 -2 -1 5 -1 W 0 -4 -3 15 -3 -3 -3 -3

G 0 -2 0 -3 -2 -3 0 -3 H 0 -2 -2 -3 10 0 -2 0 E 0 -1 -1 -3 0 6 -1 6 E 0 -1 -1 -3 0 6 -1 6 23 2003/04 Alessandro Bogliolo s t i ISTITUTO DI CIENZE E ECNOLOGIE DELL NFORMAZIONE FASTA [Pearson and Lipman, 1988] Example: BLOSUM50, linear gap penalty d=8, seed_size=1, seed_size=1 min_score=6 3. Possibly add gaps P A W

H E A E 0 0 0 0 0 0 0 0 H 0 -2 -2 -3 10 0 -2 0 E 0 -1 -1 -3 0 6 -1 6 A 0 -1 5 -3 -2 -1 5 -1 G 0 -2 0 -3 -2 -3 0 -3 A 0 -1 5 -3 -2 -1 5

-1 W 0 -4 -3 15 -3 -3 -3 -3 2003/04 Alessandro Bogliolo G 0 -2 0 -8 -2 -3 0 -3 H 0 -2 -2 -3 10 0 -2 0 E 0 -1 -1 -3 0 6 -1 6 E 0 -1 -1 -3 0 6 -1 6 28