DNA Bioinformatics
Hwa A. Lim CompBio bioinformatique bio-informatics(bio /informatics) bioinformatics Number of entries in PDB 50k 40k
30k 20k 10k 0 1985 1990 1995 2000 2005
2010 DNA
DNA 3000 400.0
100 2.5
1300
DNA DNA RNA (adennine,A) (guanine,G))
(cytosine,C) (thymine,T)) (Uracil,U)
DNA 20
G)ly G) Ser S Ala A
T)hr T) Val V Asn N
Ile I G)ln Q Leu L T)yr
Y Phe F His H Pro
P Asp D Met M G)lu E
T)rp W Lys K Cys C
Arg R Ngram CRF N-gram, binary profile N-nary profile SVM
LSA Dong et al. N-gram Statistics and Linguistic F eatrues Analysis of Whole G)enome Protein S equences. Journal of Harbin Institute of T)ech nology. 2004 N-gram SVYDA 3-gram SVY
VYD YDA N-gram N-gram Zipf C Zipf x r r Zipf log x r c log(r )
Zipf
CRF CRF CRF
A R N D C Q E G H I L K M F P S T W Y V ... 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 CRF
CRF CRF yi-1 yi yi+1 1 n p (Y | X ) exp k f k ( yi 1 , yi , X , i ) Z(X ) i 1 k 1 n
exp k tk ( yi 1 , yi , X , i ) k sk ( yi , X , i ) Z(X ) i 1 k X (x1,x2,,xi-1,xi,xi+1,xn) n Z ( X ) exp k f k ( yi 1 , yi , X , i ) i 1 k tk ( yi 1 , yi , X , i )
sk ( yi , X , i ) 1 if yi 1 y and yi y ' t y , y ' ( yi 1 , yi , X , i ) 0 otherwise
pro y , aa s s scale( PSSM ( xk , aa)) if yi y ( yi , xk , i )
0 otherwise ASA y ASA( xk ) if yi y ( yi , xk , i ) otherwise 0 grade( xk ) /10 if yi y s con ( y , x ,
i ) y i k 0 otherwise CRF SMC1HD:SCC1-C
CRF Ribosomal subunit 30S CRF
Sreptococcal pyrogenic enterotoxin C SpeC SCOP (family) (sup erfamily) (fold)
N-grams Binary profiles N-nary profiles Binary profiles Amino acid sequence QTSVSPSKVILPRGGSVLVTCSTSCDQPKLLGIETPLPKKELLLPGNN Multiple sequence alignment
EI.IH.P.A.I.....LR...P..I...RKTTF.L..V.N.E.VS-R.P.W..FL...D...EIN.L................. .V.IH.TEAF.......Q.P..S..EDEN...L..NWM.D..S-S.H.W.LFK..DIG.R.L.FE..GTT Frequency profile PSI-BLAST Binary profile Amino acid combination
A: 0.03 C: 0.002 D: 0.26 E: 0.06 F: 0.01 G: 0.01 H: 0.07 I: 0.01 K: 0.01 L: 0.05 M: 0.02 N: 0.01 P: 0.18 Q: 0.02 R: 0.11 S: 0.03 T: 0.03 V: 0.02
W: 0.02 Y: 0.03 A: 0 C: 0 D: 1 E: 0 F: 0 G: 0 H: 0 I: 0 K: 0 L: 0 M: 0 N: 0 P: 1 Q: 0 R: 0 S: 0 T: 0
V: 0 W: 0 Y: 0 DP Frequency threshold 0.17 A: 0.06 C: 0.004 D: 0.04 E: 0.03
F: 0.03 G: 0.02 H: 0.02 I: 0.2 K: 0.03 L: 0.18 M: 0.01 N: 0.05 P: 0.02 Q: 0.02 R: 0.06 S: 0.002 T: 0.05 V: 0.17 W: 0.002 Y: 0.002 A: 0 C: 0 D: 0
E: 0 F: 0 G: 0 H: 0 I: 1 K: 0 L: 1 M: 0 N: 0 P: 0 Q: 0 R: 0 S: 0 T: 0 V: 1 W: 0 Y: 0 ILV
N-nary profiles Amino acid sequence QTSVSPSKVILPRGGSVLVTCSTSCDQPKLLGIETPLPKKELLLPGNN Multiple sequence alignment EI.IH.P.A.I.....LR...P..I...RKTTF.L..V.N.E.VS-R.P.W..FL...D...EIN.L................. .V.IH.TEAF.......Q.P..S..EDEN...L..NWM.D..S-S.H.W.LFK..DIG.R.L.FE..GTT
Protein sequence frequency profiles PSI-BLAST A: 0.03 C: 0.002 D: 0.26 E: 0.06 F: 0.01 G: 0.01 H: 0.07 I: 0.01 K: 0.01 L: 0.05 M: 0.02
N: 0.01 P: 0.18 Q: 0.02 R: 0.11 S: 0.03 T: 0.03 V: 0.02 W: 0.02 Y: 0.03 A: 0.06 C: 0.004 D: 0.04 E: 0.03 F: 0.03 G: 0.02 H: 0.02
I: 0.2 K: 0.03 L: 0.18 M: 0.01 N: 0.05 P: 0.02 Q: 0.02 R: 0.06 S: 0.002 T: 0.05 V: 0.17 W: 0.002 Y: 0.002 N=10 N-nary
profiles A: 0 C: 0 D: 2 E: 0 F: 0 G: 0 H: 0 I: 0 K: 0 L: 0 M: 0 N: 0 P: 1 Q: 0 R: 1
S: 0 T: 0 V: 0 W: 0 Y: 0 A: 0 C: 0 D: 0 E: 0 F: 0 G: 0 H: 0 I: 2 K: 0 L: 1 M: 0
N: 0 P: 0 Q: 0 R: 0 S: 0 T: 0 V: 1 W: 0 Y: 0 2 2 t c
2 t c N ( A D C B) 2 (t , c) ( A C ) ( B D) ( A B) (C D) 2 2 m avg (t ) Pr (ci ) 2 (t , ci ) i 1
W A USV USVTT LSA W USV T LSA
(roc50roc50 ) roc50 (cont.)cont.)) 1 EMBL http://www.embl-heidelberg.de 2 G)enBank http:// www.ncbi.nlm.nih.gov/Web/G)enbank/index.html 3 DDBJ http://www.ddbj.nig.ac.jp/ G)DB http://www.gdb.org/ Ensembl
http://www.ensembl.org/ MG)D http://www.informatics.jax.org/ SG)D http://genome-www.stanford.edu/Saccharomyces/ dbEST) http://www.ncbi.nlm.nih.gov/dbEST)/ dbST)S http://www.ncbi.nlm.nih.gov/dbST)S/ UniG)ene http://www.ncbi.nlm.nih.gov/UniG)ene/
PIR http://pir.georgetown.edu/ SWISS-PROT) http://www.expasy.ch/sprot/sprot-top.html T)rEMBL http://www.ebi.ac.uk/trembl/ UniProt Includes PIR, SWISS-PROT), T)rEMBL http://www.uniprot.org/ PDB http://www.rcsb.org/pdb/home/home.do MMDB http://130.14.29.110/Structure/MMDB/mmdb.s
html PDB
dbSNP http://www3.ncbi.nlm.nih.gov/SNP/ SCOP http://scop.mrc-lmb.cam.ac.uk/scop/ DSSP http://www.sander.embl-heidelberg.de/dssp/ HSSP http://www.sander.embl-heidelberg.de/hssp/ OMIM http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=OMIM PRINT)S http://www.bioinf.man.ac.uk/dbbrowser/PRINT)S/ EPD http://www.epd.isb-sib.ch/ T)RRD http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/ T)RANSFAC http://transfac.gbf.de/ G)O http://www.geneontology.org/ PubMed http://www.ncbi.nlm.nih.gov/ BODYMAP http://bodymap.ims.u-tokyo.ac.jp/ PROSIT)E http://www.expasy.ch/prosite/ DBCat http://www.infobiogen.fr/services/dbcat/
EMBNet APBi oNet http://www.cbi.pku.edu.cn/chinese/mirrors.html T)he Canadian Bioinformatics Resource http://www.cbr.nrc.ca/ Human G)enome Working Draft http://genome.ucsc.edu/
T)IG)R (T)he Institute for G)enomics Research) http://www.tigr.org/ Celera http://www.celera.com/ (Model) Organism specific information: Yeast: http://genome-www.stanford.edu/Saccharomyces/ Arabidopis: http://www.tair.org/ Mouse: http://www.jax.org/ Fruitfly: http://www.fruitfly.org/ Nematode: http://www.wormbase.org/ Nucleic Acids Research Database Issue
http://nar.oupjournals.org/ (roc50First issue every year) Database interfaces Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, Sequence alignment BLAST, FASTA Multiple sequence alignment Clustal, MultAlin, DiAlign PSI-Blast G)ene finding Genscan, GenomeScan, GeneMark, GRAIL
Protein Domain analysis and identification pfam, BLOCKS, ProDom, Pattern Identification/Characterization Gibbs Sampler, AlignACE, MEME Protein Folding prediction PredictProtein, SwissModeler Sun
Dong Qiwen, Wang Xiaolong, Lin Lei. N-gram Statistics and Linguist ic Features Analysis of Whole G)enome Protein Sequences. Journal of Harbin Institute of T)echnology. 2004. Li MH, Lin L, Wang XL, Liu T): Protein-protein interaction site predicti on based on conditional random fields. Bioinformatics (2007). Dong QW., Wang XL. and Lin L.: Application of Latent Semantic An alysis to Protein Remote Homology Detection. Bioinformatics. 22, 2 85-290 (2006). Liu B, Lin L, Wang XL, Dong QW, Wang X: A discriminative method for protein remote homology detection based on N-nary profiles. BIR D08 (2008). , , .