3 Pages
English

Algorithms in bioinformatics (CSI 5126)=1Please don't print these lecture notes unless you really need

-

Learn all about the services we offer

Description

66MotivationN Y1Algorithms in bioinformatics (CSI 5126) I How to characterize motifs?A AI How to compare motifs?N-N’IN-N’ How to select amongst several motifs?Marcel TurcotteR(turcotte@site.uottawa.ca)N-N’ aN, R and Y are de ned by IUPAC-IUBN-N’School of Information Technology and Engineering R is A or G , i.e. [AG ]University of Ottawa N-N’ Y is T or C , i.e. [TC ]CanadaN-N’N is any, i.e. [ACGT ]N-N’ 3’ 0November 27, 2009 N is the complementNa5’ N www.chem.qmul.ac.uk/iubmb/misc/naseq.html1Please don’t print these lecture notes unless you really need to!Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformatics Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsInformation theory UncertaintyThe information content measures the reduction of theuncertainty (also called entropy) after some message has beenI Information is based on the notion uncertainty about anreceived. In the case of motifs, the interpretation is \how much event | what symbol do you expect to nd at a giveninformation is gained by knowing that a sequence segmentposition of the sequence?matches a given motif".I Uncertainty is de ned as follows,Merriam-Webster Online about \entropy": MXH = P log Pi 2 i1 : ... usually considered to be a measure of thei=1system’s disorder ...3 : Chaos, disorganization, randomness.Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformatics Marcel Turcotte (turcotte@site ...

Subjects

Formal sciences

Informations

1 Algorithms in bioinformatics (CSI 5126)
Marcel Turcotte (turcotte@site.uottawa.ca)
School of Information Technology and Engineering University of Ottawa Canada
November 27, 2009
1 Please don’t print these lecture notes unless you really need to! Marcel Turcotte (turcotte@site.uottawa.ca)CSI 5126. Algorithms in bioinformatics
Information theory
Theinformation contentmeasures the reduction of the uncertainty(also calledentropy) after some message has been received. In the case of motifs, the interpretation is “how much information is gained by knowing that a sequence segment matches a given motif”.
Merriam-Webster Online about “entropy”: 1 : ... usually considered to be a measure of the system’s disorder ... 3 : Chaos, disorganization, randomness.
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Consider a sample space that hastwo outcomes, one occurring with probabilityp, and the other outcome occurring with probability 1p.
The above picture shows how the entropy varies as a function ofp.
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Thenentropy is maximal when theMoutcomes are equally likely, and zero when only one outcome out ofMoccurs. Consider the case where all the outcomes are equiprobable, 1 Piall= fori1. . .M. M P Pilog2Pi= i=1..M P 1 1 log = 2 i=1..M MM 1 1 M×log = 2 M M 1 log =logM 2 2 M
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Motivation
N Y A A N-N’ N-N’ R N-N’ N-N’ N-N’ N-N’ N-N’ 3’ N 5’ N
How to characterize motifs? I I How to compare motifs? How to select amongst several motifs? I
a N,RandYare deﬁned by IUPAC-IUB RisAorG, i.e. [AG] YisTorC, i.e. [TC] Nis any, i.e. [ACGT] 0 Nis the complement
a www.chem.qmul.ac.uk/iubmb/misc/naseq.html
Marcel Turcotte (turcotte@site.uottawa.ca)
Uncertainty
CSI 5126. Algorithms in bioinformatics
I Informationis based on the notionuncertainty about an event— what symbol do you expect to ﬁnd at a given position of the sequence? I Uncertainty is deﬁned as follows,
M X Pg H=ilo2Pi i=1
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
In particular, you can clearly see that theentropy is I maximumwhen theevents are all equiprobable, its value is then logMbits, whereMis the number of outcomes (the 2 cardinality of the sample spaceM=|S|). Here, the entropy maximum is log2 = 1 bit. 2 Notice also that theentropy approaches zero, whenever the I probability of one of the events approaches 1(and hence, the probabilities of the other events approach 0). This models quite well the concept of uncertainty (entropy). I When all the outcomes are equiprobable you can’t predict the result of an experiment, but any bias towards one of the outcomes reduces the uncertainty.
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Finally, consider the case whereone outcome occurs with probability 1, and the otherM1 outcomes occur with probability 0. X (1×+ 0log 1×log 0) 2 2 i=1..M,Pi6=1 X (1×0 +0) i=1..M,Pi6=1 the uncertainty is zero as expected.
It is customary to let0log 0= 0. 2
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Information content
Theinformation contentis deﬁned as,
I=HbeforeHafter
i.e. the diﬀerence of entropy between two probability distributions.
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Information content: solid characters
When a regular expression contains asingle character, sayC, then the amount of information gained is maximal log4 = 2 bits. 2
I=HbeforeHafter= 20 = 2
Marcel Turcotte (turcotte@site.uottawa.ca)
Information content
CSI 5126. Algorithms in bioinformatics
The information content for an expression will be thesum of the information content for each term, I= 2 + 1 + 2 + 0 = 5. G[GA]C[ACGT]
Marcel Turcotte (turcotte@site.uottawa.ca)
Mutual information
RNA comparative information content between two sites
CSI 5126. Algorithms in bioinformatics
sequence analysisoften uses mutual to measure thedegree of association (columns)of an alignment.
M(I,J) =H(I) +H(J)H(I,J) where X H(I,J) =P(i=α,j=β) logP(i=α,j=β) αβ
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Information content: wildcards
Consider the case of DNA strings, Σ ={A,C,G,T}, where I all four bases are equiprobable, i.e.Pi= 0.25. Considering a wild card, [ACGT], no information is gained. I
I=HbeforeHafter
4 4 X X 1 11 1 I= (− ×log )(− ×log )= 22 = 0 2 2 4 44 4 1 1
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Information content: character class
In the case of acharacter classcontaining two elements, [AG],
4 X 1 11 1 I= (− ×log )(0)])) + 2(0 log[2( log 2 22 4 42 2 1 1 bit of information is gained.
Marcel Turcotte (turcotte@site.uottawa.ca)
Exercise
CSI 5126. Algorithms in bioinformatics
Consider an organism whose genome has the following nucleotide 1 1 1 1 frequencies:PA=,PC=,PG=,PT= .Calculate the 6 3 3 6 information content of the following expressionG[GA]C[ACGT].
I=I+I+I+I G[GA]C[ACGT]G[GA]C[AGT]
Marcel Turcotte (turcotte@site.uottawa.ca)
Sequence logo
CSI 5126. Algorithms in bioinformatics
www.lecb.ncifcrf.gov/˜toms/sequencelogo.html weblogo.berkeley.edu
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Marcel Turcotte (turcotte@site.uottawa.ca)
Information Content
CSI 5126. Algorithms in bioinformatics
N Y Information content forunpairedpositions: I A A 2×2 bits (for As) + 2×1 (R,Y) = 6 bits. N-N’ 6 We expect a match every 2= 64 I N-N’ nucleotides by chance. R N-N’Information content for abase pairis also 2 I N-N’bits, total forpairedpositions = 14 bits. N-N’Sequence + 2D= 6 bits + 14 bits = 20 N-N’bits. 20 N-N’ 3’I 1 spurious hits per 2'1Mnucleotides N MS2 genome has 3569 bp. I 5’ N (bacteriophage related to R17) Sequence based search ﬁnds 38 matches; 37 I of which are spurious. Sequence + secondary structure ﬁnds one I match. Marcel Turcotte (turcotte@site.uottawa.ca)CSI 5126. Algorithms in bioinformatics
R17 complete genome
>gi|212383464|gb|EF108465.1| Enterobacteria phage MS2 isolate R17, complete genome GGGTGGGACCCCTTTCGGGGTCCTGCTCAACTTCCTGTCGAGCTAATGCCATTTTTAATGTCTTTAGCGA GACGCTACCATGGCTATCGCTGTAGGTAGCCGGAATTCCATTCCTAGGAGGTTTGACCTATGCGAGCTTT TAGTGCCCTTGATAAGGAGAGCAAGACCTTCGTCCCTTCCATTCGCGTTTACGCGAACGGTGAGACCGAA GATAACTCATTCTCTTTAAAATATCGCTCGAACTGGACTCCCGGTCGTTTTAACTCGACTGGGGCCAGAA CGAAGCAGTGGCACTATCCCTCCCCGTATTCGCGGGGGGCGTTAAGTGTCACGTCGATAGATCAAGGTGC CTATAAGCGCAGTGGGTCATCGTGGGGTCGCCCGTACGAGGAGAAAGCCGGTTTTGGCTTCTCTCTCGAC GCACGCTCCTGCTACAGCCTCTTCCCTGTAAGCCAGAACTTGACTTACATCGAAGTGCCGCAGAACGTTG CGAATCGGGCGTCGACCGAAGTCCTGCAAAAGGTTACCCAAGGTAATTTCAACCTTGGTGTCGCCCTAGC AGAGGCCAGATCGACAGCCTCACAACTCGCGACGCAAACCATTGCGCTCGTGAAGGCGTACACTGCCGCT CGTCGCGGTAATTGGCGCCAGGCGCTCCGCTACCTCGCCCTAAACGAAGATCGAAAATTTCGATCAAAAC ACGTGGCCGGCAGGTGGTTGGAGTTGCAGTTCGGATGGTTACCACTAATGAGTGATATCCAAGGTGCATA TGAGATGCTTACGAAGGTTCACCTTCAAGAGTTTCTTCCTATGAGAGCCGTACGTCAAGTTGGTACTAAC ATTAAGTTAGATGGCCGCTTGGCGTATCCAGCTGCAAACTTCCAGACAACGTGCAACATATCACGACGTA TCGTGATATGGTTTTACATAAACGATGCACGTTTGGCATGGTTGTCGTCTCTAGGTATCTTGAACCCACT AGGTATAGTGTGGGAAAAGGTGCCTTTCTCATTCGTTGTCGACTGGCTCCTACCTGTAGGTAACATGCTC GAGGGCCTTACGGCTCCCGTTGGATGCTCCTACATGTCAGGAACAGTTACTGACGTAATAACGGGTGAGT CCATCATAAGCGTTGACGCTCCCTATGGGTGGACTGTGGAGAGACAGGGCACTGCTAAGGCCCATGTTTC AGCCATGCATCGAGGGGTACAATCCGTATGGCCAACAACTGGCGCATACGTAAAGTCTCCTTTCTCGATG GTCCATACCTTAGATGCGTTAGCATTAATCAGGCAACGGCTCTCTAAATAGAGCCCTCAACCGGAGTTTG AAGCATGGCTTCTAACTTTACTCAGTTTGTTCTCGTCGACAATGGCGGAACTGGCGACGTGACTGTCGCT CCAAGCAACTTCGCTAACGGGGTCGCTGAATGGATCAGCTCTAACTCGCGCTCACAGGCTTACAAAGTAA CCTGTAGCGTTCGTCAGAGCTCTGCGCAGAATCGCAAATACACCATTAAAGTCGAGGTGCCTAAGGTGGC AACTCAGACTGTTGGTGGTGTAGAGCTTCCTGTAGCCGCATGGCGTTCGTACTTAAATATGGAATTAACT ATTCCAATTTTCGCTACGAACTCCGATTGCGAGCTTATTGTTAAGGCAATGCAAGGTCTCCTAAAAGATG GAAACCCGATTCCCTCAGCAATCGCAGCAAACTCCGGCATCTACTAATAGATGCCGGCCATTCAAACATG
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
˚ Bacteriophage MS2 (PDB 2MS2, 2.8 A resolution)
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
FYI: Claude Shannon – Father of the Information Age
Half hour video presenting Claude Shannon’s work. “This fascinating program explores his life and the major inﬂuence his work had on today’s digital world through interviews with his friends and colleagues.” (includes comments from Andrew Viterbi, Ian Blake, and others) www.ucsd.tv/search-details.asp?showID=6090 See also:cm.bell-labs.com/cm/ms/what/shannonday/paper.html
Marcel Turcotte (turcotte@site.uottawa.ca)
What is R17?
CSI 5126. Algorithms in bioinformatics
I R17is a bacterial phage (bacteriophage), a virus that parasitizes a bacterium; I R17 is a single strand RNA virus, 3,569 bp; Acoat proteinis a protein that makes up the envelop I (capsid) of a virus; I Encodes 4 proteins: assembly protein, coat protein, lysis protein, replicase; I R17 coat protein binds the hairpin structure, showed on the previous screen, and represses the translation of itsreplicase. Functional recognition of fragmented operator sites by R17/MS2 coat protein, a translational repressor. Fouts DE, True HL, Celander DW. Nucleic Acids Res. 1997 Nov 15;25(22):4464-73.
Marcel Turcotte (turcotte@site.uottawa.ca)
R17 complete genome (cont.)
CSI 5126. Algorithms in bioinformatics
AGGATTACCCATGTCGAAGACAACAAAGAAGTTCAACTCTTTATGTATTGATCTTCCTCGCGATCTTTCT CTCGAAATTTACCAATCAATTGCTTCTGTCGCTACTGGAAGCGGTAATCCGCACAGTGACGACTTTACAG CAATTGCTTACCTAAGGGACGAATTGCTTACAAAGCATCCGACTTTAGGCTCTGGTAATGACGAGGCGAC CCGTCGCACCTTAGCTATTGCTAAGCTGCGGGAGGCGAATGATCGGTGCGGCCAGATAAATAGAGAAGGT TTCTTACATGACAAATCCTTGTCATGGGATCCGGATGTTTTACAAACCAGCATCCGTAGCCTTATCGGCA ATCTTCTCTCTGGCTACCAATCGTCGTTGTTTGGGCAATGCACGTTCTCCAACGGTGCCTCTATGGGGCA CAAGTTGCAGGATGCAGCGCCTTACAAGAAGTTCGCTGAACAAGCAACCGTTACCCCCCGCGCTCTGAGA GCGGCTCTATTGGTCCGAGACCAATGTGCGCCGTGGATCAGACACGCGGTCCACTATAACGAGTCATATG AATTTAGGCTCGTTGTAGGGAACGGAGTGTTTACAGTTCCGAAGAATAATAAAATAGATCGGGCTGCCTG TAAGGAGCCTGATATGAATATGTACCTCCAGAAAGGGGTCGGCGCTTTTATTAGACGCCGGCTCAAATCC GTTGGTATAGACCTGAATGATCAGTCGATCAACCAGCGTCTAGCTCAGCAAGGCAGCGCAGATGGTTCGC TTGCGACGATAGACTTATCGTCCGCGTCCGATTCCATTTCCGACCGCCTAGTGTGGAATTTCCTTCCACC TGAGCTATATTCATATCTTGATCGTATCCGCTCGCACTACGGAATCGTAGATGGCGAGACGATACGATGG GAACTATTTTCCACGATGGGAAATGGGTTTACATTTGAGCTAGAGTCCATGATATTCTGGGCAATAGTCA AAGCAACCCAAATCCATTTTGGTAACGCCGGAACCATAGGCATCTACGGGGACGATATCATATGTCCCAG TGAGATTGCACCCCGTGTGCTAGAGGCACTTGCCTACTACGGTTTTAAACCGAATCTTCGTAAAACGTTC GTGTCCGGGCTCTTTCGCGAGAGCTGCAGCGCGCACTTTTACCGTGGTGTCGATGTCAAACCGTTTTACA TCAAGAAACCTGTTGACAACCTCTTTGCCCTGATGCTGATATTAAATCGGCTACGGGGTTGGGGGGTTGT CGGAGGTATGTCAGATCCACGCCTTTACAAGGTGTGGGTACGACTCTCCTCCCAGGTACCTTCGATGTTC TTCGGTGGGACGGACCTCGCTGCCGACTACTACGTAGTCAGCCCGCCTACGGCGGTCTCGGTATATACCA AGACTCCGTATGGACGGCTGCTCGCGGATACCCGTACCTCGGGTTTCCGTCTTGCTCGTATCGCTCGAGA ACGCAAGTTCTTCAGCGAAAAGCATGACAGTGGTCGCTACATAGCGTGGTTCCATACTGGAGGTGAAATC ACCGACAGTATGAAGTCCGCCGGCGTGCGCATCATGCGCACTTCGGAGTGGCTAACGCCGGTTCCCACAT TCCCTCAGGAGTGTGGGCCAGCGAGCTCTCCTCGGTAGCTGACCGAGGGACCCCCGTAAACGGGGTGGGT GTGCTCGAAAGAGCACGGGTCCGCGAAAGCGGTGGCTCCACCGAAAGGTGGGCGGGCTTCGGCCCAGGGA CCTCCCCCTGAAGAGAGGACCCGGGATTCTCCCGATTTGGTAACTAGCTGCTTGGCTAGTGACCACCCA
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics
Please don’t print these lecture notes unless you really need to!
Marcel Turcotte (turcotte@site.uottawa.ca)
CSI 5126. Algorithms in bioinformatics