Algorithms in bioinformatics (CSI 5126)=1Please don

Algorithms in bioinformatics (CSI 5126)=1Please don't print these lecture notes unless you really need

-

English
82 Pages
Read
Download
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

1Algorithms in bioinformatics (CSI 5126)Marcel Turcotte(turcotte@site.uottawa.ca)School of Information Technology and EngineeringUniversity of OttawaCanadaOctober 2, 20091Please don’t print these lecture notes unless you really need to!Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsPlanI String algorithmsI MotivationI NotationI Glimpse of Boyer-Moore string matching algorithmI Su x treesI De nitionI Na ve construction algorithmI Daniela Cernea’s Java libraryMarcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsMotivationI The simplest model of a macromolecule is a string. Yet, thislevel of abstraction is su cient for a considerably largenumber of applicationsI The size of the biological databases has been doubling every12 to 18 months for the last few yearsI Millions of queries are made to (static) databasesI Instances of exact and approximate string matchingproblems are solved as sub-tasks of several bioinformaticsapplications, such as the DNA assembly processI Natural transition between approximate string matching andmolecular sequence alignmentsMarcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsExamples of Problems on Strings1. Exact string matching: nding all occurrences of a string in atext2. Approximate string matching: nd all positions in a text wherea pattern occurs, allowing for a certain number of mismatches3. Longest common ...

Subjects

Informations

Published by
Reads 28
Language English
Report a problem
1PlirpthttnesaenodnoresuteelestuecylenerlaysuolnseurcocelT!Maredtotis@ettocrut(ettSI)Ccaa.awttuoe.oibnofnitamrsci2651lg.Aitorsihm
1
Algorithms in bioinformatics (CSI 5126)
School of Information Technology and Engineering University of Ottawa Canada
October 2, 2009
Marcel Turcotte (turcotte@site.uottawa.ca)
oibiinmstimaornf
I
String algorithms IMotivation INotation IGlimpse of Boyer-Moore string matching algorithm ISuffix trees IDefinition INa¨ıve construction algorithm IDaniela Cernea’s Java library
Plan
sc.uotsite.ca)tawa21.6SC5IirhtlAogtte@urcote(trcotleuTaMcr
Marutcrtt(eruocecTlawttuoe.it@steotglA.6215ISC)ac.arotimhisbnoiniofrmatics
Motivation
IThe simplest model of a macromolecule is a string. Yet, this level of abstraction is sufficient for a considerably large number of applications IThe size of the biological databases has been doubling every 12 to 18 months for the last few years IMillions of queries are made to (static) databases IInstances of exact and approximate string matching problems are solved as sub-tasks of several bioinformatics applications, such as the DNA assembly process INatural transition between approximate string matching and molecular sequence alignments
215IlA.6irogsmhtbiinnfoimaorcstitoett(ruoctt@eiste.uottawa.ca)CSMarcTuelrc
1. all occurrences of a string in a findingExact string matching: text 2. all positions in a text whereApproximate string matching: find a pattern occurs, allowing for a certain number of mismatches 3.Longest common substring 4.Sequence similarities and differencescomparison: highlight between twosequences 5.Regular expression matching 6. motif discovery, like repeats,Structural pattern matching: tandems and palindromes
Efficient data structures, such as suffix trees, are often at the heart of efficient algorithms.
Examples of Problems on Strings
tocrt(etocru@ettrcMaTuel21.6SC5IirhtlAog.uotsite.ca)tawa
A string,S, is an ordered list of characters written contiguously from left to right. |S|denotes the length of the stringS. represents the empty string. S(i) denotes theith character ofS. The index 1 denotes the first character ofS.
Notation
scoibiinmstimaornf
The set of all characters is called thealphabet, often denoted Σ or A all our applications the alphabet is a finite set of symbols,. In although it varies in size from one application to the other: IDNA alphabet: {A,C,G,T} IProtein alphabet: {A, ..,Z} \ {B,J,O,U,X,Z} ISolvent accessibility, Inside (hydrophobic), Surface (hydrophilic): {I,S} Istructure states of proteins, Helix, Strand and Coil:Secondary {H,E,C}
s
Notation (cont.)
MruTlecrabnisnioimrofcita51SI.A26orlghmitti.eouttwa.aacC)cotte(turcotte@s
nismioibrofnitamcs
Notation
S[i..j] denotes the (contiguous)substringofSthat starts at positioniand stops at positionj,S(i)S(i+ 1). . .S(j); also called afactor. S[1..i] is theprefixofS. S[i..|S|] is thesuffixofS. A substring, prefix or suffix isproperif it’s not the entire string (and it is not empty). We say thatSis asubsequenceofT, if there exists an increasing set of indices ofT,i1<i2 < . .< .im, such that S=T[i1]T[i2]. . .T[im other words, the string]. InScan be obtained by deleting zero or more characters ofT. E.g.tieis a subsequence ofotherwise. We say that two charactersmatchif they are the same; otherwise we say it’s amismatch.
TuelrcMa@ettocrut(ettocr.ca)tawa.uotsiteirhtlAog21.6SC5I
cismrta
Notation (cont.)
be a text
LetPdenote a pattern (query, for now a string) andT (think of it as a database), in general|P|<<|T|.
ISC)ac.aglA.6215sihmitorfoinionbtte(turccelTurco.eouttwatoets@tiraM
MraecTlmrtaniofbnoimhisorit.Alg5126)CSIac.awattou.etis@teotrctue(ttcour
Initial motivation
Problem:Given a patternPand a textT, determine ifPoccurs inT. (boolean find(P, T)) Problem:Given a patternPand a textT, find all occurrences of PinT. (int[] findall(P, T)) P=string T= Algorithms on text (strings) have long been studied in computer science, and computation on molecular sequence data (strings) is at the heart of computational molecular biology. Present and potential algorithms forstringcomputation provide a significant intersection between computer science and molecular biology. How do you approach such problem?
cis
arMlTce@site.uottawa.caruoctt(eutcrtoettamrofnioibnismhitorlg.A2651SI)C
Na¨ıve algorithm
Move a window of size|P|is moved along the text (T). In the worst case, for every starting location, 1. . .|T|, all the symbols of the pattern (|P|) must be considered. Therefore requiring|T| × |P|comparisons.
ics
rmatinfonbiohmsirotiA.gl1562C)IScaa.awttuoe.it@settocrut(ettocruMarcelTsic
} return count;
while ( ! done ) { offset++; if ( offset == lp ) { done = true; count++; } else { done = p.charAt( offset ) != t.charAt( pos+offset ); } }
Na¨ıve algorithm (cont.)
public static int findall( String p, String t ) { int lp = p.length(), lt = t.length(), count = 0; for ( int pos=0; pos <= lt-lp; pos++ ) {
}
int offset = 0; boolean done = p.charAt( offset ) != t.charAt( pos+offset );