174 Pages
English

# A polyhedral approach to sequence alignment problems [Elektronische Ressource] / von Knut Reinert

Learn all about the services we offer

Description

A Polyhedral Approachto Sequence Alignment ProblemsDissertationzur Erlangung des Gradesdes Doktors der Ingenieurswissenschaften (Dr.-Ing.)der Technischen Fakult atder Universit at des SaarlandesvonKnut ReinertSaarbruc ken5. August 1999Datum des Kolloqiums: 5. August 1999Dekan der technischen Fakult at:Professor Dr. Wolfgang PaulGutachter:Professor Dr. Kurt Mehlhorn, MPI fur Informatik, Saarbruc ken Dr. John Kececioglu, University of Georgia, Athens, USA23AbstractWe study two problems in sequence alignment both from a theoretical and a practicalpoint of view. For the rst time in sequence alignment, we use tools from combina-torial optimization to develop branch-and-cut algorithms that solve these problemse cien tly. The Generalized Maximum Trace formulation captures several forms ofmultiple sequence alignment problems in a common framework, among them is theoriginal formulation of Maximum Trace. The Structural Maximum Trace Problemcaptures the comparison of RNA molecules on the basis of their primary sequenceand their secondary structure. For both problems we derive a characterization interms of graphs which we use to reformulate the problems in terms of integer linearprograms. We then study the polytopes (or convex hulls of all feasible solutions)associated with the integer linear program for both problems.

Subjects

##### Informatik

Informations

A Polyhedral Approach
to Sequence Alignment Problems
Dissertation
des Doktors der Ingenieurswissenschaften (Dr.-Ing.)
der Technischen Fakult at
der Universit at des Saarlandes
von
Knut Reinert
Saarbruc ken
5. August 1999Datum des Kolloqiums: 5. August 1999
Dekan der technischen Fakult at:
Professor Dr. Wolfgang Paul
Gutachter:
Professor Dr. Kurt Mehlhorn, MPI fur Informatik, Saarbruc ken Dr. John Kececioglu, University of Georgia, Athens, USA
23Abstract
We study two problems in sequence alignment both from a theoretical and a practical
point of view. For the rst time in sequence alignment, we use tools from combina-
torial optimization to develop branch-and-cut algorithms that solve these problems
e cien tly. The Generalized Maximum Trace formulation captures several forms of
multiple sequence alignment problems in a common framework, among them is the
original formulation of Maximum Trace. The Structural Maximum Trace Problem
captures the comparison of RNA molecules on the basis of their primary sequence
and their secondary structure. For both problems we derive a characterization in
terms of graphs which we use to reformulate the problems in terms of integer linear
programs. We then study the polytopes (or convex hulls of all feasible solutions)
associated with the integer linear program for both problems. For each polytope we
derive several classes of facet-de ning inequalities and show that for some of these
classes the corresponding separation problem can be solved in polynomial time. This
leads to a polynomial time algorithm for pairwise sequence alignment that is not
based on dynamic programming. Moreover, for multiple sequences the branch-and-cut
algorithms for both sequence alignment problems are able to solve to optimality
instances that are beyond the range of present dynamic programming approaches.
Zusammenfassung
Wir betrachten zwei Sequenz-Alignment-Probleme von einem theoretischen und prak-
tischen Standpunkt aus. Dabei nutzen wir Methoden der kombinatorischen Opti-
mierung, um Branch-and-Cut-Algorithmen zu entwickeln, die diese Probleme e zien t
l osen. Das sogenannte Generalized-Maximum-Trace-Problem beinhaltet verschiedene
Arten von multiplen Sequenz-Alignment in einem einheitlichen Rahmen, darunter auch
das ursprunglic he Maximum-Trace-Problem. Das sogenannte Structural-Maximum-
Trace-Problem beschreibt den Vergleich von RNA-Molekulen, basierend auf deren
Prim ar- und Sekund arstruktur. Wir leiten fur beide Probleme eine graphentheoretis-
che Formulierung her, welche wir dann zur De nition ganzzahliger linearer Programme
benutzen. Wir untersuchen die Polytope (d.h. die konvexen Hullen aller zul assigen
L osungen), die mit den ganzzahligen, linearen Programmen assoziiert sind. Fur jedes
Polytop leiten wir mehrere Klassen facettende nierender Ungleichungen her und zeigen,
da fur einige dieser Klassen das entsprechende Separationsproblem in Polynomial-
zeit gel ost werden kann. Dies impliziert unter anderem einen Polynomialzeitalgorith-
mus zum paarweisen Sequenzvergleich, welcher nicht auf dem Prinzip der dynami-
schen Programmierung beruht. Darub er hinaus sind die vorgestellten Branch-and-
Cut-Algorithmen in der Lage, Probleminstanzen einer Gr o e optimal zu l osen, die
mit Verfahren, welche auf dynamischer Programmierung beruhen, nicht gel ost werden
k onnen.Acknowledgments
The work on this thesis was carried out during the years 1994-1999 at the Max-Planck-
Institut fur Informatik in Prof. Dr. Kurt Mehlhorn’s group. The MPI always provided a
very pleasant working environment. The technical facilities were excellent and the large
number of guests and researchers that visited our group created a stimulating research
atmosphere. I was able to get to know many of these people during my stay at MPI.
This is the place to say \thank you" to a lot of them. Foremost I would like to thank my
advisor Dr. Hans-Peter Lenhof. He certainly was the person who piqued my interest in
Computational Biology and during the past years he had the greatest in uence on the
directions of my research. We often had long discussions that taught me a lot, even if
they sometimes were controversial. His enthusiasm and vision for the essential things
make him a distinguished researcher. Apart from him, Kurt Mehlhorn and Dr. Petra
Mutzel always had time to listen and to discuss problems. Petra taught me a lot about
combinatorial optimization and Kurt is simply an inexhaustible source of information
about any problem you can think of. Prof. Dr. John Kececioglu formulated the original
version of one of the problems I address in this thesis. We invited him several times
to the MPI, and each of his stays was valuable and fruitful for me, not only from
a scienti c point of view. Our way of thinking about problems is very much alike.
Another person I had the honor to work with was Dr. Martin Vingron who has a very
likeable personality and a great style of working on problems. In addition, he was \a
living library" to me at the times we met or worked together. During the rst year at
MPI, Dr. Phil Bradford was my o ce mate and we had a lot of fun together (we actually
also wrote a paper). Anybody who knows Phil knows what I mean. The contributions
of some of my friends and fellow PhD students cannot be counted here, but nevertheless
I shall try. They were room mates, o ce mates, correctors, discussion partners, and
they were always willing to cheer me up when the going got tough. Thank you Michael,
Gunnar, Rudiger, Ralf, Oliver, Rene, Hannah, Volker, Susan and all the others. Also
many researchers that stayed at the MPI or were editors for some of the papers that
address the problems in the thesis made valuable remarks on some topics. Among all of
them I am particulary thankful to Prof. Dr. Pavel Pevzner and Prof. Dr. Naveen Garg.
Last, but by far not least, I want to thank my wife, Birgit, who always supported me
during this time, proofread the thesis and | being the librarian at the MPI | always
provided me with the papers and books I needed.Contents
1 Introduction 1
2 Mathematical Preliminaries 17
2.1 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Polyhedral Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Integral Polytopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Polyhedral Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Independence Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Stringology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Detecting Similarity { Alignments 29
3.1 Conventional Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Pairwise Alignments . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Multiplets . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Structural Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Detecting Similarity { Traces 45
4.1 Conventional Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Pairwise Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 Multiple Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Structural Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Gapped Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56CONTENTS
5 The GMT and SMT Problem 59
5.1 Problem De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Dynamic Programming based Algorithms . . . . . . . . . . . . . . . . . 61
5.2.1 The MT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 The SMT . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 The Combinatorial Optimization Approach 67
6.1 The Generic Branch-and-Cut Algorithm . . . . . . . . . . . . . . . . . . 68
6.1.1 A Cutting Plane Approach . . . . . . . . . . . . . . . . . . . . . 68
6.1.2 Branch-and-Bound . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.3 Branch-and-Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 The GMT Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.1 A Characterization of the GMT Problem as ILP . . . . . . . . . 77
6.2.2 The Structure of the GMT Polytope . . . . . . . . . . . . . . . . 78
6.2.3 Bounds for the GMT Problem . . . . . . . . . . . . . . . . . . . 88
6.2.4 Computational Results for the GMT Problem . . . . . . . . . . . 91
6.3 The SMT Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.1 A Characterization of the SMT Problem as ILP . . . . . . . . . 102
6.3.2 The Structure of the SMT Polytope . . . . . . . . . . . . . . . . 104
6.3.3 Bounds for the SMT Problem . . . . . . . . . . . . . . . . . . . . 111
6.3.4 Computational Results for the SMT Problem . . . . . . . . . . . 115
7 Discussion 137
8 Deutsche Zusammenfassung 141
9 Glossary 151
Bibliography 155
Index 161
iiiChapter 1
Introduction