215 Pages
English
Gain access to the library to view online
Learn more

Communication theory applied to selected problems in computational genetics [Elektronische Ressource] / Janis Dingel

Gain access to the library to view online
Learn more
215 Pages
English

Subjects

Informations

Published by
Published 01 January 2009
Reads 30
Language English
Document size 2 MB

Exrait

¨ ¨TECHNISCHE UNIVERSITAT MUNCHEN
Lehrstuhl fu¨r Nachrichtentechnik
Communication theory applied to selected problems
in computational genetics
Janis Dingel
Vollst¨andiger Abdruck der von der Fakult¨at fu¨r Elektrotechnik und Informationstechnik
der Technischen Universit¨at Mu¨nchen zur Erlangung des akademischen Grades eines
Doktor–Ingenieurs
genehmigten Dissertation.
Vorsitzender: Univ.–Prof. Dr.–Ing. J. Ebersp¨acher
Pru¨fer der Dissertation:
1. Univ.–Prof. Dr.–Ing. Dr.–Ing. E.h. J. Hagenauer (i.R.)
2. Prof. O. Milenkovic, Ph.D.,
University of Illinois at Urbana Champaign, USA
DieDissertationwurdeam25.06.2009beiderTechnischenUniversit¨atMu¨ncheneingereicht
unddurchdieFakult¨atfu¨rElektro-undInformationstechnikam06.10.2009angenommen.DEDICATED TO
LONI ELISABETH DINGEL
b d23.05.1950 - 23.01.2010Preface
This thesis is the result of my work as a research and teaching assistant at the Institute
for Communications Engineering of the Technische Universit¨at Mu¨nchen. There are a
number of people I would like to thank.
First of all, I would like to thank Professor Dr.–Ing. Dr.–Ing. E.h. Joachim Hagenauer for
givingmetheopportunitytoworkinhisgroupandforhiscontinuoussupportthroughout
the course of my research. His interest and advice have been essential to this work.
Interdisciplinary, collaborative work can only be successful with strong partners from
the other discipline involved. Therefore, I am very grateful to PD Dr. Jakob Mu¨ller,
Dr. Vladimir Kuryshev, and Dr. Ju¨rgen Zech for their invaluable comments and our
fruitful discussions during our regular ComInGen group meetings, and their patience in
explaining biology to me.
Chapter 7 of this work is the result of my collaboration with Professor Olgica Milenkovic
fromtheUniversityofIllinoisatUrbanaChampaign. MyvisitattheCoordinatedScience
Laboratory was an unforgettable experience from which I have benefited a lot. I would
like to thank Professor Milenkovic for this invitation, her advice, and for serving as the
co-examiner of this thesis.
I am very thankful to all my former colleagues and students for the enjoyable atmosphere
and the inspiring working environment at the LNT. I thank Pavol Hanus and Dr. Jo-
hanna Weindl for our scientific discussions and for proofreading this work. Special thanks
to Stehphan Hellerbrand, Manfred Danzer, Martin Kontny and Professor Dr.-Ing. ha-
bil. Gu¨nter S¨oder for making the work in the system administration group enjoyable. I
would also like to thank Professor Dr.–Ing. Norbert Hanik for the possibility to finish my
work at the Institute for Communications Engineering.
Finally I would like to thank my family, my friends, and especially Laura for her love and
support.
Janis DingelMunich, February 2010Contents
1 Introduction 1
2 Information theory and statistical methods 5
2.1 Mathematical notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Entropy, divergence and mutual information . . . . . . . . . . . . . 7
2.2.2 Transmission channels . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Example: insertion and deletion channels . . . . . . . . . . . . . . . 13
2.3 Coding theory and error correction . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Example: convolutional codes . . . . . . . . . . . . . . . . . . . . . 17
2.4 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Discrete time Markov process (DTMP) . . . . . . . . . . . . . . . . 18
2.4.2 Continuous time Markov process (CTMP) . . . . . . . . . . . . . . 21
2.4.3 Properties of continuous time Markov processes . . . . . . . . . . . 24
2.4.4 Example: continuous evolution of a quarternary symbol . . . . . . . 26
2.4.5 Hidden Markov processes . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Basic principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Example: ML distance of temporally diverged sequences . . . . . . 31
2.5.3 Properties of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Expectation maximization algorithm . . . . . . . . . . . . . . . . . . . . . 33
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36vi CONTENTS
3 Communication principles of the living cells 37
3.1 The physical layer - DNA and RNA . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Genomic material as digital signals . . . . . . . . . . . . . . . . . . 37
3.1.2 From DNA to RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Transmission and maintenance of DNA . . . . . . . . . . . . . . . . . . . . 41
3.2.1 DNA replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Error control mechanisms in the cell . . . . . . . . . . . . . . . . . 42
3.2.3 Characterization of genetic errors . . . . . . . . . . . . . . . . . . . 43
3.3 Information packets - the genes . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 What is a gene? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Protein coding genes and the genetic code . . . . . . . . . . . . . . 45
3.4 Network layer: gene interaction . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Bioinformatics 51
4.1 Analysis of two DNA sequences . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Sequence homology and scoring schemes . . . . . . . . . . . . . . . 52
4.1.2 Models of sequence evolution . . . . . . . . . . . . . . . . . . . . . 54
4.1.3 Gaps and pairwise sequence alignment . . . . . . . . . . . . . . . . 57
4.2 Analysis of multiple sequences . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Evolution model and phylogenetic trees . . . . . . . . . . . . . . . . 58
4.2.2 Felsenstein algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Multiple sequence alignment . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Identification of highly conserved DNA sequences 65
5.1 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Identification of phylogenetic systems . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 Substitution process estimation . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Tree reconstruction and evolutionary distance . . . . . . . . . . . . 70
5.3 Extended models of evolution . . . . . . . . . . . . . . . . . . . . . . . . . 72CONTENTS vii
5.3.1 A space-time process accounts for variable rates . . . . . . . . . . . 72
5.3.2 Models of rate heterogeneity . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Detection of potentially functional elements . . . . . . . . . . . . . . . . . 74
5.4.1 Identification problem and overview of existing methods . . . . . . 76
5.4.2 KuLcons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.3 Run time of the algorithm . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.1 Small sample sizes and error distribution . . . . . . . . . . . . . . . 84
5.5.2 Sliding window estimation of a Markov gamma process . . . . . . . 85
5.5.3 In silico comparison of methods . . . . . . . . . . . . . . . . . . . . 87
5.5.4 Application to ENCODE data . . . . . . . . . . . . . . . . . . . . . 89
5.5.5 Extendingψ to model insertion and deletion events . . . . . . . . . 93
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 On the DNA code reverse engineering problem 101
6.1 Is there an error correcting code in the DNA? . . . . . . . . . . . . . . . . 101
6.1.1 DNA error correction from a coding theoretic perspective . . . . . . 102
6.1.2 Theoretical and experimental evidence . . . . . . . . . . . . . . . . 104
6.1.3 Limitations of Battail’s model . . . . . . . . . . . . . . . . . . . . . 105
6.1.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 The code reverse engineering problem . . . . . . . . . . . . . . . . . . . . . 107
6.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2.2 EM encoder tap estimation . . . . . . . . . . . . . . . . . . . . . . 109
6.2.3 Transformation into the log-likelihood domain . . . . . . . . . . . . 111
6.2.4 Derivation of the gradient . . . . . . . . . . . . . . . . . . . . . . . 113
6.2.5 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2.6 Inference via global optimization . . . . . . . . . . . . . . . . . . . 116
6.2.7 Extension to quarternary alphabet . . . . . . . . . . . . . . . . . . 117viii CONTENTS
6.2.8 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Application to DNA sequence data . . . . . . . . . . . . . . . . . . . . . . 119
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7 List-decoding methods for algebraic gene network models 125
7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2 System and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2.2 Ordinary differential equations: . . . . . . . . . . . . . . . . . . . . 129
7.2.3 Stochastic master equations . . . . . . . . . . . . . . . . . . . . . . 130
7.2.4 Boolean network models . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2.5 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2.6 Polynomial dynamical systems model . . . . . . . . . . . . . . . . . 133
7.3 Reverse engineering frameworks . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.1 Deterministic case and the Laubenbacher-Stigler algorithm . . . . . 134
7.3.2 Accounting for stochasticity and noise . . . . . . . . . . . . . . . . 136
7.4 Tools from coding theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4.1 Polynomial codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4.2 List-decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.5 The Pellikaan-Wu list-decoder . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.6 The reverse engineering algorithm . . . . . . . . . . . . . . . . . . . . . . . 142
7.6.1 Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.6.2 Microarray data preprocessing . . . . . . . . . . . . . . . . . . . . . 143
7.6.3 Constructing lists of approximating polynomials . . . . . . . . . . . 143
7.6.4 Reconstruction bounds . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.6.5 Synthetic networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.7 Application to the E. coli network . . . . . . . . . . . . . . . . . . . . . . . 147
7.7.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.7.2 E. coli SOS response network . . . . . . . . . . . . . . . . . . . . . 148