Entropy functions and rare events in disordered systems by transfer matrix calculations and Monte Carlo sampling [Elektronische Ressource] / von Stefan Wolfsheimer

Entropy functions and rare events in disordered systems by transfer matrix calculations and Monte Carlo sampling [Elektronische Ressource] / von Stefan Wolfsheimer

English
183 Pages
Read
Download
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

Entropy FunctionsandRare Eventsin Disordered Systemsby Transfer Matrix CalculationsandMonte Carlo SamplingVon der Fakulta¨t V (Mathematik und Naturwissenschaften) der Carl von OssietzkyUniversita¨t Oldenburg zur Erlangung des Grades einesDoktors der Naturwissenschaften (Dr. rer. nat.)angenommene DissertationvonStefan Wolfsheimergeboren am 20.09.1977 in Frankfurt am Main2Gutachter: Prof. Dr. Alexander K. HartmannZweitgutachter: Prof. Dr. Andreas EngelTag der Disputation: 27.02.20092ContentsZusammenfassung / Abstract 11 Introduction 32 Monte Carlo methods 72.1 Simple sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . . 82.3 The dynamics of the N-fold way . . . . . . . . . . . . . . . . . . . . 92.4 Parallel tempering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.1 Equilibration . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.2 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Sampling of rare events I:Importance sampling and reweighting . . . 132.6.1 Estimation of relative normalization constants . . . . . . . . . 162.6.2 Illustration: Reweighting probability distributions . . . . . . . 172.7 Sampling of rare events II:Generalized ensemble methods . . . . . . . 182.7.1 Wang-Landau sampling . . . . . . . . . . . . . . . . . . . .

Subjects

Informations

Published by
Published 01 January 2009
Reads 4
Language English
Document size 5 MB
Report a problem

Entropy Functions
and
Rare Events
in Disordered Systems
by Transfer Matrix Calculations
and
Monte Carlo Sampling
Von der Fakulta¨t V (Mathematik und Naturwissenschaften) der Carl von Ossietzky
Universita¨t Oldenburg zur Erlangung des Grades eines
Doktors der Naturwissenschaften (Dr. rer. nat.)
angenommene Dissertation
von
Stefan Wolfsheimer
geboren am 20.09.1977 in Frankfurt am Main2
Gutachter: Prof. Dr. Alexander K. Hartmann
Zweitgutachter: Prof. Dr. Andreas Engel
Tag der Disputation: 27.02.2009
2Contents
Zusammenfassung / Abstract 1
1 Introduction 3
2 Monte Carlo methods 7
2.1 Simple sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . . 8
2.3 The dynamics of the N-fold way . . . . . . . . . . . . . . . . . . . . 9
2.4 Parallel tempering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Equilibration . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Sampling of rare events I:Importance sampling and reweighting . . . 13
2.6.1 Estimation of relative normalization constants . . . . . . . . . 16
2.6.2 Illustration: Reweighting probability distributions . . . . . . . 17
2.7 Sampling of rare events II:Generalized ensemble methods . . . . . . . 18
2.7.1 Wang-Landau sampling . . . . . . . . . . . . . . . . . . . . 20
2.7.2 Optimized ensembles . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Sampling of rare events III:Evaluation of the number of potential moves 24
2.8.1 The density of states by transition matrix estimates . . . . . . 24
2.8.2 The ParQ algorithm . . . . . . . . . . . . . . . . . . . . . . . 25
3 Sequence alignment 27
3.1 Notation of sequence alignment . . . . . . . . . . . . . . . . . . . . 29
3.2 Scoring models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 ThePAM family . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 TheBLOSUM family . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Position specific scoring for transmembrane proteins using theSLIM family 35
3.3 Optimal alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Global alignment . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Local alignment . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Finite-temperature local alignment . . . . . . . . . . . . . . . . . . . 41
3.5 The linear-logarithmic phase transition . . . . . . . . . . . . . . . . . 42
3.6 Thermodynamics of local alignments by biological examples . . . . . 44
3.6.1 Strong homologs . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.2 Weak homologs . . . . . . . . . . . . . . . . . . . . . . . . . 47
34 Contents
4 Statistics of local sequence alignment 53
4.1 Karlin-Altschul-Dembo theory and beyond . . . . . . . . . . . . . . 54
4.2 Sampling of rare events in the sequence space . . . . . . . . . . . . . 57
4.3 Statistics of two i.i.d. sequences . . . . . . . . . . . . . . . . . . . . 60
4.4 The biological example revisited . . . . . . . . . . . . . . . . . . . . 62
4.5 Statistics of position dependent alignment for transmembrane protein models 65
4.5.1 Fixed queries versus random subjects . . . . . . . . . . . . . 68
4.5.2 Random queries and position specific scoring . . . . . . . . . 71
4.6 Phase diagram and statistics of finite-temperature alignment . . . . . 76
4.7 Concluding discussion . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 RNA secondary structure prediction 81
5.1 Notation of RNA secondary structures . . . . . . . . . . . . . . . . . 82
5.2 The pair-energy model . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 The free-energy model . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 The molten-glass transition . . . . . . . . . . . . . . . . . . . . . . . 91
6 Minimum-free-energy distribution of RNA secondary structures 93
6.1 Sequence models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Simulation method . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 The minimum-free-energy distributions . . . . . . . . . . . . . . . . 96
6.3.1 Entropy and thermodynamics of large deviations . . . . . . . 98
6.3.2 Comparison between random and natural RNA sequences . . 102
6.4 Discussion and outlook . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Complex state spaces and glassy Monte Carlo dynamics 107
7.1 Markov chain Monte Carlo sampling of secondary structures . . . . . 108
7.1.1 The ParQ simulation . . . . . . . . . . . . . . . . . . . . . . 111
7.1.2 Flat-histogram and optimized ensembles . . . . . . . . . . . . 112
7.2 Convergence properties of the Monte Carlo algorithms . . . . . . . . 113
7.3 Correlation between algorithmic and structural complexity . . . . . . 116
7.3.1 Ratio of number of first excitations and ground states . . . . . 116
7.3.2 Ultrametricity of the phase space . . . . . . . . . . . . . . . . 117
7.3.3 Distribution of tunneling times of the flat histogram random walk120
7.3.4 Are all ground states visited with equal probability? . . . . . . 123
7.4 Rate of convergence in extended state spaces . . . . . . . . . . . . . 125
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A Additional algorithms 131
A.1 Stochastic backtracing for local alignment . . . . . . . . . . . . . . . 131
A.2 Pair probabilities of RNA secondary structures and hierarchical backtracing131
A.3 The clustering method . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.4 Statistical significance of the Bhattacharyya distance measure . . . . . 134
B The +/- J spin glass: Algebraic tunneling times? 137
B.1 The Edwards-Anderson Hamiltonian . . . . . . . . . . . . . . . . . . 137
B.2 Extension of the state space . . . . . . . . . . . . . . . . . . . . . . . 139
B.3 Performance in the extended state space . . . . . . . . . . . . . . . . 141
C Fit parameters 147
4CONTENTS 5
D List of acronyms 153
Bibliography 154
Lebenslauf und Vero¨ffentlichungen 171
Stellungnahme 175
Danksagung 177
56 Contents
60. Zusammenfassung 1
Zusammenfassung
In der vorliegenden Arbeit wurden Fragestellungen der Bioinformatik mit Monte Carlo
Verfahren der statistischen Physik behandelt.
Beim Vergleich von molekularen Sequenzen (Sequenz Alignment) verwendet man
¨statistische Tests, um die Signifikanz beobachteter Ahnlichkeiten zu quantifizieren.
Verteilung der optimalen lokalen Alignment-Scores u¨ber Zufallssequenzen, insbeson-
dere seltene Ereignisse, sind ein wichtiger Bestandteil solcher Tests. Ich erweiterte eine
bereits bestehende Arbeit, in der große Abweichungen von der theoretisch vorherge-
sagten Gumbel-Verteilung gefunden wurde, auf weitere biologisch relevante Protein-
und Score-Modelle. In den meisten Fa¨llen konnten die Abweichungen durch eine
heuristisch modifizierte Gumbel-Verteilung beschrieben werden. Sie sind teilweise so
¨groß, dass einige signifikante Ahnlichkeiten in der bisherigen Praxis nicht als solche
klassifiziert werden. Dies kann eintreten, wenn man ein Suchergebnis weiter verfein-
ern mo¨chte. Zuerst betrachtete ich die Verteilung bezu¨glich des Standardmodells fu¨r
Proteinsequenzen fu¨r verschiedene Alignment-Parameter. In einem zweiten Schritt un-
tersuchte ich ein Modell, das Transmembran-Proteine beschreibt. Außerdem studierte
ich Verteilungen freier Energien kanonischer Alignment-Ensembles. Die temperat-
urabha¨ngige Form dieser Verteilungen deutete ich anhand des linear-logarithmischen
Phasenu¨bergangs, der in diesem Modell auftritt.
In einer a¨hnlichen Weise untersuchte ich RNA Sekunda¨rstrukturen. Hier wurde die
Verteilung der minimalen freien Energie ebenfalls u¨ber Zufallssequenzen bestimmt.
Mit diesen Verteilungen konnte ich biologische RNA Sequenzen gegen Zufallsmodelle
vergleichen. Dazu betrachtete ich mikrokanonische Sequenzensembles und verglich
deren statistische Eigenschaften mit biologischen RNA Sequenzen aus einer Daten-
bank.
Auch fu¨r eine Studie der Monte Carlo Dynamik in komplexen Energielandschaften
betrachtete ich RNA Sekunda¨rstrukturen. Diese stellen fu¨r solche Fragestellungen ein
ideales Modell dar, da es, im Gegensatz zu vielen anderen Modellen, exakt behan-
delt werden kann und gleichzeit komplexe glassartige Eigenschaften besitzt. Ich ver-
glich dynamische Eigenschaften verschiedener Monte Carlo Algorithmen mit statis-
chen Eigenschaften, die durch Transfermatrix-Berechnungen zuga¨nglich sind.
Abstract
This thesis treats problems from bioinformatics with Monte Carlo methods from sta-
tistical physics.
Methods to compare molecular sequences (sequence alignment) make use of sta-
tistical tests to assess the significance of observed similarities. Distributions of optimal
alignment scores over random sequences, particularly rare events, are integral parts of
such tests. I extended an existing work where large deviations from the asymptoti-
cally predicted Gumbel distribution were found to further biologically relevant scoring
and protein models. In most cases, deviations could be described by an heuristically
modified Gumbel distribution. In some cases the deviations are so large that, in the
previous praxis, some significant similarities are not properly classified, in particular
when one wishes to refine a certain search result. First, I studied the score distribution
for the standard protein model for different alignment parameters. In a second step, I
investigated a model which describes transmembrane proteins.
Furthermore I studied free-energy distributions of canonical alignment ensembles.
12 0. Zusammenfassung
I explained the temperature dependence of the shapes of these distributions with argu-
ments of the linear-logarithmic phase transition that occurs in this model.
In a similar way, I studied RNA secondary structures. I obtained the minimum-free-
energy distribution over random sequences. This distribution allowed me to compare
biological RNA sequences against random models. For this purpose, I considered mi-
crocanonical sequence ensembles and compared their statistical properties to those of
biological RNA sequences taken from a database.
I also used RNA secondary structures for a study of Monte Carlo dynamics in
complex energy landscapes. This model is an ideal system for such purposes, because,
in contrast to many other models, it can be treated exactly and, on the other side,
it exhibits complex glassy properties. I compared dynamical properties of different
Monte-Carlo algorithms to static properties which can be probed with transfer matrix
calculations.
2Chapter 1
Introduction
Computational molecular biology, or bioinformatics [CB05, RDM98], has arisen from
different scientific fields, like molecular biology, computer science, probability theory
and statistics. Since recently also many physicists have studied problems from bioin-
formatics and figured out that many bioinformatics models have an equivalent or a
similar description in physics and vice versa. Just as the recent tendency of increasing
interchange between computer science and physics [HR01, HW05], this point of view
allows one to interchange methods and concepts or even results between bioinformatics
and physics [LV02].
Analysis and classification of molecular biological data are challenging tasks in
bioinformatics. Most data is stored in form of biological sequences in large databases
(for example UniProt [Uni]). A biological sequence, also called primary structure,
is the linearly ordered chain of monomers of a biopolymer, such as deoxyribonucleic
acid (DNA), ribosomal nucleic acid (RNA) or proteins. They are usually encoded as
character strings over finite alphabets, four letter alphabets in the case of DNA or RNA
or the 20 letter amino acid alphabet for protein sequences.
It is commonly assumed that related organisms, so called homologs, share similari-
ties on the molecular level. For this reason sequence comparison is a fundamental tool
to detect homological relationships. Common search tools, like BLAST (Basic Lo-
cal Alignment Search Tool) [BLA], are used to search a given query sequence against
huge databases. In most cases search algorithms are based on local pairwise sequence
alignment [CB05, RDM98]. It quantitatively measures similarities between a pair of
sequences and detects corresponding regions in both sequences. The approach uses the
dynamic programing paradigm [CLR02] (commonly known as transfer matrix calcu-
lations in physical literature). The algorithms return a raw similarity score that quanti-
fies the similarity between the input objects. A more detailed introduction to sequence
alignment is given in Chapter 3.
Unfortunately, the raw score is hard to interpret because one does not know the
absolute scale of the score. An interpretation becomes possible when specifying a
probabilistic null model for the input: Then the similarity score becomes a random
variableS whose probabilitiesProb(S = s) under the null model can be determined.
Sometimes this can be done analytically [KA90, KD92, DKZ94], but usually one has
to apply numerical simulations [AG96, ABOH01, RO99, ABOH01]. The p-value as-
signed to an observed scores is defined aspval(s) := Prob(S≥s) in the null model
and−logpval(s) is a measure of surprise (and hence a universally normalized score)
for s. It is one fundamental problem in bioinformatics to find Prob(S) for a given
34 1. Introduction
comparison method, a given scoring scheme, and a given null model.
Since true homological relationships usually exhibit large scores, the rare-event tail
of the score distribution is particularly interesting. Rare-event tails are usually hardly
accessible with naive “simple sampling” methods. Similar problems can be found
in statistical mechanics, where one is interested in tails of ground-state-energy distri-
butions of disordered systems with quenched disorder (for example [Pal03, ABM04,
MG06, KKH06]). These are models with random interactions, where each realization
of random interactions induces a physical ensemble on its own.
A fruitful solution to the problem of probing the rare-event tail of such distributions
is to reinterpret the ensemble of realizations as a physical ensemble and make use of
methods to compute the microcanonical entropy function, i.e. the logarithm of the
density of states (the number of micro states for a given energy). Because feasible
exact methods are not available in most cases, such problems are approached by Monte
Carlo simulations, such as parallel tempering combined with reweighting techniques
or generalized ensemble methods.
A few years ago Hartmann applied such a method to the alignment problem [Har02,
HR04] and figured out that the accurate score distribution strongly differs from the an-
alytically predicted score distribution in the rare-event tail. Unfortunately these results
have not been considered in current database search tools, presumably because it was
applied for only one case so far.
It is one aim of this thesis to extend these results to a broader range of scoring
and protein models. Under the standard protein-sequence model, the effects of vary-
ing scoring parameters was studied. In a second step, the score statistics for a special
class of proteins that are hardly described by the standard model was considered. Fi-
nally, the statistics of a finite-temperature version of the local alignment algorithm
[Miy95, KL00] was investigated. The Monte Carlo algorithms that were used here
are introduced in Chapter 2. The results for the local-alignment-score statistics are
discussed in Chapter 4.
Another important problem in bioinformatics, molecular biology and biophysics
is the prediction of the spatial conformation, the tertiary structure, of molecules from
primary sequences. In contrast to the tertiary structure, a secondary structure describes
the conformation on a topological level, i.e. the set of paired monomers.
Such higher order structures are important because they determine the molecule’s
function. The protein folding problem (the prediction of the three dimensional structure
from the amino acid sequence) is probably the most prominent example. Beside protein
structures, also RNA structures play an important role in living organisms. In order to
fulfill a certain biological function, the molecule’s structure is assumed to sit in a global
minimum of the free energy in the structure space for a fixed sequence. Hence, many
RNA structure prediction methods are based on free energy minimization. Fortunately,
RNA structure prediction turned out to be much simpler than protein folding, because
secondary structures without so called pseudo knots (topologically crossing pairs) de-
scribe the essential features already quite well [TB99]. Neglecting pseudo knots allows
us to perform free-energy minimization in polynomial time by dynamic programing
+(transfer matrix calculations) [dG68, NPGK78, ZS81, Zuk89, HFS 94, MSZT99].
Such algorithms are explained in Chapter 5.
Because biological RNA sequences are products of evolutionary processes they
can hardly be seen as purely random objects. They rather sit in a local minimum of
the map from the space of sequences to minimum free-energy [FHS99, CFKK05].
Statistical evidence of the non-randomness of a given sequence is often measured by
the so calledz-score. That is the distance of the minimum-free-energy value from the
4