Data structures and algorithms for analysis of alternative splicing with RNA-seq data [Elektronische Ressource] / Marcel H. Schulz
151 Pages
English
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Data structures and algorithms for analysis of alternative splicing with RNA-seq data [Elektronische Ressource] / Marcel H. Schulz

-

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer
151 Pages
English

Description

Data Structures and Algorithmsfor Analysis of Alternative Splicingwith RNA-Seq DataMarcel H. SchulzJune 2010Dissertation zur Erlangung des Gradeseines Doktors der Naturwissenschaften (Dr. rer. nat.)am Fachbereich Mathematik und Informatikder Freien Universität BerlinGutachter:Prof. Dr. Martin VingronProf. Dr. Jens Stoye1. Referent: Prof. Dr. Martin Vingron2. Prof. Dr. Jens StoyeTag der Promotion: 26 August 2010PrefaceThe first part of the thesis in Chapter 3 was published in the journal Science, as partof a collaborative study that addressed the first application of RNA-Seq experimentsto two human cell lines in 2008 [147]. The methods for prediction and quantificationof alternative isoforms and their application presented in Chapter 3 and 4 appearedin the journal Nucleic Acids Research in 2010 [129]. The last part about the de novotranscriptome assembly method Oases has not been published yet, but a manuscriptis in preparation. The successful application of Oases to human, fly, and worm RNA-Seq data will be published in the report about the RGASP competition this year.My contributions to these papers was the design of statistical methods and the anal-ysis of alternative splicing with junctions reads for the Science paper. I designed,implemented, and analyzed the CASI and DASI method. I was involved in the con-ception and analysis of the POEM method, analysis of the exon array data, as wellas primer design for the PCR experiments and their evaluation.

Subjects

Informations

Published by
Published 01 January 2010
Reads 15
Language English
Document size 4 MB

Exrait

Data Structures and Algorithms
for Analysis of Alternative Splicing
with RNA-Seq Data
Marcel H. Schulz
June 2010
Dissertation zur Erlangung des Grades
eines Doktors der Naturwissenschaften (Dr. rer. nat.)
am Fachbereich Mathematik und Informatik
der Freien Universität Berlin
Gutachter:
Prof. Dr. Martin Vingron
Prof. Dr. Jens Stoye1. Referent: Prof. Dr. Martin Vingron
2. Prof. Dr. Jens Stoye
Tag der Promotion: 26 August 2010Preface
The first part of the thesis in Chapter 3 was published in the journal Science, as part
of a collaborative study that addressed the first application of RNA-Seq experiments
to two human cell lines in 2008 [147]. The methods for prediction and quantification
of alternative isoforms and their application presented in Chapter 3 and 4 appeared
in the journal Nucleic Acids Research in 2010 [129]. The last part about the de novo
transcriptome assembly method Oases has not been published yet, but a manuscript
is in preparation. The successful application of Oases to human, fly, and worm RNA-
Seq data will be published in the report about the RGASP competition this year.
My contributions to these papers was the design of statistical methods and the anal-
ysis of alternative splicing with junctions reads for the Science paper. I designed,
implemented, and analyzed the CASI and DASI method. I was involved in the con-
ception and analysis of the POEM method, analysis of the exon array data, as well
as primer design for the PCR experiments and their evaluation. I implemented a
preliminary R version of the initial steps of the transcriptome assembler, addressing
loci and trivial transcript reconstruction. I developed the theory in Section 5.1 and
made the analysis with Oases in Sections 5.3 and 5.4. I was involved in the algorithm
design in Section 5.3 and error analysis of the Oases software. I did the transcriptome
assembly and parts of the downstream analysis for the RGASP submissions.
There are a number of other contributions that are unfortunately not in the thesis. I
haveimplementedlineartimealgorithmsfortheconstructionofvariableorderMarkov
chains and the first algorithm for the score distribution computation for ontological
similarity searches, presented at the WABI conference in 2008 and 2009 [142, 141].
Also I designed and implemented a linear time truncated suffix tree algorithm [140].
Further, I was involved as a co-author in projects about clinical diagnostics with
iontologies [81], basepair-precise breakpoint detection of human structural variations
in resequencing data [27, 170], the influence of highly conserved sequence elements
on gene expression [132, 54], and algorithms for frequency pattern mining [164].
Acknowledgements I am grateful to my supervisor Martin Vingron for his ideas,
support, initiating of collaborations, and especially for his suggestion to start working
with de Bruijn graphs. It is a wonderful atmosphere that he created in his group that
I have enjoyed throughout the years. I also want to thank him and the International
Max Planck Research School for Computational Biology and Scientific Computing
for funding my Phd time and my stay in Cambridge. I also like to thank Hugues
Richard, who supervised me for the project about statistical methods with RNA-Seq
data(Chapter3and4). Myknowledgeaboutstatistics, R,andmanyotherimportant
subjectsinlifehavegrownconsiderablyduetohisinfluence. Iacknowledgehiswriting
of the functions for the POEM algorithm, help with its design and analysis, primer
design, exon array analysis, and application of POEM on the RGASP data.
IwanttothankDanielR.Zerbinoforsharinghisexpertiseaboutassemblyalgorithms,
whichhasbeenanimportantcontributiontotheproject. Inparticular, Iacknowledge
his conception for the treatment of cycles, adaptations for the transitive reduction
algorithmofMyers, andrunningABySS.IamverygratefultoDanielthatheaccepted
to implement the ideas described in Section 5.2 in the Oases software, which made
it possible to have results in time for the RGASP competition. I further want to
thank Ewan Birney for financing my stay in Cambridge, hosting me at the EBI, and
inspiring and motivating discussions about transcriptome assembly.
I want to thank Marc Sultan for conducting the RNA-Seq, PCR, and exon array
experiments, help with primer design, and many interesting discussions about wet
lab biology. Additionally, I thank Marie-Laure Yaspo for financing the experiments,
as well as, Hans Lehrach, Asja Nürnberger, Sabine Schrinner, Daniela Balzereit, and
EmilieDagand, whoconductedorsupervisedpartsofRNA-Seq, PCR,andexonarray
experiments for Chapter 3 and 4. Further, I would like to thank Stefan Haas for sev-
eral discussions about alternative splicing and help with EST analysis, paper writing,
and primer design. I further thank David Weese for help with setting up a pipeline for
the RGASP competition and other interesting projects with did together. Thanks to
Axel Rasche for sharing his knowledge about exon array analysis and preprocessing
the human exon array data. Thanks to Andreas Klingenhoff, Alon Magen, Dmitri
iiParkhomchuk, Matthias Scherf, and Martin Seifert for preprocessing and prepara-
tion of data for the Science paper. Thanks to Cole Trapnell, Ali Mortazavi, and Dian
Trout for sharing the mouse C2C12 data.
I want to thank Knut Reinert and Peter N. Robinson for constant support and guid-
ance. Thanks to Jens Stoye for reviewing my thesis, Roland Krause for joining my
"Verteidigungskommission" on very short notice. I would like to thank Hannes Luz
for support throughout my Phd and especially in the last minutes as well as Kirsten
Kelleher for support. Also thanks to Anne-Kathrin Emde, Stefan Haas, Marta Luk-
sza, Alena Mysickova, Hugues Richard, Christian Rödelsperger, Marc Sultan, Ewa
Szczurek, David Weese, and Daniel R. Zerbino for great comments from proofread-
ing. In addition, I would like to thank Sebastian Bauer, Sebastian Köhler, Jonathan
Göke, Tobias Rausch, Christian Rödelsperger, Stefan Roepcke, and Silke Stahlberg
and all so far unmentioned members of the Vingron department for help, parties, and
other interesting projects I was involved in. I want to thank Markus Bauer, Florian
Markowetz, Ole Schulz-Trieglaff as well as the EBI Pre- and Postdocs for sharing
many private parties and taking care of me during my stay in Cambridge.
Finally, I want to express my biggest gratitude to my parents and my girlfriend Susi.
Their long lasting support and love is the foundation of all my achievements. I love
you.ivContents
1 Introduction 1
1.1 DNA, Gene Expression, and Alternative Splicing . . . . . . . . . . . . 1
1.2 DNA sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Methods for Detection of Alternative Splicing . . . . . . . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Sequences and Graphs in Computational Biology 11
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . 12
2.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Data structures and Algorithms for Genome Assembly . . . . . . . . 14
2.3.1 Genome Assembly with the Overlap-Layout-Consensus Paradigm 14
2.3.2 Assembly Using the Eulerian Path Paradigm . . . . . 15
2.3.3 The Velvet Genome Assembler . . . . . . . . . . . . . . . . . . 18
2.4 Data Structures and Algorithms for EST Assembly and Analysis . . . 20
2.4.1 EST Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Splicing Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Prediction of Alternative Isoforms 23
3.1 Prediction of Alternative Splicing Events from RNA-Seq Data . . . . 23
3.1.1 A General Stochastic Count Model for Transcriptome Analysis 23
3.2 Prediction of Alternative Splicing Events with Exon Junction Read
Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Reference based Spliced Alignment of RNA-Seq Reads . . . . 25
3.2.2 Application to Human RNA-Seq data . . . . . . . . . . . . . . 28
vContents
3.3 Prediction of Alternative Isoforms with Exon Expression Levels . . . 29
3.3.1 Alternative Exon Usage within a Condition . . . . . . . . . . 31
3.3.2e Exon Usage between two Conditions . . . . . . . . 38
4 Quantification of Alternative Isoforms 45
4.1 From Gene to Transcript Expression Levels . . . . . . . . . . . . . . . 45
4.2 Quantification of Transcript Levels . . . . . . . . . . . . . 46
4.2.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Proportion Estimation with Junction Reads . . . . . . . . . . 52
4.3 Application to Human RNA-Seq Data . . . . . . . . . . . . . . . . . 53
4.3.1 Experimental Validation . . . . . . . . . . . . . . . . . . . . . 54
5 De Novo Assembly of Transcripts considering Alternative Isoforms 57
5.1 De Novo Assembly of Transcript Sequences . . . . . . . . . . . . . . . 57
5.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Transcript de Bruijn Graphs . . . . . . . . . . . . . . . . . . . 59
5.1.3 Recognition of Alternative Exon Events in Transcript de Bruijn
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Oases: a de novo Transcriptome Assembler Based on Transcript de
Bruijn Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 Error Correction and Collapsing . . . . . . . . . . . . . . . . . 69
5.2.2 Scaffolding of Loci . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.3 Recognition of Trivial Structures . . . . . . . . . . . . . . . . 73
5.2.4 Prediction of Full Length Transcript Sequences . . . . . . . . 74
5.2.5 Merged Assemblies . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.6 Transcript Confidence Scores . . . . . . . . . . . . . . . . . . . 78
5.2.7 Prediction of Alternative Exon Events . . . . . . . . . . . . . 78
5.3 Influence of Repeats, Domains and Paralogs . . . . . . . . . . . . . . 79
5.4 Application to Paired-End RNA-Seq data . . . . . . . . . . . . . . . 82
5.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 Influence of Parameter k . . . . . . . . . . . . . . . . . . . . . 83
5.4.3 Comparison with ABySS . . . . . . . . . . . . . . . . . . . . . 85
5.4.4 with Cufflinks . . . . . . . . . . . . . . . . . . . . 87
6 Discussion 91
viContents
Bibliography 97
Notation and Definitions 119
Zusammenfassung 123
Summary 125
Software Availability 127
Appendix 129
Ehrenwörtliche Erklärung 139
viiviii