174 Pages
English
Gain access to the library to view online
Learn more

Selected communications theoretic aspects in genetics [Elektronische Ressource] / Pavol Hanus

Gain access to the library to view online
Learn more
174 Pages
English

Description

¨ ¨TECHNISCHE UNIVERSITAT MUNCHENLehrstuhl fu¨r NachrichtentechnikSelected Communications Theoretic Aspectsin GeneticsPavol HanusVollst¨andiger Abdruck der von der Fakult¨at fu¨r Elektrotechnik und Informationstechnikder Technischen Universit¨at Mu¨nchen zur Erlangung des akademischen Grades einesDoktor–Ingenieursgenehmigten Dissertation.Vorsitzender: Univ.–Prof. Dr.–Ing. Jo¨rg Ebersp¨acherPru¨fer: 1. Univ.–Prof. Dr.–Ing. Dr.–Ing. E.h. Joachim Hagenauer2. Univ.–Prof. Dr.–Ing. Martin Bossert, Universit¨at UlmDie Dissertation wurde am 21.04.2010 bei der Technischen Universit¨at Mu¨nchen einge-reicht und durch die Fakult¨at fu¨r Elektrotechnik und Informationstechnik am 13.07.2010angenommen.PrefaceThis thesis was written during my time as a research assistant at the Institute for Com-munications Engineering (LNT) at the Technische Universit¨at Mu¨nchen (TUM). Manypeople, whom I am very grateful to, have contributed in different ways to the success ofthis work.Inthe firstplace, Iwould like tothankmy supervisor Prof.Dr.-Ing.Dr.-Ing.E.h.JoachimHagenauer forhisguidance andsupport. Hegave me the opportunitytoconduct researchin the novel interdisciplinary field of information and communication theory in molecularbiology. The work at his institute was a very fruitful and enriching experience. Theinterdisciplinary and open nature of my research gave me the opportunity to be creativeand learn a lot.

Subjects

Informations

Published by
Published 01 January 2010
Reads 24
Language English
Document size 6 MB

Exrait

TECHNISCHEUNIVERSITA¨TMU¨NCHEN
Lehrstuhlfu¨rNachrichtentechnik

SelectedCommunicationsTheoreticAspects
inGenetics

PavolHanus

Vollsta¨ndigerAbdruckdervonderFakulta¨tfu¨rElektrotechnikundInformationstechnik
derTechnischenUniversita¨tMu¨nchenzurErlangungdesakademischenGradeseines

genehmigtenDissertation.

Doktor–Ingenieurs

Vorsitzender:Univ.–Prof.Dr.–Ing.Jo¨rgEberspa¨cher
Pru¨fer:1.Univ.–Prof.Dr.–Ing.Dr.–Ing.E.h.JoachimHagenauer
2.Univ.–Prof.Dr.–Ing.MartinBossert,Universita¨tUlm

DieDissertationwurdeam21.04.2010beiderTechnischenUniversita¨tMu¨ncheneinge-
reichtunddurchdieFakulta¨tfu¨rElektrotechnikundInformationstechnikam13.07.2010
angenommen.

-

oT

ym

parents

-

iv

Contents

3.3.1DNADamageRepairandErrorCorrection..............22
3.3.2DNAMismatchRepairandErrorCorrection.............22
3.4MolecularBiology-GeneticInformationProcessing.............23
3.4.1Transcription..............................24
3.4.2Splicing.................................25
3.4.3Translation...............................26
3.5GeneticInformationProcessingandEngineering...............27
3.5.1TranscriptionandErrorCorrection..................27
3.5.2TranslationandErrorCoding.....................27
3.5.3GeneExpressionandMarkerSynchronization............28
3.6EvolutionaryGenetics-GeneticInformationTransmission.........28
3.6.1GenomeEvolution...........................29
3.6.2TypesofMutations...........................29
3.7GeneticInformationTransmissionandEngineering.............32
3.7.1SubstitutionChannel-ContinuousTimeMarkovProcess......32
3.7.2NucleotideEvolutionisSIMOTransmission.............36
3.7.3EstimationontheTree-FelsensteinAlgorithm...........37
3.7.4SubstitutionSpaceTimeModel....................38
3.7.5SummaryoftheNucleotideEvolutionModel.............39
3.8Summary....................................39

ISourceCodingAspectsinGenetics

14

4CompressioninEngineering43
4.1EntropyCoding.................................44
4.1.1HumanCoding............................45
4.1.2ArithmeticCoding...........................46
4.2UniversalStatisticalPredictionBasedCoding................49
4.2.1AdaptiveArithmeticCoding......................49

Contents

vii

4.2.2ContextTreeWeighting........................51
4.2.3DecompositionContextTreeWeighting................55
4.3UniversalDictionaryBasedCoding......................56
4.3.1SlidingWindowLempelZiv-LZ77..................56
4.3.2TreeStructuredLempelZiv-LZ78..................57
4.4Summary....................................58

5CompressioninGenetics59
5.1DNASequenceCompression..........................60
5.1.1DNACompressionAlgorithms.....................60
5.1.2PerformanceComparisonofDNACompressors...........62
5.2MultipleGenomeAlignment..........................63
5.2.1MotivationforMultipleGenomeAlignments.............64
5.2.2TheMultipleGenomeAlignmentProblem..............64
5.2.3PairwiseAlignment-SmithWatermanAlgorithm..........65
5.2.4MultipleSequenceAlignment(MSA).................67
5.2.5WholeGenomeAlignmentDatasets..................68
5.3MultipleSequenceAlignmentCompression..................68
5.3.1CompressionofNucleotides......................69
5.3.2CompressionofGaps..........................76
5.3.3MultipleSequenceAlignmentCompressor(MSAc)..........78
5.3.4PerformanceandComparison.....................79
5.4MultipleAlignmentFormatFilesCompression................81
5.4.1MultipleAlignmentFormat(MAF)..................81
5.4.2CompressionofMAFSupplementaryData..............81
5.4.3PerformanceandComparison.....................84
5.5Summary....................................87

iiiv

IIMarkerSynchronizationinGenetics

Contents

98

6MarkerSynchronizationinEngineering91
6.1SynchronizationModel.............................91
6.1.1ThresholdSynchronizer.........................92
6.1.2MaximizationSynchronizer......................92
6.2LikelihoodFunctionsforMarkerSynchronization..............93
6.2.1OptimalGeneralLogLikelihoodRatioFunction...........93
6.2.2LikelihoodFunctionforAWGNChannel...............96
6.2.3LikelihoodFunctionsforDiscreteMemorylessChannels.......99
6.2.4LikelihoodFunctionsforSubstitutionChannelModels.......99
6.3MarkerPerformanceandSyncwordChoice..................100
6.3.1ThresholdBasedSynchronization...................101
6.3.2MaximizationBasedSynchronization.................110
6.4Summary....................................112

7MarkerSynchronizationinGenetics115
7.1SequenceSpecicBinding...........................116
7.1.1ProteinDNABinding.........................116
7.1.2BindingMotifs.............................116
7.1.3PositionwiseIndependenceofBindingSites..............120
7.2BindingSiteInferenceandSynchronization..................121
7.2.1LikelihoodFunctionforBindingSiteInference............121
7.2.2ComparisontotheSynchronizationLikelihoodFunction......123
7.2.3VicinityExtendedBindingSiteInference...............124
7.2.4LimitationsofBindingSiteInference.................126
7.3MolecularSynchronization...........................127
7.3.1BindingSiteDetection.........................127
7.3.2ParallelstoThresholdSynchronization................128

Contents

xi

7.3.3GeneralPropertiesofMolecularMarkers...............128
7.3.4SynchronizationPerformanceofTranscriptionMarkers.......130
7.4Summary....................................133

8Conclusion

APublications

153

931

BDerivations143
B.1MatrixExponential...............................143
B.2SynchronizationSuccessProbability......................143

Nomenclature

Bibliography

541

151

Zusammenfassung

EswerdenneueAnwendungenderKommunikations-undInformationstheorieaufProb-
lemeindermolekularenBiologiebehandelt.ImerstenTeilwirdQuellencodierunggenutzt
umdieSpeicheranforderungenvongenomweitenSequenzalignmentDatensa¨tzenzuver-
ringern.EinhochezienterKompressionsalgorithmusbasierendaufstatistischenMod-
ellenderEvolutionundTechnikenausderbina¨renBildcodierungwirdvorgeschlagen.
ImzweitenTeilwerdenParallelenzwischenderMarkerSynchronisationu¨berverrauschte
Kana¨leundderProtein-DNABindungsstellensuchestudiert.StatistischeBindungsstellen
ModelleundInferenztechnikenwerdenausinformationstheoretischerSichtanalysiertund
erweitert.Synchronisationseigenschaftenvonausgewa¨hltenmolekularenMarkernwerden
evaluiertundEvidenzfu¨rSelektionsdruckzugunstenezienterMarkergefunden.

Abstract

Thisthesiscoversnovelapplicationsofconceptsfromcommunicationsengineeringtoprob-
lemsinmolecularbiology.Intherstpartthefocusisplacedonapplyingsourcecoding
techniquestoreducethestoragerequirementofmultiplegenomealignmentdatasetsused
incomparativegenomics.Ahighlyecientlosslesscompressionalgorithmusingwell
establishedmodelsofgenomeevolutionandbinaryimagecompressiontechniquesisin-
troduced.Thesecondpartstudiesparallelsbetweensequencespecicproteinbindingon
themolecularlevelandmarkersynchronization.Theengineeringconceptofthreshold
basedmarkersynchronizationovernoisychannelsisrevisedandextended.Bindingsite
modelsandinsilicoinferencetechniquesarereviewedusinginformationtheory.Synchro-
nizationpropertiesofselectedmolecularmarkersareanalysedandevidencefornatural
selectionpressuretowardsgoodmarkersisfound.

2

Chapter1Introduction

arethatforalongtimeitwasbelievedthatproteincodinggenesaremoreorlessthe
onlyfunctionalregionsinthegenomeandthefactthatonlyaverylimitedamountof
sequencedatawasavailable.Recentadvancesinhigh-throughputsequencingtechnology
havemadeitpossibletosequencewholegenomesofcomplexspecies.Thecompletion
oftherstdraftofthehumangenomein2001[LLB+01]hasrevealedthatonlyseveral
percentofitisproteincodingandthattherearelessproteincodinggenesthanexpected.
Thesequencingofothervertebrategenomescouldbecompletedsoonafterandcompar-
ativegenomicsstudieshaveuncoveredthatthereexistmanyhighlyconservedputatively
functionalregionsunderstrongevolutionarypressurenotcodingforproteins[BPM+04].
Theirfunctionisstilllargelyunknown,butithasbeensuggestedthattheymightplay
aroleingeneregulation[Mat07],whichgovernswhenandinwhatamountsparticular
genesgetexpressedintoproteins.
Therecentndingshaveraisedmanynewopenquestionsaboutthegeneticinformation
storedinthegenomes,itsfunctionality,storageandprocessing.Togetherwiththeex-
ponentialgrowthofsequencedataavailableinpublicdatabases,thishasreignitedthe
interestofinformationtheoristsandcommunicationsengineersformolecularbiology.To
nameafewcontributions:G.Battailhaspostulatedthehypothesisthaterrorcorrecting
codesareusedontheDNAsequencelevel[Bat97].J.Hagenaueretal.appliedinforma-
tiontheorytogenemappingofcomplexdiseases[HDG+04,SGD+07]andtoclassica-
tionofgeneticsequencedata[DHHM05]aswellastocomparativegenomics[DHL+08].
O.Milenkovicetal.introducedmethodsfromchannelcodingtogeneregulation[MV04].
W.Szpankowskietal.usedsourcecodingtechniquestoidentifyproteincodingre-
gions[SRS05].Theinitialresearchresultsconrmthatmodellingandanalysingtheway
naturedealswithgeneticinformationfromacommunicationsengineeringperspective
canleadtoitsbetterunderstanding.Thenewinsightsgeneratedbythisinterdisciplinary
researchmightevenhavethepotentialtohelpadvancefuturecommunicationssystems.
Furthermore,methodsfromcommunicationsengineeringcanbeusedtoreducethestorage
requirementoftheexponentiallygrowingsequencedatasets.
Thisthesiscoversnovelapplicationsofconceptsfromcommunicationsengineeringto
problemsinmolecularbiology.Ithastwoparts.InPartIthefocusisplacedonap-
plyingsourcecodingtechniquestoreducethestoragerequirementofmultiplegenome
alignmentdatasetsusedincomparativegenomics.Suchalignmentsrepresentoneofthe
largestsequencedatasetsusedinmolecularbiology.Ahighlyecientlosslesscompression
algorithmformultiplegenomealignmentsisintroduced.Ituseswellestablishedmodels
ofgenomeevolutionandtechniquesfrombinaryimagecompression.PartIIofthisthe-
sisstudiestheparallelsbetweensequencespecicbindingonthemolecularleveland
thresholdbasedmarkersynchronization.Natureusesspecicsequencepatternstomark
thetargetsitesfordierentregulatoryproteinsandtodistinguishinformationcarrying
regionse.g.genes.Theengineeringconceptofthresholdbasedmarkersynchronization
overnoisychannelsisrevisedandextended.Bindingsitemodelsandinsilicoinference
techniquesarestudiedandreviewedusinganinformationtheoreticframework.Synchro-
nizationpropertiesofmolecularmarkersareanalysedandevidenceforselectionpressure
towardsgoodmarkersisfound.Thestructureofthisthesisisasfollows.

Fundamentals

3

Chapter2introducesthefundamentalsfromstatisticsandinformationtheoryrequired
inlaterchaptersinthenotationusedthroughoutthiswork.
Chapter3providesthenecessarybackgroundonmolecularbiologyandgeneticsfornon-
biologists.Topicsfromstructuralbiology,molecularbiologyandevolutionarygeneticsare
covered.ThedierentaspectsofgeneticinformationstorageintheDNA,itsprocessing
duringgeneexpressionanditstransmissionfromgenerationtogenerationarepresented
fromtheperspectiveofacommunicationsengineer.ErrorcorrectionimplicationsofDNA
damageandmismatchrepairaretreated.Establishedstatisticalmodelsofevolutionare
explainedindetailandusedtomodeltheevolutionarymutationchannelinlaterchapters.

PItra

Chapter4dealswithsourcecodinginengineering.Entropycodingisexplainedwith
focusonarithmeticcoding.Theconceptofuniversalstatisticalpredictionbasedcoding
ispresentednext.TheContextTreeWeightingalgorithmservesasarepresentative.Itis
veryecientatadaptivelycompressingMarkovtypedependencies.Universaldictionary
basedcodingisdescribedaswell.Conceptsandtechniquesrequiredbythemultiple
alignmentcompressorproposedinChapter5ofthisthesisareexplainedinmoredetail.
Chapter5treatscompressionofDNAsequencedatasets.First,abriefoverviewofDNA
sequencecompressorsisprovided.Subsequently,anovelcompressionschemeformultiple
genomealignmentdatasetsisintroduced.Theconstructionandstatisticalregularitiesof
suchalignmentdatasetsareexplained.Theproposedalgorithmisahighlyecientlossless
predictionbasedcompressorcombiningpredictionsfromstatisticalmodelsofevolution
andbinaryimagecompression.Thealgorithmisshowntobenearlyoptimalunderthe
usedmodel.Withrespecttotheusagescenario,assumingacentralproviderofthedataset,
thealgorithmisdesignedtoachievehighcompressionatthecostofincreasedencoder
complexity.Inaddition,itisaccountedforthepossibilitytodecompressindividualblocks.
Therefore,thealgorithmcouldalsobeusedtoreducethestoragerequirementofrespective
databaseservers.

PartII

Chapter6dealswithmarkersynchronizationinengineering.Thefocusisplacedon
thresholdbasedmarkersynchronizationovernoisychannels.Atthetransmitterspeci