223 Pages
English

Similarity Search in Medical Data [Elektronische Ressource] / Katrin Haegler. Betreuer: Christian Böhm

-

Gain access to the library to view online
Learn more

Description

Similarity Search in MedicalDataKatrin HaeglerMunchen 2011iiSimilarity Search in MedicalDataDissertation im Fach Informatikan der Fakultat fur Mathematik, Informatik und Statistikder Ludwig-Maximilians-Universitat Munc henvorgelegt vonKatrin HaeglerErstgutachter: Prof. Dr. Christian BohmZweitgutachter: Prof. Dr. Martin WiesmannTag der Abgabe: 23. Mai 2011Tag der mund lichen Prufung: 08. November 2011Munchen, den 23. Mai 2011iiContentsContents iiiAbstract viiiGerman Abstract xAcknowledgements xiii1 Introduction 11.1 Outline Of The Thesis . . . . . . . . . . . . . . . . . . . . . . 42 Algorithmic Fundamentals 92.1 Information Theoretic Measures . . . . . . . . . . . . . . . . . 92.1.1 Akaike Information Criterion . . . . . . . . . . . . . . 102.1.2 Bayesian Criterion . . . . . . . . . . . . . 112.1.3 Minimum Description Length . . . . . . . . . . . . . . 112.2 Uncertain Data Mining . . . . . . . . . . . . . . . . . . . . . . 122.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Partitioning Clustering . . . . . . . . . . . . . . . . . . 152.3.2 Hierarchical . . . . . . . . . . . . . . . . . . 202.3.3 Density-based Clustering . . . . . . . . . . . . . . . . . 222.3.4 Parameter-free . . . . . . . . . . . . . . . . 232.3.5 Quality Measures . . . . . . . . . . . . . . . . . . . . . 252.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 26iii2.4 Outlier Detection . . . . . . . . . . . . . . .

Subjects

Informations

Published by
Published 01 January 2011
Reads 9
Language English
Document size 13 MB

Similarity Search in Medical
Data
Katrin Haegler
Munchen 2011iiSimilarity Search in Medical
Data
Dissertation im Fach Informatik
an der Fakultat fur Mathematik, Informatik und Statistik
der Ludwig-Maximilians-Universitat Munc hen
vorgelegt von
Katrin Haegler
Erstgutachter: Prof. Dr. Christian Bohm
Zweitgutachter: Prof. Dr. Martin Wiesmann
Tag der Abgabe: 23. Mai 2011
Tag der mund lichen Prufung: 08. November 2011Munchen, den 23. Mai 2011
iiContents
Contents iii
Abstract viii
German Abstract x
Acknowledgements xiii
1 Introduction 1
1.1 Outline Of The Thesis . . . . . . . . . . . . . . . . . . . . . . 4
2 Algorithmic Fundamentals 9
2.1 Information Theoretic Measures . . . . . . . . . . . . . . . . . 9
2.1.1 Akaike Information Criterion . . . . . . . . . . . . . . 10
2.1.2 Bayesian Criterion . . . . . . . . . . . . . 11
2.1.3 Minimum Description Length . . . . . . . . . . . . . . 11
2.2 Uncertain Data Mining . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Partitioning Clustering . . . . . . . . . . . . . . . . . . 15
2.3.2 Hierarchical . . . . . . . . . . . . . . . . . . 20
2.3.3 Density-based Clustering . . . . . . . . . . . . . . . . . 22
2.3.4 Parameter-free . . . . . . . . . . . . . . . . 23
2.3.5 Quality Measures . . . . . . . . . . . . . . . . . . . . . 25
2.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 26
iii2.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Depth-based Approach . . . . . . . . . . . . . . . . . . 28
2.4.2 Distance-based Approach . . . . . . . . . . . . . . . . . 29
2.4.3 Density-based Approach . . . . . . . . . . . . . . . . . 30
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Classi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Bayes Classi er . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 Nearest Neighbor Classi er . . . . . . . . . . . . . . . . 37
2.5.3 Decision Tree Classi er . . . . . . . . . . . . . . . . . . 39
2.5.4 Support Vector Machine Classi er . . . . . . . . . . . . 41
2.5.5 Cross Validation . . . . . . . . . . . . . . . . . . . . . 42
2.5.6 Quality Measures . . . . . . . . . . . . . . . . . . . . . 43
2.5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Parameter-free Outlier Detection Using Data Compression 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Depth-based Outlier Detection . . . . . . . . . . . . . . 50
3.2.2 Distance-based Outlier . . . . . . . . . . . . 51
3.2.3 Density-based Outlier Detection . . . . . . . . . . . . . 51
3.2.3.1 Local Outlier Factor . . . . . . . . . . . . . . 52
3.2.3.2 Local Correlation Integral . . . . . . . . . . . 52
3.2.4 Data Mining And MDL . . . . . . . . . . . . . . . . . 53
3.3 Outlier Detection Via Minimum Description Length . . . . . . 55
3.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.2 Independent Component Analysis . . . . . . . . . . . . 58
3.3.2.1 Data Preprocessing . . . . . . . . . . . . . . . 59
3.3.2.2 Identi cation Of Independent Components . . 61
3.3.3 Generalized Normal Distribution . . . . . . . . . . . . 63
3.3.4 ICA With GND Linkage . . . . . . . . . . . . . . . . . 64
3.3.5 GND Parameter Estimation . . . . . . . . . . . . . . . 65
iv3.3.5.1 Scale Parameter Estimation . . . . . . . . . . 66
3.3.5.2 Location Parameter Estimation . . . . . . . . 66
3.3.5.3 Shape Parameter Estimation . . . . . . . . . 67
3.3.6 Coding Cost Determination . . . . . . . . . . . . . . . 67
3.3.7 Outlier Score And Outlier Detection . . . . . . . . . . 68
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 71
3.4.2 Outlier Detection Results . . . . . . . . . . . . . . . . 71
3.4.2.1 Outlier Detection Using MDL . . . . . . . . . 72
3.4.2.2 Local Outlier Factor . . . . . . . . . . . . . . 73
3.4.2.3 Local Correlation Integral . . . . . . . . . . . 75
3.4.3 Outlier Score Visualization . . . . . . . . . . . . . . . . 75
3.4.4 Experimental Data . . . . . . . . . . . . . . . . . . . . 76
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4 Similarity Search In Uncertain Data 81
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Searching Uncertain Data Using GMMs . . . . . . . . . . . . . 88
4.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.2 Non-axis Parallel Gaussian Mixture Model . . . . . . . 93
4.4.3 Parallel GMM Approximation . . . . . . . . . 96
4.4.4 Clustering Of Non-axis Parallel Gaussians . . . . . . . 102
4.4.4.1 Rotation Angles By Givens Rotations . . . . 104
4.4.4.2 Angle Cluster Representative . . . . 107
4.4.4.3 Coordinate System Update . . . . . . . . . . 111
4.4.5 Object Identi cation . . . . . . . . . . . . . . . . . . . 112
4.4.5.1 Object Identi cation Using nGMMs . . . . . 112
4.4.5.2 Acceleration By Rotation And Approximation 115
4.4.5.3 k-Most Likely Identi cation Query . . . . . . 116
v4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 120
4.5.1.1 Data Property Benchmarks . . . . . . . . . . 121
4.5.1.2 Performance . . . . . . . . . . . . . . . . . . 124
4.5.2 Real World Data . . . . . . . . . . . . . . . . . . . . . 126
4.5.2.1 Biometric Identi cation . . . . . . . . . . . . 126
4.5.2.2 Meteorologic Data . . . . . . . . . . . . . . . 128
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5 Similarity Search Based Glioma Grading 135
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.2 Grading Of Brain Tumors . . . . . . . . . . . . . . . . . . . . 137
5.2.1 Glioma Grading Based On Similarity Search . . . . . . 137
5.2.1.1 Motion Correction . . . . . . . . . . . . . . . 139
5.2.1.2 Perfusion Map Generation . . . . . . . . . . . 140
5.2.1.3 Image Standardization . . . . . . . . . . . . . 146
5.2.1.4 Co-registration . . . . . . . . . . . . . . . . . 149
5.2.1.5 Tumor ROI Voxel Extraction . . . . . . . . . 149
5.2.1.6 Noise Filtering By Outlier Detection . . . . . 150
5.2.1.7 GMM Creation . . . . . . . . . . . . . . . . . 151
5.2.1.8 k-MLIQ Of Non-axis Parallel GMMs . . . . . 155
5.2.2 Comparison Methods . . . . . . . . . . . . . . . . . . . 157
5.2.2.1 Glioma Grading Based On Contrast Enhance-
ment . . . . . . . . . . . . . . . . . . . . . . . 157
5.2.2.2 Axis-parallel k-MLIQ Search . . . . . . . . . 157
5.2.2.3 k-Nearest Neighbor Search . . . . . . . . . . . 158
5.2.3 Quality Measures . . . . . . . . . . . . . . . . . . . . . 159
5.3 Glioma Grading Results . . . . . . . . . . . . . . . . . . . . . 160
5.3.1 Patient Population . . . . . . . . . . . . . . . . . . . . 160
5.3.2 MR Imaging . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3.3 Approximation k-MLIQ Search . . . . . . . . . . . . . 162
vi5.3.4 Glioma Grading Based On Conventional MRI Se-
quences . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3.5 Axis-parallel k-MLIQ Search . . . . . . . . . . . . . . . 164
5.3.6 k-Nearest Neighbor Search . . . . . . . . . . . . . . . . 164
5.3.7 ROC Plot Results . . . . . . . . . . . . . . . . . . . . . 164
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6 Conclusion 169
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.1.1 Introduction And Algorithmic Fundamentals . . . . . . 169
6.1.2 Parameter-free Outlier Detection Using Data Compres-
sion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.1.3 Similarity Search In Uncertain Data . . . . . . . . . . . 171
6.1.4y Search Based Glioma Grading . . . . . . . . 171
List of Figures 173
List of Tables 185
List of Algorithms 187
List of Abbreviations 189
Bibliography 193
viiAbstract
The ongoing automation in our modern information society leads to a tremen-
dous rise in the amount as well as complexity of collected data. In medical
imaging for example the electronic availability of extensive data collected as
part of clinical trials provides a remarkable potentiality to detect new relevant
features in complex diseases like brain tumors. Using data mining applica-
tions for the analysis of the data raises several problems. One problem is
the localization of outstanding observations also called outliers in a data set.
In this work a technique for parameter-free outlier detection, which is based
on data compression and a general data model which combines the Gener-
alized Normal Distribution (GND) with independent components, to cope
with existing problems like parameter settings or implicit data distribution
assumptions, is proposed.
Another problem in many modern applications amongst others in medical
imaging is the e cient similarity search in uncertain data. At present, an
adequate therapy planning of newly detected brain tumors assumedly of glial
origin needs invasive biopsy due to the fact that prognosis and treatment,
both vary strongly for benign, low-grade, and high-grade tumors. To date
di erentiation of tumor grades is mainly based on the expertise of neuroradi-
ologists examining contrast-enhanced Magnetic Resonance Images (MRI). To
assist neuroradiologist experts during the di erentiation between tumors of
di erent malignancy we proposed a novel, e cient similarity search technique
for uncertain data. The feature vector of an object is thereby not exactly
known but is rather de ned by a Probability Density Function (PDF) like a
Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel distributions, hence, correlations between di erent features are not
considered in these similarity searches. In this work a novel, e cient simi-
larity search technique for general GMMs without independence assumption
is presented. The actual components of a GMM are approximated in a
conservative but tight way. The conservativity of the approach leads to a
viii