149 Pages
English
Gain access to the library to view online
Learn more

Applicability domain of QSAR models [Elektronische Ressource] / Iurii Sushko

-

Gain access to the library to view online
Learn more
149 Pages
English

Informations

Published by
Published 01 January 2011
Reads 30
Language English
Document size 36 MB

Exrait

TECHNISCHE UNIVERSITÄT MÜNCHEN
Lehrstuhl für Genomorientierte Bioinformatik
Applicability domain of QSAR models
Iurii Sushko
Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung
und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften
genehmigten Dissertation.
Vorsitzender: Univ.-Prof. Dr. Langosch
Prüfer der Dissertation:
1. Univ.-Prof. Dr. H.-W. Mewes
2. Univ.-Prof. Dr. K. Suhre
(Ludwig-Maximilians-Universität München)
Die Dissertation wurde am 24.11.2010 bei der Technischen Universität München eingereicht und durch die
Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt am 17.02.2011
angenommen.Acknowledgements
I would like to express my gratitude to my research colleagues, who were always
willing to help me with an advice and gave me an opportunity to work in a friendly and
creative atmosphere: Anil Pandey, Robert Körner, Sergii Novotarskyi, Matthias Rupp,
Simona Kovarich, Stefan Brandmaier, Wolfram Teetz, Eva Schlosser, Vlad Kholodovych and
Ahmed Abdelaziz. I thank Benoit Mathieu for his advices.
I would like to thank my thesis advisor, Dr. Igor Tetko, for his help and creativeness,
for having introduced me the scientific way of thinking and for his ideas that have
significantly contributed to my thesis work.
I am very grateful to my supervisor Prof. Hans-Werner Mewes for giving me an
opportunity to work on my thesis in Institute of Bioinformatics and Systems Biology and for
supporting my research. I am also very grateful to Prof. Karsten Suhre for his interest in my
work.
I would like to thank my family back in Ukraine, my mother Valentyna Sushko, my
father Alexander Sushko and my brother Ievgenii Sushko for supporting me.
Iurii SushkoAbstract
In recent decades, computational models have gained popularity for predictions of
biological activities and physicochemical properties. This new and rapidly developing field
of research is referred to as QSAR/QSPR (Quantitative Structure-Activity/Property
Relationship) and is especially applicable in drug design and in environmental risk
assessment (ecotoxicology), where screening of large datasets of compounds is required.
The major limiting point of computational models is questionable reliability of
predictions. Computational models are not guaranteed to give equally accurate predictions on
the whole chemical space; in other words, the computational models have limited domain of
applicability. At present, the lack of a proper definition for the applicability domain (AD) of
a model is one of the major issues restraining the practical application of computational
models. The problem of the AD assessment is addressed in this work.
The work introduces the methodology for the AD assessment and conveys a
comprehensive benchmarking analysis of existing and new approaches. The practical AD
assessment is demonstrated in a number of studies on prediction of such properties as
mutagenicity (Ames test), toxicity (inhibition growth concentration), lipophilicity and
cytochromes inhibition. It is shown that the AD approaches allow to estimate the prediction
accuracy for every compound individually and, thereby, to discriminate highly accurate
predictions with the accuracy close to that of experimental measurements. All the introduced
AD methods are implemented as a part of a new platform for chemical modeling (OCHEM)
and are publicly available online at http://ochem.eu.Table of Contents
1 Introduction..........................................................................................................................1
1.1 Motivation.......................................................................................................................1
1.2 Thesis roadmap...............................................................................................................3
2 Methodology5
2.1 QSAR research5
2.1.1 Overview.................................................................................................................5
2.1.2 Molecular descriptors..............................................................................................5
2.1.3 Machine learning methods......................................................................................7
2.1.4 Meta-learning techniques........................................................................................9
A. Model ensembles and bagging...............................................................................9
B. LIBRARY model correction...................................................................................9
2.1.5 Validation of models...............................................................................................9
2.1.6 Prediction accuracy11
A. Regression models................................................................................................11
B. Classification models............................................................................................12
2.1.7 Detection of statistical significancy......................................................................12
2.1.8 Representation of molecules.................................................................................13
2.2 Applicability domain of QSAR models........................................................................14
2.2.1 Basic definitions....................................................................................................14
2.2.2 Distances to models..............................................................................................15
A. Leverage ..............................................................................................................16
B. Standard deviation of the ensemble predictions (STD)........................................16
C. Tanimoto similarity...............................................................................................18
D. Correlation of prediction vectors (CORREL)......................................................18
E. Rounding effect (CLASS-LAG) ..........................................................................18
F. Concordance of a classification ensemble.............................................................19
G. Rounding effect and standard deviation combined (STD-PROB).......................20
H. Descriptor-based and property-based DMs..........................................................21
2.2.3 Analysis of prediction accuracy............................................................................22
A. Accuracy averaging..............................................................................................22
B. Estimation of prediction accuracy........................................................................23
2.2.4 Comparison of applicability domains...................................................................24
A. Discriminative power of DM................................................................................24
B. Fitness of probability distribution.........................................................................26
2.2.5 Interpretation of applicability domains.................................................................27
2.3 Analyzed datasets..........................................................................................................28
2.3.1 Datasets of experimental measurements...............................................................28
A. Ames test dataset..................................................................................................28
B. T. pyriformis toxicity dataset29
C. Platinum complexes lipophilicity dataset.............................................................29
D. CYP450 inhibitors dataset....................................................................................30
2.3.2 Datasets of chemical compounds..........................................................................30
A. Enamine dataset....................................................................................................30
B. EINECS dataset30
C. HPV dataset30
2.4 Summary.......................................................................................................................313 Online chemical modeling environment – OCHEM.......................................................33
3.1 Motivation.....................................................................................................................33
3.2 The database of experimental measurements...............................................................34
3.2.1 Structure overview................................................................................................34
3.2.2 Sources of information..........................................................................................35
3.2.3 Data access and management................................................................................36
3.3 Modeling framework....................................................................................................37
3.3.1 Overview...............................................................................................................37
3.3.2 Calculation of models...........................................................................................37
3.3.3 Descriptors............................................................................................................39
3.3.4 Conditions of experiments....................................................................................40
3.3.5 Configuration of the machine learning methods...................................................40
3.3.6 Model calculation..................................................................................................41
3.3.7 Distributed calculations.........................................................................................41
3.3.8 Analysis and management of models....................................................................42
3.3.9 Application of models..........................................................................................44
3.3.10 Applicability domain assessment........................................................................45
A. DMs and accuracy averaging...............................................................................45
B. Estimation of the prediction accuracy..................................................................46
3.4 Implementation aspects................................................................................................47
3.5 Summary and outlook...................................................................................................47
4 Benchmarking studies........................................................................................................49
4.1 Prediction of Ames mutagenicity..................................................................................49
4.1.1 Ames test and mutagenicity49
4.1.2 Methods and datasets............................................................................................49
A. QSAR approaches49
B. Applicability domain assessment..........................................................................51
C. Benchmarking criteria..........................................................................................53
4.1.3 Results and analysis..............................................................................................54
A. Comparison of distances to model.......................................................................54
B. Analysis of the qualitative AD measures..............................................................59
C. Ability to estimate the prediction accuracy..........................................................61
D. Interpretation of the AD........................................................................................62
E. Data variability analysis62
F. Reliability of predictions vs. variability of experimental measurements .............64
G. Reliable predictions for ENAMINE, EINECS and HPV databases.....................66
4.1.4 Summary...............................................................................................................67
4.2 Toxicity against T. Pyriformis.......................................................................................69
4.2.1 Introduction...........................................................................................................69
4.2.2 Methods.................................................................................................................69
A. QSAR approaches................................................................................................69
B. Applicability domain assessment..........................................................................71
C. Benchmarking criteria..........................................................................................73
4.2.3 Results...................................................................................................................74
A. Analysis of individual models..............................................................................74
B. Comparison of distances to models......................................................................76
C. Ability to estimate the prediction accuracy..........................................................78
D. Interpretation of the AD........................................................................................80
E. Reliable predictions for HPV, EINECS and ENAMINE databases......................82
4.2.4 Summary...............................................................................................................845 Applications.........................................................................................................................85
5.1 Lipophilicity of Pt complexes.......................................................................................85
5.1.1 Introduction...........................................................................................................85
5.1.2 Methods.................................................................................................................86
A. Dataset and the variability of measurements........................................................86
B. QSAR approaches and AD assessment.................................................................86
5.1.3 Results...................................................................................................................87
A. Comparison of the QSAR approaches..................................................................87
B. Assessment of prediction accuracy and applicability domain..............................89
C. Interpretation of the AD........................................................................................90
5.1.4 Summary...............................................................................................................91
5.2 Cytochrome P450 inhibition.........................................................................................93
5.2.1 Introduction and methods......................................................................................93
5.2.2 Results...................................................................................................................93
A. QSAR modeling...................................................................................................93
B. AD assessment......................................................................................................93
C. Interpretation of the AD .......................................................................................94
D. Reliable predictions for HPV, EINECS and ENAMINE datasets........................96
5.2.3 Summary98
6 Discussion............................................................................................................................99
A. Prediction accuracy of QSARs is variable...........................................................99
B. Ensembles of models improve AD assessment..................................................100
C. Property-based DMs instead of descriptor-based DMs......................................101
D. Distances to models are universal .....................................................................101
E. Which compounds are well predicted?...............................................................102
F. Accuracy of experimental measurements is achievable with QSARs ................103
G. More diverse measurements for better models...................................................103
7 Conclusions and outlook..................................................................................................105
List of abbreviations...........................................................................................................107
Alphabetical Index..............................................................................................................109
List of Figures......................................................................................................................111
List of Tables.......................................................................................................................117
References............................................................................................................................119
Appendix..............................................................................................................................127
Curriculum vitae................................................................................................................135
Publication record..............................................................................................................137