198 Pages
English

Statistical issues in machine learning [Elektronische Ressource] : towards reliable split selection and variable importance measures / vorgelegt von: Carolin Strobl

-

Gain access to the library to view online
Learn more

Description

Statistical Issues in Machine Learning –Towards Reliable Split Selection andVariable Importance MeasuresDissertationamInstitut fur¨ StatistikderFakult¨at fu¨r Mathematik, Informatik und StatistikderLudwig-Maximilians-Universit¨at Munc¨ henVorgelegt von: Carolin StroblMunc¨ hen, den 26. Mai 2008Erstgutachter: Prof. Dr. Thomas AugustinZweitgutachter: Prof. Dr. Gerhard TutzExterner Gutachter: Prof. Dr. Kurt HornikRigorosum: 2. Juli 2008AbstractRecursivepartitioningmethodsfrommachinelearningarebeingwidelyappliedinmanyscientificfieldssuchas, e.g., geneticsandbioinformatics. Thepresentworkisconcernedwiththetwomainproblems that arise in recursive partitioning, instability and biased variable selection, from astatistical point of view. With respect to the first issue, instability, the entire scope of methodsfrom standard classification trees over robustified classification trees and ensemble methods suchas TWIX, bagging and random forests is covered in this work. While ensemble methods prove tobe much more stable than single trees, they also loose most of their interpretability. Therefore anadaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a singletree if the partition is sufficiently stable. With respect to the second issue, variable selectionbias, the statistical sources of this artifact in single trees and a new form of bias inherent inensemble methods based on bootstrap samples are investigated.

Subjects

Informations

Published by
Published 01 January 2008
Reads 13
Language English
Document size 1 MB

Statistical Issues in Machine Learning –
Towards Reliable Split Selection and
Variable Importance Measures
Dissertation
am
Institut fur¨ Statistik
der
Fakult¨at fu¨r Mathematik, Informatik und Statistik
der
Ludwig-Maximilians-Universit¨at Munc¨ hen
Vorgelegt von: Carolin Strobl
Munc¨ hen, den 26. Mai 2008Erstgutachter: Prof. Dr. Thomas Augustin
Zweitgutachter: Prof. Dr. Gerhard Tutz
Externer Gutachter: Prof. Dr. Kurt Hornik
Rigorosum: 2. Juli 2008Abstract
Recursivepartitioningmethodsfrommachinelearningarebeingwidelyappliedinmanyscientific
fieldssuchas, e.g., geneticsandbioinformatics. Thepresentworkisconcernedwiththetwomain
problems that arise in recursive partitioning, instability and biased variable selection, from a
statistical point of view. With respect to the first issue, instability, the entire scope of methods
from standard classification trees over robustified classification trees and ensemble methods such
as TWIX, bagging and random forests is covered in this work. While ensemble methods prove to
be much more stable than single trees, they also loose most of their interpretability. Therefore an
adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single
tree if the partition is sufficiently stable. With respect to the second issue, variable selection
bias, the statistical sources of this artifact in single trees and a new form of bias inherent in
ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased
split selection criterion is evaluated and another one newly introduced here. Based on the results
for single trees and further findings on the effects of bootstrap sampling on association measures,
it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of
bootstrap sampling should be employed in ensemble methods to be able to reliably compare the
variable importance scores of predictor variables of different types. The statistical properties and
the null hypothesis of a test for the random forest variable importance are critically investigated.
Finally, a new, conditional importance measure is suggested that allows for a fair comparison in
the case of correlated predictor variables and better reflects the null hypothesis of interest.Zusammenfassung
Die Anwendung von Methoden des rekursiven Partitionierens aus dem maschinellen Lernen ist
in vielen Forschungsgebieten, wie z.B. in der Genetik und Bioinformatik, weit verbreitet. Die
vorliegende Arbeit setzt sich aus statistischer Sicht mit den zwei Hauptproblemen des rekursiven
Partitionierens, Instabilit¨at und verzerrter Variablenselektion, auseinander. Im Hinblick auf das
erste Thema, die Instabilit¨at, wird das gesamte Methodenspektrum von herk¨ommlichen Klassi-
fikationsb¨aumen u¨ber robustifizierte Klassifikationsb¨aume und Ensemble Methoden wie TWIX,
Bagging und Random Forests in dieser Arbeit abgedeckt. Ensemble Methoden erweisen sich im
VergleichzueinzelnenKlassifikationsb¨aumenalsdeutlichstabiler,verlierenaberauchgr¨oßtenteils
ihre Interpretierbarkeit. Deshalb wird ein adaptives Bruchpunkt-Selektionskriterium vorgeschla-
gen, mit dem ein TWIX-Ensemble auf einen einzelnen Klassifikationsbaum reduziert wird, falls
diePartitionstabilgenugist. ImHinblickaufdaszweiteThema,dieverzerrteVariablenselektion,
werden die statistischen Ursachen fur¨ dieses Artefakt in einzelnen B¨aumen und eine neue Form
von Verzerrung, die in Ensemble Methoden auftritt die auf Bootstrap-Stichproben beruhen, un-
tersucht. Fu¨r einzelne Bau¨ me wird ein unverzerrtes Selektionskriterien evaluiert und ein anderes
hier neu eingefuh¨ rt. Anhand der Ergebnisse fu¨r einzelne B¨aume und weiteren Untersuchungen zu
den Auswirkungen von Bootstrap-Stichprobenverfahren auf Assoziationsmaße wird gezeigt dass,
neben der Verwendung von unverzerrten Selektionskriterien, Teilstichprobenverfahren anstelle
von Bootstrap-Stichprobenverfahren in Ensemble Methoden verwendet werden sollten, um die
Variable Importance-Werte von Pradikt¨ orvariablen unterschiedlicher Art zuverl¨assig vergleichen
zu k¨onnen. Die statistischen Eigenschaften und die Nullhypothese eines Test fu¨r die Variable
Importance von Random Forests werden kritisch untersucht. Abschliessend wird eine neue, be-
dingte Variable Importance vorgeschlagen, die im Fall von korrelierten Pr¨adiktorvariablen einen
fairen Vergleich erlaubt und die interessierende Nullhypothese besser widerspiegelt.Contents
Scope of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classification trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Split selection and stopping rules . . . . . . . . . . . . . . . . . . . 5
1.1.2 Prediction and interpretation . . . . . . . . . . . . . . . . . . . . . 10
1.1.3 Variable selection bias and instability . . . . . . . . . . . . . . . . . 13
1.2 Robust classification trees and ensemble methods . . . . . . . . . . . . . . 16
1.3 Characteristics and caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.1 “Small n large p” applicability . . . . . . . . . . . . . . . . . . . . . 19
1.3.2 Out-of-bag error estimation . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3 Missing value handling . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.4 Randomness and stability . . . . . . . . . . . . . . . . . . . . . . . 22
2. Variable selection bias in classification trees . . . . . . . . . . . . . . . . . 25
2.1 Entropy estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Binary splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28ii Contents
2.1.2 k-ary splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Multiple comparisons in cutpoint selection . . . . . . . . . . . . . . . . . . 34
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3. Evaluation of an unbiased variable selection criterion . . . . . . . . . . . 37
3.1 Optimally selected statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Null case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Power case I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Power case II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Application to veterinary data . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Variable selection ranking . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Selected splitting variables . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4. Robust and unbiased variable selection in k-ary splitting. . . . . . . . . 54
4.1 Classification trees based on imprecise probabilities . . . . . . . . . . . . . 55
4.1.1 Total impurity criteria . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Split selection procedure . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3 Characteristics of the total impurity criterion TU2 . . . . . . . . . 60
4.2 Empirical entropy measures in split selection . . . . . . . . . . . . . . . . . 64
4.2.1 Estimation bias for the empirical Shannon entropy . . . . . . . . . 64
4.2.2 Effects in classification trees based on imprecise probabilities . . . . 65Contents iii
4.2.3 Suggested corrections based on the IDM . . . . . . . . . . . . . . . 67
4.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5. Adaptive cutpoint selection in TWIX ensembles . . . . . . . . . . . . . . 77
5.1 Building TWIX ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Instability of cutpoint selection in recursive partitioning. . . . . . . 80
5.1.2 Selecting extra cutpoints . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 A new, adaptive criterion for selecting extra cutpoints . . . . . . . . . . . . 83
5.2.1 Adding virtual observations . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Recomputation of the split criterion . . . . . . . . . . . . . . . . . . 85
5.3 Behavior of the adaptive criterion . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Application to olives data . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.2 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Outlook on credal prediction and aggregation schemes . . . . . . . . . . . . 93
5.4.1 Credal prediction rules . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Aggregation schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6. Unbiased variable importance in random forests and bagging . . . . . . 99
6.1 Random forest variable importance measures . . . . . . . . . . . . . . . . . 100
6.2 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.1 Null case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105iv Contents
6.2.2 Power case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Sources of variable importance bias . . . . . . . . . . . . . . . . . . . . . . 111
6.3.1 Variable selection bias in individual classification trees . . . . . . . 112
6.3.2 Effects induced by bootstrapping . . . . . . . . . . . . . . . . . . . 113
6.4 Application to C-to-U conversion data . . . . . . . . . . . . . . . . . . . . 115
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7. Statistical properties of Breiman and Cutler’s test . . . . . . . . . . . . 130
7.1 Investigating the current test . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1.1 The power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1.2 The construction of the z-score . . . . . . . . . . . . . . . . . . . . 133
7.1.3 Specifying the null hypothesis . . . . . . . . . . . . . . . . . . . . . 134
7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8. Conditional variable importance . . . . . . . . . . . . . . . . . . . . . . . . 138
8.1 Variable selection in random forests . . . . . . . . . . . . . . . . . . . . . . 143
8.1.1 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.1.2 Illustration of variable selection . . . . . . . . . . . . . . . . . . . . 145
8.2 A second look at the permutation importance . . . . . . . . . . . . . . . . 147
8.2.1 Background: Types of independence . . . . . . . . . . . . . . . . . 147
8.2.2 A new, conditional permutation scheme . . . . . . . . . . . . . . . . 150
8.2.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3 Application to peptide-binding data . . . . . . . . . . . . . . . . . . . . . . 156
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158Contents v
9. Conclusion and outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165Scope of this work
This work is concerned with a selection of statistical methods based on the principle of
recursive partitioning: classification and regression trees (termed classification trees in the
followingforbrevity, whilemostresultsapplystraightforwardlytoregressiontrees), robust
classification trees and ensemble methods based on classification trees.
From a practical point of view these methods have become extremely popular in many
applied sciences, including genetics and bioinformatics, epidemiology, medicine in general,
psychiatry, psychology and economics, within a short period of time – primarily because
they “work so well”. From a statistical point of view, on the other hand, recursive parti-
tioning methods are rather unusual in many respects; for example they do not rely on any
parametric distribution assumptions.
LeoBreiman,oneofthemostinfluentialresearchersinthisfield,haspromoted“algorithmic
models” like classification trees and ensembles methods in the late years of his career
after he had left academia to work as a consultant and made the experience that current
statistical practice has “Led to irrelevant theory and questionable scientific conclusions;
Kept statisticians from using more suitable algorithmic models; Prevented statisticians
from working on exciting new problems” (Breiman, 2001b, pp. 199–200).
Today, the scientific discussion about the legitimacy of algorithmic models in statistics
continues, as illustrated by the contribution of Hand (2006) in Statistical Science with the
provocative title “Classifier Technology and the Illusion of Progress” and the multitude of
comments that were triggered by it. Of these comments, the most consensual one may be
thereplyofJeromeFriedman,anotherhighlyinfluentialresearcherinthefieldofstatistical