Sparse methods for machine learning Theory and algorithms
96 Pages
English

Sparse methods for machine learning Theory and algorithms

-

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

Niveau: Supérieur, Licence, Bac+1
Sparse methods for machine learning Theory and algorithms Francis Bach Guillaume Obozinski Willow project, INRIA - Ecole Normale Superieure ECML - PKDD 2010 - Tutorial

  • norms

  • methods

  • usual convex

  • sparsity- inducing norms

  • ?1 -norm ?w?1

  • norm

  • ecole normale


Subjects

Informations

Published by
Reads 54
Language English
Sparse methods for machine learning Theory and algorithms
Francis Bach Guillaume Obozinski Willowproject,INRIA-EcoleNormaleSupe´rieure
ECML - PKDD 2010 - Tutorial
Supervised learning and regularization
Data:xi∈ X,yi∈ Y,i= 1     n
Minimize with respect to functionf: YX →:
n X(yi f(xi)) i=1 Error on data
Loss & function space ?
Two theoretical/algorithmic issues:
1. Loss 2.Function space / norm
+
+
λ 2kfk2 Regularization
Norm ?
Usual losses
Rgeerssoin:yR, predictionyˆ =f(x), quadratic cost(y f) = 2(yyˆ)2=2(yf)2
Classification:y∈ {−11}predictionyˆ =sign(f(x))
loss of the form(y f) =(yf) ue c :  ost“Tr ”(yf) = 1yf <0 Usualconvexcost 5
4
3
2
1
03
−2
−1
0
1
0−1 hinge square logistic
2
3
4
Main goal:
Regularizations
avoid overfitting
Two main lines of work:
1.EuclideanandHilbertiannorms (i.e.,2-norms) Possibility of non linear predictors Non parametric supervised learning and kernel methods Well developped theory and algorithms (see, e.g., Wahba, 1990; Sch¨olkopfandSmola,2001;Shawe-TaylorandCristianini,2004)
Main goal:
Regularizations
avoid overfitting
Two main lines of work:
1.EuclideanandHilbertiannorms (i.e.,2-norms) Possibility of non linear predictors Non parametric supervised learning and kernel methods Well developped theory and algorithms (see, e.g., Wahba, 1990; Scho¨lkopfandSmola,2001;Shawe-TaylorandCristianini,2004) 2.rapSgucin-indsitynorms restricted to linear predictors on vectorsUsually f(x) =wx Main example:1-normkwk1=Pip=1|wi| Perform model selection as well as regularization Theory and algorithms “in the making”
2vs.1 Laplacian- Gaussian hare vs. tortoise
First-order methods (Fu, 1998; Beck and Teboulle, 2009) Homotopy methods (Markowitz, 1956; Efron et al., 2004)
Lasso - Two main recent theoretical results
1.Support
recovery
condition
(Zhao
and
Yu,
2006;
Wainwright,
2009; Zou, 2006; Yuan and Lin, 2007): the Lasso is sign-consistent if and only if there are low correlations between relevant and irrelevant variables.
Lasso - Two main recent theoretical results
1.Support recovery condition(Zhao and Yu, 2006; Wainwright, 2009; Zou, 2006; Yuan and Lin, 2007): the Lasso is sign-consistent if and only if there are low correlations between relevant and irrelevant variables.
2.Exponentially many irrelevant variables(Zhao and Yu, 2006; Wainwright, 2009; Bickel et al., 2009; Lounici, 2008; Meinshausen and Yu, 2008): under appropriate assumptions, consistency is possible as long as logp=O(n)
Going beyond the Lasso
1-norm forlinearfeature selection inhigh dimensions
Lasso usually not applicable directly
Non-linearities
Dealing with structured set of features
Sparse learning on matrices
Outline
Sparse linear estimation with the1-norm Convex optimization and algorithms Theoretical results
Groups of features
 kernel learningNon-linearity: Multiple
Sparse methods on matrices
Multi-task learning Matrix factorization (low-rank, sparse PCA, dictionary learning)
Structured sparsity
Overlapping groups and hierarchies
Why1-norms lead to sparsity?
Example 1 problem in 1D, i.e.: quadraticxmiR21nx2xy+λ|x| Piecewise quadratic function with a kink at zero =Derivative at0+:g+=λyand0:gλy
x= 0is the solution iffg+>0andg60(i.e.,|y|6λ) x>0is the solution iffg+60(i.e.,y>λ)x=yλ x60is the solution iffg60(i.e.,y6λ)x=y+λ
Solution
x= sign(y)(|y| −λ)+
=soft thresholding