Learning with sparsity-inducing norms

Francis Bach

INRIA - Ecole Normale Sup´erieure

MLSS 2008 - Ile de R´e, 2008Supervised learning and regularization

• Data: x ∈X, y ∈Y, i = 1,...,n

i i

• Minimize with respect to function f∈F:

n

X

λ

2

ℓ(y,f(x )) + kfk

i i

2

i=1

Error on data + Regularization

Loss & function space ? Norm ?

• Two issues:

– Loss

– Function space / normUsual losses [SS01, STC04]

• Regression: y∈R, prediction yˆ=f(x),

1 2

– quadratic cost ℓ(y,f(x)) = (y−f(x))

2

• Classiﬁcation : y∈{−1,1} prediction yˆ= sign(f(x))

– loss of the form ℓ(y,f(x)) =ℓ(yf(x))

– “True” cost: ℓ(yf(x)) = 1

yf(x)<0

– Usual convex costs:

5

0−1

hinge

4

square

logistic

3

2

1

0

−3 −2 −1 0 1 2 3 4Regularizations

• Main goal: control the “capacity” of the learning problem

• Two main lines of work

1. Use Hilbertian (RKHS) norms

– Non parametric supervised learning and kernel methods

– Well developped theory [SS01, STC04, Wah90]

2. Use “sparsity inducing” norms

P

p

– main example: ℓ -normkwk = |w|

1 1 i

i=1

– Perform model selection as well as regularization

– Often used heuristically

• Goal of the course: Understand how and when to use

sparsityinducing normsWhy ℓ -norms lead to sparsity?

1

1

2

• Example 1: quadratic problem in 1D, i.e. min x −xy+λ|x|

x∈R

2

• Piecewise quadratic function with a kink at zero

– Derivative at 0+: g =λ−y and 0−: g =−λ−y

+ −

– x = 0 is the solution iﬀ g > 0 and g 6 0 (i.e.,|y|6λ)

+ −

∗

– x> 0 is the solution iﬀ g 6 0 (i.e., y>λ)⇒ x =y−λ

+

∗

– x6 0 is the solution iﬀ g 6 0 (i.e., y6−λ)⇒ x =y+λ

−

∗

• Solution x = sign(y)(|y|−λ) = soft thresholding

+Why ℓ -norms lead to sparsity?

1

• Example 2: isotropic quadratic problem

p p

X X

1 1

2 ⊤ ⊤

• min x − xy +λkxk = min x x−x y+λkxk

i i 1 1

i

p p

x∈R x∈R

2 2

i=1 i=1

∗

• solution: x = sign(y )(|y|−λ)

i i +

i

• decoupled soft thresholdingWhy ℓ -norms lead to sparsity?

1

• Example 3: general quadratic problem

– coupled soft thresolding

• Geometric interpretation

– NB : Penalizing is “equivalent” to constrainingCourse Outline

1

1. ℓ -norm regularization

• Review of nonsmooth optimization problems and algorithms

• Algorithms for the Lasso (generic or dedicated)

• Examples

2. Extensions

• Group Lasso and multiple kernel learning (MKL) + case study

• Sparse methods for matrices

• Sparse PCA

3. Theory - Consistency of pattern selection

• Low and high dimensional setting

• Links with compressed sensingℓ -norm regularization

1

p

• Data: covariates x ∈R , responses y ∈Y, i = 1,...,n, given in

i i

p n×p

vector y∈R and matrix X∈R

p

• Minimize with respect to loadings/weights w∈R :

n

X

⊤

ℓ(y,w x ) + λkwk

i i 1

i=1

Error on data + Regularization

• Including a constant term b?

• Assumptions on loss:

– convex and diﬀerentiable in the second variable

– NB: with the square loss ⇒ basis pursuit (signal processing)

[CDS01], Lasso (statistics/machine learning) [Tib96]A review of nonsmooth convex

analysis and optimization

• Analysis: optimality conditions

• Optimization: algorithms

– First order methods

– Second order methods

• Books: Boyd & VandenBerghe [BV03], Bonnans et al.[BGLS03],

Nocedal & Wright [NW06], Borwein & Lewis [BL00]