Canonical Correlation

a Tutorial

Magnus Borga

January 12, 2001

Contents

1 About this tutorial 1

2 Introduction 2

3 Deﬁnition 2

4 Calculating canonical correlations 3

5 Relating topics 3

5.1 The difference between CCA and ordinary correlation analysis . . 3

5.2 Relationtomutualinformation................... 4

5.3 Relation to other linear subspace methods ....... 4

5.4 RelationtoSNR........................... 5

5.4.1 Equalnoiseenergies... 5

5.4.2 Correlation between a signal and the corrupted signal . . . 6

A Explanations 6

A.1Anoteoncorelationandcovariancematrices........... 6

A.2Afﬁnetransformations ................. 6

A.3Apieceofinformationtheory......... 7

A.4 Principal component analysis . ............. 9

A.5 Partial least squares .............. 9

A.6 Multivariate linear regression . ............. 9

A.7Signaltonoiseratio.............. 10

1 About this tutorial

This is a printable version of a tutorial in HTML format. The tutorial may be

modiﬁed at any time as will this version. The latest version of this tutorial is

available athttp://people.imt.liu.se/˜magnus/cca/.

12 Introduction

Canonical correlation analysis (CCA) is a way of measuring the linear relationship

between two multidimensional variables. It ﬁnds two bases, one for each variable,

that are optimal with respect to correlations and, at the same time, it ﬁnds the

corresponding correlations. In other words, it ﬁnds the two bases in which the

correlation matrix between the variables is diagonal and the correlations on the

diagonal are maximized. The dimensionality of these new bases is equal to or less

than the smallest dimensionality of the two variables.

An important property of canonical correlations is that they are invariant with

respect to afﬁne transformations of the variables. This is the most important differ-

ence between CCA and ordinary correlation analysis which highly depend on the

basis in which the variables are described.

CCA was developed by H. Hotelling [10]. Although being a standard tool

in statistical analysis, where canonical correlation has been used for example in

economics, medical studies, meteorology and even in classiﬁcation of malt whisky,

it is surprisingly unknown in the ﬁelds of learning and signal processing. Some

exceptionsare[2,13,5,4,14],

For further details and applications in signal processing, see my PhD thesis [3]

and other publications.

3 Deﬁnition

Canonical correlation analysis can be deﬁned as the problem of ﬁnding two sets of

basis vectors, one for and the other for , such that the correlations between the

projections of the variables onto these basis vectors are mutually maximized.

Let us look at the case where only one pair of basis vectors are sought, namely

the ones corresponding to the largest canonical correlation: Consider the linear

combinations and of the two variables respectively. This

means that the function to be maximized is

(1)

yy

The maximum of with respect to and is the maximum canonical

correlation. The subsequent canonical correlations are uncorrelated for different

solutions, i.e.

]=

=0

]=

=0 for (2)

yj

yy

yj

yi

yi

]=

=0

yj

yj

2

w

w

[

x

w

]

=

E

j

[

[

^

x

w

=

T

w

y

xy

yy

<

T

j

^

xx

w

C

y

i

]

T

=

T

w

[

T

[

x

T

C

8

xy

:

w

i

y

=

q

T

w

w

T

T

x

w

C

[

xx

j

w

[

x

yy

w

x

T

C

y

y

C

i

T

=

w

T

y

w

^

:

C

w

T

^

>

T

>

xy

E

x

x

T

x

w

]

^

E

w

[

xi

E

T

=

xj

]

w

w

xi

x

xx

2

xj

w

E

y

y

y

y

[

]

E

E

]

w

2

T

x

T

[

w

E

w

p

^

]

[

xy

E

[

E

E

x

=

y

]

y

E

w

w

^

xi

T

T

xx

q

x

w

i

xi

=

xy

w

^

y

x

=

6

y

j:

x

y

]The projections onto and ,i.e. and , are called canonical variates.

4 Calculating canonical correlations

Consider two random variables and with zero mean. The total covariance

matrix

(3)

yx

yy

is a block matrix where and are the within sets covariance matrices of

and respectively and is the between sets covariance matrix.

yx

The canonical correlations between and can be found by solving the eigen

value equations

yx

yy

(4)

yx

yy

where the eigenvalues are the squared canonical correlations and the eigen

vectors and are the normalized correlation basis vectors.The

number of non zero solutions to these equations are limited to the smallest dimen

sionality of and . E.g. if the dimensionality of and is 8 and 5 respectively,

the maximum number of canonical correlations is 5.

Only one of the eigenvalue equations needs to be solved since the solutions are

related by

(5)

yx

yy

where

yy

(6)

5 Relating topics

5.1 The difference between CCA and ordinary correlation analysis

Ordinary correlation analysis is dependent on the coordinate system in which the

variables are described. This means that even if there is a very strong linear rela

tionship between two multidimensional signals, this relationship may not be visible

in a ordinary correlation analysis if one coordinate system is used, while in another

coordinate system this linear relationship would give a very high correlation.

CCA ﬁnds the coordinate system that is optimal for correlation analysis, and

the eigenvectors of equation 4 deﬁnes this coordinate system.

3

C

x

(

w

x

^

w

xx

C

C

xx

x

xy

T

C

w

^

^

"

y

#

w

C

^

C

C

C

y

2

T

w

y

^

xy

s

=

=

x

y

1

xx

:

y

xy

=

x

x

1

w

;

w

y

C

w

=

^

w

C

C

y

1

^

=

y

x

w

C

^

C

E

x

w

y

^

y

xx

T

C

C

x

C

x

=

C

y

=

w

T

^

y

xy

C

C

xx

:

y

<

^

8

x

y

^

x

1

w

^

x

x

w

y

^

x

x

y

1

x

2

y

C

y

xx

C

xy

x

w

=

=

w

C

2

xx

C

Example: Consider two normally distributed two dimensional variables and

with unit variance. Let . It is easy to conﬁrm that the correlation

matrix between and is

(7)

This indicates a relatively weak correlation of 0.5 despite the fact that there is a

perfect linear relationship (in one dimension) between and .

A CCA on this data shows that the largest (and only) canonical correlation is

one and it also gives the direction in which this perfect linear relationship

lies. If the variables are described in the bases given by the canonical correlation

basis vectors (i.e. the eigenvectors of equation 4), the correlation matrix between

the variables is

(8)

5.2 Relation to mutual information

There is a relation between correlation and mutual information. Since informa-

tion is additive for statistically independent variables and the canonical variates

are uncorrelated, the mutual information between and is the sum of mutual

information between the variates and if there are no higher order statistic de

pendencies than correlation (second order statistics). For Gaussian variables this

means

)= (9)

Kay [13] has shown that this relation plus a constant holds for all elliptically sym-

metrical distributions of the form

(10)

5.3 Relation to other linear subspace methods

Instead of the two eigenvalue equations in 4 we can formulate the problem in one

single eigenvalue equation:

A^ (11)

where

and (12)

yx

yy

Solving the eigenproblem in equation 11 with slightly different matrices will

give solutions to principal component analysis (PCA), partial least squares (PLS)

and multivariate linear regression (MLR). The matrices are listed in table 1.

4

xy

2

z

2

(

0

1

:

C

y

T

y

)

5

z

:

Q

z

x

z

y

))

+

:

:

)

y

i

0

2

y

(1

(1

1

y

I

log

x

B

+

1

x

i

2

w

xy

=

:

5

^

0

w

:

X

T

2

1

1

=

w

A

=

i

1

0

log

C

1

xy

;

C

(

)

i

0

i

:

y

x

y

;

1

B

y

=

=

1

C

x

xx

x

0

R

0

=

C

0

i

5

:

2

0

^

5

w

:

=

x

[11]

x

R

^

=

w

1

x

0

y

^

cf

((

xPCA

PLS

yx

CCA

yx

yy

MLR

yx

Table 1: The matrices and for PCA, PLS, CCA and MLR.

5.4 Relation to SNR

Correlation is strongly related to signal to noise ratio (SNR), which is a more com-

monly used measure in signal processing. Consider a signal and two noise signals

1and all having zero mean and all being uncorrelated with each other. Let

and be the energy of the signal and the noise signals

respectively. Then the correlation between and is

(13)

]+

]+

Note that the ampliﬁcation factors and do not affect the correlation or the SNR.

5.4.1 Equal noise energies

In the special case where the noise energies are equal, i.e.

, equation 13 can be written as

(14)

This means that the SNR can be written as

(15)

1The assumption of zero mean is for convenience. A non zero mean does not affect the SNR or

the correlation.

5

S

1

N

C

E

+

[

I

x

0

2

(

i

E

=

C

2

C

2

1

=

S

x

p

(

:

S

+

C

N

C

1

0

)(

0

S

2

+

b

N

+

2

]

)

S

:

:

E

E

2

]

x

E

[

E

S

q

1

A

2

0

x

C

E

0

=

]

C

a

I

2

b

xy

)

I

2

A

=

+

(

x

)

(

2

x

b

a

[

i

E

=

]

S

2

N

)

2

1

[

=

+

N

x

2

(

[

2

=

a

2

[

1

E

x

p

N

)]

2

B

+

x

0

2

xx

C

0

)

xy

1

0

+

0

x

xx

(

a

0

[

xy

E

0

=

0

)

I

2

C

+

C

x

N

xx

1

B

=

N

b

(Here, it should be noted that the noise affects the signal twice, so this relation

between SNR and correlation is perhaps not so intuitive. This relation is illustrated

in ﬁgure 1 (top).

5.4.2 Correlation between a signal and the corrupted signal

Another special case is when

=0 and . Then, the correlation between

a signal and a noise corrupted version of that signal is

(16)

In this case, the relation between SNR and correlation is

(17)

This relation between correlation and SNR is illustrated in ﬁgure 1 (bottom).

A Explanations

A.1 A note on correlation and covariance matrices

In neural network literature, the matrix in equation 3 is often called a corre

lation matrix. This can be a bit confusing, since does not contain the correla

tions between the variables in a statistical sense, but rather the expected values of

the products between them. The correlation between and is deﬁned as

(18)

see for example[1], i.e. the covariance between and normalized by the geo

metric mean of the variances of and ( ). Hence, the correlation is

bounded, . In this tutorial, correlation matrices are denoted .

The diagonal terms of are the second order origin moments, ,of .

The diagonal terms in a covariance matrix are the variances or the second order

central moments, ,of .

The maximum likelihood estimator of is obtained by replacing the expecta

tion operator in equation 18 by a sum over the samples. This estimator is sometimes

called the Pearson correlation coefﬁcient after K. Pearson[16].

A.2 Afﬁne transformations

An afﬁne transformation is simply a translation of the origin followed by a linear

transformation. In mathematical terms an afﬁne transformation of is a map

of the form

)=

where is a linear transformation of and is a translation vector in .

6

x

R

:

i

E

q

x

2

i

]

)

2

N

ij

S

=

E

j

[(

C

x

R

i

p

R

x

)

i

(

)(

R

x

i

[

x

E

x

[(

n

i

C

2

n

]

:

x

F

[(

Ap

E

p

p

)]

A

j

+

x

R

x

j

i

x

S

C

)

xx

x

j

i

x

[(

i

1

x

i

x

x

E

x

R

j

xx

N

xx

F

x

R

=

!

E

n

[

2

x

1

]

(

1

2

+

N

8

2

2

=

n

N

=

;

S

]

:

2

N

)

S

j

i

1

n

q

j

i

p

x

x

ij

n

]

=A.3 A piece of information theory

Consider a discrete random variable :

2f

2f

;:::

;N (19)

(There is, in practice, no limitation in being discrete since all measurements have

ﬁnite precision.) Let be the probability of for a randomly chosen .

The information content in the vector (or symbol) is deﬁned as

(20)

If the basis 2 is used for the logarithm, the information is measured in bits. The def

inition of information has some appealing properties. First, the information is 0 if

)=

1 ; if the receiver of a message knows that the message will be , he does

not get any information when he receives the message. Secondly, the information

is always positive. It is not possible to lose information by receiving a message. Fi-

nally, the information is additive, i.e. the information in two independent symbols

is the sum of the information in each symbol:

)=

(21)

)=

)+

if and are statistically independent.

The information measure considers each instance of the stochastic variable

but it does not say anything about the stochastic variable itself. This can be

accomplished by calculating the average information of the stochastic variable:

)=

)= (22)

is called the entropy of and is a measure of uncertainty about .

Now, we introduce a second discrete random variable , which, for example,

can be an output signal from a system with as input. The conditional entropy

[18] of given is

)= (23)

The conditional entropy is a measure of the average information in given that

is known. In other words, it is the remaining uncertainty of after observing

2.The average mutual information between and is deﬁned as the

average information about gained when observing :

)= (24)

2Shannon48 originally used the term rate of transmission.Theterm mutual information was

introduced later.

7

;

y

g

x

x

x

;

x

x

)

x

x

(

y

j

(

(

H

x

(

k

x

x

j

)

y

=

H

P

H

i

(

(

x

)

;

P

y

=

)

H

k

(

k

y

;

)

x

:

i

:

(

))

j

i

x

x

(

(

;

P

(

(

x

log

)

)

j

i

I

x

x

(

)

P

P

=1

)

i

P

X

=

N

(

i

x

x

x

(

(

I

g

)

i

i

x

x

x

x

y

(

x

P

P

=1

log

i

))

X

x

y

i

N

(

x

x

(

log

H

x

x

y

j

I

x

x

i

y

x

H

)

x

j

H

x

x

(

y

x

:

(

i

k

x

k

(

(

I

:

x

k

j

(

x

log

y

(

k

P

(

log

1

)

log

i

)

x

x

(

I

P

x

log

k

=

=

))

)

j

x

x

P

I

:

(

2

x

1

;

;

y

i

)

x

(

P

IThe mutual information can be interpreted as the difference between the uncer-

tainty of and the remaining uncertainty of after observing . In other words, it

is the reduction in uncertainty of gained by observing . Inserting equation 23

into equation 24 gives

)=

)+

)= (25)

which shows that the mutual information is symmetric.

Now let be a continuous random variable. Then the differential entropy

is deﬁned as [18]

)= (26)

where is the probability density function of . The integral is over all dimen

sions in . The average information in a continuous variable would of course be

inﬁnite since there are an inﬁnite number of possible outcomes. This can be seen

if the discrete entropy deﬁnition (eq. 22) is calculated in limes when approaches

a continuous variable:

)=

x

x

)=

x

; (27)

x

x

where the last term approaches inﬁnity when

x approaches zero [8]. But since

mutual information considers the difference in entropy, the inﬁnite term will vanish

and continuous variables can be used to simplify the calculations. The mutual

information between the continuous random variables and is then

)=

)+

)=

(28)

where and are the dimensionalities of and respectively.

Consider the special case of Gaussian distributed variables. The differential

entropy of an dimensional Gaussian variable is

)=

e (29)

where is the covariance matrix of [3]. This means that the mutual information

between two dimensional Gaussian variables is

jj

yy

)= (30)

where

yx

yy

and are the within set covariance matrices and is the between

yy

yx

sets covariance matrix. For more details on information theory, see for example [7].

8

y

;

y

p

d

C

N

x

)

M

y

xx

x

=

!

y

C

)

p

y

x

(

I

p

log

)

j

x

1

(

C

p

C

)

x

y

;

;

)

x

x

(

log

p

p

x

log

log

)

x

y

1

;

j

x

C

(

C

p

(

M

i

R

C

Z

C

N

R

0

Z

lim

y

H

N

(

;

xy

x

x

(

h

h

h

)

R

y

x

(

(

z

x

h

x

h

i

(

(

z

(

x

H

1

(

2

;

log

)

(2

2

(

)

C

N

i

j

x

C

j

j

j

;

h

p

y

=

;

C

x

(

xx

C

xy

I

X

y

1

x

:

x

xx

x

C

y

(

x

x

y

(

I

;

z

I

(

y

x

C

;

=

y

T

H

x

(

x

x

(

H

)

log

(

0

Z

!

N

(

(

lim

)

)

p

x

x

(

d

h

;

y

(

)

)

N

x

)

d

xA.4 Principal component analysis

Principal component analysis (PCA) is an old tool in multivariate data analysis.

It was used already in 1901 [17]. The principal components are the eigenvectors

of the covariance matrix. The projection of data onto the principal components is

sometimes called the Hotelling transform after H. Hotelling[9] or Karhunen Loev´ e

transform (KLT) after K. Karhunen and [12] and M. Loev´ e [15]. This transforma

tion is as an orthogonal transformation that diagonalizes the covariance matrix.

A.5 Partial least squares

partial least squares (PLS) was developed in econometrics in the 1960s by Her-

man Wold. It is most commonly used for regression in the ﬁeld of chemometrics

[19]. PLS i basically the singular value decomposition (SVD) of a between sets

covariance matrix.

For an overview, see for example [6] and [11]. In PLS regression, the principal

vectors corresponding to the largest principal values are used as a new, lower di

mensional, basis for the signal. A regression of onto is then performed in this

new basis. As in the case of PCA, the scaling of the variables affects the solutions

of the PLS.

A.6 Multivariate linear regression

Multivariate linear regression (MLR) is the problem of ﬁnding a set of basis vectors

and corresponding regressors in order to minimize the mean square error of

the vector :

(31)

where dim . The basis vectors are described by the matrix

which is also known as the Wiener ﬁlter. A low rank approximation to this problem

can be deﬁned by minimizing

x^ (32)

yi

where

<M and the orthogonal basis s span the subspace of which gives

yi

the smallest mean square error given the rank . The bases and are

wi

yi

given by the solutions to

(33)

yx

which can be recognized from equation 11 with and from the lower row in

table 1.

9

2

2

^

k

x

"

1

E

2

=

C

2

w

y

y

w

i

B