Tutorial using the free software

Performing a spatial Principal Component

Analysis using the R software

T. JOMBART

November 2006

For any questions or comments, please send an email to:

jombart@biomserv.univ-lyon1.fr.

Contents

1 Introduction 2

2 Building a connection network 4

3 Performing the analyses 5

3.1 Testing for the existence of spatial structures . . . . . . . . . . . . . 6

3.2 Identifying and displaying the spatial structures . . . . . . . . . . . 8

3.2.1 The Principal Component Analysis . . . . . . . . . . . . . . 8

3.2.2 The spatial Principal Component Analysis . . . . . . . . . . 10

1T. JOMBART, A sPCA tutorial using R

1 Introduction

Thistutorialshowshowtoperformaspatial Principal Component Analysis (sPCA,

Jombart et al., submitted), starting from a data set of allelic frequencies of geo-

referenced individuals or groups of individuals. The second illustration presented

inthepreviouslycitedarticleisreproducedbelow. Itinvolvesgroupsofindividuals

(cat colonies) data, but the process is exactly the same using individuals data.

Before starting, please download the ﬁles Nancy cats.rda and map.ppm at

http://pbil.univ-lyon1.fr/R/html/additifs.html. The ﬁrst ﬁle is a data

set with R format (’rda’), and the second one is a picture of the sampled area.

We assume that both ﬁles were saved in your R working directory (which can be

known and set using the commands getwd() and setwd()). Several R packages

are also used to perform the analyses. These can be installed using the following

commands:

> install.packages("ade4", dep = TRUE)

> install.packages("spdep", dep = TRUE)

> install.packages("pixmap", dep = TRUE)

We ﬁrst load the data object:

> load("Nancy_cats.rda")

> names(Nancy_cats)

[1] "xy" "tab" "sampsizes"

This list contains a matrix of geographic coordinates of the 17 stray cats colonies

(Nancy_cats$xy), a table of allelic frequencies for 9 microsatellite markers

(Nancy_cats$tab), and the number of cats sampled for each colonies

(Nancy_cats$sampsizes, not used in this tutorial). We rename these objects

respectively to xy and tab. We also load the map of the sampled area, stored as

a ppm ﬁle, using the pixmap library.

> xy = Nancy_cats$xy

> colnames(xy) <- c("x", "y")

> rownames(xy) <- 1:nrow(xy)

> head(xy)

x y

1 263.3498 171.10939

2 183.5028 122.40790

3 391.1050 254.70148

4 458.6121 41.72336

5 182.7769 219.08398

6 335.2121 344.83557

2T. JOMBART, A sPCA tutorial using R

> tab = Nancy_cats$tab

> tab[1:5, 1:6]

L1.01 L1.02 L1.03 L1.04 L1.05 L1.06

P01 0.00000000 0.00000000 0.1000000 0.3000000 0.10000000 0.05000000

P02 0.00000000 0.27272727 0.0000000 0.5454545 0.04545455 0.00000000

P03 0.04166667 0.16666667 0.1666667 0.2083333 0.00000000 0.08333333

P04 0.02173913 0.08695652 0.1304348 0.4130435 0.00000000 0.06521739

P05 0.00000000 0.00000000 0.0000000 0.5333333 0.00000000 0.10000000

> library(pixmap)

> map <- read.pnm("map.ppm")

> plot(map, main = "Sampled area and positions \nof the colonies")

At this step, we need to express the spatial relationships between the colonies

in order to include spatial information in further analyses. This is achieved by

building a connection network (also called ’neighbouring graph’), which deﬁnes

which colonies are neighbours or not.

3T. JOMBART, A sPCA tutorial using R

2 Building a connection network

As we have no information about the real connectivity between these stray cat

colonies, we use an algorithm to build their connection network. The R software

oﬀers many tools to perform this task (see Table 1). A remaining diﬃculty is

that these tools are spread throughout diﬀerent packages, the main of which are

presented below. Note that each package produces diﬀerent output types (called

classes in R framework); however, this diﬃculty is easily circumvented by ’trans-

lation’ functions. For instance, having an object of the class titi, it will often be

possible to get an object of the class toto using a function titi2toto().

We choose Gabriel criterion (Gabriel and Sokal, 1969). The connection

network is built using the function gabrielneigh, whose argument is a matrix of

the geographic coordinates of the points.

> library(spdep)

Package SparseM (0.71) loaded. To cite, see citation("SparseM")

> gab.graph <- gabrielneigh(Nancy_cats$xy)

> class(gab.graph)

[1] "Graph" "Gabriel"

Note that we could add or remove connections of the Gabriel graph using the

function fix.nb of the spdep package. The presentation of usual algorithms other

than Gabriel criterion (Table 1) is beyond the scope of this document. For more

details, a starting point can be found in (Legendre and Legendre, 1998, p.

752-756), although R documentation may be suﬃcient.

Table 1: Usual algorithms used to build connection networks, and the corresponding func-

tions in the R software, precising their library, class and the associated conversion functions.

Name of the algorithm Function Library Class Useful conversion functions

Delaunay triangulation tri2nb tripack nb nb2neig

Gabriel graph gabrielneigh spdep graph graph2nb, nb2neig

Relative neighbours relativeneigh spdep graph graph2nb, nb2neig

K nearest neighbours knearneigh spdep knn knn2nb, nb2neig

Neighbourhood by distance dnearneigh spdep nb nb2neig

Minimum spanning tree mstree ade4 neig neig2nb

Among the many R classes related to the connection networks, we are partic-

ularly interested in the the following: nb, which can be viewed as the reference

4T. JOMBART, A sPCA tutorial using R

class for connection networks, listw, which is used in computations (Moran’s I

and sPCA), and neig which is used for plotting the results in ade4 graphical func-

tions. As said previously, many functions allow to convert one class into another

(Table 1).

In this example, we need graph2nb to convert a graph object into the a nb

object, required for further operations. We call the resulting object cn.gab.nb.

Note that the argument sym=TRUE speciﬁes that spatial connections are symetrical

(i is connected to j implies that j is connected to i). We also proceed to other

conversions, latter involved in computations and graphical display.

> library(spdep)

> library(ade4)

> cn.gab.nb <- graph2nb(gab.graph, sym = TRUE)

> class(cn.gab.nb)

[1] "nb"

> cn.gab.listw <- nb2listw(cn.gab.nb)

> class(cn.gab.listw)

[1] "listw" "nb"

> cn.gab.neig <- nb2neig(cn.gab.nb)

> class(cn.gab.neig)

[1] "neig"

3 Performing the analyses

Once the connection network is deﬁned, we can proceed to the analyses, which

can be detailed into two main steps. First, the existence of spatial structures is

assessed in the raw data, using a spatial autocorrelation test. If this test detects

the presence of spatial patterns, then these patterns are searched and displayed

using the sPCA.

5T. JOMBART, A sPCA tutorial using R

3.1 Testing for the existence of spatial structures

The procedure proposed by Smouse and Peakall (1999) is used to test the

existence of spatial autocorrelation. Basically, as this test is based on Moran’s

I, it would not be able to diﬀerenciate between the absence of structures, and a

mixture of global and local structures. As such a case is less likely to occur in a

single locus, we perform the test for each locus separately.

Smouse and Peakall (1999) procedure was generalized by Ollier (2004),

and is available in the function multispati.randtest (package: ade4). We have

to split the main allelic frequencies table in order to have a table for each locus.

Here are the numbers of alleles per marker:

> table(substr(colnames(tab), 1, 2))

L1 L2 L3 L4 L5 L6 L7 L8 L9

11 18 10 9 12 8 16 12 12

> blocs <- as.numeric(table(substr(colnames(tab), 1, 2)))

Then we use the ade4 function ktab.data.frame to create one table per marker.

> tab_loc <- ktab.data.frame(tab, blocs, tabnames = paste("loc",

+ 1:9, sep = "."))

> names(tab_loc)[1:9]

[1] "loc.1" "loc.2" "loc.3" "loc.4" "loc.5" "loc.6" "loc.7" "loc.8" "loc.9"

Now, we perform the spatial autocorrelation test for each of the nine tables. Note

that here, the function dudi.pca is only used by the procedure to get the inertia

of each table.

> SPtest <- lapply(1:9, function(i) multispati.randtest(dudi.pca(tab_loc[[i]],

+ center = T, scale = F, scannf = F), cn.gab.listw, nrep = 9999))

> names(SPtest) <- names(tab_loc)[1:9]

> SPtest

$loc.1

Monte-Carlo test

Observation: -0.1228694

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.7608

$loc.2

Monte-Carlo test

Observation: 0.1051725

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

6T. JOMBART, A sPCA tutorial using R

Simulated p-value: 0.0421

$loc.3

Monte-Carlo test

Observation: 0.06423195

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.0858

$loc.4

Monte-Carlo test

Observation: 0.01707664

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.2046

$loc.5

Monte-Carlo test

Observation: -0.1462053

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.8424

$loc.6

Monte-Carlo test

Observation: -0.2374583

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.9757

$loc.7

Monte-Carlo test

Observation: -0.09831204

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.6916

$loc.8

Monte-Carlo test

Observation: 0.008750039

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.1854

$loc.9

Monte-Carlo test

Observation: -0.1030899

Call: multispati.randtest(dudi = dudi.pca(tab_loc[[i]], center = T,

scale = F, scannf = F), listw = cn.gab.listw, nrepet = 9999)

Based on 9999 replicates

Simulated p-value: 0.6401

Thep-valuescorrespondtotheprobabilitythatateststatistic(roughlyinterpreted

like Moran’s I) as high as the observed one occurs by chance. Thus, the second

locus is signiﬁcantly positively autocorrelated, whereas the locus 6 displays nega-

tive spatial autocorrelation. As a consequence, the raw data contain both global

7T. JOMBART, A sPCA tutorial using R

and local structures.

3.2 Identifying and displaying the spatial structures

3.2.1 The Principal Component Analysis

Nowwecanstartthemultivariateanalysesusingthe ade4library(Chessel et al.,

2004), which oﬀers functions for most of the multidimensional analyses, genetic

datahandlingandgraphicalrepresentations. First, weperformacentredPrincipal

Component Analysis of the allelic frequencies table, using the function dudi.pca

(see ?dudi for the other ordination methods). Note that the term ’dudi’ refers

to a the duality diagram, which is a general framework for ordination methods

(Escoufier, 1987). We tell the function to perform a centred (center=T) but

not scaled PCA (scale=F), not to display the screeplot (scannf=F), and directly

choose to retain the three highest eigenvalues (nf=3).

> pca1 <- dudi.pca(tab, center = T, scale = F, scannf = F, nf = 3)

> pca1

Duality diagramm

class: pca dudi

$call: dudi.pca(df = tab, center = T, scale = F, scannf = F, nf = 3)

$nf: 3 axis-components saved

$rank: 16

eigen values: 0.1118 0.1064 0.0945 0.08141 0.07786 ...

vector length mode content

1 $cw 108 numeric column weights

2 $lw 17 numeric row weights

3 $eig 16 numeric eigen values

data.frame nrow ncol content

1 $tab 17 108 modified array

2 $li 17 3 row coordinates

3 $l1 17 3 row normed scores

4 $co 108 3 column coordinates

5 $c1 108 3 column normed scores

other elements: cent norm

We then display the PCA results using the s.value function, which allows to

represent a quantitative variable on a map using black (positive values) and white

(negative values) squares whose sizes correspond to the absolute values. The row

scores, notedφ inJombart et al. (submitted), are stored in the matrix pca1$li.

For instance,φ is the ﬁrst column (pca1$li[,1]),φ is in (pca1$li[,2]), etc.1 2

Wegiveoptionalargumentstothefunctioninordertochooseabackgroundpicture

(pix=map), to add a connection network (neig=nb2neig(cn.gab.nb)), and not to

include x y axes (addaxes=F), origin (include.ori=F), grid (grid=F) or legend

(cleg=0). The two last arguments are used to choose the subtitles and their size.

8'

T. JOMBART, A sPCA tutorial using R

> par(mfrow = c(2, 2))

> invisible(lapply(1:3, function(i) s.value(xy, pca1$li[, i], pix = map,

+ neig = nb2neig(cn.gab.nb), include.ori = F, grid = F, addaxes = F,

+ cleg = 0, sub = paste("PCA scores", (1:3)[i]), csub = 1.3)))

Are these scores displaying spatial structures? It seems not, but their spatial

autocorrelation can be quantiﬁed and tested using Moran’s I. The library spdep

proposes a simple permutation test for this statistic: the arguments of moran.mc

are the tested variable, the connection network and the number of simulations.

> lapply(1:3, function(i) moran.mc(pca1$li[, i], cn.gab.listw,

+ nsim = 9999))

[[1]]

Monte-Carlo simulation of Moran s I

data: pca1$li[, i]

weights: cn.gab.listw

number of simulations + 1: 10000

9'

'

T. JOMBART, A sPCA tutorial using R

statistic = -0.2048, observed rank = 2453, p-value = 0.7547

alternative hypothesis: greater

[[2]]

Monte-Carlo simulation of Moran s I

data: pca1$li[, i]

weights: cn.gab.listw

number of simulations + 1: 10000

statistic = 0.0727, observed rank = 7653, p-value = 0.2347

alternative hypothesis: greater

[[3]]

Monte-Carlo simulation of Moran s I

data: pca1$li[, i]

weights: cn.gab.listw

number of simulations + 1: 10000

statistic = -0.054, observed rank = 5256, p-value = 0.4744

alternative hypothesis: greater

We conclude that none of the scores are signiﬁcantly spatially structured. Hence,

the PCA was unable to display the spatial patterns existing in the data.

3.2.2 The spatial Principal Component Analysis

We can therefore proceed to the sPCA. The function performing this analysis is

called multispati, for ’multivariate spatial’ analysis. This function requires the

original analysis (here, pca1), and a list of spatial weights of the class listw,

containing the same information as a row-standardized connection matrix L. We

tell the function not to display the screeplot (scannf=F), and to keep 2 global

(nfposi=2) as well as 2 local (nfnega=2) structures. The output is an object of

the class multispati.

> sPca1 <- multispati(pca1, cn.gab.listw, scannf = F, nfposi = 2,

+ nfnega = 2)

> sPca1

Multispati object

class: multispati

$call: multispati(dudi = pca1, listw = cn.gab.listw, scannf = F, nfposi = 2,

nfnega = 2)

$nfposi: 2 axis-components saved

$nfnega: 2 axis-components saved

Positive eigenvalues: 0.05265 0.03444 0.02745 0.01776 0.01127 ...

Negative eigenvalues: -0.05201 -0.03989 -0.03434 -0.02702 -0.02374 ...

10