18 Pages
English
Gain access to the library to view online
Learn more

A genome-wide view of mutation rate co-variation using multivariate analyses

-

Gain access to the library to view online
Learn more
18 Pages
English

Description

While the abundance of available sequenced genomes has led to many studies of regional heterogeneity in mutation rates, the co-variation among rates of different mutation types remains largely unexplored, hindering a deeper understanding of mutagenesis and genome dynamics. Here, utilizing primate and rodent genomic alignments, we apply two multivariate analysis techniques (principal components and canonical correlations) to investigate the structure of rate co-variation for four mutation types and simultaneously explore the associations with multiple genomic features at different genomic scales and phylogenetic distances. Results We observe a consistent, largely linear co-variation among rates of nucleotide substitutions, small insertions and small deletions, with some non-linear associations detected among these rates on chromosome X and near autosomal telomeres. This co-variation appears to be shaped by a common set of genomic features, some previously investigated and some novel to this study (nuclear lamina binding sites, methylated non-CpG sites and nucleosome-free regions). Strong non-linear relationships are also detected among genomic features near the centromeres of large chromosomes. Microsatellite mutability co-varies with other mutation rates at finer scales, but not at 1 Mb, and shows varying degrees of association with genomic features at different scales. Conclusions Our results allow us to speculate about the role of different molecular mechanisms, such as replication, recombination, repair and local chromatin environment, in mutagenesis. The software tools developed for our analyses are available through Galaxy, an open-source genomics portal, to facilitate the use of multivariate techniques in future large-scale genomics studies.

Subjects

Informations

Published by
Published 01 January 2011
Reads 13
Language English

Exrait

Ananda et al. Genome Biology 2011, 12:R27
http://genomebiology.com/2011/12/3/R27
RESEARCH Open Access
A genome-wide view of mutation rate
co-variation using multivariate analyses
1,2 1,3*† 1,4*†Guruprasad Ananda , Francesca Chiaromonte and Kateryna D Makova
Abstract
Background: While the abundance of available sequenced genomes has led to many studies of regional
heterogeneity in mutation rates, the co-variation among rates of different mutation types remains largely
unexplored, hindering a deeper understanding of mutagenesis and genome dynamics. Here, utilizing primate and
rodent genomic alignments, we apply two multivariate analysis techniques (principal components and canonical
correlations) to investigate the structure of rate co-variation for four mutation types and simultaneously explore the
associations with multiple genomic features at different genomic scales and phylogenetic distances.
Results: We observe a consistent, largely linear co-variation among rates of nucleotide substitutions, small
insertions and small deletions, with some non-linear associations detected among these rates on chromosome X
and near autosomal telomeres. This co-variation appears to be shaped by a common set of genomic features,
some previously investigated and some novel to this study (nuclear lamina binding sites, methylated non-CpG sites
and nucleosome-free regions). Strong non-linear relationships are also detected among genomic features near the
centromeres of large chromosomes. Microsatellite mutability co-varies with other mutation rates at finer scales, but
not at 1 Mb, and shows varying degrees of association with genomic features at different scales.
Conclusions: Our results allow us to speculate about the role of different molecular mechanisms, such as
replication, recombination, repair and local chromatin environment, in mutagenesis. The software tools developed
for our analyses are available through Galaxy, an open-source genomics portal, to facilitate the use of multivariate
techniques in future large-scale genomics studies.
Background regional mutation rate variation remain elusive.
BioDeciphering the mechanisms of mutagenesis is central chemical processes, including replication and
recombito our understanding of evolution and critical for stu- nation, have been suggested as potential contributors to
dies of human genetic diseases. The availability of a mutation rate variation. For instance, replication likely
multitude of sequenced genomes and their alignments determines the differences in nucleotide substitution
provides an opportunity to study mutations on a gen- rates among chromosomal types - nucleotide
substituome-wide scale in many species, including humans. tion rates are highest on chromosome Y, intermediate
There is now substantial evidence for within-genome on autosomes, and lowest on chromosome X (for
examvariation in mutation rates; in particular, regional varia- ple, [10,11]), consistent with the relative number of
tion in nucleotide substitution rates, insertion and dele- germline cell divisions and thus DNA replication rounds
for each of these chromosome types [12,13]. Local maletion (indel) rates, and microsatellite mutability have
been documented across the human genome [1-10]. recombination rate has been shown to be a significant
However, notwithstanding the attention it has received determinant of regional nucleotide substitution rate
varin the literature, the causative mechanisms underlying iation [10], supporting the potential mutagenic nature of
recombination and/or biased gene conversion [1,6,10].
Rates of small deletions have been found to be
asso* Correspondence: fxc11@psu.edu; kdm16@psu.edu
ciated with replication-related genomic features, and
† Contributed equally
1 rates of small insertions with recombination-related fea-Center for Medical Genomics, Penn State University, University Park, PA
16802, USA tures [8]. Finally, the role of replication slippage in
Full list of author information is available at the end of the article
© 2011 Ananda et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.Ananda et al. Genome Biology 2011, 12:R27 Page 2 of 18
http://genomebiology.com/2011/12/3/R27
determining variation in mutability among microsatellite genomic features) one at a time [2,3,5,8-10,18,20-22]. A
loci has been recently corroborated [9]. Other factors - better understanding of the structure and causes of
for example, the predominance of aberrant DNA repair mutation rate co-variation, which is crucial for studies
mechanisms like non-homologous end-joining at subte- of mutagenesis, can be achieved only through more
lomeric regions [14], and yet unexplored mutagenic sophisticated data analysis approaches.
mechanisms potentially acting at telomeres [10] - might Thisisexactlywhatwepursuedinthecurrentstudy,
where we jointly investigated multiple mutation ratesinfluence regional variation in mutation rates as well.
alongside several plausible explanatory genomic features,Genome-wide information on three additional
genoshedding light on the interplay between mutagenesis andmic features has recently become available. Nuclear
lamina binding regions are thought to represent a the genomic landscape in which it occurs. In more detail,
repressive chromatin environment and are concentrated we used multivariate analysis techniques to characterize
in the proximity of centromeres [15]; their impact on the co-variation structure of four rates (nucleotide
substilocal mutation rates has not been investigated to date. tutions, insertions, deletions, and microsatellite repeat
An abundance of methylated sites at non-CpG DNA number alterations) and explore their joint relationship
locations in human embryonic stem cells was revealed with several genomic landscape variables. First, we
by a recent study [16], suggesting alternative roles for applied principal component analysis (PCA) to mutation
DNA methylation in CpG and non-CpG contexts. rates computed along the genome. Next, we linked rates
Although the function of methylation in generating to genomic landscape variables using canonical
correlamutations at CpG locations has been extensively tion analysis (CCA). Finally, we applied non-linear
verresearched [2,6,8-10], no study to date has looked at the sions of these multivariate techniques, kernel-PCA
potential impact of the non-CpG methylome on the (kPCA) and kernel-CCA (k-CCA), to investigate the
pregenome and its mutagenesis; in particular, methylated sence of non-linear associations. We conducted our
ananon-CpG cytosines may also elevate mutation rates. lyses on two mutually exclusive neutral subgenomes -
Finally, recent predictions of the density of nucleosome- one repetitive (ancestral repeats (ARs)) and one unique
free regions based on MNase digestion [17] can be used (non-coding non-repetitive (NCNR) sequences), and
to understand the influence of local chromatin structure three genomic scales (1-Mb, 0.5-Mb, and 0.1-Mb) using
on mutation rates. Assessing the contribution of these human-orangutan comparisons, and repeated them for
three novel genomic features to mutation rate variation two additional phylogenetic distances using
humanmacaque and mouse-rat comparisons, to understand ifis of obvious and immediate interest.
and how the structure of mutation rate co-variation andIn addition to varying regionally, rates of different
mutations frequently co-vary with each other. Co-varia- the contribution of various genomic features may differ
tion was observed between rates of nucleotide substitu- among them.
tions (estimated at ancestral repeats and four-fold Importantly, we have made the suite of software tools
degenerate sites), large deletions and insertions of trans- implemented for this research publicly available, with
posable elements [2]. In a separate study, co-variation the aim of improving reproducibility and facilitating
was observed between rates of nucleotide substitutions future studies of mutation rates and other genome-wide
and both small insertions and small deletions [8]. What data. We integrated our software into a modular tool set
causes regional co-variation in the rates of different in Galaxy [23], a free and easy-to-use web-based
genomutation types? While explanations based on selection mics portal that has already established a substantial
have been considered [18], they are not satisfactory community of users.
because mutation rates also co-vary in presumably
neutrally evolving portions of the genome [2]. Shared local Results
genomic landscapes might be responsible for the co-var- To investigate co-variation in rates of nucleotide
substiiation of these rates and, on a purely mechanistic basis, tutions, small insertions, small deletions, and
microsatelone mutation type might be physically associated with lite repeat number alterations, we identified all such
another one (for example, indel-induced nucleotide sub- mutations in the human-orangutan alignments, using
stitutions) [19], causing the corresponding rates to co- macaque as an outgroup to distinguish insertions from
vary. However, these hypotheses have never been exten- deletions. Our rationale for using human-orangutan
sively explored. Notably, while a number of studies have comparisons is that, since their divergence is greater
documented regional variation and co-variation of rates than that of human and chimpanzee, it is expected to
of mutations of several types, they have mostly relied on be less affected by biases due to ancestral
polymorphcorrelation and univariate regression analyses, which isms [24]. We limited our analysis to human-specific
relate mutation rates only in a pair-wise fashion, and mutations occurring after the human-orangutan split in
attempt to explain their variation (as a function of two supposedly neutrally evolving subgenomes; ARs [2]Ananda et al. Genome Biology 2011, 12:R27 Page 3 of 18
http://genomebiology.com/2011/12/3/R27
and NCNR sequences [11]. These have been successfully outgroup). Below, we focus on AR and NCNR
subgeused for evaluating neutral variation in other studies nome results obtained with 1-Mb windows and
human[2,8,10,11,25-27]. Human-specific mutations were cho- orangutan alignments. Findings for, and comparisons
sen because of the high quality of the human genome with, other genomic scales/phylogenetic distances
anasequence and its annotation. The AR subgenome con- lyzed for the NCNR subgenome are provided in the
sisted of all transposable elements that were inserted in next-to-last subsection of the Results, the Discussion,
and in Additional file 1.the human genome prior to the human-macaque
divergence (thus excluding L1PA1-A7, L1HS, and AluY). The
NCNR subgenome was constructed by excluding genes Mutation rate co-variation
and 5-kb flanking regions around them (thus removing PCA was used to characterize co-variation among the
known coding and regulatory elements), other computa- four mutation rates in terms of orthogonal components,
tionally predicted and/or experimentally validated func- each representing a linear combination of the rates.
tional elements (see Materials and methods), and all PCA was run on the correlation matrix (that is, after
repeats identified by RepeatMasker [28] (excluding standardizing the rates) and resulted in two significant
mononucleotide microsatellites). This minimizes poten- components (eigenvalues greater than 1) [29], which
tial effects of selection and avoids overlap with the AR accounted for approximately three-quarters of the total
subgenome. variance (Table S1 in Additional file 1). Loadings
(eigenNext, the human genome was broken into 1-Mb vectors), which capture the correlation between each
windows, which has been proposed as the natural var- principal component and the rates, were then used to
iation scale for both mammalian nucleotide substitu- interpret the co-variation structure. Results were largely
tion and indel rates [8,25]. For each 1-Mb window, similar between the AR and NCNR subgenomes (Figure
restricting attention to the AR (and separately NCNR) 1).
portion of the window, we computed rates of nucleo- The first principal component suggested that the
tide substitutions, small (≤ 30-bp) insertions, small (≤ strongest co-variation in the genome occurs among
30-bp) deletions and mononucleotide microsatellite insertion, deletion and substitution rates. Insertion and
repeat number alterations (Table 1; see Materials and deletion rates exhibited large and concordant loadings
methods). Moreover, for each 1-Mb window we aggre- for this component in both subgenomes (Figure 1; Table
gated genomic features to be used as predictors (Table S2 in Additional file 1), indicating a strong positive
asso2; see Materials and methods). Relationships among ciation between these two mutation rates. Substitution
mutation rates, and between mutation rates and geno- rate also had a large loading for the first principal
commic features, were explored using multivariate analysis ponent in both subgenomes, indicating its association
techniques, including PCA, CCA, and non-linear ver- with indel rates.
sions of both methods. All computations were per- Microsatellite mutability, which was absent from the
formed using a suite of tools developed in Galaxy (see first principal component, was the only strong loading
Materials and methods). in the second principal component in both subgenomes
To verify whether our findings were consistent over (Figure 1; Table S2 in Additional file 1), suggesting that
different genomic scales and phylogenetic distances, we the variation in this rate is largely orthogonal to the
produced and analyzed analogous data for the NCNR others, and thus that the genomic forces driving
microsubgenome considering 0.5-Mb and 0.1-Mb genomic satellite mutability might be distinct from those driving
windows, as well as human-macaque alignments (here indel and substitution rates (see below). Interestingly, a
insertions and deletions were distinguished using mar- marked negative correlation was observed between
submoset as the outgroup) and mouse-rat alignments (here stitution rates and the number of orthologous
microsawe studied mouse-specific mutations and distinguished tellites per 1-Mb window (Figure S1 in Additional file
insertions and deletions using guinea pig as the 1). Thus, microsatellite mutability and microsatellite
Table 1 Mutation rates investigated in the present study
Type Measurement Alignment used
Insertion rate Insertions/bp Human-orangutan-macaque
Deletion rate Deletions/bpue
Nucleotide substitution rate Substitutions/bp Human-orangutan
Mononucleotide microsatellite mutability Mutability/bpngutan
Mutation rates, which are used as input to PCA and as response set in CCA, are listed, along with the measurement unit and alignments used for their
estimation.Ananda et al. Genome Biology 2011, 12:R27 Page 4 of 18
http://genomebiology.com/2011/12/3/R27
Table 2 Genomic features investigated in the present study
Feature Measurement (per Mb) Source
GC content Percentage of G and C bases ’GC Percent’ track from the UCSC Genome Browser
CpG islands Count ’CpG island’ track from the UCSC
Non-CG methyl-cytosines Count [16]
LINE Count ’RepeatMasker’ track from the UCSC Genome Browser
SINE Count ’r’ track from the UCSC
Nuclear lamina Number of LaminB1 interaction sites with positive ’NKI LaminB1’ track from the UCSC Genome Browser
intensity
Telomere Distance in bp ’Gap’ track from the UCSC Genome Browser
Female recombination rate (1 Mb) Centimorgan (cM) ’Recomb rate’ track from the UCSC Genome Browser
Male recombination rate (1 Mb) (cM) ’ rate’ track from the UCSC
Recombination rate (0.5 Mb and 0.1 Centimorgan (cM) [82]
Mb)
SNP Count ’SNPs 129’ track from the UCSC Genome Browser
Replication timing Time through S-phase [33]
Nucleosome-free regions Coverage [17]
Coding exons ’UCSC Genes’ track from the UCSC Genome Browser
Conserved elements Coverage ’28-way most conserved’ track from the UCSC Genome
Browser
Genomic features, used as predictors in CCA, are listed along with their measurement unit and source. LINE, long interspersed repetitive elements; SINE, short
interspersed repetitive element.
birth/death rates appear to have different dynamics in studies. Investigating non-linear associations (for
examthe genome. ple, one rate first increasing but then decreasing as
Non-linear relationship between certain mutation another increases; one rate exhibiting more than
proportypes (for example, substitutions and insertions [8]) have tional growth as another increases; one rate ‘leveling off’
been observed by pair-wise comparisons in earlier in its growth as another increases) is of interest because
AR PCA components (1−Mb; human−orangutan) NCNR PCA components (1−Mb; human−orangutan)
−40 −20 0 20 40 −40 −20 0 20 40
. .. .. .. . . .. .. . .. . . .. . . . .. . ... . .. MS . MS. . .. . . . . . .. . .. . . .. . . . .. . . . . .. .. . .. . . . . ... . . . . . .. . . . .. . . .. . . . .. . . . . .. . .... .. . .. .. . . .. . . . .. . ... . .. . . . . ... . . . .. . .. . . . . . . .... . . . . .. . . .. . . . . ... .. .. . . .. . . . . . . ... . .. . .. . . . .. .. . . . . . . . .. .. . . ... . . . . . . .. . . . . .. . . .. . . . . . .. . . . . . . . . . . . ... .. . . . . . . .. . .. .. .. . . . . . . . . . . .. . ... . .. ... . . . . .. . . . .. ... .. . . .. . . . . . . .. .. .. . . . .. .. . . . . .. . . . . . .. . .. . .. . . . . . . . . ... . . .. ...... .. . .. . . ... . . . .. ... .. . . . . . . . . . . . . . . ... . . . . . ... .... . . . ... . . ... . . . . . . . . . . . .... . . .. . . .... . . .. . . . . . . .. .. . . .. . . .. .. . ..... . . .. . . . . . . . .. . . .. . .. . .... ..... . . . . . . . .. .. . ......... .. .. . . ... .. ... . . . .... . . . .. . .. . . . . . ... . ... . . . .. . ... . ... . . ... . . . .. ... . . . .. . .. . ... .... ........ . . . . . . . . . . . . . . ... ... .. . . .... ... . . . . . ..... .. .. . . .... . .. ... . .... . ... . ... . ... .. .. .. . ... . . . .. ..... ... . .. . . .. . ... . . . . . .... . . . .. .. .... .. ... .. .. . . . . .. .. ... . .. .. . . .. ..... . .. . . . . . .. .. .. . ..... . . . . .. .. . . .... . . ... . .. .. .. . . . .. ..... . . . SUB.. .... . ... .. ...... .. . . . .... . . ..... . . . .. ...... . .. .. . .. . . . . . . . .. . ... . .. .. . ... . . . .. .. ....... .. . ... . . ..... . . . . . .. . . .. . ............... ..... . . .. .. . .. .... . . ... . . .. . .... ..... . . . . . .. . . .... ...... .. . .. . . .. . . . . ... . . ... . . . .. . . . . . .. .. . .... ..... ... .. . . .. . . . . .......... . . . . . . . . . . . .. .. . . . .. . .. . . .. . .. .. .. .... ............... . SUB . . .. .. . . ....... ...... . . .. . . . ... ... .......... .. . . . . . . ....... ....... .. ....... ....... . ............. . . . . . . .. . ....... . .. . ... .. .. .. ... .. .... ...... .... .. .... . ... . .. . . . . .... . ... . . . . . ... .. . ...... . . . . . . .... ........ .... .. ... ... ... . .. .. ..... .. . ... . .. . . ... ..... .. .. . .. . ..... . .. . ....... ...... ... . . ..... .. . .. . .. . . . . .. . . .... .. .. . .. .... .. ........ .. .. . DEL . . . . .. .... .... ..... ... .... .. .. ... . .. .. ......... ... . ...... ... . ... ...... ....... . ...... .. .. .... ... .. .. . . .. . ... .. ... .. .................. . .. .. . . ..... ...... . . .... ..... . ... .. . .. . ....... ..... ....... .... ..... . .. .. . . . .. .. .... .. .. ... ..... .. ... . . .. .. . . . ............. ........................ .. . . ... . . . ..... ..... .. ....... .............. .. . .. . .. ..... ... . ..... ..... ........... ............. . . INS . . ... ...... . .. .......... . .. . .. .. . . .. ... ........... ... ....... . . . . . . . ........ .......................... ... . ... . .. ... .. ........ .. . . .. .... . . .. .. ......... ........ ... .. .. . .... . .DEL. . ..... . ..... .. . ..... . .. ................ . .... ..... .... . . . . ....... . .. .. ... .......... ..... .......... ..... . . .. . . . . . ......... .............. .. ........ .. ... . . . . ... ...... ...... ... . .... . .. . . .. ...... ... ....... . . .... . . . ... . . . ..... ......... ............. . ... . . ... . .. ...... ..... . .. ........ ... .. . ... . . ... . .. ... . ... . ... ...... .. .. . . .. . ........... .. .......... .. ..... . . . . .. . .. ..... ..... . ..... ... . . . . . . .... . ............. ....... ..... . . .... . .. . INS .. . . .. . ... . .......... . . .. . . . . .. . .. . .... .... ..... ....... .. .. . . .. .. . .. .. .. . . . . . .. . . .... .... . .... . . .... .. . . . . . . . .... ......... ... . . ... . ........... . . . .. .. . . . . .. . . ..... . . .. . ... ...... .. . ... .... . . .. . . . . . . ..... ... . . . .. . . . ....... .. . .. . .. . . . ... .... . . . . ... ..... . .... . . . ... . ... . ..... . . . . . ... ... . . .. ... . . . .. .. . . .. .. .. ... .. . .. . . . . ... .... ... .. ... .... . ... .. . . . .. . .. . . . ...... . .. ... . .. . . . ...... . . . . . . . . . . ... . . .. . .. . . .. . ... . . . . ...... ... ... ... . . ... . . . . . .... . . . . . .. .. ... .. .. . .. . .. .. . . . . . .... .. . . .. ..... . . . ... ...... ... . . .. . ... . . . . . ... ..... ....... . . .. . .. .. . . . . . ..... . . .. . . ... . . . .. . ... .. .. . . .. . . .. . . .... . . . . . . . . .. . . . . .. .. . .. . . . . . . . . . ... . . . . .. . . .. ... . . . . . .. . .. .. . . .. . ....... .. . . . . . . . .. .. .. . .. . . . . .. .. .. ... . ... .. .. . . . .. .. . .. . . . . . . . . ... .. . . . . ... . . .. . . . . . ... . . . . . . .. .. . .. . .. . . . . ... ... . . .. . . . . .. . . . . . . .. . . .. .. . . . .. .. . . ... ... . .
.
−0.05 0.00 0.05 −0.05 0.00 0.05
Component 1 Component 1
Figure 1 Biplots of the first two PCA components for our four mutation rates, as obtained from the AR and NCNR subgenomes along
the human-orangutan branch for 1-Mb windows. Black dots represent projected observations (that is, projected windows). The vectors
labeled INS, DEL, SUB, and MS depict loadings for insertion rate, deletion rate, substitution rate, and mononucleotide microsatellite mutability,
respectively. See Tables S1 and S2 in Additional file 1 for summary statistics.
Component 2
−0.05 0.00 0.05
−40 −20 0 20 40
Component 2
−0.05 0.00 0.05
−40 −200 20 40Ananda et al. Genome Biology 2011, 12:R27 Page 5 of 18
http://genomebiology.com/2011/12/3/R27
they can be suggestive of connections and constraints rates (responses, Table 1) to multiple genomic features
linking different mutation types. However, questions (predictors, Table 2).
concerning the strength of such non-linearities, espe- We used the four mutation rates introduced above as
cially when considered as a multiple (as opposed to our response set, and formed a predictor set that
pair-wise) phenomenon, and whether they tend to occur included genomic features shown to associate with
in particular genomic locations or contexts, have never mutation rates in previous studies (GC content,
recombeen addressed directly. To investigate the existence of bination rates, number of CpG islands, proximity to
telnon-linear associations among multiple mutation rates, omere, replication timing, number of long interspersed
repetitive elements (LINEs), number of short inter-we applied kPCA, a variant of PCA that utilizes kernel
mapping (see Materials and methods) to compute prin- spersed repetitive element (SINEs), density of SNPs,
cipal components in a high dimensional space non-line- density of coding exons and density of conserved
elearly related to the original space [30]. While results ments) [2,5,6,8-10], as well as features not formerly
con(Figures S2 and S3 in Additional file 1) were similar to sidered (number of nuclear lamina binding sites,
the PCA results described above (with the first principal abundance of non-CG methyl-cytosines, and density of
component dominated by insertion, deletion, and substi- nucleosome-free regions; Table 2). Some of these
genotution rates, and the second dominated by microsatellite mic features are correlated (for example, GC content
mutability), the scores produced by linear PCA and and replication timing [32,33]), and one can investigate
kPCA for 1-Mb windows, although associated, were not their co-variation structure through PCA as was done
in complete agreement (Figure S4 in Additional file 1). for the mutation rates (PCA results for genomic features
Comparing linear and non-linear PCA scores provides a are reported in Figure S7 and Tables S4 and S5 in
Addimeans to identify genomic regions where neutral muta- tional file 1). However, our focus here is not on
identifytion rates are co-varying differently from the rest of the ing leading components of the local variation in
genome. We regressed the strongest ‘non-linear signal’ genomic landscape, but rather leading components of its
(scores from the first kernel principal component) onto effects on mutation rates - to this end, extracting CCA
the ‘linear signals’ that emerged as significant in the components is more effective and easier to interpret
data (scores from the first and second principal compo- than correlating principal components extracted
sepa2
nents; Table S3 in Additional file 1). The R value was rately for mutation rates and genomic features.
76%, implying that, for the most part, the non-linear sig- CCA yielded four canonical component pairs in the
nal could be recapitulated by the linear signals. The NCNR subgenome and four in the AR subgenome. The
windows where the non-linear signal was poorly recapi- correlations observed for these pairs were 0.6955,
0.5043, 0.3906 and 0.1043 for the NCNR subgenome,tulated by the linear signals were identified as outliers of
the regression (see Materials and methods), and a vast and 0.7338, 0.5336, 0.3287 and 0.0534 for the AR
subgemajority of them were found to be located either on nome. Based on P-values from Rao’sFApproximation
chromosome X (55% for AR, 64% for NCNR sequences) test [34] (see Materials and methods), all four NCNR
or at subtelomeric regions of autosomes (Figure 2A; pairs and the first three AR pairs were significant
(P58% and 45% of autosomal windows in AR and NCNR values < 2.2e-16, < 2.2e-16, < 2.2e-16, and 0.0116 for
sequences, respectively, were located within ≤15% of the NCNR, and < 2e-16, < 2e-16, < 2e-16, and 0.7637 for
chromosomal length from the telomeres; see also Fig- AR; Table S6 in Additional file 1). Remarkably, the first
ures S5A and S6A in Additional file 1). three AR and NCNR response components described
very similar patterns (although differing in order; see
Mutation rate co-variation and genomic landscape below). Loadings, which capture the correlations
Linking mutation rates and their co-variation to the between canonical components belonging to each pair
genomic landscape is crucial for understanding its and the rates (in the response space) or the genomic
effects on mutagenesis and thus drawing inferences on features (in the predictor space), were then used for
potential causal mechanisms. To achieve this, we interpretation.
employed CCA. This is a multivariate technique that, The first AR response component and the second
given two sets of variables (for example, responses and NCNR response component were very similar to one
predictors), extracts pairs of components (each compris- another (and similar to the first principal component);
ing a linear combination in the response space, and a they showed strong and concordant loadings for
inserlinear combination in the predictor space) that are tion rates, deletion rates and substitution rates (Figure
maximally correlated to one another - like PCA, subse- 3). Thus, these components render a direction of strong
co-variation for indel and substitution rates. The corre-quent pairs have orthogonal response components, and
sponding predictor components in both subgenomesorthogonal predictor components [31]. This provides a
showed strong loadings for GC content, number of CpGwayofsimultaneouslyassociatingmultiplemutationAnanda et al. Genome Biology 2011, 12:R27 Page 6 of 18
http://genomebiology.com/2011/12/3/R27
(a) Mapping PCA signals on the genome (b) Mapping CCA response −space signals on the genome
Window type Window type− − − −− − − − Linearity in PCA − Linearity in CCA Responses− −− − − −−−− −− Non −linearity in PCA− Non −linear− − − −− −−− Centromere Centromere− − − −−− − − − −− −− − − − − −− − − − − −− − − − − − − − − −− − − −− − −− − − − −− − − − − − − −− − −− − − − − − − − − − −−−− −−− −−−− − −−− −−− −−− −−− − − −−− −−− − −− −− −− − −− − −− −− −−− −−− −− −−− − −−− −− −− − −−− −− −−− − −−− −− −−−− − −− − − −− − −− − −− − − −− − −− − − −− − − − − − −− − −−− −− −− −−− −−− −− −−− −− −− −− −−− −−− −− −−−−− −− −− −− − −− −− − −− −− − −−− − − −− − − −− − − − −− − − −−−− −−− −− −− −−− −−− −− − −− −−− −− −− −−− −−− −− −−− − − −− − − −− − − −− − −− − − −−−− −− − − −− −− − − −− −− − − −− −− −−− −− − −−− − −− − −− −−− − −− −− − −−− − −− − −−− −− −− −− −− − −− −− −− − −− −− −− − −− −− −−− −−− −−− −− −−− −− −−− −−− −− −−− −−− − − −−− −−− −− −−− −− −−− −−− −− −−− −−− −− − − − −− −− −− −− − −− − −− − − − −− −− −− −− −− −− −− −− −− −− − −− −− − − − −− −− −− −− −− − −− −− − −− − − − − − −− − − −− − − − − −−− − − − −− − −−−− −− − − −− −− − −−−−−− − −− −− − − − − − − − −−−− −− −−− − −−−− − − − − −− − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−− − − − − −−− −−− − − − − −− − − −−− − − −− − −− − − − −− − − − − − − − − − −− − − − − −− − −− −− − − −−− − − − − − − − − − −− − − − − − − − − − −− − −− − −−− − − − − − − − − −− − − − − − − −− − − −− − − −− −−− −− − − − − −− − − −− − − − − − − − − − − − − − − − − − − −− − −− − − − − − − − − − − − − −− − −− − − − − − − − − − − − −− − − − − − − − − −− − − − −− − − −− − −−− − − − − − − −− − − − − − − − − −− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− − − − −− − − −− − −−−− − − −− − − −−− − − − −− − −−− − − − − − − −− − −− − −− −− −− − − −− − − − −− − −−−− − − − −−− − − − − − − − −− − − −− − − − − − − − − − − − − − −− − − − − − − − −− − − − − − − − − − − − − −− −− − − −− − − − − −− − −− − − − − − − −− − − − −− − − − − − − −− − −− − − − −− − − − − − − − − − − − − −− − − − − − − − −− − − − − −− − − −− −− − −− − − −− −− − − − − − − − − − −−− − −− − − − −− −−− − − − − − − − −−− − − − − − − − −− − − − − − − − − −− −− − − − − − − − − −− − −− −− − − − − − − − − − − − − − − −− − − − − − −− − − − − −− − −− − − −− −− − −− − − − −− − − − −− − − −−− − − − − −− − − − − − − − − −
123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 X 123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 X
Chromosome Chromosome
(c) Mapping CCA predictor −space signals on the genome
Window type− −−−− − − Linearity in CCA Predictors− −− − Non −linear−−− −−−−−− −−−− −−−− −− Centromere−− −−−−− −−− −−−− −−−−− −−− −−− −− −−−− −− −−− − − −− − −−−−− −−− −−− − −− −− − −−− − − −− − − − − −− −− − − − − − − −− − − − −− − − − −− −− − − − − − − − −− − −− − − − − − − − − −− − − −− −−− − −−− − −−−− − − − − − − − − − − − − − −− − −− − − − −− − −−−− − − − −−− − −−− − − − − − − − − − − − − − − −− −− − − − − −− −− − −−−− − − − − − − − − − − −− − − −− − − − −− − − − − − − − − −− − −− −−− −− − − − − −− − − −− − − − − − − − − − − − − −− − − − − − − −− − − − − − − − − −− − − − − −−− − − − − − − − − − − − − −−− − − − − − − − − − − − − − −− − −− −−− −− −− −− − − − −− − − −− − − −−− − − − −− − −−− −− −− − − − − − − −− − − −−− − − −− −− − − − − − − − −− − − −− − − − − − −− −− −− −− − −−− −− −− − −−− − −−− − −−− −−− − − −− −− −−− −− −− −− −− −− − −− −− −− −− −− −− − −− −− −− −− − −− −−− −− − − −− −− −− − −− −− − − −− −− − − − −− − − − −−−− −−− −− −−− −−− −− −− − −−− −− −− −−− −− −− −− − −− − −−− − −−− − −− − − −− −− −− −− −− −− − − − − − − −− −− −− −− − −− − − − −− −− − − −− − − −−−− −−− −− − −− −− − −− −−− −−− −− −− −− −−− − − −−− − −− −− − −− − −− −− −− − −− − −− −−−− −− −− −−− − −−− −−− − −−− −−− −− −− −−− −− −− −−− −−− −− −− −− −− −− −− − −− − −− −− −− −− −−− −− −− −− −− −− −− − −− − − −− −− −− −−− −−− −− −− −− −− −− − −−− −−− −− −−− −− − −− − −− −− −− − − −
123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 X
Chromosome
Figure 2 Genome-wide locations of windows driving non-linear signals in the data. (a-c) Black circles denote windows without marked
non-linearity. Green and blue circles denote windows displaying mutation rate non-linearity in PCA (a) and CCA in the response space (b). Red
circles denote windows displaying genomic feature non-linearity in CCA in the predictor space (c). Yellow triangles represent the location of the
centromeres on each of the chromosomes.
islands, non-CpG methylated sites, SINEs and density of and both had dominant nucleotide substitution rate
loadcoding exons (all displaying a positive association with ings (Figure 3). Thus, these components render a
directhe responses), as well as number of nuclear lamina tion of strong nucleotide substitution rate variation. The
binding sites and density of nucleosome-free regions corresponding predictor in both
subge(both negatively associated with the responses). There- nomes had strong positive loadings for recombination
fore, the first AR and second NCNR canonical compo- rates, and strong negative loadings for distance to
telonent pairs suggest that nucleosome-free regions with mere. The predictor component in the NCNR
subgemany nuclear lamina binding sites, low GC content, nome also had a strong positive loading for GC content.
The third AR and NCNR response components showedfewer SINEs and fewer coding exons are less prone to
insertions, deletions and nucleotide substitutions (Figure strong loadings for deletion rates (Figure 3). In addition,
3). Male recombination rate (positively associated with the NCNR component also displayed a strong loading for
the responses), as well as distance from telomere and insertion rates. Thus, these components render a
direcdensity of conserved elements (both negatively asso- tion of deletion rate variation in both subgenomes,
addiciated with the responses) appear alongside all of the tionally depicting a negative co-variation between indel
above-mentioned genomic features as strong contribu- rates in the NCNR subgenome. In both subgenomes, the
tors to the second NCNR predictor component. corresponding predictor component had negative
loadThesecondARresponsecomponentandthefirst ings for GC content, female recombination rate, SINE
NCNR response component were similar to one another, counts, and density of conserved elements. Additionally,
Position along the chromosome Position along the chromosome
0 5e+07 1e+08 1.5e+08 2e+08 2.5e+08 0 5e+07 1e+08 1.5e+08 2e+08 2.5e+08
Position along the chromosome
0 5e+07 1e+08 1.5e+08 2e+08 2.5e+08GC
GC
GC
GC
GC
GC
GC
CpG
CpG
CpG
CpG
CpG
CpG
CpG
msMut
nCGm
msMut
nCGm
msMut
nCGm
msMut
nCGm
msMut
nCGm
msMut
nCGm
msMut
nCGm
LINE
LINE
LINE
LINE
LINE
LINE
LINE
SINE
SINE
SINE
SINE
SINE
SINE
SINE
sub
NLp
sub
NLp
subNLp
subNLp
subNLp
subNLp
subNLp
telo
telo
telo
telo
telo
telo
telo
Ananda et al. Genome Biology 2011, 12:R27 Page 7 of 18
http://genomebiology.com/2011/12/3/R27
Predictors (X) Responses (Y) Predictors (X) Responses (Y) Predictors (X) Responses (Y)
AR CV −1 AR CV −2 AR CV −3
Predictors (X) Responses (Y) Predictors (X) Responses (Y) Predictors (X) Responses (Y) Predictors (X) Responses (Y)
NCNR CV −1 NCNR CV −2 NCNR CV −3 NCNR CV −4
Figure 3 Helioplots for CCA performed on the AR and NCNR sub-genomes along the human-orangutan branch for 1-Mb windows. The
labels on the plots are as follows: CV, canonical variate; GC, GC content; CpG, number of CpG islands; nCGm, number of non-CpG
methylcytosines; LINE, number of LINE elements; SINE, number of SINE elements; NLp, of nuclear lamina associated regions; telo, distance to
the telomere; fRec and mRec, female and male recombination rates; SNPd, SNP density; RepT, replication time; nucFree, density of
nucleosomefree regions; cExon, coverage by coding exons; mostCons, coverage by most conserved elements. Red bars indicate positive loadings, and blue
bars negative loadings. See Table S6 in Additional file 1 for summary statistics.
in the NCNR subgenome, the third predictor component comparisons (for example, biphasic effect of GC content
had sizeable positive loadings for density of nucleosome- on substitution rates [10]). Investigating non-linear
assofree regions, and negative loadings for density of coding ciations between mutations and genomic context can
exons. provide crucial insights into mutagenesis mechanism.
Finally, although not significant in the AR subgenome, Here, we are interested in detecting and interpreting
the fourth response components in both the AR and non-linear signals linking multiple mutation rates to
mulNCNR subgenomes had dominant microsatellite mut- tiple genomic features, and on locating these signals
ability loadings (Figure 3). Thus, these components ren- along the genome. We applied kCCA, a variant of CCA
der a direction of strong microsatellite mutation rate that uses kernel mapping to compute canonical
compovariation. The marginal correlations between these and nents in high dimensional spaces non-linearly related to
the corresponding predictor components (0.104 and still response and predictor spaces [35]. Plotting linear CCA
and kCCA scores against one another (Figure S8 in Addi-significant in NCNR, 0.053 and non-significant in AR),
and the smaller number of predictors with sizeable load- tional file 1) suggested non-linearity in the association of
ings, confirm a lesser role of genome landscape features mutation rates to the genomic landscape, comprising a
in explaining microsatellite mutability [9]. Nevertheless, small non-linearity in mutation rates, and a more
noticeit is important to note a positive association between able one in genomic features. To further explore this, we
microsatellite mutability and the density of CpG islands, regressed the strongest ‘non-linear signals’ in response
and a negative association between microsatellite mut- and predictor space (scores from the first kernel CCA
ability and counts of methylated non-CpG sites. response and predictor components) onto significant
‘linNon-linear relationships between mutation rates and ear signals’ (scores from significant linear CCA response
genomic landscape variables have been noted in previous and predictor components; Table S7 in Additional file 1).
studies, and usually investigated through pair-wise For the response space (mutation rates), the dominant
fRec
fRec
fRec
fRec
fRec
fRec
fRec
mRec del
mRec del
mRec del
mRec del
mRec del
mRec del
mRec del
SNPd
SNPd
SNPd
SNPd
SNPd
SNPd
SNPd
RepT
RepT
RepT
RepT
RepT
RepT
RepT
nucFree
ins
nucFree
ins
nucFree
ins
nucFree
ins
nucFree
ins
nucFree
ins
nucFree
ins
cExon
cExon
cExon
cExon
cExon
cExon
cExon
mostCons
mostCons
mostCons
mostCons
mostCons
mostCons
mostConsAnanda et al. Genome Biology 2011, 12:R27 Page 8 of 18
http://genomebiology.com/2011/12/3/R27
non-linear signal was almost entirely recapitulated by the PCA at smaller scales (0.5 Mb and 0.1 Mb). CCA
2significant linear signals (R higher than 99% for both AR results also captured this co-variation, with SINE
and NCNR sequences). However, for the predictor space counts and GC content being the major contributors
(genomic features), significant linear signals could (both negative; Figures S13 to S16 in Additional file 1).
account for merely 1% of the variance of the dominant Considering multiple window sizes also provided
non-linear signal. Thus, when considering signals asso- insights into the scale at which various genomic
feaciating mutation rates and genomic landscape features, tures affect the structure of mutation rate co-variation.
non-linearities displayed by the latter are much stronger For instance, replication timing, SNP density and
density of nucleosome-free regions become significant pre-than those displayed by the former.
We again used outliers from the regressions to iden- dictors of microsatellite mutability at smaller scales
tify genomic locations ‘driving’ non-linearity in mutation (Figures S13 to S17 in Additional file 1). These
assorates and genomic features - that is, windows for which ciations are noted here for the first time, as previous
non-linear signals were poorly recapitulated by linear studies only considered microsatellite mutability at
ones (see Materials and methods). In the case of the scales of 1 Mb or larger [9]. Further, the association of
2responses, non-linearity was minimal (R above 99%; mutation rates with genomic features showed some
Table S7 in Additional file 1), but, interestingly, results differences between the rodent branch and the two
paralleled those obtained with PCA signals. The major- primate branches (Figure S17 in Additional file 1). For
ity of outlying loci were on chromosome X (64% for AR instance, the effect of recombination on mutation rates
- Figure S5B in Additional file 1; 52% for NCNR was found to be substantial in the primate
comparisequences - Figure S6B in Additional file 1) or near sons, and barely marginal in the rodent comparison.
autosomal telomeres (Figure 2B; 42% and 62% of auto- Such differences are expected given the fact that
prisomal windows in AR and NCNR sequences, respec- mates and rodents are known to differ in both
genotively, were located within a distance ≤10% of the mic landscape characteristics and mutation rates [36].
chromosomal length from the telomeres; see also
Figures S5B and S6B in Additional file 1). These are Toolset in Galaxy
regions of the genome where mutation rates are sizably Comparative genomic studies like ours often process
lower (chromosome X) or higher (telomeres) than auto- enormous amounts of sequence and alignment data, the
somal averages. In the case of the genomic features, the storing and handling of which poses big challenges.
2
non-linearity was very marked (R of merely 1%; Table Having data and software tools on a single platform can
S7 in Additional file 1), and a vast majority of the loci substantially facilitate genome-wide analyses and
driving this strong non-linearity were concentrated improve reproducibility of results (see, for instance, a
around the centromeres of large chromosomes (Figure workflow for the present study in Figure 4). To
dissemi2C; 49% and 51% of such windows in AR and NCNR nate the software developed for our project to the
sequences, respectively, were within a distance of ≤15% research community, we used Galaxy [23] - a free,
of the chromosomal length from the centromere; see open-source genomics portal with a consistent and
easyalso Figures S5C and 6C in Additional file 1). to-use interface capable of handling vast amounts of
data. Galaxy stores all sequences and alignments locally,
Consistency across genomic scales and phylogenetic and provides a multitude of software tools organized in
distances different sections. The ones we developed (Table 3) are
To verify whether our findings could be reproduced available under the ‘Regional variation’, ‘Multiple
regresover different genomic scales and phylogenetic dis- sion’,and ‘Multivariate analysis’ sections, and include
tances, in addition to the 1-Mb windows and human- software for alignment data preprocessing, identification
orangutan comparison investigated above, we repeated of mutations and computation of rates, aggregation of
our analyses considering 0.5-Mb and 0.1-Mb genomic genomic variables, and statistical analyses (more details
windows as well as human-macaque and mouse-rat are provided in the Materials and methods).
comparisons. Interestingly, the mutation rate
co-variation structure remained largely consistent across all Discussion
three genomic scales and all three phylogenetic dis- In this study we investigate regional co-variation among
tances (Figure 1; Figures S9 to S17 in Additional file mutation rates in largely neutrally evolving parts of the
1). Nevertheless, we did observe some differences. For human genome (the AR and NCNR subgenomes), and
instance, while microsatellite mutability varied ortho- its association with features of the genomic landscape.
gonally to indel and substitution rates at the 1-Mb For the first time, the structure and causes of mutation
scale, a co-variation (at best moderate) linking micro- rate co-variation were studied via a multivariate
satellite mutability to the three rates was shown by approach considering several mutation types and a largeAnanda et al. Genome Biology 2011, 12:R27 Page 9 of 18
http://genomebiology.com/2011/12/3/R27
Figure 4 Galaxy workflow developed for estimating mutation rates and computing principal components. A similar workflow (not
shown) was implemented to compute canonical correlation component pairs. MAF, multiple alignment format.
number of genomic features jointly. Notably, the simi- is among insertion, deletion, and substitution rates.
larity in results obtained for the AR and NCNR subge- While this association has been suggested by previous
nomes lends support to the notion of a common pair-wise analyses [8,37], here we are able to speculate
denominator shaping mutagenesis in both repetitive and about its causes using the CCA results. The first AR
unique parts of the genome. and second NCNR canonical component pairs (Figure
3) suggest that the co-variation of indel and substitution
Association of insertion, deletion and substitution rates, rates is shaped by a common set of genomic features.
and its causes Some of these features have been found to affect rates
As indicated by the first principal component of our of individual mutation types in previous studies; in
parPCA analysis, the strongest co-variation in the genome ticular, GC content, number of CpG islands and SINEs,
Table 3 ’Regional variation’, ‘multiple regression’ and ‘multivariate analysis’ toolsets in Galaxy
Data pre-processing tools
Make windows To partition genome into windows of a user-specified size
Feature coverage To apportion various genomic features in genomic windows
Filter nucleotides To identify and mask low-quality nucleotides from alignments based on a quality score cutoff
specified by the user
Mask CpG/non-CpG sites To identify and mask CpG/non-CpG-containing sites from alignments
Tools for identifying mutations and
computing their rates
Fetch Indels To identify insertions and deletions from three-way alignments using a user-specified outgroup
Estimate indel rates To estimate indel rates by aggregating insertions and deletions in genomic regions specified by the
user
Fetch substitutions To identify nucleotide substitutions from pair-wise alignments
Estimate substitution rates To estimate substitution rate according to Jukes-Cantor JC69 model
Extract orthologous microsatellites To fetch microsatellites using SPUTNIK, and detect orthologous repeats
Estimate microsatellite mutability To estimate microsatellite mutability by grouping (and sub-grouping) repeats based on their size,
unit and motif
Multiple regression tools
Perform linear regression To construct a linear regression model using the user-selected predictors and response variables best-subsets regression To examine all of the linear regression models that can be created from all possible combinations of
the predictors variables
Compute RCVE To compute RCVE (relative contribution to variance) for all possible variable subsets
Multivariate analysis tools
PCA To perform PCA on a set of variables
CCA To CCA on two sets of variables
Kernel PCA To perform kernel PCA on a set of variables, using a user-specified kernel
Kernel CCA To kernel CCA on two sets of variables, using a user-specified kernel
RCVE, relative contribution to variability explained.Ananda et al. Genome Biology 2011, 12:R27 Page 10 of 18
http://genomebiology.com/2011/12/3/R27
and density of coding exons have been shown to associ- lesion formation in compact chromatin regions [45] and
ate positively with indel rate and substitution rate varia- to the differences in repair mechanisms between
differtion [2,5,8,10]. Other genomic features are investigated ent chromatin environments [46].
here for the first time; we show that non-CpG methyl- The third AR and NCNR CCA component pairs
cytosines, nuclear lamina binding sites and nucleosome- depict deletion rate variation, with the third NCNR
free regions are significant contributors to mutation rate CCA component pairs also indicating a negative
association between insertion and deletion rates (Figure 3). Theco-variation, suggesting a role for non-CpG methylation,
corresponding predictor components have negativenuclear lamina association, and chromatin structure in
loadings for GC content, SINE counts and density ofmutagenesis.
The positive effect of GC content, density of coding conserved elements (the latter only for the AR
subgeexons and non-CpG methyl-cytosines on mutation rates nome). GC-poor regions are known to be
late-replicatunderlines the role of methylation in creating mutation ing [32,33] and more prone to replication errors [47],
hotspots [38,39], while the negative effect of number of which accounts for the elevated mutation rates; our
nuclear lamina binding sites and density of nucleosome- observation therefore supports a role of replication in
free regions suggests that regions associated with the generating deletions. Furthermore, we confirm the
negalamina and/or having compact chromatin structures are tive association between SINE counts and deletion rates
less prone to mutations. Distance from telomere appears observed previously [8,21]. The positive association of
alongside all of the above mentioned genomic features GC content and density of coding exons with insertion
as a strong contributor to the second NCNR predictor rates, and their negative association with deletion rates,
canonical component, with a negative association with point to genomic regions that tolerate more insertions
the responses, which emphasizes peculiar mutagenic than deletions; such regions were indeed found to be
mechanisms acting near telomeres [6,8,10,40]. Notably, present in GC-rich, gene-rich isochores in Venter’s
genthe number of nuclear lamina binding sites is positively ome by a recent study [43]. The negative association of
associated with the distance to telomere in this compo- the density of conserved elements with deletion rates
nent; in agreement with another study [15], this indi- reiterates a previous observation about conserved and
cates that lamina binding regions might be less mutable functional regions being depleted of small deletions [8].
when they are located at a distance from the telomeres. A set of features comprising male and female
recomThe first AR and second NCNR canonical component bination rates and distance to telomere was identified as
affecting substitution rates through the second AR andpairs suggest that genomic regions with many nuclear
the first NCNR CCA component pair (Figure 3). Theselamina binding sites, a high density of nucleosome-free
regions, low GC content, low exon density, and fewer again reflect the role of recombination in contributing
SINEs are less prone to insertions, deletions and nucleo- to substitution rate variation [1,2,6,10,48], and reiterate
tide substitutions (Figure 3). Regions associated with the presence of mutagenic mechanisms acting near
telonuclearlaminaconstituteastronglyrepressivechroma- meres that can lead to elevated nucleotide substitution
tin environment [15], low-GC and gene-poor regions rates [10]. Alternatively, or additionally, telomeres might
are known to possess compact chromatin structure and possess fixation biases, for example, due to biased gene
higher concentration of indels [41-43], and the preferen- conversion [49]. The strong positive loading for GC
tial retention of SINEs in GC-rich regions has also been content in the NCNR subgenome is a possible
conselinked to the chromatin structure (SINE integration may quence of recombination-associated mismatch repair,
be facilitated by chromatin decondensation in GC-rich which is GC-biased in mammals [48,50,51].
regions) [44]. Further, these component pairs show the
density of nucleosome-free regions to be positively asso- Microsatellite mutability and its genomic determinants
ciated with nuclear lamina counts, and negatively asso- Our results suggest that microsatellite mutability is
driciated with both GC content, density of CpG islands ven by different factors than indel and substitution
and coding exons. In all, the picture is one of nucleo- rates. Indeed, microsatellite mutability was the only
sigsome-free regions characterized by a compact chromatin nificant contributor to the second PCA component,
structure. indicating a variation largely orthogonal to that of the
In summary, the first AR and second NCNR CCA other three mutation rates. No association between
component pairs suggest that methylation and chroma- microsatellite mutability (computed here for
mononutin structure may have a dominant role in the strong cleotide microsatellites only) and substitution rate was
co-variation of indel rates and substitution rate - typify- found also in another recent study [9]. The presence of
ing an inverse relationship between compact chromatin a negative correlation between microsatellite density and
structure and proneness of DNA to indels and substitu- substitution rates (Figure S1 in Additional file 1)
contions. This can perhaps be attributed to the low rate of firms the findings of Zhu and colleagues [52], and