Consultation statistique avec le logiciel
19 Pages
English
Gain access to the library to view online
Learn more

Consultation statistique avec le logiciel

Gain access to the library to view online
Learn more
19 Pages
English

Description

Consultation statistique avec le logiciel Why Muto & Osawa (1987) plot is not good? J.R. Lobry 14 mars 2006 Professor Alexander N. Gorban asked me by e-mail on Sun, 23 Oct 2005 14:00:17 +0100 for the following legitimate clarification: About your third remark: ”As pointed by Sueoka (in 1988 IIRC) the Muto & Osawa representation is very bad be- cause the x-axis and y-axis variables are not independent...” From my point of view, here we meet one of the standard misunderstanding of the ”independency” notion. Because we touch this point third time, could you, please, give the definition of independency that you with Sueoka use here. In the standard sense all the variables we are talking about are not independent. I will try here to make this important point clear in a reproducible way. Table des matieres 1 Muto & Osawa (1987) plot 3 2 Sueoka's (1988) critics 3 3 Simulation 4 3.1 G+C content in coding sequences . . . . . . . . . . . . . . . . . . 4 3.2 G+C content in intergenic spaces . . . . . . . . . . . . . . . . . . 5 3.3 CDS versus intergenic spaces .

  • citation rate

  • science citation

  • consultation statistique avec le logiciel

  • coding sequences

  • professor alexander

  • cds versus intergenic spaces

  • sueoka

  • noboru sueoka wrote


Subjects

Informations

Published by
Reads 10
Language English

Exrait

Consultation statistique avec le logiciel
Why Muto & Osawa (1987) plot is not good?
J.R. Lobry
14 mars 2006
Professor Alexander N. Gorban asked me by e-mail on Sun, 23 Oct
2005 14:00:17 +0100 for the following legitimate clarification:
About your third remark: ”As pointed by Sueoka (in 1988
IIRC) the Muto & Osawa representation is very bad be-
causethex-axisandy-axisvariablesarenotindependent...”
From my point of view, here we meet one of the standard
misunderstanding of the ”independency”notion. Because
we touch this point third time, could you, please, give the
definition of independency that you with Sueoka use here.
Inthestandardsenseallthevariableswearetalkingabout
are not independent.
I will try here to make this important point clear in a reproducible
way.
Table des mati`eres
1 Muto & Osawa (1987) plot 3
2 Sueoka’s (1988) critics 3
3 Simulation 4
3.1 G+C content in coding sequences . . . . . . . . . . . . . . . . . . 4
3.2 G+C content in intergenic spaces . . . . . . . . . . . . . . . . . . 5
3.3 CDS versus intergenic spaces . . . . . . . . . . . . . . . . . . . . 5
3.4 Genome G+C content . . . . . . . . . . . . . . . . . . . . . . . . 6
4 The part/whole problem 9
5 Comments by Professor Noboru Sueoka 10
6 Comments by Professor Alexander N. Gorban 10
6.1 First attempt for a definition . . . . . . . . . . . . . . . . . . . . 11
6.2 Second attempt for a definition . . . . . . . . . . . . . . . . . . . 11
6.3 a priori information . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.4 External argument for using G+C content . . . . . . . . . . . . . 13
7 Extra e-mail 16
1J.R. Lobry
References 18
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 2/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
1 Muto & Osawa (1987) plot
The following figure is a screen copy of figure 2 from [Muto and Osawa,
1987].
This is a very famous plot within the field of molecular evolution. According
to the Science Citation Index (31-OCT-2005) the paper itself has been quoted
241 times, which is a very high citation rate for this field. Moreover, the figure
has been reproduced, or adapted, page 222 in [Li and Graur, 1991] and page
1414 in [Graur and Li, 2000] .
2 Sueoka’s (1988) critics
One year later [Sueoka, 1988], Noboru Sueoka wrote :
1To be completed, for sure the figure has been reproduced in many more places
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 3/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
So the question is why total G+C is not an ideal variable.
3 Simulation
First of all, let’s use a given seed for the random number generator used in
just to allow for reproducibility :
set.seed(1071966)
Let note n the total number of species under study :
(n <- 500)
[1] 500
So, in the following simulations we have 500 species.
3.1 G+C content in coding sequences
Now, suppose that cds denotes the G+C content in coding sequences. We
take it here from a random sampling in a beta distribution :
cds <- rbeta(n = n, shape1 = 2, shape2 = 2)
hist(cds, col = grey(0.8), xlab = "G+C content", main = paste("Distribution of G+C content\n",
"in the coding sequence of", n, "species"))
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 4/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
3.2 G+C content in intergenic spaces
Now, suppose that itg denotes the G+C content in intergenic spaces. We
take it again from a random sampling in a beta distribution, and this indepen-
dently from the previous sampling for the coding sequences :
itg <- rbeta(n = n, shape1 = 2, shape2 = 2)
hist(itg, col = grey(0.8), xlab = "G+C content", main = paste("Distribution of G+C content\n",
"in the intergenic spaces of", n, "species"))
3.3 CDS versus intergenic spaces
Let’s check that the G+C content in CDS is independent from the G+C
content in intergenic spaces :
plot(x = cds, y = itg, xlab = "G+C content in coding sequences",
ylab = "G+C content in intergenic spaces", las = 1, main = "G+C content in CDS and intergenic spaces")
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 5/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
This seems to be OK, at least the random number generator is not too bad
here.TheG+CcontentinCDSisindependentoftheG+Ccontentinintergenic
space : the knowledge of one does not help much to predict the other one. If
this were genuine data we would perhaps do something like that :
cor.test(itg, cds)
Pearson s product-moment correlation
data: itg and cds
t = 1.9357, df = 498, p-value = 0.05347
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.001283992 0.172797471
sample estimates:
cor
0.08641633
to say that at a critical level of 5% experimental doesn’t allow to reject the
null hypothesis that the linear correlation coefficient is equal to zero.
3.4 Genome G+C content
Inbacterialchromosomes,about80%ofspaceisdevotedtocodingsequences
and the remaining 20% to intergenic spaces. Let note gen the genome G+C
content in our simulation :
gen <- 0.8 * cds + 0.2 * itg
hist(gen, col = grey(0.8), xlab = "G+C content", xlim = c(0, 1),
main = paste("Distribution of G+C content\n", "in the genome of",
n, "species"))
Now, let’s use the genome G+C content as a predictive variable, as in Muto
& Osawa plots :
plot(x = gen, y = cds, xlab = "Genome G+C content", ylab = "G+C content in CDS",
las = 1, main = "CDS versus genome G+C content")
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 6/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdf
'J.R. Lobry
What a nice correlation! Let’s quantify this :
cor(gen, cds)
[1] 0.9714057
cor(gen, cds)^2
[1] 0.943629
Which means that 94.3% of the variability in CDS is taken into account by
the variability in genome. Is it significant?
cor.test(gen, cds)
Pearson s product-moment correlation
data: gen and cds
t = 91.3035, df = 498, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9660022 0.9759609
sample estimates:
cor
0.9714057
Yes, at a critical level of 5%, experimental data are clearly in contradiction
with the null hypothesis that the linear correlation coefficient is equal to zero.
If you consider the p-value here, we have a very highly significant result.
The result is statistically significant but biologically meaningless : all we
have here is that the contribution of CDS to the genomic G+C is important.
More systematically, we can explore the effect of the proportion of CDS on the
squared linear coefficient :
npoints <- 200
props <- seq(from = 0, to = 1, length = npoints)
rs <- sapply(props, function(x) cor(x * cds + (1 - x) * itg, cds)^2)
plot(props, rs, xlab = "Proportion of CDS in genomes", las = 1,
ylab = expression(r^2), type = "l", lwd = 2, main = expression(paste("Influence of the proportion of CDS on ",
r^2)))
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 7/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdf
'J.R. Lobry
And just for the fun the effect of the proportion of CDS on the result of
testing the null hypothesis that the linear correlation coefficient is equal to
zero :
props <- seq(from = 0, to = 1, length = npoints)
pvals <- sapply(props, function(x) cor.test(x * cds + (1 - x) *
itg, cds)$p.value)
plot(props, pvals, xlab = "Proportion of CDS in genomes", las = 1,
ylab = "p-value", type = "l", lwd = 2, main = "Influence of the proportion of CDS on p-values")
abline(h = 0.05, col = "red")
text(0.5, 0.07, expression(alpha == 0.05), col = "red")
Which means that with our simulated dataset, as soon as the proportion
of CDS in genomes is greater than about 10%, then we have to reject the null
hypothesis.Thetwovariablesarenotindependent,butthisjustbyconstruction.
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 8/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
4 The part/whole problem
Turning back to the initial question, the problem is not related to the de-
finition of independency, but to what is know as the part/whole problem in
allometric studies. I have taken the following quote from Jim Moore’s site at
http://weber.ucsd.edu/~jmoore/courses/allometry/allometry.html :
Properly speaking, we are wrong to correlate brain weight with to-
tal body weight, because total body weight includes brain weight
and so artificially strengthens the correlation between the ”two”va-
riables. Primate brains aren’t usually such a large proportion of the
body weight that this bias is important, but if you were interested
in e.g., muscle mass, clearly you’d want to compare muscle mass
not with body weight, but with the weight of everything left with
muscle removed (messy research, that). And it may turn out to be
an important problem for brain allometry...
A direct translation to Muto & Osawa plot :
1. We are wrong to correlate G+C content in first codon positions with
genome G+C because genome includes first codon positions for about
80/3≈27%.
2. We are wrong to correlate G+C content in second codon positions with
genome G+C because genome includes second codon positions for about
80/3≈27%.
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 9/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
3. We are wrong to correlate G+C content in third codon positions with
genome G+C because genome includes third codon positions for about
80/3≈27%.
The plot is not good because it artificially strengthens the correlation bet-
ween the ”two”variables. The simulation in the previous section demonstrates
that even when the two part variables are independent, the part/whole correla-
tion yields a spurious correlation. What is interesting with actual data is that
part variables are not independent, but to see and quantify this we have to
remove first the spurious part of the correlations.
5 Comments by Professor Noboru Sueoka
Fromane-mailonMon, 21 Nov 2005 12 :03 :01 -0700ProfessorNoboru
Sueoka wrote :
IenjoyedyourdiscourseabouttheMutoandOsawa(1987)paper.It
answersA.N.Gorban’squestionverywellandIamsureclearssome
people’s minds, although the word ”independent”is the best one for
the part/whole problem. It is surprising to find that so many people
are confused about this point. There must be a logical symbol for
the situation, where two variables are exclusive or inclusive.
I have no solution right now.
6 CommentsbyProfessorAlexanderN.Gorban
Fromane-mailon 21 Nov 2005 12 :08 :20.0588 (UTC)ProfessorAlexan-
der N. Gorban wrote :
Do you know now that I am very boring mathematician, and your
first remark about ”x-axis and y-axis dependency”could not make
me happy as well as the second ”whole-part”remark, and the exer-
ciseswithrandomnumbersgenerators.Ineedstatementswithexact
sense, sorry about it.
I always appreciate attempts to make thing more clear.
You wrote : ”Turning back to the initial question, the problem is not
related to the definition of independency, but to what is know as the
part/whole problem in allometric studies.”
Hence, it seems now that you do not support literally your initial
statementabouttheMuto&Osawarepresentation>”Aspointedby
Sueoka (in 1988 IIRC) the Muto & Osawa representation is
>very bad because the x-axis and y-axis variables are not >inde-
pendent...”
I still do support it, but I would now write this as ”because the x-axis and
the y-axis are obviously not a priori independent”to avoid confusion.
You use an important quotation : ”Properly speaking, we are wrong
to correlate brain weight with total body weight, because total body
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 10/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdf