36 Pages
English

Submitted to the Annals of Statistics arXiv: math PR

Gain access to the library to view online
Learn more

Description

Niveau: Supérieur, Doctorat, Bac+8
Submitted to the Annals of Statistics arXiv: math.PR/0000000 NEEDLES AND STRAWS IN A HAYSTACK: POSTERIOR CONCENTRATION FOR POSSIBLY SPARSE SEQUENCES By Ismael Castillo ? and Aad van der Vaart We consider full Bayesian inference in the multivariate normal mean model in the situation that the mean vector is sparse. The prior distribution on the vector of means is constructed hierarchically by first choosing a collection of nonzero means and next a prior on the nonzero values. We consider the posterior distribution in the frequen- tist set-up that the observations are generated according to a fixed mean vector, and are interested in the posterior distribution of the number of nonzero components and the contraction of the posterior distribution to the true mean vector. We find various combinations of priors on the number of nonzero coefficients and on these coeffi- cients that give desirable performance. We also find priors that give suboptimal convergence, for instance Gaussian priors on the nonzero coefficients. We illustrate the results by simulations. 1. Introduction. Suppose that we observe a vector X = (X1, . . . , Xn) in Rn such that (1.1) Xi = ?i + ?i, i = 1, . . . , n, for independent standard normal random variables ?i and an unknown vector of means ? = (?1, . . . , ?n). We are interested in Bayesian inference on ?, in the situation that this vector is possibly sparse.

  • square

  • inf ?

  • gaussian sequence

  • such full

  • bayesian inference

  • pin decrease

  • full posterior

  • rate over


Subjects

Informations

Published by
Reads 28
Language English

SubmittedtotheAnnalsofStatistics
arXiv:
math.PR/0000000

NEEDLESANDSTRAWSINAHAYSTACK:POSTERIOR
CONCENTRATIONFORPOSSIBLYSPARSESEQUENCES
∗ByIsmae¨lCastilloandAadvanderVaart
WeconsiderfullBayesianinferenceinthemultivariatenormal
meanmodelinthesituationthatthemeanvectorissparse.Theprior
distributiononthevectorofmeansisconstructedhierarchicallyby
firstchoosingacollectionofnonzeromeansandnextaprioronthe
nonzerovalues.Weconsidertheposteriordistributioninthefrequen-
tistset-upthattheobservationsaregeneratedaccordingtoafixed
meanvector,andareinterestedintheposteriordistributionofthe
numberofnonzerocomponentsandthecontractionoftheposterior
distributiontothetruemeanvector.Wefindvariouscombinations
ofpriorsonthenumberofnonzerocoefficientsandonthesecoeffi-
cientsthatgivedesirableperformance.Wealsofindpriorsthatgive
suboptimalconvergence,forinstanceGaussianpriorsonthenonzero
coefficients.Weillustratetheresultsbysimulations.

1.Introduction.
Supposethatweobserveavector
X
=(
X
1
,...,X
n
)
in
R
n
suchthat
(1.1)
X
i
=
θ
i
+
ε
i
,i
=1
,...,n,
forindependentstandardnormalrandomvariables
ε
i
andanunknownvector
ofmeans
θ
=(
θ
1
,...,θ
n
).WeareinterestedinBayesianinferenceon
θ
,in
thesituationthatthisvectorispossibly
sparse
.
Non-Bayesianapproachestothisproblemhaverecentlybeenconsidered
bymanyauthors.Golubev[13]obtainedresultsformodelselectionmethods
andthresholdestimatorsforthemean-squaredrisk.Birge´andMassart[4]
treatedthemodelwithintheirgeneralcontextofmodelselectionbypenal-
izedleastsquares.Abramovichetal.in[1]studiedtheperformanceofthe
FalseDiscoveryRatemethod.TheearlierworkbyDonohoandJohnstone
[10]canbeviewedasstudyingtheproblemwithinan
`
r
context.Manyau-
thors(seee.g.[3],[22],[21]andreferencescitedthere)haveinvestigatedthe
connectiontotheLASSOorsimilarmethods.
MethodswithaBayesianconnectionwerestudiedbyGeorgeandFoster
[12],Zhang[20],JohnstoneandSilverman[16,17],Abramovich,Grinshtein

WorkpartlysupportedbyaPostdoctoralfellowshipfromtheVUUniversityAmster-
madAMS2000subjectclassifications:
Primary62G05,62G20
Keywordsandphrases:
Bayesianestimators,Sparsity,Gaussiansequencemodel,Mix-
turepriors,Asymptotics,Contraction.
1imsart-aosver.2007/12/10file:spa-revised.texdate:November8,2011

2
I.CASTILLOANDA.W.VANDERVAART
andPensky[2],andJiangandZhang[15].Thepapers[12]and[16]con-
sideredanempiricalBayesmethod,consistingofmodellingtheparameters
θ
1
,...,θ
n
a-prioriasindependentlydrawnfromamixtureofaDiracmea-
sureat0andacontinuousdistribution,determininganappropriatemixing
weightbythemethodof(restricted)marginalmaximumlikelihood,and
finallyemployingtheposteriormedianormean.Thesecondpaper[2]moti-
vatedpenalties,appliedinapenalizedminimumcontrastscheme,byprior
distributionsontheparameters,andderivedestimatorsforthenumberof
nonzero
θ
i
andthe
θ
i
itself.Thefirstisaposteriormode,buttheestimator
for
θ
,called“Bayesiantestimation”,doesnotseemitselfBayesian.(Infact,
theGaussianpriorforthenon-zeroparametersin[2]willbeseentoperform
suboptimallyinourfullyBayesianset-up.)Thepapers[20]and[15]obtain
sharpresultson(nonparametric)empiricalBayesestimators.
Otherrelatedpapersinclude[19],[6],[7],[14],[15],[5].
Apenalizedminimumcontrastestimatorcanoftenbeviewedasthemode
oftheposteriordistribution,anditishelpfultointerpretepenaltiesaccord-
ingly.However,theBayesianapproachyieldsafullposteriordistribution,
whichisarandomprobabilitydistributionontheparameterspace.Ithas
bothalocationandaspread,andcanbemarginalizedtogiveposterior
distributionsforanyfunctionsoftheparametervectorofinterest.Itisthis
objectthatwestudyinthispaper.SuchfullBayesianinferencewasrecently
consideredbyScottandBerger[18],whodiscussedvariousaspectsnotcov-
eredinthepresentpaper,butnoconcentrationresults.Oneexampleofour
resultsisthatthebeta-binomialpriorsin[18],combinedwithmoderately
toheavytailedpriorsonthenonzeromeans,yieldoptimalrecovery.
Sparsity
canbedefinedinvariousways.Perhapsthemostnaturaldefini-
tionistheclassof
nearlyblack
vectors,definedas
`
0
[
p
n
]=
{
θ

R
n
:#(1

i

n
:
θ
i
6
=0)

p
n
}
.
Here
p
n
isagivennumber,whichintheoreticalinvestigationsistypically
assumedtobe
o
(
n
),as
n
→∞
.Sparsitymayalsomeanthatmanymeans
aresmall,butpossiblynotexactlyzero.Definitionsthatmakethisprecise
use
strong
or
weak
`
s
-balls
,typicallyfor
s

(0
,
2).Thesearedefinedas,with
θ
[1]

θ
[2]
≥∙∙∙≥
θ
[
n
]
thenonincreasingpermutationofthecoordinatesof
θ
=(
θ
1
,...,θ
n
),
nX
n
o
`
s
[
p
n
]=
θ

R
n
:1
|
θ
i
|
s

p
ns
nn1=in
n
1
s

p
n

s
o
m
s
[
p
n
]=
θ

R
:
n
1

m
i
a

x
n
i
|
θ
[
i
]
|≤
n.
imsart-aosver.2007/12/10file:spa-revised.texdate:November8,2011

SPARSITYANDBAYESPOSTERIORMEASURE
3
Becausethenonzerocoefficientsin
`
0
[
p
n
]arenotquantitativelyrestricted,
thereisnoinclusionrelationshipbetweenthisspaceandtheweakandstrong
balls,althoughresultsforthelattercanbeobtainedbyprojectingthem
into
`
0
[
p
n
].Ontheotherhand,forany
s>
0wehavetheinclusion
`
s
[
p
n
]

m
s
[
p
n
].
Theextentofthesparsity,measuredbytheconstant
p
n
,isassumedun-
known.OurBayesianapproachstartsbyputtingaprior
π
n
onthisnumber,
agivenprobabilitymeasureontheset
{
0
,
1
,
2
,...,n
}
.Nextwecomplete
thistoaprioronthesetofallpossiblesequences
θ
=(
θ
1
,...,θ
n
)in
R
n
,
bygivenadraw
p
from
π
n
choosingarandomsubset
S
⊂{
1
,...,n
}
of
cardinality
p
,andchoosingthecorrespondingcoordinates(
θ
i
:
i

S
)from
adensity
g
S
on
R
S
andsettingtheremainingcoordinates(
θ
i
:
i

S
c
)equal
tozero.Giventhisprior,Bayes’ruleyieldstheposteriordistributionof
θ
asusual.Weinvestigatethepropertiesofthisposteriordistribution,inits
dependenceonthepriorsonthedimensionandonthenonzerocoefficients,
inthenonBayesianset-upwhere
X
follows(1.1)with
θ
equaltoafixed,
“true”parameter
θ
0
.
Ifthetrueparametervector
θ
0
belongsto
`
0
[
p
n
],thenitisdesirablethat
theposteriordistributionconcentratesmostofitsmassonnearlyblack
vectors.Onemainresultofthepaperisthatthisisthecaseprovidedthe
priorprobabilities
π
n
{
p
}
decreaseexponentiallyfastwiththedimension
p
.
Thequalityofthereconstructionofthefullvector
θ
canbemeasuredby
variousdistances.AnaturaloneistheEuclideandistance,withsquare
nXk
θ

θ
0
k
2
=(
θ
i

θ
i
0
)
2
.
1=iIftheindicesofthe
p
n
nonzerocoordinatesofavectorinthemodel
`
0
[
p
n
]
wereknowna-priori,thenthevectorcouldbeestimatedwithmeansquare
erroroftheorder
p
n
.In[11]itisshownthat,as
n,p
n
→∞
with
p
n
=
o
(
n
),
2in
ˆ
fsup
P
n,θ
k
θ
ˆ

θ
k
=2
p
n
log(
n/p
n
)1+
o
(1)
.
θθ

`
0
[
p
n
]
Heretheinfimumistakenoverallestimators
θ
ˆ=
θ
ˆ(
X
)and
P
n,θ
denotes
takingtheexpectationundertheassumptionthat
X
is
N
n
(
θ,I
)-distributed.
Inotherwords,thesquareminimaxrateover
`
0
[
p
n
]is
p
n
log(
n/p
n
),meaning
thattheunknownidentityofthenonzeromeansneedstoleadonlytoa
logarithmicloss.
TheBayesianapproachispresumablyadoptedfortheintuitionprovided
bypriormodelling,andisnotnecessarilydirectedatattainingminimax

imsart-aosver.2007/12/10file:spa-revised.texdate:November8,2011

4
I.CASTILLOANDA.W.VANDERVAART
rates.However,fortheoreticalinvestigationitisnaturaltotakethemin-
imaxrateasabenchmark,anditisofparticularinteresttoknowwhich
priorsyieldaposteriordistributionthatconcentratesmostofitsmasson
ballsaround
θ
0
ofsquareradiusoforder
p
n
log(
p
n
/n
),orcloserelativesas
p
n
(log
n
)
r
thatloose(only)alogarithmicfactor.Asecondmainresultof
thepaperisthattheminimaxrateisattainedformanycombinationsof
priors.Itsufficesthatthepriors
π
n
decreaseexponentiallywithdimension,
andgivesufficientweighttothetruelevelofsparsity:forsome
c>
0,
(1.2)
π
n
(
p
n
)
&
exp

cp
n
log(
n/p
n
)
.
Furthermore,thepriorsonthenonzerocoordinatesshouldhavetailsthat
arenotlighterthanLaplace,andsatisfyanumberofothertechnicalprop-
erties.Ifinequality(1.2)failsthentherateofcontr

actionm

aybeslower
thanminimax;weshowthatitisnotslowerthanlog1

n
(
p
n
).(Theword
“contraction”isinlinewithotherliteratureonnonparametricBayesianpro-
cedures;withthepresentchoiceofmetrics(whichgrowwith
n
)therates
actuallyincreasetoinfinity.)
Moregenerally,weconsiderreconstructionrelativetothe
`
q
metricfor
0
<q

2,defined(without
q
th-root)by
nX(1.3)
d
q
(
θ,θ
0
)=
|
θ
i

θ
0
i
|
q
.
1=iFor
q<
2this“metric”ismoresensitivetosmallvariationsinthecoordi-
natesthanthesquareEuclideanmetric,whichis
d
2
.(For
q

1thedefinition
givesatruemetric
d
q
;for1
<q

2itdoesnot.)From[11]theminimax
rateover
`
0
[
p
n
]for
d
q
isknowntobeoftheorder
(1.4)
r
n

,q
=
p
n
log
q/
2
(
n/p
n
)
.
Weshowthattheposterior“contraction”rateattainsthisorderundercon-
ditionsasintheprecedingparagraph,andmoregenerallycharacterizethe
rateintermsoflog(1

n
(
p
n
)).
Besidesnearlyblackvectorsweconsiderratesofcontractionif
θ
0
isina
weak
`
s
-ball.Theminimaxrateover
m
s
[
p
n
]relativeto
d
q
isgivenby(see
)]01[sp(1.5)
µ
n

,s,q
=
n
n
log
(
q

s
)
/
2
(
n/p
n
)
.
nThisisshowntobealsotherateofposteriorcontractionunderslightly
strongerconditionsonthepriorsthanbefore:thepriorondimensionmust
imsart-aosver.2007/12/10file:spa-revised.texdate:November8,2011

SPARSITYANDBAYESPOSTERIORMEASURE
5
decreaseslightlyfasterthanexponential.Underthesameconditionswe
alsoshowthattheposteriordistributionhasexponentialconcentration,and
thereforecontractsalsointhestrongersenseof(any,Euclidean)moments.
Asummaryoftheseresultsisthatgoodpriorsforthedimensiondecrease
atexponentialor,perhapsbetter,slightlyfasterrate,andgoodpriorson
thenonzeromeanshavetailsthatareheavierthanLaplace.Wealsoshow
thatpriorswithlightertails,suchastheGaussian,attainsignificantlylower
contractionratesattrueparametervectors
θ
0
thatarenotclosetothe
origin.
Thestructureofthearticleisasfollows.InSection2westatethemain
concentrationresults.Apracticalalgorithm,simulationsandsomepictures
arepresentedinSection3.Proofsaregatheredattheendofthepaperand
inthesupplementaryAppendix[9].
1.1.
Notation.
Wedenoteby
a

b
and
a

b
theminimumandmaximum
oftworealnumbers
a,b
,andwrite
a
.
b
if
a

Cb
forauniversalconstant
C
.Thenotation
,
means“equalbydefinitionto”.Wecall
support
ofa
vector
θ
=(
θ
1
,...,θ
n
)

R
n
thesetofindicesofnonzerocoordinates,and
denotethisby
S
θ
=
{
i
∈{
1
,...,n
}
:
θ
i
6
=0
}
.Weset
θ
S
=(
θ
i
:
i

S
),and
let
|
S
|
bethecardinalityofaset
S
⊂{
1
,...,n
}
.
2.Mainresults.
ThroughoutthepaperweconsiderapriorΠ
n
on
R
n
constructedinthreesteps:
(P1)A
dimension
p
ischosenaccordingtoapriorprobabilitymeasure
π
n
ontheset
{
0
,
1
,
2
,...,n
}
.
(P2)Given
p
asubset
S

⊂{
1
,...,n
}
ofsize
|
S
|
=
p
ischosenuniformlyat
nrandomfromthe
p
subsetsofsize
p
.
(P3)Given(
p,S
)avector
θ
S
=(
θ
i
:
i

S
)ischosenfromaprobability
distributionwithLebesguedensity
g
S
on
R
p
(if
p

1)andthisis
extendedto
θ

R
n
bysettingtheremainingcoordinates
θ
S
c
equalto
.0Forsimplicityweusethesamedensity
g
S
foreverysetofagivendimension
|
S
|
,andwilldenotethisalsoby
g
|
S
|
.
GiventhepriorΠ
n
,Bayes’ruleyieldsthe
posteriordistribution
B
7→
Π
n
(
B
|
X
),theconditionaldistributionof
θ
given
X
iftheconditionaldistri-
butionof
X
given
θ
istakenequaltothenormaldistribution
N
n
(
θ,I
).The
probabilityΠ
n
(
B
|
X
)ofaBorelset
B

R
n
undertheposteriordistribution

imsart-aosver.2007/12/10file:spa-revised.texdate:November8,2011

6
I.CASTILLOANDA.W.VANDERVAART
canbewritten
X
n
π
(
p
)
XZYY

nn

φ
(
X
i

θ
i
)
φ
(
X
i
)
g
S
(
θ
S
)

S
p
=0
p
|
S
|
=
p
(
θ
S
,
0)

Bi

Si

/S
(2.1)
X
n
π
(
p
)
XZYY
.

nn

φ
(
X
i

θ
i
)
φ
(
X
i
)
g
S
(
θ
S
)

S
p
=0
p
|
S
|
=
pi

Si

/S
Here(
θ
S
,
0)isthevectorin
R
n
formedbyaddingcoordinates
θ
i
=0to
θ
S
=(
θ
i
:
i

S
),atthepositionsleftopenby
S
⊂{
1
,...,n
}
(inthecorrect
orderofthecoordinatesandnotattheend,asthenotationsuggests).This
expressionissomewhatunwieldy;weconsidercomputationinSection3.
Theposteriordistributioncanbemarginalizedtoobtaintheposteriordis-
tributionofanarbitrarytransformation
f
(
θ
)bychoosingtheset
B
equal
to
B
=
{
θ
:
f
(
θ
)

B
0
}
.
Theposteriordistributionisarandomprobabilitydistributionon
R
n
,
whichwestudyundertheassumptionthatthevector
X
=(
X
1
,...,X
n
)is
distributedaccordingtoamultivariatenormaldistributionwithmeanvector
θ
0
andcovariancematrixtheidentity.Welet
P
n,θ
0
T
denotetheexpected
valueofafunction
T
=
T
(
X
)underthisdistribution.
Weshallbeinterestedintwoaspectsoftheposteriordistribution:its
dimensionalityanditsabilitytorecoverthemeanvector
θ
.Becausethe
conditionsaresimplerinthecasethatthenonzerocoordinatesareindepen-
dentundertheprior,inthefirsttworesultsweassumethatthedensities
g
S
in(P3)areofproductform.Concreteexamplesofpriorsasin(P1)and(P3)
thatsatisfytheconditionsimposedinthetheoremsaregiveninSection2.5.
2.1.
Dimensionality.
Inthecontextof
`
0
[
p
n
]-classes,wesaythatthe
prior
π
n
ondimensionhas
exponentialdecrease
if,forsomeconstants
C>
0
and
D<
1,
(2.2)
π
n
(
p
)


n
(
p

1)
,p>Cp
n
.
Theorem
2.1(Dimension)
.
If
π
n
hasexponentialdecrease
(2.2)
and
g
S
isaproductof
|
S
|
copiesofaunivariatedensitywithmeanzeroandfinite
secondmoment,thenthereexists
M
suchthat,as
p
n
,n
→∞
,
sup
P
n,θ
0
Π
n
θ
:
|
S
θ
|
>Mp
n
|
X

0
.
θ
0

`
0
[
p
n
]
Forreasonablepriors,wemayhopethattheposteriordistributionspreads
massinthe
p
n
-dimensionalsubspacethatsupportsatruemeanvector

imsart-aosver.2007/12/10file:spa-revised.texdate:November8,2011

SPARSITYANDBAYESPOSTERIORMEASURE
7
θ
0

`
0
[
p
n
].Thetheoremshowsthattheposteriordistribution“overshoots”
thisspacebysubspacesofdimensionatmostamultipleof
p
n
.Becausethe
overshootcanhavearandomdirection,thisdoesnotmeanthattheposte-
riordistributionconcentratesoverallonafixed
Mp
n
-dimensionalsubspace.
Thetheoremshowsthatitconcentratesalong
Mp
n
-dimensionalcoordinate
planes,butitssupportwillbefarfromconvex.
Obviouslytheposteriordistributionwillconcentrateonlow-dimensional
subspacesifthehigherdimensionalspacesreceivelittlemassundertheprior
π
n
.Bythetheoremexponentialdecreaseissufficient.Thenextstepisto
showthatexponentialdecreaseisnottooharsh:itiscompatiblewithgood
reconstructionofthefullmeanvector
θ
.Thisthenofcourserequiresalower
boundonthepriormassgiventothespacesof“correct”dimension;for
instance(1.2).
2.2.
Recovery.
Goodrecoveryrequiresalsoappropriatepriordensities
g
S
onthenonzerocoordinates.Becausethestatisticalproblemofrecovering
θ
froma
N
p
(
θ,I
)distributedobservationisequivariantin
θ
,wemayhope
thatthelocationofthenonzerocoordinatesof
θ
0
doesnotplayaroleinits
recoveryrate.Thenon-Bayesianproceduresconsideredinforinstance[13]
indeedfulfillthisexpectation.However,aBayesianprocedure(withproper
priors)necessarilyfavourscertainregionsoftheparameterspace.Depending
onthechoiceofpriors
g
S
in(P3)thismayleadtoashrinkageeffect,even
inthe“average”recoveryoftheparameteras
n
→∞
,yieldingsuboptimal
behaviourfortrueparameters
θ
0
thatarefarfromtheorigin.Thisshrinkage
effectcanbepreventedbychoosingpriors
g
S
withsufficientlyheavytails.
Againwefirstconsiderthecaseofindependentcoordinates.Inthefol-
lowingtheoremweassumethat
g
S
isaproductof
|
S
|
densitiesoftheform
e
h
,forafunction
h
:
R

R
satisfying
(2.3)

h
(
x
)

h
(
y
)