CS535D Project: Bayesian Logistic Regression through Auxiliary Variables
19 Pages
English

CS535D Project: Bayesian Logistic Regression through Auxiliary Variables

-

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

Niveau: Supérieur, Licence, Bac+1
CS535D Project: Bayesian Logistic Regression through Auxiliary Variables Mark Schmidt Abstract This project deals with the estimation of Logistic Regression parameters. We first review the binary logistic regression model and the multinomial extension, including standard MAP parameter estimation with a Gaussian prior. We then turn to the case of Bayesian Logistic Regression under this same prior. We review the cannonical approach of performing Bayesian Probit Regression through auxiliary variables, and extensions of this technique to Bayesian Logistic Regression and Bayesian Multinomial Regression. We then turn to the task of feature selection, outlining a trans-dimensional MCMC approach to variable selection in Bayesian Logistic Regression. Finally, we turn to the case of estimating MAP parameters and performing Bayesian Logistic Regression under L1 penalties and other sparsity promoting priors. 1 Introduction In this project, we examined the highly popular Logistic Regression model. This model has tradition- ally been appealing due to its performance in classification, the potential to use its outputs as probabilitic estimates since they are in the range [0, 1], and the interpretation of the coefficients in terms of the 'log- odds' ratio [1]. It is especially popular in biostatistical applications where binary classification tasks occur frequently [1]. In this first part of the report, we review this model, its multi-class generalization, and standard methods of performing maximum likelihood (ML) or maximum a posteriori (MAP) para- meter estimation under a zero-mean Gaussian prior for the regression coefficients.

  • gaussian

  • parameter estimation

  • multinomial

  • auxiliary variable

  • bayesian logistic

  • regression

  • binary probit

  • function


Subjects

Informations

Published by
Reads 18
Language English
CS535DProject:BayesianLogisticRegressionthroughAuxiliaryVariablesMarkSchmidtAbstractThisprojectdealswiththeestimationofLogisticRegressionparameters.Werstreviewthebinarylogisticregressionmodelandthemultinomialextension,includingstandardMAPparameterestimationwithaGaussianprior.WethenturntothecaseofBayesianLogisticRegressionunderthissameprior.WereviewthecannonicalapproachofperformingBayesianProbitRegressionthroughauxiliaryvariables,andextensionsofthistechniquetoBayesianLogisticRegressionandBayesianMultinomialRegression.Wethenturntothetaskoffeatureselection,outliningatrans-dimensionalMCMCapproachtovariableselectioninBayesianLogisticRegression.Finally,weturntothecaseofestimatingMAPparametersandperformingBayesianLogisticRegressionunderL1penaltiesandothersparsitypromotingpriors.1IntroductionInthisproject,weexaminedthehighlypopularLogisticRegressionmodel.Thismodelhastradition-allybeenappealingduetoitsperformanceinclassication,thepotentialtouseitsoutputsasprobabiliticestimatessincetheyareintherange[0,1],andtheinterpretationofthecoefcientsintermsofthe’log-odds’ratio[1].Itisespeciallypopularinbiostatisticalapplicationswherebinaryclassicationtasksoccurfrequently[1].Inthisrstpartofthereport,wereviewthismodel,itsmulti-classgeneralization,andstandardmethodsofperformingmaximumlikelihood(ML)ormaximumaposteriori(MAP)para-meterestimationunderazero-meanGaussianpriorfortheregressioncoefcients.WethenturntothecaseofobtainingBayesianposteriordensityestimatesoftheregressioncoefcients.Inparticular,weexaminetherecentlyproposedextensionsoftheBayesianProbitRegressionauxiliaryvariablemodeltotheLogisticRegressionandMultinomialRegressionscenarios.Finally,weturntothechallengingtask
ofincorporatingfeatureselectionintothesemodels,focusingontrans-dimensionalsamplingmethods,andMAPand/orBayesianestimationunderpriorsthatencouragesparsity.1.1BinaryLogisticRegressionModelWeuseXtodenotethenbypdesignmatrix,containingpfeaturesmeasuredforninstances.Weuseytodenotethelengthnclasslabelvector,wherethevaluestakeoneither+1or1,correspondingtotheclasslabelforthenthinstance.Finally,wewillusewtorepresentthelengthpvectorofparametersofthemodel.Primarilyforeaseofpresentation,wewillnotaddressthe‘bias’termw0inthisdocument,butalltechniqueshereinareeasilymodiedtoincludeabiasterm.Underthestandard(binary)LogisticRegressionmodel,weexpresstheprobabilitythataninstanceibelongstotheclass+1as:1π(yi=+1|xi,w)=1+exp(wTxi)(1)Forbinaryreponses,wecancomputetheprobabilityofthe’negative’classusingthesumruleofprobability:π((yi=1|xi,w)=1π(yi=+1|xi,w).WetypicallyassumeindependentGaussianpriorswithmeansof0andvarianceofvonthecoefcientsofthemodel:wiN(0,v))2(ToperformMAPparameterestimation,wetakethelogofthelikelihood(1)overallexamples,timestheprior(2)overallparameters(ignoringthenormalizingconstant)togivethefollowingobjectivefunction:n1Xf=log(1+exp(yiwTxi))2wTw(3)v21=iFromthisexpression,weseethattheMaximumLikelihoodestimateisobtainedbysettingvto.DifferentiatingtheabovewithrespecttowtoweobtainthefollowingexpressionsforthegradientandHessian(usingσtodenotethesigmoidfunctionσ(x)=1/(1+exp(x)):2
nwXg=(1σ(yiwTxi))yixi2v1=in1XH=σ(wTxi)(1σ(wTxi))xixiT2Ipv1=i()4)5(WenotethattheHessianisnegative-deniteandsubsequentlythattheoriginalfunctionis(log)concave,indicatingthatanylocalmaximizerofthisobjectivewillbeaglobalmaximizer.AsimplemethodtomaximizethisobjectiveistorepeatNewtoniterationsstartingfromaninitialvalueofwuntilthenormofthegradientissufcientlysmall(notingthatthatthegradientwillbe0atamaximizer).Thisresultsinasimplexed-pointiterativeupdateasfollows:w=w+αHm1g)6(WhereHmisamodicationoftheHessiantobesufcientlynegative-denite,oranegative-deniteapproximationtothe(inverse)Hessian(see[2]).Thestepsizeαcanbesetto1,butconvergencemaybehastenedbyusinglinesearchmethodssatisfyingsufcientdescentconditions(see[3]or[4]).WehaveimplementedanapproachofthistypemakinguseofMatlab’s‘fminunc’functioninthedirectoryLOGREG,andanexamplecallingthiscodeisincludedasexampleLOGREG.m.Othermethodsforoptimizingthisobjectivearediscussedandcomparedin[4].1.2MultinomialLogisticRegressionModelThebinaryLogisticRegressionmodelhasanaturalextensiontothecasewherethenumberofclasses,K,isgreaterthan2.Thisisdoneusingthesoftmaxgeneralization[1]:exp(wkTxi)π(yi,k|xi,w)=PKexp(wTxi)(7)j1=jInthiscase,wehaveamatrixoftargetlabelsythatisnbyK,andy(i,j)issetto+1ifinstanceihasclassj,and0otherwise.TheweightsarenowexpandedtoapbyKmatrix,andwenowhave3
anindividualweightvectorcorrespondingtoeachclass.Notethatwritingtheclassprobabilitiesinthiswaymakesitclearthatweareemployinganexponentialfamilydistribution.Byobservingthatthenormalizingdenominatorenforcesthattheprobilitiessummedovertheclassesmustbeequalto1,wecansettheparametervectorforoneoftheclassestobeavectorofzeros.Usingthis,wecanseethatinthiscasethesoftmaxlikelihoodwillbeidenticaltothebinarylogisticregressioncasewhenwehavetwoclasses.Notealsothatthecoefcientsusedinasoftmaxfunctionretaintheirinterpretabilityintermsofchangestothelog-odds,butthatthesechangesarenowrelativetotheclasswhoseparametersaresettozero[1].AgainassuminganindependentGaussianpriorontheelementsofw,wecanwritethemulti-classpenalizedlog-likelihoodforuseinMAPestimationasfollows:XnXK1XKf=[yiwTxilog(exp(wkTxi))]2wjTwj(8)i=1j=12vj=1Abovewehaveintroducedyiasanindicatortoselecttheappropriatecolumnofwfortheinstacei.Weseethatthelog-likelihoodtermhasthefamiliar(numerator-denominator)form,subsequentlyweexpectthegradientandHessiantocontainmomentsofthedistribution.IfweuseSM(i,k)todenotethesoftmaxprobabilityofinstanceiforclassk,andδi==jasthekroneckerdeltafunctionforiandj,weexpressthegradientfortheparametersofclasskandtheHessianfortheparametersofclasseskandjasfollows:n1Xgk=[xi(yiSM(i,k))]2wkv1=inδXHkj=xixiT[SM(i,k)(δi=jSM(i,j))]j=2kv1=i)9()01(TheHessianremiainsnegativedenite,butnowhas(pK)2elementsinsteadof(pK),makingcomput-ingand/orinvertingtheHessianmuchmoreexpensive.Itisnoteworthythatinthesoftmaxcasewecan(andwill)haveahigherdegreeofcorrelationbetweenvariablesthanwedidforthebinarycasesincewe4
haveadditionalcorrelationbetweentheclasses.WehaveimplementedMAPestimationformultinomialregressionmakinguseofMatlab’s‘fminunc’function(andhenceusingupdatesbasedonthegradientandaninverseHessianapproximationasdiscussedforthebinarycase)inthedirectory‘MLOGREG’,andanexamplecallingthiscodeisincludedasexampleMLOGREG.m.Notethatthiscodeisnotvectorized,socouldbemademuchmoreefcient.2BayesianAuxiliaryVariableMethodsAbovewehavedescribedindetailthelogisticandmultinomialregressionmodels,andoverviewedsomestraightforwardmethodstoperformMAPparameterestimationinsuchmodelsunderaGaussianprior.However,wewouldmuchratherbedoingBayesianparameterestimationinthesemodels,inordertoobtainposteriordistributionsofthemodelparameters.WenowturntoBayesianmethodsofestimatingposteriordistributionsinlogisticregressionmodels.Inparticular,wewillfocusontheGibbssamplingmethodemployingauxiliaryvariablesandjointupdatestotheregressioncoefcientsandauxiliaryvariablesproposedin[5].2.1BinaryProbitRegressionAsdiscussedinclass,wecanderiveconjugatepriorsforthelogisticregressionlikelihoodfunction,buttheyarenotterriblyintuitive.Fortunately,wecantransformthemodelintoanequivalentformulationthatincludesauxiliaryvariables,andadmitsstandardconjugatepriorstothelikelihoodfunction.Thismethodisanextensionofthewell-knownauxiliaryvariablemethodforBinaryProbitRegressionof[6].Beforediscussingthelogisticregression,wewillrstreviewthesimplerBayesianBinaryProbitRegressionmodel,aspresentedin[5].UsingΦtodenotetheGaussiancumulativedistributionfunction,BinaryProbitRegressionusesthefollowinglikelihood:π(yi=1|xi)=Φ(xiTw)5)11(
SincethereisnoconjugatepriortotheGaussiancumulativedistributionfunction,weintroduceasetofnauxiliaryvariablesziwith:zi=xiTw+i)21(Here,iN(0,1),andyitakesthevalueof1iffziispositive.Introducingtheauxiliaryvariablesgivesanequivalentmodel,butthismodelismoreamenabletosamplingsincewisremovedfromthelikelihood.IntheparticularcaseofaGaussianprioronw,itadmitsastraightforwardGibbssamplingstrategywhereziissampledfromindependent(univariate)truncatedGaussiandistributions,andwcancanbesampledfromamultvariateGaussiandistribution.Specically,astraightforwardGibbssamplingstrategywithπ(w)N(b,v)canbeimplementedusingthefollowing[5]:zi|wN(xiTw,1)I(zi>0)yi=1Tzi|wN(xiw,1)I(zi0)yi6=0w|z,yN(B,V)B=V(v1b+XTz)V=(v1+XTX)1)31()41(()51)61()71()81(Asbefore,wewillassumethatthemeanofthepriorontheregressioncoefcientsbiszero.Follow-ingfromthis,wenoteabovethatBisthenormalequationsinLeastSquaresestimation,andthatVisthecorrespondinginverseHessian(correspondingtotheinverseconvarianceorprecisionmatrix).TheGibbssamplerproducedbytheabovestrategyistrivialtoimplement(givencurrentlyavailablelinearalgebrasoftware),sinceitonlyrequiressamplingfromamultivariateGaussian,andtruncatedunivari-ateGaussians.Unfortunately,asdiscussedinclass,thisstraightforwardGibbssamplingapproachisinnefcientsincetheelementofwarehighlycorrelatedwiththeelementsofz.6
Tocombatthecorrelationinherentinwandzintheabovemodel,[5]proposedamethodtoupdatewandzjointlybyusingtheproductruletodecomposethejointprobabilityofthemodelasfollows:π(w,z|y)=π(z|y)π(w|z))91(TheproposedmethodsampleseachzifromaGaussiandistributionwithmeansandvariancesde-rivedfromaleave-one-outmarginalpredictivedensity(see(5)in[5]),updatingtheconditionalmeansaftereachupdatetozi,thensamplingwfromitsconditionalnormaldistributionafterallofthezihavebeensampled.Althoughresultsarepresentedshowingthatthisjointupdatingstrategyoffersanad-vantage,wewillnotreviewitindetailsincethereexistsasimplermethodtofacilitatejointupdatinginlogisticregression.Nevertheless,wehaveimplementedthisblockupdatingscheme(basedonthepseudocodepresentinthepaper)inthedirectory‘PROBREGSAMP’.Forourimplementations,weusedasimplerejectionsamplingapproachforsamplingfromtruncateddistributions,whereeachre-jectedsamplerestrictsthesamplingdensityenvelope.AnexampleshowinghowtorunthiscodeisatexamplePROBREGSAMP.2.2BinaryLogisticRegressionBeginningfromthebinaryBayesianProbitRegressionmodelabove,[5]proposetoperformbinaryBayesianLogisticRegressionbyreplacingtheindependentGaussianprioronwithindependentlogisticdistributions.Unfortunately,thissignicantlycomplicatesthesimplesamplingstrategiesabove.Tofacilitatestraightforwardsamplingofthismodel[5]introduceanaddtionalsetofauxiliaryvariablesλ1:nandmodifythenoisefunctiontobeascalemixtureofnormalswithamarginallogisticdistributionasfollows(whereKSdenotestheKolmogorov-Smirnovdistribution):iN(0i)λi=(2ψi)27)02(2()1
SKψi()22Ifwetemporarilyviewλasconstant,weseethatthisisidenticaltotheProbitmodelabove,exceptthateachvalueofzihasanindividualtermλiforitsnoisevariance.Subsequently,weknowhowtosamplefromthismodelforxedλ.ItsimplyinvolvesusingWeightedLeastSquaresinsteadofLeastSquares(andassociatedinverseHessian),andsamplingfromindividualtruncatednormalsthathavedifferentvariances.SubsequentlywecanimplementaGibbssamplerifweareabletosamplefromtheKSdistribution.Fortunately,[5]outlineanrejectionsamplingmethodtosimulatefromtheKSdistributionusingtheGeneralizedInverseGaussianasthesamplingdensity.Thus,wecanimplementastraightforwardGibbssamplerforbinaryBayesianLogisticRegressionusingthisrejectionsamplingapproachinadditiontothefollowing:zi|w,λN(xiTw,λi)I(zi>0)ifyi=1zi|w,λN(xiTw,λi)I(zi0)ifyi6=0w|z,y,λN(B,V)B=V(v1b+XTWz)V=(v1+XTWX)1W=diag(λ1))32()42(()52)62()72()82()92(Twoapproachesarepresentedin[5]toperformblocksamplingoftheparameters.TherstusesthesamestrategyasintheProbitmodel,wherewandzareupdatedjointlyinthesamewayusingtheabovemodicationstotheconditionals,followedbyanupdatetotheaddtionalauxiliaryvariablesλ.Weimplementedthisstrategy(basedonthepseudocodefromthepaper)inthedirectory‘LOGREGSAMP’,exampleLOGRESAMPisanexamplescriptcallingthisfunction.8
Thesecond(andtheauthor’spreferred)strategyforblocksamplingpresentedin[5]updateszandλjointly,followedbyanupdatetow.Samplingwandλremainsidenticalinthisapproach,butsamplingzibecomeseasier.Inthisapproach,zi|w,λfollowsatruncatedlogisticdistributionwithmeanxiTwandascaleof1.Notonlydoesthisobviatetheneedforcomputingmarginalpredictivedensities,theinverseofthecumulativedistributionfunctionofthelogisticdistributionhasaclosedformandissubsequentlytrivialtosamplefrom(althoughweagainusedasimpleadaptiverejectionsamplingtechniqueinourimplementation).Weimplementedthisstrategy(basedonthepseudocodefromthepaper)inthedirectoryLOGREGSAMP,exampleLOGRESAMP2isanexamplescriptcallingthisfunction.2.3MultinomialLogisticRegressionUnliketheProbitRegressioncase,thebinaryLogisticRegressionsamplingtechniquesabovehaveatrivialextensiontothemulti-classscenario.InadditiontohavingayandwvariableforeachclassaswesawinSection1,wenowhaveazandλvectorforeachclass.TheGibbssamplerpresentedin[5]simplyloopsovertheclasses,performingthebinarylogisticregressionsamplingtechniqueforthecurrentclasseskeepingallotherclassesxed.Weimplementedthisstrategy(basedonthepseudocodefromthepaper)inthedirectoryMLOGREGSAMP,exampleMLOGRESAMPisanexamplescriptcallingthisfunction.Unfortunately,wefoundthatthisdoesnotmakeanespeciallyeffectivesamplingstrategy,andthatthetechniquestaysinareasofthedistributionthatwerefarfromtheMAPestimate,anddidnotproduceaccurateclassicationresults.Wehypothesizethatthisisduetoseveralfactors.Therstfactorissimplythelargernumberofparametersinthismodel.Anotherfactoristhat,asdiscussedpreviously,therecaninherentlybeamuchhigherdegreeofcorrelationinthesoftmaxcasethaninthebinaryscenario.Finally,wenotethatthesamplingstrategyofloopingovertheclasses,andrunningthebinarysamplerisnotespeciallycleveraboutdealingwiththesecorrelations,sinceitrequiresseperatesamplingofthezvaluesforeachclassinadditiontothewvaluesforeachclass,andajointupdatewouldlikelyimprovetheperformance.Beforemovingontofeatureselection,Iwouldliketooutlinesomeextensionsoftheabovemodels9
thatIwouldhavelikedtoexplored,ifIhadmoretime.Oneideawithsignicantpotentialforimprov-ingthesamplingstrategiesistointegrateoutparameters.Giventhehighdegreeofcorrelationbetweenvariables(especiallyinthemulti-classcase),thiswouldlikelyimprovethesamplingstrategiessignif-icantly.Anotherareaofexplorationistonottheviewcovarianceorhyper-parametersasxed,andexploreposteriorestimateswithpriorsonthesedistributions.Thisisespeciallyrelevantfromthepointofviewofmodelgeneralization,sincethecovarianceandhyper-parameterscansignicantlyaffecttheclassicationperformanceofthemodel.3FeatureSelectionAmajorappealofLogisticRegression,besidesitsintuitivemulti-classgeneralization,istheinter-pretationofitscoefcients.Asdiscussedin[1],researchersoftenexploredifferentcombinationsofthefeaturesinordertoproduceaparsimoniousregressionmodelthatstillprovideseffectivepredictionper-formance.Inthissection,wediscussautomatedapproachestothisfeatureselectionproblem.Werstpresentanextensiontotheabovemodelsthatincorporatesfeatureselectionthroughtrans-dimensionalsampling.Wethenturnourfocustopriorsthatencouragesparsityinthenalmodel.3.1Trans-DimensionalSamplingFocusingonthebinarylogisticregressionscenario,onemethodtoincorporatefeatureselectionintotheprocedureistoaddyetanothersetofauxiliaryvariables,γ1:p.Specically,ifthebinaryvariableγiissetto1thenthecorrespondingvariableisincludedinthemodel,andifγiissetto0thenthecorrespondingvariableisexcludedfromthemodel(ie.setto0).[5]proposesthismodel,andsug-gestusingthemodelpresentedearlierforbinarylogisticregressionwiththeseauxiliaryvariables,withjointupdatesto{z,λ}andto{γ,w}.Theyproposethatγ|zcanbesampledusing(Reversible-Jump)Metropolis-Hastingssteps.Specically,samplingz,λ,andwremainsthesame(butusingonlytheactivecovariateset),andweacceptatrans-dimensionalstepfromγtoγ?(underasymmetricproposal)usingthefollowingacceptanceprobability:01
|Vγ?|1/2|vγ|1/2exp(0.5BγTVγ1Bγ?)α=min{1,??}(30)|Vγ|1/2|vγ?|1/2exp(0.5BγTVγ1Bγ)[5]usesasimpleproposaldistribution,theyipthevalueofarandomlychosenelementofγ.Weimplementedsamplingfromtheabovemodelinthecaseofbinarylogisticregression(withfeatureselec-tion)inthedirectoryLOGREGSAMPFS,anexamplerunningthisroutineisexampleLOGREGSAMPFS.3.2PriorsEncouragingSparsityAlthoughtheabovestrategytoincorporatefeatureselectionintothemodelisasimpleextensionofthelogisticregressionmodel,ithasmajordrawbacks.Specically,updatingsinglecomponentscausesveryslowexplorationofthespaceof2pvariables.Asdiscussedinclass,wecouldjointlyupdatecorrelatedcomponentstosignicantlyimprovetheresults.Analternatestrategy,especiallyrelevantwhenpisverylarge,istousepriorsthatencouragesparsity.3.3MAPEstimationoftheLogisticLASSOTheLASSOprioradvocatedin[7](bututlilizedearlierunderthename‘BasisFunctionPursuit’[8])iscurrentlyapopularstrategyforenforcingsparsityintheweightsoftheregressioncoefcients.Fromthepointofviewofoptimization,theLASSOpriorconsistsofusingascaledvalueoftheL1-normoftheweightsasthepenalty/regularizationfunction,insteadofthesquaredL2-normdiscussedearlier.Specically,ourobjectivefunctionbecomes:nXf=log(1+exp(yiwTxi))1||w||1(31)v1=iAlthoughtheaboveobjectiveisstillconcave,amajordisadvantageofthisobjectivefunctionisthatitisnon-differentiableatpointswhereanywiiszero.Hence,weneedtouseslightlylessgenericop-timizationapproachesforndingtheMAPestimates.Furthermore,wecannotuseefcientmethodssuchastheonepresentedin[9]forLeastSquaresestimationunderanL1penaltyinordertooptimizethelogisticregressionlikelihoodfunction.Themostwidelyusedmethodforoptimizingthelogistic11