Benchmarking SAT Solvers for Bounded Model Checking
15 Pages
English
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Benchmarking SAT Solvers for Bounded Model Checking

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer
15 Pages
English

Description

Benchmarking SAT Solvers for Bounded
Model Checking
Emmanuel Zarpas
IBM Haifa Research Laboratory,
zarpas@il.ibm.com
Abstract. Modern SAT solvers are highly dependent on heuristics.
Therefore, benchmarking is of prime importance in evaluating the per-
formances of different solvers. However, relevant benchmarking is not
necessarily straightforward. We present our experiments using the IBM
CNF Benchmark on several SAT solvers. Using the results, we attempt
to define guidelines for a relevant benchmarking methodology, using SAT
solvers for real life BMC applications.
1 Introduction
Over the past decade, formal verification via model checking has evolved from a
theoretical concept to a production-level technique. It is being actively used in
chip design projects across the industry, where formal verification engineers can
now tackle the verification of large industrial hardware designs. Bounded Model
Checking (BMC) [1] has recently enabled the verification of problems that used
to be beyond the reach of formal verification. However, BMC performance relies
heavily on the performance of the underlying SAT solver.
We conducted several experiments using the IBM CNF benchmark [12, 11]
to evaluate SAT tools. Often, we found discrepancies between our results and
the experimental results found in the literature. For example, our results are not
necessarily in line with the results of the SAT03 contest [4]. The main goal of this
paper is to outline a methodology for benchmarking SAT ...

Subjects

Informations

Published by
Reads 42
Language English

Exrait

BenchmarkingSATSolversforBoundedModelCheckingEmmanuelZarpasIBMHaifaResearchLaboratory,azprsai@.lbi.mocmAbstract.ModernSATsolversarehighlydependentonheuristics.Therefore,benchmarkingisofprimeimportanceinevaluatingtheper-formancesofdifferentsolvers.However,relevantbenchmarkingisnotnecessarilystraightforward.WepresentourexperimentsusingtheIBMCNFBenchmarkonseveralSATsolvers.Usingtheresults,weattempttodefineguidelinesforarelevantbenchmarkingmethodology,usingSATsolversforreallifeBMCapplications.1IntroductionOverthepastdecade,formalverificationviamodelcheckinghasevolvedfromatheoreticalconcepttoaproduction-leveltechnique.Itisbeingactivelyusedinchipdesignprojectsacrosstheindustry,whereformalverificationengineerscannowtackletheverificationoflargeindustrialhardwaredesigns.BoundedModelChecking(BMC)[1]hasrecentlyenabledtheverificationofproblemsthatusedtobebeyondthereachofformalverification.However,BMCperformancereliesheavilyontheperformanceoftheunderlyingSATsolver.WeconductedseveralexperimentsusingtheIBMCNFbenchmark[12,11]toevaluateSATtools.Often,wefounddiscrepanciesbetweenourresultsandtheexperimentalresultsfoundintheliterature.Forexample,ourresultsarenotnecessarilyinlinewiththeresultsoftheSAT03contest[4].ThemaingoalofthispaperistooutlineamethodologyforbenchmarkingSATsolversonBoundedModelCheckingproblems.Thepaperisorganizedasfollows:InSection2,wepresentexperimentalresultsforsomefamousSATsolvers.Section3presentsanewversionoftheIBMBenchmarkanddiscussescorrelationoftheCNFcharacteristicsandsolversperformance.Section4containsadiscussiononSATbenchmarkingforBMC,wherewetrytolearnfromourexperimentalresultsinlightofotherexperimentssuchastheSATcontests.Section5concludesthepaper.2ExperimentswithIBM2002Benchmark:SATSolverComparison[O7]u,rzeCxhpaeriImIe(nztCshcaom2p0a0re4.d5.f1o3u)r[f3a],mBoeurskSMAiTn5s6o1lv[e2r]sa:nzdChSiaegeI(v2400[81]..2.W17evuesresdiotnh)eF.BacchusandT.Walsh(Eds.):SAT2005,LNCS3569,pp.340–354,2005.cSpringer-VerlagBerlinHeidelberg2005
BenchmarkingSATSolversforBoundedModelChecking3412002versionoftheIBMCNFBenchmark[12],whoseCNFsaregeneratedthroughBMCfromtheIBMFormalVerificationBenchmarkLibrary[11].Thislibraryisacollectionofmodelsfromreal-lifeindustrialhardwareverificationprojects.Foreachmodel,weusedtheCNFformulas(SATdat.k.cnf)forboundwithk=1,10,15,20,25,30,35,40,45,50.Thetime-outwassetat10,000secondsonaworkstationwith867841XIntel(R)Xeon(TM)CPU2.40GHz,with512KBfirstlevelcacheand2.5GBphysicalmemory.Weusedthedefaultconfigurationforeachengine.2.1zChaIvs.BerkMin561WeranzChaffI(2001.21.17version)andBerkMin561onthe2002versionoftheIBMCNFBenchmark.Table1isasummaryoftheresultsobtained.MoreTable1.Theresultsaredisplayedinsecondsandincludethetimeout10,000secondszChaffBerkMin561Totaltime(10000sectime-out)344765414094First(#ofCNFwheretheengineisthefastest)298131#ofunsolvedproblems(timeoutreached)2530+(#ofCNFwheretheengineisthefastestbymorethanaminuteand20%)6732Firstbymodel(#ofmodelwheretheengineisthefastest)2818completeexperimentalresultsarepresentedinTable8,whereforeachmodelinthetable,wepresentthesumoftheresultsofSATdat.k.cnfwithk=1,10,15,20,25,30,35,40,45,50(i.e.,theBMCtranslationofeachmodelforthedifferentk).Formoredetails,seethecompleteresultsin[13].BecauseSATsolversdonotalwaysbehaveinhomogeneousmanner(Cf.detailedresultsin[13]ortable2),thecompleteresultsanalysisshouldnotbedisregarded.TheresultsshowthatzChaffandBerkMin561achievedcloseresults.OnsomeCNFs,zChaffrunsfaster,andonothers,BerkMin561runsfaster.Inmostcases,thedifferencesintheirperformanceisnotverysignificant,thehighestspeedisnotfasterbymorethanoneminuteor20%.However,whilezChaffseemstoperformslightlybetteroverall,BerkMin561getsbetterresultsthanzChaffontheUNSATCNFs.Fromtheseresults,itisnotpossibletoconcludethatBerkMin561performsbetterthanzChaff.ThisisnotconsistentwithwhatcanbereadintheliteratureorwiththefinalresultsoftheSAT03contest(Cf.section2.4).ThisisnotaverysatisfyingconclusionandthepreviousmetricsdonottellusinasimplewayhowmuchfasterzChaffisthanBerkmin5612.Weneedsome12BerkMin561wassecondintheSAT03contestindustrialbenchmarkscategoryintheSAT03contest;thewinnerForkliftisnotpubliclyavailable.Itisveryimportanttobeabletogiveclearandconcisevaluesforresultsdissemi-nation.
342E.ZarpasTable2.TheresultsaredisplayedinsecondsfortheCNFsfrom18rulemodel.Thetime-outwassetto10000seconds.zChaffperformstheworstfork=15,20,35;Berk-Min561performstheworstfork=30,40,50;Siegev4hasthebestperformance,exceptfork=15,2518ruleCNFResultzChaBerkMinSiegeSATdat.k1.cnfunsat000SATdat.k10.cnfunsat110SATdat.k15.cnfunsat1345SATdat.k20.cnfunsat654741SATdat.k25.cnfunsat951109251SATdat.k30.cnfsat700755557SATdat.k35.cnfsattime-out22301730SATdat.k40.cnfsat55108060247SATdat.k45.cnfsattime-outtime-out959SATdat.k50.cnfsat5010time-out1370additionalmetricsinordertocomparetheseSATsolvers.Thearithmeticmeanofthespeedupseemsirrelevant.Let’ssaywewanttocomparetheperformancesofsolversAandB.IfthespeedupofAvs.Bis10ontest1and0.0001ontest2,thearithmeticmeanofthespeedup(Avs.B)wouldbe5andthemeanofthespeedup(Bvs.A)wouldbe5000.Thisisclearlynotrelevant.Theglobalspeedup(TotaltimeA/TotalTimeB)isgoodtohavebutthe“longruns”havefarmoreweightthantheshortones.Themedianismuchmoreinteresting.howeveritdoesnottakeintoaccounttheextremevalues(e.g,ifontest1thespeedupisequaltomedian+10ortomedian+0.1,itdoesnotchangethemedianvalue).Ontheotherhand,themedianinsensivitytoextremevaluesmakesitlessdependentonthetime-outs.Thegeometricalmeanseemstobeagoodsolution.Itdefinitelytakesintoaccounttheweightofthe“negativespeedup”(i.e.,aspeedupoflessthan1)3.Wefeltthesevaluesshouldbecomputedthesevaluesonlyonarelevantsubsetofthebenchmark.Forexample,wediscardedallthecaseswherethefourSATsolverstime-out.Inaddition,wediscardedcasesforwhichthefoursolvers(zChaI,Berkmin561,Siegev4,zChaII)runinlessthanveseconds.Therationalisthatitisdifficulttoaccuratelycompareverysmallruntimesandthattheseverysmallruntimesarearguablynotrelevant.TheresultsdisplayedinTable3(b)showthatBerkmin561isnearlytwotimesslowerthanzChaffI,butthatBerkminbehavessomehowbetterinthe“long”cases.2.2Siegev4Siegewashors-concoursfortheSAT03contest,thereforeitdidnotparticipateinthesecondstage.Nevertheless,theSiegeresultswereprettygoodforthe3Notethatthelogarithmofthegeometricalmeanisequaltothearithmeticalmeanofthelogarithms.
BenchmarkingSATSolversforBoundedModelChecking343Table3.Witha10000sec.time-outTotalTime#time-outs(inseconds)zChaI344,76425berkmin561414,03530Siegev419723914zChaII389,30431(a)SpeedupzBCerhkamffinIzCSiheagfefIzzCChhaaffffIIIgmeeodmiaenan00..455122..980711..2239global0.753.580.75(b)firststageandSiegehasareputationforbeingoneofthebestSATsolversavailable.WeranSiegev4[8]onthebenchmarkusing123456789asaseed.Table3displaystheoverallconclusions.FormoredetailsseeTable8and[13].Siegev4isthefastestin298cases(CNFs)outof498.Inmanydicultcases,Siegev4isfastestbyanorderofmagnitude,ormore.Insomecases,Siegev4performedsignicantlyworsethanzChaorBerkMin561.Forexample,for26rule,Siegev4isslowerthanzChaffbyanorderofmagnitudeandslowerthanBerkMinbytwoordersofmagnitude.Inaddition,withinthetime-out,severalCNFscanbesolvedonlybySiegev44.Inconclusion,weseethatSiegev4performssignificantlybetter(roughlyspeaking2.5timesfaster)thanzChaffI,Berkmin561andzChaffIIontheIBMCNF2002benchmark.2.3zChaIITheindustrialcategoryoftheSAT04competition[6]waswonbyzChaffII(zChaff2004.5.13).However“black-box”solverssuchasForklift(the2003edi-tionwinner)were“hors-concours”andassuchnotallowedtoenterthesecondstageofthecompetition.WefocushereontheperformanceofzChaffIIvszChaffI.zChaffIIisfasterthan:zChafffor229CNFs(outof442),Berkmin561for208CNFs(outof442),Siegev4for82CNFs(outof442).Inaddition,zChaIIreachedtime-outmoreoftenthanzChaffI(forsixmorecases).Table3dis-playsasyntheticviewoftheresults.ThezChaffIIcodewasleftunchanged,soitusedrandomseeds.Siegev4isroughlythreetimefasterthanzChaI.Figure1givesusagraphicviewoftheperformanceofzChaffIIvs.zChaffI.Theover-allperformancedifferencebetweenzChaffIandzChaffIIonourbenchmarkissmall5,eventhoughthetwosolverscanbehaveverydifferentlyinspecificcases.45Siegeisarandomizedsolver,therefore,itcouldbearguedthatthecomparisonisnotfairsincethezChaffandBerkMin561,versionsweusedwerenotrandomized.Nevertheless,evenadeterministicsolvercanbeluckyandprovidingSiegewithaseedofSiegemakesitdeterministic.ItshouldbenotethatzChaffIIisrandomized,sotheresultscouldbedifferentforotherruns.ItwouldbeinterestingtohaveastatisticaldescriptionoftherangeofzChaffIIperformances.
344E.Zarpas0000100010010110.11101001000100001.010.0zchaff I runtime08070605040302100niB(a)(b)Fig.1.Histogram(b)givesthedistributionofthedecimallogarithmofzChaffI/zChaffIIspeedupwith10000sectime-out2.4ComparisonwithSATContestResultsInordertounderstandwhetherornotourresultsareconsistentwiththoseoftheSAT03contest6(fordetailsresultssee[5]),wehadalookattheresultsofthefirstandsecondstagesofthecompetition.Firststage.Lookingattheresults7fortheindustrialcategory[5],wenotethefollowing:Series13rule8isover-represented(Cf.Table4).Besides,sinceallsolverstime-outonmostoftheCNFsfromthisseries,itisnotverymeaningful.Exceptforseries13ruleand11rule,theseriesfromtheIBMCNFbench-markareeasyforzCha,BerkMin561,andSiegev1.Resultsforrule07arenotconsistentwithourownexperiences.Thisdis-crepancyisprobablycausedbythe“Lisasyndrome”[4]:CNFswereshuffledforSAT03andsolverperformancecandifferdramaticallybetweenashuffledCNFandtheoriginal.TheSAT03contestresultsforBerkMin561,Forklift,Siegev1,andzChaon(shuffled)seriesfromtheIBMCNFBenchmarkaresummarizedinTable4.Ifresultsfromseries07ruleand13rulearediscarded,zChaffshowsbetterresultsthanBerkMin561.678InthefirststageoftheSAT04competition,zChaffIandzChaffIIhaveverycloseresultsandthesolversthatoutperformedthem(Forklift,Oepir)arenotavailableforexperimentation.TheSiegeversionusedfortheSAT03contestisSiegev1.ThenamesoftheSAT03seriesappearsinitalicsinordertodifferentiatethemfromtheseriesweusedinourexperiments.FortheexactcompositionoftheSAT03series,see[5].
BenchmarkingSATSolversforBoundedModelChecking345Table4.PartialresultsfromfirststageSAT03contestBerkMin561forkliftSiege1zChaffTotal#ofSolvedBenchmarks112112112101TotalCPUtimeneeded(sec)105000103000103000114000without13rule(sec)15105184084160without13ruleand07rule(sec)1410424268553WebelievethatthedifferencesinresultsforthefirststageofSAT03contestresultsforIBMthebenchmarkandourexperimentsareduetoclauseshufflinginSAT03andtothefactthattheSAT03experimentusedasmallertest-bedofCNFsfromtheIBMbenchmark.Secondstage.Duringthesecondstageofthecompetition,solverswererankedaccordingtothenumberofCNFstheycouldsolvefromarestrictedbenchmark.Forkliftwasrankedfirst,Berkmin561second,andzChaffIsixthontheindustrialbenchmark9.Clearly,therespectivezChaffIandBerkMin561rankingdonotcorrespondtoourexperimentalresults(seeTable3).However,inthesecondstage,allsolvers“timed-out”ontheIBMbenchmarksselected.Inotherwords,therankingofthesolversselectedforthesecondstageofthecompetitiondidnottakeintoaccountperformanceontheIBMbenchmarks.ThisprobablyexplainswhyourevaluationofzChaffandBerkMin561ontheIBMCNFBenchmarkgivesresultsthatarenotinlinewiththesecondstageresults(onindustrialbenchmarks)oftheSAT03contest.3ExperimentswithIBM2004Benchmark:SearchforCorrelation3.1DescriptionoftheIBM2004BenchmarkBoundedModelCheckingtranslationtechnologyiscontinuouslyevolving.There-fore,were-generatedtheIBMCNFBenchmarkfromtheIBMModelBenchmarkusinganalternativeBMCtool.Weranthenewtranslationforthefollowingboundsk=0..10,k=11..15,k=16..20,k=21..25,...,k=95..100.The2004CNFsareavailableon-line.Inthispaper,werefertothe2002and2004versionsastheoldandthenewbenchmarks,respectively.Table5and6presentstatisticaldescriptionsoftheoldandthenewbench-marks.Inthesetwotables,averageisthearithmeticalmean,STDEVisthestandarddeviation,clavuasresistheratiobetweenthenumberofclausesandthenumberofvariables,%unitisthepercentofunit(i.e.,oflengthone)clausesinaCNF,%binisthepercentofclausesoflengthtwo,%teristhepercentofclauses9Siegewashors-concours,thereforenotselectedforthesecondstage.
436E.ZarpasTable5.Old(2002)benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average80,167343,8264.120.375.514.23.56.5median54,857220,1804.080.277.012.23.56.6STDEV81,924388,1560.520.36.47.71.01.3max636,0893,172,1075.421.684.555.35.29.1min3,64514,6812.48040.84.30.10.1Table6.New(2004)Benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average73,414305,3013.990.674.515.03.66.3median50,897195,6123.980.476.113.13.66.6STDEV75,553349,2480.450.66.17.30.91.2max565,8892,760,5025.484.684.551.64.99.1min3,60614,1042.55046.54.40.10.1oflengththree,%l=4isthepercentofclausesoflength4and%l>4isthepercentofclauseslongerthan4.ThesetwotablestellusthatthenewbenchmarkCNFsareroughly10%smallerthantheoldbenchmarkCNFs.However,theyencompassroughlythesameproportionoflengthtwo,three,andfourclauses.Thehigherproportionofunitclausesinthenewbenchmarkisduetospecificoptimizations.Wealsoseethattheratioclavuarsesisslightlylowerinthenewbenchmark(seethenextsubsectionforadiscussionontherelevanceofthisratio).Figure2displaysthehistogramof%bin,wheretwomodels(07and131)haveanonstandardpercentofbinaryclausesandareinthe45%-55%andnotinthe70%-85%.WelookedatsomestatisticsforthestructureoftheCNFsofthenewbench-mark.Wefoundthatforanymodelofthebenchmark,therearefourrealnum-bersa,b,c,dsuchthatforalltheCNFsofthemodel,#(variables)ak+band#(clauses)ck+d,withkbeingtheboundusedtogeneratetheCNFs.Inamoregeneralway,foranymodelandforanyintegernstrictlygreaterthan0,therearetworealnumberseandfsuchthatforallCNFsofthemodel#(clausesoflengthn)ek+f(ofcourseeandfcanbenull,forexampleinmostmodels,thenumberofunitclausesisaconstant).Thecorrelations(foradiscussionaboutstatistictechniquesforNP-completeproblemssee[10])be-tweentheseriesfromthebenchmarkandtheseriespredictedwiththepreviousequationsareabout0.99.ThisisnotsurprisingsincetheCNFsaregeneratedfromthemodelinawayessentiallylineartothebound.Inaddition,wefoundoutthat,inourbenchmark,thelengthofthelongestclauseisthesameforallCNFsgeneratedfromthesamemodel.Figure2(b)displaysthehistogramofthelongestclausesinthenewbenchmark.Notethattwomodels,07and131,whichhavethegreaterlongestclauses,arealsotheoneswitha“nonstandard”percentofbinaryclauses.