Read anywhere, anytime
juergens - technische_universitat_munchen
Description
Subjects
Informations
Published by | technische_universitat_munchen |
Published | 01 January 2011 |
Reads | 14 |
Language | English |
Document size | 3 MB |
Exrait
yWh
and
wHo
in
to
olContr
Software
Cloning
tifactsAr
Elmar
uergensJ
derTechnischenInstitutfürUnivInforersitätmatikMünchen
WhyandHowtoControlCloning
tifactsArSoftwarein
ElmarensguerJ
VollständigerAbdruckdervonderFakultätfürInformatikderTechnischenUniversität
MünchenzurErlangungdesakademischenGradeseines
DoktorsderNaturwissenschaften(Dr.rer.nat.)
Dissertation.genehmigten
Vorsitzender:Univ.-Prof.BerndBrügge,Ph.D.
Dissertation:derPrüfer1.Univ.-Prof.Dr.Dr.h.c.Manfred
2.Univ.-Prof.Dr.RainerKoschke
Univ.-Prof.Dr.Dr.h.c.ManfredBroy
Univ.-Prof.Dr.RainerKoschke
BremenersitätvUni
DieDissertationwurdeam07.10.2010beiderTechnischenUniversitätMüncheneingere-
ichtunddurchdieFakultätfürInformatikam19.02.2011angenommen.
Abstract
Themajorityofthetotallifecyclecostsoflong-livedsoftwarearisesafterits®rstrelease,during
softwaremaintenance.Cloning,theduplicationofpartsofsoftwareartifacts,hindersmaintenance:
itincreasessize,andthuseffortforactivitiessuchasinspectionsandimpactanalysis.Changes
needtobeperformedtoallclones,insteadoftoasinglelocationonly,thusincreasingeffort.If
individualclonesareforgottenduringamodi®cation,theresultinginconsistenciescanthreaten
programcorrectness.Cloningisthusaqualitydefect.
Thesoftwareengineeringcommunityhasrecognizedthenegativeconsequencesofcloningover
adecadeago.Nevertheless,itaboundsinpractice—acrossartifacts,organizationsanddomains.
Cloningthrives,sinceitscontrolisnotpartofsoftwareengineeringpractice.Weareconvinced
thatthishastwoprincipalreasons:®rst,thesigni®canceofcloningisnotwellunderstood.We
donotknowtheextentofcloningacrossdifferentartifacttypesandthequantitativeimpactithas
onprogramcorrectnessandmaintenanceefforts.Consequently,wedonotknowtheimportanceof
clonecontrol.Second,nocomprehensivemethodexiststhatguidespractitionersthroughtailoring
andorganizationalchangemanagementrequiredtoestablishsuccessfulclonecontrol.Lackingboth
aquantitativeunderstandingofitsharmfulnessandcomprehensivemethodsforitscontrol,cloning
islikelytobeneglectedinpractice.
Thisthesiscontributestobothareas.First,wepresentempiricalresultsonthesigni®canceof
cloning.Analysisofdifferencesbetweencodeclonesinproductivesoftwarerevealedover100
faults.Morespeci®cally,everysecondmodi®cationtocodethatwasdoneinunawarenessofits
clonescausedafault,demonstratingtheimpactofcodecloningonprogramcorrectness.Further-
more,analysisofindustrialrequirementsspeci®cationsandgraph-basedmodelsrevealedsubstantial
amountsofcloningintheseartifacts,aswell.Thesizeincreasecausedbycloningaffectsinspection
efforts—foronespeci®cation,byanestimated14persondays;forasecondonebyover50%.To
avoidsuchimpactonprogramcorrectnessandmaintenanceefforts,cloningmustbecontrolled.
Second,wepresentacomprehensivemethodforclonecontrol.Itcomprisesdetectortailoringtoim-
proveaccuracyofdetectedclones,andassessmenttoquantifytheirimpact.Itguidesorganizational
changemanagementtosuccessfullyintegrateclonecontrolintoestablishedmaintenanceprocesses,
androotcauseanalysistopreventthecreationofnewclones.Tooperationalizethemethod,we
presentaclonedetectionworkbenchforcode,requirementsspeci®cationsandmodelsthatsupports
allthesesteps.Wedemonstratetheeffectivenessofthemethod—includingitstools—throughan
industrialcasestudy,whereitsuccessfullyreducedcloningintheparticipatingsystem.
Finally,weidentifythelimitationsofclonedetectionandcontrol.Throughacontrolledexperiment,
weshowthatclonedetectionapproachesareunsuitedtodetectbehaviorallysimilarcodethathas
beendevelopedindependentlyandisthusnottheresultofcopy&paste.Itsdetectionremainsan
importanttopicforfuturework.
3
ementswledgknoAc
IhavespentthelastfouryearsasaresearcherattheLehrstuhlforSoftware&SystemsEngineering
atTechnischeUniversitätMünchenfromProf.Dr.Dr.h.c.ManfredBroy.Iwanttoexpressmy
gratitudetoManfredBroyforthefreedomandresponsibilityIwasgrantedandforhisguidanceand
advice.Ihave,andstilldo,enjoyworkinginthechallengingandcompetitiveresearchenvironment
hecreates.IwanttothankProf.Dr.rer.nat.RainerKoschkeforacceptingtoco-supervisethis
thesis.Iamgratefulforinspiringdiscussionsonsoftwarecloning,butalsoforthehospitalityand
interest—bothbyhimandhisgroup—thatIexperiencedduringmyvisitinBremen.Myviewofthe
socialaspectsofresearch,whichformedinthelarge,thematicallyheterogenousgroupofManfred
Broy,wasenrichedbytheglimpseintothesmaller,morefocussedgroupofRainerKoschke.
Iamverygratefultomycolleagues.Theirsupport,bothonthescienti®candonthepersonallevel,
wasvitalforthesuccessofthisthesis.Andnotleast,formypersonaldevelopmentduringthelast
fouryears.IamgratefultoSilkeMüllerforschedulemagic.ToFlorianDeissenboeckforbeing
anexampleworthfollowingandforbothhisencouragementandoutrightcriticism.ToBenjamin
Hummelforhismeritandcreativityinproducingideas,andforhisproductivityandeffectivenessin
theirrealization.ToMartinFeilkasforhisabilitytooverviewandsimplifycomplicatedsituations
andforreliabilityandtrustcomewhatmay.ToStefanWagnerforhisguidanceandexamplein
scienti®cwritingandempiricalresearch.ToDanielRatiuforthesensitivity,carefulnessanddepth
heshowsduringscienti®cdiscussions(andoutsideofthem).ToLarsHeinemannforbeingthebest
colleagueIeversharedanof®cewithandforhistoleranceexhibiteddoingso.ToMarkusHer-
rmannsdörferforhisencouragementandpragmatic,uncomplicatedwaythatmakescollaboration
productiveandfun.ToMarkusPizkaforraisingmyinterestinresearchandforencouragingmeto
startmyPhDthesis.Workingwithallofyouwas,andstillis,aprivilege.
Research,understandingandideagenerationbene®tfromcollaboration.Iamgratefulforjoint
paperprojectstoSebastianBenz,MichaelConradt,FlorianDeissenboeck,ChristophDomann,
MartinFeilkas,Jean-FrançoisGirard,NilsGöde,LarsHeinemann,BenjaminHummel,Klaus
Lochmann,BenediktMayyParareda,MichaelPfaehler,MarkusPizka,DanielRatiu,Bernhard
Schaetz,JonathanStreit,StefanTeuchertandStefanWagner.Inaddition,thisthesisbene®tedfrom
thefeedbackofmany.Iamthankfulforproof-readingdraftstoFlorianDeissenboeck,Martin
Feilkas,NilsGöde,LarsHeinemann,BenjaminHummel,KlausLochmann,BirgitPenzenstadler,
DanielRatiuandStefanWagner.AndtoRebeccaTiarksforhelpwiththeBellonBenchmark.
Theempiricalpartsofthisworkcouldnothavebeenrealizedwithoutthecontinuoussupportofour
industrialpartners.IwanttothankeverybodyIworkedwithatABB,MAN,LV1871andMunich
ReGroup.IparticularlythankMunichReGroup—especiallyRainerJanßenandRudolfVaas—for
thelong-termcollaborationwithourgroupthatsubstantiallysupportedthisdissertation.
Mostofall,Iwanttothankmyfamilyfortheirunconditionalsupport(bothmaterialandimmaterial)
notonlyduringmydissertation,butduringallofmyeducation.Iamdeeplygratefultomyparents,
mybrotherand,aboveall,mywifeSo®e.
5
»Aman’sgottadowhataman’sgottado«
FredMacMurrayinTheRainsofRanchipur
»Aman’sgottadowhataman’sgottado«
NoonHighinCooperGary
»Aman’sgottadowhataman’sgottado«
GeorgeJetsoninTheJetsons
John»Aman’CleesesingottaMontydowhatPython’asman’sGuidegottatoLifedo«
Contents
oductionIntr11.1ProblemStatement..................................
1.2Contribution......................................
1.3Contents........................................
Fundamentals22.1NotionsofRedundancy................................
2.2SoftwareCloning...................................
2.3NotionsofProgramSimilarity............................
2.4TermsandDe®nitions.................................
2.5CloneMetrics.....................................
2.6Data-¯owModels...................................
2.7CaseStudyPartners..................................
2.8Summary.......................................
3StateoftheArt
3.1ImpactonProgramCorrectness............................
3.2ExtentofCloning...................................
3.3CloneDetectionApproaches.............................
3.4CloneAssessmentandManagement.........................
3.5LimitationsofCloneDetection............................
4ImpactonProgramCorrectness
4.1ResearchQuestions..................................
4.2StudyDesign.....................................
4.3StudyObjects.....................................
4.4ImplementationandExecution............................
4.5Results.........................................
4.6Discussion.......................................
4.7ThreatstoValidity...................................
4.8Summary.......................................
5CloningBeyondCode
5.1ResearchQuestions..................................
5.2StudyDesign.....................................
5.3StudyObjects.....................................
5.4ImplementationandExecution............................
131416171919222628293536363737404147515353545556575959616363646567
9
Contents
5.5Results.........................................68
5.6Discussion.......................................76
5.7ThreatstoValidity...................................77
5.8Summary.......................................79
81ModelCostClone66.1MaintenanceProcess.................................81
6.2Approach.......................................83
6.3DetailedCostModel.................................84
6.4Simpli®edCostModel................................88
6.5Discussion.......................................88
6.6Instantiation......................................89
6.7Summary.......................................92
7AlgorithmsandToolSupport95
7.1Architecture......................................95
7.2Preprocessing.....................................98
7.3DetectionAlgorithms.................................101
7.4Postprocessing....................................115
7.5ResultPresentation..................................120
7.6ComparisonwithotherCloneDetectors.......................127
7.7MaturityandAdoption................................135
7.8Summary.......................................135
8MethodforCloneAssessmentandControl137
8.1Overview.......................................137
8.2CloneDetectionTailoring...............................138
8.3AssessmentofImpact.................................143
8.4RootCauseAnalysis.................................147
8.5IntroductionofCloneControl............................152
8.6ContinuousCloneControl..............................155
8.7ValidationofAssumptions..............................157
8.8Evaluation.......................................165
8.9Summary.......................................173
9LimitationsofCloneDetection175
9.1ResearchQuestions..................................175
9.2StudyObjects.....................................176
9.3StudyDesign.....................................177
9.4ImplementationandExecution............................178
9.5Results.........................................181
9.6Discussion.......................................184
9.7ThreatstoValidity...................................185
9.8Summary.......................................186
lusionConc10
10
187
Contents
10.1Signi®canceofCloning................................
10.2CloneControl.....................................
orkWFuture1111.1ManagementofSimions.............
11.2CloneCostModelDataCorpus..........
11.3LanguageEngineering..............
11.4CloninginNaturalLanguageDocuments....
11.5CodeCloneConsolidation............
yliographBib
..........
......
......
......
......
......
...............
..........
.....
.....
.....
.....
.....
187190
193193194195196198
20111
oductionIntr1
Softwaremaintenanceaccountsforthemajorityofthetotallifecyclecostsofsuccessfulsoftware
systems[21,80,184].Halfofthemaintenanceeffortisnotspentonbug®xingoradaptationsto
changesofthetechnicalenvironment,butonevolvingandnewfunctionality.Maintenancethuspre-
servesandincreasesthevaluethatsoftwareprovidestoitsusers.Reducingthenumberofchanges
thatgetperformedduringmaintenancethreatenstoreducethisvalue.Instead,tolowerthetotal
lifecyclecostsofsoftwaresystems,theindividualchangesneedtobemadesimpler.Animportant
goalofsoftwareengineeringisthustofacilitatetheconstructionofsystemsthatareeasy—andthus
maintain.economic—tomoreSoftwarecomprisesavarietyofartifacts,includingrequirementsspeci®cations,modelsandsource
code.Duringmaintenance,allofthemareaffectedbychange.Inpractice,theseartifactsoften
containsubstantialamountsofduplicatedcontent.Suchduplicationisreferredtoascloning.
Figure1.1:Cloninginusecasedocuments
thusCloningeffortforhampersallsize-relatedmaintenanceactiofvitiessoftwsuchareartifasactsinsevinspections—inspectorseralways.First,simplyithaincreasesvetowtheirorksizethroughand
tomoreitsclones,content.causingSecond,effortchangesforthattheirarelocationperformedandtoconsistentanartifactmodi®cation.oftenalsoIf,neede.g.,tobedifferentperformeduse
caseauthenticationdocumentsiscontainchangedfromduplicatedpasswordinteractiontokeycardstepsforentry.systemMoreovlogin,er,ifthenotyallallhavclonesetoofbeanadaptedartifactif
aremodi®edconsistently,inconsistenciescanoccurthatcanresultinfaultsindeployedsoftware.
If,e.g.,adeveloper®xesafaultinapieceofcodebutisunawareofitsclones,thefaultfailsto
13
oductionIntr1
beremovedfromthesystem.Eachoftheseeffectsofcloningcontributestoincreasedsoftware
lifecyclecosts.Cloningis,hence,aqualitydefect.
Figure1.2:Cloningthreatensprogramcorrectness
Thenegativeimpactofcloningbecomestangiblethroughexamplesfromreal-worldsoftware.We
studiedinspectioneffortincreaseduetocloningin28industrialrequirementsspeci®cations.For
thepersonlargestdays.Forspeci®cation,asecondtheestimatedspeci®cation,insitevpectionenefdoublesfortdueincreasetoiscloning1101.personhours,oralmost14
Theeffortincreaseduetothenecessitytoperformmultiplemodi®cationsisillustratedinFigure1.1,
whichrectangledepictsrepresentscloningainuse150case,useitscasesheightfromancorrespondingindustrialtobtheusinesslengthofinformationtheusecasesystem.inlines.EachEachblack
text.coloredIfastripechangeisdepictsmadeatoaspeci®cationcoloredreclone;gion,itstripesmaywithneedtothebesameperformedcolorindicatemultiplecloneswittimes—increasinghsimilar
.accordinglyfortefmodi®cationFinallyprogram,Figurecorrectness1.22:aillustratesmissingthenullcheconsequencesckhasonlyofbeeninconsistent®xedinonemodi®cationsclone,thetootherclonedstillcodecontainsfor
thedefectandcancrashthesystematruntime.
1.1StatementlemobPr
Differentgroupsinthesoftwareengineeringcommunityhaveindependentlyrecognizedthatcloning
cannegativelyimpactengineeringefforts.Redundancyinrequirementsspeci®cations,including
moticloning,veisrequirementsconsideredasanengineeringobstacle[230].forCloningmodi®abilityinsource[100]andcodeliissteddeemedasaasmajoranproblemindicatorinforauto-bad
design[17,70,175].Inresponse,theinvestigationofcloninghasgrownintoanactiveareainthe
softwareengineeringresearchcommunity[140,201],yielding,e.g.,numerousdetectionapproaches
andabetterunderstandingoftheoriginandevolutionofcloninginsourcecode.
1ThestudyispresentedindetailChapter5.
2ThecodeexampleistakenfromtheopensourceprojectSysiphus.
14
lemobPr1.1Statement
Nevertheless,cloningaboundsinpractice.Researchersreportthatbetween8%and29%,insome
casesevenmorethan60%ofthesourcecodeinindustrialandopensourcesystemshasbeendupli-
catedatleastonce[6,62,157].Cloninginsourcecodehasbeenreportedfordifferentprogramming
languagesandapplicationdomains[140,201].Despitethesefacts,hardlyanysystematicmeasures
tocontrolcloningaretakeninpractice.Givenitsknownextentandnegativeimpactonreal-world
software,weconsiderthisapparentlackofappliedmeasuresforclonecontrolasprecarious.
Basedonourexperiencesfromfouryearsofclosecollaborationonsoftwarecloningwithourindus-
trialpartners,weseetwoprincipalreasonsforthis:®rst,thesigni®canceofcloningisinsuf®ciently
understood;second,welackacomprehensivemethodthatguidespractitionersinestablishingcon-
tinuousclonecontrol.Wedetailbothreasonsbelow.
Signi®canceofCloningTheextentofcloninginsoftwareartifactsisinsuf®cientlyunderstood.
Whilenumerousstudieshaverevealedcloninginsourcecode,hardlyanythingisknownabout
cloninginotherartifacts,suchasrequirementsspeci®cationsandmodels.
Evenmoreimportantly,thequantitativeimpactofcloningonprogramcorrectnessandmaintenance
eftifyfortitinistermsunclearof.fWhileaultsoreefxistingfortincrease.researchhasConsequentlydemonstrated,weitsdonotimpactknowhoqualitatiwvelyharmful,wecannotcloning—andquan-
howimportantclonecontrol—reallyisinpractice.
theClonecreationControfolnewTobecloneseffectiandve,toclonecreateawcontrolarenessneedsoftoebexistingappliedclonescontinuouslyduring,codebothtomodi®cation.prevent
Continuousapplicationrequiresaccurateresults.However,existingtoolsproducelargeamounts
offalsepositives.Sinceinspectionoffalsepositivesisawasteofeffort,andrepeatedinspection
evenmoreso,theyinhibitcontinuousclonecontrol.Welackcommonlyacceptedcriteriaforclone
relevanceandtechniquestoachieveaccurateresults.Furthermore,tohavelongtermsuccess,clone
controlmustbepartofthemaintenanceprocess.Itsintegrationrequireschangestoestablished
habits.Unfortunately,existingapproachesforclonemanagementarelimitedtotechnicaltopicsand
issues.anizationalgorignoreTooperationalizeclonecontrol,comprehensivetoolsupportisrequiredthatsupportsallofitssteps.
propagExistingation,tools,orhoareweverlimited,totypicallysourcefocuscodeonandindithusvidualcannotaspects,beappliedsuchastoclonespeci®cationsdetectionorormodels.change
notproFurthermore,videmostreal-timedetectionresultsforlarapproachesgeevareolvingnotsoftwbothareartifincrementalacts.andDedicatedscalable.toolThesupportythusisthuscan-
control.cloneforrequired
ProblemWeneedabetterunderstandingofthequantitativeimpactofcloningonsoftware
engineeringandacomprehensivemethodandtoolsupportforclonecontrol.
15
oductionIntr1
utionContrib1.2
Thisdissertationcontributestobothareas,asdetailedbelow.
Signi®canceofCloningWepresentempiricalstudiesandananalyticalcostmodeltodemon-
stratethesigni®canceofcloningand,consequently,theimportanceofclonecontrol.
First,wepresentalargescalecasestudyinvestigatingtheimpactofcloningonprogramcorrectness.
Throughtheanalysisofinconsistentlymaintainedclones,107faultswerediscoveredinindustrial
andopensourcesoftware,including17criticalonesthatcouldresultinsystemcrashesordataloss;
notasinglesystemwaswithoutfaultsininconsistentlymodi®edclonedcode.Everysecondchange
toclonedcodethatwasunawareofcloningwasfaulty.Thisdemonstratesthatunawarenessof
cloningsigni®cantlyimpactsprogramcorrectnessandthusdemonstratestheimportancetocontrol
codecloninginpractice.ThecasestudywascarriedoutwithMunichReandLV1871.
Second,wepresenttwolargeindustrialcasestudiesthatinvestigatecloninginrequirementsspeci-
®cationsandMatlab/Simulinkmodels.Theydemonstratethattheextentandimpactofcloningare
notlimitedtosourcecode.Fortheseartifacts,manualinspectionsarecommonlyusedforquality
assurance.Thecloninginducedsizeincreasetranslatestohigherinspectionefforts—foroneofthe
analyzedspeci®cationsbyanestimated14persondays;forasecondoneitmorethandoubles.To
avoidtheseconsequences,cloningneedstobecontrolledforrequirementsspeci®cationsandgraph-
basedmodels,too.Thisworkisthe®rsttoinvestigatecloninginrequirementsspeci®cationsand
graph-basedmodels.Thecasestudieswerecarriedout,amongothers,withMunichRe,Siemens,
Group.ahrzeugeNutzfMANandThird,wepresentananalyticalcostmodelthatquanti®estheimpactofcodecloningonmaintenance
activitiesand®eldfaults.Itcomplementstheaboveempiricalstudiesbymakingourobservations
andassumptionsabouttheimpactofcodecloningonsoftwaremaintenanceexplicit.Thecostmodel
providesafoundationforassessmentandtrade-offdecisions.Furthermore,itsexplicitnessoffers
anobjectivebasisforscienti®cdiscourseabouttheconsequencesofcloning.
CloneControlWepresentacomprehensivemethodforclonecontrolandtoolsupporttooper-
practice.initationalizeWmenteofintroducecloningainmethodsoftwforarecloneartifactsandassessmentfortheandcontrolcontrolofthatcloningprovidesduringdetailedsoftwarestepsfortheengineering.assess-It
comprisesdetectortailoringtoachieveaccuratedetectionresults;assessmenttoevaluatethesig-
ni®canceofcloningforasoftwaresystem;changemanagementtosuccessfullyadaptestablished
processesandhabits;androotcauseanalysistopreventcreationofexcessiveamountsofnewclones.
ThemethodhasbeenevaluatedinacasestudywithMunichReinwhichcontinuousclonecontrol
wasperformedoverthecourseofoneyearandsucceededtoreducecodecloning.
Tooperationalizethemethod,weintroduceindustrial-strengthtoolsupportforcloneassessment
andcontrol.Itincludesnovelclonedetectionalgorithmsforrequirementsspeci®cations,graph-
basedmodelsandsourcecode.Theproposedindex-baseddetectionalgorithmisthe®rstapproach
thatisatthesametimeincremental,distributedandscalabletoverylargecodebases.Sincethetool
16
Contents1.3
supporthasmaturedbeyondthestageofaresearchprototype,severalcompanieshaveincludedit
intotheirdevelopmentorqualityassessmentprocesses,includingABB,BayerischesLandeskrimi-
nalamt,BMW,Capgeminisd&m,itestraGmbH,KabelDeutschland,MunichReandWincorNix-
dorf.Itisavailableasopensourceforusebybothindustryandtheresearchcommunity.
Finally,thisthesispresentsacontrolledexperimentthatshowsthatexistingclonedetectors—and
theirunderlyingapproaches—arelimitedtocopy&paste.Theyareunsuitedtodetectbehaviorally
similarcodeofindependentorigin.Theexperimentwasperformedonover100behaviorallysimilar
programsthatwereproducedindependentlyby400studentsthroughimplementationofasingle
speci®cation.Qualitycontrolthuscannotrelyonclonecontroltomanagesuchredundancies.Our
empiricalresultsindicate,however,thattheydooccurinpractice.Theirdetectionthusremainsan
importanttopicforfuturework.
Asstatedabove,softwarecomprisesvariousartifacttypes.Allofthemcanbeaffectedbycloning.
Weareconvincedthatitshouldbecontrolledforallartifactsthataretargettomaintenance.How-
ever,thesetofallartifactsdescribedintheliteratureislarge—beyondwhatcanbecoveredindepth
inadissertation.Inthiswork,wethusfocusonthreeartifacttypesthatarecentraltosoftware
engineering:requirementsspeci®cations,modelsandsourcecode.Amongthem,sourcecodeisar-
guablythemostimportant:maintenancesimplycannotavoidit.Evenprojectsthathave—sensibly
ornot—abandonedmaintenanceofrequirementsspeci®cationsandmodels,stillhavetomodify
sourcecode.Consequently,itistheartifacttypethatreceivesmostattentioninthisthesis.
Contents1.3
Theremainderofthisthesisisstructuredasfollows:
Chapter2discussesdifferentnotionsofredundancy,de®nesthetermsusedinthisthesisandintro-
ducesthefundamentalsofsoftwarecloning.Chapter3discussesrelatedworkandoutlinesopen
issues,providingjusti®cationfortheclaimsmadeintheproblemstatement.
Thefollowingchapterspresentthecontributionsofthethesisinthesameorderastheyarelisted
inSection1.2.Chapter4presentsthestudyontheimpactofunawarenessofcloningonprogram
correctness.Chapter5presentsthestudyontheextentandimpactofcloninginrequirements
Chapterspeci®cations7outlinesandtheMatlab/Simulinkarchitectureandmodels.functionalityChapter6ofthepresentsproposedthecloneanalyticaldetectionclonewcostorkbench.model.
Chapter8introducesthemethodforcloneassessmentandcontrolanditsevaluation.Chapter9
reportsonthecontrolledexperimentonthecapabilitiesofclonedetectioninbehaviorallysimilar
origin.independentofcodeFinally,Chapter10summarizesthethesisandChapter11providesdirectionsforfutureresearch.
PreviouslyPublishedMaterial
Partsofthecontributionspresentedinthisthesishavebeenpublishedin[53–55,57,97,110–117].
17
2Fundamentals
Thischapterintroducesthefundamentalsofthisthesis.The®rstpartdiscussesdifferentnotions
ofthetermredundancythatareusedincomputerscience.Itthenintroducessoftwarecloningand
othernotionsofprogramsimilarityinthecontextofthesenotionsofredundancy.Thelaterpartsof
thechapterintroduceterms,metricsandartifacttypesthatarecentraltothethesisandtheindustrial
partnersthatparticipatedinthecasestudies.
yRedundancofNotions2.1
Redundancyisthefundamentalpropertyofsoftwareartifactsunderlyingsoftwarecloningresearch.
Thissectionoutlinesandcomparesdifferentnotionsofredundancyusedincomputerscience.It
providesthefoundationtodiscusssoftwarecloning,theformofredundancystudiedinthisthesis.
2.1.1DuplicationofProblemDomainInformation
Inseveralareasofcomputerscience,redundancyisde®nedasduplicationofproblemdomain
knowledgeintherepresentation.Weusetheterm“problemdomainknowledge”withabroad
meaning:itnotonlyreferstotheconcepts,processesandentitiesfromthebusinessdomainofa
softwareartifact.Instead,weemployittoincludeallconceptsimplementedbyaprogramorrep-
resentedinanartifact.Thesecan,e.g.,includedatastructuresandalgorithmsandcompriseboth
aspects.vioralbehaandstructural
singleNormalfactFormsfromtheinproblemRelationaldomainisDatabasesstoredmultipleIntuitively,timesaindatabasethedatabase.containsIfredundanccomparedy,iftoaa
databasewithoutredundancy,thishasseveraldisadvantages:
Sizethusincrincreasesease:thesizeRepresentationofadatabaseofandinformationthuscostsrequiresforstoragespace.orStoringalgorithmsasinglewhosefactruntimemultipledependstimes
size.databaseonUpdateanomaly:Ifinformationchanges,e.g.,throughevolutionoftheproblemdomain,allloca-
tionsprobleminwhichdomainitisthusstoredrequiresinthemultipledatabasemodi®cneedtoationsbeinthechangeddatabase.accordinglyThef.actAthatsingleasinglechangeinchangethe
requiresmultiplemodi®cationsisreferredtoasupdateanomalyandincreasesmodi®cationeffort.
Furthermore,ifnotalllocationsareupdated,inconsistenciescancreepintothedatabase.
malRelationalformsaredatabasepropertiesdesignofadvdatabaseocatesnormalschemasformsthat,towhenreduceviolated,redundancyindicateindatabasesmultiple[129].storageNorof-
19
Fundamentals2
informationfromtheproblemdomaininthedatabase.Normalformsarede®nedaspropertieson
thepropagschemasatesatop-do[129]—notwnofapproachthedatatodiscoentriesverstandoredavinoidtheredundancdatabase.yinDatabasedatabases:schemathroughdesignanalysisthus
ofthepropertiesoftheschema,notthroughanalysisofsimilarityinthedata.
LogicalRedundancyinProgramsInhisPhDthesis,DanielRatiude®neslogicalredundancy
forprograms[190].Intuitively,accordingtohisde®nitions,aprogramcontainsredundancyiffacts
fromtheproblemdomainareimplementedmultipletimesintheprogram.Justasfordatabases,if
comparedtoaprogramwithoutredundancy,thishasseveraldisadvantages:
Sizeincrease:Implementationofafactfromtheproblemdomainrequiresspaceintheprogram
andthusincreasesprogramsize.Forsoftwaremaintenance,thiscanincreaseeffortsforsize-related
inspections.assuchvitiesactiUpdateanomaly:Similarlytotheupdateanomalyindatabases,ifafactintheproblemdomain
changes,allofitsimplementationsneedtobeadaptedaccordingly,creatingeffortfortheirlocation
andconsistentmodi®cation.Again,ifmodi®cationisnotperformedconsistentlytoallinstances,
inconsistenciescanbeintroducedintotheprogram.
Justasfordatabases,redundancyisde®nedindependentoftheactualrepresentationofthedata.
Redundantprogramfragmentsthuscan,butdonotneedtolooksyntacticallysimilar.
Whereasschemasprovidemodelsoftheproblemdomainfordatabasesystems,incontrast,there
isnocomparablemodeloftheproblemdomainofprograms.Ratiusuggeststouseontologiesas
modelsoftheproblemdomain[190].Sincetheyaretypicallynotavailable,theyhavetobecreated
.yredundancdetectto
ExcesseSizRepresentation2.1.2Ininformationtheory[166],minimaldescriptionlengthresearch[89]anddatacompression[205],
redundancyisde®nedassizeexcess.Intuitively,datacontainsredundancy,ifashorterrepresenta-
tionforitcanbefoundfromwhichitcanbereproducedwithoutlossofinformation.
Thenotionofredundancyassizeexcesstranslatestocompressionpotential.Anypropertyofan
artifact,whichcanbeexploitedforcompression,thusincreasesitssizeexcess.Since,accordingto
Grünwald[89],anyregularitycaninprinciplebeexploitedtocompressanartifact,allregularity
xcess.esizeincreasesCompressionpotentialnotonlydependsontheartifactbutalsoontheemployedcompression
scheme.ThemostpowerfulcompressionschemeistheKolmogorovcomplexityofanartifact,
de®nedasthesizeofthesmallestprogramthatproducestheartifact.Unfortunately,itisundecid-
able[89,156].Hence,toemploycompressionpotentialasametricforredundancyinpractice,less
powerful,butef®cientlycomputablecompressionschemesareemployed,as,e.g.,generalpurpose
compressorslike,gziporGenCompress.
Regularityindatarepresentationcanhavedifferentsources.Duplicatedfragmentsofproblemdo-
mainknowledgeexhibitthesamestructureandthusrepresentregularity.Regularity,however,does
notneedtostemfromproblemdomainknowledgeduplication.Inef®cientencodingofthealphabet
20
yRedundancofNotions2.1
ofsion,aaslanguageis,e.g.,intodoneabybinaryHuffmanrepresentationcoding[95].introducesregularitythatcanbeexploitedforcompres-
Similarly,languagegrammarsareasourceofregularity,sincetheyenforcesyntaxrulestowhich
allartifactswritteninalanguageadhere.Again,thisregularitycanbeexploitedforcompression,
asis,e.g.,donebysyntax-basedcoding[35].
Redundancyintermsofrepresentationsizeexcessthuscorrespondstocompressionpotentialofan
artifact.Regularityinthedatarepresentationprovidescompressionpotential,independentofits
source:fromthepointofviewofcompression,itisofnoimportanceiftheregularitystemsfrom
problemdomainknowledgeduplicationorinef®cientcoding.Thisnotionofredundancythusdoes
notdifferentiatebetweendifferentsourcesofregularity.
2.1.3Discussion
Therearefundamentaldifferencesbetweenthetwonotionsofredundancy.Whereasnormalforms
andlogicalprogramredundancyarede®nedintermsofduplicationofinformationfromtheproblem
domainintherepresentation,sizeexcessisde®nedontherepresentationalone.Thisisexplicitin
thestatementfromGrünwald[89]:»Weonlyhavethedata«—nointerpretationintermsofthe
problemdomainisperformed.Thishastwoimplications:
Broaderapplicability:Sincenointerpretationintermsoftheproblemdomainisrequired,itcan
beappliedtoarbitrarydata.Thisisobviousfordatacompressionthatisentirelyagnosticofthe
informationencodedinthe®lesitprocesses.However,itcanalsobeappliedtodataweknowhow
tointerpret,butforwhichnosuitablemachinereadableproblemdomainmodelsareavailable,as,
e.g.,programsforwhichwedonothavecompleteontologies.
itycanReducedcreateconclusivrepresentationenessw.r.t.sizeedomainxcess,itknoisnowledgeconclusiveduplication.indicatorSincefordifproblemferentdomainsourcesofknorewledgegular-
duplication.representationThisaloneneedstotodiscobevertakenproblemintoaccountdomainbyknowledgeapproachesthatduplication.searchforredundancyonthe
TherelationshipbetweenthetwonotionsofredundancyissketchedinthediagraminFigure2.1.
Theleftsetrepresentsredundancyinthesenseofduplicateddomainknowledge.Therightsetre-
dundancyintermsofrepresentationsizeexcess.Theirintersectionrepresentsduplicateddomain
knowledgethatissuf®cientlyrepresentationallysimilartobecompressiblebytheemployedcom-
scheme.pression
Thediagramassumesanimperfectcompressionscheme.Foraperfectcompressor,problemdomain
knowledgeduplicationwouldbeentirelycontainedinrepresentationsizeexcess,sinceaperfect
compressorwouldknowhowtoexploititforcompression,evenifitissyntacticallydifferent.
However,nosuchcompressorexistsand—sinceKolmogorovcomplexityisundecidable—never
will.
21
Fundamentals2
Figure2.1:Relationshipofdifferentnotionsofredundancy
Super¯uousness2.1.4Apartfromproblemdomainduplicationandrepresentationsizeexcess,athirdnotionofredundancy
isusedinsomeareasofcomputerscience:super¯uousness.
Severalexamplesforthistypeofredundancyexistintheliterature.Incompilerconstruction,state-
mentsareconsideredasredundant,iftheyareunreachable[134].Iftheunreachablestatementsare
removed,thecodestillexhibitsthesameobservablebehavior1.Second,ifausageperspectiveis
adopted,statementsareredundant,iftheyarenotrequiredbytheusersofthesoftware,e.g.,be-
causethefeaturetheyimplementhasbecomeobsolete.Basedontheactualneedoftheusers,the
softwarestillexhibitsthesamebehaviorifthefeatures,thatwillneverbeusedagain,areremoved.
Athirdexamplecanbefoundinlogic:aknowledgebaseofpropositionalformulasisredundant,if
itcontainspartsthatcanbeinferredfromtherestofit[158].Theremovalofthesepartsdoesnot
changethemodelsoftheknowledgebase,e.g.,thevariableassignmentsthatevaluatetotrue.
Super¯uousnessisfundamentallydifferentfromtheothernotionsofredundancy.Whereasduplica-
tionofproblemdomaininformationandrepresentationsizeexcessindicatethattherepresentation
canbecompactedwithoutlossofinformation,super¯uousnessindicateswhichinformationcanbe
lostsinceitisnotrequiredforacertainpurpose.Thisnotionofredundancyisoutsidethescopeof
thesis.this
CloningSoftware2.2
Thissectionintroducessoftwarecloningandcomparesitwiththenotionsofredundancyintroduced
above.Amorein-depthdiscussionofresearchinsoftwarecloningandinclonedetectionisprovided
3.Chapterin
2.2.1CloningasProblemDomainKnowledgeDuplication
Programsencodeproblemdomaininformation.Duplicatingaprogramfragmentcanthuscreate
duplicationofencodedproblemdomainknowledge.Sinceprogramfragmentduplicationpreserves
syntacticstructure,theduplicatesarealsosimilarintheirrepresentation.
1Disregardingeffectsduetoapotentiallysmallermemoryfootprint.
22
CloningSoftware2.2
Clonesaresimilarregionsinartifacts.Theyarenotlimitedtosourcecode,butcanoccurinother
artifacttypessuchasmodelsortexts,aswell.Intheliterature,differentde®nitionsofsimilarityare
employed[140,201],mostlybasedonsyntacticcharacteristics.Theirnotionofredundancyisthus,
strictlyspeaking,agnosticoftheproblemdomain.Incontrast,inthisthesis,werequireclonesto
implementoneormorecommonproblemdomainconcepts,thusturningclonesintoaninstanceof
logicalprogramredundancyasde®nedbyRatiu[190].Cloningthusexhibitsthenegativeimpactof
logicalprogramredundancy(cf.,Section2.1.1).
Thecommonconceptimplementationsgivesrisetochangecoupling:whentheconceptchanges,
allofitsimplementations—theclones—needtobechanged.Inaddition,werequireclonesto
besyntacticallysimilar.Whilesyntacticsimilarityisnotrequiredforchangecoupling,existing
clonedetectionapproachesrelyonsyntacticsimilaritytodetectclones.IntermsofFigure2.1,
thisrequirementlimitsclonestotheintersectionofthetwosets.Hence,weemploythetermclone
todenotesyntacticallysimilarartifactregionsthatcontainredundantencodingsofoneormore
problemdomainconcepts.Whilesyntacticsimilaritycanbedeterminedautomatically,redundant
cannot.implementationconceptForthesakeofclarity,wedifferentiatebetweenclonecandidates,clonesandrelevantclones.Clone
candidatesareresultsofaclonedetectorrun:syntacticallysimilarartifactregions.Cloneshavebeen
inspectedmanuallyandareknowntoimplementcommonprogramdomainconcepts.However,not
allclonesarerelevantforalltasks:whileforchangepropagation,allclonesarerelevant,forprogram
compaction,e.g.,onlythosearerelevantthatcanberemoved.Incaseonlyasubsetoftheclonesin
asystemisrelevantforacertaintask,werefertothemasrelevantclones.
Aclonegroupisasetofclones.Clonesinasinglegrouparereferredtoassiblings;aclone’s
artifactregionissimilartotheartifactregionsofallitssiblings.Weemploythesetermsforclone
candidates,clonesandrelevantclones.
CloningorfCauses2.2.2
pasteClones(andarepossiblytypicallycreatedmodify)byancopartify&actpaste.fragment.ManySedifveralferentauthorscauseshavcanetriggeranalyzedthecausesdecisionfortocloningcopy,
inandcodecauses[123,131,originating140,in201].theWedifmaintenanceferentiateenherevironmentbetweenandthecausesmaintainers.inherenttosoftwareengineering
InherentCausesCreatingsoftwareisadif®cult,intellectuallychallengingtask.Inherentcauses
forcloningarethosethatoriginateintheinherentcomplexityofsoftwareengineering[25]—even
idealprocessesandtoolscannoteliminatethemcompletely.
Oneinherentreasonisthatcreatingreusableabstractionsishard.Itrequiresadetailedunderstand-
ingofthecommonalitiesanddifferencesamongtheirinstances.Whenimplementinganewfeature
thatissimilartoanexistingone,theircommonalitiesanddifferencesarenotalwaysclear.Cloning
canbeusedtoquicklygenerateimplementationsthatexposethem.Afterwards,remainingcom-
monalitiescanbeconsolidatedintoasharedabstraction.Asecondreasonisthatunderstandingthe
impactofachangeishardforlargesoftware.Anexploratoryprototypicalimplementationofthe
changeisonewaytogainunderstandingofitsimpact.Forit,anentiresubsystemcanbeclonedand
23
Fundamentals2
modi®edforexperimentalpurposes.Aftertheimpacthasbeendetermined,asubstantiateddecision
canbetakenonwhethertointegrateormergethechangesintotheoriginalcode.Afterexploration
is®nished,clonescanberemoved.
Inbothcases,cloningisusedasameanstospeedupimplementationtoquicklygainadditional
information.Oncetheinformationisobtained,clonescanbeconsolidated.
MaintenanceEnvironmentThemaintenanceenvironmentcomprisestheprocesses,languages
andtoolsemployedtomaintainthesoftwaresystem.Maintainerscandecidetoclonecodetowork
aroundaprobleminthemaintenanceenvironment.
ernsProcessesitsevcanolutioncauseandqualitycloning.First,assurance.toreuseMissingcode,oranorunsuitableganizationreuseneedsprocessesareusehinderprocessmaintainersthatgoinv-
sharingcode.Inresponse,theyreusecodethroughduplication.Second,short-sightedprojectman-
agementpracticescantriggercloning.ExamplesincludeproductivitymeasurementofLOC/day,
orconstanttimepressurethatencouragesshorttermsolutionsinignoranceoftheirlong-termcon-
sequences.Inresponse,maintainersduplicatecodetoreducepressurefromprojectmanagement.
Third,assurancetomaketechniquescodecanreusablemakeintheanewconsequencescontext,itofthesometimesnecessaryneedschangestobedifadapted.®culttovPooralidate.qualityIn
response,maintainersduplicatethecodeandmakethenecessarychangetotheduplicatetoavoid
theriskofbreakingtheoriginalcode.
oftenLimitationsrequiresinthelanguagesintroductionortoolsofcanparameters.causecloning.LanguageFirst,limitationsthecreationcanofprohibitathereusablenecessaryabstractionpa-
rameterization.Inresponse,maintainersduplicatethepartsthatcannotbeparameterizedsuitably.
Second,reusablefunctionalityisoftenencapsulatedinfunctionsormethods.Onhotcodepaths
ofpilercannotperformanceperformcriticalsuitableapplications,inliningtomethodallowcallsforreusecanimposewithoutathisperformancepenalty,penaltymaintainers.Iftheinlinecom-the
methodsmanuallythroughduplicationoftheirbodies.
Finally,besidesinherentandmaintenanceenvironmentcauses,maintainerscandecidetoclone
codeforintrinsicreasons.Forexample,thelong-termconsequencesofcloningcanbeunclear,or
maintainersmightlacktheskillsrequiredtocreatereusableabstractions.
Allnon-inherentcausesforcloningsharetwocharacteristics:evenwhilecloningmightbeasuc-
stillcessfulhold;inshort-termaddition,astechniquelongastotheircircumvcauseentisitsnotcause,recti®ed,itsnethegativcloneseimpactcannotonbesoftwareconsolidated.maintenanceThese
causescanthusleadtogradualaccumulationofclonesinsystems.
2.2.3CloneDetectionasSearchforRepresentationalSimilarity
Thegoalofclonedetectionisto®ndclones—duplicatedproblemdomainknowledgeinthepro-
this,gram.cloneUnfortunatelydetection,clonesearchesfordetectionsimilarityhasnointaccesshetoprogrammodelsoftherepresentation.problemThisdomain.hastwoTocircumvimplicationsent
quality:resultdetectionfor
24
CloningSoftware2.2
Recall:duplicatedproblemdomainknowledgethatisnotsuf®cientlyrepresentationallysimilar
doesnotgetdetected.Thislimitstherecallofdetectedw.r.t.totalduplicatedproblemdomain
wledge.knoThemagnitudeofthiseffectisdif®culttoquantifyinpractice,sincetheamountofallduplicated
domaindetectorinknotermswledgeofhoinwasetmuchofofartifthisactsitcanisdetect,typicallyisthusunknown.unfeasibleinComputingpractice.therecallofaclone
Precision:Sincesimilarityintheprogramrepresentationcan,butdoesnotneedtobecreatedby
problemdomainknowledgeduplication,notalldetectedclonecandidatescontainduplicatedprob-
lemdomainknowledge.Allprogramfragmentsthataresuf®cientlysyntacticallysimilartobede-
tectedasclones,butdonotimplementcommonproblemdomainknowledge,arefalsepositivesthat
e.g.,reducethroughprecision.Thisnormalization,typicallywhichoccursremoifvescloneidentifersdetectionthatremovreferenceesalllinksdomaintotheconcepts.problemArtifactdomain,re-
gionsthoughthattheeyxhibitssharenolittlerelationshipsyntacticvontheariationlevelareofthenthelikelyproblemtobedomainidenti®edconceptsasclonetheycandidates,implement.even
CodeCloneandCloneCandidateClassi®cationCodeclonesandclonecandidatesfor
sourcecodecanbeclassi®edintodifferenttypes.Clonetypesimposesyntacticconstraintsonthe
differencesbetweensiblings[19,140]:type1islimitedtodifferencesinlayoutandcomments,
type2furtherallowsliteralchangesandidenti®errenamesandtype3inadditionallowsstatement
changes,additionsordeletions.Theclonetypesformahierarchy:type-3clonescontaintype-2
clones,whichcontaintype-1clones.Type-2clones(includingtype-1clones)arealsoreferredtoas
clones.ungappedForclonesinotherartifacttypesthansourcecode,noclonetypeclassi®cationshavebeenestab-
lishedsofar.However,similarsyntacticcriteriacouldbeusedtocreateclassi®cationsforclonesin
data¯owmodels[86]andrequirementsspeci®cations.
2.2.4CloneManagement,AssessmentandControl
Softwconsequencesarecloneofmanacopgyingementandcomprisespasting”all[141],activitiesincludingofthe“lookingprevafterentionandofmakingclonecreationdecisionsandaboutthe
consistentmaintenanceandremovalofexistingclones.
artifSoftwactsareandclonequanti®esassessmentits,impactasemploonyedengineeringbythisactithesis,vities.isanactivitythatdetectsclonesinsoftware
SoftwQualityarecontrolclonecontrcomparesol,astheemploactualyedbyqualitythisofathesis,systemispartagofainsttheitsprocessqualityofqualityrequirementscontrolandtak[48].es
necessaryactionstocorrectthedifference.Thequalityrequirementforclonecontrolistwofold:
of®rst,existtokingeeptheclonesinamountaofsystem.clonesinConsequentlyasystem,loclonew;controlsecond,toanalyzesalleviatethetheresulnetsgofativecloneconsequencesassessment
andremainingtakesclones.necessaryCloneactionscontroltoisreducethustheaamountcontinuousofclonesprocessandthattoissimplifyperformedtheaspartmaintenanceofqualityof
controlthatemploysactivitiesfromclonemanagement.
25
Fundamentals2
2.3NotionsofProgramSimilarity
ofProgramsconcepts;encodeidenti®ersproblemde®neanddomainreferenceknowledgedomainindifentitiesferentandways.algorithmsDatastructuresimplementencodebehaviorpropertiesand
processesfromaproblemdomain.Duplicationofproblemdomaininformationinthecodecanlead
todifferenttypesofprogramsimilarity.
Manydifferentnotionsofprogramsimilarityexist[228].Inthissection,wedifferentiatebetween
representationalandbehavioralsimilarityofcode.Bothrepresentationalandbehavioralsimilarity
canrepresentproblemdomainknowledgeduplication.
Similarityogram-Representation-basedPr2.3.1
Numerousclonedetectionapproacheshavebeensuggested[140,201].Allofthemstaticallysearch
asuitablerepresentationprogramtheyworkrepresentationonandtheforsearchsimilarparts.algorithmsAmongsttheyemploothery2.things,theConsequentlyydiffer,ineachtheapproachprogram
hasadifferentnotionofsimilaritybetweenthecodefragmentsitcandetectasclones.
Theemployednotionscomprisetextual,metricsandfeature-basedsimilarity[228].Fromatheoret-
icalSinceperspectinormalizedve,theycaninformationbegeneralizeddistanceisintobasedtheonnotiontheofuncomputablenormalizedKinformationolmogorovdistancecomplexity[155].,it
cannotbeemployeddirectly.Instead,existingapproachesusesimplernotionsthatareef®ciently
whencomputable.recognizingWeequiclassifyvalentthemcodebythefragmentstypeofandbehabyviorthe-indifvariantferencesvaritheyationtoleratetheycanbetweencompensatesimilar
fragments.codeText-basedapproachesdetectclonesthatareequalonthecharacterlevel.Token-basedapproaches
canperformtoken-based®lteringandnormalization.Theyarethusrobustagainstreformatting,doc-
umentationchangesorrenamingofvariables,classesormethods.Abstractsyntaxtree(AST)-based
approachescanperformgrammar-levelnormalizationandarethusfurthermorerobustagainstdif-
ferencesinoptionalkeywordsorparentheses.Programdependencegraph(PDG)-basedapproaches
aresomewhatindependentofstatementorderandarethusrobustagainstreorderingofcommutative
statements.Inanutshell,existingapproachesexhibitvaryingdegreesofrobustnessagainstchanges
toduplicatedcodethatdonotchangeitsbehavior.
Someapproachesalsotoleratedifferencesbetweencodefragmentsthatchangebehavior.Mostap-
thatproachesexhibitsemplodifyferentsomebehanormaliviorlookzationequithatvaleremontvtoestheorreplacesdetectionspecialalgorithm.tokensMoreoandvercan,semakveralecodeap-
vectorsproachestocomputeidentifyclones.characteristicDependingvectorsonforthecodeapproach,fragmentsandcharacteristicuseavectorsdistancearethresholdcomputedbetweenfrom
more,metrics,e.ConQAg.,T[115]function-ledetectsvelsizeclonesandthatcompledifferxityup,to[139,an170]absoluteororASTrelativfragmentseedit[16,distance.106].Further-
Inaapproachesnutshell,differnotionsintheofrtypeseprofbehaesentationalvior-invsimilarityariantaschangesemplotheyedycanbystatecompensateoftheartandclonetheamountdetectionof
2PleaserefertoSection3.3foracomprehensiveoverviewofexistingclonedetectionapproaches.
26
2.3NotionsofProgramSimilarity
intx,y,z;
z=xy;intx’=x;
;0=zwhile(x’>0){
;y=+zx’!=1;
}while(x’<0){
;y=!zx’+=1;
}
Figure2.2:Codethatisbehaviorallyequalbutnotrepresentationallysimilar.
infurtherpracticedeis,viationhowethevyer,allosevwerelybetweenlimitedcodebythefragments.amountTheoffalseamountpositiofvdeesitviationproduces.thatcanbetolerated
SimilarityvioralBeha2.3.2Besidestheirrepresentationalaspects,programscanbecomparedbasedontheirbehavior.Be-
havioralprogramsimilarityisnotemployedbyexistingclonedetectors3.However,weintroduce
behavioralnotionsofprogramsimilaritysinceweemploythemlatertoreasonaboutthelimitations
ofclonedetection(cf.,Chapter9).
Severalnotionsofbehavioralorsemanticsimilarityhavebeensuggested[228].Inthiswork,we
focusonsimilarityintermsofI/Obehavior.Wechoosethisnotionforseveralreasons.Itismore
robustagainsttransformationsthan,e.g.,executioncurvesimilarity[228]orstrongprogramschema
equivalence[98,203].Furthermore,itishabituallyemployedinthespeci®cationofinteractive
systems[26]andbestcapturesourintuition.
Forapieceofcode(i.e.,asequenceofstatements)wecallallvariableswrittenbythiscodeits
outputvariablesandallvariableswhicharereadanddohaveanimpactontheoutputsitsinput
variables.Eachofthevariableshasatypewhichisuniquelydeterminedfromthecontextofthe
code.Wecantheninterpretthiscodeasafunctionfromvaluationsofinputvariablestovaluations
ofoutputvariables,whichistriviallystate-less(andthusside-effectfree),aswecapturedallglobal
variablesintheinputandoutputvariables.
Wecalltwopiecesofcodebehaviorallyequal,ifftheyhavethesamesetsofinputandoutput
variables(modulorenaming)andareequalwithrespecttotheirfunctioninterpretation.So,foreach
inputvaluationtheyhavetoproducethesameoutputs.Ane4xampleofcodethatisbehaviorally
equalbutnotrepresentationallysimilarisshowninFigure2.2.
3weWhilearguetherethatarethesomeyuseaapproachesrepresentationalthatrefernotiontoofthemselvsimilarityesas,sincesemanttheicclonePDGisadetection,programe.g.,PDGrepresentation.basedapproaches,
4Variablex’ontherightsideisintroducedtoavoidmodi®cationoftheinputvariablex.
27
Fundamentals2
Forpracticalpurposes,oftennotonlystrictlyequalpiecesofcodearerelevant,butalsosimilar
ones.Wecallsuchsimilarcodeasimion.Simionsarebehaviorallysimilarcodefragmentswhere
behavioralsimilarityisde®nedw.r.t.input/outputbehavior.Thespeci®cde®nitionofsimilarityis
task-speci®c.Onede®nitionwouldbetoallowdifferentoutputsforaboundednumberofinputs.
Thiswouldcapturecodewithisolateddifferences(e.g.,errors),forexampleinboundarycases.
Anotheronecouldtoleratesystematicdifferences,suchasdifferentreturnvaluesforerrors,orthe
infamous“offbyone”errors.Afurtherde®nitionofsimilarityiscompatibilityinthesensethatone
simionmayreplaceanotherinaspeci®ccontext.
Thedetectionofsimionsthatarenotrepresentationallysimilarisbeyondthescopeofthisthesis.
SimionversusCloneMostde®nitionsofsoftwareclonesdenoteacommonoriginofthe
clonedcodefragments[227],asisalsothecaseinbiology:Haldanecoinedtheterm“clone”
fromtheGreekwordfortwig,branch[90].Wewanttobeabletoinvestigatecodesimilarities
independentoftheirmodeofcreation,however.Usingatermthatinmostofitsde®nitionsim-
pliesduplicationfromasingleancestorasamodeofcreationisthuscounter-intuitive.Wethus
deliberatelyintroducetheterm“simion”toavoidconfusion.
Forthesakeofclarity,werelatethetermtothosede®nitionsof“clone”thataremostclosely
related:accidentalclonesdenotecodefragmentsthathavenotbeencreatedbycopy&paste[1].
Theirsimilarityresultstypicallyfromconstraintsorinteraction-protocolsimposedbythesame
librariesorAPIstheyuse.However,whiletheyaresimilarw.r.t.thoseconstraintsorprotocols,
theydonotneedtobesimilaronthebehaviorallevel5.Semanticclonesdenotecodefragments
whoseprogramdependencegraphfragmentsareisomorphic[73].Sincetheprogramdependence
graphsareabstractionsoftheprogramsemantics,andthusdonotcapturethemprecisely,theycan,
butdonotneedtohavesimilarbehavior.Type-4clonesasde®nedby[200]as“twoormorecode
fragmentsthatperformthesamecomputationbutareimplementedbydifferentsyntacticvariants”
arecomparabletosimions.However,wepreferatermthatdoesnotincludetheword“clone”as
thisimpliesthatonesimilarinstanceisderivedfromanotherwhichisnotthecaseiftheyhavebeen
.independentlyelopedvde
De®nitionsandermsT2.4
Thissectionintroducesfurthertermsthatarecentraltothisthesis.
cleofSoftwareasoftwAraretifactssystem.AItissoftwarparteoftheartifactissystema®leorthatiscapturescreatedknoandwledgemaaboutintainedit.duringExamplesthelifeincludecy-
isregrequirementsardedasaspeci®cations,collectionofmodelsatomicandunits.sourceForcode.naturalFromlanguagethepointtexts,ofviethesewofunitsanalysis,canbeanwartifordsactor
sentences.Forsourcecode,tokensorstatements.Fordata-¯owmodelssuchasMatlab/Simulink,
atomicunitsarebasicmodelblockssuchasadditionormultiplicationblocks.Thetypeofdata
5Indialogsotherwcanords,lookevenandthoughbehavetheverycodedifofferent.twoUIdialogslookssimilarinpartssincethesamewidgettoolkitisused,the
28
MetricsClone2.5
structureaccordingtowhichtheatomicunitsarearrangedvariesbetweenartifacttypes.Require-
mentsspeci®cationsandsourcecode,areconsideredassequencesofunits.Data-¯owmodelsas
units.ofaphsgrWeusethetermrequirementsspeci®cationaccordingtoIEEEStd830-1998[100]todenoteaspec-
i®cationforaparticularsoftwareproduct,program,orsetofprogramsthatperformscertainfunc-
tionsinaspeci®cenvironment.Asinglespeci®cationcancomprisemultipleindividualdocuments.
Weusethetermusecasetorefertoarequirementsspeci®cationwritteninusecaseform.Usecases
describetheinteractionbetweenthesystemandastakeholderundervariousconditions[37].We
assumeusecasestobeintextform.
Weusethetermdata-¯owmodeltorefertomodelsasusedintheembeddeddomain,suchas
Matlab/SimulinkorASCETmodels.Asingledata-¯owmodelcancomprisemultiplephysical
®les.model
SizeMetricsLinesofcode(LOC)denotethesumofthelinesofcodeofallsource®les,includ-
ingcommentsandblanklines.Sourcestatements(SS)arethenumberofallsourcecodestatements,
nottakingcommentedorblanklinesandcodeformattingintoaccount.Formodels,sizemetrics
typicallyrefertoblocksorelements,insteadoflinesorstatements.Thenumberofblocksdenotethe
sizeofaMatlab/Simulinkmodelsintermsofatomicelements.Theredundancyfreesourcestate-
once.mentsRFSS(RFSS)thusaretheestimatesnumbertheofsizeofsourceasystemstatements,fromifwhichclonedallclonessourcearestatementsperfectlyareremoonlyved.counted
Ftheailureuser.AandfaultFaultistheWecauseuseinthethetermsourcefailurcodeetoofadenotepotentialanfincorrectailure.outputofasoftwarevisibleto
MethodWeemploythetermmethodaccordingtoBalzert6todenote“asystematic,justi®ed
goals”.speci®edaccomplishtoprocedure
MetricsClone2.5
Thecasestudiesandmethodspresentedinthefollowingchaptersemployseveralclone-related
metrics.Theyarede®nedandillustratedinthefollowing.Themetricsareemployedinthisorin
similarformbyseveralclonedetectionapproaches[140,201].
Example2.5.1
Tomakethemetricsmoretangible,weusearunningexample.Figure2.3showsthestructureof
theexampleartifactsandtheircontainedclones.
6TranslatedfromGermanbytheauthor.
29
Fundamentals2
eRunning2.3:Figurexample
Theexamplecontainsthreeartifact®lesA-Candthreecandidateclonegroupsa-c.Candidate
clonegroupahasthreecandidateclones,coveringallartifacts.Groupbhastwocandidateclones,
coveringartifactsAandB.Groupchasfourcandidateclones,withc1andc2locatedinartifactsA
andBrespectively,andc3andc4locatedinartifactC.Groupsbandcoverlap.Dimensionsofthe
artifactsandthecandidateclonegroupsaredepictedinTable2.1.
Dimensions2.1:leabTABCabc
Length601004054010
Weinterprettheexampleforsourcecode,requirementsspeci®cationsandmodelsbelow.Lengthis
measuredinlinesforsourcecodeandrequirementsandinmodelelementsformodels.Theprimary
differenceinthecaseofmodelsisthattheirclonesarenotconsecutive®leregions,butsubgraphs
ofthemodelgraph.Avisualizationofthemodelsandtheircandidatecloneswouldthuslookless
2.3.Figurethanlinear
SourceCodeArtifactsAtoCaretextualsourcecode®lesinJava.ArtifactsAandBimplement
businesslogicforabusinessinformationsystem.Aimplementssalarycomputationforemploy-
ees,Bimplementssalarycomputationforfreelancers.Ccontainsutilitymethodsthatcompute
salaries.Thecandidateclonesofcandidateclonegroupacontainimportstatementsthatarelocatedatthe
startoftheJava®les.Clonegroupbcontainsthebasicsalarycomputationfunctionality.Clone
andgroupcfreelancerscontainsaandtaxinthecomputationutilitymethodsroutineof®lewhichC.isusedbothforsalarycomputationofemployees
30
MetricsClone2.5
Javemploayedimportbythestatementscompilermap.ModernbetweenIDEslocaltypeautomatenamesmanagementusedinaof®leandimportfullystatements.quali®edThetypeyarenamesthus
notthusdoesmodi®ednotafmanuallyfectduringmaintenancetypicaleffort.softwaremaintenancetasks.Redundancyinimportstatements
RequirementsSpeci®cationsArtifactsAtoCareusecasedocuments.DocumentAde-
scribesDocumentuseCcasedescribes“Createuseemplcaseoyee“createaccount”,customer”anddocumentcontainsBuseprimarycaseand“Createalternatifreelancervescenarios.account”.
TheCloneclonesgroupofbclonecontainsgroupapreconditionscontain,stepsdocumentandheaderspostconditionsthatareofcommongenerictoallaccountusecasecreation.documents.Clones
ofclonegroupccontainpostconditionsthatholdbothafteraccountcreationandforboththe
primaryandalternativescenarioofcustomercreation.
EachData-Flo®lewrepresentsModelsaArtifseparateactsAtosubsystem.CareMatlaWhereastheb/Simulinkclones®lesofthatclonearepartgroupsofabandsinglecmodel.encode
blocks,similarthePIDyarecontrollers,thusnottherelevcloneantforcandidatesmaintenance.ofcandidateclonegroupaonlycompriseconnectors
RelevanceFromamaintenanceperspective,candidateclonegroupaisnotrelevant.Inthe
sourcecodecase,itcontainsimportstatementsthatareautomaticallymaintainedbymodernIDEs—
nomanualimportstatementmaintenancetakesplacethatcouldbene®tfromknowledgeofclone
relationships.Intherequirementsspeci®cationcase,itcontainsadocumentheaderthatdoesnot
getmaintainedmanuallyineachdocument.Changestotheheaderareautomaticallyreplicatedfor
alldocumentsbythetextprocessorusedtoedittherequirementsspeci®cations.Inthemodelcase,
theconnectorsestablishthesyntacticsubsysteminterface.Consistencyofchangestoitisenforced
bythecompiler.Similarly,nomanualmaintenancetakesplacethatcouldmakeuseofknowledge
aboutclonerelations.Thecandidateclonesingroupaarethusnotrelevantclonesforthetaskof
softwaremaintenance.Theremainingclonegroups,however,arerelevant.
2.5.2emplateTMetric
itsEachscalemetricandisrange.introducedItsfollodeterminationwinga®xdescedribestemplate.whetherItsthevde®nitionalueforde®nesthethemetricmetriccanbeanddeterminedspeci®es
fullyputestheautomaticallymetricforbytheaetoolxampleorartifwhetheractsandhumanclonejudgegroups.mentisrequired.Itsexampleparagraphcom-
Theengineeringroleoftheactivitiesmetricsisfordescribedcloneinassessment,detailinandChapterthus8.theThisinterpretationsectionthusofonlytheirvbrie¯yaluesforsummarizessoftware
metric.eachofpurposethe
31
Fundamentals2
CountsClone2.5.3
De®nition1Clonegroupcountisthenumberofclonegroupsdetectedforasystem.Clonecount
isthetotalnumberofclonescontainedinthem.
byClonecloning.countsBothareusedcountsduringhaveacloneratioscaleassessmentandtorangerevealbetweenhow[0,man1y[.partsofthesystemareaffected
DeterminationBothcountsaretriviallydeterminedautomatically.
remoExampleved,cloneForgrtheoupecountxample,isthereducedclonetogr2oupandcountcloneiscount3,theto6.clonecountis9.Ifclonegroupais
2.5.4Overhead
De®nition2Overheadisthesizeincreaseofasystemduetocloning.
Overheadisusedintheevaluationofthecloning-inducedeffortincreaseinsize-relatedactivities.
Itismeasuredinrelativeandabsoluteterms:
sizeoverhead_rel=redundancyfreesize!1
Ifthesizeis>0,theredundancyfreesizecanneverbe0.Overheadsizeisthusalwaysde®nedforall
artifactsofsize>0.Thesubtractionof1fromtheratioredundancyfreesizemakestheoverhead_rel
quantifyonlythesizeexcess.
overhead_abs=size!redundancyfreesize
Bothhavearatioscaleandrangebetween[0,1[.
oftheovDeterminationerheadmetricOvthuserheadisdependscomputedontheonaccuractheycloneofthegroupsclonesdetectedonwhichforaitissystem.computed.Theaccuracy
32
MetricsClone2.5
dundancExampleyfreeTosourcecomputeovstatementserheadfor(RFSS)sourceforartifcode,actAweareemploycomputedstatementsastheassumbasicof:units.There-
15statementsthatarenotcoveredbyanyclone—theyaccountfor15RFSSfor®leA.
The51statements2thatarecoveredbyclonea1occur3timesaltogether.Theythusonlyaccount
for5·3=13RFSSfor®leA.
The30statementsthat1arecoveredbycloneb1butnotbyclonec1occur2times.Theythus
onlyaccountfor30·2=15RFSSfor®leA.
The101statements1thatarecoveredbybothclonesb1andc1occur4times.Theythusaccount
for10·4=22RFSSfor®leA.
Inall,®leAthushas15+132+15+221=3461RFSS.Since®leAhas60statementsaltogether,
overhead=36401!1=75.6%.
6RFSSforartifactsA-Cis130,correspondingoverheadisoverhead=123000!1=53.8%.Ifclone
groupsaisexcludedsinceitisnotrelevanttomaintenance,RFSSincreasesto140andoverhead
42.9%.todecreasesTocomputeoverheadforotherartifacts,wechoosedifferentartifactelementsasbasicunits.For
requirementsspeci®cations,weemploysentencesasbasicunits;formodels,modelelements.Over-
headforthemiscomputedanalogously.
2.5.5CloneCoverage
De®nition3Clonecoverageistheprobabilitythatanarbitrarilychosenelementinasystemis
coveredbyatleastoneclone.
Clonecoverageisusedduringcloneassessmenttoestimatetheprobabilitythatachangetoone
statementneedstobemadetoadditionalstatementsduetocloning.Itisde®nedasfollows,where
clonedsizeisthenumberofunitscoveredbyatleastoneclone,andsizeisthenumberofallunits:
coverage=clonedsize
size
Clonecoveragehasaratioscaleandrangesbetween[0,1].
tem.TheDeterminationaccuracyofJusttheascooveraverheadge,metriccoverathusgeisdependscomputedontheontheaccuraccloneyofthegroupsunderlyingdetectedforclones.asys-
33
Fundamentals2
forExamplesourcecode.JustasTheforoclonedverheadsize,forweartefemploactyAissourcecomputedstatementsasfolloasws:basicunitstocomputecoverage
Clonea1accountsfor5clonedstatements.
Cloneb1accountsfor40clonedstatements.
Clonec1spans10statements.However,allofthemarealsospannedbycloneb1.Clonec1
doesthusnotaccountforadditionalclonedstatements.
TheclonedsizeforartifactAisthus5+40=45.SinceAhasasizeof60,itscoverageis
6405=0.75%40.Ifclonegroupaisignoredsinceitisnotrelevantformaintenance,coverageforAis
reducedto60=66.7%.
Thecoverageforallthreeartifactsis210105=57.5%,ifclonegroupaisincluded,else210000=50%.
Forartifacttypesotherthansourcecode,basicunitsarechosendifferently,butcoverageiscomputed
.analogously
Precision2.5.6
De®nition4Precisionisthefractionofclonegroupsthatarerelevanttosoftwaremaintenance,
oraspeci®ctask,forwhichcloneinformationisemployed.Itcanbecomputedonclonesorclone
oups.gr
Basedonthesetsofcandidateclonegroupsandrelevantclonegroups,itisde®nedasfollows:
precision_CG=|{relevantclonegroups}\{candidateclonegroups}|
|{candidateclonegroups}|
Precisionbasedonclones,precision_C,iscomputedanalogously.Bothprecisionmetricshaveratio
scalesandrangebetween[0,1].
DeterminationPrecisionisdeterminedthroughdeveloperassessmentsofsamplesofthede-
tectednance,thatclones.is,Forwhethereachclonechangestogroup,thedevcloneselopersareeassessxpectedwhethertobeitiscoupled.relevTantoaforchievsoftwearereliablemainte-and
repeatablemanualcloneassessments,explicitrelevancecriteriaarerequired.
Sinceinpracticethesetofdetectedclonesisoftentoolargetobefeasiblyassessedentirely,preci-
sionistypicallydeterminedonarepresentativesampleofthecandidateclonegroups.
ExampleIntheexample,clonegroupaisnotrelevantforsoftwaremaintenance.Theremaining
clonegroupsarerelevant.Consequently,precisionCC=32,precisionC=96=32.
34
ModelswData-¯o2.6
2.1.82.5I1
zPMaxI-Delay
1<1121
1InzCompareSetOutInP1Out
I-Delayz
.5D-Delay
7.0DIFigure2.4:Examples:DiscretesaturatedPI-controllerandPID-controller
ModelswData-¯o2.6
butwithModel-basedmoredevabstractelopmentmodelsmethodsspeci®cto[188]—dethevelopmentdomain—areofgsoftwainingarenotimportanceontheinclassicalthecdomainodelevofel
7automotiembeddedvesystems.domain,alreadyTheseupmodelsto80%areofusedthetoproductionautomaticallycodedeplogeneyedrateonproductionembeddedcodecontrol.Inunitsthe
canbegeneratedfrommodelsspeci®edusingdomain-speci®cformalismslikeMatlab/Simulink
[118].Thesemodelsaretakenfromcontrolengineering.Blockdiagrams—similartodata-¯owdiagrams—
Thus,consistingblocksofblockscorrespondandtolinesfunctionsareused(e.ing.,thisintegrators,domain®aslters)structuredtransformingdescriptioninputofsignalsthesetosystems.output
signals,linestosignalsexchangedbetweenblocks.Thedescriptiontechniquesspeci®callyaddress-
ingwithdata-¯ocomputationwsystemsschemesaretarlargetinggelytheindependentmodelingofofthecomplexcomputedstereotypicaldataandthusrepetitivecontainingcomputations,littleor
noaspectsofcontrol¯ow.Typicalapplicationsofthosemodelsare,e.g.,signalprocessingalgo-
rithms.Recently,toolsforthisdomain—withMatlab/Simulink[169]orASCET-SDasexamples—areused
forthegenerationofembeddedsoftwarefrommodelsofsystemsunderdevelopment.Tothatend,
theseblockdiagramsareinterpretedasdescriptionsoftime-(andvalue-)discretecontrolalgorithms.
ByusingtoolslikeTargetLink[58],thesedescriptionsaretranslatedintothecomputationalpartof
ataskdescription;byaddingschedulinginformation,thesedescriptionsarethencombined–often
usingareal-timeoperatingsystem—toimplementanembeddedapplication.
Figure2.4showstwoexamplesofsimpledata-¯owsystemsusingtheSimulinknotation.Both
modelsarefeedbackcontrollersusedtokeepaprocessvariablenearaspeci®edvalue.Bothmodels
transformatime-andvalue-discreteinputsignalInintoanoutputsignalOut,usingdifferenttypes
ofbasicfunctionblocks:gains(indicatedbytriangles,e.g.,PandI),adders(indicatedbycircles,
with+and!signsstatingtheadditionorsubtractionofthecorrespondingsignalvalue),one-unit
delays(indicatedbyboxeswith1,e.g.,I-Delay),constants(indicatedbyboxeswithnumerical
values,e.g.,Max),comparisonsz(indicatedbyboxeswithrelations,e.g.,Compare),andswitches
(indicatedbyboxeswithforks,e.g.,Set).
7Thetationtermpurposes.“model-based”Herehoiswevoftener,wealsofocususedoninthemodelscontethatxtofareemploincompletyedeforfullspeci®cationscodethatgeneration.domainlyservedocumen-
35
Fundamentals2
Systemsareconstructedbyusinginstancesofthesetypesofbasicblocks.Wheninstantiatingbasic
blocks,dependingontheblocktype,differentattributesarede®ned,e.g.,constantsgetassigneda
value,orcomparisonsareassignedarelation.Forsomeblocks,eventhepossibleinputsignalsare
declared.Forexample,foranadder,thenumberofaddedsignalsisde®ned,aswellasthecorre-
spondingsigns.Byconnectingthemviasignallines,(basic)blockscanbecombinedtoformmore
complexblocks,allowingthehierarchicdecompositionoflargesystemsintosmallersubsystems.
2.7CaseStudyPartners
Thissectiongivesashortoverviewofthecompaniesororganizationsthatparticipatedinoneor
studies.casetheofmore
MunichReGroupTheMunichReGroupisoneofthelargestre-insurancecompaniesinthe
worldandemploysmorethan47,000peopleinover50locations.Fortheirinsurancebusiness,they
developavarietyofindividualsupportingsoftwaresystems.
Lebensversicherungvon1871a.G.TheLebensversicherungvon1871a.G.(LV1871)is
aMunich-basedlife-insurancecompany.TheLV1871developsandmaintainsseveralcustom
softwaresystemsformainframesandPCs.
obtainedSiemensfromAGtheisbtheusinesslargestunitdealingengineeringwithcompanindustrialyinEurope.automation.Thespeci®cationusedherewas
anMOSTautomotivCooperaemultimediationisapartprotocol.nershipKeyofcarpartnersmanufincludeacturersAudi,andBMWcomponentandDaimlersuppliers.thatde®ned
MANNutzfahrzeugeGroupisaGermany-basedinternationalsupplierofcommercialvehicles
150andworktransportonsystems,electronicsmainlyandsoftwtrucksareanddevbuses.elopment.IthasovHence,er34,000thefocusemploisonyeeswembeddedorld-wideofsystemswhichin
domain.evautomotithe
ySummar2.8
ofThisredundancchapteryusedintroducedinclonescomputerasascience.formofBasedlogicalthereon,redundancityde®nedandthecomparedcentralittermswithandothermetricsnotions
employedinthisthesis.Besides,thechapterintroducedthecompaniesthattookpartinindustrial
casestudiesthatarepresentedinlaterchapters.
36
3StateoftheArt
Thischaptersummarizesexistingworkintheresearchareaofsoftwarecloninginsupportofthe
claimsmadeinthethesisstatement(cf.,Section1.1).Morespeci®cally,itsummarizesworkonthe
impactofcloningonsoftwareengineeringandonapproachesforitsassessmentandcontrol1.
Thestructureofthischapterre¯ectstheorganizationofthisthesis:Section3.1outlinesworkon
theimpactofcloningonprogramcorrectness.Section3.2outlinesworkontheextentofcloning
indifferentsoftwareartifacttypes.Section3.3outlinesexistingclonedetectionapproachesand
argueswhynoveloneshadtobedeveloped.Section3.4outlinesworkoncloneassessmentand
management.Finally,Section3.5outlinesworkonthelimitationsofclonedetection.
Eachsectionsummarizesexistingwork,outlinesopenissuesandpointstothechaptersinthisthesis
thatcontributetotheirresolution.
3.1ImpactonProgramCorrectness
Itiswidelyacceptedthatcloningcan,inprinciple,impedemaintenancethroughitsinducedincrease
inartifactsizeandnecessityofmultiple,consistentupdatesrequiredforasinglechangeinproblem
cloningdomainisininformation.practice.AHosurvweveyer,ontheretheisnoharmfulnessconsensusofincloningthebyresearchHordijketal.community[93]onconcludeshowthatharmful“a
directlinkbetweenduplicationandchangeabilityhasnotbeenprovenyet,butnotrejectedeither”.
extentConsequentlyofthe,aimpactnumberonofmaintenanceresearcherseffortshaveand,performedespecially,onempiricalprogramstudiestocorrectness.betterunderstandthe
CloneRelatedBugsLietal.[157]presentanapproachtodetectbugsbasedoninconsistentre-
namingofidenti®ersbetweenclones.Jiang,SuandChiu[159]analyzedifferentcontextsofclones,
suchasmissingifstatements.Bothpapersreportthesuccessfuldiscoveryofbugsinreleasedsoft-
ware.In[4],[237],[216]and[7],individualcasesofbugsorinconsistentbug®xesdiscoveredby
analysisofcloneevolutionarereportedforopensourcesoftware.Thesestudiesthuscon®rmcases
whereinconsistenciesbetweenclonesindicatedbugs,supportingtheclaimfornegativeimpactof
correctness.programforclones1Acomprehensiveoverviewofsoftwarecloningresearchingeneralisbeyondthescopeofthisthesis.Pleasereferto
Koschke[140]andRoyandCordy[201]fordetailedsurveys.
37
3StateoftheArt
isgiClonevenEvbyseolutionveralresearchers.IndicationforLaguetheetal.harmfulnes[149],sofreportcloningforinconsistentevmaintainabilityolutionoforacorrectnesssubstantial
reamountvisionofnumberclonesforin®laneswithindustrialclonesthantelecommunicationfor®leswithoutsystem.ina20Mondenyearetoldal.legac[178]yreportsystem,apossi-higher
blyindicatinglowermaintainability.In[132,133],Kimetal.reportthatmanychangestocode
clonesoccurinacoupledfashion,indicatingadditionalmaintenanceeffortduetomultiplechange
locations.Thummalapenta,AversanoCeruloandDiPenta[4,216]reportthathighproportionsof
bug®xesoccurforclonesthatshowlatepropagations,i.e.,inconsistentchangesthatarelatermade
consistent,indicatingthatcloningdelayedtheremovalofbugsfromthesystem,orthattheincon-
sistenciesintroducedbugsthatwerelaterrepaired.LozanoandWermelinger[163,193]reportthat
maintenanceeffortmayincreasewhenamethodhasclones.
Incontrast,doubtthatconsequencesofcloningareunambiguouslyharmfulisraisedbyseveral
recentresearchresults.Krinke[147]reportsthatonlyhalftheclonesinseveralopensourcesystems
evolvedconsistentlyandthatonlyasmallfractionofinconsistentclonesbecomesconsistentagain
throughlaterchanges,potentiallyindicatingalargerdegreeofindependenceofclonesthanhitherto
believed.Geigeretal.[76]reportthatarelationbetweenchangecouplingsandcodeclonescould,
nocontrarysystematictoexpectations,relationshipnotbetweenbecodestatisticallycloningveri®ed.andLozanochangeabilityandWcouldermelingerbeestablished.[163]reportIn[148],that
andKrinkeconcludesreportsthatthatitinathussetofcannotopenbesourceassumedsystems,torequireclonedmorecodeismaintenancemorestablecoststhaningeneral.non-clonedcode
Bettenburgetal.[20]analyzedtheimpactofinconsistentchangestoclonesonprogramcorrectness.
Insteadofanalyzingindividualchanges,theyanalyzedonlyreleasedsoftwareversions.Ofthe
toanalyzedcodebclones,ugsintheindicatingtwoasystems,smallonlyimpact1.3%ofandcloning2.3%onwereprogramfoundtobecorrectnessdueto.incRahmanonsistentetal.changes[189]
codeanalyzecontainsrelationlessbbetweenugsthancodenon-clonedcloningandcode.bugsandreportthat,intheanalyzedsystems,cloned
Duetconclusionsothediwv.r.t.ersitytheoftheharmfulnessresultsofproducedcloning.byThistheisstudiesemphasizedonclonebyevtheolution,resultsitisfromhardGödetodra[83],w
whostudiesonanalyzescloneevevolutionolution.ofHetype-1reportsclonesthatin9theopenratioofsourceconsistentsystemsandtovalidateinconsistent®ndingschangesfromtopreclonedvious
codevariessubstantiallybetweentheanalyzedsystems,makingconclusionsdif®cult.
CloningPatternsThroughcloningpatterns,KapserandGodfrey[123]contrastmotivationand
impactofcloningasadesigndecisionwithalternativesolutions.Theyreportthatcloningcanbea
justi®ableorevenbene®cialdevelopmentactioninspecialsituations,i.e.,whereseverelanguage
limitationsorcodeownershipissuesprohibitgenericsolutions.Notablyhowever,whiletheyargue
thatlackof,orproblemsassociatedwithalternativesolutionscanmakeupforthem,theyemphasize
thatforallcloningpatternsthenegativeimpactofcloningstillholds.
SummaryTheeffectofcloningonmaintainabilityandcorrectnessisthusnotclear.Further-
more,theabovelistedpublicationssufferfromoneormoreshortcomingsthatlimitthetransferabil-
®ndings.reportedtheofity
38
3.1ImpactonProgramCorrectness
Manystudiesemployclonedetectorsintheirdefaultcon®gurationwithoutadaptingthemto
theanalyzedsystemsortasks[4,7,76,147,148,163,189].Asaconsequence,nodifferentiation
ismade,e.g.,betweenclonecandidatesinhand-maintainedorgeneratedcode,althoughclone
candidatesingeneratedcodeareirrelevantformaintenance.Theemployednotionof“clone”
isthuspurelysyntacticandtask-relatedprecisionunclear.Forexample,foroneoftheanalyzed
systems,Krinkereportsthatmorethanhalfofthedetectedcloneswereincodegeneratedby
aparsergenerator[148].However,theywerenotexcludedfromthestudy,thusdilutingits
conclusivenessw.r.t.totheimpactofcloning.
Insteadofmanualinspectionoftheactualinconsistentclonestoevaluateimpactformainte-
nanceandcorrectness,indirectmeasuresareused[4,76,83,147–149,163,178].Forexample,
changecoupling,theratiobetweenconsistentandinconsistentevolutionofclonesorcode
stabilityareanalyzed,insteadofactualmaintenanceeffortsorfaults.Indirectmeasuresare
inherentlyinaccurateandcaneasilyleadtomisleadingresults:unintentionaldifferencesand
faults,e.g.,whileunknowntodevelopers,exhibitthesameevolutionpatternasintentionally
independentevolutionandarethuspronetomisclassi®cation.Furthermore,inconsistencies
thatarefaultsthathavenotyetbeendiscovered,orhavebeen®xedindifferentways,can
incorrectlybeclassi®edasintentionalindependentevolution.
Apartfromtheirinaccuracy,theinterpretationoftheindirectmeasuresisdisputable.Thisis
apparentforthemeasureofcodestabilityasanindicatorformaintainability.Onetheone
hand,higherstabilityofclonedversusnon-clonedcode,couldbeinterpretedasanindicator
forlowermaintenancecostsofclonedcode,as,e.g.,doneby[148];fewerchangescouldmean
lesscosts.Ontheotherhand,itcanbeinterpretedasanindicatorforlowermaintainability—
developersmightshirkchangingclonedcodeduetotheincreasedeffort—indicatinghigher
overallmaintenancecosts!Supportforthelatterinterpretationis,e.g.,givenbyGlass[81],
whoreportsmorechangesformoremaintainableapplicationsthanforunmaintainablecode,
simplybecausedevelopmentexploitsthefactthatchangesareeasiertomake.
Theanalyzedsystemsaretoosmall(20kLOC)toberepresentative[132,133]oromitanalysis
ofindustrialsoftware[4,7,76,83,132,133,147,148,163,189].
Theanalysesspeci®callyfocusonfaultsintroducedduringcreation[157,159]orevolution[7]
ofclones,inhibitingquanti®cationofinconsistenciesingeneral.Or,inthecaseof[20],only
lookatbugsinreleasedsoftware,thusignoringeffortsfortesting,debuggingand®xingof
clone-relatedbugsintroducedand®xedduringdevelopment.
Additionalempiricalresearchoutsidetheselimitationsisrequiredtobetterunderstandtheimpact
ofcloning[140,201].Inparticular,theimpactofcloningonprogramcorrectnessisinsuf®ciently
understood.ProblemItisstillnotwellunderstood,howstronglyunawarenessofcloningduringmaintenance
affectsprogramcorrectness.However,asthisisthecentralmotivationdrivingthedevelopmentof
clonemanagementtools,weconsiderthisprecarious.
ContributionChapter4presentsalargescalecasestudythatstudiestheimpactofunawareness
ofcloningonprogramcorrectness.Itemploysdeveloperratingoftheactualinconsistentclones
insteadofindirectmeasures,thestudyobjectsarebothopensourceandindustrialsystems,and
39
3StateoftheArt
sufferinconsistenciesfromtheabohavveebeenmentionedanalyzedshortcomings.independentlyoftheirmodeofcreation.Itdoes,hence,not
CloningofExtent3.2
Cloninghasbeenstudiedintenselyforsourcecode.Littlework,however,hasbeendoneoncloning
inotherartifacttypes.Thissectionoutlinesexistingworkontheextentofcloningindifferent
types.actartif
code.SourceBothCodefortheeThevaluationmajorityofofthedetectionresearchapproachesintheandareaforofthesoftwanalysisareofcloningtheimpactfocusesofonsourcloning,ce
asubstantialnumberofresultsfordifferentcodebaseshavebeenpublished[1,3,4,7,33,60,83,
84,comprise110,115,source133,code140,147,from148,systems157,of159,dif161,ferent162,size164,and178,age,189,from193,dif195,ferent198,199,domains,201,dev216].elopmentThey
theseteamsandstudieswrittenconindifvincinglyferentshowprogrammingthatcloninglanguages.canoccurWhileinthesourceamountcodeofdetectedindependentcloningofvdomain,aries,
courseprogrammingofthisthesislanguagesupportordethisvelopingobservorgation.anization.Thestudiesthathavebeenperformedinthe
RequirementsSpeci®cationsThenegativeeffectsofcloninginprograms,inprinciple,also
applytocloninginsoftwarerequirementsspeci®cations(SRS).AsSRSarereadandchangedof-
ten(e.g.,duringrequirementselicitation,softwaredesign,andtestcasespeci®cation),redundancy
isconsideredanobstacletorequirementsmodi®ability[100]andlisted,forinstance,asamajor
probleminautomotiverequirementsengineering[230].
Ingeneral,structuringofrequirementsandmanualinspection—based,e.g.,onthecriteriaof
[100]—areusedforqualityassessmentconcerningredundancy.Asitrequireshumanaction,it
doesintroducesubjectivenessandcauseshighexpenses.Inaddition,approachesexisttome-
chanicallyanalyzeotherqualityattributesofnaturallanguagerequirementsspeci®cations,espe-
ciallyambiguity-relatedissueslikeweakphrases,lackofimperative,orreadabilitymetricsasin,
e.g.,[28,66,101,233].However,redundancyhasnotbeeninthefocusofanalysistools.
Algorithmsforcommonalitiesdetectionindocumentshavebeendevelopedinseveralotherareas.
Clusteringalgorithmsfordocumentretrieval,suchas[231],searchfordocumentsontopicssimilar
tothethosedetectionofaofreferencecommonalitiesdocument.betweenPlagiarismdocuments.detectionHowever,algorithms,whilelikthesee[44,approaches165],alsosearchaddressfor
commonalitiesbetweenaspeci®cdocumentandasetofreferencedocuments,clonedetectionalso
needstoconsidercloneswithinasingledocument.Furthermore,wearenotawareofstudiesthat
applythemtorequirementsspeci®cationstodiscoverrequirementscloning.
40
3.3CloneDetectionApproaches
ModelsUptonow,littleworkhasbeendoneonclonedetectioninmodel-baseddevelopment.
Consequently,wehavelittleinformationonhowlikelyreal-worldmodelscontainclones,andthus,
howimportantclonedetectionandmanagementisformodel-baseddevelopment.
In[160],Liuetal.proposeasuf®x-treebasedalgorithmforclonedetectioninUMLsequence
diagrams.Theyevaluatedtheirapproachonsequencediagramsfromtwoindustrialprojectsfroma
singlecompany,discovering15%ofduplicationinthesetof35sequencediagramsinthe®rstand
8%ofduplicationinthe15sequencediagramsofthesecondproject.
In[186]and[180],Phametal.andNguyenetal.presentclonedetectionapproachesforMat-
lab/Simulinkmodels.TheirevaluationislimitedtofreelyavailablemodelsfromMATLABCentral
though,thatmainlyserveeducationalpurposes.Itthusdoesnotallowconclusionsabouttheamount
ofcloninginindustrialMatlab/Simulinkmodels.
SummaryAlthoughrequirementshaveapivotalroleinsoftwareengineering,andeventhough
redundancyhaslongbeenrecognizedasanobstacleforrequirementsmodi®cation[100],tothe
bestofourknowledge,noanalysisofcloninginrequirementsspeci®cationshasbeenpublished
(exceptfortheworkpublishedaspartofthisthesis).Wethusdonotknowwhethercloningoccurs
inrequirementsandneedstobecontrolled.
Althoughmodel-baseddevelopmentisgainingimportanceinindustry[188],exceptfortheanalysis
ofcloninginsequencediagrams,nostudiesoncloninginmodelshavebeenpublished(exceptfor
theworkpublishedaspartofthisthesis).Wethusdonotknowhowrelevantclonedetectionand
managementisformodel-baseddevelopment.
ProblemSubstantialresearchhasanalyzedcloninginsourcecode.However,verylittleresearchhas
beencarriedoutoncloninginothersoftwareartifacts.Itisthusunclearwhethercloningprimarily
occursinsourcecode,oralsoneedstobecontrolledforothersoftwareartifactssuchasrequirements
models.andspeci®cationsContributionToadvanceourknowledgeoftheextentandimpactofcloninginotherartifacts,
Chapter5presentsalargescaleindustrialcasestudyoncloninginrequirementsspeci®cations
thatanalyzesextentandimpactofcloningin28speci®cationsfrom11companies.Itindicates
thatcloningdoesaboundinsomespeci®cationsandgivesindicationsforitsnegativeimpact.The
chapterfurthermorepresentsanindustrialcasestudyoncloninginMatlab/Simulinkmodelsthat
demonstratesthatcloningdoesoccurinindustrialmodels—clonedetectionandmanagementare,
hence,alsobene®cialforrequirementsspeci®cationsandinmodel-baseddevelopment.
3.3CloneDetectionApproaches
Bothempiricalresearchonthesigni®canceofcloningandmethodsforcloneassessmentandcontrol
requireclonedetectors.Inits®rstpart,thissectiongivesageneraloverviewofexistingcodeclone
detectionapproaches.Then,itpresentsapproachesforreal-timeclonedetectionoftype-2andeager
detectionoftype-3clonesinsourcecodeandclonedetectioningraph-basedmodelsindetailand
identi®estheirshortcomings.Thissectionthusmotivatesandjusti®esthedevelopmentofnovel
detectionapproachesthatarepresentedinChapter7.
41
3StateoftheArt
Code3.3.1DetectionClone
Theclonedetectioncommunityhasproposedverymanydifferentapproaches,thevastmajorityof
themforsourcecode.Theydifferintheprogramrepresentationtheyoperateonandinthesearch
algorithmtheyemployto®ndclones.Westructurethemhereaccordingtotheirunderlyingprogram
representation.Thissectionfocusesoncodeclonedetection.Approachesforotherartifactsare
3.3.4.SectioninpresentedText-basedclonedetectionoperatesonatext-representationofthesourcecodeandisthuslan-
guageindependent.Thus,text-baseddetectiontoolstypicallycannotdifferentiatebetweenseman-
ticschangingandsemanticsinvariantchanges.Approachesinclude[41,61,62,108,167,202].
Token-basedclonedetectionoperatesonatokenstreamproducedfromthesourcecodebyascan-
ner.Itisthuslanguagedependent,sinceascannerencodeslanguage-speci®cinformation.However,
comparedtoparsersorcompilers,scannersarecomparativelyeasytoproduceandrobustagainst
compileerrors.Token-basedclonedetectionallowstoken-typespeci®cnormalization,suchasre-
movalofcommentsorrenamingofliteralsandidenti®ers.Itisthusrobustagainstcertainsemantics
invariantchangestosourcecode.Approachesinclude[6,14,85,85,88,113,121,157,210,220].
AST-basedclonedetectionoperatesonthe(abstract)syntaxtreeproducedfromthesourcecodeby
aparser.Itthusrequiresmorelanguage-speci®cinfrastructurethantoken-baseddetection,butcan
bemaderobustagainstfurtherclassesofprogramvariation,suchasdifferentconcretesyntaxesfor
thesameabstractsyntaxelement.Approachesinclude[16,29,36,65,67,106,142,182,213,226].
Metrics-basedapproachescuttheprogramintofragments(e.g.,methods)andcomputeametric
vector—containinge.g.,linesofcode,nestingdepth,numberofpaths,andnumberofcallstoother
functions—foreach.Fragmentswithsimilarvectorsarethenconsideredclones.Sincethemetrics
abstractfromsyntacticfeaturesofthesourcecode,theseapproachesarealsorobustagainstcertain
typesofdifferencesbetweenclones.Approachesinclude[138,139,170].
PDG-basedapproachesoperateontheprogramdependencegraph(PDG)andsearchitforisomor-
phicsubgraphs.Ontheonehand,theyarerobustagainstfurthertypesofprogramvariationthat
cannotbeeasilydetectedbyotherapproaches,suchasstatementreordering.Ontheotherhand,
theymakethehighestdemandsw.r.t.availableprogramminglanguageinfrastructuretocreatea
PDG.Approachesinclude[73,137,146].
Assembler-basedapproachesemploytechniquesfromtheaboveapproachesbutoperateonthe
assemblerorintermediatelanguagecodeproducedbythecompiler,insteadofonsourcecode.
Ontheonehand,theyarerobustagainstprogramvariationremovedduringcompilation,suchas
interchangeableloopconstructs.Ontheotherhand,theyhavetodealwithredundancycreatedby
thecompilerthroughreplacementofasinglehigher-levellanguagestatement,likealoop,through
aseriesoflowerlevellanguagestatements.Approachesinclude[45,204]forassemblerand[213]
code.languageintermediate.NETforEachprogramrepresentationthedetectionapproachesoperateonrepresentsadifferenttrade-off
betweenseveralfactors:language-independence,robustnessagainstprogramvariationandperfor-
mancebeingamongthemostimportant.Increasingsophisticationofprogramrepresentation(text,
42
3.3CloneDetectionApproaches
token,AST,PDG)increasesrobustnessagainstprogramvariation,sincemoreinformationfornor-
malizationandsimilaritycomputationisavailable.However,atthesametimeitdecreaseslanguage
performance.andindependenceHybridapproacheshave,consequently,beenproposedthatattempttocombinetheadvantagesof
individualapproaches.Wrangler[154]employsahybridtoken/AST-basedapproachthatexploits
theperformanceoftoken-basedclonedetectionandemploystheASTtomakesurethatthedetected
clonesrepresentsyntacticallywell-formedprogramentitiesthatareamendabletocertainrefactoring
techniques.KClone[105]®rstoperatesonthetokenleveltoexploittheperformanceoftoken-based
clonedetectionandthenoperatesonagraph-basedrepresentationtoincreaserecall.
DetectionCloneReal-Time3.3.2
Clonemanagementtoolsrelyonaccuratecloninginformationtoindicatecloningrelationshipsin
theIDEwhiledevelopersmaintaincode.Toremainuseful,cloninginformationmustbeadapted
continuouslyasthesoftwaresystemunderdevelopmentevolves.Forthis,detectionalgorithmsneed
tobeabletoveryrapidlyadaptresultstochangingcode,evenforverylargecodebases.Weclassify
existingapproachesbasedontheirscalabilityandtheirabilitytorapidlyupdatedetectionresultsto
code.thetochanges
EagerAlgorithmsAsoutlinedinSection3.3.1,amultitudeofclonedetectionapproacheshave
beenproposed.Independentofwhethertheyoperateontext[41,62,202],tokens[6,113,121],
ASTs[16,106,142]orprogramdependencegraphs[137,146],andindependentofwhetherthey
employtextualdifferencing[41,202],suf®x-trees[6,113,121],subtreehashing[16,106],anti-
uni®cation[30],frequentitemsetmining[157],slicing[137],isomorphicsubgraphsearch[146]
oracombinationofdifferentphases[105],theyoperateinaneagerfashion:theentiresystemis
processedinasinglestepbyasinglemachine.
Thescalabilityoftheseapproachesislimitedbytheamountofresourcesavailableonasinglema-
chine.Theuppersizelimitontheamountofcodethatcanbeprocessedvariesbetweenapproaches,
butisinsuf®cientforverylargecodebases.Furthermore,iftheanalyzedsourcecodechanges,
eagerapproachesrequiretheentiredetectiontobereruntoachieveup-to-dateresults.Hence,these
approachesareneitherincrementalnorsuf®cientlyscalable.
talcloneIncrementaldetectionorReal-timeapproach.TheyDetectionemployaGödegeneralizedandKoschksufe®x-tree[85,85]thatcanproposedbetheupdated®rstefincremen-®ciently
whenthesourcecodechanges.Theamountofeffortrequiredfortheupdateonlydependsonthe
sizeofsubstantiallythechange,morenotmemorythesizethanoftheread-onlycodesufbase.®x-trees,sinceUnfortunatelythey,requiregeneralizedadditionalsuf®x-treeslinksthatrequireare
traacrossverseddifferentduringthemachines,updatethememoryoperations.Sincerequirementsgeneralizedrepresentsufthe®x-treesbottleneckarewnot.r.t.easilyscalabilitydistrib.utedCon-
sequently,theimprovementinincrementaldetectioncomesatthecostofreducedscalabilityand
ution.distrib
43
3StateoftheArt
Ydevamashinaelopersetinsideal.the[126]IDE.proposeInsteadatoolofcalledperformingSHINOBIclonethatdetectionprovideonsreal-timedemand(andcloningincurringinformationwaitingto
timesfordevelopers),SHINOBImaintainsasuf®x-arrayonaserverfromwhichcloninginforma-
tionapproachforasuf®le®x-arrayopenedbyademaintenanceveloperincantheirbewretrieork.vedefReal-time®ciently.cloningUnfortunatelyinformation,thehenceauthorsappearsdonotto
belimitedtoanimmutablesnapshotofthesoftware.Wethushavenoindicationthattheirapproach
.incrementallyorkswNguyenetal.[182]presentanAST-basedincrementalclonedetectionapproach.Theycompute
searchingcharacteristicforvsimilarectorsvforectors.allIfthesubtreesofanalyzedtheparsesoftwaretreeofchanges,acodev®le.ectorsforClonesmodi®edarethen®lesaredetectedsimplyby
availablerecomputed.onaAssinglethealgorithmmachine.isnotFurthermore,distributed,ASTits-basedscalabilitycloneisdetectionlimitedbyrequirestheamountparsers.ofUnfortu-memory
evernately,,paaccordingrsersfortoleourgaceyxperiencelanguages(cf.,suchChapterasPL/I4),orsuchCOBOLsystemsareoftenoftenhardcontaintoobtainsubstantial[150].amountsHow-
ofcloning.Clonemanagementishenceespeciallyrelevantforthem.
cloneScalabledetectionDetectionacrossmanLiyvierietmachinesal.to[162]improproposeveascalabilitygeneral.Theirdistributiondistributionmodelmodelthatdistribpartitionsutes
sourcedetectioncodeisthenintopiecesperformedsmallonallenoughpairs(e.ofg.,15pieces.MB)tDifobeferentanalyzedpairscanonabesingleanalyzedmachine.ondifferentClone
machines.Finally,resultsforindividualpairsarecomposedintoasingleresultfortheentirecode
timebase.forSincelargethesystemsnumberisofsubstantial.pairsofThepiecesincreaseincreasesinsquadrcalabilityaticallythuswithcomesatsystemthesize,costoftheresponseanalysis
time.
SummaryWerequireclonedetectionapproachesthatarebothincrementalandscalabletoef®-
cientlysupportclonecontroloflargecodebases.
Problemeagerclonedetectionisnotincremental.Thelimitedmemoryavailableonasinglema-
chinefurthermorerestrictsitsscalability.Novelincrementaldetectionapproachescomeatthecost
ofscalability,andviceversa.Inanutshell,noexistingapproachisbothincrementalandscalableto
verylargecodebases.
ContributionChapter7introducesindex-basedclonedetectionasanoveldetectionapproachfor
type-1&2clonesthatisbothincrementalandscalabletoverylargecodebases.Itextendspractical
applicabilityofclonedetectiontoareasthatwerepreviouslyunfeasiblesincethesystemsweretoo
largeorsinceresponsetimeswereunacceptablylong.Itisavailableforusebyothersasopensource
are.softw
3.3.3DetectionofType-3Clones
Thetioncase3.1)studyrequiresthataninvestigapproachatestoimpactdetectofunatype-3w(cf.areness,Secof2.2.3)cloningclonesoninprogramsourcecode.correctnessWe(cf.,classifySec-
44
3.3CloneDetectionApproaches
existingapproachesfortype-3clonedetectioninsourcecodeaccordingtotheprogramrepresenta-
tiontheyoperateonandoutlinetheirshortcomings.
TextInNICAD,normalizedcodefragmentsarecomparedtextuallyinapairwisefashion[202].A
similaritythresholdgovernswhethertextfragmentsareconsideredasclones.
TokenUedaetal.[220]proposepost-processingoftheresultsoftoken-baseddetectionofexact
clonesthatcomposestype-3clonesfromneighboringungappedclones.In[157],Lietal.present
thetoolCP-Miner,whichsearchesforsimilarbasicblocksusingfrequentsubsequenceminingand
thencombinesbasicblockclonesintolargerclones.
AbstractSyntaxTreeBaxteretal.[16]hashsubtreesintobucketsandperformpairwisecom-
parisonofsubtreesinthesamebucket.Jiangetal.[106]proposethegenerationofcharacteristic
vectorsforsubtrees.Insteadofpairwisecomparison,theyemploylocalitysensitivehashingforvec-
torclustering,allowingforbetterscalabilitythan[16].In[65],treepatternsthatprovidestructural
abstractionofsubtreesaregeneratedtoidentifyclonedcode.
ProgramDependenceGraphKrinke[146]proposesasearchalgorithmforsimilarsubgraphiden-
ti®cation.KomondoorandHorwitz[137]proposeslicingtoidentifyisomorphicPDGsubgraphs.
Gabel,JiangandSu[73]useamodi®edslicingapproachtoreducethegraphisomorphismproblem
.similaritytreeto
SummaryWerequireatype-3clonedetectionalgorithmtostudytheimpactofunawarenessof
correctness.programoncloningProblemTheexistingapproachesprovidedvaluableinspirationforthealgorithmpresentedinthis
thesis.However,noneofthemwasapplicabletostudytheimpactofunawarenessofcloningon
programcorrectness,foroneormoreofthefollowingreasons:
Tree[16,65,106]andgraph[73,137,146]basedapproachesrequiretheavailabilityofsuitable
suchcontextasJafreeva,thisgrammarsposesaforsevASTereorproblemPDGforlegconstruction.acylanguagesWhilesuchfeasibleasforCOBOLmodernorPL/I,languagewheres
suitablegrammarsarenotavailable.Parsingsuchlanguagesstillrepresentsasigni®cantchal-
150].[62,lengeDuetotheinformationlossincurredbythereductionofvariablesizecodefragmentsto
constant-sizenumbersorvectors,theeditdistancebetweeninconsistentclonescannotbecon-
trolledpreciselyinfeaturevector[106]andhashingbased[16]approaches.
detectedIdiosyncrasiesiftheirofsomeconstituenteapproachesxactclonesthreatenarenotrecall.longInenough.[220],In[73],inconsistentclinconsistenciesonescannotmightbe
notbedetectediftheyadddataorcontroldependencies,asnotedbytheauthors.
Scalabilitytoindustrial-sizesoftwareofsomeapproacheshasbeenshowntobeinfeasible
[137,146]orisatleaststillunclear[65,202].
Formostapproaches,implementationsarenotpubliclyavailable.
45
3StateoftheArt
ContributionChapter7presentsanovelalgorithmtodetecttype-3clonesinsourcecode.In
contrasttotheaboveapproaches,itsupportsbothmodernandlegacylanguagesincludingCOBOL
andPL/I,allowsforprecisecontrolofsimilarityintermsofeditdistanceonprogramstatements,is
suf®cientlyscalabletoanalyzeindustrial-sizeprojectsinreasonabletimeandisavailableforuseby
othersasopensourcesoftware.
3.3.4DetectionofClonesinModels
ToanalyzetheextentofcloninginMatlab/Simulinkmodels,andtoassessandcontrolexisting
clonesinthemduringmaintenance,weneedasuitableclonedetectionalgorithm.Inthissection,
wediscussrelatedworkinclonedetectiononmodelsandoutlineshortcomings.
model-basedModel-baseddevCloneelopment.DetectionIn[160],UpLiutoet.nowal.,littleproposeworkasufhas®x-treebeendonebasedonalgorithmclonefordetectionclonein
detectioninUMLsequencediagrams.Theyexploitthefactthatparallelism-freesequencediagrams
canbelinearizedinacanonicalfashion,sinceauniquetopologicalorderforthemexists.Thisway,
theyeffectivelyreducetheproblemof®ndingcommonsubgraphstothesimplerproblemof®nding
commonsubstrings.However,sinceaunique,similaritypreservingtopologicalordercannotbe
establishedforMatlab/Simulinkmodels,theirapproachisnotapplicabletoourcase.
Aproblemwhichcouldbeconsideredasthedualoftheclonedetectionproblemisdescribedby
Kappliedelteret.toal.difinferent[128]versionswhereofatheysingletrytomodel).identifyInthetheirdifapproachferencestheybetweenrelyonUMLcalculatingmodelspairs(usuallyof
matchingelements(i.e.,classes,operations,etc.)basedonheuristicsincludingthesimilarityof
names,andexploitingthefactthatUMLisrepresentedasarootedtreeintheXMIusedasstorage
format,makingitinappropriateforourcontext.
Inapproach[186],Phampresentedetal.inthispresentthesisaandclonewas,detectionthus,notavapproachailablefortouswhenMatlab/Simulink.wedevelopedItbit.uildsonthe
Graph-basedCloneDetectionGraph-basedapproachesforcodeclonedetectioncould,in
principle,alsobeappliedtoMatlab/Simulink.In[137],KomondoorandHorwitzproposeacom-
binationofforwardandbackwardprogramslicingtoidentifyisomorphicsubgraphsinaprogram
dependencegraph.Theirapproachisdif®culttoadapttoMatlab/Simulinkmodels,sincetheirap-
plicationofslicingtoidentifysimilarsubgraphsisveryspeci®ctoprogramdependencegraphs.
In[146],Krinkealsoproposesanapproachthatsearchesforsimilarsubgraphsinprogramdepen-
dencegraphs.Sincethesearchalgorithmdoesnotrelyonanyprogramdependencegraphspeci®c
properties,itisinprinciplealsoapplicabletomodel-basedclonedetection.However,Krinkeem-
ploysaratherrelaxednotionofsimilaritythatisnotsensitivetotopologicaldifferencesbetween
subgraphs.Sincetopologyplaysacrucialroleindata-¯owlanguages,weconsiderthisapproachto
models.Matlab/Simulinkforsub-optimalbe
46
3.4CloneAssessmentandManagement
GraphTheoryProbablythemostcloselyrelatedproblemingraphtheoryisthewellknown
NP-completeMaximumCommonSubgraphproblem.Anoverviewofalgorithmsispresentedby
icsBunke[191],etal.where[31].itisMostusedtopractical®ndsimilaritiesapplicationsofbetweenthisproblemmolecules.seemHotowevbeer,studiedwhileintypicalchemoinformat-molecules
consideredtherehaveuptoabout100atoms,manyMatlab/Simulinkmodelsconsistofthousands
ofblocksandthusmaketheapplicationofexactalgorithmsasappliedinchemoinformaticsinfea-
sible.
SummaryWerequireaclonedetectionalgorithmforMatlab/Simulinkmodelstoinvestigatethe
extentofcloninginindustrialMatlab/Simulinkmodels.
ProblemWhiletheexistingapproachesforclonedetectioningraphsandmodelsprovidedvaluable
inspiration,noneissuitabletostudytheextentofcloninginindustrialMatlab/Simulinkmodels.
ContributionChapter7presentsanovelclonedetectionapproachfordata-¯owmodelsthatis
suitableforMatlab/Simulinkandscalestoindustrial-sizemodels.
3.4CloneAssessmentandManagement
Thiscomprisesectionallworkoutlinesthatworkemploysrelatedclonetoclonedetectionmanagement;resultstotosupportbesoftwcomprehensiareve,wemaintenance.interpretthisto
AssessmentClone3.4.1Clonedetectiontoolsproduceclonecandidates.Justbecausethesyntacticcriteriafortype-xclone
candidatesaresatis®ed,theydonotnecessarilyrepresentduplicationofproblemdomainknowl-
edge.Hence,theyarenotnecessarilyrelevantforsoftwaremaintenance.Ifprecisionisinterpreted
astaskrelevance,existingclonedetectionapproaches,hence,producesubstantialamountsoffalse
positives.Cloneassessmentneedstoachievehighprecisiontogetconclusivecloninginforma-
tion.Theexistenceoffalsepositivesinproducedclonecandidateshasbeenreportedbyseveralre-
searchers.KapserandGodfreyreportbetween27%and65%offalsepositivesincasestudies
investigatingcloninginopensourcesoftware[122].BurdandBailey[32]comparedthreeclone
detectionandtwoplagiarismdetectiontoolsusingasinglesmallsystemasstudyobject.Through
subjectiveassessments,38.5%ofthedetectedcloneswererejectedasfalsepositives.Amorecom-
prehensivestudywasconductedbyBellonetal.[19].Sixclonedetectorswerecomparedusingeight
differentsubjectsystems.AsampleofthedetectedcloneswasjudgedmanuallybyBellon.Itwas
foundthat—dependingonthedetectiontechnique—alargeamountoffalsepositivesareamongthe
detectedclones.Tiarksetal.[217]categorizedtype-3clonesdetectedbydifferentstate-of-the-art
clonedetectorsaccordingtotheirdifferences.Beforecategorization,theymanuallyexcludedfalse
positives.Theyfoundthatupto75%ofthecloneswerefalsepositives.
Walensteinetal.[229]revealcaveatsinvolvedinmanualcloneassessment.Lackofobjective
clonerelevancecriteriaresultsinlowinter-raterreliability.SimilarresultsarereportedbyKapser
47
3StateoftheArt
etal.[124].Theirworkemphasizestheneedformeasurementofinter-raterreliabilitytomakesure
objectiveclonerelevancecriteriaareused.
Someworkhasbeendoneontailoringclonedetectorstoimprovetheiraccuracy:KapserandGod-
freyproposeto®lterclonesbasedonthecoderegionstheyoccurin.Theyreportthatsuch®lters
cansuccessfullyremovefalsepositivesinregionsofstereotypecodewithoutsubstantiallyaffecting
recall[122].Inaddition,allclonedetectiontoolsexposeparameterswhosevaluationsin¯uencere-
sultaccuracy.Forsomeindividualtoolsandsystems,theireffectonthequantityofdetectedclones
hasbeenreported[121].However,wearenotawareofsystematicmethodsonhowresultaccuracy
ed.vimprobecan
SummaryUnfortunately,thereisnocommon,agreed-uponunderstandingofthecriteriathat
determinetherelevanceofclonesforsoftwaremaintenance.Thisisre¯ectedinthemultitudeof
differentde®nitionsofsoftwareclonesintheliterature[140,201].Thislackofrelevancecriteria
introducessubjectivityintoclonejudgement[124,229],makingobjectiveconclusionsdif®cult.The
negativeconsequencesbecomeobviousinthestudydonebyWalensteinetal.[229]:threejudges
independentlyperformedmanualassessmentsofclonerelevance;sincenoobjectiverelevancecri-
teriaweregiven,judgesappliedsubjectivecriteria,ratingonly5outof317candidatesconsistently.
Obviously,suchlowagreementisunsuitedasabasisforimprovementofclonedetectionresult
.yaccuracProblemClonedetectiontoolsproducesubstantialamountsoffalsepositives,threateningthecor-
rectnessofresearchconclusionsandtheadoptionofclonedetectionbyindustry.However,welack
explicitcriteriathatarefundamentaltomakeunbiasedassessmentsofdetectionresultaccuracy;
consequently,welackmethodsforitsimprovement.
ContributionChapter8introducesclonecouplingasanexplicitcriterionfortherelevanceofcode
clonesforsoftwaremaintenance.Itoutlinesamethodforclonedetectiontailoringthatemploys
clonecouplingtoimproveresultaccuracy.Theresultsoftwoindustrialcasestudiesindicatethat
developerscanestimateclonecouplingconsistentlyandcorrectlyandshowtheimportanceoftai-
loringforresultaccuracy.
3.4.2ementgManaClone
InHe[141],followsKoschkLagueeetproal.vides[149]aandcomprehensGieseckieve[78]ovinerviediwvidingoftheclonecurrentwmanagementorkonacticlonevitiesintomanagement.three
aimsareas:topralleeviateventiveimpactmanagementofexistingaimstoclonesavandoidcorrcreationectiveofnewmanagementclones;aimstocompensativeremoveclones.management
ClonePreventionTheearlierproblemsinsourcecodeareidenti®ed,theeasiertheyareto®x.
Thisalsoholdsforcodeclones.In[149],Lagueetal.proposestopreventthecreationofnewclones
byanalyzingcodethatgetscommittedtothecentralsourcecoderepository.Incaseachangeadds
aclone,itneedstopassaspecialapprovalprocesstobeallowedtobeaddedtothesystem.
Severalprocesses[5,51,177]employmanualreviewsofchangesbeforethesoftwarecangointo
production.TheLEvDprocess[51]weemployforthedevelopmentofConQAT,e.g.,requires
48
3.4CloneAssessmentandManagement
allincludingcodeclonechangestodetection.berevieCloneswedthusbeforedraawrelease.attentionManualduringrerevieviewwsisandare,supportedinmostbycases,analysismarktools,ed
asreview®ndingsthatneedtobeconsolidatedbytheoriginalauthor.Whilethisschemedoesnot
preventclonesfrombeingintroducedintothesourcecoderepository,itdoespreventthemfrom
beingintroducedintothereleasedcodebase.
cloningExistingcloneremain,prevmaintainersentionarefocuseslikonelythetoclones,continuenottooncreatetheirclones.rootTocauses.beefHofectiwevve,er,clonewhileprevcausesentionfor
henceneedstoanalyze—andrectify—thecausesforcloning.
duringClonemaintenanceCompensationofcodeinCloneanIDE.indicationTheirtoolsgoalispointtooutincreaseareasofdevclonedeloperawcodetoarenesstheofdevcloningeloper
andthusmakeunintentionallyinconsistentchangeslesslikely.Examplesinclude[46,59,60,92,
94,102,update-to-date103,218].cloneReal-timeinformationcloneforevdetectionolvingsoftwapproachesaretohaclonevebeenindicationproposedtoolsto[126,quickly235].deliver
Linkpromiseedtoeditingreducetoolsthereplicatemodi®cationovmodi®cationserheadcausedmadetobyonecloningcloneandtotheitsliksiblingselihoodto[218].makeTheyuninten-thus
tionallyinconsistentmodi®cations.AsimilarideaisimplementedbyCReN[102]thatconsistently
code.clonedinidenti®ersrenamesBothcloneclonecomprehension,indicationandandthuslinkedcloneeditingtoolscompensation,operatecanonbethesupportedsourcecodethroughlevel.toolsInathatlarofgefersystem,inter-
activevisualizationsatdifferentlevelsofabstraction.Examplesinclude[219],[238]and[125].
supportBesidessupportingcomprehensionofcomprehensiontheevolutionofofclonesclonesinainasinglesystem.systemSevveralersion,toolsclonetoanalyzetrackingthetoolsevaimolutionto
ofGödecloningdiscusshavethatbeencloneproposed,trackingandincludingmanage[60,ment83,f85,ace85,obstacles132,133,and181,raise216].costsInin[91],practice.Harderand
jlichClone[68]RemoreportvalonanSeveralindustrialauthorscasehavestudyinvinestigwhichatedcertaincorrectivecloneclonetypesweremanagement.remoFvedantaandmanuallyRa-
fromstacleaforC++clonesystem.Theconsolidation.yidentifySuchthetoollackofsupportisdedicatedproposedtoolbysupportotherforauthors:cloneKremovomondooralasan[136]ob-
investigatesautomatedcloneconsolidationthroughprocedureextraction.Baxteretal.[16]proposes
togenerateC++macrobodiesasabstractionsforclonegroupsandmacroinvocationstoreplacethe
clones.In[8],Balazinskaetal.presentanapproachthatconsolidatesclonesthroughapplication
ofthestrategydesignpattern[74];intheirlaterpaper[9],thesameauthorspresentaapproachto
supportsystemrefactoringtoremoveclones.Inamorerecentpaper,theideatosuggestrefactor-
ingsbasedontheresultsfromclonedetectioniselaboratedbyLiandThompsonin[154]forthe
Erlang.languageprogrammingSeveralauthorshaveidenti®edlanguagelimitationsasonereasonforcloning[140,201].Tocounter
this,clonesomeremovalauthorsusinghavtraitseinv[179].estigatedBasitetfurtheral.studymeanstocloneremoremovevalcloning.inC++Murphusingay-Hillstaticetal.metastudypro-
[15].languagegramming
49
3StateoftheArt
Orwithtechnicalganizationalchallenges.ChangeBut,Manatogachieementveadoption,Existingandresearchthusinimpactcloneonsoftwmanagementareengineeringprimarilyprac-deals
tice,outlinesfurtherbarriersobstaclesinhaadoptionvetoofbeovprogramercome.Incomprehensionhiskeynotetechniques,speechpublishedincludinginclone[40],Jimdetection,Cordyby
hisproaches,industrialbutinsteadpartners.businessCordydoesrisks,notmanagementmentiontechnicalstructuresandchallengessocialorandimmaturityculturalofeissuesxistingascen-ap-
tralbarrierstoadoption.Hisreportscon®rmthatadoptionofclonedetectionormanagementap-
proachesresearchersbycon®rmsindustryfaceschallengesinchallengesresearchbeyondadoptionthebecapabilitiesyondoftechnicaltheemploissuesyed[38,tools.69,W209].orkofother
Introducingclonemanagementtoreducethenegativeimpactofcloningonmaintenanceefforts
andprogramcorrectness,isnotaproblemthatcanbesolvedsimplybyinstallingsuitabletools.
Instead,itrequireschangesoftheworkhabitsofdevelopers.Tobesuccessful,introductionof
clonemanagementmustthusovercomeobstaclesthatarisewhenestablishedprocessesandhabits
changed.betoare
Challengesfacedwhenchangingprofessionalhabitsarenotspeci®ctotheintroductionofclone
management.Instead,theyarefacedbyallchangestodevelopmentprocesses,includingtheintro-
ductionofdevelopmentorqualityanalysistools.Furthermore,theyarenotlimitedtochangestothe
developmentprocess,butinsteadpermeateallorganizationalchanges.Thishasbeenrealizedlong
ago—managementliteraturecontainsasubstantialbodyofknowledgeonhowtosuccessfullyco-
erceestablishedhabitsintonewpaths[43,130,143–145,152,153],somedatingbackto1940ies.
prevSummarention,yThecompensationresearchandcommunityremovalofproducedcloning.Muchsubstantiaoflwthisorkwonorkclonefocusesonmanagement,asingletarmanage-geting
mentcloneaspect,managementforearexamplenotclonelimitedtoindicationdevorelopingtracking.appropriateHowever,tools.theInstead,challengesthefyacedrequirebybothsuccessfulan
Changingunderstandingestablishedofthebehacausesviorforishard.cloningWandorkinchangesorgtoeanizationalxistingchangeprocessesandmanagementdevhaselopershobehawnviorthat.
itencountersobstaclesthatneedtobeaddressedforchangestosucceedinthelongterm.This
isapproachescon®rmed[38]byinreportsindustryon.reluctancetoadoptclonemanagement[40]andotherqualityanalysis
ProblemSuccessfulintroductionofclonemanagementrequireschangestoestablishedprocesses
andhamanagementbits.Existingtasks.wThisorkdoesonnotclonefacilitatemanagement,orghoanizationalwever,changefocusesprimarilymanagement.onWtoolsithoutforit,indithough,vidual
clonemanagementapproachesareunlikelytoachievelong-termsuccessinpractice.
ContributionChapter8presentsamethodtointroduceclonecontrolintoasoftwaremaintenance
project.Itadaptsresultsfromorganizationalchangemanagementtothedomainofsoftwarecloning.
Furthermore,itdocumentscausesofcloningandtheirsolutionsforeffectivecloneprevention.The
chapterpresentsalongtermindustrialcasestudythatshowsthatthemethodcanbeemployedto
successfullyintroduceclonecontrol,andreducetheamountofcloning,inpractice.
50
CloneofLimitations3.5Detection
DetectionCloneofLimitations3.5
Sevderstanderalstudieslimitationsinvestigofateclonewhichmanagementclonesincertainpractice,detectionwemustapproachesunderstandcan®nd.whichHowevduplicationer,totheun-y
cannot®nd.Thissectionoutlinesresearchondetectionofprogramsimilaritybeyondcloningcre-
atedbycopy&paste.
SimionDetectionSeveralauthorsdealtwiththeproblemof®ndingbehaviorallysimilarcode,
althoughoftenonlyforaspeci®ckindofsimilarity.
AnearlypaperonthesubjectbyMarcusandMaletic[167]dealswiththedetectionofsocalledhigh-
levelconceptclones.Theirapproachisbasedonreducingcodechunks(usuallymethodsor®les)to
tokensets,andperforminglatentsemanticindexing(LSI)andclusteringonthesesetsto®ndparts
ofcodethatusethesamevocabulary.Thepaperreportson®ndingmultiplelistimplementations
inacasestudy,butdoesnotquantifythenumberofclonesfoundortheprecisionoftheapproach.
Limitationsareidenti®edespeciallyinthecaseofmissingormisleadingcomments,astheseare
search.clonetheinincludedTheworkofKawrykowandRobillard[127]aimsat®ndingmethodsinaJavaprogramwhich
reimplementfunctionsavailableinlibraries(APIs)usedbytheprogram.Therefore,methodsare
reducedtothesetofclasses,methods,and®eldsused,whichareextractedfromthebyte-code,and
thenmatchedpairwiseto®ndsimilarmethods.Additionalheuristicsareemployedtoreducethe
falsepositiverate.Applicationtotenopensourceprojectsidenti®ed405“imitations”ofAPImeth-
odswithanaverageprecisionof31%(worstprecision4%).Sincetheentiresetofall“imitations”
ofthemethodsisunclear,therecallisunknown.
Nguyenetal.[183]applyagraphminingalgorithmtoanormalizedcontrol/data-¯owgraphto®nd
“usagepatterns”ofobjects.Thefocusoftheirworkisnotthedetectionofcloning,butratherof
similarbutinconsistentpatterns,whichhintatbugs.Theprecisionofthisprocessisabout20%2.
Again,w.r.t.simiondetection,recallisunclear.
Thepaper[107]byJiangetal.introducesanapproachthatcanbesummarizedbydynamicequiva-
lencechecking.Thebasicideais,thatiftwofunctionsaredifferent,theywillreturndifferentresults
onthesamerandominputwithhighprobability.Theirtool,calledEQMINER,detectsfunctionally
equivalentfunctionsinCcodedynamicallybyexecutingthemonrandominputs.Usingthistool,
they®nd32,996clustersofsimilarcodeinasubsetofabout2.8millionlinesoftheLinuxker-
nel.UsingtheirclonedetectorDeckardtheyreportthatabout58%ofthebehaviorallysimilarcode
discoveredissyntacticallydifferent.Sincenosystematicinspectionoftheclustersisreported,no
precisionnumbersareavailable.Again,duetoseveralpracticallimitationsoftheapproach(e.g.,
randomizationofreturnvaluestoexternalAPIcalls),therecallw.r.t.simiondetectionisunclear.
In[1],Al-Ekrametal.searchforcloningbetweendifferentopen-sourcesystemsusingatoken-
basedclonedetector.Theyreportthat,totheirsurprise,theyfoundlittlebehaviorallysimilarcode
acrossdifferentsystems,althoughthesystemsofferedrelatedfunctionality.Theclonestheydid
®ndweretypicallyinareaswheretheuseofcommonAPIsimposedacertainprogrammingstyle,
2Whenincluding“codethatcouldbeimprovedforreadabilityandunderstandability”as¯aws,thepaperreportsnear
precision.40%
51
3StateoftheArt
therebylimitingprogramvariation.However,sincetheabsoluteamountofbehaviorallysimilar
codebetweenthedifferentsystemsisunknown,itisunclearwhetherthesmallamountofdetected
behaviorallysimilarclonesisduetotheirabsenceintheanalyzedsystems,orduetolimitationsof
detection.cloneIn[12,13],BasitandJarzabekproposeapproachestodetecthigher-levelsimilaritypatternsinsoft-
ware.Theirapproachemploysconventionalclonedetectionandgroupsdetectedclonesaccording
todifferentrelationtypes,suchascallrelationshipsbetweentheclones.Whiletheirapproachhelps
tocomprehenddetectedclonesthroughinferringstructure,itdoesnotdetectmoreredundancythan
conventionalclonedetection,sinceitbuildsonit.Itdoesthusnotimproveourunderstandingofthe
limitationsofclonedetectionw.r.t.simiondetection.
AlgorithmRecognitionThegoalofalgorithmrecognition[2,176,232]istoautomatically
recognizedifferentformsofaknownalgorithminsourcecode.Justasclonedetection,ithasto
copewithprogramvariation.Themostfundamentaldifferencew.r.t.similarcodedetectionisthat
foralgorithmrecognitionasproposedby[176,232],thealgorithmstoberecognizedneedtobe
ance.advinwnkno
shedSummarlightyontheExistingwcapabilitiesorkonofclonecomparisondetectionofclonetodetedetectctionclonesapproachescreated[19,through196,cop197,y&200]pastehas
&behamodifyviorally.Hosimilarwever,codewethatknowisnotlittlearesultaboutofthecopy&limitationspastebofuthasclonebeendetectioncreatedw.r.t.discoindependentlyvery.of
haPrvioroblemactuallyWedois.notAsaknowresult,howitisstructurallyuncleartodifwhichferentextentindependentlyrealworlddevprogramselopedcodecontainwithsimilarredundancbe-y
tothatcontaincannotbemultipleattributedtoimplementationscopy&ofpaste,thesamalthougheintuitionfunctionality.tellsAsusathatlarconsequence,geprojectswedoareenotxpectedknow
ifwecandiscoversimionsthatresultfromindependentimplementationofredundantrequirements
onthecodelevel.
ofContribprogramutionvariationChapterin9overpresents100theresultsimplementationsofaofcontrolledasingleexperimentspeci®cationsthatthatanalyzesweretheproducedamount
istingindependentlydetectors—arebystudentpoorlyteams.suitedIttoshowsdetectthatesimionsxistingthatclonehavedetnotectionbeencreatedapproaches—notbycopy&onlypaste,ex-
emphasizingtheneedtoavoidcreationofsimionsinthe®rstplace.
52
4ImpactonProgramCorrectness
Muchoftheresearchinclonedetectionandmanagementisbasedontheassumptionthatunaware-
nessofcloningduringmaintenancethreatensprogramcorrectness.Thisassumption,however,
hasnotbeenvalidatedempirically.Wedonotknowhowwellawareofcloningdevelopersare,
andconversely,howstronglyalackofawarenessimpactscorrectness.Theimpactofcloningon
programcorrectnessis,hence,insuf®cientlyunderstood.Theimportanceofcloning—andclone
.unclearthusmanagement—remainsThischapteranalyzestheimpactofunawarenessofcloningonprogramcorrectnessthroughalarge
industrialcasestudy.Itthuscontributestothebetterunderstandingoftheimpactofcloningandthe
importancetoperformclonedetectionandclonemanagementinpractice.Partsofthecontentof
thischapterhavebeenpublishedin[115].
hcResear4.1Questions
Wesummarizethestudyusingthegoalde®nitiontemplateasproposedin[234]:
Analyzecloninginsourcecode
forthepurposeofcharacterizationandunderstanding
fromwiththevierespectwpointtoofitssoftwimpactareondevprogreloperamandcorrectnessmaintainer
inthecontextofindustrialandopensourceprojects
investigTherefore,ateathesetfolloofwingindustrial4researchandopenquestions:sourceprojectsareusedasstudyobjects.Indetail,we
RQ1Arecloneschangedindependently?
The®rstquestioninvestigateswhethertype-3clonesappearinreal-worldsystems.Besideswhether
wecan®ndthem,itexploresiftheyconstituteasigni®cantpartofthetotalclonesofasystem.It
doesnotmakesensetoanalyzeinconsistentchangestoclonesiftheyareararephenomenon.
RQ2Aretype-3clonescreatedunintentionally?
Hacreatedvingestablishedintentionallythatornot.thereItarecanbetype-3sensibleclonestoinrealchangeasystems,clonewesothatanalyzeitbecomeswhetheratheytype-3haveclone,been
ifdifithasferencestocanconformindicatetodifproblemsferentthatrequirementswerenott®xhanedinitsallsiblings.siblings.Ontheotherhand,unintentional
53
4ImpactonProgramCorrectness
Figure4.1:Clonegroupsets
RQ3Cantype-3clonesbeindicatorsforfaults?
Afterestablishingtheseprerequisites,wecandeterminewhetherthetype-3clonesareindicatorsfor
systems.realinaultsf
RQ4Dounintentionaldifferencesbetweentype-3clonesindicatefaults?
Thisquestiondeterminestheimportanceofclonemanagementinpractice.Areunintentionally
createdmodi®cationstype-3canclonesreduceliktheelytolikelihoodindicateoffaults?errors.IfIfso,not,theclonereductionofmanagementisunintentionallylessusefulininconsistentprac-
tice.
Stud4.2Designy
WeanalyzethesetsofclonegroupsasshowninFig.4.1:theoutermostsetcontainsallclone
groupsCinasystem;ICdenotesthesetoftype-3clonegroups;UICdenotesthesetoftype-
3clonegroupswhosedifferencesareunintentional;thedifferencesbetweenthesiblingsarenot
wanted.ThesubsetFofUICcomprisesthosetype-3clonegroupswithunintentionaldifferences
thatindicateafaultintheprogram.Wefocusonclonegroups,insteadofonindividualclones,since
differencesbetweenclonesarerevealedonlybycomparison,andthusinthecontextofaclone
group,andnotapparentintheindividualcloneswhenregardedinisolation.Furthermore,wedonot
distinguishbetweencreatedandevolvedtype-3clones—forthequestionoffaultiness,itdoesnot
matterwhenthedifferenceshavebeenintroduced.
Theindependentvariablesinthestudyaredevelopmentteam,programminglanguage,functional
domain,ageandsize.Thedependentvariablesareexplainedbelow.
theRQ1sizeofinvsetestigICateswiththeerespectxisttoencetheofsizetype-3ofsetCclones.Wineapplyreal-wourorldtype-3systems.cloneTodetectionanswerit,weapproachanalyze(cf.,
falseSectionpositi7.3.4)vestoandallcalculatestudytheobjects,type-3performcloneratiomanual|ICa|/|C|ssessment.ofthedetectedclonestoeliminate
54
ObjectsyStud4.3
theRQs2izeinofvtheestigsetsatesUICwhetandherIC.type-3Theclonessetsarearecreatedpopulatedbyshounintentionallywing.eachToansweridenti®edit,wetype-3compareclone
toThisdevgiveselopersustheofthesystemunintentionallyandaskinginconsistentthemtocloneratertheatio|difUIC|ferences/|IC|.asintentionalorunintentional.
FRQin3relationinvestigtotheatessizewhetherofIC.Thetype-3setFclonesis,agindicateain,faultspopulated.Tobyansweraskingit,deweveloperscomputeofthethesizerespectiofvsete
system.Theirexpertopinionclassi®estheclonesintofaultyandnon-faulty.Weonlyanalyzetype-3
cloneswithunintentionaldifferences.Ourfaultyinconsistentcloneratio|F|/|IC|isthusalower
bound,aspotentialfaultsinintentionallydifferenttype-3clonesarenotconsidered.
Basedonthisratio,wecreateahypothesistoanswerRQ3.Weneedtomakesurethatthefault
densityintheinconsistenciesishigherthaninrandomlypickedlinesofcode.Thisleadstothe
:HypothesishThefaultdensityintheinconsistenciesishigherthantheaveragefaultdensity.
Aswedonotknowtheactualfaultdensitiesoftheanalyzedsystems,weneedtoresorttoaverage
values.Thespanofavailablenumbersislargebecauseofthehighvariationinsoftwaresystems.
EndresandRombach[64]give0.1–50faultsperkLOCasatypicalrange.Forthefaultdensityin
thesistencies.inconsistencies,Werefrainwefromusethetestingnumberthehoffypothesisaultsdividedstatisticallybythebecauselogicaloflinestheoflowcodeofnumbertheofincon-data
pointsaswellasthelargerangeoftypicaldefectdensities.
RQ4investigateswhetherunintentionallydifferenttype-3clonesindicatefaults.Toanswerit,
weinconsistentcomputetheclonesizeratioof|setF|/F|inUIC|isrelationalotowerthebound,sizeofassetUICpotential.Agfaiaultsn,theinfaultyintentionallyunintentionallydifferent
considered.notareclones
ObjectsyStud4.3
Sincewerequiredthewillingnessofdeveloperstoparticipateincloneinspectionsandclonedetec-
tiontailoring,wehadtorelyonourcontactswithindustryinourchoiceofstudyobjects.However,
wechosesystemswithdifferentcharacteristicstoincreasegeneralizabilityoftheresults.
Wechose2companiesand1opensourceprojectassourcesofsoftwaresystems.Wechosesystems
writtenindifferentlanguages,bydifferentteamsindifferentcompaniesandwithdifferentfunction-
alities.Theobjectsincluded3systemswritteninC#,aJavasystemaswellasalong-livedCOBOL
system.Allofthemareinproduction.Fornon-disclosurereasons,wegavethecommercialsystems
namesfromAtoD.AnoverviewisshowninTable4.1.
AlthoughsystemsA,BandCareallownedbyMunichRe,theywereeachdevelopedbydifferent
organizations.Theyprovidesubstantiallydifferentfunctionality,rangingfromdamageprediction,
overpharmaceuticalriskmanagementtocreditandcompanystructureadministration.Thesystems
55
4ImpactonProgramCorrectness
Table4.1:Summaryoftheanalyzedsystems
SystemOrganizationLanguageAgeSize
(kLOC)(years)BAMunichMunichReReC#C#46454317
CMunichReC#2495
DSysiphusLVTUM1871JavaCOBOL178281197
supportbetween10and150expertuserseach.SystemDisamainframe-basedcontractmanage-
mentsystemwritteninCOBOLemployedbyabout150users.TheopensourcesystemSysiphus1
inisvdeolvvedelopedinitsatdethevTelopment).echnischeItUniconsversitättitutesaMünchencollaboration(buttheenauthorvironmentofforthisdistribthesisutedhasnotsoftwbeenare
developmentprojects.Weincludedanopensourcesystembecause,astheclonedetectiontoolis
alsofreelyavailable,theresultscanbeexternallyreplicated2.Thisisnotpossiblewiththedetailed
con®dentialresultsofthecommercialsystems.
ecutionExandImplementation4.4
RQ1Forallsystems,ourclonedetectorConQATwasexecutedbyaresearchertoidentifytype-
3clonecandidates.Onan1.7GHznotebook,thedetectiontookbetweenoneandtwominutes
foreachsystem.Thedetectionwascon®guredtonotcrossmethodboundaries,sinceexperiments
showedthattype-3clonesthatcrossmethodboundariesinmanycasesdidnotcapturesemantically
meaningfulconcepts.Thisisalsonotedfortype-2clonesin[142]andisevenmorepronounced
fortype-3clones.InCOBOL,sectionsintheproceduraldivisionarethecounterpartofJavaorC#
methods—clonedetectionforCOBOLwaslimitedtothese.
FortheC#andJavasystems,thealgorithmwasparameterizedtouse10statementsasminimal
clonelength,amaximumeditdistanceof5,amaximalgapratio(i.e.,theratioofeditdistanceand
clonelength)of0.2andtheconstraintthatthe®rst2statementsoftwoclonesmustbeequal.Due
totheverbosityofCOBOL[62],minimalclonelengthandmaximaleditdistanceweredoubledto
20and10,respectively.Generatedcodethatisnotsubjecttomanualeditingwasexcludedfrom
clonedetection,sinceincompletemanualupdatesobviouslycannotoccur.Normalizationofidenti-
®ersandconstantswastailoredasappropriatefortheanalyzedlanguage,toallowforrenamingof
identi®erswhileavoidingtoohighfalsepositiverates.Thesesettingsweredeterminedtorepresent
thebestcombinationofprecisionandrecallduringcursoryexperimentsontheanalyzedsystems,
forwhichrandomsamplesofthedetectedcloneswereassessedmanually.
Thedetectedclonecandidateswerethenmanuallyratedbytheauthortoremovefalsepositives—
codefragmentsthat,althoughidenti®edasclonecandidatesbythedetectionalgorithm,havenose-
manticrelationship.Type-3andungapped(type-1andtype-2)clonegroupcandidatesweretreated
1http://sysiphus.in.tum.de/2.in.tum.de/~ccsm/icse09/yhttp://wwwbro
56
4.5Results
differently:alltype-3clonegroupcandidateswererated,producingthesetoftype-3clonegroups
ofIC.ratingSincealltheofungthem,appedarandomclonegroupssampleofwere25%notwasrequiredrated,forandffurtheralsepositistepsvofetheratescasethenestudy,xtrapolatedinstead
todeterminethenumberofungappedclones.
RQs2,3and4Thetype-3clonegroupswerepresentedtothedevelopersoftherespective
systemsusingConQAT’scloneinspectionviewer.Thedevelopersratedwhethertheclonegroups
werecreatedintentionallyorunintentionally.Ifaclonegroupwascreatedunintentionally,the
developersalsoclassi®editasfaultyornon-faulty.FortheJavaandC#systems,alltype-3clone
groupswereratedbythedevelopers.FortheCOBOLsystem,ratingwaslimitedtoarandomsample
of68outofthe151type-3clonegroups,sincetheageofthesystemandthefactthattheoriginal
developerswerenotavailableforratingincreasedratingeffort.Thus,fortheCOBOLcase,the
resultsforRQ2andRQ3werecomputedbasedonthissample.Incaseswhereintentionalityor
faultinesscouldnotbedetermined,e.g.,becausenoneoftheoriginaldeveloperscouldbeaccessed
forrating,theinconsistenciesweretreatedasintentionalandnon-faulty.
Results4.5
RQsystem1D,Thethequantitatiprecisionvevaluesresultsareofoursmallerstudyforaretype-3summclonearizedgroupsinTablethan4.2.forungExceptappedforclonetheCOBOLgroups.
ThisresultsisofnotsystemuneDxpected,resultsincefromtype-3therathercloneconservgroupsativalloewcloneformoredetectiondeviation.parametersThehighchosendueprecisionto
thelatedvobjectserbositygaofveriseCOBOL.toloForwersystemprecisionA,values.stereotypeAboutdatabasehalfoftheaccessclonescodeof(52%)aresemanticallystricttype-3unre-
clones—theirclonesdifferbeyondidenti®ersnamesliteralorconstantvalues.Therefore,RQ1
canbeansweredpositively:clonesarechangedindependently,resultingintype-3clonesintheir
systems.
Table4.2:Summaryofthestudyresults
ProjectABCDSysiphusSumMean
Precisionungappedclonegroups0.881.000.961.000.98—0.96
ClonePrecisiongroupstype-3|C|clonegroups2860.611600.863260.803521.003030.871427——0.83
Type-3Unintent.clonediff.groupstype-3|ICgroups||UIC|159518929179661511514642724203——
Faultyclonegroups|F|191842523107—
RQ1|IC|/|C|0.560.560.550.430.48—0.52
RQ2|UIC|/|IC|0.320.330.370.100.29—0.28
RQRQ43||FF||//||IUCI|C|0.370.120.620.200.640.230.330.030.550.16——0.150.50
FaultInconsistentdensityinlogicalkLOClines!14344291.419752.77973.4147650.1459—337148.1—
57
4ImpactonProgramCorrectness
Figure4.2:DifferentUIbehavior:rightsidedoesnotuseoperations(Sysiphus)
RQ2Fromthesetype-3clones,overaquarter(28%)hasbeenintroducedunintentionally.Hence,
RQ2canalsobeansweredpositively:Type-3clonesarecreatedunintentionallyinmanycases.
OnlysystemDexhibitsalowervalue,withonly10%ofunintentionallycreatedtype-3clones.
Withaboutthreequartersofintentionalchanges,thisshowsthatcloningandchangingcodeseems
tobeafrequentpatternduringdevelopmentandmaintenance.
RQ3Atleast3-23%ofthedifferencesrepresentedafault.Again,thebyfarlowestnumber
comesfromtheCOBOLsystem.Ignoringit,thetotalratiooffaultytype-3clonegroupsgoesupto
18%.Thisconstitutesasigni®cantsharethatneedsconsideration.TojudgehypothesisH,wealso
calculatedthefaultdensities.Theylieintherangeof3.4–91.4faultsperkLOC.Again,systemDis
anoutlier.Comparedtoreportedfaultdensitiesintherangeof0.1to50faultsandconsideringthat
allsystemsarenotonlydeliveredbutevenhavebeenproductiveforseveralyears,weconsiderour
resultstosupporthypothesisH.Onaverage,theinconsistenciescontainmorefaultsthanaverage
code.Hence,RQ3canalsobeansweredpositively:type-3clonescanbeindicatorsforfaultsin
systems.realAlthoughnotcentraltoourresearchquestions,thedetectionoffaultsalmostautomaticallyraisesthe
questionoftheirseverity.Asthefaulteffectcostsareunknownfortheanalyzedsystems,wecannot
provideafull-¯edgedseverityclassi®cation.However,weprovideapartialanswerbycategorizing
aults:ffoundthe
58
Critical:faultsthatleadtopotentialsystemcrashordataloss.Oneexampleforafaultin
thiscategoryisshowninFigure1.2inChapter1.Here,onecloneoftheaffectedclonegroup
performsanull-checktopreventanull-pointerdereference,whereastheotherdoesnot.Other
examplesweencounteredareindex-out-of-boundsexceptions,incorrecttransactionhandling
rollbacks.missingandUser-visible:faultsthatleadtounexpectedbehaviorvisibletotheenduser.Fig.4.2showsan
example:inoneclone,theperformedoperationisnotencapsulatedinanoperationobjectand,
hence,ishandleddifferentlybytheundomechanism.Furtherexampleswefoundareincor-
rectendusermessages,inconsistentdefaultvaluesaswellasdifferenteditingandvalidation
behaviorinsimilaruserformsanddialogs.
Discussion4.6
Non-user-visible:faultsthatleadtounexpectedbehaviornotvisibletotheenduser.Examples
weidenti®edincludeunnecessaryobjectcreation,minormemoryleaks,performanceissues
likemissingbreakstatementsinloopsandredundantre-computationsofcachedvalues;dif-
ferencesinexceptionhandling,differentexceptionanddebugmessagesordifferentloglevels
cases.similarfor
Ofthe107faultsfound,17werecategorizedascritical,44asuser-visibleand46asnon-user-visible
faults.Sinceallanalyzedsystemsareinproduction,therelativelysmallernumberofcriticalfaults
xpectations.eourwithcoincides
RQ4WhilethenumbersaresimilarfortheC#andJavaprojects,ratesofunintentionalincon-
sistenciesandthusfaultsarecomparativelylowforprojectD,whichisalegacysystemwrittenin
COBOL.Toacertaindegree,weattributethistoourconservativeassessmentstrategyoftreating
inconsistencieswhoseintentionalityandfaultinesscouldnotbeunambiguouslydeterminedasin-
tentionalandnon-faulty.Furthermore,interviewingthecurrentmaintainersofthesystemsrevealed
thatcloningissuchacommonpatterninCOBOLsystems,thatsearchingforduplicatesofapiece
ofcodeisanintegralpartoftheirmaintenanceprocess.Comparedtothedevelopersoftheother
projects,theCOBOLdeveloperswherethusmoreawareofclonesinthesystem.
Therow|F|/|UIC|inTable4.2accountsforthisdifferencein“cloneawareness”.Itrevealsthat,
whiletheratesofunintentionalchangesarelowerforprojectD,theratioofunintentionalchanges
leadingtoafaultisinthesamerangeforallprojects.Fromourresults,itseemsthataboutevery
secondtothirdunintentionalchangetoacloneleadstoafault.
Discussion4.6
Evthatenclonesconsideringcanleadthetofthreatsaults.toThevalidityinconsisdiscussedtenciesbelobetweenw,theclonesresultsareofoftenthenotstudyshojusti®edwconbydifvincinglyferent
requirementsbutcanbeexplainedbydevelopermistakes.
Whiletentlythefoundratioacrossofallstudyunintentionallyobjectsthatinconsistentunintentionallychangesvariedinconsistentstronglychangesbetweenaresysliktems,elytoweindicateconsis-
faults.Onaverage,inroughlyeverysecondcase.Weconsiderthisasstrongindicationthatclone
managementisusefulinpractice,sinceitcanreducethelikelihoodofunintentionallyinconsistent
changes.
VtoThreats4.7alidity
Wediscusshowwemitigatedthreatstointernalandexternalvalidityofourstudies.
59
4ImpactonProgramCorrectness
alidityVInternal4.7.1
Wedidnotanalyzetheevolutionhistoriesofthesystemstodeterminewhethertheinconsistencies
havebeenintroducedbyincompletechangestothesystemandnotbyrandomsimilaritiesofun-
relatedcode.Thishastworeasons:(1)Wewanttoanalyzealltype-3clones,alsotheonesthat
havebeenintroduceddirectlybycopyandmodi®cationinasinglecommit.Thosemightnotbe
visibleintherepository.(2)Theindustrialsystemsdonothavecompletedevelopmenthistories.
Weconfrontedthisthreatbymanuallyanalyzingeachpotentialtype-3clone.
Thecomparisonwithaveragefaultprobabilityisnotperfecttodeterminewhethertheinconsisten-
ciesaremorefault-pronethanarandompieceofcode.Acomparisonwiththeactualfaultdensities
ofthesystemsoractualchecksforfaultsinrandomcodelineswouldbettersuitthispurpose.How-
ever,theactualfaultdensitiesarenotavailabletousbecauseofincompletedefectdatabases.To
checkforfaultsinrandomcodelinesispracticallynotpossible.Wewouldneedthedevelopers’
timeandwillingnessforinspectingrandomcode.Asthepotentialbene®tforthemislow,the
motivationwouldbelowandhencetheresultswouldbeunreliable.
Asweaskthedevelopersfortheirexpertopiniononwhetheraninconsistencyisintentionalor
unintentionalandfaultyornon-faulty,athreatisthatthedevelopersdonotjudgethiscorrectly.
Onecaseisthatthedeveloperassessessomethingthatisfaultyincorrectlyasnon-faulty.This
caseonlyreducesthechancestopositivelyanswertheresearchquestions.Thesecondcaseisthat
thedevelopersratesomethingasfaultywhichisnofault.Wemitigatedthisthreatbyonlyrating
aninconsistencyasfaultyifthedeveloperwasentirelysure.Otherwiseitwaspostponedandthe
developerconsultedcolleagueswhoknewthecorrespondingpartofthecodebetter.Inconclusive
candidateswererankedasintentionalandnon-faulty.Again,onlytheprobabilitytoanswerthe
researchquestionpositivelywasreduced.
Thecon®gurationoftheclonedetectiontoolhasastrongin¯uenceonthedetectionresults.We
calibratedtheparametersbasedonapre-studyandourexperiencewithclonedetectioningeneral.
Thecon®gurationalsovariesoverthedifferentprogramminglanguagesencountered,duetotheir
differencesinfeaturesandlanguageconstructs.However,thisshouldnotstronglyaffectthedetec-
tionoftype-3clonesbecausewespentgreatcaretocon®gurethetoolinawaythattheresulting
sensible.areclonesWealsopre-processedthetype-3clonesthatwepresentedtothedeveloperstoeliminatefalse
positives.Thiscouldmeanthatweexcludedclonesthatwerefaulty.However,thisagainonly
reducedthechancesthatwecouldanswerourresearchquestionpositively.
Ourde®nitionofclonesandclonegroupsdoesnotpreventdifferentgroupsfromoverlappingwith
eachother;agroupwithtwolongclonescan,e.g.,overlapwithagroupwithfourshorterclones,
as,e.g.,groupsbandcintheexampleinSection2.5.1.Substantialoverlapbetweenclonegroups
couldpotentiallydistorttheresults.Thisdid,however,notoccurinthestudy,sincetherewasno
substantialoverlapbetweenclonegroupsinIC.ForsystemA,e.g.,89%oftheclonedstatements
didnotoccurinanyotherclone.Furthermore,overlapwastakenintoaccountwhencounting
faults—evenifafaultystatementoccurredinseveraloverlappingclonegroups,itwasonlycounted
ault.fsingleaas
60
ySummar4.8
alidityVExternal4.7.2onTheourprojectsconnectionswereobwithviouslythenotdeveloperssampledoftherandomlysystems.fromallHence,possiblethesetsoftwofaresystemssystemsisbutnotweentirelyrelied
representative.ThemajorityofthesystemsiswritteninC#andanalyzing5systemsintotalisnota
highnumber.However,all5systemshavebeendevelopedbydifferentdevelopmentorganizations
andtheC#-systemsaretechnicallydifferent(2web,1richclient)andprovidesubstantiallydifferent
asanfunctionalities.opensourceWeJavafurthersystem.mitigatedthisthreatbyalsoanalyzingalegacyCOBOLsystemaswell
Summar4.8y
Thisrectness.chapterInthe®vpresentedeanalyzedtheresultssystems,ofa107larfgeaultscasewerestudydiscoonvtheeredimpactthroughofthecloninganalysisonofprogramuninten-cor-
devtionallyelopers;44inconsistentcouldcausechangestoundesiredclonedprogramcode.Ofbehathem,vior17thatwerewasvisibleclassi®edtoasthecriticaluser.bythesystem
Wiedeobservsubstantiallyedtwoefacrossfectstheconcerningsystems.theSomedevmaintenanceeloperofteamsclones.wereFirst,moretheawaarewofarenesstheeofxistingcloningclonesvar-
thanothers,resultingindifferentlikelihoodsofunintentionallyinconsistentchangestoclonedcode.
tionalSecond,theinconsistencimpactyofunaindicatedwaarenessfaultofinthecloningsoftwwasare.Inaconsistent.nutshell,Onawhileverage,theevamounteryofsecondunawuninten-areness
ofcloningvariedbetweensystems,ithadaconsistentlynegativeimpact.
Thenance.studyresultsConsequently,emphasizetheytheemphasizenegativtheeimpactimportanceofaoflackofcloneawcontrol.arenessofSincecloningeveryduringsecondmainte-unin-
controltentionallycanproinconsistentvidesubstantialchangevalue,createdifaitfaultmanages(orftoailedtodecreaseremothevelikafaultelihoodfromofthesuchsystem),changes—byclone
decreasingtheextentandincreasingtheawarenessofcloning.
61
5CloningBeyondCode
Thepreviouschapterhasshownthatunawarenessofclonesinsourcecodenegativelyaffectspro-
grammcorrectness.Cloninghas,however,notbeeninvestigatedinotherartifacttypes.Itisthus
unclear,whetherclonesoccursandshouldbecontrolledinotherartifacts,too.
Weconjecturethatcloningcanoccurinall—includingnon-code—artifactscreatedandmaintained
duringsoftwareengineering,andthatengineersneedtobeawareofcloneswhenusingthem.
Thischapterpresentsalargecasestudyonclonesinrequirementsspeci®cationsanddata-¯ow
models.Itinvestigatestheextentofclonesintheseartifactsanditsimpactonengineeringactivities.
Itdemonstratesthatcloningcanoccurinnon-codeartifactsandgivesindicationforitsnegative
impact.Partsofthecontentofthischapterhavebeenpublishedin[54,57,111].
QuestionshcResear5.1
Wesummarizethestudyusingthegoalde®nitiontemplateasproposedin[234]:
Analyzecloninginrequirementsspeci®cationsandmodels
forthepurposeofcharacterizationandunderstanding
withrespecttoitsextentandimpactonengineeringactivities
fromtheviewpointofrequirementsengineerandqualityassessor
inthecontextofindustrialprojects
Therefore,asetofspeci®cationsandmodelsfromindustrialprojectsareusedasstudyobjects.We
furtherdetailtheobjectivesofthestudyusingfourresearchquestions.The®rstfourquestionstarget
requirementsspeci®cations,the®fthtargetsdata-¯owmodels.
RQ5Howaccuratelycanclonedetectiondiscovercloninginrequirementsspeci®cations?
Weneedanautomaticdetectionapproachforalarge-scalestudyofcloninginrequirementsspec-
i®cations.approachesThisneedtoquestionbedeinvvestigeloped.atesItprowhethervidesethexistingbasiscloneforthedetectorsstudyofaretheextentappropriate,andorifimpactnewof
cloning.requirements
RQ6Howmuchcloningdoreal-worldrequirementsspeci®cationscontain?
Thecontainamountlittleofornocloningcloning,initisrequirementsunlikelytohaspeci®cationsveastrongdeterminesimpactontherelevmaintenance.anceofthisstudy.Ifthey
63
5CloningBeyondCode
RQ7Whatkindofinformationisclonedinrequirementsspeci®cations?
Thekindofinformationthatisclonedin¯uencestheimpactofcloningonmaintenance.Iscloning
limitedto,orespeciallyfrequentfor,aspeci®ckindofinformationcontainedinrequirementsspec-
i®cations?
RQ8Whichimpactdoescloninginrequirementsspeci®cationshave?
Cloningincodeisknowntohaveanegativeimpactonmaintenance.Canitalsobeobserved
forcloninginspeci®cations?Thisquestiondeterminestherelevanceofcloninginrequirements
maintenance.aresoftwforspeci®cations
RQ9Howmuchcloningdoreal-worldMatlab/SimulinkModelscontain?
ofAsforclonecodedetectionandandrequirementsclonemanagementspeci®cations,forthereal-wamountorldofcloningMatlab/Simulinkisanindicatormodels.oftheimportance
DesignyStud5.2
Arequirementsspeci®cationisinterpretedasasinglesequenceofwords.Incaseitcomprises
multipledocuments,individualwordlistsareconcatenatedtoformasinglelistfortherequire-
mentsspeci®cation.Normalizationisafunctionthattransformswordstoremovesubtlesyntactic
differencesbetweenwordswithsimilardenotation.Anormalizedspeci®cationisasequenceof
normalizedwords.Aspeci®cationclonecandidateisa(consecutive)substringofthenormalized
speci®cationwithacertainminimallength,appearingatleasttwice.
Forspeci®cationclonecandidatestobeconsideredasclones,theymustconveysemanticallysimilar
informationandthisinformationmustrefertothesystemdescribed.Examplesofclonesaredupli-
catedusecasepreconditionsorsysteminteractionsteps.Examplesoffalsepositivesareduplicated
documentheadersorfootersorsubstringsthatcontainthelastwordsofoneandthe®rstwordsof
thesubsequentsentencewithoutconveyingmeaning.
RQs5to8Thestudyusescontentanalysisofspeci®cationdocumentstoanswertheresearch
questions.Forfurtherexplorativeanalyses,thecontentofsourcecodeisalsoanalyzed.Content
analysisisperformedusingConQATasclonedetectiontoolaswellasmanually.
First,weassignrequirementsspeci®cationstopairsofresearchersforanalysis.Assignmentis
randomizedtoreduceanypotentialbiasthatisintroducedbytheresearchers.Clonedetectionis
performedonalldocumentsofaspeci®cation.
Next,theresearcherpairsperformclonedetectiontailoringforeachspeci®cation.Forthis,they
manuallyinspectdetectedclonesforfalsepositives.Filtersareaddedtothedetectioncon®guration
sothatthesefalsepositivesnolongeroccur.Thedetectionisre-runandthedetectedclonesare
64
ObjectsyStud5.3
cloneanalyzed.groups.ThisToisanswerrepeatedRQ5,untilnoprecisionfalsebeforepositivandesareafterfoundtailoring,inacaterandomgoriesofsamplefalseofthepositivesdetectedand
timesrequiredfortailoringarerecorded.
Theresultsofthetailoredclonedetectioncompriseareportwithallclonesandclonemetricsthatare
usedtoanswerRQ6:clonecoverage,numberofclonegroupsandclones,andoverhead.Overhead
isthemeasuredliteratureinarerelatiusedvetoandquantifyabsolutetheterms.additionalStandardefvfortaluesthatforthisoreadingverheadandcauses.inspectionOvspeedserheadfromand
cloning-inducedeffortsareusedtoanswerRQ8.
Foreachspeci®cation,wequalitativelyanalyzearandomsampleofclonegroupsforthekindofin-
formationtheycontain.Westartwithaninitialcategorizationfromanearlierstudy[57]andextend
it,andwhengroundednecessarytheory,duringapproachcate[39]).gorizationIfaclone(formallycontainsspeaking,weinformationthusthatemploycanabemixedassignedtotheory-basedmore
thaninformationonecateingory,itrequirementsisassignedtospeci®cationsallsuitableisusedcatetogories.answerTheRQ7.resultingToensurecateagorizationcertainoflevelclonedof
objectiveness,inter-rateragreementismeasuredfortheresultingcategorization.
Inmanysoftwareprojects,SRSarenoread-onlyartifactsbutundergoconstantrevisionstoadaptto
everchangingrequirements.Suchmodi®cationsarehamperedbycloningaschangestoduplicated
textoftenneedtobecarriedoutinmultiplelocations.Moreover,ifthechangesareunintentionally
notadditionalperformedeffortstoallforaffectedclari®cation.clones,Inthewinconsistenciesorstcase,cantheybemakeintroducedittotheinSRSthatimplementationlateronofcreatethe
softwpracticeareforsystem,inconsistentcausingmodi®cationsinconsistenttobehacodeviorofclonesthe®nal[115].Wproduct.ethuseStudiesxpectshothatwitthatcanthisalsooccurshappenin
inSRS.Hence,besidesthecategories,furthernoteworthyissuesoftheclonesnoticedduringmanual
inspectioninformationareisuseddocumented,foradditionalsuchasanswersinconsistenciestoRQ8.intheduplicatedspeci®cationfragments.This
isMoreovperformed:er,onweseleinctedvestigatespeci®cations,thecodecontentcorrespondinganalysistoofthespeci®cationsourcecodeclonesoftotheclassifyimplementationwhether
thespeci®cationcloningresultedincodecloning,duplicatedfunctionalitywithoutcloning,orwas
resolvedthroughthecreationofasharedabstraction.Theseeffectsareonlygivenqualitatively.
Furtherquantitativeanalysisisbeyondthescopeofthisthesis.
oInvtheervie®nalwofstep,theallstepsofcollectedthedatastudyisisgivanalyzedeninFig.and5.1.interpretedtoanswertheresearchquestions.An
RQ9WeusedtheclonedetectionapproachpresentedinSec.7.3.5todetectclonesinMat-
lab/Simulinkmodels.Tocapturetheextentofcloninginmodels,werecordedclonecountsand
erage.vco
Stud5.3Objectsy
RQsistration,5to8automotiWeve,usecon28venience,requirements®nance,speci®cationstelecommunication,asstudyobjectsandfromtransportation.thedomainsTheofspeci®edadmin-
65
5CloningBeyondCode
Random assignment of spec.
Run clone detection tool
Inspect detected clones
seYFalse positives?oNCategorize clones
Add #lter
Analysis of further e"ectsIndependent re-categorization
Data analysis & interpretation
Figure5.1:Studydesignoverview
systemsincludesoftwaredevelopmenttools,businessinformationsystems,platforms,andembed-
dedsystems.Thespeci®cationsarewritteninEnglishorGerman;theirscoperangesfromapart
totheentiresetofrequirementsofthesoftwaresystemstheydescribe.Fornon-disclosurereasons,
thesystemsarenamedAtoZtoAC.Anovervie1wisgiveninTable5.1.Thespeci®cationswere
obtainedfromdifferentorganizations,includingMunichReGroup,SiemensAGandtheMOST
Cooperation.
Thespeci®cationsmainlycontainnaturallanguagetext.Ifpresent,othercontent,suchasimages
ordiagrams,wasignoredduringclonedetection.Speci®cationsN,UandZareMicrosoftExcel
documents.Sincetheyarenotorganizedasprintablepages,nopagecountsaregivenforthem.The
remainingspeci®cationsareeitherinAdobePDForMicrosoftWordformat.Insomecases,these
speci®cationsaregeneratedfromrequirementsmanagementtools.Tothebestofourknowledge,
theduplicationencounteredinthespeci®cationsisnotintroducedduringgeneration.
Obviously,thespeci®cationswerenotsampledrandomly,sincewehadtorelyonourrelationships
withourpartnerstoobtainthem.However,weselectedspeci®cationsfromdifferentcompaniesfor
differenttypesofsystemsindifferentdomainstoincreasegeneralizabilityoftheresults.
RQ9WeemployedamodelprovidedbyMANNutzfahrzeugeGroup.Itimplementsthemajor
partofthepowertrainmanagementsystem.Toallowforadaptiontodifferentvariantsoftrucks
andbuses,itisheavilyparameterized.Themodelconsistsofmorethan20,000TargetLinkblocks
thataredistributedover71Simulink®les.Such®lesaretypicaldevelopment/modellingunitsfor
getLink.arSimulink/T1Duetonon-disclosurereasons,wecannotlistall11companiesfromwhichspeci®cationswereobtained.
66
5.4ecutionExandImplementation
Table5.1:Studyobjects
SpecPagesWordsSpecPagesWords
A51741,482O18418,750
CB1,013133130,96818,447PQ45336,9775,040
ED18524137,05637,969SR14410924,34315,462
GF854210,0767,662TUn/a4043,2167,799
IH1605319,6326,895VW21144831,67095,399
KJ28394,4115,912XY15823519,67949,425
L53584,959Zn/a13,807
NM233n/a103,06746,763AABC3,100696274,48981,410
1,242,7658,667
ecutionExandImplementation5.4
Thissectiondetailshowthestudydesignwasimplementedandexecutedonthestudyobjects.
RQs5and6ClonedetectionandmetriccomputationisperformedusingthetoolConQATas
describedinSec.3.3.Detectionusedaminimalclonelengthof20words.Thisthresholdwasfound
toprovideagoodbalancebetweenprecisionandrecallduringprecursoryexperimentsthatapplied
tailoring.detectionclonePrecisionisdeterminedbymeasuringthepercentageoftherelevantclonesintheinspectedsample.
Clonedetectiontailoringisperformedbycreatingregularexpressionsthatmatchthefalseposi-
tives.Speci®cationfragmentsthatmatchtheseexpressionsarethenexcludedfromtheanalysis.A
maximumnumberof20randomlychosenclonegroupsisinspectedineachtailoringstep,tokeep
manualeffortwithinfeasiblebounds,ifmorethan20clonegroupsarefoundforaspeci®cation;
else,falsepositivesareremovedmanuallyandnofurthertailoringisperformed.
RQ7Ifmorethan20clonegroupsarefoundforaspeci®cation,themanualclassi®cationis
performedonarandomsampleof20clonegroups;else,allclonegroupsforaspeci®cationare
wereinspected.removed.DuringToimproinspection,vethethecatequalityofgorizationthewcateasegorizationxtendedby8results,catecategories,1gorizationwasischanged,performednone
togetherbyateamof2researchersforeachspeci®cation.Inter-rateragreementisdeterminedby
calculatingCohen’sKappafor5randomlysampledspeci®cationsfromwhich5clonegroupseach
areindependentlyre-categorizedby2researchers.
67
5CloningBeyondCode
RQ8OverheadmetricsarecomputedasdescribedinSection2.5.4.Theadditionaleffortfor
readingiscalculatedusingthedatafrom[87],whichgivesanaveragereadingspeedof220words
perminute.Fortheimpactoninspectionsperformedontherequirementsspeci®cations,wereferto
GilbandGraham[79]thatsuggest1hourper600wordsasinspectionspeed.Thisadditionaleffort
isbothcalculatedforeachspeci®cationandasthemeanoverall.
Toanalyzetheimpactofspeci®cationcloningonsourcecode,weuseaconveniencesampleofthe
studyobjects.Wecannotemployarandomsample,sinceformanystudyobjects,thesourcecodeis
unavailableortraceabilitybetweenSRSandsourcecodeistoopoor.Ofthesystemswithsuf®cient
traceability,weinvestigatethe5clonegroupswiththelongestandthe5withtheshortestclonesas
wellasthe5clonegroupswiththeleastandthe5withthemostinstances.Therequirements’IDs
intheseclonegroupsaretracedtothecodeandcomparedtoclonedetectionresultsonthecode
level.ConQATisusedforcodeclonedetection.
RQ9normalizationThelabelsdetectionweusedapproachthetype;outlinedforinsomeSectionofthe7.3.5wblocksasthatadjustedtoimplementSimulinkseveralmodels.similarForfunc-the
tionsaddedthevalueoftheattributethatdistinguishesthem(e.g.,fortheTrigonometryblockthis
isanattributedecidingbetweensine,cosine,andtangent).Numericvalues,suchasthemultiplica-
tibeveextractconstantedasforlibrarygain,wereblocksremowhereved.suchThiswayconstants,detectioncouldcanbemadediscoverparameterspartialofmodelsthenewhichwlibrarycould
block.Fromamounttheweclonesstillfound,considerwetobediscardedrelevallantatthoseleastinconsistingsomeofcases.lessthan5Furthermore,blocks,aswethisistheimplementedsmallesta
weightingschemethatassignsaweighttoeachblocktype,withadefaultof1.Infrastructureblocks
(e.g.,terminatorsandmultiplexers)wereassignedaweightof0,whileblockshavingafunctional
meaning(e.g.,integrationordelayblocks)wereweightedwith3.Theweightofacloneisthesum
ofthattheatleastweightssmallofitsclonesblocks.areClonesconsideredwithonlya,ifweighttheirlesstfunctionalhan8alsoportionwereislargediscarded,enough.whichensures
5.5Results
Thissectionpresentsresultsorderedbyresearchquestion.
5.5.1RQ5:DetectionTailoringandAccuracy
RQ5investigateswhetherredundancyinreal-worldrequirementsspeci®cationscanbedetected
approaches.xistingewithPrecisionvaluesandtimesrequiredforclonedetectiontailoringaredepictedinTable5.2.Tailoring
timesdonotincludesetuptimesanddurationofthe®rstdetectionrun.Ifnoclonesaredetected
foratailoringisspeci®cationnecessary(i.e.,atQall,ande.g.,T),E,noF,Gorprecision,S,thevalueworstisgiven.precisionWhilevalueforwithoutsometailoringspeci®cationsisaslonow
as2%forspeci®cationO.Inthiscase,hundredsofclonescontainingonlythepagefootercause
68
Table5.2:Studyresults:tailoring
SbefPrec..Tminail.Prec.afterSbefPrec..Tminail.Prec.after
BA58%27%1530100%100%PO48%2%208100%100%
DC99%45%255100%99%RQ40%n/a41100%n/a
EF100%100%24100%100%ST100%n/a21100%n/a
HG100%97%102100%97%VU59%85%56100%85%
IJ100%71%28100%100%XW100%96%136100%100%
KL96%52%262100%96%YZ100%97%17100%100%
NM100%44%234100%100%AABC30%48%3314100%100%
5.5Results
thelargeamountoffalsepositives.For8speci®cations(A,C,M,O,P,R,AB,andAC),precision
valuesbelow50%aremeasuredbeforetailoring.Thefalsepositivescontaininformationfromthe
gories:catewingfolloDocumentmetadatacomprisesinformationaboutthecreationprocessofthedocument.This
includesauthorinformationanddocumentedithistoriesormeetinghistoriestypicallycontainedat
thestartorendofadocument.
Indexesdonotaddnewinformationandaretypicallygeneratedautomaticallybytextprocessors.
Encounteredexamplescomprisetablesofcontentorsubjectindexes.
Pagedecorationsaretypicallyautomaticallyinsertedbytextprocessors.Encounteredexamples
includepageheadersandfooterscontaininglengthycopyrightinformation.
Openissuesdocumentgapsinthespeci®cation.Encounteredexamplescomprise“TODO”state-
mentsortableswithunresolvedquestions.
Speci®cationtemplateinformationcontainssectionnamesanddescriptionscommontoallindi-
vidualdocumentsthatarepartofaspeci®cation.
Someofthefalsepositives,suchasdocumentheadersorfooterscouldpossiblybeavoidedby
accessingrequirementsinformationinamoredirectformthandonebytextextractionfromre-
documents.speci®cationquirementsPrecisionwasincreasedsubstantiallybyclonedetectiontailoring.Precisionvaluesforthespeci®-
cationsareabove85%,averageprecisionis99%.Thetimerequiredfortailoringvariesbetween1
and33minutesacrossspeci®cations.Lowtailoringtimesoccurredwheneithernofalsepositives
wereencountered,ortheycouldveryeasilyberemoved,e.g.,throughexclusionofpagefootersby
addingasinglesimpleregularexpression.Onaverage,10minuteswererequiredfortailoring.
69
5CloningBeyondCode
5.5.2RQ6:AmountofSRSCloning
RQ6investigatestheextentofcloninginreal-worldrequirementsspeci®cations.Theresultsare
showhichwninnotacolumnssingle2–4cloneofofTablethe5.3.requiredClonecolengthviseragevfound,ariestowidely:speci®cationfromHspeci®cationscontainingQaboutandTtw,ino-
athirdscloneofcoverageduplicatedabovecontent.20%.6outTheofavtheerage28analyzedspeci®cationclonespeci®cationscoverage(namelyisA,13.6%.F,G,H,L,Speci®cationsY)have
A,D,F,G,H,K,L,VandYevenhavemorethanonecloneperpage.Nocorrelationbetween
speci®cationsizeandcloningisfound.(Pearson’scoef®cientforclonecoverageandnumberof
wordsis-0.06—con®rmingalackofcorrelation.)
Table5.3:Studyresults:cloning
SpeccoClonev.grCloneoupsclonesorelativeverheadoworverheadds
A35.0%25991432.6%10,191
B8.9%2656395.3%6,639
DC18.5%8.1%105374798811.5%6.9%2,4631,907
FE51.1%0.9%5061621260.6%0.4%2,890161
HG71.6%22.1%7160360262129.6%20.4%11,0831,704
I5.5%7153.0%201
J1.0%120.5%22
LK20.5%18.1%303197945514.1%13.4%10,475699
NM1.2%8.2%15911373230.6%5.0%4,915287
PO5.8%1.9%5810163.0%1.0%204182
Q0.0%000.0%0
R0.7%240.4%56
TS0.0%1.6%1102700.0%0.9%2280
UV15.5%11.2%2018523748510.8%7.0%6,2044,206
XW12.4%2.0%211445316.8%1.1%1,253355
Y21.9%18155318.2%7,593
Z19.6%5011714.2%1,718
AABC12.1%5.4%6356518181483.2%8.7%21,9932,549
13.5%13.6%vgA100,1787,6692,631Fig.5.2depictsthedistributionofclonelengthsinwords(a)andofclonegroupcardinalities(b),
70
Results5.5
i.e.,thenumberoftimesaspeci®cationfragmenthasbeencloned2.Shortclonesaremorefrequent
thanlongclones.Still,90clonegroupshavealengthgreaterthan100words.Thelongestdetected
groupcomprisestwoclonesof1049wordseach,describingsimilarinputdialogsfordifferenttypes
data.ofacrossCloneallpairsarespeci®cations,moref49requentgroupsthanwithclonegroupscardinalityofabovecardinality10w3ereorhigherdetected..HoTheweverlar,gestaggreggroupated
encounteredcontains42clonesthatcontaindomainknowledgeaboutrolesinvolvedincontracts.
5.5.3RQ7:ClonedInformation
TheRQ7cateinvgoriesestigofatesclonedwhichkindinformationofinformationencounteredisinclonedtheinstudyreal-wobjectsorldare:requirementsspeci®cations.
withDetailedtheUsesystem,CasesuchasSteps:thestepsDescriptionrequiredoftooneorcreateamorenewstepsincustomerauseaccountcaseoninahowasystem.userinteracts
partReferofence:thesameFragmentdocument.inarequirementsExamplesarespeci®cationreferencesthatinarefersusetocasetoanotherotherdocumeusentcasesorortoanotherthe
process.usinessbcorrespondingUI:visibleonInformationwhichthatscreenrefersisantoethexample(graphical)forthisusercategoryinterf.ace.Thespeci®cationofwhichbuttonsare
detailsDomainaboutKnowhatwledge:ispartofanInformationinsuranceaboutthecontractforapplicationasoftwaredomainthatofthemanagessoftware.insuranceAnecontracts.xampleare
function,Interfaceorsystem.Description:AneDataxampleandisthemessagede®nitionde®nitionsofthatmessagesdescribeonathesysteminterfbusacethatofaacomponent,component
writes.andreadsPraree-Condition:pre-conditionsAforconditiontheexthatecutionhasoftoaholdspeci®cbeforeusecase.somethingelsecanhappen.Acommonexample
thing.AnSide-Condition:exampleisConditionthatatuserhathastodescribesremainthestatusloggedthatinhasduringtotheholdexduringecutiontheofeaxcertainecutionoffunction-some-
.alityaretimingCon®guration:parametersExplicitforsettingscon®guringforacon®guringtransmissiontheprotocol.describedcomponentorsystem.Anexample
Feature:Descriptionofapieceoffunctionalityofthesystemonahighlevelofabstraction.
TtechnicalechnicalenDomainvironmentKnoofthewledge:system,e.Informationg.,usedbaboutusthesystemsusedinantechnologyembeddedforthesystem.solutionandthe
2Thegivenriforghtmostthevunionalueofineachdetecteddiagramclonesaggreacrossgatesspeci®cationsdatathatisandnotoutsideforitseachrange.oneindiForviduallyconciseness,.Thethegeneraldistribobservutionsationsare
are,however,consistentacrossspeci®cations.
71
5
Cloning
72
ondyBe
)a)bFigure
Nuclmbo oenfr gero upsNuclmbo oenfr gero ups0000 505022110000000000 8642011111 050000 00008642Code
05.2:
0 02 1 2 03utionDistrib
3 04of
4tyil aegnnio rdlpcaCuro selrdn ootn wlghiCne 05lonec
5 06 6lengths
07 7and
8 08lonec
9 09oupgr
010 01dinalitiescar
Results5.5
Post-Condition:Conditionthatdescribeswhathastoholdaftersomethinghasbeen®nished.Anal-
ogoustothepre-conditions,post-conditionsareusuallypartofusecasestodescribethesystemstate
aftertheusecaseexecution.
Rationale:Justi®cationofarequirement.Anexampleistheexplicitdemandbyacertainuser
group.
Wedocumentthedistributionofclonegroupstothecategoriesforthesampleofcategorizedclone
groups.404clonegroupsareassigned498times(multipleassignmentsarepossible).Thequantita-
tiveresultsofthecategorizationaredepictedinFig.5.3.Thehighestnumberofassignmentsareto
category“DetailedUseCaseSteps”with100assignments.“Reference”(64)and“UI”(63)follow.
Theleastnumberofassignmentsaretocategory“Rationale”(8).
0nto siidStpdeenlotiIe-CaUDPreegdelw onKnimaoDntoniitodrainnuceotofgnpi-Cnfirescrieoeteree daCFRIutrfceDSieneaelantoaiRntoiidnostPo-Cegdelw onKnimao DlcainchTe2040608 001Figure5.3:Quantitativeresultsforthecategorizationofclonedinformation
TheandAB.randomFromsampleeachforspeci®cation,inter-rater5randomagreementclonescalculationareinspectedconsistsandofthecategorized.speci®cationsAsL,oneR,speci®-U,Z,
cationagreementonlyhasusing2cloneCohen’sgroups,Kappainwithtotala22resultcloneof0.67;groupsthiaresisinspected.commonlyWeconsideredmeasureastheintersubstantial-rater
theagreement.clonedHence,informationthecatesimilarly,gorizationimplyingisagoodcertainenoughdetogreeofensurethatcompletenessindependentandraterssuitabilitycate.gorize
5.5.4RQ8ImpactofSRSCloning
RQ8investigatestheimpactofSRScloningwithrespectto(1)speci®cationreading,(2)speci®ca-
tionmodi®cationand(3)speci®cationimplementation.
Speci®cationReadingCloninginspeci®cationsobviouslyincreasesspeci®cationsizeand,hence,
affectsallactivitiesthatinvolvereadingthespeci®cationdocuments.AsTable5.4shows,the
averageoverheadoftheanalyzedSRSis3,578wordswhich,attypicalreadingspeedof220words
perminute[87],translatestoadditional16minutesspentonreadingforeachdocument.
73
5CloningBeyondCode
Whilethisdoesnotappeartobealot,oneneedstoconsiderthatqualityassurancetechniqueslike
inspectionsassumeasigni®cantlylowerprocessingrate.Forexample,[79]considers600words
onperhourinspectionsastheofthemaximumanalyzedrateSRSforefisefectivxpectedetoinspections.beabout6Hence,hours.theInavaeragetypicaladditionalinspectiontimemeetingspent
with3participants,thisamountsto2.25persondays.Forspeci®cationABwithanoverheadof
21,993words,effortincreaseisexpectedtobegreaterthan13persondaysifthreeinspectorsare
applied.
Table5.4:Studyresults:impact
So[wverheadords][m]read.3[h]insp.4So[wverheadords][m]read.3[h]insp.4
A10,19146.317.0O1820.80.3
BC6,6391,90730.28.711.13.2QP20400.00.90.00.3
ED2,46316111.20.74.10.3RS228560.31.00.10.4
GF1,7042,89013.17.72.84.8UT4,206019.10.07.00.0
H11,08350.418.5V6,20428.210.3
I2010.90.3W3551.60.6
J220.10.0X1,2535.72.1
KL10,47569947.63.217.51.2ZY1,7187,59334.57.812.72.9
MN4,91528722.31.30.58.2AABC21,9932,549100.011.636.74.2
6.016.33,578vgA
alyzeSpeci®cationthecommentsModi®cationthatwereToexploredocumentedtheeduringxtentoftheinspectioninconsistenciesofintheoursampledspeci®cations,clonesforweeachan-
speci®cationset.Theyrefertoduplicatedspeci®cationfragmentsthatarelongerthantheclonesde-
tectedbetweenbythethetool.clonesThethatfulloftenlengthresultofthefromduplicationinconsistentisnotmodi®cation.foundbythetoolduetosmalldifferences
Ani®cationexample(M).forThesuchafunctionpotentialclassesinconsistenc“SequenceyPcanroperty”befoundandinthe“SequencepubliclyaMethod”vailablehaveMOSTthesamespec-
parameterlists.Theyaredetectedasclones.Thefollowingdescriptionisalsocopied,butoneends
withthesentence“Pleasenotethatincaseofelements,parameterFlagsisnotavailable”.Inthe
notothercouldcase,onlythisbesentencedeterminedisbymissing.consultingWhetherthetheserequirementdifferencessareengineersdefectsoftheinthesystem.Thisrequirementsfurtheror
stepremainsforfuturework.
isSpeci®cationimportanttounderstandImplementationwhichWithimpactrespectSRStocloningthehasentiretyonofdevtheelopmentsoftwareactidevvitieselopmentthatuseprocess,SRSasit
43AdditionalAdditionalreadinginspectionefeffortfortininclockclockminutes.hours.
74
Results5.5
Table5.5:Numberof®les/modellingunitstheclonegroupswereaffecting
NumberofmodelsNumberofclonegroups
43181212334
Table5.6:Numberofclonegroupsforclonegroupcardinality
CardinalityofclonegroupNumberofclonegroups
108220310415
antheirinput,e.correspondingg.,systemsourceimplementationcode,wefoundand3test.difForferenttheeffects:inspected20speci®cationclonegroupsand
1.Theredundancyintherequirementsisnotre¯ectedinthecode.Itcontainssharedabstractions
duplication.oidvathat2.toThethecodeclonedthatcodeimplementscauseaadditionalclonedefpiecefortsofanasSRSismodi®cationscloned,musttoo.Inbethisre¯ectedcase,infutureallclones.changes
Furthermore,changestoclonedcodeareerror-proneasinconsistenciesmaybeintroduced
accidentally(cf.,Chapter4).
3.Codeofthesamefunctionalityhasbeenimplementedmultipletimes.Theredundancyofthe
caseerequirementsxhibitsthussimilardoeseproblemsxistinasthecasecode2asbutwellbcreatesuthasnotadditionalbeeneffortscreatedforbythecopy&repeatedpaste.imThiple-s
mentation.approachescannotMoreover,reliablythis®ndtypeofcodethatredundancisyisfunctionallyhardertosimilardetectbutasnotethexistingresultcloneofcopdetectiy&on
paste,asshowninChapter9.
5.5.5RQ9:AmountofModelCloning
Wefound166clonepairsinthemodelswhichresultedin139clonegroupsafterclusteringand
resolvinginclusionstructures.Ofthe4762blocksusedfortheclonedetection,1780wereincluded
inatleastoneclone(coverageofabout37%).Weconsiderthisasubstantialamountofcloningthat
indicatesthenecessitytocontrolcloningduringmaintenanceofMatlab/Simulinkmodels.
AsshowninTable5.5,onlyabout25%ofthecloneswerewithinonemodelingunit(i.e.,asingle
Simulink®le),whichwastobeexpectedassuchclonesaremorelikelytobefoundinamanual
reviewprocessasopposedtoclonesbetweenmodelingunits,whichwouldrequirebothunitstobe
reviewedbythesamepersonwithinasmalltimeframe.Tables5.7and5.5giveanoverviewofthe
found.groupsclone
75
5CloningBeyondCode
Table5.7:Numberofclonegroupsforclonesize
clonesofNumbersizeClone7610–51611––20151735
1120>
Table5.7showshowmanycloneshavebeenfoundforsomesizeranges.Thelargestclonehada
sizeof101andaweightof70.Smallerclonesaremorefrequentthanlargerclones,ascanalsobe
observedforclonesinsourcecodeorrequirementsspeci®cations.
Discussion5.6
thatRQs5cloningto8:intheCloningsenseinofcopy&RequirementspasteiscommonSpeci®cationsinreal-worldTheresultsrequirementsfromthecasespeci®cations.studyshoHerew
weinterprettheseresultsanddiscusstheirimplications.
AccordingtotheresultsofRQ6,theamountofcloningencounteredissigni®cant,althoughitdiffers
betweenspeci®cations.Thelargeamountofdetectedcloningisfurtheremphasized,sinceour
approachfragmentsonlythathavlocatesebeenidenticalcopiedpartsbutoftheslightlytext.rewOtherordedinformslaterofeditingredundancsteps,y,suchorasthatarespeci®cationentirely
rewordedyetconveythesamemeaning,arenotincludedinthesenumbers.
TheresultsforRQ7illustratethatcloningisnotcon®nedtoaspeci®ckindofinformation.Onthe
contrary,wefoundthatduplicationcan,amongstothers,befoundinthedescriptionofusecases,
theapplicationdomainandtheuserinterfacebutalsoinpartsofdocumentsthatmerelyreference
otherdocuments.Ourcasestudyonlyyieldstheabsolutenumberofclonesassignedtoacategory.
ifAswecloningdidisnotinmorevestiglikelyatetowhichoccurinamountoneofacateSRSgorycanthanbeanotherassigned.totHence,hecatewegory,currentlywecannotassumededucethat
clonesarelikelytooccurinallpartsofSRS.
Therelativelybroadspectrumof®ndingsillustratesthatcloninginSRScanbesuccessfullyavoided.
SRSE,forexample,islargeandyetexhibitsalmostnocloning.
Themostobviouseffectofduplicationistheincreasedsize(cf.,RQ8),whichcouldoftenbeavoided
byprocessingcross-referencesstepsorperformeddifferentontheorganizationspeci®cations,ofthesuchasspeci®cations.restructuringSizeorincreasetranslatingaffectsthemallto(manual)other
languages,andespeciallyreading.Readingisemphasizedhere,astheratioofpersonsreadingto
thosewritingaspeci®cationisusuallylarge,evenlargerthaninsourcecode.Theactivitiesthat
involvereadingincludespeci®cationreviews,systemimplementation,systemtestingandcontract
negotiations.Theyaretypicallyperformedbydifferentpersonsthatareallaffectedbytheoverhead.
Whiletheadditionaleffortforreadinghasbeenassumedtobelinearinthepresentationofthe
results,onecouldevenarguethattheeffortislarger,ashumanreadersarenotef®cientwithword-
difwiseferencescomparisons,betweenwhichthemarethatcouldrequiredothetorwisecheckleadtopresumablyerrorsintheduplicated®nalpartssystem.to®ndpotentialsubtle
76
alidityVtoThreats5.7
Furthermore,inconsistentchangesoftherequirementsclonescanintroduceerrorsinthespeci®ca-
tionandthusofteninthe®nalsystem.Basedontheinconsistenciesweencountered,westrongly
suspectthatthereisarealthreatthatinconsistentmaintenanceofduplicatedSRSintroduceser-
rorsinpractice.However,sincewedidnotvalidatethattheinconsistenciesareinfacterrors,our
resultsarenotconclusive—futureresearchonthistopicisrequired.Nevertheless,theinconsisten-
ciesprobablycauseoverheadduringfurthersystemdevelopmentduetoclari®cationrequestsfrom
them.spottingelopersvde
Ourreimplementedobservationspartsshoofw,code.moreovOftener,thatthesespeci®cationduplicationscloningcannotcanevenleadbetospottedclonedbyor,theevdeenvwelopers,orse,
astheyonlyworkonapartofthesystem,whosesub-speci®cationmightnotevencontainclones
isolation.inwedviewhen
oftenRedundancanalyzeyistheharddiftoferentidentifypartsinofaSRSasspeci®cationcommonindiqualityviduallyandassuranceare,hence,techniquesproneliketomissinspectionsdu-
identifyplication.clonedTheresultsinformationforRQin5SRSshowinthatpractice.existingHoweclonever,itdetectionalsoshowsapproachesthatacancertainbeamountappliedofto
clonedetectiontailoringisrequiredtoincreasedetectionprecision.Astheeffortrequiredforthe
considertailoringthisstepstoisbebeloanwoneobstaclepersonforthehourforapplicationeachofspeci®cationclonedetectiondocumentduringintheSRScasequalitystudy,weassessmentdonot
practice.in
RQ9:CloninginModelsManualinspectionofthedetectedclonesshowedthatmanyof
themarerelevantforpracticalpurposes.Besidesthe“normal”clones,whichatleastshouldbe
documentedtomakesurethatbugsarealways®xedinbothplaces,wealsofoundtwomodelswhich
werenearlyentirelyidentical.Additionally,someoftheclonesarecandidatesfortheproject’s
library,astheyincludedfunctionalitythatislikelytobeusefulelsewhere.Anothersourceofclones
isthelimitationofTargetLinkthatscaling(i.e.,themappingtoconcretedatatypes)cannotbe
parameterized,whichleavesduplicationastheonlywayforobtainingdifferentscalings.
Themainproblemweencounteredisthelargenumberoffalsepositivesasmorethanhalfofthe
clonesfoundareobviouslyclonesaccordingtoourde®nitionbutwouldnotbeconsideredrelevant
byadeveloper(e.g.,largeMux/Demuxconstructs).Whileweightingthecloneswasamajorstep
inimprovingthisratio(withoutweightingtherewereabout®vetimesasmanyclones,butmostly
consistingofirrelevantconstructs)thisstillisamajorareaofpotentialimprovementfortheusability
approach.ourof
alidityVtoThreats5.7
Inthissection,wediscussthreatstothevalidityofthestudyresultsandhowwemitigatedthem.
77
5CloningBeyondCode
alidityVInternal5.7.1
thatRQs5&performed6Thecloneresultsdetectioncanbetailoring.in¯uencedWbyeindimitigvidualatedthispreferencesriskbyormistakperformingesofclonethetailoringresearchersin
pairstoreducetheprobabilityoferrorsandimproveobjectivity.
canPrecisionpotentiallywasintroducedeterminedoninaccuracrandomy,samplingsamplesisinsteadcommonlyofonalluseddettoecteddetermineclonegroups.precisionandWhileitthishas
beendemonstratedthatevensmallsamplescanyieldpreciseestimates[19,116].
Whilealotofeffortwasinvestedintounderstandingdetectionprecision,weknowlessaboutdetec-
tionrecall.First,ifregularexpressionsusedduringtailoringaretooaggressive,detectionrecallcan
bereduced.Weusedpair-tailoringandcomparisonofresultsbeforeandaftertailoringtoreduce
thiscontainedrisk.inaFurthermore,speci®cationweandhavenotnotidentiinvesti®edgbyatedthefalseautomatednegatives,detectori.e.,.theTheamountreasonsofforthisduplicationare
thelittledif®cultysyntacticofclearlycommonalde®ningity);andthetheeffortcharacteristicsrequiredoftosuch®ndclonesthem(havingmanuallya.Thesemanticreportedrelationebxtentut
ofcloningisthusonlyalowerboundforredundancy.Whiletheinvestigationofdetectionrecall
remainsdetectedclonesimportantandfuturtheework,conclusionsourdralimitedwnfromknothem.wledgeaboutitdoesnotaffectthevalidityofthe
igRQated7thisTheriskcatebygorizationpairingtheoftheresearchersclonedasinformationwellasisbysubjectianalyzingvetothesomeinterde-ratergree.Weagreement.againmit-All
researcherswereinthesameroomduringcategorization.Thisway,newlyaddedcategorieswere
immediatelyavailabletoallresearchers.
RQ8Thecalculationofadditionaleffortduetooverheadcanbeinaccurateiftheuseddatafrom
theliteraturedoesnot®ttotheeffortsneededataspeci®ccompany.Astheusedvalueshavebeen
con®rmedinmanystudies,however,theresultsshouldbetrustworthy.
Weknowlittleabouthowreadingspeedsdifferforclonedversusnon-clonedtext.Ontheone
hand,onecouldexpectthatclonedtextcanbereadmoreswiftly,sincesimilartexthasbeenread
before.Ontheotherhand,weoftennoticedthatreadingclonedtextcanbealotmoretimecon-
sumingthanreadingnon-clonedtext,sincethediscoveryandcomprehensionofsubtledifferences
istedious.Lackingprecisedata,wetreatedclonedandnon-clonedtextuniformlywithrespectto
readingefforts.Furtherresearchcouldhelptobetterquantifyreadingeffortsforclonedspeci®cation
fragments.
RQ9Thedetectionresultscontainfalsepositives.Bothreportedclonecountsandcoverageare
thusnotperfectlyaccurate.However,manualinspectionsrevealedasubstantialamountofclones
relevantformaintenance.Whiletheclonecountsandcoveragemetricsmightbeinaccurate,the
conclusionthatclonemanagementisrelevantformaintenanceofthemodelsholdsandissharedby
elopers.vdethe
78
alidityVExternal5.7.2
ySummar5.8
RQs5to8Thepracticeofrequirementsengineeringdiffersstronglybetweendifferentdomains,
companies,andevenprojects.Hence,itisunclearwhethertheresultsofthisstudycanbegen-
oferalizedrequiremtoallentsexistingspeci®cationsinstancesfromof11orgrequirementsanizationswithspeci®cations.over1.2Howemillionver,wweordsinvandestigalmosatedt289,000sets
pages.Thespeci®cationscomefromseveraldifferentcompanies,fromdifferentdomains—ranging
fore,fromweareembeddedcon®dentsystemsthattothebusinessresultsareinformationapplicabletoasystems—andwidevarietywithvofarioussystemsageandanddepth.domains.There-
RQ9Whiletheanalyzedmodelislarge,itisfromasinglecompanyonly.Thegeneralizability
oftheresultsisthusunclear—futureworkisrequiredtodevelopabetterunderstandingofcloning
acrossmodelsofdifferentsize,ageanddevelopingorganization.However,weareoptimisticthat
theresultsareatleasttransferabletoothermodelsintheautomotivedomain,sincetheyareconsis-
tentwithcloningwesawinmodelsatothercompaniesintheautomotivedomain.Unfortunately,
duetonon-disclosurereasons,wearenotabletopublishthemhere.
ySummar5.8
Thischapterpresentedacasestudyontheextentandimpactofcloninginrequirementsspeci®ca-
models.Matlab/SimulinkandtionsWehaveanalyzedcloningin28industrialrequirementsspeci®cationsfrom11differentcompa-
nies.Theextentofcloningvariessubstantially;whilesomespeci®cationscontainnoneorvery
fewclones,otherscontainverymany.Wehaveseenindicationfornegativeimpactofrequirements
cloningonengineeringefforts.Duetosizeincrease,cloningsigni®cantlyraisestheeffortforac-
tivitiesthatinvolvereadingofSRS,e.g.,inspections.Intheworstencounteredcase,theeffortfor
aninspectioninvolvingthreepersonsincreasesbyover13persondays.Inaddition,justasfor
sourcecode,modi®cationofduplicatedinformationiscostlyanderrorprone;wesawindication
thatunintentionallyinconsistentmodi®cationscanalsohappentospeci®cationclones.
Besidesrequirementsspeci®cations,wehaveanalyzedcloninginalargeindustrialMatlab/Simulink
model.Again,substantialamountsofcloningwerediscovered.Whiletheresultscontainedfalse
positives,developersagreedthatmanyofthedetectedclonesarerelevantformaintenance.Asfor
code,awarenessofcloningisthusrequiredtoavoidunintentionallyinconsistentmodi®cations.
Furthermore,thestudiesindicatethatcloninginrequirementsspeci®cationscancauseredundancy
insourcecode,bothintermsofcodeclonesandindependentimplementationofbehaviorallysimilar
functionality.Sincemodelsareoftenusedasspeci®cations,weassumethatthiseffectcanalsooccur
them.incloningforWeconcludethattheresultsfromthestudiessupportourconjecture:cloningdoesoccurinnon-code
artifactsaswell.Sinceitcanalsonegativelyimpactsoftwareengineeringactivities,weconclude
thatclonecontrolneedstoreachbeyondcodetorequirementsspeci®cationsandmodels.
79
ModelCostClone6
Athoroughunderstandingofthecostscausedbycloningisanecessaryfoundationtoevaluate
alternativeclonemanagementstrategies.Doexpectedmaintenancecostreductionsjustifytheeffort
requiredforcloneremoval?Howlargearethepotentialsavingsthatclonemanagementtoolscan
provide?Weneedaclonecostmodeltoanswerthesequestions.
Thischapterpresentsananalyticalcostmodelthatquanti®estheimpactofcloninginsourcecode
onmaintenanceeffortsand®eldfaults.Furthermore,itpresentstheresultsfromacasestudythat
instantiatesthecostmodelfor11industrialsoftwaresystemsandestimatesmaintenanceeffort
increaseandpotentialbene®tsachievablethroughclonemanagementtoolsupport.Partsofthe
contentofthischapterhavebeenpublishedin[110].
ocessPrMaintenance6.1
Thissectionintroducesthesoftwaremaintenanceprocessonwhichthecostmodelisbased.Itqual-
itativelydescribestheimpactofcloningforeachprocessactivityanddiscussespotentialbene®tsof
clonemanagementtools.TheprocessislooselybasedontheIEEE1219standard[99]thatdescribes
theactivitiescarriedonsinglechangerequests(CRs)inawaterfallfashion.Thesuccessiveexe-
cutionofactivitiesthat,inpractice,aretypicallycarriedoutinaninterleavedanditeratedmanner,
servestheclarityofthemodelbutdoesnotlimititsapplicationtowaterfall-styleprocesses.
Analysis(A)studiesthefeasibilityandscopeofthechangerequesttodeviseapreliminaryplan
fordesign,implementationandqualityassurance.Mostofittakesplaceontheproblemdomain.
Analysisisnotimpactedbycodecloning,sincecodedoesnotplayacentralpartinit.Possible
effectsofcloninginrequirementsspeci®cations,whichcouldinprincipleaffectanalysis,arebeyond
model.thisofscopethe
domainLocationconcepts(L)afdeterminesfectedbyasettheofCRtochangethestartsolutionpoints.Itdomain.thusperformsLocationadoesmappingnotcontaifromnproblemimpact
analysis,thatis,consequencesofmodi®cationsofthechangestartpointsarenotanalyzed.Location
efinvfortolvisesinspectionproportionaloftothesourceamountcodetoofcodedeterminethatgetschangeinspected.startpoints.Weassumethatthelocation
cationCloningeffort.increasesWearethenotsizeawofaretheofcodetoolthatsupportneedstotoallebeviateinspectedtheimpactduringofcodelocationcloningandonthusaflocation.fectslo-
81
6ModelCostClone
Designumentation(D)tousesdesignthetheresultsmodi®cationofanalysisoftheandsystem.locationWasewellassumeastthehatsoftwdesignareissystemnotandimpacteditsbydoc-
toavcloning.oidThisismodi®cationsaconservofheaativevilyclonedassumption,areas.sinceforaheavilyclonedsystem,designcouldattempt
ImpactAnalysis(IA)usesthechangestartpointsfromlocationtodeterminewherechangesin
thecodeneedtobemadetoimplementthedesign.Thechangestartpointsaretypicallynotthe
onlyplaceswheremodi®cationsneedtobeperformed—changestothemoftenrequireadaptations
inusesites.Weassumethattheeffortrequiredforimpactanalysisisproportionaltothenumberof
sourcelocationsthatneedtobedetermined.
Iftheconceptthatneedstobechangedisimplementedredundantlyinmultiplelocations,allof
themneedtobechanged.Cloningthusaffectsimpactanalysis,sincethenumberofchangepoints
isincreasedbyclonedcode.Toolsupport(cloneindication)simpli®esimpactanalysisofchanges
toclonedcode.Idealtoolsupportcouldreducecloningeffectonimpactanalysistozero.
tweentwImplementationoclassesof(Impl)changestorealizessourcethecode.designedAdditionschangeaddinnethewsourcesourcecodecode.toWethedifsystemferentiatewithoutbe-
changingexistingcode.Modi®cationsalterexistingsourcecodeandareperformedtothesource
locationsdeterminedbyimpactanalysis.Weassumethateffortrequiredforimplementationis
proportionaltotheamountofcodethatgetsaddedormodi®ed.
Weassumethataddingnewcodeisunaffectedbycloninginexistingcode.Implementationisstill
afeditingfectedbytoolscloning,could,sinceideally,modi®reducecationseffectstoofclonedcloningcodeonneedtobeimplementationperformedtozero.multipletimes.Linked
QualityAssurance(QA)comprisesalltestingandinspectionactivitiescarriedouttovalidate
thatthemodi®cationsatis®esthechangerequest.Weassumeasmartqualityassurancestrategy—
onlycodeaffectedbythechangeisprocessed.Wedonotlimitthemaintenanceprocesstoaspeci®c
qualityassurancetechnique.However,weassumethatqualityassurancestepsaresystematicallyap-
plied,e.g.,allchangesareinspectedortestingisperformeduntilacertaintestcoverageisachieved
ontheaffectedsystemparts.Consequently,weassumethatqualityassuranceeffortisproportional
totheamountofcodeonwhichqualityassuranceisperformed.
Wedifferentiatetwoeffectsofcloningonqualityassurance:cloningincreasesthechangesizeand
thustheamountofmodi®edcodethatneedstobequalityassured.Second,justasmodi®edcode,
addedcodecancontaincloning.Thisalsoincreasestheamountofcodethatneedstobequality
assuredandhencetherequiredeffort.Wearenotawareoftoolsupportthatcansubstantially
alleviatetheimpactofcloningonqualityassurance.
Other(O)comprisesfurtheractivities,suchas,e.g.,deliveryanddeployment,usersupportor
changecontrolboardmeetings.Sincecodedoesnotplayacentralpartintheseactivities,theyare
cloning.byfectedafnot
82
oacAppr6.2h
Thissectionoutlinestheunderlyingcostmodelingapproach.
hoacAppr6.2
RelativeCostModelManyfactorsin¯uencemaintenanceproductivity[22,23,211]:thetype
ofsystemanddomain,developmentprocess,availabletoolsandexperienceofdevelopers,toname
justafew.Sincethesefactorsvarysubstantiallybetweenprojects,theyneedtobere¯ectedby
costcomprises,estimationthemoreefapproachesfortistorequiredachieveforitsaccuratecreation,absoluteitsfactorresultslookup.Thetables,moreandfforactorsitsacostinstantiationmodel
inpractice.Ifanabsolutevalueisrequired,sucheffortisunavoidable.
Theassessmentoftheimpactofcloningdiffersfromthegeneralcostestimationproblemintwo
oneimportantwithoutaspects.cloning—forFirst,wewhichcomparemostfefactorsfortsforaretwoidentical,systems—thesinceouractualmaintenanceoneandenthehvironmentypotheticaldoes
notchange.Second,relativeeffortincreasew.r.t.thecloning-freesystemissuf®cienttoevaluatethe
impactofcloning.Sincewedonotneedanabsoluteresultvalueintermsofcosts,andsincemost
factorsin¯uencingmaintenanceproductivityremainconstantinbothsettings,theydonotneedto
becontainedinourcostmodel.Inanutshell,wedeliberatelychosearelativecostmodeltokeepits
numberofparametersandinvolvedinstantiationeffortatbay.
CloneRemovabilityThecostmodelisnotlimitedtoclonesthatcanberemovedbythemeans
ofremothevavabilityailable.Inaddition,abstractionevenifnomechanisms,clonecansincebenegremoativveed,ithempactmodelofcanclonesbeisusedtoindependentassessofpossibletheir
improvementsachievablethroughapplicationofclonemanagementtools.
CostModelStructureThemodelassumeseachactivityofthemaintenanceprocesstobecom-
pleted.Itisthusnotsuitabletomodelpartialchangerequestimplementationsthatareabortedat
point.someThetotalmaintenanceeffortEisthesumoftheeffortsofindividualchangerequests:
E=Xe(cr)
CR2crThescopeofthecostmodelisdeterminedbythepopulationofthesetCR:tocomputethemain-
tenanceeffortforatimespant,itispopulatedwithallchangerequeststhatarerealizedinthat
period.Alternatively,ifthetotallifetimemaintenancecostsaretobecomputed,CRispopulated
withallchangerequestseverperformedonthesystem.Themodelcanthusscaletodifferentproject
scopes.Theeffortofasinglechangerequestcr2CRisexpressedbye(cr).Itisthesumoftheeffortsof
theindividualactivitiesperformedduringtherealizationofthecr.Theactivityeffortsaredenoted
aseX,whereXidenti®estheactivity.EachactivityfromSection6.1contributestotheeffortofa
changerequest.Forbrevity,weomit(cr)inthefollowing:
83
ModelCostClone6
e=eA+eL+eD+eIA+eImpl+eQA+eO
Tefofortmodeleiandthecloningimpactofinducedcloningefonfortoverheadmaintenanceec.efforts,Inherentweefsplitforteeiintoistwindeopendentcomponents:ofcloning.inherentIt
capturestheeffortrequiredtoperformanactivityonahcypotheticalversionofthesoftwarethat
doesnotcontaincloning.Cloninginducedeffortoverheade,incontrast,capturestheeffortpenalty
causedbycloning.Totaleffortisexpressedasthesumofthetwo:
e=ei+ec
Theincreaseineffortsduetocloning,e,iscapturedbyeie+iec!1,orsimplyeeic.Thecostmodel
thusexpressescloninginducedoverheadrelativetotheinherenteffortrequiredtorealizeachange
request.Theincreaseintotalmaintenanceeffortsduetocloning,E,isproportionaltotheaverage
effortincreaseperchangerequestandthuscapturedbythesameexpression.
6.3ModelCostDetailed
Thismodelssectionfortheindiintroducesvidualaprocessdetailedactviersionvities.ofThethefolloclonewingcostsectionsmodel.Itsemplo®rstysectthemiontoconstructintroducesmod-cost
elsformanagementmaintenancetoolsupport.effortWeandinitiallyremainingassumefaultthatcountnocloneincreaseandmanagementthetoolspossiblearebene®tsemploofyed.clone
CostsActivity6.3.1ovTheerhead,activitiesec,isAnalysisthus,zero.DesignTheir,andtotalefOtherfortsarenothenceequalimpactedtheirbycloning.inherentefTheirforts.cloninginducedeffort
Locationeffortdependsoncodesize.Cloningincreasescodesize.Weassumethat,onaverage,
increaseoftheamountofcodethatneedstobeinspectedduringlocationisproportionaltothe
cloninginducedsizeincreaseoftheentirecodebase.Sizeincreaseiscapturedbyoverhead:
ecL=eiL·overhead
Impactanalysiseffortdependsonthenumberofchangepointsthatneedtobedetermined.
Cloningincreasesthenumberofchangepoints.WeassumethateIcAisproportionaltothecloning-
inducedincreaseinthenumberofsourcelocations.Thisincreaseiscapturedbyoverhead:
84
eIcA=eIiA·overhead
ModelCostDetailed6.3
Implementationeffortcomprisesbothadditionandmodi®cationeffort:eImpl=eImpl+
eImplAdd.WeassumethateffortrequiredforadditionsisunaffectedbycloninginexistingModsource
code.Weassumethattheeffortrequiredformodi®cationisproportionaltotheamountofcodethat
getsmodi®ed,i.e.,thenumberofsourcelocationsdeterminedbyimpactanalysis.Itscloning
inducedoverheadis,consequently,affectedbythesameincreaseasimpactanalysis:ecImpl=
eiImplMod·overhead.Themodi®cationratiomodcapturesthecmodi®cation-relatedpartofthein-
herentimplementationeffort:eImplMod=eImpl·mod.Consequently,eImplis:
ecImpl=eiImpl·mod·overhead
QualityAssuranceeffortdependsontheamountofcodeonwhichqualityassurancegetsper-
formed.Bothmodi®cationsandadditionsneedtobequalityassured.Sincethemeasureoverhead
capturessizeincreaseofbothadditionsandmodi®cations,wedonotneedtodifferentiatebetween
them,ifweassumethatcloningis,onaverage,similarinmodi®edandaddedcode.Theincreasein
qualityassuranceeffortishencecapturedbytheoverheadmeasure:
ecQA=eiQA·overhead
6.3.2MaintenanceEffortIncrease
Basedonthemodelsfortheindividualactivities,wemodelcloninginducedmaintenanceeffortec
forasinglechangerequestlikethis:
ec=overhead·(eiL+eiIA+eiImpl·mod+eiQA)
Therelativecloninginducedoverheadiscomputedasfollows:
overhead·(eiL+eiIA+eiImpl·mod+eiQA)
=eeiA+eiL+eiD+eiIA+eiImpl+eiQA+eiO
notThistakemodelimpactofcomputescloningtheonrelativeprogrameffortcorrectnessincreaseinintomaintenanceaccount.Thiscostsisdonecausedinbythenecloning.xtsection.Itdoes
85
6ModelCostClone
IncreaseaultF6.3.3Qualityassuranceisnotperfect.Evenifperformedthoroughly,faultsmayremainunnoticedand
causefailuresinproduction.Someofthesefaultscan,inprinciple,beintroducedbyinconsistent
updatestoclonedcode.Cloningcanthusaffectthenumberoffaultsinreleasedsoftware.Thiscan
haveeconomicconsequencesthatarenotcapturedbytheabovemodel.Thissectionintroducesa
does.thatmodelQualityassurancecanbedecomposedintotwosub-activities:faultdetectionandfaultremoval.We
assumethat,independentofthequalityassurancetechnique,theeffortrequiredtodetectasingle
faultinasystemdependsprimarilyonitsfaultdensity.Wefurthermoreassume,thataveragefault
removaleffortforasystemisindependentofthesystem’ssizeandfaultdensity.Theseassumptions
allowustoreasonaboutthenumberofremainingfaultsinsimilarsystemsofdifferentsizebutequal
faultdensities.IfaQAprocedureisappliedwiththesameamountofavailableeffortperunitof
size,weexpectasimilarreductionindefectdensity,sincethesimilardefectdensitiesimplyequal
costsforfaultlocationperunit.Forthesesystems,thesamenumberoffaultscanthusbedetected
and®xedperunit.FortwosystemsAandB,withBhavingtwicethesizeandavailableQAeffort,
weexpectasimilarreductionoffaultdensity.However,sinceBistwiceasbig,thesamefault
densitymeanstwicetheabsolutenumberofremainingfaults.
Asystemthatcontainscloninganditshypotheticalversionwithoutcloningaresuchapairofsim-
ilarsystems.Weassumethatfaultdensityissimilarbetweenclonedcodeandnon-clonedcode—
cloningduplicatesbothcorrectandfaultystatements.Besidessystemsize,cloningthusalsoin-
creasestheabsolutenumberoffaultscontainedinasystem.Iftheamountofeffortavailablefor
qualityassuranceisincreasedbyoverheadw.r.t.thesystemwithoutcloning,thesamereductionin
faultdensitycanbeachieved.However,theabsolutenumberoffaultsisstilllargerbyoverhead.
Thisreasoningassumesthatdevelopersareentirelyignorantofcloning.Thatis,ifafaultis®xedin
oneclone,itisnotimmediately®xedinanyofitssiblings.Instead,faultsinsiblingsareexpectedto
bedetectedindependently.Empiricaldatacon®rmsthatinconsistentbug®xesdofrequentlyoccur
inpractice[115].However,italsocon®rmsthatclonesareoftenmaintainedconsistently.Both
assumingentirelyconsistentorentirelyinconsistentevolutionisthusnotrealistic.
Inpractice,acertainamountofthedefectsthataredetectedinclonedcodearehence®xedin
someofthesiblingclones.Thisreducesthecloninginducedoverheadinremainingfaultcounts.
However,unlessallfaultsinclonesare®xedinallsiblings,resultingfaultcountsremainhigher
cloning.withoutsystemsinthanThemissratiocapturestheamountofclonesthatareunintentionallymodi®edinconsistently.It
hencecapturestheshareofclonedfaultsthatarenotremovedonceafaultisdetectedintheirsibling.
Theincreaseinfaultcountsduetocloningcanhencebequanti®edasfollows:
F=overheadmissratio
Tocomputemissratio,atimewindowforwhichchangestoclonesareinvestigatedisrequired.To
quantifyincreaseinremainingfaults,wechooseatimewindowthatstartswiththeinitiationof
the®rstchangerequest,andendswiththerealizationofthelastchangerequestinCR.Thisway,
missratiore¯ectsthatincreasedeffortavailableforqualityassuranceallowsforindividualdetection
86
ModelCostDetailed6.3
offaultscontainedinsiblingclones,iftheir®xwasmissedinpreviousdetections.Thesequence
ofinconsistentmodi®cationandlatepropagationthatoccursinsuchacaseis,sinceallofthem
occurredinsidethetimewindow,observedasasingleconsistentmodi®cation.Hence,missratio
onlycapturesthosefaultsthatslipthroughqualityassurance.ItisthusdifferentfromUICRand
FUICR.
tSupporoolT6.3.4Clonemanagementtoolscanalleviatetheimpactofcloningonmaintenanceefforts.Weadaptthe
detailedmodeltoquantifytheimpactofclonemanagementtools.Weevaluatetheupperboundof
whattwodifferenttypesofclonemanagementtoolscanachieve.
CloneIndicationmakescloningrelationshipsinsourcecodeavailabletodevelopers,forexam-
plethroughclonebarsintheIDEthatmarkclonedcoderegions.Examplesforcloneindication
toolsincludeConQATandCloneTracker[60].Optimalcloneindicationthuslowerstheeffortre-
quiredforclonediscoverytozero.Itthussimpli®esimpactanalysis,sincenoadditionaleffort
isrequiredtolocateaffectedclones.Assumingperfectcloneindicators,eIcAisreducedtozero,
model:costthisyielding
overhead·(eiL+eiImpl·mod+eiQA)
=eeiA+eiL+eiD+eiIA+eiImpl+eiQA+eiO
LinkedEditingreplicateseditoperationsperformedononeclonetoitssiblings.Prototype
linkededitingtoolsincludeCodelink[218]andCReN[102].Optimallinkededitingtoolsthus
lowerstheoverheadrequiredforconsistentmodi®cationsofclonedcodetozero.Sincelinked
editorstypicallyalsoprovidecloneindication,theyalsosimplifyimpactanalysis.Theirapplication
model:wingfollotheyields
overhead·(eiL+eiQA)
e=eiA+eiL+eiD+eiIA+eiImpl+eiQA+eiO
Wforedoqualitynotthinkassurance.thatIfclonetheamountmanagementoftoolschangedcancodeissubstantiallylargerduetoreducecloning,theovemorerheadcodecloningneedstocausesbe
processedbyqualityassuranceactivities.Wedonotassumethatinspectionsortestexecutionscan
besimpli®edsubstantiallybytheknowledgethatsomesimilaritiesresideinthecode—faultsmight
stilllurkinthedifferences.
Hocloningwever,weimposesareonconthevincednumberthatoffcloneaultsthatindicationsliptoolsthroughcanqualitysubstantiallyassurance.reduceIfathesingleimpactfaultthatis
foundinclonedcode,cloneindicatorscanpointtoallthefaultsinthesiblingclones,assistingin
theirpromptremoval.Weassumethatperfectcloneindicationtoolsreducethecloninginduced
overheadinfaultsafterqualityassurancetozero.
87
ModelCostClone6
ModelCostSimpli®ed6.4
Thissectionintroducesasimpli®edcostmodel.Whilelessgenerallyapplicablethanthedetailed
model,itiseasiertoapply.
Duetoitsnumberoffactors,thedetailedmodelrequiressubstantialefforttoinstantiateinpractice—
eachofitsninefactorsneedstobedetermined.Exceptforoverhead,allofthemquantifymain-
tenanceeffortdistributionacrossindividualactivities.Sinceinpracticetheactivitiesaretypically
interleaved,withoutcleartransitionsbetweenthem,itisdif®culttogetexactestimateson,e.g.,
howmucheffortisspentonlocationandhowmuchonimpactanalysis.
Theindividualfactorsofthedetailedmodelarerequiredtomaketrade-offdecisions.Weneedto
distinguishbetween,e.g.,impactanalysisandlocationtoevaluatetheimpactthatcloneindication
toolsupportcanprovide,sinceimpactanalysisbene®tsfromcloneindication,whereaslocation
doesnot.Beforeevaluatingtrade-offsbetweenclonemanagementalternativeshowever,asimpler
decisionneedstobetaken:whethertodoanythingaboutcloningatall.Onlythenisitreasonableto
investtheefforttodetermineaccurateparametervalues.Ifthecostmodelisnotemployedtoassess
clonemanagementtoolsupport,manyofthedistinctionsbetweendifferentfactorsareobsolete.We
canthusaggregatethemtoreducethenumberoffactorsandhencetheeffortinvolvedinmodel
instantiation.Writtenslightlydifferent,thedetailedmodelis:
e=overheadeiL+eiIA+eiImplmod+eiQA
eThefractionistheratioofeffortrequiredforcodecomprehension(eiL+eiIA),modi®cationofexisting
code(eiImplmod)andqualityassurance(eiQA)w.r.t.theentireeffortrequiredforachangerequest.
Weintroducethenewparametercloning-affectedeffort(CAE)forit:
eiL+eiIA+eiImplmod+eiQA
=CAEeIfCAEisdeterminedasawhole(withoutitsconstituentparameters),thissimpli®edmodelprovides
asimplewaytoevaluatetheimpactofcloningonmaintenanceefforts:
Discussion6.5
e=overheadCAE
Thecostmodelisbasedonaseriesofassumptions.Itcansensiblybeappliedonlyforprojectsthat
satisfythem.Welistanddiscussthemheretosimplifytheirevaluation.
Weassumethatthesigni®cantpartofthecostmodelsforthemaintenanceprocessactivitiesare
linearfunctionsonthesizeofthecodethatgetsprocessed.Forexample,weassumethatlocation
88
Instantiation6.6
effortisprimarilydeterminedbyandproportionaltotheamountofcodethatgetsinspectedduring
actilocation.vityhaInsasomehigh®xedsituations,setupacticost,vitythecostcostmodelsmodelmightshouldbemoreincludea®xcomplicated.edfactor;Fordieseconomyxample,ifofan
scalecouldincreaseeffortw.r.t.sizeinasuperlinearfashion.Insuchcases,therespectivepart
ofthecostmodelneedstobeadaptedappropriately.COCOMOII[23],e.g.,usesapolynomial
functiontoadaptsizetodiseconomyofscale.
Weassumethatchangestoclonesarecoupledtoasubstantialdegree.Thecostmodelthusneeds
totheybearefalseinstantiatedpositionvesortailoredbecauscloneepartsdetectionoftheresults.systemInarecasenolongerclonesaremaintained,uncoupled,thee.g.,modelisbecausenot
applicable.Weassumethateachmodi®cationtoacloneinoneclonegrouprequiresthesameamountofeffort.
Weignorethatsubsequentimplementationsofasinglechangetomultiplecloneinstancescouldget
cheaper,sincethedevelopergetsusedtothatclonegroup.Wearenotawareofempiricaldatafor
thesecosts.Futureworkis,thus,requiredtobetterunderstandchangesinmodi®cationeffortacross
siblingclones.Sinceinpractice,however,mostclonegroupshavesize2,theinaccuracyintroduced
bythissimpli®cationshouldbemoderate.
6.6Instantiation
Thissectiondescribeshowtoinstantiatethecostmodelandpresentsalargeindustrialcasestudy.
DeterminationarameterP6.6.1Thissectiondescribeshowtheparametervaluescanbedeterminedtoinstantiatethecostmodel.
OverheadComputationOverheadiscomputedontheclonesdetectedforasystem.Itcap-
ofturesthecloningprogramminginducedsizelanguageincrease(cf.,2.5.4).independentThisofiswhetherintended—thetheclonesnegcanativebeimpactremovedofwithcloningmeanson
maintenanceactivitiesisindependentofwhethertheclonescanberemoved.
Theaccuracyoftheoverheadvalueisdeterminedbytheaccuracyoftheclonesonwhichitis
computed.Unfortunately,manyexistingclonedetectiontoolsproducehighfalsepositiverates;
positiKapservesanddetectedGodfrebyy[122]state-of-the-artreportbetweentools.False27%positiandves65%,eTxhibitiarkssomeetal.level[217]ofuptosyntactic75%ofsimilarityfalse,
butimpedenosoftwcommonareconceptmaintenanceandimplementationmustbeandexcludedhencenofromocouplingverheadoftheircomputation.changes.Theythusdonot
Toachieveaccurateclonedetectionresults,andthusanaccurateoverheadvalue,clonedetection
needstobetailored.Tailoringremovescodethatisnotmaintainedmanually,suchasgeneratedor
unusedcode,sinceitdoesnotimpedemaintenance.Exclusionofgeneratedcodeisimportant,since
generatorstypicallyproducesimilar-looking®lesforwhichlargeamountsofclonesaredetected.
tionareFurthermore,avoided.tailoringThisisadjustsnecessarydetectionsothat,soe.thatg.,refalsegionspositiofJavvesagdueettertos,ovthaterlydifferaggressiinvtheireidenti®ersnormaliza-
89
ModelCostClone6
andnoreshaveidenti®ernoconceptualnames.Accordingrelationship,toourareenotxperienceerroneously[115],afterconsideredtailoring,asclonesclonesbyeaxhibiteddetectorthatchangeig-
concept.coupling,CloneindicatingdetectiontheirtailoringsemanticiscovrelationshiperedindetailthroughinSectionredundant8.2.implementationofacommon
DeterminingActivityEffortsThedistributionofthemaintenanceeffortsdependsonmany
factors,includingthemaintenanceprocessemployed,themaintenanceenvironment,thepersonnel
andthetoolsavailable[211].Toreceiveaccurateresults,theparametersfortherelativeeffortsof
theindividualactivitiesthusneedtobedeterminedforeachsoftwaresystemindividually.
Coarseeffortdistributionscanbetakenfromprojectcalculation,bymatchingengineerwages
againstmaintenanceprocessactivities.Thisway,therelativeanalysiseffort,e.g.,isestimated
astheshareofthewagesoftheanalystsw.r.t.allwages.Aswecannotexpectengineerrolesto
matchtheactivitiesofourmaintenanceprocessexactly,weneedtore®nethedistribution.Thiscan
bedonebyobservingdevelopmenteffortsforchangerequeststodetermine,e.g.,howmucheffort
analystsspendonanalysis,locationanddesign,respectively.Tobefeasible,suchobservationsneed
tobecarriedoutonrepresentativesamplesoftheengineersandofthechangerequests.Strati®ed
samplingcanbeemployedtoimproverepresentativenessofresults—sampledCRscanbeselected
accordingtothechangetypedistribution,sothatrepresentativeamountsofperfectiveandother
analyzed.areCRsTheparameterCAEforthesimpli®edmodelisstillsimplertodetermine.Efforteistheoverall
persontimespentonasetofchangerequests.Itcanoftenbeobtainedfrombillingsystems.Fur-
thermore,weneedtodeterminepersonhoursspentonqualityassurance,workingwithcodeand
spentexclusivelydevelopingnewcode.Thiscan,again,bedonebyobservingdevelopersworking
CRs.onThemodi®cationratiocan,inprinciple,alsobedeterminedbyobservingdevelopersanddiffer-
entiatingbetweenadditionsandmodi®cations.Ifavailable,itcanalternativelybeestimatedfrom
statistics.typerequestchange
LiteratureValuesforActivityEffortsofferasimplewaytoinstantiatethemodel.Unfortu-
nately,theresearchcommunitystilllacksathoroughunderstandingofhowtheactivitycostsare
distributedacrossmaintenanceactivities[211].Consequently,resultsbasedonliteraturevaluesare
lessaccurate.Theycanhoweverserveforacoarseapproximationbasedonwhichadecisioncanbe
taken,whethereffortformoreaccuratedeterminationoftheparametersisjusti®ed.
Severalresearchershavemeasuredeffortdistributionacrossmaintenanceactivities.In[194],Rom-
bachyearsetandal.covreporteringaroundmeasurement10,000resultshoursforofthreelarmaintenancegesystems,effort.carriedBasilioutetoal.ver[10]thecourseanalyzedof25threere-
leaseseachof10differentprojects,coveringover20,000hoursofeffort.Bothstudiesworkondata
thatwasrecordedduringmaintenance.YehandJeng[236]performedaquestionnaire-basedsurvey
inTaiwan.Theirdataisbasedon97validresponsesreceivedfor1000questionnairesdistributed
acrossTaiwan’ssoftwareengineeringlandscape.Thevaluesofthethreestudiesaredepictedin
6.1.ableT
90
Table6.1:Effortdistribution
Activity[194][10][236]Estimate
5%26%Analysis8%13%Location30%Design16%19%16%ImpactImplementationAnalysis22%29%26%26%5%
QualityAssurance22%24%17%22%
18%12%18%26%Other
Instantiation6.6
usedSinceineachthisstudythesis,usedweacannotslightlydirectlydifferentdeterminemaintenanceaveragevprocesaluess,feachoractibeingvitydifdistribferentution.fromFtheoreonex-
ample,in[194],designsubsumesanalysisandlocation.In[10],analysissubsumeslocation.The
estimatedmentation,avqualityerageefassurfortsanceareanddepictedotherinarethesimilarfourthrowbetweenofTtheable6.1.studiesSinceandourtheprocess,de®nitionsweofusedimple-the
aremedianoflittleashelp,estimatedsincevalue.theFactiorvitiesthedoremainingnotexistactiinvities,theirtheefprocessesfortordistribareutionsde®nedfromdiftheferently.literatureWe
thusdistributedtheremaining34%ofeffortaccordingtoourbestknowledge,basedonourownde-
velopmentexperienceandthatofourindustrialpartners—thedistributioncanthusbeinaccurate.
Todeterminetheratiobetweenmodi®cationandadditioneffortduringimplementation,weinspect
therequestsdistribmainlyutionofinvolvchangeerequestmodi®cations,types.Wwhereaseassumeperfectithatveadaptichangesve,mainlycorrectivinevandolveprevadditions.entivechangeCon-
othersequentlychange,wetypes.estimateTablethe6.2ratioshowsbetweeneffortadditiondistribandutionmodi®acrosscationchangebythetypesratiooffromtheperfectiabovevew.r.t.studies.all
Thefourthrowdepictsthemedianofallthree—37%ofmaintenanceeffortsarespentonperfec-
tiveestimateCRs,thetheremainimodi®cationng63%ratiotoarebedistrib0.63.utedacrosstheotherCRtypes.Basedonthesevalues,we
Table6.2:Changetypedistribution
Effort[194][10][236]Median
Adaptive7%5%8%7%
Corrective27%14%23%23%
29%44%20%29%OtherPerfective37%61%25%37%
StudiesCase6.6.2Thissectionpresentstheapplicationoftheclonecostmodeltoseverallargeindustrialsoftwaresys-
temstoquantifytheimpactofcloning,andthepossiblebene®tofclonemanagementtoolsupport,
practice.in
91
ModelCostClone6
GoalThecasestudyhastwogoals.First,evaluationoftheclonecostmodel.Second,quanti®ca-
tionoftheimpactofcloningonsoftwaremaintenancecostsacrossdifferentsoftwaresystems,and
thepossiblebene®toftheapplicationofclonemanagementtools.
StudyObjectsWechose11industrialsoftwaresystemsasstudyobjects.Sincewerequirethe
willingnessofdeveloperstocontributeinclonedetectiontailoring,wehadtorelyonourcontacts
withindustry.However,wechosesystemsfromdifferentdomains(®nance,contentmanagement,
convenience,powersupply,insurance)from7differentcompanieswrittenin5differentprogram-
minglanguagestocapturearepresentativesetofsystems.Fornon-disclosurereasons,wetermed
thesystemsA-K.Table6.3givesanovervieworderedbysystemsize.
StudyDesignandProcedureClonedetectiontailoringwasperformedtoachieveaccurate
results.Systemdevelopersparticipatedintailoringtoidentifyfalsepositives.Clonedetectionand
overheadcomputationwasperformedusingConQATforallstudyobjectsandlimitedtotype-1and
type-2clones.Minimalclonelengthwassetto10statementsforallsystems.Weconsiderthisa
conservativeminimalclonelength.
Sincetheeffortparametersarenotavailabletousfortheanalyzedsystems,weemployedvalues
fromtheliterature.Weassumethat50%(8%location,5%impactanalysis,26%·0,63implemen-
tationand22%qualityassurance;roundedfrom51,38%to50%sincetheavailabledatadoesnot
containtheimpliedaccuracy)oftheoverallmaintenanceeffortareaffectedbycloning.Toestimate
theimpactofcloneindicationtoolsupport,weassumethat10%ofthateffortareusedforimpact
analysis(5%outof50%intotal).Incasecloneindicationtoolsareemployed,theimpactofcloning
onmaintenanceeffortcanthusbereducedby10%.
ResultsandDiscussionTheresultsaredepictedinTable6.3.Thecolumnsshowlinesofcode
(kLOC),sourcestatements(kSS),redundancy-freesourcestatements(kRFSS),sizeoverheadand
cloninginducedincreaseinmaintenanceeffortwithout(E)andwithcloneindicationtoolsupport
(ETool).Suchtoolsupportalsoreducestheincreaseinthenumberoffaultsduetocloning.As
mentionedinSection6.3.3,thisisnotre¯ectedinthemodel.
Theeffortincreasevariessubstantiallybetweensystems.Theestimatedoverheadrangesfrom75%,
forsystemA,to5.2%forsystemF.Wecouldnot®ndasigni®cantcorrelationbetweenoverheadand
systemsize.Onaverage,estimatedmaintenanceeffortincreaseis20%fortheanalyzedsystems.
Themedianis15.9%.Forasinglequalitycharacteristic,weconsiderthisasubstantialimpacton
maintenanceeffort.ForsystemsA,B,E,G,I,JandKestimatedeffortincreaseisabove10%;
forthesesystems,itappearswarrantedtodetermineprojectspeci®ceffortparameterstoachieve
accurateresultsandperformclonemanagementtoreduceeffortincrease.
ySummar6.7
Thistenancechapterefforts.presentedTheanmodelanalyticalcomputescostmodelmaintenancetoquantifyefforttheincreaseeconomicrelativeeffecttoofacloningsystemonwithoutmain-
92
Table6.3:Casestudyresults
ySummar6.7
SystemLanguagekLOCkSSkRFSSoverheadEETool
AXSLT31156150.0%75.0%67.5%
BABAP51211540.0%20.0%18.0%
CC#154413517.1%8.6%7.7%
DC#3261089513.7%6.8%6.2%
EC#360735923.7%11.9%10.7%
FC#423968710.3%5.2%4.7%
GABAP46120815534.2%17.1%15.4%
HC#65724221015.2%7.6%6.9%
ICOBOL1,00540022478.6%39.3%35.4%
JJava1,34736826538.9%19.4%17.5%
KJava2,17973355631.8%15.9%14.3%
atedcloning.thecostItcanmodelbeonused11asaindustrialbasistoesystems.valuateAlthoughcloneresultmanagementaccuracyalternaticouldvbees.Wimproehavvedebyinstanti-using
ducedprojectimpactspeci®cvariesinsteadofsigni®cantlyliteraturevbetweenaluesforsystemseffortandisparameters,substantialtheforresultssome.indicateBasedthatonthecloningresults,in-
someprojectscanachieveconsiderablesavingsbyperformingactiveclonecontrol.
Boththecostmodel,andtheempiricalstudiesinChapters4and5,furtherourunderstandingof
thesigni®canceofcloning.However,thenatureoftheircontributionsisdifferent.Theempirical
studiesobservereal-worldsoftwareengineering.Whiletheyyieldobjectiveresults,theirresearch
questionsandscopearelimitedtowhatwecanfeasiblystudy.Thecostmodelisnotaffected
bytheselimitationsandcanthuscovertheentiremaintenanceprocess.Ontheotherhand,thecost
ofmodelcloningismoreonspeculatiengineeringveactithanvities.theempirTheicalcostmodelstudiesinthusthatservitesre¯ectstwoourpurposes.assumptionsFirst,itonthecomplementsimpact
theassumptionsempiricalexplicitstudiestoandthuscompleteproourvidesanunderstandingobjectiveofbasistheforimpactofsubstantiatedcloning.scienti®cSecond,itdiscoursemakesonour
cloning.ofimpactthe
93
7AlgorithmsandToolSupport
Bothclonedetectionresearchandcloneassessmentandcontrolinpracticeareinfeasiblewithoutthe
appropriatetools—clonesarenearlyimpossibletodetectandmanagemanuallyinlargeartifacts.
Thischapteroutlinesthealgorithmsandintroducesthetoolsthathavebeencreatedduringthis
thesistosupportcloneassessmentandcontrol.
Thesourcecodeoftheclonedetectionworkbenchhasbeenpublishedasopensourceaspartof
ConQAT.Itsclonedetectionspeci®cparts,whichhavebeendevelopedduringthisthesis,comprise
kLOC.67approximatelyTheclonedetectionprocesscanbebrokendownintoindividualconsecutivephases.Eachphase
operatesontheoutputofitspreviousphaseandproducestheinputforitssuccessor.Thephasescan
thusbearrangedasapipeline.Figure7.1displaysageneralclonedetectionpipelinethatcomprises
fourphases:preprocessing,detection,postprocessingandresultpresentation:
FigurepipelinedetectionClone7.1:
Preprocessingreadsthesourceartifactsfromdisk,removesirrelevantpartsandproducesaninter-
mediaterepresentation.Detectionsearchesforsimilarregionsintheintermediaterepresentation,
theclones,andmapsthembacktoregionsintheoriginalartifacts.Postprocessing®ltersdetected
clonesandcomputescloningmetrics.Finally,resultpresentationrenderscloninginformationinto
aformatthat®tsthetaskforwhichclonedetectionisemployed.Anexampleisatrendchartina
qualitydashboardusedforclonecontrol.
Thisclonedetectionpipeline,orsimilarpipelinemodels,arefrequentlyusedtooutlinetheclone
111,detection113,115,process200].orItthealsoservarchitectureesasofancloneoutlineofdetectionthistoolschapter:fromsecationhigh7.1levelintroducespointoftheviewarchi-[57,
tectureoftheclonedetectionworkbenchthatre¯ectsthepipelineoftheclonedetectionprocess.
Thesubsequentsectionsdetailpreprocessing(7.2),detection(7.3),postprocessing(7.4)andresult
presentation(7.5).Section7.6comparestheworkbenchwithexistingdetectorsandsection7.7dis-
cussesitsmaturityandadoption.Finally,section7.8summarizesthechapter.Partsofthecontent
ofthischapterhavebeenpublishedin[54,97,111,113,115].
hitecturecAr7.1
Thissectionintroducesthepipes&®ltersarchitectureoftheclonedetectionworkbench.
95
7AlgorithmsandToolSupport
ariabilityV7.1.1
Clonedetectorsareappliedtoalargevarietyoftasksinbothresearchandpractice[140,201],in-
cludingqualityassessment[111,159,178],softwaremaintenanceandreengineering[32,54,102,
126,149],identi®cationofcrosscuttingconcerns[27],plagiarismdetectionandanalysisofcopy-
121].[77,infringementrightEachofthesetasksimposesdifferentrequirementsontheclonedetectionprocessanditsresults[229].
Forexample,theclonesrelevantforredundancyreduction,i.e.,clonesthatcanberemoved,differ
signi®cantlyfromtheclonesrelevantforplagiarismdetection.Similarly,aclonedetectionprocess
usedatdevelopmenttime,e.g.,integratedinanIDE,hasdifferentperformancerequirementsthana
detectionexecutedduringanightlybuild.Moreover,evenforaspeci®ctask,clonedetectiontools
needafairamountoftailoringtoadaptthemtothepeculiaritiesoftheanalyzedprojects.Sim-
pleexamplesaretheexclusionofgeneratedcodeorthe®lteringofdetectionresultstoretainonly
clonesthatcrossprojectboundaries.Moresophisticated,onemaywanttoaddapre-processing
phasethatsortsmethodsinsourcecodetoeliminatedifferencescausedbymethodorderortoadda
recommendersystemthatanalyzesdetectionresultstosupportdevelopersinremovingclones.
Whileapipelineisausefulabstractiontoconveythegeneralpicture,thereisnouniqueclone
detectionpipelinethat®tsallpurposes.Instead,bothinresearchandpractice,afamilyofrelated,
yetdifferentclonedetectionpipelinesareemployedacrosstools,tasksanddomains.
Clonedetectiontoolsformafamilyofproductsthatarerelatedandyetdifferinimportantdetails.
Asuitablearchitectureforaclonedetectionworkbenchthusneedstosupportthisproductfamily
nature.Ontheonehand,itneedstoprovidesuf®cient¯exibility,con®gurabilityandextensibilityto
caterforthemultitudeofclonedetectiontasks.Ontheotherhand,itmustfacilitatereuseandavoid
redundancybetweenindividualclonedetectiontoolsofthefamily.
PipelineExplicit7.1.2
Thecloneclonedetectiondetectiontoolswbyorkbenchmakingdethevelopedcloneduringdetectionthispipelinethesisesupportsxplicit.theTheproductclonefamilydetectionnaturephasesof
arecomposedliftedtofrom®rstaclasslibraryofentitiesunitofsathatdeclaratiperformvespeci®cdata¯owdetectionlanguage.tasks.ThisBothwaythe,aindiclonevidualdetectorunitsandis
combinationsofunitscanbereusedacrossdetectors.
TheclonedetectionworkbenchisimplementedaspartoftheContinuousQualityAssessment
Toolkit(ConQAT)[48,50,52,55,56,113].ConQAToffersavisualdata¯owlanguagethatfa-
cilitatestheconstructionofprogramanalysesthatcanbedescribedusingthepipes&®ltersarchi-
tecturalstyle[208].Thisvisuallanguageisusedtocomposeclonedetectiontoolsfromindividual
processingsteps.Furthermore,ConQAToffersaninteractiveeditortocreate,modify,execute,doc-
umentanddebuganalysiscon®gurations.Usingthisanalysisinfrastructure,ConQATimplements
severalsoftwarequalityanalyses.1TheclonedetectiontoolsupportpresentinConQAThasbeen
developedaspartofthisthesis.
1Inanbecameearlierpartvofersion,ConQAtheT.Fcloneordetectionsimplicity,toolwerefersupporttoitwasasan»ConQAindependentT«fortheprojectremaindercalledofthisCloneDetectivethesis.[113]beforeit
96
hitecturecAr7.1
Figure7.2showsanexemplaryclonedetectioncon®guration.ItdepictsascreenshotfromConQAT,
whichhasbeenmanuallyeditedtoindicatecorrespondenceoftheindividualprocessingstepstothe
clonedetectionpipelinephases.Eachbluerectanglewithagearwheelsymbol»«isaprocessor.
Itrepresentsanatomicpieceofanalysisfunctionality.Eachgreenrectanglewithaboxeddouble
gearwheelsymbol»«representsablock.Ablockisapieceofanalysisfunctionalitymadeupof
furtherprocessorsorblocks.Itisthecompositepieceoffunctionalitythatallowsreuseofrecurring
parts.analysisThisclonedetectioncon®gurationsearchesforclonesinJavasourcecodethatspandifferentprojects
toidentifycandidatesforreuse.Indetail,thecon®gurationworksasshowninFigure7.2:
con®gurationdetectionClone7.2:Figure
Duringpreprocessing,thesource-code-scopereadssource®lesfromdiskintomemory.Theregex-
region-markermarksJavaincludestatementsinthe®lesforexclusion,sincetheyarenotrelevant
forthisusecase.Thestatement-normalizationblockcreatesanormalizationstrategy.
Inthedetectionphase,theclone-detectorprocessorusesthenormalizationstrategytotransform
theinput®lesintoasequenceofstatementunitsandperformsdetectionofcontiguousclones.The
non-overlapping-constraintisevaluatedoneachdetectedclonegroup.Clonegroupsthatcontain
clonesthatoverlapwitheachotherareexcluded.
Duringpostprocessing,theblack-list-®lterremovesallclonegroupsthathavebeenblacklistedby
developers.Therfss-annotatorcomputestheredundancy-free-source-statementsmeasureforeach
source®le.Thecross-project-clone-group-®lterremovesclonegroupsthatdonotspanatleasttwo
projects.
97
7AlgorithmsandToolSupport
Intheoutputphase,theclone-report-writer-processorwritesthedetectionresultsintoanXMLre-
portthatcanbeopenedforinteractivecloneinspection.Thecoverage-outputandhtml-presentation
createatreemapthatgivesanoverviewofthedistributionofcross-projectclonesacrosstheana-
projects.lyzedInthiscon®guration,thestatement-normalizationandthecoverage-outputarereusedcon®guration
blocks.Theremainingunitshavebeenindividuallycon®guredforthisanalysis.
WhilethephasesoftheclonedetectionpipelinefromFigure7.1arestillrecognizableintheCon-
QATcon®gurationinFigure7.2,thecon®gurationcontainstask-speci®cunits(e.g.,thecross-
project-clone-groups-®lter)thatarenotrequiredinothercontexts.Consequently,forothertasks,
speci®cpipelinescanbecon®guredthatreusesharedfunctionalityavailableintheformofproces-
blocks.orsors
ocessingPrepr7.2
Preprocessingtransformsthesourceartifactsintoanintermediaterepresentationonwhichclone
fromdetectiontheislanguageperformed.oftheTheartifactthatintermediategetsanalyzed,representationallowingservestwdetectionotopurposes:operate®rst,itindependentabstractsof
idiosyncraciesof,e.g.,C++orABAPsourcecodeortextswritteninEnglishorGerman;second,
differentelementsintheoriginalartifactscanbenormalizedtothesameintermediatelanguage
fragment,thusintentionallymaskingsubtledifferences.
Thisspeci®csectionstrategies®rstforintroducessourcecode,artifact-inderequirementspendentspeci®cpreprocessingationsandstepsmodels.andthenoutlinesartifact-
Steps7.2.1ConQATperformspreprocessinginfoursteps:collection,removal,normalizationandunitcreation.
Allofthemcanbecon®guredtomakethemsuitablefordifferenttasks.
Collectiongatherssourceartifactsfromdiskandloadsthemintomemory.Itcanbecon®gured
todeterminewhichartifactsarecollectedandwhichareignored.Inclusionandexclusionpatterns
canbespeci®edonartifactpathsandcontenttoexclude,e.g.,generatedcodebasedon®lename
patterns,locationinthedirectorystructureortypicalcontent.
Removalstripspartsfromtheartifactsthatareuninterestingfromaclonedetectionperspective,
e.g.,commentsorgeneratedcode.
Normalizationsplitsthe(non-ignoredpartsofthe)sourceartifactsintoatomicelementsandtrans-
formsthemintoacanonicalrepresentationtomasksubtledifferencesthatareuninterestingfroma
e.vperspectidetectioncloneUnitcreationgroupsatomicelementscreatedbynormalizationintounitsonwhichclonedetection
isperformed.Dependingontheartifacttype,itcangroupseveralatomicelementsintoasingleunit
(e.g.,tokensintostatements)orproduceaunitforeachatomicelement(e.g.,forMatlab/Simulink
graphs).
98
ocessingPrepr7.2
Theresultofthepreprocessingphaseisanintermediaterepresentationofthesourceartifacts.The
underlyingdatastructuredependsontheartifacttype:preprocessingproducesasequenceofunits
forsourcecodeandrequirementsspeci®cationsandagraphformodels.
7.2.2Code
Preprocessingforsourcecodeoperatesonthetokenlevel.Programming-languagespeci®cscanners
areemployedtosplitsourcecodeintotokens.Bothremovalandnormalizationcanbecon®gured
tospecifywhichtokenclassestoremoveandwhichnormalizingtransformationstoperform.Ifno
scannerforaprogramminglanguageisavailable,preprocessingcanalternativelyworkontheword
orlinelevel.However,normalizationcapabilitiesarethenreducedtoregular-expression-based
2.replacements
Tokensareremovediftheyarenotrelevantfortheexecutionsemantics(suchas,e.g.,comments)
oroptional(e.g.,keywordssuchasthisinJava).Thisway,differencesinthesourcecodethatare
limitedtothesetokentypesdonotpreventclonesfrombeingfound.
Normalizationisperformedonidenti®ersandliterals.Literalsaresimplytransformedintoasingle
constantforeachliteraltype(i.e.,booleanliteralsaremappedtoanotherconstantthanintegerliter-
als).Foridenti®ertransformation,aheuristicstrategyisemployedthataimstoprovideacanonical
representationtoallstatementsthatcanbetransformedintoeachotherthroughconsistentrenaming
oftheirconstituentidenti®ers.Forexample,thestatement»a=a+b;«getstransformedto»id0
=id0+id1«.Sodoes»x=x+y«.However,statement»a=b+c«doesnotgetnormalized
likethis,sinceitcannotbetransformedintothepreviousexamplesthroughconsistentrenaming.
(Instead,itgetsnormalizedto»id0=id1+id2«.)Thisnormalizationissimilartoparameterized
stringmatchingproposedbyBaker[6].
ConQATdoesnotemploythesamenormalizationtoallcoderegions.Instead,differentstrategies
canbeappliedtodifferentcoderegions.Thisallowsconservativenormalizationtobeperformed
torepetitivecode—e.g.,sequencesofJavagettersandsetters—toavoidfalsepositives;atthesame
time,non-repetitivecodecanbenormalizedaggressivelytoimproverecall.Thenormalization
strategiesandtheircorrespondingcoderegionscanbespeci®edbytheuser;alternatively,ConQAT
implementsheuristicstoprovidedefaultbehaviorsuitabletomostcodebases.
Unitboundaries.creationAformsclonethusstatementscannotbefromgintokorens.endThissomewwhereay,inclonethemiddleboundariesofacoincidestatement.withstatement
Shapersinsertuniqueunitsatspeci®edpositions.Sinceuniqueunitsareunequaltoanyotherunit,
theycannotbecontainedinanyclone.Shapersthusclipclones.ConQATimplementsshapersto
clipclonestobasicblocks,methodboundariesoraccordingtouser-speci®edregularexpressions.
2Forandreasonsparametersofforconciseness,normalizationthisissectioniscontainedlimitedintoanConQAovTDocerviewat.Awwwdetailed.conqat.orgdocumentationandtheofConQAtheeTxistingBook[49].processors
99
7AlgorithmsandToolSupport
Speci®cationsRequirements7.2.3
tosplitPreprocessingtextintoforwordnaturalandlanguagepunctuationtokdocumentsens.operatesWhitespaceonisthewdiscarded.ordlevel.BothAremoscannervalisandemplonormal-yed
izationoperateonthetokenstream.
thermore,Punctuationstopisworremodsvedaretoremoallovwedclonesfromtothebetokenfoundstream.thatonlyStopdifwferordsin,aree.g.,de®nedtheirincommas.informationFur-
retrievalaswordsthatareinsigni®cantortoofrequenttobeusefulinsearchqueries.Examplesare
w”.“hoor“and”,“a”,Normalizationperformswordstemmingtotheremainingtokens.Stemmingheuristicallyreducesa
wordlanguages.toitsBothstem.theConQAlistofTstopuseswtheordsPorterandthestemmerstemmingalgorithmdepend[187],onthewhichislanguageavaioflabletheforvspeci®ca-arious
tion.Unitsentencecreationboundaries.formsAsentencecloneunitthusscannotfromwbeordgintokorens.endsomeThiswwhereay,inclonethemiddleboundariesofacoincidesentence.with
Models7.2.4
PreprocessingtransformsMatlab/Simulinkmodelsintolabeledgraphs.Itinvolvesseveralsteps:
readingthemodels,removalofsubsystemboundaries,removalofunconnectedlinesandnormal-
ization.Normalizationproducesthelabelsoftheverticesandedgesinthegraph.Thelabelcontentdepends
onwhichverticesareconsideredequal.Forblocks,usuallyatleasttheblocktypeisincluded,
whilesemanticallyirrelevantinformation,suchasthename,color,orlayoutposition,areexcluded.
Additionally,someoftheblockattributesaretakenintoaccount,e.g.,fortheRelationalOperator
blockthevalueoftheOperatorattributeisincluded,asthisdecideswhethertheblockperformsa
greaterorlessthancomparison.Forthelines,westoretheindicesofthesourceanddestination
portsinthelabel,withsomeexceptionsas,e.g.,foraproductblocktheinputportsdonothaveto
bedifferentiated.Furthermore,normalizationstoresweightvaluesforvertices.Theweightvalues
areusedtotreatdifferentvertextypesdifferentlywhen®lteringsmallclones.Weightingcanbe
con®guredandisanimportanttooltotailormodelclonedetection.
TheresultofthesestepsisalabeledmodelgraphG=(V,E,L)withthesetofvertices(ornodes)
Vcorrespondingtotheblocks,thedirectededgesEV×Vcorrespondingtothelines,anda
labelingfunctionL:V[E"Nmappingnodesandedgestonormalizationlabelsfromsomeset
N.Twoverticesortwoedgesareconsideredequivalent,iftheyhavethesamelabel.AsaSimulink
blockcanhavemultipleports,eachofwhichcanbeconnectedtoaline,Gisamulti-graph.The
portsarenotmodeledherebutimplicitlyincludedinthenormalizationlabelsofthelines.
ForthesimplemodelsshowninFigure7.3thelabeledgraphproducedbypreprocessingisdepicted
inFigure7.4.Thenodesarelabeledaccordingtoournormalizationfunction.(Thegreyportionsof
thegraphmarkthepartweconsideraclone.)
100
Detection7.3Algorithms
Figure7.3:Examples:DiscretesaturatedPI-controllerandPID-controller
Figure7.4:Themodelgraphforoursimpleexamplemodel
AlgorithmsDetection7.3
invDetectionolvedinidenti®esdetectiontheandactualthenclonesoutlinesinthedetectionartifacts.algorithmsThisforsectionsequences®rstandintroducesgraphs.generalsteps
Steps7.3.1Thedetectionphaseproducescloninginformationintermsofregionsinthesourceartifacts.It
involvestwosteps.First,clonesareidenti®edintheintermediaterepresentation.Second,clones
aremediatemappedfromrepresentation,theintermediatemappingisrepresestraight-forwntationard.totheirTheoriginalprincipalartifacts.challengeGivinenthisaphasesuitableisinterthus-
thedetectionofclonesintheintermediaterepresentation.
Theemployeddetectionalgorithmsdependonthestructureoftheintermediaterepresentation,not
onthetypeoftheartifact.Morespeci®cally,differentalgorithmsareemployedforsequencesthan
forthosegraphs.thatoperateThisonsectiongraphsis3.thusstructuredaccordingtoalgorithmsthatoperateonsequencesand
Inprogramprinciple,dependencesourcecodegraph).canbeThus,bothrepresentedsequence-bothasandasequencegraph-basedofstatementsdetectionorasalgorithmsagraphcan(e.beg.,ap-a
pliedtosourcecode.PDG-basedapproaches[137,146],e.g.,operateonagraph-basedintermediate
representationforcode.However,ConQATperformsclonedetectiononsequences,sincefromour
3ConQATdoesnotimplementclonedetectionalgorithmsthatoperateontrees.
101
7AlgorithmsandToolSupport
experience,thecostincreaseincurredbysearchingclonesingraphsinsteadisnotaccountedforbya
suf®cientincreaseindetectionresultquality—manyofthegraph-basedclonedetectionapproaches
areprohibitivelyexpensiveforpracticalapplication[137,146].Fordata-¯owmodels,ontheother
hand,wearenotawareofasequentializationthatissuf®cientlycanonicaltoallowforhighrecallof
sequence-basedclonedetectioninmodels.Thus,weperformclonedetectionforsourcecodeand
requirementsspeci®cationsonsequences,butclonedetectionformodelsongraphs.
7.3.2BatchDetectionofType-1andType-2ClonesinSequences
ConQATimplementsasuf®xtree-basedalgorithmforthedetectionoftype-1andtype-2clonesin
sequences.Thealgorithmoperatesonastringofunitsanddetectssubstringsthatoccurmorethan
once.Itcanbeappliedbothtosourcecodeandtorequirementsspeci®cations.Thealgorithmis
similartotheclonedetectionalgorithmsproposedbyBaker[6]andKamiyaetal.[121].
Asuf®xtreeoverasequencesisatreewithedgeslabeledbywordssothatexactlyallsuf®xesof
sarefoundbytraversingthetreefromtherootnodetoaleafandconcatenatingthewordsonthe
encounterededges.Itisconstructedinlineartime—andthuslinearspace—usingthealgorithmby
Ukkonen[222].Asuf®xtreeforthesequenceabcdXabcd$isdisplayedinFigure7.5.Rededges
denotesuf®xlinks.Asuf®xlinkpointsfromanodetoanodethatrepresentsitsdirectsuf®x.
Figure7.5:Suf®xtreeforsequenceabcdXabcd$
Inasuf®xtree,notwoedgesleavinganodehavethesamelabel.Iftwosubstringsofsareidentical,
itcontainstwosuf®xesthathavethestringastheirpre®x;bothsharethesameedgeinthetree.In
sequenceabcdXabcd$,thestringabcdoccurstwice;consequently,thesuf®xesabcdXabc$and
abcd$4sharethepre®xabcdandthustheedgebetweenn0andn6inthetree(denotedinblue).The
noden6indicatesthatthesuf®xesdifferfromthispointon—onecontinueswiththelabelXabcd$,
.$withoneTodetectclones,thealgorithmperformsadepth-®rstsearchofthesuf®xtree.Ifanodeinthetree
haschildren,thelabelfromtheroottothenodeoccursexactlyasmanytimesins,asthenodehas
4Thesentinelcharacter$denotestheendofthesequences.
102
7.3AlgorithmsDetection
Figure7.6:Theoriginal®lenamedX.j(left),itsnormalization(center),andthecor-
respondingcloneindex(right).
abcdreachableoccurslea2vestimesintheinstree.andFisorthusexample,reportedsinceasan6clonehastwgroupowithreachabletwoleafsclones.(n1andn7),thelabel
Thesuf®xesofclones—bcd,cdandddenotedingrayintheexample—alsooccurseveraltimesin
s.Werefertothemasinducedclones.Iftheydonotoccurmoreoftenthantheirlongervariants,
theyarenotreported.Thealgorithmemploysthesuf®xlinkstopropagateinducedclonecounts.
theClonesexample,areonlynoclonesreported,areifthereportedinducedfornodesclonen8,countn10forandan12node.issmallerthanitsclonecount.In
algorithmScalabilitytogetherandPwitherftheormanceindex-basedWeevaluatealgorithminscalabilitythenextandsection.performanceofthesuf®xtree-based
7.3.3Real-TimeDetectionofType-1andType-2ClonesinSequences
type-2ConQATclonesimplementsthatisbothindex-basedincremental,clonedistribdetectionutableasandanoscalableveltovdetectionerylargeapproachcodeforbases.type-1and
alloClonewstheIndexlookupTheofallcloneclonesindexforisathesinglecentral®ledata(andthusstructurealsousedfortheforourentiredetectionsystem),andalgorithm.canbeIt
updatedef®ciently,when®lesareadded,removed,ormodi®ed.
Thelistofallclonesofasystemisnotasuitablesubstituteforacloneindex,asef®cientupdateis
notpossible.Addinganew®lemaypotentiallyintroducenewclonestoanyoftheexisting®lesand
thusacomparisontoall®lesisrequiredifnoadditionaldatastructureisused.
(cfThe.,core[135],ideapp.ofthe560–663).cloneindeThere,xisasimilarmappingtothefrominveachertedwindeordxtousedallitsindocumentoccurrencesretrieisvalmaintained.systems
Similarlyoccurrences.,theMorecloneindepreciselyx,themaintainscloneaindemappingxisalistfromoftuplessequences(®le,ofstatementnormalizedindex,statementssequencetohash,their
info),where®leisthenameofthe®le,statementindexisthepositioninthelistofnormalized
statementsforthe®le,sequencehashisahashcodeforthenextnnormalizedstatementsinthe
®lestartingfromthestatementindex(nisaconstantcalledchunklengthandisusuallysetto
thealgorithms,minimalbutclonemightlength),beusefulandinfwhenocontainsproducinganythelistadditionalofclones,data,suchwhichastheisnotstartandrequiredendforlinestheof
sequence.statementthe
103
7AlgorithmsandToolSupport
Thecloneindexcontainsthedescribedtuplesforall®lesandallpossiblestatementindices,i.e.,
forasingle®lethestatementsequences(1,...,n),(2,...,(n+1)),(3,...,(n+2)),etc.are
stored.Ourdetectionalgorithmrequireslookupsoftuplesbothby®leandbysequencehash,so
bothshouldbesupportedef®ciently.Otherthanthat,norestrictionsareplacedontheindexdata
structure,sotherearedifferentimplementationspossible,dependingontheactualuse-case.These
includein-memoryindicesbasedontwohashtablesorsearchtreesforthelookups,anddisk-based
indiceswhichallowpersistingthecloneindexovertimeandprocessingamountsofcodewhichare
toolargeto®tintomainmemory.Thelattermaybebasedondatabasesystems,orononeofthe
manyoptimized(andoftendistributed)key-valuestores[34,47].
InFig.7.6,thecorrespondencebetweenaninput®le»X.j«5andthecloneindexisvisualizedfor
achunklengthof5.The®eldthatrequiresmostexplanationisthesequencehash.Thereason
forusingsequencesofstatementsintheindexinsteadofindividualstatementsisthatthestatement
sequenceslesscommon(twoidenticalstatementsequencesarelesslikelythantwoidenticalstate-
ments)andarealreadyquitesimilartotheclones.Iftherearetwoentriesintheindexwiththesame
sequence,wealreadyhaveacloneoflengthatleastn.Thereasonforstoringahashintheindex
insteadoftheentiresequenceisforsavingspace,asthiswaythesizeoftheindexisindependentof
thechoiceofn,andusuallythehashisshorterthanthesequence’scontentsevenforsmallvaluesof
n.WeusetheMD5hashingalgorithm[192]whichcalculates128bithashvaluesandistypically
usedincryptographicapplications,suchasthecalculationofmessagesignatures.Asouralgorithm
onlyworksonthehashvalues,severalstatementsequenceswiththesameMD5hashvaluewould
causefalsepositivesinthereportedclones.Whiletherearecryptographicattacksthatcangenerate
messageswiththesamehashvalue[212],thecaseofdifferentstatementsequencesproducingthe
sameMD5hashissounlikelyinoursetting,thatitcanbeneglectedforpracticalpurposes.
CloneRetrievalThecloneretrievalprocessextractsallclonesforasingle®lefromtheindex.
Usuallyweassumethatthe®leiscontainedintheindex,butofcoursethesameprocesscanbe
appliedto®ndclonesbetweentheindexandanexternal®leaswell.Tupleswiththesamesequence
hashalreadyindicatecloneswithalengthofatleastn(wherenisthechunklength).Thegoalof
cloneretrievalistoreportonlymaximalclones,i.e.,clonegroupsthatarenotentirelycontainedin
anotherclonegroup.TheoverallalgorithmissketchedinFig.7.7,whichwenextexplaininmore
detail.The®rststep(uptoLine6)istocreatethelistcofduplicatedchunks.Thisliststoresforeach
statementoftheinput®lealltuplesfromtheindexwiththesamesequencehashasthesequence
foundinthe®le.Theindexusedtoaccessthelistccorrespondstothestatementindexintheinput
®le.ThesetupisdepictedinFig.7.8.Thereisacloneoflength10(6tupleswithchunklength5)
withthe®leY.j,andacloneoflength7withbothY.jandZ.j.
Inthemainloop(startingfromLine7),we®rstcheckwhetheranynewclonesmightstartatthis
position.Ifthereisonlyasingletuplewiththishash(whichhastobelongtotheinspected®leatthe
currentlocation)weskipthisloopiteration.Thesameholdsifalltuplesatpositionihavealready
beenpresentatpositioni!1,asinthiscaseanyclonegroupfoundatpositioniwouldbeincluded
inaclonegroupstartingatpositioni!1.Althoughweusethesubsetoperatorinthealgorithm
description,thisisnotreallyasubsetoperation,asofcoursethestatementindexofthetuplesinc(i)
5WeusethenameX.jinsteadofX.javaasanabbreviationinthe®gures.
104
AlgorithmsDetection7.3
(®lename)reportClonesfunction12letfbethelistoftuplescorrespondingto®lename
sortedbystatementindexeitherreadfrom
theindexorcalculatedonthe¯y
3letcbealistwithc(0)=;
4fori:=1tolength(f)do
5retrievetupleswithsamesequencehashasf(i)
6storethissetasc(i)
7fori:=1tolength(c)do
8if|c(i)|<2orc(i)c(i!1)then
9continuewithnextloopiteration
10leta:=c(i)
11forj:=0i+1tolength(c)do
12leta0:=a\c(j)
13if|a|<|a|then
14report0clonesfromc(i)toa(seetext)
a=:a1516if|a|<2orac(i!1)then
loopinnereakbr17Figure7.7:Cloneretrievalalgorithm
Figure7.8:Lookupsperformedforretrieval
willbeincreasedby1comparedtothecorrespondingonesinc(i!1)andthecontentoftheinfo
.ferdifwill®eldThesetaintroducedinLine10iscalledtheactivesetandcontainsalltuplescorrespondingto
cloneswhichhavenotyetbeenreported.Ateachiterationoftheinnerloopthesetaisreducedto
tupleswhicharealsopresentinc(j)(againtheintersectionoperatorhastoaccountfortheincreased
statementindexanddifferentinfo®eld).Thenewvalueisstoredina0.Clonesareonlyreported,
iftuplesarelostinLine12,asotherwiseallcurrentclonescouldbeprolongedbyonestatement.
Clonereportingmatchestuplesthat,aftercorrectionofthestatementindex,appearinbothc(i)and
a;eachmatchedpaircorrespondstoasingleclone.Itslocationcanbeextractedfromthe®lename
andinfo®elds.Allclonesinasinglereportingstepbelongtooneclonegroup.Line16earlyexits
theinnerloopifeithernomoreclonesarestartingfrompositioni(i.e.,aistoosmall),orifall
tuplesfromahavealreadybeeninc(i!1).(again,correctedforstatementindex).Inthiscasethey
105
7AlgorithmsandToolSupport
havealreadybeenreportedinthepreviousiterationoftheouterloop.
Thisalgorithmreturnsallclonegroupswithatleastonecloneinstanceinthegiven®leandwitha
minimallengthofchunklengthn.Shorterclonescannotbedetectedwiththeindex,sonmustbe
chosenequaltoorsmallerthantheminimalclonelength.Ofcourse,reportedclonescanbeeasily
®lteredtoonlyincludecloneswithalengthl>n.
Oneproblemofthisalgorithmisthatclonegroupswithmultipleinstancesinthesame®leare
encounteredandreportedmultipletimes.Furthermore,whencalculatingtheclonegroupsforall
®lesinasystem,clonegroupswillbereportedmorethanonceaswell.Bothcasescanbeavoided,
bycheckingwhetherthe®rstelementofa0(withrespecttoa®xedorder)isequaltof(j)andonly
case.thisinreport
IndexMaintenanceByindexmaintenancewerefertoallstepsrequiredtokeeptheindexup
todateinthepresenceofcodechanges.Forindexmaintenance,onlytwooperationsareneeded,
namelyadditionandremovalof6single®les.Modi®cationsof®lescanbereducedtoaremove
operationfollowedbyanadditionandindexcreationisjustadditionofallexisting®lesstarting
fromanemptyindex.Intheindex-basedmodel,bothoperationsaresimple.Toaddanew®le,ithas
tobereadandpreprocessedtoproduceitssequenceofnormalizedstatements.Fromthissequence,
allpossiblecontiguoussequencesoflengthn(wherenisthechunklength)aregenerated,which
arethenhashedandinsertedastuplesintotheindex.Similarly,theremovalofa®leconsistsofthe
removalofalltuplesthatcontaintherespective®le.Dependingontheimplementationoftheindex,
theadditionandremovaloftuplesmightcauseadditionalprocessingsteps(suchasrebalancing
searchtrees,orrecoveringfreeddiskspace),butthesearenotconsideredhere.
ImplementationConsiderationsDetailsonindeximplementationandananalysisofthe
complestronglyxitydependsoftheonthealgorithmstructurecanofbethefoundanalyzedin[97].system.WeItsomititpracticalhere,asitssuitabilityoverallthusneedsperformancetobe
determinedusingmeasurementsonreal-worldsoftware,whicharereportedbelow.
ScalabilityandPerformance:BatchCloneDetectionToevaluateperformanceandscal-
abilityofboththesuf®xtree-basedandtheindex-basedalgorithm,weexecutedbothonthesame
hardware,withthesamesettings,analyzedthesamesystemandcomparedtheresults.Bothalgo-
rithmsarecon®guredtooperateonstatementsasunits.Fortheindex-basedalgorithm,weusedan
implementation.xindeclonein-memoryWerithmsuseddetectthe11theMLOCsameof60.353CcodeclonestheinLinux25.663Kernelgroupsinvforersionit.Toev2.6.33.2aluateasstudyscalability,object.weBothperformedalgo-
severaldetections,eachanalyzingincreasingamountsofcode.Weanalyzedbetween500KLOC
and10MLOCandincrementedby500KLOCforeachrun.Themeasurementswerecarriedout
inonaFigureWindo7.9.wsItshomachinewsthewithnumber2.53ofGHz,Jastatemeva1.6ntsand(insteadaheapofthesizeoflines1ofGB.code)Theonresultsthearex-axis,depictedsince
theymoreaccuratelydetermineruntime.500KLOC,e.g.,correspondto141Kstatements.
6Thissystems.Ifsimpli®cationasystemmakonlyessenseconsistsonlyofifaafewsinglehuge®le®les,issmallmorere®nedcomparedupdatetotheoperationsentirecodewouldbase,bewhichrequired.holdsformost
106
ExeicutioTnime Senco nds000000 4201110000 00008642 0 00 05ntocte eidDse-Bafx-TSuefreintocte eidDsex-BaeIdn00 01nottainre SttmeCae00 51t ns0Sttme0ae 0n 1zeiSi00 0200 5200 03AlgorithmsDetection7.3
00 53Figure7.9:Performanceoftype-2clonedetection
00 04Thetimerequiredtocreatethestatementunits(includingdiskI/O,scanningandnormalization)
isdepictedinred.Itdominatestheruntimeforbothalgorithms.Theruntimesofthesuf®xtree-
basedandindex-baseddetectionalgorithms(includingstatementunitcreation)aredepictedinblue
andgreen,respectively.Forbothalgorithms,runtimesincreaselinearwithsystemsize.Thesuf®x
tree-basedalgorithmisfaster.Itshouldthusbeusedifbatchdetectiongetsperformedonasingle
machineandsuf®cientmemoryisavailable.Otherwise,theindex-basedalgorithmispreferable.
ScalabilityandPerformance:Real-TimeCloneDetectionWeinvestigatedthesuitabil-
wityareforasaboreal-timeve.Wecloneusedadetectionpersistentonlarclonegecodeindexthatisimplementationmodi®edbasedcontinuouslyonBerkoneletheyDBsame7,ahard-high-
database.embeddedperformance
Wemeasuredthetimerequiredto(1)buildtheindex,(2)updatetheindexinresponsetochanges
tothesystem,and(3)querytheindex.Forthis,weanalyzedversion3.3oftheEclipseSDK
(42.693.793LOCin209.312®les).Wetimedindex-creationtomeasure(1).Tomeasure(2),we
removed1,000randomlyselected®lesandre-addedthemafterwards.For(3),wequeriedtheindex
forallclonegroupsof1,000randomlyselected®les.
7Tablehours7.1and4depictsminutes.theresults.ThecloneIndexindexcreation,occupiedincluding5.6GBwritingondisk.thecloneIndexindeupdate,xtotheincludingdatabase,writingtook
tothedatabase,took0.85secondsper®leonaverage.Finally,queriesforallclonegroupsfora®le
took0.91secondsonaverage.Medianquerytimewas0.21seconds.Only14ofthe1000®leshad
aquerytimeofover10seconds.Onaverage,the®leshadasizeof3kLOCandqueriesforthem
clones.350returned7http://www.oracle.com/technology/products/berkeley-db/index.html
107
7AlgorithmsandToolSupport
Theresultsindicatethatourapproachiscapableofsupportingrealtimeclonemanagement:the
indexcanbecreatedduringasinglenightlybuild.(Afterwards,theindexcanbeupdatedtochanges
anddoesnotneedtoberecreated.)Theaveragetimeforaqueryis,inouropinion,fastenoughto
supportinteractivedisplayofcloneinformationwhenasource®leisopenedintheIDE.Finally,the
performanceofindexupdatesallowsforcontinuousindexmaintenance,e.g.,triggeredbycommits
tothesourcecoderepositoryorsaveoperationsintheIDE.
Table7.1:Clonemanagementperformance
Indexcreation(complete)7hr4min
Indexquery(per®le)0.21secmedian
sec0.91eragevaIndexupdate(per®le)0.85secaverage
ScalabilityandPerformance:DistributedCloneDetectionWeevaluatedthedistribu-
tionplementedonmultipleontopofmachinesBigtableusing[34],aGoogle’key-vsaluecomputingstoresupportinginfrastructure.distribTheutedemploaccess.yedDetailsindexisonim-the
implementationonGoogle’sinfrastructurecanbefoundin[97].
Weanalyzedthirdpartyopensourcesoftware,including,e.g.,WebKit,Subversion,andBoost.
(73.2MLOCofJava,C,andC++codein201,283®lesintotal.)Weexecutedbothindexcreation
andcoveragecalculationasseparatejobs,bothondifferentnumbersofmachines8.Inaddition,to
evaluatescalabilitytoultra-largecodebases,wemeasuredindexconstructionon1000machineson
about120millionC/C++®lesindexedbyGoogleCodeSearch9,comprising2.9GLOC10.
Using100machines,indexcreationandcoveragecomputationforthe73.2MLOCofcodetook
aboutcreation36oftheminutes.cloneForinde10xformachines,the2.9theGLOCprocessingofC/C++timeissourcesstillinonlytheslightlyGoogleaboCodeve3Searchhours.indeThex
requiredlessthan7hourson1000machines.
Weobservedasaturationoftheexecutiontimeforbothtasks.Towardstheendofthejob,most
machinesarewaitingforafewmachineswhichhadaslightlylargercomputingtaskcausedbylarge
®lesor®leswithmanyclones.Thealgorithmthusscaleswelluptoacertainnumberofmachines.
Additionalmeasurements(cf.,[97])revealedthatusingmorethanabout30machinesforretrieval
doesnotmakesenseforacodebaseofthegivensize.However,thelargejobprocessing2.9GLOC
demonstratesthe(absenceof)limitsforindexconstruction.
8ThemachineshaveIntelXeonprocessorsfromwhichonlyasinglecorewasused,andthetaskallocatedabout3GB
9RAMhttp://wwwoneach..google.com/codesearch
10Moreprecisely2,915,947,163linesofcode.
108
AlgorithmsDetection7.3
7.3.4Type-3ClonesinSequences
ConQATimplementsanovelalgorithmtodetecttype-3clonesinsequences.Thetaskofthede-
tectionalgorithmisto®ndcommonsubstringsintheunitsequence,wherecommonsubstringsare
notrequiredtobeexactlyidentical,butmayhaveaneditdistanceboundedbysomethreshold.This
problemisrelatedtotheapproximatestringmatchingproblem[109,221],whichisalsoinvestigated
extensivelyinbioinformatics[215].Themaindifferenceisthatwearenotinterestedin®ndingan
approximationofonlyasinglegivenwordinthestring,butratherarelookingforallsubstrings
approximatelyoccurringmorethanonceintheentiresequence.
Thealgorithmconstructsasuf®xtreeoftheunitsequenceandthenperformsanedit-distance-based
approximatesearchforeachsuf®xinthetree.Itemploysthesamesuf®xtreeasthealgorithmthat
searchesfortype-1andtype-2clonesfromSection7.3.2,butemploysadifferentsearch.
DetectionAlgorithmAsketchofourdetectionalgorithmisshowninFigures7.10and7.11.
Clonesparametersarearetidenti®edhebysequencethesweprocedurearewsearorkingchonthatandtherecursivpositionelytravstarterseswherethesufthe®xsearchtree.Itswas®rsttwstarted,o
callwhichofissearchrequired)markswhenthereportingcurrentaendclone.oftheThesubstringparameterunderj(whichinspection.istheTosameprolongasstartthisinthesubstring,®rst
tothethesubstringcurrentnodestartingvat(forjistherootcomparednodetowethenejustxtusewordthewinempttheysufstring).®xtree,Forwhichthisisthecomparison,edgeanleadingedit
editdistancedistanceofatmostmaximallyeallooperationswed(®fforthaclone.parameter)Ifistheallowed.remainingForeditthe®rstoperationscallofaresearnotch,eenoughistheto
tramatcvhersaltheofentirethetreeedgewcontinuesordw(elserecursivcase),ely,weincreasingreportthetheclonelengthas(fjar!asstartwe)offoundtheit.currentOtherwise,substringthe
andreducingthenumbereofeditoperationsavailablebytheamountofoperationsalreadyspent.
procdetect(s,e)
Input:Strings=(s0,...,sn),maxeditdistancee
21forConstructeachi2suf{®x1,.tree..,Tn}fromdos
3search(s,i,i,root(T),e)
Figure7.10:Outlineofapproximateclonedetectionalgorithm
Asuf®xtreeforthesequenceabcdXabcYd$isdisplayedinFigure7.12,thatcontainsthetype-3
andclonesabcaYbcd,danddepictedabcYind.blue.ForanFromeditnodedistancen6,theof1,labelsthedX$algorithmabcYd$matchesandYdthe$aretype-3compared.clonesaIfbcYd
isprolongedremovedbyd(indicforatedn1inandYorange),dfornboth7.Thelabelsstartinducedwithcloned.sTheton8labelandabnc10fromarenag0aintone6canxcluded.thusThebe
inducedreported,clonesinced,theatsearchnodenonly13isstartsnotatreachablepositionsinthroughtheawsuford®xthatlink.areHonotwecoverv,ereditstillbydoesothernotclones.get
strateHence,gy,nothesearchalgorithmstartsfordoesd,notsinceitguaranteeiscotovered®ndbygloballytheaboveoptimalcloneeditgroup.sequences.Duetoitslocalsearch
Ttheomaklongestethiseditalgorithmdistanceworkmatch,andweitsuseresultstheusable,dynamicsomedetailsprogramminghavetobealgorithm¯eshedfoundout.Tinoalgorithmcompute
109
7AlgorithmsandToolSupport
procsearch(s,start,j,v,e)
Input:startindeStringxofs=current(s0,...search,,sn),currentsearchindexj,
nodevofsuf®xtreeovers,maxeditdistancee
1Let(w1,...,wm)bethewordalongtheedgeleadingtov
2Calculatethemaximallengthlm,sothat
thereisakjwheretheeditdistancee0between
(w1,...,wl)and(sj,...,sk)isatmoste
3ifl=mthen
54forsearcheach(s,childstart,nodek+uofm,vu,doe!e0)
76elsereportifk!startsubstringfromminimalstartclonetokoflengthsasthenclone
Figure7.11:Searchroutineoftheapproximateclonedetectionalgorithm
Figure7.12:Suf®xtreeforsequenceabcdXabcYd$
textbooks.Whileeasytoimplement,itrequiresquadratictimeandspace11.Tomakethisstep
efthe®cient,suf®xwetreelookedgeatismostshorterat,thethis®rstisnot1000aproblem.statementsInofcasethewthereordiswa.Asclonelongofasmorethethanword1000on
eachstatements,suf®xweweare®nditrunninginchunkstheofsearch1000.onwillWeofconsidercoursebethisparttobeofthetolerabletree,weforalsopracticalhavetopurposes.makesureAs
thatnoselfmatchesarereported.
manWhenyrunningstatementstheasalgorithmpossible.asHois,wevtheer,resultsallowingareforofteneditnotaseoperationsxpectedrightatbecausetheitbetriesginningtoormatchattheas
endofacloneisnothelpful,astheneveryexactclonecanbeprolongedintoatype-3clone.We
thusenforcethe®rstfewstatements(howmanycanbeparameterized)tomatchexactly.Thisalso
speedsupthesearch,aswecanchoosethecorrectchildnodeattherootofthesuf®xtreeinonestep
withoutlookingatallchildren.Thelaststatementsarealsonotallowedtodiffer,whichischecked
forandcorrectedjustbeforereportingaclone.
Withtheseoptimizations,thealgorithmcanmissacloneeitherduetothethresholds(eithertooshort
11Itcanbeimplementedusingonlylinearspace,butpreservingthefullcalculationmatrixallowssomesimpli®cations.
110
AlgorithmsDetection7.3
10000 9000 8000 7000 6000 5000Time in seconds 4000 3000 2000 1000 0 0 1 2 3 4 5 6
System size in MLOCFigure7.13:Runtimeoftype-3clonedetection
ortoosubstringmanofyacloneisinconsistencies),ofcourseorifagitainisacovcloneeredandbyweotherusuallyclones.doThenotwlaterantcasetheseistobeimportant,reported.aseach
ScalabilityandPerformanceToassesstheperformanceoftheentireclonedetectionpipeline,
weexecutedConQATtodetecttype-3clonesonthesourcecodeofEclipse12,limitingdetectiontoa
certainamountofcode.OurresultsonanIntelCore2Duo2.4GHzrunningJavainasinglethread
with3.5GBofRAMareshowninFigure7.13.Weuseaminimalclonelengthof10statements,
maximaleditdistanceof5andagap-ratioof0.213.Itiscapabletohandlethe5.6MLOCofEclipse
inabout3hours.Thisisfastenoughtobeexecutedduringanightlybuild.
7.3.5ClonesinData-FlowGraphs
ConQATimplementsanovelalgorithmtodetectclonesingraphs.Inthissection,weformalize
clonedetectioningraph-basedmodelsanddescribeanalgorithmforsolvingit.Ourapproach
comprisestwosteps.First,itextractsclonepairs(i.e.,partsofthemodelthatareequivalent);
second,itclusterspairstoalso®ndsubstructuresoccurringmorethantwice.
ProblemDe®nitionDetectionoperatesonanormalizedmodelgraphG=(V,E,L).Wede®ne
aclonepairasapairofsubgraphs(V1,E1),(V2,E2)withV1,V2VandE1,E2E,sothatthe
hold:conditionswingfollo1.TherearebijectionsV:V1"V2andE:E1"E2,sothatforeachv2V1itholdsL(v)=
L(V(v))andforeache=(x,y)2E1itisbothL(e)=L(E(e))and(V(x),V(y))=
E(e).
2.V1\V2=;
3.Thegraph(V1,E1)isconnected.
12CoretheofcorecodeEclipseandeEuropaxcludedreleaseother3.3.TheprojectscodefromsizetheisEclipsesmallerthanecosystem,mentionedthatinwerepartSectionofthe7.3.3,analysissinceweinonlySectionanalyzed7.3.3.
13Thegapratioistheratiooftheeditdistancew.r.t.thelengthoftheclone.
111
7AlgorithmsandToolSupport
ForV1,V2V,wesaythattheyareinacloningrelationship,iffthereareE1,E2Esothat
(V1,E1),(V2,E2)isaclonepair.
The®rstconditionofthede®nitionstatesthatthosesubgraphsmustbeisomorphicregardingtothe
labelsL;thesecondonerulesoutoverlappingclones;thelastoneensureswearenot®ndingonly
unconnectedblocksdistributedarbitrarilythroughthemodel.Notethatwedonotrequirethemto
becompletesubgraphs(i.e.,containallinducededges).
ThesizeoftheclonepairdenotesthenumberofnodesinV1.Thegoalisto®ndallmaximalclone
pairs,i.e.,allsuchpairswhicharenotcontainedinanyotherpairofgreatersize.
Whilethisproblemseemstobesimilartothewell-knownNP-hardMaximumCommonSubgraph
(MCS)problem(alsocalledLargestCommonSubgraphin[75]),itisslightlydifferentinthatwe
onlydealwithonegraph(whileMCSlooksforsubgraphsintwodifferentgraphs)andwedonot
onlywantto®ndthelargestsubgraph,butallmaximalones.
DetectingClonePairsSincetheproblemof®ndingthelargestclonepairisNP-complete,we
cannotexpectto®ndanef®cient(polynomialtime)algorithmthatenumeratesallmaximalclone
pairs—atleastnotformodelsofrealisticsize.Instead,ConQATemploysaheuristicapproach.
Figure7.14givesanoutlineofthealgorithm.Ititeratesoverallpossiblepairingsofnodesand
nodeproceedspairsininatheclone,breadth-®rst-searchSofnodesseen(BFS)infromthetherecurrent(linesBFS,and4-12).DItofnodemanagespairsthewesetsareCdoneofwith.current
Line9,whichisoptional,skipsthecurrentlybuiltclonepair,ifwe®ndapairofnodeswehave
alreadyseenbefore.Thiswasintroducedaswefoundthatclonesreportedthiswayareoften
similartoothersalreadyfound(althoughwithdifferent“extensions”)andthusrathertendtoclutter
output.theTheproachmaingivdifeninference[172])isbetweeninlineour7:heuristicweonlyandaninspectexhaustionevepossiblesearch(suchmappingastofhethebacktrackingnodes’neigh-ap-
borhoodstoeachother.To®ndallclonepairs,wewouldhavetoinspectallpossiblemappings
andperformbacktracking.Evenonlytwodifferentmappingsquicklyleadtoanexponentialtime
algorithminthiscase,whichwillnotbecapableofhandlingthousandsofnodes.
Thus,foreachpairofnodes(u,v),weonlyconsideronemappingPoftheiradjacentblocks.All
blockpairs(x,y)ofPmustful®llthefollowingtwoconditions:
L(x)=L(y)(7.1)
(u,x),(v,y)2EandL((u,x))=L((v,y))
(7.2)or(x,u),(y,v)2EandL((x,u))=L((y,v))
Asweareonlylookingatasingleassignmentoutofmany,itisimportanttochoosethe“right”one.
Thisisaccomplishedbythesimilarityfunctiondescribedinthefollowingsection.
112
AlgorithmsDetection7.3
Input:ModelgraphG=(V,E,L)
;=:D12foreach(u,v)2V×Vwithu6=v^L(u)=L(v)do
43if{u,Queuev}Q62:D={then(u,v)},C:={(u,v)},S:={u,v}
5whileQ6=;do
76fromdequeuethepair(wneighborhood,z)fromof(Qw,z)buildalistof
nodepairsPforwhichtheconditions(7.1,7.2)hold
8foreach(x,y)2Pdo
109ififx(x6,=yy)^2{Dx,ythen}\S=continue;thenwithloopatline2
1112C:enqueue=C[(x{,(yx),iny)}Q,S:=S[{x,y}
1314Dreport:=Dnode[CpairsinCasclonepair
Figure7.14:Heuristicfordetectingclonepairs
TheSimilarityFunctionTheideaofthesimilarityfunction:V×V"[0,1]istohavea
measureforthestructuralsimilarityoftwonodeswhichnotonlycapturesthenormalizationlabels,
butmainalsolooptheirintheorderneighborhood.ofWdecreasingeusethesimilaritysimilarity,asinatwhighoplaces.valueFirst,ismorewelikvisitelythetonodeyieldapairsin“good”the
clone.Second,inline7,wetrytobuildpairswithahighsimilarityvalue.Thisisaweighted
bipartitematchingwithasweight,whichcanbesolvedinpolynomialtime[185].
Fortwonodesu,v,wede®neafunctionsi(u,v)thatintuitivelycapturesthestructuralsimilarityof
allnodesthatarereachableinexactlyisteps,by
s0(u,v)=1ifL(u)=L(v)
otherwise0andsi+1(u,v)=max{|NM(iu()u|,,v|)N(v)|}ifL(u)=L(v)
(otherwise0whereN(u)denotesthesetofnodesadjacenttou(itsneighborhood);Mi(u,v)denotestheweight
ofamaximalweightedmatchingbetweenN(u)andN(v)usingtheweightsprovidedbysiand
(7.2).and(7.1)conditionsrespectingWecanshowthat,foreveryiandpair(u,v)itholdsbyinduction,that0si(u,v)1andthus
de®ning1(u,v):=21isi(u,v)
X0=iisvalidastheexpressionconvergestoavaluebetween0and1.Theweightingwith21imakes
nodesneartothepair(u,v)morerelevantforthesimilarity.Forpracticalapplications,onlythe
®rstfewtermsofthesumhavetobeconsideredandthesimilarityforallpairscanbecalculated
programming.dynamicusing
113
7AlgorithmsandToolSupport
Figure7.15:Apartiallyhiddencloneofcardinality3
ClusteringClonesSofar,weonly®ndclonepairs.Subgraphsthatarerepeatedntimeswill
thusresultinn(n!1)/2clonepairs.Clusteringaggregatesthosepairsintoasinglegroup.
Whileitseemsstraightforwardtogeneralizethede®nitionofaclonepairtonpairsofnodesand
edgestogetthede®nitionofaclonegroup,wefeltthisde®nitiontobetoorestrictive.Consider,
e.g.,clonepairs(V1,E1),(V2,E2)and(V3,E3),(V2,E4).Althoughthereisabijectionbetween
thenodesofV1andV3theyarenotnecessarilyclonesofeachother,astheymightnotcontainthe
requirededges.However,weconsiderthisrelationshiptobestillrelevanttobereported,aswhen
lookingforpartsofthemodeltobeincludedinalibrarytheblockscorrespondingtoV2mightbea
goodcandidate,asitcouldpotentiallyreplacetwootherparts.
Soinsteadofclusteringclonesbyexactidentity(includingedges)whichwouldmissmanyinterest-
ingcasesdifferingonlyinoneortwoedges,weperformclusteringonlyonthesetsofnodes.This
isanoverapproximationthatcanresultinclusterscontainingclonesthatareonlyweaklyrelated.
However,asweconsidermanualinspectionofclonestobeimportantfordecidinghowtodealwith
them,thosecases(whicharerareinpractice)canbedealtwiththere.
Thus,foramodelgraphG=(V,E,L),wede®neaclonegroupofcardinalitynasaset{V1,...Vn},
sothatforevery1i<jnitisViVandthereisasequencek1,...,kmwithk1=i,
km=j,andVklandVkl+1areinaclonerelationshipforall1l<m(i.e.,thereisaclonepath
betweenanytwoclones).ThesizeoftheclonegroupisthesizeofthesetV1,i.e.,thenumberof
nodes.duplicatedThisboilsdowntoagraphwhoseverticesarethenodesetsoftheclonepairsandtheedgesare
inducedbythecloningrelationshipbetweenthem.Theclonegroupsarethentheconnectedcom-
ponents,whichcanbefoundusingstandardgraphtraversalalgorithms;alternativelyaunion-®nd
structure(see,e.g.,[42])allowstheconnectedcomponentstobebuilton-line,i.e.,whileclone
pairsarebeingreported,withoutbuildinganexplicitgraphrepresentation.
Therearestilltwoissuestobeconsidered.First,whilewede®nedclonepairstobenon-overlapping,
clonegroupscanpotentiallycontainoverlappingblocksets.Thisdoesnothavetobeaproblem,
sinceexamplesforthisareratherarti®cial.Second,someclonegroupsarenotfound,sincelarger
clonepairshidesomeofthesmallerones.AnexampleofthiscanbefoundinFigure7.15,where
equalpartsofthemodel(andtheiroverlaps)areindicatedbygeometric®gures.Wewantto®ndthe
clonegroupswithcardinality3shownascircles.Astheclonepairdetection®ndsmaximalclones
however,whenstartingfromnodesincircles1and2,theclonepairsconsistingofthepentagons
willbefound.Similarly,thecirclepair1and3ishiddenbytherectangle.Soourpairdetection
reportstherectanglepair,thepentagonpair,andthecircles2and3.
114
ocessingostprP7.4
Wehandlethisina®nalstepbycheckingtheinclusionrelationshipbetweenthereportedclone
pairs.Intheexample,thisrevealsthatthenodesfromcircle2areentirelycontainedinoneof
theinformationpentagons(whichandthusanalogouslytherehastoholdsbeaforthecloneofrectangle),thiscircleweincanthe®ndotherthethirdpentagon,circletotoo.getaUsingclonethis
groupofcardinality3.Iftherewasanadditionalcloneoverlappingcircles2and3,wehadnosingle
clonepairofthecircleclonegroupandthusthisapproachdoesnotworkforthiscase.However,
weconsiderthiscasetobeunlikelyenoughtoignoreit.
ScalabilityThetimeandspacerequirementsforclonepairdetectiondependquadraticallyon
theoverallnumberofblocksinthemodel(s).Whilefortherunningtimethismightbeacceptable
(thoughnotoptimal)aswecanexecutetheprograminbatchmode,theamountofrequiredmemory
canbetoomuchtoevenhandleseveralthousandblocks.
Tosolvethis,wesplitthemodelgraphintoitsconnectedcomponents.Weindependentlydetect
clonepairswithineachsuchcomponentandbetweeneachpairofconnectedcomponents,which
stillallowsusto®ndallclonepairswewould®ndwithoutthistechnique.Thisdoesnotimprove
runningtime,asstilleachpairofblocksislookedat(althoughwemightgainsomethingby®ltering
outcomponentssmallerthantheminimalclonesize).Theamountofmemoryneeded,however,
nowonlydependsquadraticallyonthesizeofthelargestconnectedcomponent.Ifthemodelis
composedofunconnectedsubmodels,orifwecansplitthemodelintosmallerpartsbysome
otherheuristic(e.g.,separatingsubsystemsonthetopmostlevel),memoryis,hence,nolongerthe
.actorflimitingWemeasuredperformancefortheindustrialMatlab/Simulinkmodelweanalyzedduringthecase
studypresentedin5,whichcomprises20,454blocks:theentiredetectionprocess—includingpre-
andpostprocessing—took50sonaIntelPentium43.0GHzworkstation.Thealgorithmthusscales
models.orldreal-wtowell
ostprP7.4ocessing
Postprocessingcomprisestheprocessstepsthatareperformedtothedetectedclonesbeforethe
resultsarepresentedtotheuser.InConQAT,postprocessingcomprisesmerging,®ltering,metric
tracking.andcomputation
Steps7.4.1
Filteringremovesclonesthatareirrelevantforthetaskathand.Itcanbeperformedbasedon
clonepropertiessuchaslength,cardinalityorcontent,orbasedonexternalinformation,suchas
blacklists.-createdelopervdeMetriccomputationcomputes,e.g.,clonecoverageoroverhead.Itisperformedafter®ltering.
115
7AlgorithmsandToolSupport
Clonetrackingcomparesclonesdetectedonthecurrentversionofasystemagainstthosedetected
onapreviousone.Itidenti®esnewlyadded,modi®edandremovedclones.Iftrackingisper-
formedregularly,beginningatthestartofaproject,itdetermineswheneachindividualclonewas
introduced.Thefollowingsectionsdescribethepostprocessingstepsinmoredetail.Postprocessingstepsare,
inprinciple,independentoftheartifacttype.Eachstep—®ltering,metriccomputationandclone
tracking—canthusbeperformedforclonesdiscoveredinsourcecode,requirementsspeci®cations
ormodels.However,forconciseness,thissectionpresentspostprocessingforclonesinsourcecode.
Sincethesameintermediaterepresentationisusedforbothcodeandrequirementsspeci®cations,
allofthepresentedpostprocessingfeaturescanalsobeappliedtorequirementsclones.Mostof
them,inaddition,areeitheravailableforclonesinmodelsaswell,orcouldbeimplementedina
ashion.fsimilar
Filtering7.4.2
Filteringremovesclonegroupsfromthedetectionresult.ConQATperforms®lteringintwoplaces:
local®ltersareevaluatedrightafteranewclonegrouphasbeendetected;global®ltersareevaluated
afterdetectionhas®nished.Whileglobal®ltersarelessmemoryef®cient—thelateraclonegroupis
®ltered,thelongeritoccupiesmemory—theycantakeinformationfromotherclonegroupsintoac-
count.Theythusenablemoreexpressive®lteringstrategies.ConQATimplementscloneconstraints
basedonvariouscloneproperties.
TheNonOverlappingConstraintcheckswhetherthecoderegionsofsiblingclonesoverlap.The
SameFileConstraintchecksifallsiblingarelocatedinasingle®le.TheCardinalityConstraint
checkswhetherthecardinalityofaclonegroupisaboveagiventhreshold.
TheContentConstraintissatis®edforaclonegroup,ifthecontentofatleastoneofitsclones
matchesagivenregularexpression.Content®lteringis,e.g.,usefultosearchforclonesthatcontain
specialcommentssuchasTODOorFIXME;theyoftenindicateduplicationofopenissues.
Constraintsfortype-3clonesallow®lteringbasedontheirabsolutenumberofgapsortheirgap
ratio.If,e.g.,allcloneswithoutgapsare®ltered,detectionislimitedtotype-3clones.Thisis
usefultodiscovertype-3clonesthatmayindicatefaultsandconvincedevelopersofthenecessityof
clonemanagement.Clonescanbe®lteredbothforsatis®edorviolatedconstraints.
BlactinuousklistingcloneEvenmanagement,ifcloneadetectionmechanismisistailoredrequiredwell,tofremoalsevepositisuchvesfalsemaypositislipves.through.TobeForuseful,con-
itmustberobustagainstcodemodi®cations—afalsepositiveremainsafalsepositiveindependent
ofmodi®ed).whetherItitsthus®lestillisneedsrenamedtoorbeitssuppressedlocationbyinthethe®le®lteringchangesmechanism.(e.g.,becausecodeaboveitis
ConQATimplementsblacklistingbasedonlocationindependentclone®ngerprints.Ifa®leisre-
Fornamed,type-1ortheandlocationtype-2ofaclones,cloneallintheclones®leinachanges,clonethevgroupaluehaofvethethesame®ngerprint®ngerprint.remainsAunchanged.blacklist
116
ocessingostprP7.4
stores®ngerprintsofclonesthataretobe®ltered.Fingerprintsareaddedbydevelopersthatcon-
sideracloneirrelevantfortheirtask.Duringpostprocessing,ConQATremovesallclonegroups
whose®ngerprintappearsintheblacklist.14
normalizedFingerprintsareunitsiscomputedconcatenatedontheintoanormalizedsinglecontentcharacteristicofaclone.string.TheFortextualtype-1andrepresentationtype-2ofclones,the
allhastheclonessameinaclonecharacteristicgrouphavestring—elsethesameitwouldbecharacteristicpartofthestring;®rstnocloneclonegroup.outsideThetheclonecharacteristicgroup
stringConQAisTusesindependentitsMD5ofthe[192]hash®lenameasorclonelocation®ngerprintintheto®le.saveSincespace.itcanBecausebelarofgetheforverylonglowclones,col-
lisionprobabilityofMD5,wedonotexpecttounintentionally®lterclonegroupsdueto®ngerprint
collisions.
Blacklistingworksfortype-1andtype-2clonesinsourcecodeandrequirementsspeci®cations.It
iscurrentlynotimplementedfortype-3clones.However,theirclonegroup®ngerprintscouldbe
computedonthesimilarpartsoftheclonestocopewithdifferentgapsoftype-3clones.
Crclonesoss-PrspanatojectleastClonetwodifFilteringferentprojects.CrossTheprojectclonede®nitionofdetectionproject,searchesinthisforcase,clonedependsgroupsonwhosethe
xt:contementsCrossthatprojectareclonecandidatesdetectionforcanbeconsolidationusedin[173];softwareortoproductdiscoverlinestoclonesdiscovbetweenerreusableapplicationscodefrag-that
buildproductsontopofaofaproductcommonfamilyframeorworkapplicationstospotthatuseomissions.thesameProjectsframeinwthisork.contextarethusindividual
Todiscovercopyrightinfringementorlicenseviolations,itisemployedtodiscovercloningbetween
thecodebasemaintainedbyacompanyandacollectionofopensourceprojectsorsoftwarefrom
otherowners[77,121].Projectsinthiscontextarethecompany’scodeandthethirdpartycode.
ConQATimplementsaCrossProjectCloneGroupsFilterthatremovesallclonegroupsthatdonot
spanatleastaspeci®ednumberofdifferentprojects.Projectsarespeci®edaspathorpackage
pre®xes.Projectmembershipexpressedviathelocationinthe®lesystemorthepackage(orname
structure.space)Figure7.16depictsatreemapthatshowscloningacrossthreedifferentindustrialprojects15.Areas
A,BandCmarkprojectboundaries.Onlycross-projectclonegroupsareincluded.Theprojectin
thelowerleftcornerdoesnotcontainasinglecross-projectclone,whereastheothertwoprojects
do.Inbothprojects,mostofitis,however,clusteredinasingledirectory.ItcontainsGUIcodethat
both.betweensimilaris14Allfeatureblacklistedhasbeenclonemisusedgroupstoarearti®ciallyoptionallyreducewrittencloning.toaseparatereporttoallowforcheckswhethertheblacklisting
15Section7.5.1explainshowtointerprettreemaps.
117
7AlgorithmsandToolSupport
Figure7.16:Cross-projectclonedetectionresults
ComputationMetric7.4.3
ConQATcomputesthecloningmetricsintroducedinChapter4,namelyclonecounts,clonecover-
ageandoverhead.Computationofcountsandcoverageisstraightforward.Hence,onlycomputa-
tionofoverheadisdescribedhereindetail.
OverheadiscomputedastheratioofRFSSSS!1.If,forexample,astatementinasource®le
iscoveredbyasingleclonethathastwosiblings,itoccursthreetimesinthesystem.Perfect
removalwouldeliminatetwoofthethreeoccurrences.ItthusonlycontributesasingleRFSS.
RFSScomputationiscomplicatedbythefactthatclonegroupscanoverlap.
1eRFSSxample,eachcomputationoccurrenceonlyofcountstheaunitstatementinaissourcethusonlyartifactcountedtimesascloned1RFSS.numberWeofemplotimes.yaIntheunion-®ndabove
datastructuretorepresentcloningrelationshipsattheunitlevel.3Allunitsthatareinacloning
relationshipareinthesamecomponentintheunion-®ndstructure,allotherunitsareinseparate
ones.ForRFSScomputation,theunitsaretraversed.Eachunitaccountsforcomponent1sizeRFSS.
kingracT7.4.4
Clonetrackingestablishesamappingbetweenclonegroupsandclonesofdifferent(typicallycon-
secutive)versionsofasoftware.Basedonthismapping,clonechurn—added,modi®edandre-
movedclones—iscomputed.Trackinggoesbeyond®ngerprint-basedblacklisting,sinceitcanalso
associatecloneswhosecontenthaschangedacrossversions.Sincedifferentcontentimpliesdiffer-
ent®ngerprints,suchclonesarebeyondthecapabilitiesofblacklisting.
118
ocessingostprP7.4
ConQATimplementslightweightclonetrackingtosupportclonecontrolwithclonechurninforma-
tion.TheclonetrackingprocedureisbasedontheworkbyGöde[83].Itcomprisesthreestepsthat
areoutlinedinthefollowing:
mayUpdatehaveOldchanged.CloningTheInfcloningormationinformationSincethefromlastthelastclonedetectiondetectioniswasthusperformed,outdated—clonethesystempo-
sitionsmightbeinaccurate,someclonesmighthavebeenremovedwhileothersmighthavebeen
added.ConQATupdatesoldcloninginformationbasedontheeditoperationsthathavebeenper-
formedsincethelastdetection,todeterminewheretheclonesareexpectedinthecurrentsystem
ersion.vConQATemploysarelationaldatabasesystemtopersistclonetrackinginformation.Cloninginfor-
themationdifffrombetweenthelasthetpredetectionviousvisersionloaded(storedfrominit.theThen,fordatabase)eachand®lethethatcurrentcontainsvatersionleastisonecomputed.clone,
Itisthenusedtoupdatethepositionsofallclonesinthe®le.Forexample,ifaclonestartedinline
30,but10linesaboveithavebeenreplacedby5newlines,itsnewstartpositionissetto25.Ifthe
coderegionthatcontainedaclonehasbeenremoved,thecloneismarkedasdeleted.Ifthecontent
ofaclonehaschangedbetweensystemversions,thecorrespondingeditoperationsarestoredfor
clone.each
DetectNewClonesWhiletheabovestepidenti®esoldandremovedclones,itcannotdiscover
newlyaddedclonesinthesystem.Forthispurpose,inthesecondstep,acompleteclonedetection
isrunonthecurrentsystemversion.Itidenti®esallitsclones.
computeComputecloneChurnchurn.InWethedifthirdferentiatestep,betweenupdatedtheseclonesarecases:matchedagainstnewlydetectedonesto
Positionsofupdatedcloneandnewclonematch:thisclonehasbeentrackedsuccessfully
ersions.vsystembetweenNewclonehasnomatchingupdatedclone:trackinghasidenti®edaclonethatwasaddedin
thenewsystemversion.
Updatedclonehasnomatchingnewclone:itisnolongerdetectedinthenewsystemversion.
Thecloneoritssiblingshaveeitherbeenremoved,orinconsistentmodi®cationpreventsits
detection.Thesetwocasesneedtobedifferentiated,sinceinconsistentmodi®cationsneedto
bepointedouttodevelopersfurfurtherinspections.Trackingdistinguishesthembasedonthe
editoperationsstoredintheclones.
Churncomputationdeterminesthelistofaddedandremovedclonesandofclonesthathavebeen
.inconsistentlyorconsistentlymodi®ed
119
7AlgorithmsandToolSupport
PresentationResult7.5
Difoutlinesferenthousewcasesresultsarerequirepresenteddifferentinawaysqualityofinteractingdashboardforwithcloneclonecontroldetectionandinresults.anIDEThisforsectioninter-
activecloneinspectionandchangepropagation.
Similartopostprocessing,thissectionfocusesonpresentationofcodeclones;allpresentationscan
beFurthermore,appliedtoinmanrequirementsycases,clonesConQAasTwell,eithersincecontainsbothsisharemilarthesampresentationeintermediatefunctionalityforrepresentation.model
clones,oritcouldbeimplementedinasimilarfashion.
dDashboarojectPr7.5.1
Projectdashboardssupportcontinuoussoftwarequalitycontrol.Theirgoalistoprovisionstake-
thequalityholders—includingcharacteristicsprojectofthemanagementsoftwareandtheydearevdevelopers—witheloping[48].relevForantthis,andqualityaccuratedashboardsinformationperon-
formautomatedqualityanalysesandcollect,®lter,aggregateandvisualizeresultdata.Throughits
visualdata¯owlanguage,ConQATsupportstheconstructionofsuchdashboards.Clonedetection
isoneofthekeysupportedqualityanalyses.
Difthem,ferentConQAstakTeholderpresentsrolesclonerequiresdetectiondifferentresultpresentationsinformationonofdifcloneferentlevdetectionelsofresults.aggregTation.osupport
CloneListsprovidecloninginformationonthe®lelevel,asdepictedinthescreenshotinFig-
urereplacement7.17.Theforyrevcloneealtheinspectionlongestonclonesthecodeandlethevel,cloneclonegroupslistsallowithwthedevmosteloperstoinstances.geta®rstWhileideano
aboutthedetectedcloneswithoutrequiringthemtoopentheirIDEs.
Figure7.17:Clonelistinthedashboard
Treemaps[223]visualizethedistributionofcloningacrossartifacts.Theythusrevealtostake-
holderswhichareasoftheirprojectareaffectedhowmuch.
Treemapsinterpretationvisualizebysourceconstructingcodeasize,treemapstructurestepbyandstep.cloningAintreemapasinglestartsimage.withanWeemptyintroducerectangle.their
120
PresentationResult7.5
Itsarearepresentsallprojectartifacts.Inthe®rststep,thisrectangleisdividedintosub-rectangles.
Eachsub-rectanglerepresentsacomponentoftheproject.Thesizeofthesub-rectanglecorresponds
totheaggregatesizeoftheartifactsbelongingtothecomponent.Theresultingvisualizationis
depictedinFigure7.18ontheleft.Thevisualizedprojectcontains24components.Forthelargest
ones,nameandsize(inLOC)aredepicted.SincecomponentGUIForms(91kLOC)islargerthan
componentBusinessLogic,itsrectangleoccupiesaproportionallylargerarea.
Inthesecondstep,eachcomponentrectangleisfurtherdividedintosub-rectanglesfortheindividual
artifactscontainedinthecomponent.Again,rectangleareaandartifactsizecorrespond.Theresult
isdepictedinFigure7.18ontheright.
Figure7.18:Treemapconstruction:artifactarrangement
Althoughpositionandsizeofthetop-levelrectanglesdidnotchange,theyarehardtorecognizedue
tois,thethus,manyobscured.individualTobetterrectanglesconvenoywtheirpopulatinghierarchy,thethetreemap.rectanglesThearehierarcshadedhyinthebetweenthirdstep,rectanglesas
depictedontheleftofFigure7.19.
Figure7.19:Treemapconstruction:artifactcoloring
121
7AlgorithmsandToolSupport
Inthelaststep,colorisemployedtorevealtheamountofcloninganartifactcontainsandindicate
generatedcode.Morespeci®cally,individualartifactsarecoloredonagradientbetweenwhiteand
redforaclonecoveragebetween0and1.Furthermore,codethatisgeneratedandnotmaintained
byhandiscoloreddarkgray.Figure7.19showstheresultontheright.Theartifactsincomponent
GUIFormscontainsubstantialamountsofcloning,whereastheartifactsinthecomponentonthe
bottom-lefthardlycontainany.TheartifactsofthecomponentDataAccessaregeneratedandthus
depictedingray,exceptforthetwo®lesinitsleftuppercorner.
ConQATdisplaystooltipswithdetails,includingsizeandcloningmetrics,foreach®le.The
treemapsthusrevealmoreinformationinthetoolthaninthescreenshots.
TrendChartsvisualizetheevolutionofcloningmetricsovertime.Theyallowstakeholdersto
determinewhethercloningincreasedordecreasedduringadevelopmentperiod.Figure7.20depicts
atrendchartdepictingthedevelopmentofclonecoverageovertime.
Figure7.20:Clonecoveragechart
wereBetweenintroduced.AprilandAfterMay,devcloneeloperscoveragenoticedthis,decreasedthesinceintroducedclonescloneswerewereremoved.Inconsolidated.May,newclones
CloneChurnrevealscloneevolutiononthelevelofindividualclones,whichisrequiredto
diagnosetherootcauseoftrendchanges.Clonechurnthuscomplementstrendchartswithmore
qualitydetails.Thedashboard.screenOnshotstheinleft,Figurethedif7.21ferentdepictchurnhowlistscloneareshochurnwn.Forinformationinspectiisonofdisplayedclonesinthatthe
havebecomeinconsistentduringevolution,thedashboardcontainsaviewthatdisplaystheirsyntax-
highlightedcontentandhighlightsdifferences.Onesuchcloneisshowninthescreenshotonthe
7.21.Figureofright
InspectionCloneInteractive7.5.2vThisestigatesectionclonesoutlinesinsideConQAtheirT’IDEssandinteractitoveuseclonecloninginspectioninformationfeaturesforthatchangeallowdepropagvelopersationtowhenin-
122
Result7.5Presentation
Figure7.21:Clonechurninthequalitydashboard
modifyingsoftwarethatcontainsclones.
ConQAinspection.TTheimplementsindentedauseClonecaseisDetectionone-shotPerinvspectiveestigthatationproofvidescloningainacollectionsoftwareofviesystem.wsforclone
AscreenshotoftheCloneDetectionPerspectiveisdepictedinFigure7.22.Detaileddocumentation
ofandtheoutsideClonethesDetectioncopeofPerspectithisve,document.includingHoaweuserver,duemanual,toistheircontainedimportanceintheforConQAtheTcaseBookstudies[49]
performedduringthisthesis,twoviewsareexplainedindetailbelow.
TheCloneInspectionViewisthemostimportanttoolforinspectingindividualcloneson
thecodelevel.Itimplementssyntaxhighlightingforalllanguagesonwhichclonedetectionis
supported.Furthermore,ithighlightsstatement-leveldifferencesbetweentype-3clones.According
toourexperience,thisviewsubstantiallyincreasesproductivityofcloneinspection.Weconsider
thiscrucialforcasestudiesthatinvolvedeveloperinspectionofclonedcode.
TheCloneVisualizerusesaSeeSoftvisualizationtodisplaycloninginformationonahigher
levelofaggregationthanthecloneinspectionview[63,214].Itthusallowsinspectionofthecloning
relationshipsofoneortwoordersofmagnitudemorecodeonasinglescreen.
Eachbarintheviewrepresentsa®le.Thelengthofthebarcorrespondstothelengthofits®le.
Eachcoloredstriperepresentsaclone;allclonesofaclonegrouphavethesamecolor.Thelength
ofthestripecorrespondstothelengthoftheclone.Thisvisualizationreveals®leswithsubstantial
mutualcloningthroughsimilarstripepatterns.
ConQATprovidestwoSeeSoftviews.Theclonefamilyvisualizerdisplaysthecurrentlyselected
®le,allofitsclones,andallother®lesthatareinacloningrelationshipwithit.However,forthe
other®les,onlytheircloneswiththeselected®learedisplayed.Theclonefamilyvisualizerthus
supportsaquickinvestigationoftheamountofcloninga®leshareswithother®les,asdepictedin
7.23.Figure
123
7AlgorithmsandToolSupport
Figure7.22:Clonedetectionperspective
Figure7.23:Clonefamilyvisualizer
Theclonevisualizerdisplaysallsource®lesandtheirclones.Ifthe®lesaredisplayedintheorder
theyoccurondisk(orinthenamespace),high-levelsimilaritiesaretypicallytoofarseparatedtobe
recognizedbytheuser.Toclustersimilar®les,ConQATordersthembasedontheiramountofmu-
tualcloning.Filesthatsharemanyclonesare,hence,displayedclosetoeachother,allowingusers
tospot®le-levelcloningduetotheirsimilarlycoloredstripepatterns,asdepictedinFigure7.24.
Ordering®lesbasedontheiramountofmutuallyclonedcodecanbereducedtothetravelingsales-
personproblem:®lescorrespondtocities,linesofmutuallyclonedcodecorrespondtotravelcost,
and®ndinganorderingthatmaximizesthesumofmutuallyclonedlinesbetweenneighboring®les
correspondsto®ndingamaximallyexpensivetravelroute.Consequently,itisNP-complete[75].
ConQATthusemploysaheuristicalgorithmtoperformthesorting.
CloneFilteringApartfrompostprocessing,clonescanbe®lteredduringinspection,sothat
developersdonothavetowaituntildetectionhasbeenre-executed.Clonescanbe®lteredbased
124
7.5PresentationResult
Figure7.24:Clonevisualizerwith®lesorderedbymutualcloning
onasetof®lesorclonegroups(bothinclusivelyandexclusively),basedontheirlength,number
ofinstances,gappositionsorblacklists.Clone®ltersaremanagedonastackthatcanbedisplayed
andeditedinaview.
tionClonewhiletheIndicationyareThemaintaininggoalofsoftwclonearethatindicationcontisainstoprocloningvisiontodevreduceeloperstheratewithofcloningunintentionallyinforma-
inconsistentmodi®cations.ItisintegratedintotheIDEinwhichdevelopersworktoreducetheeffort
requiredtoaccesscloninginformation.17WehaveimplementedcloneindicationforbothEclipse16
andMicrosoftVisualStudio.NET[72].
Afterclonedetectionhasbeenperformed,ConQATdisplayssocalledcloneregionmarkersinthe
editorsassociatedwiththecorrespondingartifacts,asdepictedinFigure7.25.
Figure7.25:Cloneregionmarkerindicatescodecloningineditors.
Cloneregionmarkersindicateclonesinthesourcecode.Asinglebarindicatesthatexactlyone
1617wwwwww.eclipse.or.microsoft.com/VgisualStudio2010
125
7AlgorithmsandToolSupport
cloneinstancecanbefoundonthisline;twobarsindicatethattwoormorecloneinstancescan
befound.Thebarsarealsocolorcodedorangeorred:orangebarsindicatethatallclonesofthe
clonegroupareinthis®le;redbarsindicatethatatleastonecloneinstanceisinadifferent®le.A
rightclickonthecloneregionmarkersopensacontextmenuasshowninFigure7.25.Itallows
developerstonavigatetothesiblingsofthecloneoropentheminacloneinspectionview.
Figure7.26:CloneindicationinVS.NET.
Figure7.26depictsascreenshotofcloneindicationinVisualStudio.NET.
TailoringSupportForeachiterationofthetailoringprocedure,clonedetectiontailoring(cf.,
Section8.2)requirescomputationofprecision,andcomparisonofclonereportsbeforeandafter
tailoring.ConQATprovidestoolsupporttomakethisfeasible.
Theorderofthelistofclonegroupscanberandomized.The®rstnclonegroupsthencorrespondto
arandomsampleofsizen.EachclonegroupcanberatedasAcceptedandRejected.Boththelist
orderandtheratingarepersistedwhentheclonereportisstored.ConQATcancomputeprecision
onthe(sample)ofratedclonegroups.
Tocompareclonereportsbeforeandaftertailoring,theycanbesubtractedfromeachother,reveal-
ingwhichcloneshavebeenremovedoraddedthroughatailoringstep.Twodifferentsubtraction
applied:becanmodesFingerprint-basedsubtractioncomparesclonereportsusingtheirlocation-independentclone®n-
gerprints.Itcanbeappliedwhentailoringisexpectedtoleavethepositionsandnormalizedcontent
ofdetectedclonesintact,e.g.,whenthe®ltersemployedduringpost-processingaremodi®ed.
Clone-region-basedsubtractioncomparesclonereportsbasedonthecoderegionscoveredbyclones.
Itcanbeappliedwhentailoringdoesnotleavepositionsornormalizedcontentintact,e.g.,when
thenormalizationischangedorshapersareintroducedthatclipclones.Theclonereportproduced
126
7.6ComparisonwithotherCloneDetectors
bydifferencingcontainsclonesthatrepresentintroducedorremovedcloningrelationshipsbetween
gions.recode
CloneTrackingFordeeperinvestigationofcloneevolution,ConQATsupportsinteractivein-
vestigationofclonetrackingresultsthroughaviewthatvisualizescloneevolution,asdepictedin
Figure7.27.Sourcecodeofclonescanbeopenedfordifferentsoftwareversionsandclonesofarbi-
traryversionscanbecomparedwitheachothertofacilitatecomprehensionofcloneevolution.The
visualizationofcloneevolutionislooselybasedonthevisualizationproposedbyGödein[83].
Figure7.27:Interactiveinspectionofclonetracking
7.6ComparisonwithotherCloneDetectors
asAsdonestatedininthisSectionchapter3.3,,athusplethoraraisestwofoclonequestions.detectionFirst,toolswheywxists.asitThedeveloped?presentationAnd,ofanosecond,velhotool,w
doesitcomparetoexistingtools?Thissectionanswersboth.
Wecreatedanoveltool,becausenoexistingonewassuf®cientlyextensibleforourpurposes.Both
forourempiricalstudies,andtosupportclonecontrol,weneededtoadapt,extendorchangechar-
acteristicsoftheclonedetectionprocess:tailoringaffectsbothpre-andpostprocessing;thenovel
algorithmsaffectthedetectionphase;andmetriccomputationandtrackingaffectpostprocessing.
Sinceexistingtoolswereeitherclosedsource,monolithic,notdesignedforextensibilityorsimply
notavailable18,wedesignedourowntool.Sinceitisavailableasopensource,othersthatmightbe
inasimilarsituationmaybuildontopofit,asis,e.g.,doneby[96,180,186].
toTheanswersecond,sincequestion,thehowcomparisonthecloneofclonedetectiondetectorsworkbenchisnon-tricomparesvial.Intotheothernexttools,sections,ismorewedifbrie¯y®cult
summarizechallengesandexistingapproachestoclonedetectorcomparisonandthendescribeour
18.in.tum.de/~ccsm/icse09/yhttp://wwwbro
127
7AlgorithmsandToolSupport
detectorprecisionbasedandthatonanitserecallxistingisnotqualitatismallerveframethanwthatork.ofcomparableFurthermore,wedetectors.showthatitcanachievehigh
7.6.1ComparisonofCloneDetectors
Thecomparisonofclonedetectorsischallengingformanyreasons[200]:thedetectiontechniques
areverydiverse;welackstandardizedde®nitionsofsimilarityandrelevance;targetlanguages—
andthesystemswritteninthem—differstrongly;anddetectorsareoftenverysensitivetotheir
con®gurationortuningoftheirparameters.Tocopewiththesechallenges,twodifferentapproaches
havebeenproposed:aqualitative[200]andaquantitative[19]one.
QualitativeApproachIn[200],Roy,CordyandKoschkecompareexistingclonedetectors
qualitatively.Theircomparisoncomprisestwomainelements.First,asystemofcategories,facets
andattributesfacilitatesastructureddescriptionoftheindividualdetectors.Second,mutation-
basedscenariosprovidethefoundationforadescriptionofcapabilitiesandshortcomingsofexisting
approaches.Thequalitativecomparisondoesnotorderthetoolsintermsofprecisionandrecall.However,it
doessupportusersintheirchoicebetweendifferentclonedetectors:thesystematicdescriptionand
scenario-basedevaluationprovidedetailedinformationonwhichsuchachoicecanbefounded,as
theauthorsdemonstrateexemplarilyin[200].WedescribeConQATusingthedescriptionsystem
andthescenario-basedevaluationtechniquefrom[200]inSections7.6.2and7.6.3.
QuantitativeApproachIn[19],Bellonetal.proposeaquantitativeapproachtocompareclone
detectors.Theyquantitativelycomparetheresultsofseveralclonedetectorsforanumberoftarget
systems.Theclonedetectorswerecon®guredandtunedbytheiroriginalauthors.Asubsetofthe
clonecandidateswasratedbyanindependenthumanoracle.Boththetargetsystems,thedetected
clonesandtheratingresultsareavailable.
Inprinciple,theBellonbenchmarkoffersanappealingbasis,sinceityieldsadirectcomparisonof
theclonedetectorsintermsofprecisionandrecall.Toaddanewtooltothebenchmark,however,
itsdetectedclonecandidatesneedtoberated.Tobefair,theratingoraclemustbehavesimilarto
theoriginaloracle.StefanBellon,whoratedtheclonesintheoriginalexperiment,wasnotinvolved
inthedevelopmentofanyoftheparticipatingclonedetectors.Hethusrepresentedanindependent
party.Incontrast,ifweratetheresultsofourownclonedetector,wecouldbebiased.Furthermore,
fromourexperience,classi®cationofclonesincodethatothershavewritten,withoutknowledge
about,e.g.,theemployedgenerators,ishard.Wethusexpectittocontainacertainamountof
subjectivity.Forexample,thebenchmarkcontainsclonesingeneratedcodethatBellonratedas
relevant.Weconsiderthemasfalsepositives,however,sincethecodedoesnotgetmaintained
directly.Evenifwewerenotbiased,itisthusunclear,howwellourratingbehaviorwouldcompare
s.Bellon’withAlternatively,wecouldreproducethebenchmarkwithacollectionofup-to-datetoolsandtarget
systems.Thereproductioninitsoriginalstylerequiresparticipationoftheoriginalauthorsandis
thusbeyondthescopeofthisthesis.However,ifweexecutetheirdetectorsourselves,theresults
128
7.6ComparisonwithotherCloneDetectors
arelikelytobebiased.Wesimplyhavealotmoreexperiencewithourowntoolthanwiththeir
detectors.Asecondquantitativeapproach,whichemploysamutation-basedbenchmark[197],is
notfeasibleeither:neitherthebenchmark,norresultsformanyexistingclonedetectorsarepublicly
available.WearethusunabletoperformareliablequantitativecomparisonofConQATandother
clonedetectorsonthebasisofexistingbenchmarks.
Instead,wechoseadifferentapproach.WecomputedalowerboundfortherecallofConQATon
theBellonbenchmarkdata.Forthis,weanalyzewhetherConQATcanbecon®guredtodetectthe
referenceclonesdetectedbyothertools.Thisway,wedonotneedanoraclefortheclonecandidates
detectedbyConQAT.WedetailcomputationofrecallinSection7.6.4.
Inaddition,wecomputedprecisionforthesystemsthatweanalyzedduringthecasestudiesin
Chapter4.Theirdeveloperstookpartinclonedetectiontailoringandinclonerating.Forthe5
studyobjects,wedeterminedprecisionfortype-2andtype-3clonesseparately.Fortype-2clones,
precisionrangedbetween0.88and1.00,withanaverageof0.96.Fortype-3clones,between0.61
and1.0,withanaverageof0.83.Lowerprecisionoftype-3clonesisduetothelargerdeviation
toleratedbetweenthem.Averageprecisionofover95%fortype-2clonesis,fromourexperience,
highenoughforcontinuousapplicationofclonedetectioninindustrialenvironments.
Wemeasureprecisionandrecallindependentofeachother.Strictlyspeaking,theseexperiments
thusdonotshowthatConQATcanachievehighprecisionandrecallatthesametime,sinceim-
provementofonecouldcomeatthecostoftheother.PleaserefertoSection8.7foracasestudy
thatdemonstratesthatclonedetectiontailoringcanimproveprecisionandmaintainrecall.
DescriptionSystematic7.6.2
Inthissection,wedescribeourclonedetectionworkbenchusingthecategoriesandfacetsfrom[200].
Forsimplicity,werefertotheclonedetectionworkbenchsimplyas“ConQAT”.Wedescribeeach
categoryfrom[200]inaseparateparagraph.Facetnamesforeachcategoryaredepictedinitalics.
Tosimplifycomparisonwiththeothertoolslistedin[200],wegivetheabbreviationsfrom[200]
fortheindividualattributesinafacetinparentheses.
Usagedescribestoolusageconstraints.Platform:ConQATisplatformindependent(P.a).We
haveexecuteditonWindows,Linux,MacOS,SolarisandHP-UX.ExternalDependencies:The
clonedetectionworkbenchispartofConQAT(D.d).AllcomponentsusedbyConQATarealso
platformindependent,excepttheMicrosoftVisualStudiointegration,whichdependsonMicrosoft
VisualStudio.Availability:ConQATisavailableasopensource(A.a).Itslicenseallowsitsusefor
bothresearch(A.d)andcommercialpurposes(A.c).
Interactiondescribesinteractionbetweentheuserandthetool.UserInterface:ConQATpro-
videsbothacommandlineinterfaceandagraphicalinterface(U.c).Thegraphicalinterfacecan
beusedbothforcon®gurationandexecution,andforinteractiveinspectionoftheresults.Output:
ConQATprovidesbothtextualcoordinatesofcloninginformationanddifferentvisualizations(O.c).
IDESupport:ConQATcomprisespluginsforEclipse(I.a)andMicrosoftVisualStudio(I.b).
129
7AlgorithmsandToolSupport
limitedLanguatogeaspeci®cdescribeslanguagethelanguagesparadigmthat(LP.c).canbeWehavanalyzed.eappliedLanguait,e.geg.,Ptoaradigm:object-orientedConQAT(LPis.b),not
supportsproceduralthe(LP.a),programmingfunctionallanguages(LP.e)andABAP,modelingAda,COBOLlanguages(LS.f),(LPC.f).(LS.b),LanguagC++e(LS.c),Support:C#ConQA(LS.d),T
Java(LS.e),PL/I,PL/SQL,Python(LS.g),T-SQLandVisualBasic(LS.i).Furthermore,itsup-
portsthemodelinglanguageMatlab/Simulinkand15naturallanguages,includingGermanand
English.
ClonedirectlyInfyieldsormationclonegroupsdescribesforthetype-1cloneandtype-2informationclonesthetinoolcansequencesemit.(R.b).CloneRelation:PostprocessingConQAcanT
mermodelgecloneclonegroupsdetection,basedpairsonaredifferentcombinedcriteria,duringe.theg.,ovclusteringerlappingphase.gapsinClonetype-3Granularity:clones(R.d).ConQAForT
cantrimproduceclonestoclonesclassesoffree(G.e),granularityfunctions/methods(G.a)or®xed(G.b),basicgranularity,blocksifshapers(G.c,G.d)areorused.matchShapersarbitrarycan
ketype-2ywords(CTor.b)andothertype-3language(CT.c)clonescharacteristicsforcode.(G.g).CloneFurthermore,Type:itcanConQAdetectTcanmodeldetectclonestype-1(CT(CT.e)..a),
TechnicalAspectsdescribepropertiesofthedetectionalgorithms.ComparisonAlgorithm:
ConQAToffersdifferentdetectionalgorithms,includingasuf®xtreebasedonefortype-2clones
(CA.a),asuf®xtreebasedonefortype-3clonesthatcomputeseditdistance(CA.n)andanindex-
basedonefortype-2clones(CA.q).Furthermore,asubgraph-matchingoneformodels(CA.k).
ComparisonGranularity:ConQATsupportsdifferentcomparisongranularities,namelylines(CU.a),
tokity:ensThe(CU.d),complexitystatementsdepends(CU.e)ontheandemplomodelyedelementsalgorithms.(CU.k).PleaseWorstreferCasetoSectionComputational7.3fordetails.Complex-
Adjustmentdescribesthelevelofcon®gurabilityofthetool.Pre-/Postprocessing:Theopen
architectureofConQATallowscon®guration—includingreplacement—ofalldetectionphases).
Heuristic/Thresholds:ConQATofferscon®gurablethresholdsforclonelength(H.a)andgapsize
(H.c).Filerscanbeusedtopruneresults(H.d).Normalizationcanbeadaptedtochangethe
employednotionofsimilaritywhencomparingclones(H.b).
Processingdescribeshowthetoolanalyzes,representsandtransformsthetargetprogramfor
analysis.BasicTransformation/Normalization:Normalizationisverycon®gurable.Itcan,e.g.,
performthefollowing:optionalremovalofwhitespaceandcomments(T.b,T.c);optionalnormal-
izationofidenti®ers,typesandliteralvalues(T.e,T.f,T.g);andlanguagespeci®ctransformations
(T.h).CodeRepresentation:Codecanberepresentedas®lteredstringsinwhichcommentsmaybe
removed(CR.d)ornormalizedtokensortokensequences(CR.f).ProgramAnalysis:Fortext-based
clonedetection,ConQATonlyrequiresregularexpressionsto®lterinput,e.g.,removecomments
(PA.b).Fortokenorstatement-baseddetection,ConQATemploysscanners(PA.d).ConQATim-
plementsscannersforalllanguageslistedunderthe“LanguageSupport”facetabove.Forshaping,
ConQATemploysshallowparsing(PA.c).
130
7.6ComparisonwithotherCloneDetectors
Evaluationdescribeshowthetoolhasbeenevaluated.EmpiricalValidation:ConQAThasbeen
employedinanumberofempiricalstudiesasreportedinthisthesis(E.b).AvailabilityofEmpirical
Results:ManyoftheprojectsweanalyzedwithConQATareclosedsource.Thedetectedclones
thuscannotbepublished.Instead,wepublishedaggregatedresults(AR.b).Theresultsoftheopen
sourcestudyobjectfromChapter4areavailable.Thestudycanbereproduced(AR.a).Subject
Systems:Mostsystemsweanalyzedareclosedsource(S.g).
aluationEvScenario-Based7.6.3Inthissection,weevaluateConQATonthecloningscenariosfrom[200].Tomakethissectionself
contained,we®rstrepeatthescenariosfrom[200].Thenwedescribethecapabilitiesandlimitations
ofConQATforeachscenario.
ScenariosEachscenariodescribeshypotheticalprogrameditingstepsthat,accordingtothe
aauthors,clonearfromeanrepresentatioriginal.veAllforclonestypicalproducedchangestobycoptheyedit&spastedtepsfromcode.scenarioEach1editaresequencetype-1clones;creates
scenario2yields3type-2clonesand1type-3clone(S2(d)).Scenarios3and4yieldtype-3clones,
ofwhichsomearesimions.Figure7.28showstheoriginalinthemiddleandtheclones,orderedby
it.aroundscenario,Inthecapabilitiesfollowingandsections,limitationsweof®rstConQArestateTforthethem.scenarioAfterwdescriptionsards,wefromdiscuss[200]andcrosscuttingthenaspects.describethe
aloopScenariovariable1andfromcalls[200]:“anotherAprogrammerfunction,foo()copieswithathesefunctionvaluesthatascalculatesparametersthethreesumandtimes,productmakingof
changesinwhitespaceinthe®rstfragment(S1(a)),changesincommentinginthesecond(S1(b)),
andchangesinformattinginthethird(S1(c)).”
Usingthesuf®xtreeorindex-baseddetectionalgorithmsfortype-1andtype-2clones,ConQAT
canthis,producecon®gureasinglenormalizationclonetogroupremothatvecontainswhitespacetheandoriginal,comments,S1(a),butS1(b)notandtoS1(c)normalizeasclones.identi®ersFor
alues.vliteralor
Scenario2from[200]:“Theprogrammermakesfourmorecopiesofthefunction,usinga
systematicrenamingofidenti®ersandliteralsinthe®rstfragment(S2(a)),renamingtheidenti®ers
(butnotnecessarilysystematically)inthesecondfragment(S2(b)),renamingdatatypesandliteral
values(butnotnecessarilyconsistent)inthethirdfragment(S2(c))andreplacingsomeparameters
withexpressionsinthefourthfragment(S2(d)).”
Usingthesamedetectionalgorithms,ConQATcanproduceasingleclonegroupthatcontainsthe
original,S2(a),S2(b)andS2(c)(and,inaddition,S1(a-c)).Forthis,con®gurenormalizationto
normalizeidenti®ers(whichtakescareofS2(a)andS2(b)),typekeywordsandliteralvalues(which
S2(c)).ofcareestakThelastcloneinthisscenario,S2(d),isnotoftype-2,butoftype-3.Wediscussitinscenario3.
131
7AlgorithmsandToolSupport
void sumProd(int n){ void sumProd(int n) { void sumProd(int n) { void sumProd(int n) { void sumProd(int n) {
float s=0.0; //C1 float sum=0.0; //C1 float sum=0.0; //C1’ float sum=0.0; //C1 float sum=0.0; //C1
float p =1.0; float prod =1.0; float prod =1.0; //C float prod =1.0; float prod =1.0;
for (int j=1; j<=n; j++) for (int i=1; i<=n; i++) for (int i=1; i<=n; i++) for (int i=1; i<=n; i++) { for (int i=1; i<=n; i++)
{s=s + j; {sum=sum + i; {sum=sum + i; ’ sum=sum + i; {sum=sum + i;
p = p * j; prod = prod * i; prod = prod * i; prod = prod * i; prod = prod * i;
foo(s, p); }} foo(sum, prod, n); }}
foo(sum, prod); }} foo(sum, prod); }} foo(sum, prod); }}
void sumProd(int n) {
void sumProd(int n){
S1(b) S3(a)
S2(a) S1(c)
float sum=0.0; //C1
S1(a)
float s=0.0; //C1
float prod =1.0;
float p =1.0;
Copy & Paste
for (int i=1; i<=n; i++)
for (int j=1; j<=n; j++)
{sum=sum + i;
{s=s + j;
prod = prod * i;
p = p * j;
1SS3(b)
foo(prod); }}
foo(p, s); }}
S2(b)
Original Copy
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
CCfloat sum=0.0; //C1
int sum=0; //C1
oofloat sum=0.0; //C1
float prod =1.0;
pypyint prod =1;
S3(c)
float prod =1.0;
S2(c) S3
2Sfor (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
{sum=sum + i;
a P & ets etsa P &
{sum=sum + i;
{sum=sum + i;
prod = prod * i;
prod = prod * i;
prod = prod * i;
if (n % 2)==0 {
foo(sum, prod); }}
foo(sum, prod); }}
S3(d)
foo(sum, prod);} }}
void sumProd(int n) {
S2(d)
void sumProd(int n) {
float sum=0.0; //C1
4Sfloat sum=0.0; //C1
float prod =1.0;
float prod =1.0;
for (int i=1; i<=n; i++)
Copy & Paste
for (int i=1; i<=n; i++)
{sum=sum + (i*i);
{sum=sum + i;
S3(e)
prod = prod*(i*i);
S4(b)
//line deleted
S4(a)
S4(c)
foo(sum, prod); }}
S4(d)
foo(sum, prod); }}
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
float sum=0.0; //C1
float sum=0.0; //C1
float sum=0.0; //C1
float prod =1.0; float sum=0.0; //C1
float prod =1.0;
float prod =1.0;
float prod =1.0;
float prod =1.0;
float sum=0.0; //C1
int i=0;
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
while (i<=n)
{sum=sum + i;
{sum=sum + i;
{ if (i%2) sum+= i;
{prod = prod * i;
{ sum=sum + i;
prod = prod * i;
foo(sum, prod)
prod = prod * i;
sum=sum + i; prod = prod * i;
prod=prod * i; }}
foo(sum, prod); }}
foo(sum, prod);
foo(sum, prod); }}
foo(sum, prod); }}
i++ ; }}
Figure7.28:Scenariosfrom[200]
Scenario3from[200]:“Theprogrammermakes®vemorecopiesofthefunctionandthistime
makessmallinsertionswithinalineinthe®rstfragment(S3(a)),smalldeletionswithinalinein
thelinessecondfromthefragmentfourth(S3(b))fragment,inserts(S3(d)),andsomemakneweslineschangesinthetosomethirdwholefragmentlinesin(S3(c))the,®fthdeletesfragmentsome
”.(S2(e))TheHence,diftheferencesalgorithmsintheseforclonestype-2goclonebeyonddetectionwhatcannotConQATdetectcanthemeliminateasathroughcompleteclonesnormalization.ofthe
original.eHowexample,ver,ifthecon®guredtype-3toclonerunondetectionstatementsalgorithmandoftooperateConQATwithcananbeeditcon®gureddistancetoof1,detectitcanthem.detectFor
S3(a),S3(b),S3(c),S3(d)andS3(e)asclonesoftheoriginal(and,withsuf®cientnormalization,as
clone,clonesofConQAS1T’andspostS2(a-c)).processingSinceallcanoftheoptionallyresultingbeclonecon®guredtogroupsmergecontainthemtheintoaoriginalsingleasgroup.common
132
7.6ComparisonwithotherCloneDetectors
eIfveerx,ecutedcloneswithfromaneditscenario4distancealsoofonly2,havConQAeaTalsostatement-ledetectsvelS2(d)editasadistancecloneofof2thefromtheoriginal.original.How-
doesConQAnotTthusdetectancannotyclonesbefromcon®guredS4asinawclonesaythatfromdoesthedetectoriginal.S2(d)asacloneoftheoriginal,but
timeScenarioreorders4thefromdata[200]:independent“Thedeclarationsprogrammerinmakthees®rstfourfragmentmorecopies(S4(a))of,thereordersfunctiondataandindepen-this
dentstatementsinthesecond(S4(b)),reordersdatadependentstatementsinthethird(S4(c)),and
replacesacontrolstatementwithadifferentoneinthefourth(S4(d)).”
ClonesS4(a),S4(b),S4(c)andS4(e)haveastatement-leveleditdistanceof2fromtheoriginal,
clonedistance.S4(d)Asaabovdistancee,itofcannot3.beConQAmadeTtocandetectdetectclonesthem,inifS4butcon®gurednotS2(d)withorasufvicev®cientlyersa.largeedit
DiscussionSeveralcon®gurationoptionsin¯uenceConQAT’sresultsforallscenarios.Avery
smallminimalclonelength,say2statements,canproducegroupsthatcoverall17codefragments
inthescenario.Toosmallminimalclonelengthscanthusresultinpoortask-speci®caccuracy.
Inaddition,severaltailoringeffectsarenotobviousinthescenarios.First,ConQATcanproduce
clonesthatcrossmethodboundaries.Shapingcanbeemployedtoavoidthis.However,shapingcan
reducerecall,iftheresultingclonefragmentsareshorterthantheminimalclonelengththreshold.
Second,increasingtheeditdistancefortype-3clonedetection,canalsoincreasethenumberoffalse
positives,sinceahigheditdistancetoleratessubstantialdifferenceonthecodelevel.Toacertain
degree,thiscanbecompensatedwithrelativeeditdistancethresholdsthattakeclonelengthinto
account(whichisalsosupportedbyConQAT).
Recall7.6.4Inthissection,weshowthattherecallofConQATisnotlowerthantherecallofexistingtext-based
ortoken-basedclonedetectors.Todothis,wecomputealowerboundfortherecallofConQAT
basedontheBellonbenchmarkdata.
StudyDesignTheBellonbenchmarkdatabasecontainsreferenceclonepairsthatBellonrated
asrelevantfor8systems(4writteninC,4writteninJava,compareTable7.2).Onetext-based
detector(Duploc[62])andtwotoken-baseddetectors(CCFinder[121]andDup[6])participated
inthebenchmark.WecompareConQATagainsttheresultsofthesedetectorstoinvestigatehow
ConQATcomparestoclonedetectorsthatemployasimilardetectionapproach.
WeselectedallclonepairsproducedbyDuploc,CCFinderandDupthatareratedasrelevantby
BellonfromtheBenchmark.Theyrepresentthesetofreferenceclonepairs.Thenweexecuted
ConQATonthe8systemstoproducethecandidateclonegroupsandcomparedthemagainstthe
referenceclonepairs.Wecomputedthepercentageofthereferenceclonepairsthatarecontained
inthecandidateclonegroupsasalowerboundfortherecallofConQAT.Itisalowerbound,since
potentiallyrelevantcandidateclonegroupsthataredetectedbyConQATbutnotbytheothertools,
ignored.are
133
7AlgorithmsandToolSupport
Table7.2:Recall(lowerbound)w.r.t.benchmark
Recall(SLOC)SizeLanguageProgram0.9811KCweltab0.9180KCcook0.94115KCsnns0.78235KCpostgresqlnetbeans-javadocJava19K0.94
eclipse-antJava35K0.92
eclipse-jdtcoreJava148K0.92
j2sdk1.4.0-javax-swingJava204K0.86
ImplementationandExecutionWedeterminedthereferenceclonepairsbyextractingtheir
imalpositionsclonefromlengththeof5benchmarkstatements,database.strongWeexecutednormalization)ConQAforTtype-2withaclonetolerantdetectioncon®gurationonthe(min-study
objectstoproducethecandidateclonegroups.
Matchingofreferencepairsandcandidategroupsisperformedasfollows.Areferenceclonepair
isconsideredasmatched,ifacandidateclonegroupcontainstwoclonesthatexactlymatchits
toleratepositions.deSinceviationsweinclonenoticedstartslightandendpositionlinesofupfsetsto2forlines.someIf,offortheexample,clonesainthereferencebenchmark,clonestartswe
inline10andendsinline21,itismatchedbyclonecandidatesthatstartinlines8-12andendin
lines19-23.(However,forareferenceclonepairtobematched,bothmatchingcandidateclones
needtobelongtothesamegroup).
Forreferenceclonepairsthatarenotmatchedthisway,wecomputeamatchmetricbasedontheir
lines.Foreachpairoflinesbetweenwhichaclonerelationshipexists,wecheckwhetherthesame
relationshipalsoexistsinthecandidateclones.Weillustratethisforareferenceclonepairwith
the®rstclonein®leA,lines10-15,andthesecondclonein®leB,lines20-27.Forit,wecheck
forpairs(A:10,B:20),(A:11,B:21),(A:12,B:22),(A:13,B:23),(A:14,B:24)and(A,15,B:25).
Inourexample,the®rst4pairsarealsocoveredbyapairofcandidateclones,yieldingaclone
matchmetricvalueof0.67.Weaggregatetheclonematchmetricsaccordingtothenumbersof
linepairs.Thisway,thematchmetriccapturesthechangepropagationusecaseencounteredduring
clonemanagement.Ifadeveloper®xesabuginclonedcode,andthesiblingclonescanbedetected
byoneofDup,DuplocandCCFinder,themetricdeterminesthepercentagewithwhichConQAT
clones.thedetectalsocould
ResultsTheresultsaredepictedinTable7.2.Forpostgresqlandj2sdk1.4.0-javax-swing,the
recallvalueisbelow90%.Manualinspectionofthemissedreferenceclonesintheseprojects
revealedthatmanyofthemareingeneratedcode.19Fortheotherprojects,themeasuredrecallwas
90%.evabo19Generatedcodeisoftenhighlyredundant.Forthepostgresqlandj2sdk1.4.0-javax-swing,thematchingprocessdid
nothadworkslightlywellofforfsetclonespositions,insogeneratedthatthecode,linesincepairsdidcandidanottematch.clonesnearthereferencecloneswerelongerorshorteror
134
AdoptionandMaturity7.7
DiscussionFor6outof8projects,wemeasuredarecallofover90%.Inaddition,wecompared
ConQATnottoasingletool,buttoaunionofthreecomparabletools.Intheoriginalbenchmark,
manycloneswereonlyfoundbyoneortwotools.Theresults—togetherwiththefactthatthemea-
sureisalowerbound,andthatitcomparesagainstthejointresultsofthreetools—thusdemonstrate
thatConQATcanbecon®guredtohavearecallsimilartothatofthetext-basedandtoken-based
toolsthatparticipatedintheBellonbenchmark.
ySummar7.6.5
Intativthiseclonesection,detectorwedescribedcomparisonourbycloneRoy,detectionCordywandKorkbenchoschke[200].accordingThistothehastwframeoworkpurposes.forquali-First,
itinmak[200]esanditsthuscapabilitiessupportsandusersinlimitationstheirechoicexplicit.amongSecond,difitferentsupportcloneitsdetectors.comparisonwithFurthermore,thetoolswe
haandvetoken-baseddemonstratedclonethatdetectorsConQATthatcanbeparticipatedcon®guredintotheachievBellonesimilarbenchmark.recallvAaluesasquantitatithevteext-basedcompar-
andisonofrecallcodeforclonerequirementsdetectionwithspeci®cationsotheranddetectors—asmodels—remainswellasaatopicthoroughforinvfutureestigwationork.ofprecision
AdoptionandityMatur7.7
Theclonedetectiontoolsupportdescribedinthischapterisavailableasopensourceathttp://www.
conqat.org/.Forsourcecodeclonedetection,itcurrentlysupportstheprogramminglanguages
ABAP,Ada,COBOL,C++,C#,Java,PL/I,PL/SQL,Python,T-SQLandVisualBasic.Fordetection
innaturallanguagetexts,stemmingissupportedfor15languages,includingGermanandEnglish.
Atthetimeofwriting,ithasbeendownloadedover18,000times.
Sincethetoolsupporthasmaturedbeyondthestageofaresearchprototype,severalcompanieshave
includeditintotheirdevelopmentorqualityassessmentprocesses,includingABB,Bayerisches
Landeskriminalamt,BMW,Capgeminisd&m,itestraGmbH,KabelDeutschland,MunichReand
Nixdorf.incorW
Summar7.8y
Thischapterpresentedthetoolsupportproposedbythisthesisthatenablesclonedetectionfor
differentartifacttypes,includingsourcecode,requirementsspeci®cationsandmodels.Resultsare
presentedinacustomizablequalitydashboardtosupportclonecontrolwithoverviewandtrend
information.Toolingforinteractivecloneinspection,inaddition,supportsin-depthinspectionof
clones.PluginsfortwostateoftheartIDEssupportdeveloperstoconsistentlyperformchangesto
clonedcode.Sinceithasmaturedbeyondthestageofaresearchprototype,severalcompanieshave
includeditintotheirdevelopmentorqualityassessmentprocesses.
Throughitspipes&®ltersarchitecture,theclonedetectionworkbenchprovidesafamilyofclone
detectiontoolsthatcanbecustomizedtosuitdifferenttasks.This¯exibilityandextensibility,and
135
7AlgorithmsandToolSupport
itsavailabilityasopensource,hassupportedresearchnotonlybyus,butalsobyothers[24,96,104,
186].180,
Astype-2partandofthetype-3cloneclonesdetectionthatcanwbeorkbench,appliedthistodetectchapterclonesintroducedinsourcenovelcodedetectionandinalgorithmsrequirementsfor
speci®cations.Itmoreoverintroducedthe®rstscalabledetectionalgorithmforclonesindata¯ow
Matlab/Simulink.assuchmodels
Theclonedetectionworkbench,includingthenovelalgorithms,providedthefoundationforthe
experimentsandcasestudiespresentedinthisthesis.Thetype-3clonedetectionapproachenabled
theanalysisoftheimpactofunawarenessofcloningonprogramcorrectness(Chapter4).The
modelclonedetectionalgorithmmadethestudyoftheextentofcloninginMatlab/Simulinkmodels
(Chapter5)possible.Finally,theentireworkbenchprovidesthebasisforthemethodofclone
assessmentandcontrolpresentedinthenextchapter.
136
8MethodforCloneAssessmentandControl
Thischapterintroducesamethodforcloneassessmentandcontrol.Itsgoalsaretwofold:®rst,to
informstakeholdersabouttheextentandimpactofcloningintheirsoftwaretoallowforasubstanti-
ateddecisiononhowcloningneedstobecontrol;second,toalleviatethenegativeimpactofcloning
maintenance.aresoftwduringThe®rstpartofthechapterintroducesthemethod,thesecondpartitsvalidationandevaluation.We
demonstratetheapplicabilityandeffectivenessofthemethodthroughalongitudinalcasestudyat
MunichReGroup,wheretheapplicationofcloneassessmentandcontrolsuccessfullyreducedthe
amountofcloninginalargebusinessinformationsystem.Partsofthecontentofthischapterhave
[116].inpublishedbeen
wvieOver8.1
Thissectionoutlinesthegoalandthestepsofthecloneassessmentandcontrolmethodthatare
presentedindetailinthefollowingsections.While,inprinciple,themethodcanbeappliedto
cloninginotherartifactsaswell,thischapterfocusesoncloninginsourcecode.
Thecloneassessmentandcontrolmethodinvolvestherolesqualityengineeranddeveloper.The
qualityengineeroperatestheclonedetectiontoolsandguidesthroughcloneassessment.Thede-
veloperprovidesnecessarysystemknowledgefortheevaluationofclonerelevanceandevolution.
Bothrolescan,inprinciple,beperformedbythesameperson.Sincetheyrequiredifferentexper-
tise,however,theyaretypicallyperformedbydifferentpersonsinpractice.Themethodhastwo
goals:
Goal1Informstakeholdersabouttheextent,impactandcausesofcloningintheirsoftware.
Goal2Alleviatethenegativeimpactofcloningduringsoftwaremaintenance.
Themethodcomprises®vesteps.Stepsonetothreepursuegoal1,stepsfourand®vegoal2:
toStepachie1:veCloneaccuratecloneDetectiondetectionTailoringresults.TheDuringqualitytailoring,engineerthequalityperformscloneengineerdetectionincorporatestailoringde-
velopereliminateffeedbackalsepositionvthees.releThevanceresultofofthisthestepdetectedarecloneaccurateclonecandidatesdetectionintotheresults.detectionprocessto
137
8MethodforCloneAssessmentandControl
Step2:AssessmentofImpactThequalityengineercomputesasetofmetricsthatquantify
theTheeresultxtentofofthiscloningstepandisthusallowtheforquanti®cationinterpretationofofthetheimpactimpactofofcloningcloningononmaintenancemaintenanceactiactivitiesvities.
correctness.programand
Step3:RootCauseAnalysisThequalityengineeranalyzesdetectedclonesandinterviews
developerstoidentifythemajorcausesforcloning.Theresultofthisstepisalistofcausesof
cloning.Aftercloneassessment,thesystemstakeholdersinterpretthecloningmetricsandcausestodecide
howtocontrolcloningtoreducethenegativeimpactofcloningonsoftwaredevelopment.
Step4:IntroductionofCloneControlBoththequalityengineersandthedevelopersintro-
duceclonecontrolintotheirprocesses.Theresultofthissteparethusmodi®eddevelopmentand
habits.andprocessesmaintenanceIntroductionofclonecontrolintoasoftwaredevelopmentprojectmeanschange—notonlytopro-
cessesandtools,butalsotoestablishedhabits.Forclonecontroltobesuccessfullyapplied,thus
notonlytechnicalchallengeshavetobeovercome.Instead,successhingesonwhetherhabitsare
adaptedaccordingly.Thestepstointroduceclonecontrolbuildonexistingworkonorganizational
changemanagement[43,130,143–145,152,153,225]toincorporatebestpracticesonhowtocoerce
establishedhabitsintonewpaths.
Step5:ContinuousCloneControlThedevelopersinspecttheevolutionofcloningona
regularbasistocon®rmthatthecontrolmeasureshavetakenthedesiredeffectand,ifnecessary,
measures.consolidationschedule
TDetectionClone8.2ailoring
Thissection®rstintroducesclonecouplingasanexplicitcriteriontoevaluaterelevanceofclones
forsoftwaremaintenance.Basedthereon,itintroducesclonedetectiontailoringasaprocedureto
achieveaccurateclonedetectionresults.Itsgoalistoremovefalsepositives—clonecandidatesthat
areirrelevanttosoftwaremaintenanceduetoaverylowcoupling—fromthedetectionresults,while
keepingrelevantclones,toimproveaccuracy.
CouplingClone8.2.1
Thefundamentalcharacteristicofrelevantclonescausingproblemsforsoftwaremaintenanceis
theirchangecoupling,i.e.,thefactthatchangestooneclonemayalsoneedtobeperformedtoits
siblings.Thischangecouplingistherootcauseforincreasedmodi®cationeffortandfortheriskof
introducingbugsduetoinconsistentchangestoclonedcode,requirementsspeci®cationsormodels
maintenance.aresoftwduring
138
ailoringTDetectionClone8.2
Thecouplingbetweenclonecandidateshasadirectimpactonsoftwaremaintenanceefforts.Ifclone
candidatesarecoupled,eachchangetoonealsoneedstobeperformedtoitssiblings.Eachtime
oneclonecandidateischanged,effortisrequiredforlocation,consistentmodi®cationandtesting
oftheotherclonecandidate(s).Incasetheothersarenotmodi®ed,aninconsistencyisintroduced
intothesystem.Ifthechangewasabug®x,theunchangedclonesstillcontainsthebug.If,onthe
otherhand,clonecandidatesarenotcoupled,achangetooneneveraffectsitssiblings,requiringno
additionaleffortforlocation,modi®cationandtesting.
Thisimpactofcloningonmodi®cationeffortislargelyindependentofothercharacteristicsofclone
candidatessuchas,e.g.,theirremovability.Consequently,duetoitsimplicationsformaintenance
efforts,weproposetoemployclonecouplingasacriteriontoevaluatetherelevanceofclonecan-
maintenance.aresoftwfordidates
CouplingCloneDetermining8.2.2
Touseclonecouplingasarelevancecriterion,weneedaproceduretodetermineitonreal-world
softwaresystems.Tobeusefulinpractice,thisprocedureneedstobebroadlyapplicable.We
proposetoemploydeveloperassessmentsofclonecandidategroupstoestimatecoupling,since
theyarenotrestrictedtoaspeci®csystemtype,programminglanguage,oranalysisinfrastructure.
Morespeci®cally,assessorshavetoanswerthefollowingquestion:
RelevanceQuestion1Ifyoumodifyaclonecandidateduringmaintenance,doyouwanttobe
informedaboutitssiblingstobeabletomodifythemaccordingly?
Thisway,developersestimatewhethertheygetapositivereturnontheirefforttoinspectthesib-
lingswhenperformingamodi®cationtoaclonecandidate.Thequestionpartitionsassessedclone
candidategroupsintotwoclasses—relevantclonegroupswhoseexpectedcouplingishighenough
toimpedesoftwaremaintenance,andgroupswhoseexpectedcouplingissolowthattheyareirrel-
evanttosoftwaremaintenance.
ocedurePrailoringT8.2.3
ThestepsofthetailoringprocedurearedepictedinFigure8.1.First,thequalityengineerexecutes
theclonedetectorwithatolerantinitialcon®gurationthataimstomaximizerecall.Second,devel-
opersassesscouplingofthedetectedclonegroupcandidatestoidentifyfalsepositives.Coupling
isassessedonasampleofthecandidateclonegroups—assessmentofallclonesistypicallytoo
expensive1.Allcandidateclonegroupsclassi®edasuncoupledaretreatedasfalsepositives.Ifno
falsepositivesarefound,clonedetectiontailoringiscomplete.
Iffalsepositivesarefound,theclonedetectorcon®gurationneedstobeadaptedtoreducethe
amountoffalsepositivesinthedetectionresults.Whichstrategyisusedforthistypicallydepends
onthedetectedfalsepositives.Theclonedetectoristhenexecutedwiththeadaptedcon®guration.
1AsshownbythecasestudypresentedinSection8.7,samplingdoesnotnegativelyaffecttailoringresults.
139
8MethodforCloneAssessmentandControl
detectorcloneRun
candidatescloneAssessNoFDoneposit.?alseesYdetectorcloneRe-con®gure
Re-rundetectorclone
beforeCompareafterandNoacy?Accur>esYFigure8.1:Stepsofthetailoringmethod
Todeterminetheeffectofthere-con®gurationonresultquality,thequalityengineercomparesre-
sultsbeforeandafterre-con®guration.Morespeci®cally,thequalityengineerinspectswhetherthe
clonegroupsconsideredrelevantarestillcontainedin,andwhethertheirrelevantcandidateclone
groupsareremovedfromthenewdetectionresults.Iftheimprovementofresultaccuracyisnot
achiesatisfying,vebothperfectre-con®gurationprecisionandandresulterecallvonaluationtheissampledrepeated.Incandidatecaseclones,tailoringonedoesmaynotbesucceedforcedtoto
maketrade-offsoneitherprecisionorrecall.Fromourexperience,however,precisioncansubstan-
tiallybeincreasedwithoutdamagingrecall(cf.,Section8.7).Furthermore,thecasestudypresented
this.con®rms8.7SectioninInsomecases,themajorityofthecandidateclonegroupsintheassessedsamplearefalsepositives,
e.g.,iftheanalyzedsystemcontainsalargeamountofgeneratedcode.Eveniftheycansuccessfully
besampleremovedcontainedinatoosinglefewreltailoringevantstep,clonesatofurtherconclusitailoringvelyroundestimatemaybeprecision.required,Inthissincecase,thetailoringoriginal
continueswithanotherassessment(andpossiblyre-con®guration,...)step.
8.2.4TaxonomyofFalsePositives
Wegiveashorttaxonomyoffalsepositivesbasedontheexperiencesgatheredduringclonedetec-
tiontailoringinseveralindustrialprojects.Itprovidesthebasisoffalsepositivescharacterization,
whichistheprerequisiteofclonedetectorrecon®guration.
Noconceptualrelationship.Theclonecandidatesarenotimplementationsofacommonconcept—
noconceptchangecangiverisetoupdateanomalies.Hence,nocoupledchangescanoccurthat
inconsistencies.inresultcould
140
ailoringTDetectionClone8.2
Inconsistentmanualmodi®cationimpossible.Althoughacommonconceptcanexistinthiscase,
consistencyofcoupledchangesisenforcedbysomemeans.Forexample,clonecandidatesin
generatedcodeare,uponchange,regeneratedconsistently;acompilerenforcesconsistencybetween
aninterfaceandaNullObjectimplementation.Hence,noinconsistenciescanbeintroducedthrough
maintenance.manualArtifactsthatcontainclonecandidatesareirrelevant.Ifcode,speci®cationsormodelsareno
longerused,potentialinconsistenciescannotdoharm—atleast,aslongastheartifactinquestion
use.ofoutremainsWhilethelikelihoodoftheirappearanceprobablydiffers,theseclassesoffalsepositivesarenot
limitedtoaspeci®cartifacttype:overlytolerantdetectioncan®ndclonecandidatesincode,mod-
elsandrequirementsspeci®cationsthatlacksimilarconcepts;generatorsarenotlimitedtosource
codeormodels,butarealsoemployedtogeneraterequirementsspeci®cationdocumentsfromre-
quirementsmanagementtools,possiblyreplicatinginformation.
Importantly,thecategoriesoftheabovetaxonomyareorthogonaltothecategorizationofclonetypes
forcodeormodelsthatclassifythembasedonthesyntacticnatureoftheirdifferences[86,140]:
type-1clonecandidatesarenomorelikelytoberelevantthantype-3clonecandidates,ifthe®le
thatcontainsthemisnolongerused.Thecrucialinformation,namelythatthe®leisnolongerused,
isindependentofthesyntacticfeaturesoftheclonecandidate.Consequently,wecannotexpectthe
problemofimperfectprecisiontobesolvedthroughthedevelopmentofbetterdetectionalgorithms
thatimprovedetectionforcertainsyntacticclasses.Instead,weneedtoidentifyotherfeaturesto
characterizefalsepositivestoexcludethem.
8.2.5CharacterizingFalsePositives
Successfultailoringrequirestheidenti®cationoffeaturesthatarecharacteristicfor(acertainset
of)falsepositives.Oncetheyareknown,theclonedetectorcanbecon®guredtohandleartifact
fragmentsthatexhibitsthesefeaturesspecially.Anyattributesofsourcecode,requirementsspeci®-
cationsormodelscan,inprinciple,becandidatesforsuchfeatures.Examplesinclude:thelocation
inthenamespaceordirectorystructure;®lenameor®leextensionpatterns;implementedinterfaces
orsupertypes;occurrenceofspeci®cpatternsinthesourcecode,e.g.,Thiscodewasgenerated
byatool.Characteristicwaysofstructuring,e.g.,sequencesofconstantdeclarations;identi®ersof
methodsortypes;locationorroleinthearchitecture.
Thereisnosingle,canonicwaytodeterminecharacteristicfeatures.However,wefoundthatthe
reasonswhydevelopersconsidercandidateclonesirrelevantoftenyieldclues.Wegiveexamples
forcodeclonesinthefollowing:
Codeisunused—itwillnotbemaintained.Howcansuchdeadcodeberecognized?Doesit
carry,e.g.,Obsoleteannotationsascommonlyencounteredfor.NETsystems,ordoaffectedtypes
resideinaspecialnamespace?Ifnot,candevelopersproducealistof®les,directories,typesor
namespacesthatcontainunusedcode?
Codeisnotmaintainedbyhandsinceitisgeneratedandregenerateduponchange.Isgenerated
codeinaspecialfolderordoesituseaspecial®lenameorextension?Doesitcontainasignature
stringofthegenerator?Ifnot,canitbemadetodoso?
141
8MethodforCloneAssessmentandControl
Codehasnoconceptualrelationship—maintenanceisindependent.Thisistypicallyencoun-
ofteredtheiftheimplementedclonedetectorconcepts.performsCodeovthenerlyappearsaggressivesimilartonormalization,thedetectoref,fectivdespiteelyremothelackvingofallatracescon-
settersceptualorC#relationshipproperties.thatWhichcauseschangelanguageorcoupling.systemTypicalspeci®cepatternsxamplescanareberegionsusedtoofJavarecognizedgetterssuchand
gions?recode2Compilerimplementationspreventsoftheinterfinconsistentaces.Bothmodi®cations.interfaceandExamplesNullObjectareinterfcontainacestheandsameNullObjectmethods,dopatternwn
tointerfaceidenti®ersmustandbetypes.performedHotowevtheer,adeNullObjectveloperasiswell.noti®edThefbyactthethatthecompilerNullObjectthatachangeimplementstothethe
interfacecanbeasuitablecharacteristic.
Similarcharacteristicscanoftenbefoundforirrelevantclonecandidatescontainedinrequirements
i®cationsspeci®cationspresorentedmodels.inChapterAs5,detailedfalseinthepositivestailoringcouldcasebestudyrecognizedforclonibynginpatternsmatcrequirementshingspec-their
contentortheirsurroundingtext.
Con®gurationDetectorClone8.2.6
Clonedetectorrecon®gurationdeterminesthesuccessofclonedetectiontailoring—accuracyisonly
increased,ifrecon®gurationsarewellconceived.Althoughautomationisdesirable,recon®guration
process.manualacurrentlyisClonedetectorcon®gurationincorporatescharacteristicsoffalsepositivesintothedetectionprocess
toremovethemfromtheresults.Weoutlinecon®gurationstrategiesapplicabletoourclonedetector
ConQAT(cf.,Chapter7).Again,wegivetheexamplesforsourcecode.Similarstrategiescanbe
applied,however,toclonedetectorcon®gurationforrequirementsormodels.
Minimumclonelengthpreventsthedetectionofclonecandidatesthataretooshorttobemean-
ingful.Ithasastrongimpactontheresults.Whileone-tokenclonecandidatesarenotveryuseful,
toolargevaluescansigni®cantlythreatenrecall.Still,excludingveryshortclonecandidatesisan
effectivestrategytoincreaseprecisionwithoutdamagingrecall.
Codeexclusionremovessourcecodefromthedetection,andthuspreventsdetectionofclonecan-
didatesforcertaincodeareas.ConQATsupports®leexclusionbasedonnameorcontentpatterns.
Italsosupportsexclusionofcoderegions,whichiscrucialinenvironmentswheresomeregions
of®lesaregenerated,whereastheremainderishandmaintained.Thisis,e.g.,foundin.NETde-
velopment,wheretheGUIbuildergeneratedcodeiscontainedinaspeci®cmethodinotherwise
®les.manually-maintainedContextsensitivenormalizationallowstoapplydifferentnotionsofsimilaritytodifferentcode
regions.Thisway,equalidenti®ersandliteralvaluescan,e.g.,berequiredforclonecandidatesin
stereotypeorrepetitivecodesuchasvariabledeclarationsequences,gettersandsetters,orselect/-
casecascades,whileatthesametimedifferencesinliteralsandidenti®ersaretoleratedforclone
2NullObjectsareemptyinterfaceimplementationsthatreducethenumberofrequirednullchecksinclientcode.
142
ImpactofAssessment8.3
candidatesinothercode.Differentheuristicsandpatternsforcontextsensitivenormalizationare
ailable.vaCloneShapingallowstotrimclonecandidatestosyntacticstructuressuchasmethodsorbasic
fromblocks.theCloneresults.candidatThisescan,thate.g.,arebeshorterusedtothanremotheveminimalshortcloneclonelengthcandidatesafterthatshapingcontainaretheremoendvofed
oneandthebeginningofanothermethodwithoutconveyingmeaning.
Post-detectionclone®lteringremovesclonecandidatesfromthedetectionresults.ConQATsup-
gportsappedclonescontent-basandedblack®ltering,listingremoforval®lteringofovbasederlappingonclonelocation-igroups,ndependentgap-ratio®ngerprintsbasedthat®lteriarengro-for
bustduringsystemevolution.Blacklistingcanbeusedtoexcludeindividualclonecandidates—it
canthusbeappliedevenifnosuitablecharacteristicsoffalsepositivesareknown.
Re-con®gurationofanyclonedetectionphase—preprocessing,detection,orpost-processing—can
improveaccuracy.
8.2.7AssessmentToolSupport
Besidesacon®gurableclonedetector,furthertoolingisrequiredtoperformclonedetectiontailor-
ing:Cloneassessment:dedicatedtoolsupportiscrucialtoachieveacceptablecloneassessmentpro-
ductivity.Basedonourexperiencefromlargeindustrialcasestudies[57,111,115,116],itmust
supportthegenerationofarandomsampleandstoretheassessmentresultsforeachclonegroup
andofferacloneinspectionviewerthatdisplaystwosiblingclonesside-by-side,providingsyntax
highlightingandcoloringofdifferencesbetweenclones.
Comparisonofclonereports:Toolsupportisrequiredtoinspectthedifferencesbetweentwoclone
reports.Thisisnecessarytoinvestigatetheimpactofre-con®gurationonprecisionandrecall.
Supportforcloneassessmentandcomparisonofclonereports,isavailableinConQAT.
ImpactofAssessment8.3
ploThisyedtosectionquantifyfollowstheaimpact‘goal,ofquestion,cloning.metric’(GQM)approach[11]tointroducethemetricsem-
8.3.1Goal
onThesoftwgoalareofcloneengineeringassessmentactivities.istoMorequantifysthepeci®callyimpact,theofgoalcloningistoinquantifytermsthethatrevimpactealoftheirefcloningfect
onmaintenanceeffortandprogramcorrectness.Wehenceneedmetricsthatcapturesigni®cant
cloning.byin¯uencedproperties
143
8MethodforCloneAssessmentandControl
Wesummarizethegoalofcloneassessmentusingthegoalde®nitiontemplateasproposedin[234].
Sincewedonotperformasingleassessment,asGQMismainlytargetedfor,butratherprovidethe
foundationforaclassofassessments,wedonotapplyGQMdirectlybutinsteademployittoguide
presentation.theAnalyzecloninginsoftwareartifacts,includingbutnotlimitedto
sourcecode,requirementsspeci®cationsandmodels
forthepurposeofcharacterizationandquanti®cation
withrespecttoitsimpactonmaintenanceeffort
correctnessamrprogandfromtheviewpointofsoftwareengineer,independentofrole,e.g.,
manager,developer,qualityassuranceengineer
inthecontextofprojectsthatdevelopormaintainsoftware
Questions8.3.2Themeasurementgoalcanbebrokendownintoseveralquestionsthathelptoquantifythedifferent
impactsofcloning.Thequestionsare,onpurpose,independentoftheartifacttypeinwhichcloning
occurs.
Q1Howlargeissize-increaseduetocloning?
testedDuplicationandincreasesmaintained,thesizerequirementsofanartifact.duplicationDuplicatedincreasescodethenumberincreasesofthesLOCentencesthatthatneedneedtotobe
beread;similarly,modelcloningincreasesthenumberofmodelelementsthatneedtobequality
maintained.andassured
Q2Howlargeisexpectedmodi®cation-size-increaseduetocloning?
Ifacloneismodi®ed,themodi®cationtypicallyneedstobeperformedtoitssiblingsaswell.This
increasesthenumberofstatements,sentencesormodelelementsthatneedtobemodi®ed—the
change.aimplementmodi®cation-size—to
Q3Ifasingleelementcontainsafault,withwhichprobabilityisthisfaultcloned?
Ifanartifactelementcontainsafault,itsclonesarelikelytocontainitaswell.If,e.g.,acodeclone
lacksanullcheck,itismissinginitssiblingsaswell.Ifarequirementclonecontainsawrong
precondition,itislikelytobewronginitssiblingsaswell.And,accordingly,ifanadderblockina
Matlab/Simulinkmodelreceivesthewrongparameterasinput,itislikelytobewronginitssiblings
well.as
Q4Howmanyclonegroupsandclonesdoesanartifactcontain?
144
ImpactofAssessment8.3
Thenumberofclonesandclonegroupsdetermineseffortrequiredforcloneinspectionandclone
consolidation.
Q5Howlikelyisacoupledchangeunintentionallynotperformedtoallaffectedclones?
Ifaproblemdomainconcept(whoseinformationisduplicatedamongtheclonesofaclonegroup)
changes,theclonesneedtobeadaptedaccordingly.Howlikelyaredeveloperstobeunawareofall
clones,andthustonotperformthechangeconsistentlytoallaffectedclones?
Q6Howlikelydoesanunintentionallyinconsistentchangeindicateafault?
Thisquestionre¯ectshowoftenachangetoclonedartifacts,thatunintentionallydoesnotgetper-
formedconsistentlytoallaffectedclones,introducesanewfaultorfailstoremoveanexistingfault.
Itthuscaptureshowunawarenessofcloningaffectscorrectness.
Metrics8.3.3
Overheadquanti®esthesizeincreaseduetocloningcf.,Section2.5.4.Relativeoverheadquanti-
®esthesizeincreasecausedbycloningandcanthusbeusedtoanswerquestionQ1.Assumingthat
clonedartifactfragmentsareaslikelytobemodi®edasnon-clonedfragments,itcanalsobeused
toanswerquestionQ2,astherelativemodi®cationsizeincreasethencorrespondstotherelative
erhead.vo
atlCloneeastCooneveraclonegecf.,istheSection2.5.5.probabilitythatAssuminganthatarbitrarilystatements,chosenunitsentencesinanorartifmodelactiscoelementsveredthatby
containfaultsareequallylikelytobeclonedasthosethatdonot,itcanbeusedtoanswerquestion
Q3.Clonecoveragecanalsobeemployedtoanswerrelatedquestions:duringarequirementsspeci®-
cationinspection,howlikelywillthesentenceyoujustreadoccuragaininanothersectionofthe
singledocumentstatement,atleastsentenceonce?orHowmodellikelyelementwillyouathaleastvetoonceperformmore?themodi®cationyoujustdidtoa
Countsdenotethenumbersofclonegroupsandclonesinanartifact.Clonegroupcountand
Q4.questionanswercountclone
UnintentionallyInconsistentCloneRatio(UICR)capturesthelikelihoodthatthediffer-
encesbetweentype-3clonesinaclonegroupareunintentional,cf.,Section4.2.Itthuscapturesthe
lackofawarenessofcloningduringmaintenanceandanswersquestionQ5.
145
8MethodforCloneAssessmentandControl
FaultyUnintentionallyInconsistentCloneRatio(FUICR)capturesthelikelihoodthatthe
differencesbetweenunintentionallyinconsistenttype-3clonesinaclonegroupindicateatleastone
fault,cf.,4.2.Itthuscapturestheimpactofthelackofawarenessofcloningoncorrectnessand
Q6.answersAllmetricsarecomputedontailoredclonedetectionresults.Overhead,clonecoverageandclone
countscanbecomputedfullyautomatically,asis,e.g.,donebyConQAT(cf.,Chapter7).The
metricsUICRandFUICRaredeterminedbydeveloperassessmentsoftype-3clonegroups.Ifthe
numberoftype-3clonegroupsistoolarge,ratingcanbelimitedtoasample.Themetricsare
computedasdescribedinSection4.2.
Discussion8.3.4
posedContribbeforeutionandare,WhileasinthethemetricscaseofUICRcloneandcovFUICRerageorareclonenovel,countthes,othercomputedmetricsbyhaevexistingbeenclonepro-
detectiontools.Thenoveltyoftheproposedcloneassessmentmethodthusresidesnotsomuch
inimpactthenoonveltyofmaintenanceitsefmetrics.fortIincreasenstead,(oitsverheadcontribandutioncovistwerage)ofold:and®rst,programthemetricscorrectnesscapture(FUICR).both
Second,detectionrandesultsmore,andthusimportantlyon,clonestheyarethatexhibitcomputedclonenotoncoupling.cloneThecandidatesmetrics,butthusonallotailorwforedmoreclone
onreliableuntailoredinterpretationclonewdetection.r.t.theresultsimpactforofwhichcloningonprecisionismaintenanceunknown.activities,thanmetricscomputed
EffortsBothclonedetectiontailoringandmetriccomputationarenotcost-free.Sincefreein-
dustrialstrengthclonedetectorsareavailable—suchastheoneproposedbythisthesis—themain
costdriveristheinvolveddevelopertime.Sincetheactualdetectiontimesarefastforsoftwareof
typicalsize(cf.,7),waitingtimesdonotaccountformuch;mostoftheeffortisrequiredfordevel-
operassessmentsofclonesthatareperformedtotailordetectionresultsandrateclonestodetermine
FUICR.andUICRHowever,accordingtoourexperiencesfrom,e.g.,thecasestudyinChapter4,thefaultsdiscovered
duringinspectionoftype-3clonescanamortizetheseefforts.Inonesystem,forexample,we
discoveredatype-3clonegroupinwhichoneclonecontainedacommentwithanissuetracker
ticketnumberindicatinga®xedbug.Itssiblings,however,stillcontainedthebug.Theissuetracker
entrydocumentedalengthyandcostlyprocess:thebughadbeendiscoveredinthe®eld,hadbeen
triagedbyagroupofexperts,discussedbyacontrolboardandclassi®edassuf®cientlycriticalto
be®xedinthenextrelease.Thenithadbeen®xedbyadeveloperandveri®edbyatester.Thecost
forthisprocess,accordingtothedevelopersinvolvedinthestudy,exceededtheeffortgoneinto
cloneassessment.Inotherwords,theeffortwasaccountedforbythesinglefaultwefound,since
itcouldbe®xedandtestedwithoutrequiringthecostlytriageandqualitycontrolboardprocess.
Theadditionalfaultsthatwerefoundduringthatanalysisincreasedthereturnofinvestmentonthe
effortinvestedintocloneassessment.Whilethereisobviouslynoguaranteethatthefoundfaults
amortizeorbestthecosts,wehaverepeatedlyreceivedthefeedbackfromtheinvolvedstakeholders
thatcloneassessmentwaswellworththeeffort.
146
ysisAnalCauseRoot8.4
PropertiesofClonesandClonedCodeAsmentionedinSection8.2.1,theimpactof
cloningisdeterminedbyclonecoupling,whichisindependentofwhetherclonescanberemoved
usingtheabstractionmechanismavailablefortheartifacttype.Removabilityoftheclonesisthus
metrics.theinre¯ectednotTheinterpretationofoverheadasanestimatorformodi®cation-size-increaseassumesthatcloned
artifactfragmentsareaslikelytobeaffectedbychangeasnon-clonedones.Forsourcecode,this
assumptionhasbeenstudiedbyseveralresearchers.TheresultsfromJensKrinkeseemtocontradict
it:in[148],hereportsthatclonedcodeismorestablethannon-clonedcode.However,inalater
study,NilsGödeusesamoresophisticatedclonetrackingschemeandreportsthatstabilityofcloned
versusnon-clonedcodevariesbetweentheanalyzedsystems[83]andisthushardtogeneralize.
Lackinggeneralizableresultswhetherclonedcodeismoreorlessstablethannon-clonedcode,
andlackinganyempiricaldataforotherartifactssuchasrequirementsspeci®cationsandmodels,
weassumethatitdoesnotdifferinstability.Futureworkisrequiredtobetterunderstandthe
relationshipbetweencloningandstability.Incaseitvariessubstantially,itcouldbeincludedasan
additionalmetricintoafuture,extendedcloneassessmentmethod.
Theinterpretationofclonecoverageasthelikelihoodthatfaultsareclonedassumesthatfaultyarti-
factunitsareaslikelytobeclonedasnon-faultyones.Again,wehavelittleempiricaldatathatsheds
lightonfaultdensities:wearenotawareofanystudiesforrequirementsspeci®cationsormodels
andonlyofasinglestudythatcomparesfaultdensitiesforclonedandnon-clonedcode[189].
Inaddition,sincetheauthorsdonotemployclonetailoring,accordingtotheterminologyofthis
thesis,theirstudyanalyzesclonecandidates,notclones—theapplicabilityoftheirresultsisthusun-
clear.Consequently,furtherresearchisrequiredtobetterunderstandthefaultdensitiesforcloned
andnon-clonedartifactfragments.Lackingempiricaldata,weassumefaultdensitiestobesimilar
forclonedandnon-clonedcode.Alternatively,afuture,extendedversionofthecloneassessment
methodcouldincorporateametricthatre¯ectsthedifferencesbetweenthetwo.
ysisAnalCauseRoot8.4
Besidesassuranceofconsistentevolutionofexistingclones,animportantfunctionofsuccessful
clonecontrolisthepreventionofnewones.Variouscausesurgemaintainerstocreateclones;
pleaserefertoSection2.2.2foranoverview.Inmanycases,cloningisperformedtoworkaround
problemsinthemaintenanceenvironment.Aslongasthesecausesforcloningremain,maintainers
arelikelytocontinuetocreateclonesinresponse.Hence,forclonepreventiontobeeffective,the
causesforcloningneedtobedeterminedandrecti®ed.
Existingworkonclonepreventionfocusesonmonitoringofchangestothesourcecode[149].
alloChangeswedtothatbeaddedintroducetothenewsystem.clonesareWhilesuchidenti®edanandapproachneedtocanpasshelpatospecialspotclonesapprovalearly,itprocessistolimitedbe
toanalysisofthesymptoms—theclones—andignorestheircause.Suchapproachesthusneedto
becomplementedwitharootcauseanalysisthatdeterminestheforcesdrivingclonecreation.This
sectionpresentsalistofrootcauses.
Therulesoutcausesaforsingle,cloningcanonicalaredivrecipeerse;forsuitablerootcausesolutionsanalysis.thusdifferInstead,wesubstantiallylist.theTheircausesandheterogeneitycoun-
147
8MethodforCloneAssessmentandControl
termeasuresintheformofpatterns.Manyoftheexamplesdescribedbelowstemfromfouryears
experienceofanalysisofcloninginindustrialsoftware—oftenwithpartnersoutsidethosemen-
tionedinSection2.73.Where®tting,wealsogiveexamplesfromtheliterature.Thislistisnot
complete.Itsextensionremainsanimportanttopicforfuturework.
Thelistfocusesoncausesforcloninginthemaintenanceenvironment.Inherentcauses,suchas
dif®cultyofabstractioncreation(cf.,,Section2.2.2)arenotconsideredfortworeasons:®rst,
beinginherent,theycannotberecti®edthroughchangestothemaintenanceenvironment;second,
theresultingclonescanbeconsolidatedatalaterpoint,e.g.,whenmoreinformationaboutthe
instancesofacertainabstractionisavailable.Welistthepatternsinalphabeticorder.
PatternunderlyingTemplateproblem.ItsEachsolutioncauseisdescribesdescribedpossiblefollowingmeasuresa®xedthattecanmplate.beusedItstocausesolvethedescribesproblem.the
itsItselimitationsxamplesdocumentdocumentoccurrencesconstraintsinthattherestrictliteratureorapplicabilityexperiencesofthewegsolutions.atheredinpractice.Finally,
GeneratorokenBr8.4.1
CauseCodethatwasoriginallygeneratedisnowmaintainedmanually.
SolutionSeparatehand-writtenandgeneratedcode.Ifthegeneratedcodeneedstobeaugmented
manually,use,e.g.,theGenerationGappattern[224]toplaceitindifferent®les.Donotcommit
generatedcodetotheversioncontrolsystem.Instead,re-generateditautomaticallyeverytimeits
inputartifactschange.Thisreducestheprobabilitythatsmall®xesaredirectlyintroducedintothe
generatedcodethateffectivelybreakthepossibilitytoregenerateit.
ExamplesInsomebusinessinformationsystemsweanalyzed,one-shotgeneratorshadbeenem-
ployed.Theyhadgeneratedcodeentitieswith“holes”thatwerelater®lledinmanually.This
resultedinlargeamountsofcloning.
AnotherprojectweanalyzedinitiallyemployedaUMLtoolthatgeneratedclassesfromdiagrams.
TheUMLtoolgeneratedstereotypecodefor,e.g.,associationhandlingandobjectlifecyclethatis
duplicatedbetweenclasses.Thisdidnotrepresentaproblemaslongasthetoolwasused,since
itmaintainedtheduplication.However,atsomepoint,theUMLtoolwasabandoned.Allcode,
includingthegeneratedduplication,getsnowmaintainedbyhand.
Athirdprojectweanalyzedinheritedacomponentfromanotherteam.Thatteamemployedacode
generator.However,thegeneratorisnowlost.Furthermore,itisunknown,whetherthegenerated
codehaslaterbeenmodi®edbyhand.Consequently,itnowgetsmaintainedmanually.
LimitationsIfhand-writtenandgeneratedcodehavebeenmixedlongago,theirseparationcanbe
tedious.However,suchcomplexityisaccidental.Weseenoinherentreasonthatpreventscomplete
separationofgeneratedandhand-writtencode.
3Fornondisclosurereasons,wecannotgivemoredetailsonthecompany,domainoranalyzedsoftware.
148
SkillsAbstractionInsuf®cient8.4.2
ysisAnalCauseRoot8.4
CauseThemaintainerslacksomeoftheskillsrequiredtocreatereusableabstractions.
SolutionEducatethemaintainersintherequiredskills.
ExamplesEveniflanguagelimitationsruleoutonewayofcreatingasharedabstraction,often
other,sometimeslessobvious,waysexist.Manydesignpatternsoffersuchways.Forexample,if
twofragmentsofcodedifferinonemethodtheycall,Javadoesnotallowtointroduceaparameter
forthismethod,sinceitdoesnotsupportfunctiontypes.However,thedesignpatternsTemplate
MethodsandVisitor[74],e.g.,supportsuchcasesthroughtheuseofinheritanceandpolymorphism.
Toconsolidatecloning,refactoringcanreducetherequiredeffortandlikelihoodoferrors.
Atoneofourindustrialpartners,across-cuttingconcernwasclonedbetweentheunderlyingframe-
workandallcomponentsthatweredevelopedontopofit.TheapplicationoftheTemplateMethod
patternallowedconsolidationofasubstantialpartoftheclones:thecommoncodewasmovedinto
theframeworkbaseclasses,thevariabilitydelegatedtoabstracthookmethodsthatwereimple-
mentedbythederivedclassesinthecomponents.
LimitationsTheavailableabstractions,patternsandrefactoringsdifferbetweenprogramminglan-
guages.
LimitationsegLangua8.4.3
CauseTheavailableabstractionmechanismdoesnotallowtointroducethenecessaryparameters
abstraction.reusableacreatetoSolutionThedirectsolutionistoaugmenttheabstractionmechanismtosupporttherequiredpa-
rameterization.Ifthisisunfeasible,usespeci®ctoolsthatcomplementthelanguage.
ExamplesThequalityanalysistoolkitConQAT,ontopofwhichthetoolsupportproposedbythis
thesisisconstructed,implementsitsowndomainspeci®clanguagetospecifyprogramanalyses.
Itsinitialversiondidnothaveareusemechanismforrecurringspeci®cationfragments.Theinitial
analyses,thus,containedclones.Inresponse,alaterversionintroducedanabstractionmechanism
thatallowsforstructuredreuse.
GeneralpurposeprogramminglanguageslikeJavadonotallowforencapsulationofcross-cutting
concerns.Concernssuchaslogging,tracingorpreconditionchecking,hence,areduplicated.Oneof
ourindustrialpartnersintroducedaspectorientedprogrammingtechniquestofactoroutthecloned
code.tracingLimitationsManycommonlyusedabstractionmechanisms,e.g.,thoseingeneralpurposepro-
gramminglanguages,cannotbeextendedbytheirusers.Aspectorientedprogrammingorgenera-
tors,however,cansometimesbeemployed.
149
8MethodforCloneAssessmentandControl
8.4.4NoConsolidationofExploratoryCloning
izationCauseofInherentchangescausestoforunderstandcloning,theirsuchimpact,asdif®cultydisappearofwithcreatingtime(cf.,abstractionsSectionor2.2.2).prototypicalCloningreal-can
thenbeconsolidated.Thisdoesnotalwayshappeninpractice.
theirSolutionremovalEstablishassooncloneastheycontrol,canbeaspresentedconsolidated,belowwhile,totheirtrackremosuchvalisclones.stillcheap.Scheduleresourcesfor
ExamplesInseveraloftheindustrialprojectsweanalyzed,wefoundcodeimplementingfeatures
withsimilarbusinessfunctionality.Partsofthemhadbeenimplementedviacloning.Repository
analysisrevealedthatcloninghadalsobeenusedforprototypicalimplementationinotherareasof
theapplication.However,intheseareas,itwaslaterconsolidated,asthecommonalitiesanddiffer-
encesbetweenthefeaturesbecameclear.Developersreportedthatmanyoftheremainingclones
hadconsolidationoriginallywasbeenmeantpostponedtobeandthenconsolidated.forgotten.However,duetotimepressureandinterruptions,the
ClonesLimitationsshouldThethuslongerberemoclonesvedremainearly,intoaavsystem,oidtheadditionalmoreefeffortsfortscanforfariseamilforiarizationtheirandconsolidation.quality
assurance.
8.4.5UnreliableTestProcess
CauseThetestprocess—especiallyregressiontesting—isunreliable.Inresponse,maintainersdo
notreusable,trustittocopiesdiscoareverfcreated,aultstoavintroducedoidriskofduringbreakingmaintenaexistingnce.code.Insteadofchangingcodetomakeit
SolutionImprovethetestprocess.
idateExamplescloning,JimtoavCordyoidthe[40]riskreportsofonbreakingthereluctancerunningofsystems.maintainersIncreasedinthereliability®nancialofsectorthetomaintainersconsol-
inthetestprocessescouldreducetheirreluctance.
notOneacompansingleytestwewcaseorkwedaswithwautomated.asinaInsimilarconsequence,situation.Theirdeterminingtestprocessthatawaschangeentirelyonlyhadmanual—the
clearintended,whichimpacttestwcasesaswereinfeasible:potentiallyapartaffromfectedthebycostsaofchange.manualThetestexresultingecution,itreluctancewasnottoalwmodifyays
existingcodeleadtoasteadyincreaseincloning.
LimitationsAsanyprocesschange,improvingatestprocessrequiresplanning,organizational
resources.andmanagementchange
ocessPrReuseUnsuited8.4.6
CauseTheorganizationdoesnothaveasuitablereuseprocessthatgovernsthecreationandmain-
tenanceofsharedcode4.Unsuitedreuseprocessescanoccurindifferentforms,e.g.:
4Weusethetermsharedcodeinawaythatdoesnotsubsumeclonedcode.
150
ysisAnalCauseRoot8.4
missing.isprocessReuseRestrictivecodeownershipimpedesmodi®cationsnecessarytoreuseexistingcode.
SolutionChangeprocesstofacilitatecreationandmaintenanceofsharedcode.
cessExamples[120].AtTheonecompancompanyysimply,ahadcausenoofcodecrossentitiesprojectthatwerecloningwsharedasthebetweenabsenceprojects,ofaandreuseconse-pro-
code,quentlythenodevprocesselopersforitscopieditmaintenance.betweenLacking,projects.e.Asg.,aasolution,commonthelibrarycompanintoywhichplanstotoplaceintroduceshareda
commonslibraryandamaintenanceprocessforit.
Restrictivecodeownershipisfrequentlymentionedasareasonforcloningintheliterature[201].
Collectivecodeownership,as,e.g.,advocatedbyagiledevelopmentmethods[18,71]presentsa
e.valternatisuitableLimitationsBothestablishingandchangingareuseprocessrequireplanningandorganizational
tionschangeofothermanagement.processes,suchSwitchingasfromqualityrestrictiassurance,vetoifitcollectiwasveocodeownership-based.wnershipmightrequireadapta-
8.4.7WrongDescriptionMechanism
CauseThedescriptiontechniqueemployedtoimplementapieceofsoftwareisinappropriate.As
aconsequence,highleveloperationsareinterspersedwithrepetitivesequencesoflowlevelcom-
mands.SolutionUseamoreappropriatedescriptiontechnique.Forexample,useadomainspeci®clan-
guageinwhichthehigh-leveloperationsareencodedandageneratorthataddslow-levelcommands
andtransformsitintoexecutableartifacts.OruseaninternalDSLto,e.g.,separatetestdatacon-
logic.testfromstructionExamplesOneofthebusinessinformationsystemsweanalyzedstartedoffwithamanuallywritten
(andmaintained)persistencylayer.Storageofobjectsinarelationaldatabase(and,correspondingly,
theirretrieval)followedstereotypepatterns.Foreachobjectattribute(high-levelinformation),a
numberoflow-levelstorageandretrievalcommandswereimplemented,resultinginlargeamounts
ofsimilarcode.Inalaterversion,thecompanyreplacedthiscodewithageneratedO/Rmapper.
AsecondexampleareAPIsusedtoprogramgraphicaluserinterfaces.Eachinstantiationofa
widget(high-leveloperation)requiresasequenceof(low-level)methodandconstructorcalls.Since
APIconstraintsgoverntheirshapeandorder,theresultingcodelookssimilar[1,123].Again,
highleveloperations(placethiswidgetoverthere,lookingassuch)isinterspersedwithlowlevel
information(howtoconstructthewidget,howtoallocateanddisposeofitsresources,...).Again,
codegeneratorshavebeendevelopedthatallowthecompositionandmaintenanceofgraphicaluser
interfacesonahigherlevelofabstraction.
Automatedtestsrequiretestobjectsonwhichthefunctionalityundertestoperates.Often,these
testobjectsareconstructedprogrammatically.Again,high-leveloperations(whichobjectstocom-
bine)areinterspersedwithnumerouslow-levelconstructorandsettercalls.Asasolution,describe
151
8MethodforCloneAssessmentandControl
testobjectconstructionusinginternalorexternalDSLsthatallowtestobjectspeci®cationonan
appropriatelevelofabstraction.
theirLimitationsconstructionSuitableanddomainmaintenancespeci®cconstrainlanguagestheiroruse.generatorsmightnotbeavailable.Thecostsfor
ySummar8.4.8Theanalysisofcausesofcloningcanrevealproblemsinthemaintenanceprocess.Theseprob-
lemscanhavesevereconsequencesforsoftwaremaintenancefarbeyondtheirimpactoncloning:
workingonthewronglevelofabstractioncreatesunnecessaryeffort;insuf®cientdeveloperskills
threatenmanyqualityattributesofasoftwaresystem;andreluctancetochangeexistingcodedue
toanunreliabletestprocessinhibitsmaintenanceingeneralandnotonlyconsolidationofcloning.
Rootcauseanalysisofcloningoffersonetooltospotsuchproblems.Ifemployedduringclone
control,itcanhelptoidentifysuchproblemsearlyandthushelptocontainthedamagetheycan
cause.Therecti®cationofacauseforcloningmustmakeeconomicsense.Itsexpectedsavings,both
intermsofreducedimpactofcloningandonsoftwaremaintenanceingeneral,mustexceedthe
expectedcosts.Clonepreventionthusinvolvestrade-offdecisions.Thesetrade-offscanshiftover
time.Acausethatinitiallyappearstobenegligiblecanbecomeimportant,asitsimpactbecomes
obvious.Inaddition,causesthatareexpensiveto®xnowcanbecomecheaper,astechnology
advances.Timelyrootcauseanalysisenablesasubstantiateddecisiononwhethertoact,orwhether
toaccepttheconsequencesandcontroltheresultingclones.Furthermore,ifperformedasapart
ofcontinuousclonecontrol,thedecisionscanbereevaluated,asadditionalinformationbecomes
ailable.va
LastingImpactCloneassessmentandrootcauseanalysisalone,however,areunlikelytohavea
lastingimpactonthecloninginasystem.Ifthenegativeimpactofcloningistobereduced,speci®c
en.takbemustactionsTheprojectstakeholdersthusneedtomakeadecisionwhethertheimpactofcloningisacceptable
fortheirsoftwareproject,orwhetheranyactionsshouldbetakentoalleviatetheimpactorreduce
theamountofcloninginasystem.Inreal-worldsoftwareprojects,thequestionismorelikely
whichactionsareappropriate,thanwhetheratallactionsneedtobetaken:inthefewtimeswe
encounteredsoftwaresystemswithverylowcloningmetrics,effectiveclonecontrolmeasureswere
place.inalreadyThenextsectionprovidesamethodtointroduceclonecontrolthathelpstoalleviatethenegative
impactofcloningonsoftwaremaintenanceactivities.
8.5IntroductionofCloneControl
Ifneedclonetobeocontrolvisercome.tobeTheappliedgoaloforgcontinuouslyanizationalduringchangemaintenance,managementisestablishedtofdeacilitatevelopmentsuchchangehabits
152
8.5IntroductionofCloneControl
processes.Belowwesummarizeanorganizationalchangemanagementprocessfrom[225]thathas
beenadaptedfortheintroductionofqualitycontrolmeasures.Itsstepsprovidethebasisforthe
control.cloneofintroduction
ConvinceStakeholdersandestablishasenseofurgencyaboutthenegativeimpactofcloning
forthesoftwaresystemtobuildupenoughmomentum.Theintendedresultofthisstepismotivation
amongthestakeholderstointroduceclonecontrol.
CreateaGuidingCoalitionthatincludeskeypersonstointroduceclonecontrolintothede-
vthetaskelopmentforceprocess.thatwillIdentifyinitiateallandrequiredperformrolestheandactionspersonsrequiredtoavtooiddelayintroduce.Thecloneresultofcontrolthisintosteptheis
process.elopmentvde
CommunicateChangetoallstakeholdersaffectedbyclonecontroltoachievetransparency
andreduceanxietypossiblycreatedbyasenseofbeingcontrolledormeasured.Theresultofthis
stepisknowledgeoftheintroducedclonecontroltoolsandmeasures.
EstablishShort-termWinstoprovidepayoffsforinvestmentsmadesofarandbolstermo-
tivation.Theseinclude®xingofencounteredbugsandremovalofeasilyremovableclones.The
resultofthisstepistheimprovementofthesoftwaresystem’squality.
MakeChangePermanentbytrackingclonestorewardremovalofexistingclonesandnotice
introductionofnewones.Theresultofthisstepisawarenessoftheevolutionofcloninginthe
systemandthelastingapplicationofclonecontrol.Thisstepoforganizationalchangemanagement
isperformedbythe®fthstepofthemethod,continuousclonecontrol.
Inprinciple,themethodpresentedinthischapterfocusesonpointsinwhichcomputersciencecan
helporganizationalchangemanagement.Itdoesnottargetpointsthatarenotprimarilycomputer
scienceterritory,suchas,e.g.,expectationmanagement,con¯ictmanagementorcommunication
insideanorganization.Itthuscomplementsexistingapproachesfororganizationalchangemanage-
mentanddoesnotreplacethem.Theremainderofthissectiondescribestheindividualstepsofthe
introductionofclonecontrolinmoredetail.
sStakeholdervinceCon8.5.1
Introductionofclonecontrolneedsresources.Forthem,itcompetesagainstothertasksinaproject.
Inorderforclonecontroltobeinitiated,therequiredresourcesmustbeallocated.Thisdemands
convictionamongallinvolvedstakeholdersthatclonecontrolisbothnecessaryandurgent,elseit
willnothappenorbedelayed.
Forasoftwaresysteminproduction,cloningisnotmerelyanissueaffectingmaintenanceinthe
distantfuture.Instead,itnotonlyaffectsthepresentbutalreadyaffectedpastmaintenance.Inother
153
8MethodforCloneAssessmentandControl
words,theimpactofcloningalreadyaffectsthestakeholders.Fromourexperience,eveninsystems
thataresubstantiallyimpactedbythenegativeimpactofcloning,thisisnotcleartostakeholders.It
ishenceakeyfactinestablishingasenseofurgencyamongthem.
Toestablishthatthenegativeimpactofcloningalreadyhasaffecteddevelopmentandcontinuesto
doso,resultsfromcloneassessmentareemployed.Fromourexperience,itfostersunderstanding
ifprothevideimpacttangibleofecloningxamples,isandpresentedontheinlevtweloofwtheays:wholeonthelesystem,veltoofputindividualcloningintosoftwareconteartifxt.Onacts,theto
levelofindividualartifacts,examplesofinconsistentevolutiontangiblydemonstratethatcloning
threatensprogramcorrectness.Onthelevelofthewholesystem,theclonemetricsquantifythe
impactofcloningforthewholesystem.
Themorestakeholderscanbeconvincedoftheurgencyofclonecontrol,thehigheritschancesof
success.Whileparticipationofallstakeholdersisnotnecessarilyrequired,atleaststakeholders
whoseinactivityblocksclonecontrolneedtobeconvinced.
CoalitionGuidingaCreate8.5.2mentOnceaprocesssenseofofaurgencproject.yhasDifbeenferentrolesestablished,areinvcloneolvedcontinrolthis.needstoDependingbeinteongratedtheintoproject,thedethevyelop-can
butneednotbeperformedbydifferentpersons:
Buildengineer:Integratesclonedetectionintothesoftwarebuildenvironmentsothatitisper-
formedautomaticallyonaregularbasis.
DependingDashboardontheappointee:projectsizeCreatesandateamdashboardstructure,thatthepresentsdashboardcloneappointeedetectioncreatesresultstodashboarddevvieelopers.ws
fortheindividualcomponentsorsubsystemstoprovidecustomizedclonedetectionresultstothe
eholders.stakToolappointee:Familiarizeshimselfwiththeclonedetectiontoolsupporttoadaptittotheproject
colleagues.histutorandOncetheguidingcoalitionhasbeencreated,itperformsitstasks.Besidestheidenti®cationofthe
inonvaolvedcontinuousindividuals,basis.theresultsofthisstepthusincludeaclonedetectiondashboardthatisupdated
eChangunicateComm8.5.3Onceclonedetectionhasbeenintegratedintotheregularbuild,up-to-dateclonedetectionresults
are,inprinciple,availabletodevelopers.However,whileanecessaryrequirement,boththeexis-
tenceofup-to-datedetectionresultsandclonemanagementtoolsalonedonotalleviatethenegative
impactofcloning.Theyalsoneedtobeusedbydeveloperstotakeeffect.
Forthis,developersneedtobemadefamiliarwiththeclonecontroltoolsupportavailabletothem
andthewaysitcanbeusedtosupportmaintenance.Thisincludesboththeclonecontroldashboard
thatprovidesaggregatedinformation,andtheIDEintegrationofcloneindicationthatsupports
changepropagation,implementationandimpactanalysis,asdescribedinChapter7.
154
8.6ContinuousCloneControl
Furthermore,thewaysthecloninginformationisusedbyotherstakeholders,includingmanage-
aboutment,theneedsusetoofbethecollectedcommunicateddatatocanleadcreatetotranspadefensivrenceybeha[38].viororneOtherwise,glect,thethreateningresultingtheuncertaintyadoption
control.cloneof
8.5.4EstablishShort-termWins
Allpreviousstepsrepresentinvestmentsintoclonecontrolthatoffernoimmediatelyvisiblebene®ts.
Atthisstep,tangiblereturnsinsoftwarequalityimprovementarerequiredtobothjustifyprevious
investmentsandbolsterdevelopermotivation.Strategiestoachievetheminclude:
Fixbugsintroducedbyinconsistenciesbetweenclones.Bug®xesofferimmediateimprovements
insoftwarequalityandareeasytocommunicateamongstakeholders.
Consolidateclonesthatareeasilyremovable.Suchclonescan,e.g.,befoundbyusingvery
conservativenormalization.Theirremovalreducessoftwaresizeandthusfuturemaintenanceeffort.
Startingwithclonesthatareeasytoremovebolstersmotivation,sincelimitedeffortvisiblyimpacts
dashboard.theinmetricscloningConsolidatelargeclones,bothinlengthandincardinality.Removalofsuchclonesvisiblyreduces
clonemetricvaluesandthusalsobolstersmotivation.
8.6ContinuousCloneControl
Apartfromestablishingshort-termqualityimprovements,boththeamountofcloningandtheprob-
abilityapplicationtoofintroducecloneerrorscontrol.duetoContinuousinconsistcloneentcontrolmodi®cationsinvolvcanesbeboththereducedqualitythroughengineercontinuousandthe
elopers.vde
Quality8.6.1Engineer
Asbasis,parte.ofg.,aspartcontinuousofweeklyclonecontrol,projecttstahetusqualitymeetings:engineerperformsaseriesofactivitiesonaregular
InspectestablishestheCloningcloneMetricsmetricsasintheimportantdashboardprojecttoqualitytrackthecharacteristicshigh-levelevandolutionmaintainsofcloning.attentionThison
them.Furthermore,thequalityengineeranalyzestheirtrendstomonitorwhetherclonecontrolhas
fect.efan
155
8MethodforCloneAssessmentandControl
TracsupportkforClonesclonetotrackingidentifycf.,evSectionolution7.4.4ofcloningidenti®esonaddedtheleandvelofmodi®edindiclonevidualgroups.cloneThegroups.qualityTool
engineerperformsthefollowingstepsonthem:
Addeddetection:iftheresults.cloneElse,incandidatevestigisateafifalsethepositicloneve,shouldadditbetotheremovedblacklistand,toifso,remosveitchedulefromitftheor
removalby,e.g.,creatingaworkitemforitintheproject’sissuetracker.Ifthecloneshould
notberemoved,e.g.,sincethelanguageabstractionmechanismsareinsuf®cient,theclone
theremainsrootincausetheofthedetectioncloneresultsandtobedetermineavifailablereactionsforchangeneedtopropagbetakation.en.Furthermore,analyze
Modi®ed:ifthemodi®cationwasnotperformedconsistentlytoallclonesintheclonegroup,
invcheckestigifatethiswhwyascloneunintentional.indicationwIfasso,notusedscheduleawsuccessfullyork.itemtorepairtheinconsistencyand
Inaddition,thequalityengineerfollowsprogressonthescheduledworkitemsforcloneremoval
orincludedinconsistencintheyqualityremoval.Tdashboardobolstertomakdeveeloperprogressmotivvisibleation,tothethelistteam.ofremovedclonescan,e.g.,be
sveloperDe8.6.2
Aspartofcontinuousclonecontrol,thedevelopersperformaseriesoftasksaspartoftheirdevel-
vities.actiopment
EmploinconsistentyClonechangestoIndicationclonedforcodeischangereduced,propagevenation.ifThiscloningwayis,nottheprobabilitconsolidated.yofunintentionally
uledResolveforremoWvorkalandItemsthatinconsistencieshavebeenthatneedscheduledtobebytherepaired.qualityWhileengineerthis,causesnamelyeffortclonesforfsched-amil-
iarizationandqualityassurance,itimmediatelyreducestheamountofcloningandfaultsinthe
system.
ConsolidateUponChangeremovescloningwhenchangestoclonedcodearerequiredduring
maintenance.Ifcodeneedstobechangedtoimplementachangerequest,cloneconsolidationin
thatcodedoesnotcreateadditionaleffortforfamiliarizationandqualityassurance.Thisstrategy
allowstoremovecloninggraduallyduringsystemevolution,withoutrequiringasigni®cantup-front
estment.vinApartfromthereductionoftheamountofcloningandtheprobabilityofinconsistentmodi®cations,
alongtermbene®tofcontinuousclonecontrolisalsothemaintaineddeveloperawarenessofthe
negativeimpactofcloning.Thisawarenessmakestheintroductionofnewclonesinaddedor
modi®edcodelesslikely.
156
AssumptionsofalidationV8.7
Discussion8.6.3Thegenericclonecontrolmethodabovecanbeadaptedtospeci®cprojectcontexts.
GreenFieldDevelopmentTheabovemethodfocusedontheintroductionofclonecontrol
intomaintenanceprojects.Itthusfocusedonhowtochangeestablishedhabitsandhowtomanage
existingclones.Ifclonecontrolisintroducedattheverybeginningofaproject,itdiffersintwo
aspects.importantFirst,insteadofchangingestablishedhabits,newhabitsneedtobecreated,whichisarguablysim-
pler.Still,tocreatenewhabits,developersneedtobemotivated.Sincecloneassessmentresultsfor
theprojectdonotexist,resultsfromother,ifpossiblecomparableprojectsshouldbeemployed.
Second,ifaprojectstartswithzeroartifacts,italsostartswithzeroclones.Clonecontrolcanthus
focusoncloneavoidanceinsteadofmanagementofexistingclones.Onepossibilityistotrack
clonestodiscovertheexistenceofnewclonesrightaftertheircreation,whiletheirremovalisstill
e.vxpensiine
Multi-projectEnvironmentsIfclonecontrolisintroducedintoamulti-projectenvironment,a
stagedapproachthatstartswithafewprojectsbeforeintroducingclonecontrolintoallprojectshas
severaladvantages.First,lessinvestmentisrequired.Second,lessonslearnedonthepilotprojects
canbeappliedtotheremainingones,potentiallysavingtherepetitionoferrors.Third,thepilot
projectscanbeemployedasexamplestocreateasenseofurgencyandshowfeasibilityofclone
projects.remainingthetocontrol
ToolSupportDedicatedtoolsupportiscrucialforclonecontrol.Tocontrolcloningonaproject
level,qualitydashboardsaggregateandvisualizetheextentandevolutionofcloninginasystem.For
changepropagation,cloneinspectionandremoval,clonemanagementtoolsthatintegrateintoIDEs
providesupporttodevelopers.Bothtoolsupportontheprojectlevelandforclonemanagementin
theIDEisproposedinthisthesisandoutlinedinChapter7.
alidationV8.7Assumptionsof
Thisclonesectionassessmentpresentsandcontrol.industrialThecaseevstudiesaluationthatofvthealidatemethodtheisassumptionspresentedinunderlyingSection8.8.themethodfor
Assumptions8.7.1Thetailoringprocedurethatispartofthecloneassessmentmethodemploysdeveloperassessments
ofclonecouplingonaclonesampletodetermineresultaccuracy.Thisisbasedonthreeassump-
tions:
157
8MethodforCloneAssessmentandControl
Assessmentconsistency.Weassumethatdifferentdevelopersevaluatethecouplingofclonescon-
.sistentlyAssessmentcorrectness.Weassumethattheevaluationofclonecouplingiscorrectregardinghow
changeswillaffectclonesinreality.
Assessmentgeneralizability.Weassumethatassessmentresultsforasampleofthedetectedclones
canbegeneralizedtoallclones.
Whileacertainamountoferrorcanbetolerated,theassumptionsmustholdonagenerallevelfor
theuseofdeveloperassessmentsonasampletomakesense.
ermsT8.7.2
Foforathesoftwsakeareofsystemclarity,onwethede®nesevconceptualeralletermsvel.weAemplomodi®cyationduringistheanstudy:alterationAchangonetheisansourcealterationcode
level.locations.AsingleDetectionchangeresultcomprisesaccuracymultiplereferstoamodi®cations,combinationifitsofbothpreimplementationcisionafandfectsrecall.severalcode
QuestionshcResear8.7.3
Weuseastudydesignwithtwoobjectsandfourresearchquestionstovalidatetheassumptions.
Thestudyislimitedtosourcecode:
RQ10Dodevelopersestimateclonecouplingconsistently?
Theapplicationofdeveloperassessmentstoestimateclonecouplingisbasedontheassumptionthat
difhaveferentdedemonstratedvelopersthatestiassmateessmentsclonerequirecouplinganexplicitconsistentlyclone.relevExperimentsancebycriterionWtoalensteinproduceetal.consis-[229]
tentresults.Thisresearchquestionvalidateswhethertheestimationofcouplingrepresentssuch.
RQ11Dodevelopersestimateclonecouplingcorrectly?
Consistencyaloneisnosuf®cientindicatorforcorrectness.Predictionofchange,whichispart
ofassessingthecouplingbetweenclones,inherentlycontainsuncertainty.Toassesshowuseful
developerassessmentsofclonecouplingarefortailoring,weneedtounderstandtheircorrectness.
RQ12Cancouplingbegeneralizedfromasample?
Ratingisperformedonasampleofthecandidateclones,sincereal-worldsoftwaresystemscontain
toomanyclonestofeasiblyratethemall.Thesamplemustberepresentativeforthesystem,else
sense.noesmaksampling
RQ13Howlargeistheimpactoftailoringonclonedetectionresults?
158
AssumptionsofalidationV8.7
Table8.1:Studyobjects
Lang.Age(years)Size(kLOC)Developers(max)
AABAP1344210(40)
BC#83604(12)
Tailoringchangestheresultsofclonedetection.Thesizeofthechangeintermsofaccuracyand
amountofdetectedclonesdeterminestheimportanceofclonedetectiontailoringforbothresearch
practice.and
(RQ10)yConsistencEstimation8.7.4
StudyObjectWeuseanindustrialsoftwaresystemfromtheMunichReGroupasstudyobject.
TheMunichReGroupisthelargestre-insurancecompanyintheworldandemploysmorethan
47,000peopleinover50locations.Fortheirinsurancebusiness,theydevelopavarietyofindividual
supportingsoftwaresystems.Fornon-disclosurereasons,wenamedthesystemA.Anoverviewis
showninTable8.1.Codesizereferstothehandmaintainedcodethatwasanalyzed.Thesystem
implementsbilling,timeandemployeemanagementfunctionalityandsupportsabout3700users.
this,DesigndevWelopersedetermineindependentlyinter-raterestimateagreementcouplingbetweenforadifsampleferentofdevelcandidateoperstocloneanswerpairsRQ1.fromFtheor
studyobjectbyansweringassessmentquestion1foreachpair.Inter-rateragreementisthendeter-
minedbycomputingCohen’sKappa.
PrstudyocedureobjectA.andFromExtheecutionresults,aClonerandomdetectionsamplewasofcloneperformedpairswwithasangenerated.untailoredIfasampledcon®gurationcandi-on
dateclonegroupcontainedmorethantwoclones,its®rsttwocloneswerechosen.Eachdeveloper
assessedresearcherecouplingxplainedfortheeachassessmentclonepairtoolindiandaskviduallyed.theAssessmentassessmentwasquestionguidedforbyaeachcloneresearcherpair.,bTheut
cepttook,rcareejectnotandtoin¯uenceundecided.Indiassessmentvidualresults.ratingDevmeetingseloperswerecouldlimitedprotovide90threeminutesanswesincers,enamelyxperiencesac-
withdeconcentrationveloperandclonemotivationassessmentsdecreasefromandearlierthreateneresultxperimentsaccurac[115]y.indicatedthatafter90minutes,
ThreeResultscloneandpairswereDiscussionratedasCloneundecidedcouplingbywoneasdevestimatedeloper,forone48cloneclonepairpairswasbyratedthreeasdevundecidedelopers.
bytwodevelopers.Furthermore,®veclonepairsreceivedatleastoneacceptandonerejectassess-
ment.Theremaining39clonepairsallreceivedthesameratingsbyallthreedevelopers.Table8.2
showstheresultsoftheassessment.
81.3%.AgreementInrowsbetween1–4,allpairscloneofdepairsvareeloperstakenrangesintobetweenaccount,85.4%includingandclone89.1%.pairsOvthaterallwereagreementestimatedis
159
8MethodforCloneAssessmentandControl
Table8.2:Estimationconsistencyresults
AgreementelopersvDe87.5%2&185.4%3&189,6%3&21&2&381.3%
1&2&3(w/ounrated)88.1%
asundecidedbyonedeveloper.Forthelastrow,thefourclonepairsforwhichatleastonedeveloper
ratedundecidedwereremovedfromtheresult.Ontheremaining44clonepairs,88.1%arerated
consistentlybetweenthreedevelopers,indicatingsubstantialagreement.Cohen’sKappaforthe
threecategoriesaccepted,rejectedandundecidedandthethreeratersis0.87forthe48ratedclone
groups.AccordingtoLandisandKoch[151],thisisconsideredasalmostperfectagreement.
Fortheanalyzedclonepairs,developersdidhaveaconsistentestimationofthecouplingofclones.
Aftertheassessmentswerecomplete,resultswerediscussedwiththedevelopers.Developerscould
agreeonanassessmentforfouroutofthe®veclonepairsthatwereassessedcontradictorily.Only
forasingleclonepairdevelopersremainedofdifferentopinion.Basedontheseresults,weconsider
itfeasibletoachieveconsistentestimationsofclonecouplingthroughdeveloperassessments.
8.7.5EstimationCorrectnessandGeneralizability(RQ11&RQ12)
StudyObjectWeuseasecondindustrialsoftwaresystemfromtheMunichReGroupasstudy
object.Fornon-disclosurereasons,wenamedthesystemB.AnoverviewisshowninTable8.1.
Thesystemimplementsdamagepredictionfunctionalityandsupportsabout100expertusers.
DesignClonedetectiontailoringpartitionstheresultsofuntailoredclonedetectionintotwo
sets—thesetofacceptedclonegroupsthatarestilldetectedaftertailoring,andthesetofrejected
clonegroupsthatarenotdetectedanymore.Ifdeveloperassessmentsofclonecouplingarecorrect
andresultscanbegeneralizedfromthesample(andnoerrorshavebeenmadeduringclonedetec-
tiontailoring),acceptedclonegroupsmustexhibitahigherratioofcoupledchangesduringtheir
evolutionthanrejectedclonegroups.
De®nition5ChangeCouplingRatio(CCR):Probabilitythatachangetoonecloneofaclone
groupshouldalsobeperformedtoatleastoneofitssiblings.
Westatethisasahypothesis:
Hypothesis1CCRforacceptedclonegroupsishigherthanforrejectedclonegroups.
160
V8.7Assumptionsofalidation
WcloneedeterminegroupsasCCRdescribedonthebeloevw.olutionWethenhistoryuseofathepairedstudyt-testtoobjecttestforbothHypothesisaccepted1againstandtherejectednull
hypothesisthatCCRforacceptedclonegroupsisequalorsmallerthanforrejectedclonegroups.
CCRisdeterminedbyinvestigatingthesetofchangesthatareperformedtoclonegroupsduring
groupsystemisevcoupled,olution.CCRwhichisissimplyequaltothetheeratioxpectedoftheprobabilitynumberthatofacoupledrandomlychangeschosentothechangenumbertoaofcloneall
ones.uncoupledincludingchanges,Inpractice,developersdonothaveperfectchangeimpactknowledge.Themodi®cationsdevelop-
ersperformtoclonedcodecandeviatefromtheintentionalnatureofthechange:developerscan
missaclonewhenimplementingacoupledchange.Themodi®cationoftheclonedcodegetsthus
unintentionallyuncoupled5.Thethreewayshowachangecanaffectclonedcodeare:1)Consistent
aremodi®cationsintentionallyareuncoupledintentionallycoupledmodi®cationsmtoodi®cationsclonedctoode.cloned3)code.Inconsistent2)Independentmodi®cationsaremodi®cationsunin-
tentionallyuncoupledmodi®cationstoclonedcode.
Informationabout6theintentionalityofamodi®cationis,ingeneral,notcontainedintheevolution
historyofasystem.Itisthusmanuallyassessedbythesystemdevelopers.
asWefollodeterminews:®rst,CCRclonesforaaresystemtrackedbybetweeninspectingthetwchangesosystembetweenversionspairsoftoconsecutiidentifyveclonesystemgroupsversionsthat
lyingwerechange,modi®ed;theysecond,areallclassi®edmodi®edintocsetsloneofgroupsconsistentlyare,inspectedindependentlymanually—basedorinconsistentlyontheirchangedunder-
clonegroups;CCRcannowbecomputedas:
|consistent|+|inconsistent|
CCR=|consistent|+|inconsistent|+|independent|
Thisvidualcloneproceduregroups.doesnotTorequireimproveaccurateaccuracy,anditcancompletebeevperformedolutiononhistoriesmultipleorpairsgenealogiesofconsecutiofindi-ve
systemversions—CCRisthendeterminedonalargersampleofchanges.
ProcedureandExecutionThesystemversionsbetweenwhichcodemodi®cationswerean-
alyzedwerechosenusingaconveniencesamplingstrategy.W7eeklysnapshotsofthesourcecode
werechurnewasxtracteddeterminedfromtheasvtheersionnumbercontrolofchangedsystemfor®lestheasanyearestimate2006.ofdeBetweenvelopmenteachactisnapshots,vityinthatcode
week.Fourweeklyintervalswerechosenformeasurement.Theirchoiceaimedatmaximizingthe
coveredpartofthesystemevolution,tomeasuredifferentstagesandtocapturedifferentlevelsof
developmentactivitytoreducetheprobabilitytoonlycoveranunrepresentativepartofthesystem’s
olution.ve5Inaffectprinconeiple,declone,velopersthusafcouldfectingalsoanerroneouslyunintentionallymodifycoupledclonesinamodi®cation.coupledHofweashion,ver,sincealthoughthisthecasewchangeasnotshouldobservedonly
6onBasedtheonstudyhistoryobject,analysiswealignoreone,ititishere.undecidablewhethertwodifferentlymodi®edsiblingclonesrepresentaninde-
pendentorinconsistentmodi®cationandthuswhethertheunderlyingchangeiscoupledornot.
7TheearlierdevevelopersolutionhavehistoryemployedfragmentourtocloneavoiddetectionunwantedtoolsideConQAeffectsTonduringthedevdataelopmentcausedbysincethe2008.useofWethethuscloneanalyzeddetector.an
161
8MethodforCloneAssessmentandControl
Foreachmeasurementinterval,couplingwasdeterminedforbothacceptedandrejectedclone
proachgroupsassimilarfollotows.theFirst,onedescribedmodi®cationsin[83,to83],clonedcf.,codeSectionwere7.4.4.computedSecond,usingallaclonemodi®cationstrackingtocloneap-
groupsweremanuallyclassi®edasconsistent,inconsistentorindependent.Requiredefforttoindi-
viduallyrateallclonegroupsforallintervalsandbothdetectioncon®gurationswouldbetoohigh
tobefeasible.Threemeasuresweretakentoreducerevieweffort:
Cloneclustering:Duetothenatureofclonegroups,longclonegroupsoftenoverlapwithshorter
clonecontainsgroupstwoofmethods.higherIfyoucardinalitynow.cloneSayoneyouofcreatedtheamethodscloneagpairain,Ayoubyhavecloningcreatedacodearesecondgionclonethat
pairgroupA.BWewithcallthreesuchovclones—oneerlappingclonecontaininggroupstheanewlycluster.insertedIfthemethodoriginalclone,methodtwoogetsverlappingchanged,cloneboth
clonegroupsAandBaremodi®ed.Apre-studyweperformedtovalidatethetoolsetupshowed
thatmodi®cationsareoftenratedequallyforallclonegroupsinacluster.Althoughallclonegroups
inimproavclustederratingwereratedproductiindivity.vidually,sortingclonegroupsaccordingtocloneclusterssubstantially
Twthoseforo-phasewhichreview:obInviouslytheno®rstphase,commonaconceptresearcherbetweeninspectedclonesallmodi®edcouldbecloneidenti®edgroupsasandindepeclassi®edndent.
Taggressiypicalveexamplesnormalization.includeIngettertheandsecondsetterphase,clonesthethatremainingareonlycloneconsideredgroupsweresimilarpair-reduevietoowedverlyby
aresearcherandadeveloper.Theresearcheroperatedthecloneinspectiontool,thedevelopertook
decisions.ratingtheSingleclassi®cation:Ratedclonegroupswerepartitionedintoacceptedandrejectedsets.This
wasdetectiondonebycon®guration.matchingtheMatchingratedwcloneasgroupsperformedaginainstatheresultssemi-automatedofclonefashion:trackingcloneusinggroupsatailorwithed
aidenticalresearcherpositionsbasedonweretheirmatchedlocationandautomaticallycontent,8.Firemainingveoutcloneof91groups(5.5%)wereofthematcheddetectedmanuallyclustersby
couldnotbematchedandwereexcludedfromthestudy.
ClonedetectionwasperformedwithConQATusingaminimalclonelengthof10statements.Tai-
loreddetectionwasperformedusinganexistingtailoringfromanearliercollaborationthatwas
createdusingthemethodfromSection8.2.Itexcludesclonegroupswithoverlappingclones,em-
ployscontextsensitivenormalizationofrepetitivecoderegionsandexcludesC#usingstatements
code.generatedand
ResultsandDiscussionTables8.3and8.4showtheresultsofthemanualchangeclassi®ca-
tionandtheresultingcouplingforthesetofacceptedandrejectedclonegroups,respectively.In
total,changesto211clonegroups(containing1279clones)weremanuallyclassi®edduringthe
xperiment.eInintervals1and2,modi®cationsforoneacceptedclonegroupwereratedasdon’tknow.Forcom-
putationofcoupling,theywereconservativelycountedasindependent.Thisconservativestrategy
onlymakesithardertoanswertheresearchquestionpositively—itdoesnotthreatenthevalidityof
apositiveanswer.
8Tailoringcanresultinshorterclonesthatarethusnotinidenticallocationsastheiruntailoredcorrespondents.
162
AssumptionsofalidationV8.7
Table8.3:Evolutionofacceptedclonegroups
CouplingIndependentInconsistentConsistentInt.0.857331510.54510111243311061301.0000.740
0.7232610581-4
Table8.4:Evolutionofrejectedclonegroups
CouplingIndependentInconsistentConsistentInt.0.167100210.0234210243100038230.0260.000
0.034102131-4
Thepairedt-testyieldsap-valueof0.002162.Thisindicatesthatthegreaterclonecouplingfor
ItthusacceptedsupportsthanforrejectedHypothesisclone1.Degroupsveloperis,foraestimationcon®denceofcloneintervalofcoupling95%,thusstatisalignsticallywellsigni®cant.withthe
evolutionofclonesduringthesystem’sevolutionhistory.
8.7.6CloneTailoringImpact(RQ13)
StudyObjectWeusesystemBfromMunichRe(asforRQ12).
DesignWecomputeseveralcloningmetricsfortheclonedetectionresultsbeforeandaftertai-
loring,namely:countofclonesandclonegroups,clonecoverageandcloneblow-up.Wethen
calculatetheirdeltatoevaluatethequantitativeimpactoftailoringonthedetectionresults.
PrsionsofocedurethesourceandExcodeofecutionthestudyWeobject.performedUntailoredtailoredcloneanddetectionuntailoredsimplyclonereturnsdetectionallontype-1twovander-
type-2clones(accordingtothede®nitionfrom[140]).Allmetricswerecomputedautomatically
isbyfromConQAmidT.2008The®rst(beforeversionConQAistheTwoneasfromintroducedthe®rstformeasurementcontinuouscloneinterval.Themanagement).secondvBetweenersion
theseversions,thedevelopersreplacedhand-writtendata-accesscodewithgeneratedcodethatis
nevermodi®edmanually—ifthedata-accesslayerchanges,itisfullyre-generated—unintentionally
generateduncoupledcodechangesonthusuntailoredcannotdetectionoccur.Weresults.includedthissecondversiontoinvestigatetheeffectof
163
8MethodforCloneAssessmentandControl
Table8.5:Impactoftailoringondetectionresults
20082006Untail.Tail.Untail.Tail.
CloneClonesGroups2,1185981,005332!!53%44%12,6752,5583,5581,028!!72%60%
CoBlovw-Uperage27.8%29.3%14.2%18.3%!!49%38%41.2%36.2%16,1%19,4%!!61%46%
ResultsandDiscussionTheresultsaredisplayedinTable8.5.Inbothversions,tailoring
substantiallyreducedthenumberofdetectedclonesandthusclonecoverageandblow-up.However,
stronglysubstantialifgeneratedamountsofcodecloningisarepresent—allstilldetectedmetricsafterarereducedtailoring.byTalarailoringgerfafactorfects.resultsevenmore
Themereobservationthattheintroductionof®ltersduringtailoringreducesthenumberofdetected
clonesislittlesurprising.However,fortheanalyzedsystem,recallwaslargelypreserved—ofthe
72clonegroupstowhichcoupledchangesoccurred,68werestilldetectedbythetailoredclone
detection,indicatingarecallofthetailoredcomparedtotheuntailoreddetectionof94.4%.Conse-
quently,changesinclone(group)countmostlydenotechangesinprecision.Morespeci®cally,for
theanalyzedsystem,abouteverysecondclonegroupintheuntailoredresultisconsideredirrelevant
bydevelopers.Fortheanalyzedsystem,adoptionofclonedetectiontechniquesforcontinuousclone
managementfaileduntiltailoringwasperformed—eventhoughthesystemscontainedsubstantial
amountsofrelevantclones,falsepositiverateswereconsideredtoohighforproductiveuse.
Threats8.7.7alidityVto
InternalmeasurementTheintervchoicealscoofvtheeringayearmeasurementofdevintervelopmentalsforhistoryRQ2,withcanafdiffectferentresultintervvalsalidity.betweenWechosethem
andwithdifferentchurntoreducetheprobabilityofonlyselectingunrepresentativeintervals.
Wadveertentlyassumeinthatvestalleffortconsistentintochangichangesngdifareferentintentional,clonesontheconsistentlybasis,ifthatonlaydeavelopersingledoesclonenotneedsin-
tobebenechanged.gligible—oftheWhile43thisconsistentlysimpli®cationmodi®edcanincloneprinciplegroupsintroducemanuallyinvinaccuracestigatedy,weduringexpecttheitcaseto
study,notasingleonewasunintentionallymodi®edconsistently.
9areOurtrackapproachedtbetweenomeasuretwocloneconsecutivecouplingsystemisvunableersionstoonlydetect.Thislatedoespropagnotationsaffect,thebecausequalityclonesof
ourresults,however,sincemanualclassi®cationofuncoupledchangesbydevelopersrecognizes
changesthatarepartoflatepropagationsasunintentionalinconsistencies,andthusascoupled
changes.9Adevlateeloperpropagmodi®esationistheanclonesinconsistentmissedinchangethe®rsttoclonedmodi®cationcodestepthataccordinglybecomes.consistentagainatalaterpoint,whena
164
aluationEv8.8
Overeagertailoringcan®lteroutclonesthatarerelevant.Thisalsoleadstoasubstantialchangein
clonemetrics,butisnotdesirableinpractice.However,intheanalyzedsystem,94.4%(68outof
72)oftheclonegroupsthatevolvedinacoupledfashionarestillcontainedinthedetectionresults
aftertailoring—indicatinghighrecalloftailoredinrelationtountailoreddetectionresults.
Manualclassi®cationofclonegroups—asdonetoanswerRQ2—entailstheriskofmisclassi®cation
toduetoreducehumantheerrors.probabilityWeoftookindiseveralvidualmeasureserrors.toThereducethisparticipatingrisk:depairveloper-classi®cationhadbeenwasworkiemplongyedon
theproject,withoutbreak,forseveralyears,coveringallmeasurementintervals—hewasthuswell
familiarwiththesystem.Furthermore,uncertaincaseswereratedasdon’tknowtoavoidguesswork
andwerehandledconservatively.
Incaseclonegroupsfromtheuntailoredandthetailoreddetectionresultscouldnotbemapped
unambiguously,theywereexcludedfromthestudy.Sincethisaffectedonly5.5%(®veoutof91)
ofthedetectedclusters,weexpectthepotentialimpactofthissimpli®cationtobenegligible.
ExternalEachresearchquestionhasbeenevaluatedonasinglesystemonly.Thesystemshave
notbeenchosenrandomlybutwereselectedbasedonanexistingcooperationandtheavailability
andwillingnessofdeveloperstocontribute.Furthermore,onlyasingleclonedetector—andhence
onlyasingleclonedetectionapproach—wasemployed.Thus,fromthestudyresults,wecannot
tellhowresultsaretransferabletosystemswrittenindifferentlanguages,byotherdeveloperteams,
ortootherclonedetectorsordetectionapproaches.Althoughtheresultsfromthestudiesalign
wellwithexperienceswehavegatheredapplyingclonedetectiontailoringinvariousothercontexts,
furtherstudiesarerequiredtogainabetterunderstandingofresulttransferability.
Thestudyonlyanalyzedcloninginsourcecode.Whileweseenofactorsthatthreatentoinvalidate
theapplicabilityoftheresultstocloninginotherartifacttypes,andthusassumethattheyholdfor
themtoo;futureworkisrequiredtovalidatetheseassumptionsforrequirementsspeci®cationsand
models.
aluationEv8.8
caseThisstudysectionthatpresentsemploysanethevaluationproposedofthemethodmethodonanforcloneindustrialassesssoftwmentareandsystemcontrol.andItanalyzespresentsthea
resultingchangesintheamountandevolutionofcodecloning.Thecasestudyhasbeenperformed
incollaborationwithMunichReGroup.
CloneAssessmentandControlWeappliedthemethodforcloneassessmentandcontrolas
describedinthischaptertoasoftwareprojectdevelopedandmaintainedatMunichReGroup.We
steps.mainthesummarizeshortlyCloneassessmentwasperformedasonthe,atthattime,currentversionofthesoftwaresystem.
Severaldeveloperstookpartincloneinspectionsduringclonedetectiontailoringanddetermination
165
8MethodforCloneAssessmentandControl
ofUICR10andFUICR11.AsreportedinChapter4,multiplefaultswerefoundintheinspected
clones.type-3Theresultsofcloneassessmentwerepresentedanddiscussedinmeetingsinwhichtheentiremain-
tenanceteamparticipated.Besidesanintroductiontocloningingeneral,boththeresultsofthe
clonemetricsfortheprojectandtheindividualdiscoveredfaultswerediscussed.Thefaults,espe-
cially,helpedtoestablishasenseofurgencyamongtheparticipants.Thedevelopers®xedthefaults
andconsolidatedanumberofclonesdirectlyafterpresentationoftheresultsofcloneassessment.
Twotypesoftoolsupportforclonecontrolwereemployed.AConQAT-basedqualitydashboard
wascreatedfortheprojectthatwasupdatedonadailybasis.Thedashboardcontainedallclone
visualizationsintroducedinChapter7,includingclonelists,treemapsandclonemetrictrends.
Thedashboardresultswereavailabletothedevelopersforindividualuse.Inaddition,theywere
inspectedbytheteamaspartofregularprojectstatusmeetings.Besidesthedashboard,developers
hadaccesstotheinteractivetoolsupportforcloneinspection(cf.,Chapter7).Thisway,individual
clonescouldbeinspectedindetailatthecodelevel.
Atthebeginningofthecasestudy,wetutoredtheprojectparticipantsintheinterpretationofthe
visualizationsandmetricsintheprojectdashboardandontheuseoftheinteractivecloneinspec-
tiontools.Apartfromthesetutorialsandthepresentationsofthecloneassessmentresultsatthe
beginningofthecasestudy,wedidnotactivelyparticipateinclonecontrol.Importantly,wedidnot
touchasinglelineofcodeintheproject.Anychangestothecodeoftheprojectwereperformedby
thedevelopersthemselves.
Resear8.8.1Questionshc
Toevaluatetheusefulnessofcloneassessmentandcontrol,weinvestigatethefollowingtwore-
questions:search
RQ14Didclonecontrolreducetheamountofcloning?
Clonecontrolrequiresresources.Tojustifytheirexpense,clonecontrolneedstotakeanoticeable
effect.Thisquestioninvestigateswhetheranoticeableeffectcanbeobservedintheamountof
cloning.
RQ15Istheimprovementlikelytobecausedbythecloneassessmentandcontrolmeasures?
Improvementalonedoesnotjustifyclonecontrol.Itcould,inprinciple,beduetoothercauses.
Thisresearchquestionanalyzeswhethertheobservedreductionincloningcanbeattributedto
control.clone1011FaultyUnintentionallyunintentionallyinconsistentinconsistentclonesratioclonescf.,ratiocfSection.,Section8.3.3.8.3.3.
166
DesignyStud8.8.2
Ev8.8aluation
RQ14Weanalyzetheamountofcloninginthestudyobjectinbothrelativeandabsoluteterms.
Themetricclonecoveragecapturestherelativeamountofcloning;numberofclonedstatements
amount.absoluteitscapturesBothmetricsarecomputedonadailybasistocapturetheirevolutionduringthecasestudy.
RQclone15controlToinvmeasures,estigatewewhetheralsothecomputereductionsthecloneincloningmetricsonarethelikeelyvtoolutionbecausedhistorybyofthetheprojectapplied
beforeclonecontrolwasintroduced.Wethencomparethetrendsofthemetricswithandwithout
clonecontroltoanalyzedifferences.
ObjectsyStud8.8.3
WechoseanindustrialsoftwaresystematMunichReGroupasastudyobject.Itisabusinessinfor-
mationsystemwritteninC#thatprovidespharmaceuticalriskmanagementfunctionality.During
theyearofthecasestudy,thesizeofthesystemgrewfrom450kLOCto500kLOC.Itisthesame
systemassystemBinthestudyobjectsinSection4.3.
Softwarequalitycharacteristics—includingcloning—arein¯uencedbymanyfactors.Tonamejust
afew,theseincludethecompany,developerexpertise,teamstructures,themaintenanceenviron-
mentandavailabletools.Tohaveaconclusivecontrolgrouptoanswerresearchquestion15,these
factorsneedtobecontrolled.
However,eveninsidetheMunichReGroup,itisdif®cultto®ndsoftwaresystemswiththesame
characteristicsasthestudyobject,astheyaredevelopedandmaintainedbydifferentsubcontractors.
Theydiffer,thus,intheirprocesses,teamstructuresandemployedtools.
Insteadofchoosingotherprojectswithdifferentcharacteristics,whoseimpactoncloningishardto
determine,wechosethepastevolutionofthestudyobject,beforeclonecontrolwasintroduced,as
controlobject.Thisway,thecompany,domain,developmentprocess,teamstructureandemployed
developmenttoolsremainconstantforthemostpart.
ecutionExandImplementation8.8.4
RQ14Theconstructionofthequalitydashboardwasintegratedintoacontinuousbuildprocess
thatwasexecutedeveryday.Allcomputedclonemetricswerewrittentoadatabase.Thisway,the
clonemetrictrendswerecollectedcontinuouslyduringtheperiodofthecasestudy.
167
8MethodforCloneAssessmentandControl
RQ15Tocomputetheclonemetricsonthepastprojectevolution,weextractedweeklysnapshots
fromitsversioncontrolsystem.Clonedetectionwasthenperformedoneachweeklysnapshot,clone
metricscomputedandwrittentoadatabaseforlatertrendanalysis.
Samplesoftheclonesofseveralsnapshotsofthesystemwereinspectedwiththedeveloperstomake
surethattailoringwasstillaccurate.
Results8.8.5
Thissectionpresentstheresultsofthecasestudy.
RQ14Figure8.2depictstheevolutionofclonecoverage.Theupperchartshowsthatclone
coMayverage2008,theredecreasedisashortduringtheincreasecaseinstudyclonecofromverage.14%inAnAprilintervie2008wtowithbelothewdev10%elopersinreMayvealed2009.thatIn
alarresultinggecloneinthehaddropbeenoftheintroduced,clonecobutvwerageastrendnoticedtoatitsapreteamviousmeetinglevel.andApartfromconsolidatedthisperiod,subsequentlyanda,
secondsmallincreaseinJuly2008,theclonecoveragetrendissteadilydecreasing.
ThenumberupperofchartstatementsofFigurethatare8.3covdepictseredthebyatnumberleastofoneallclonestatementsinred.ofIttheshowssystemthatinthebluenumberandtheof
clonedstatementsdecreasesfrom15.000inApril2008to11.000inMay2009.Duringthestudy
Likperiod,ethetheclonesizecoofvtheeragesystemtrend,theincreasedclonedfromstatementsaroundtrend105.000issteadilystatementstodecreasing115.000formoststatements.ofthe
period.studycaseThereduceddecreasetheinamountbothofeclonexistingcoveragecloningandintheclonedsystem.statementsWeshothuswsanswerthatcloneRQ14controlpositively:successfullyclone
controldidreducetheamountofcloninginthestudiedsystem.
evRQolution15Thispatternsresearchbeforeclonequestionincontrolvestigwasatesintroduced.whethertheclonemetricsalreadyexhibitedsimilar
ThelowerchartsinFigure8.2depictstheevolutionofclonecoveragebetweenSeptember2004
andJanuary2007.Increasesinclonecoveragearealwayscausedbythecreationofnewclones.
Decreasesinclonecoverageareeithercausedbycloneremoval,orbyadditionofnewcodethat
containsno(orless)cloning.Formostofthisperiod,clonecoverageoscillatesbetween10%
and20%.Theamplitudeofthechanges¯attensastheprojectadvances,sincetherelativesize
ofthecodechangedduringaniterationdecreasesw.r.t.theoverallprojectsize,astheoverallsize
growslarger.Forthesecondpartofthechart,theperiodafterJanuary2006,clonecoveragenever
14%.yondbedecreasesIncontrast,theclonecoveragetrendduringthecasestudyexhibitsasubstantiallydifferentevolu-
tion,sinceitdecreasesforthemostpart.
ThestatementslowerinchartredininFigurethe8.3sameshowsperiod.thenumberIncreasesofinallclonedstatementsstatementsinblueareandalthewaysnumbercausedofbyclonedthe
168
Figure
8.2:
Clone
gveraco
e
olutionve
with
(top)
and
without
(bottom)
8.8
aluationEv
lonec
olcontr
169
8
Method
Figure
170
for
8.3:
Clone
Assessment
Statements
olcontr
and
and
olContr
lonedc
statements
with
(top)
and
without
(bottom)
lonec
aluationEv8.8
creationofnewclones,decreasesbytheirremoval.Thewavesinthetrendindicatethatsome
cloninggetsconsolidatedshortlyafteritsintroduction.However,theamountofclonedstatements
afterawaveisneverbelowtheamountofclonedstatementsbeforeawave,indicatingthatclones
remaininthesystem,aftertheyhavesurvivedforacertainamountoftime.Ifmeasuredonlyatthe
lowestpoints,thetrendissteadilyincreasing.
Incontrast,theclonedstatementtrendduringthecasestudymostlydecreased.Itthusexhibitsa
substantiallydifferentevolution,thanbeforeclonecontrolwasintroduced.
Sincebothclonecoverageandclonedstatementsevolvedsubstantiallydifferentwithoutandwith
clonecontrol,althoughnomajorchangesinotherprojectcharacteristicswereperformedatthetime,
weanswerRQ15positively:thedecreaseincloningislikelytobecausedbyclonecontrol.
Discussion8.8.6
Thewavesinthetrendsare,inparts,causedbytheiterativedevelopmentprocess.Thesystem
sizetrendinthelowerchartinFigure8.3re¯ectstheiterativedevelopmentprocessandrelease
cycleoftheproject.Atthestartofanewiteration,systemsizetendstoincreaseratherrapidly,
asimplementationofnewfeaturesresultsinfastproductionofnewcode.Towardstheendofan
iteration,sizeincreaseslowsorstagnates,asmoreresourcesarededicatedtotestingor®xingof
functionality,thantoproductionofnewcode.Insomecases,cleanupduringtheendofaniteration
evenreducesthecodesize.Theclonedstatementstrendfollowsthispattern.Wecouldobserve
thatcloneswereoftenintroducedatthebeginningofaniteration.Sometimes,apartoftheclones
wasconsolidatedatalaterpointofthesameiteration,causingareductioninthenumberofcloned
statements.However,whilesomecloneswereconsolidatedduringtheiterationinwhichtheyarecreated,clones
thatsurvivedbeyondtheendoftheirbirthiterationwereunlikelytoberemovedatalaterpoint,
beforeclonecontrolwasintroduced.Theseobservationswerecon®rmedthroughinterviewswith
thedevelopersandinspectionsoftheevolutionofsamplesoftheclones.Asaconsequence,the
numberofclonedstatementsattheendofaniterationwasneversmallerthanatitsbeginning;if
measuredattheendofiterations,theabsoluteamountofcloningthussteadilyincreased.Onlyafter
clonecontrolwasintroduceddidtheclonedstatementstrenddecreaseacrossdifferentiterations.
Wethinkthatthisreversingoftheclonedstatementstrendisastrongindicatorfortheimpactof
clonecontrolontheamountofcloninginthesystem.
alidityVtoThreats8.8.7
InternalWeinterpretreductionsinclonedstatementstobecausedbyintentionalremovalof
clones.Thenumberofclonedstatementscanalsodecreaseonalargescale,however,ifclonesare
systematicallymodi®edtopreventtheirdetection,withoutremovingthem.Tocontrolthispotential
threat,weinspectedasampleofthecoderegionsinwhichcloneswerenolongerdetected.They
revealedintentionalconsolidation.Wethusdonotexpectsystematicconcealmenttocausethe
trends.clonetheindecrease
171
8MethodforCloneAssessmentandControl
FThisorwsomeasdayscausedinbytheproblemscharts,nowithdatathearebauildvailable.infrastructureForthem,thattheprevinterpretentedtheationsaredashboardthusfrominaccurate.being
executedfortheseperiods.However,interviewswiththedeveloperssuggestthatnojumpsdidoccur
inthem.Inaddition,theevolutionforthetimesforwhichdataisavailableisalreadysubstantially
differentfromthehistoricaldata.Wethusdonotconsiderthemissingdatapointsasthreatstoour
conclusionthatclonecontrolmanagedtoreducecloning.
WeWhiledidwenotthinkvalidatethatathehstatisticalypothesisvalidationthatwcloneouldbecontrolrdesirable,educedwethedonotamountbelieofvethatcloningasinglestatisticallystudy.
objectprovidessuf®cientdataforit.Therepetitionofthestudyonfurtherprojectsandthestatistical
validationthusremainsimportantfuturework.
Thereductionincloningcould,inprinciple,becausedmerelysincedevelopersweremadeaware
ofthefactthatclonesareharmful,orbymakingadashboardwithclonemetricsavailabletothem.
Ifnotso,validthesfortepstwoofthereasons:cloneNotcontrolonlydidmethodthewrateouldofnotnewbecloningrequired.Wdecrease,ethinkbutthatcloningthiswasassumptionactivelyis
Theremoveddashboardfromwtheasalsosystem.madeActiavveailableremovtoaltwdoesonotfurtheroccurprojectsatsubconsciouslyMR(projectsorAaccidentallyandC.fromSecond,the
casestudyinChapter4).However,intheseprojects,thestepsoftheclonecontrolmethodwere
notperformed:assessmentresultsanddiscoveredfaultswerenotpresentedanddiscussedina
meetingwithallstakeholders.Notutorialwasperformedthatinstructedthestakeholdersinthe
useofthequalitydashboardandthecloneinspectiontools.Thequalitydashboardresultswerenot
inteclonegratedcoverageintotheandreclonedgularsprojecttatementsstatuscanmeebetings.observFed,orasthesefortheprojects,studynoobject.comparableTheseedecreasesxperiencesin
thusgivefurtherindication,thatthechangestotheamountofcloningwerecausedbytheperformed
clonecontrolmeasures,andcannotsolelybeexplainedbymakingdashboardsavailable.However,
thiscasestudythusonlyprovidesindicationoftheeffectivenessofclonecontrolonagenerallevel.
Themeritoftheindividualstepsisnotvalidatedempirically.Furtherempiricalvalidationisrequired
tobetterunderstandtheimportanceoftheindividualsteps,potentialforsimplicity,omissionsor
potential.ementvimpro
ExternalThebiggestthreattotransferabilityoftheresultsisthatweonlyperformedthecase
studyonasinglestudyobject.Thesimplereasonforthisisthatthecasestudyrequiredalotof
effortandtime,andthatindustrialprojectswillingtoparticipateinsuchcasestudiesarehardto
®nd.Futureworkisrequiredthatrepeatsthecasestudyonfurtherprojectstobetterunderstandthe
results.theofgeneralizability
8.8.8ExperiencesditionalAd
Apartfromtheresultsdirectlytargetingtheresearchquestions,wemadeanumberofexperiences
restudygardingandfromclonesevcontrol.eralfurtherThefolloprojectswinginwhichparagraphswere¯ectintroducedoureclonexperiencescontrol,bothincludingfromtheprojectsaboveat
MunichReGroup,ABBandWincorNixdorf.
172
ySummar8.9
SenseofUrgencyWefoundthatthesenseofurgencythatpresentationsofcloningandclone
assessmentresultscreate,dependsstronglyontherelationofthedeveloperstothestudiedcode
base.Ifclonesinthird-partycodearepresented,theytendtoberegardedasotherpeople’sproblems.
Clonesintheirowncodebase,whileattractingmoreattentionandtriggeringjusti®cationattempts,
didtypicallynotcreateasenseofurgency,sincetheyoftenwereconceivedasfuturemaintenance
problems;inotherwords,notpresentmaintenanceproblems.Thefactthatcloningcanalready
havecausedproblemsinthepastwasnotapparent.Incontrast,presentationofexistingclone-
relatedbugsmakeapparentthatcloningisapresentmaintenanceproblem.Theresultingsenseof
urgencyiscorrespondinglylarger.
ReactionstoDiscoveredClonesWealsofoundthatdiscoveryofclonesintheirsystem
oftentriggersimilarreactionpatternsbydevelopers.Whileagreementthatcloningcanhinder
maintenanceingeneralistypicallyeasilyachieved,thepropositionthatthisholdsforspeci®cclones
intheirownsystemaswelltypicallyencountersinitialresistance.Inthenumerousdiscussionswe
had,theinitialreactiontoapresentedclonewastotestifitcouldberemoved.Ifnot,orifnot
easily,developersjumpedtotheconclusionthattheclonesarenotproblematic,sincetheycannot
beavoided.Insuchsituations,itwasimportanttopointoutthatchangestothemstillneededtobe
carriedouttoallsiblings;andthatcloneindicationtoolingcanmakethiseasier,sinceitsupports
changepropagation.Thisemphasisonclonecontroltoolsassupporttoevolveexistingclones,
accordingtoourexperience,helpedadoptionbydevelopers.
DashboardsasaMeansofCommunicationDashboardscanserveasmotivationandasa
meansofcommunicationinsideandbetweendifferentgroupsofstakeholders.Weencounteredthat
clonetrendsthatre¯ectcloneconsolidationcanhavemotivatingeffectsondevelopers,encouraging
themtoperformfurtherconsolidations.Theythuscommunicateconsolidationeffortsandeffects
insidethedevelopergroup.Furthermore,theamountandevolutionofcloningiscommunicated
toothergroupsofstakeholders,includingmanagement.Althoughthisfactcancreateinitialreluc-
tanceamongdevelopers,wefrequentlyencounteredpositivereactions,oncedevelopersweremore
familiarwithit.Somegroupsemployeditspeci®callytocommunicatethattheyrequireresources
toconsolidateseveralareasofunmaintainablecode,turningclonemeasurementsintoanargument
cause.theirfor
ySummar8.9
step,Thisclonechapterdetectionpresentedatailoring,methodemploforysclonedeveloperassessmentassessmentsandcontrolofclonethatccouplingomprisesto®veachiesteps.veItsaccurate®rst
clonedetectionresults.Itssecondstep,assessmentofimpact,determinesmetricsonthedetected
thirdclones.step,Theserootcausemetricsanalysis,quantifythedeterminesimpacttheofforcescloningdrionvingthemaintenancecreationofeffortscloning,andthuscorrectness.uncoveringIts
emplopotentialysstrateproblemsgiesinfromtheorgmaintenanceanizationalenchangevironment.managementItsfourthtostep,successfullyintroductionintroduceofclonecontinuouscontrol,
clonemanagementintoestablishedmaintenanceprocesses.Its®fthstep,continuousclonecontrol,
173
8MethodforCloneAssessmentandControl
performsclonecontrolmeasuresonaregularbasistopermanentlyreducethenegativeeffectsof
cloning.
Thesumptionssecondpartunderlyingofthethechaptermethodandpresentedtwdemonstratoesindustrialitscasefeasibilitystudies.and,Thethrough®rstthestudyvmagnitudealidatesofas-the
impacttailoringhadontheresults,itsimportanceforcloneassessment.Thesecondstudyevaluates
evthealuationproposedshowsmethodthatontheanproposedindustrialmethodsoftwaresucceededsystemattoreduceMunichRe.cloningForandthegivesstudiedindicationsystem,thatthe
thethusreductiondemonstrateswasinthefactfeasibilitycausedbyandtheeffectivapplicationenessofofthethecloneproposedassessmentmethodinandcontrolindustrialmethod.softwareIt
practice.engineering
174
Limitations9DetectionCloneof
Softwarecontainsfurtherredundanciesthanthosecreatedbycopy&paste.Forexample,asfound
inChapter5,redundancyinrequirementscanleadtore-implementationoffunctionality.Inde-
pendentlydevelopedcodeofsimilarbehaviorhasacomparablenegativeimpactonmaintenance
activities,asclonedcode.Maintenancethusneedstobeawareofit.Itisunclear,however,whether
existingclonedetectionapproachescandetect,orcanbemadetodetect,suchredundancies.Con-
sequently,wedonotknowwhetherclonemanagementapproachescanbeusedtocontrolsuch
redundancyonceithasbeenintroducedintoasystem.
Thischapterarguesthatbehaviorallysimilarcodeofindependentoriginisunlikelytobesyntacti-
callysimilar.Itreportsonacontrolledexperimentthatjusti®esthisclaim.Existingclonedetection
approachesarethusill-suitedtodetectsuchredundancy—itishencebeyondthescopeofclone
managementtools.Partsofthecontentofthischapterhavebeenpublishedin[112].
9.1QuestionshcResear
Wesummarizethestudyusingthegoalde®nitiontemplateasproposedin[234]:
forthepurposeAnalyzofecharbehaacterviorallyizationsimilarandprogramunderstandingfragments
withrespecttoitsrepresentationalsimilarityanddetectability
frominthetheviecontewpointxtofofresearcherindependentimplementationsofasinglespeci®cation
Indetail,weanswerthefollowing3researchquestions.
RQ16Howsuccessfullycanexistingclonedetectiontoolsdetectsimions1thatdonotresultfrom
paste?©
Multipleclonedetectorsexistthatsearchforsimilarprogramrepresentationtodetectsimilarcode.
Thebeen®rstcreatedquestionbycopweyneed&topaste.answerIfeisxistinghowwelldetectorstheseperformtoolsarewell,ablenotonovdetecteldetectionsimionsthattoolshavneedenotto
eloped.vdebe
RQ17Isprogram-representation-similarity-basedclonedetectioninprinciplesuitedtodetect
simionsthatdonotresultfromcopy&paste?
1Behaviorallysimilarcodefragments,cf.,2.3.2
175
9DetectionCloneofLimitations
Hadetectors,vingweestablishedneedtothatunderstandsimionsarewhetheroftentoothesyntactlimitationsicallydifresideferentinthetobetoolsdetectedorinbytheexistingprinciples.cloneIf
theproblemsresideinthetoolsbuttheapproachesthemselvesaresuitable,nofundamentallynew
approachesneedtobedeveloped.
RQ18Dosimionsthatdonotresultfromcopy&pasteoccurinpractice?
Thethirdquestionweaddressiswhethersimionsoccurinrealworldsystems.Fromasoftware
engineeringperspective,theanswertothisquestionstronglyin¯uencestherelevanceofsuitable
approaches.detection
ObjectsyStud9.2
RQs16and17Wecreatedaspeci®cationforasimpleemailaddressvalidatorfunctionthatwas
implementedbycomputersciencestudents.Thefunctiontakesastringcontainingconcatenated
emailaddressesasinput.Itextractsindividualaddresses,validatesthemandreturnscollections
ofvalidandinvalidemailaddresses.About400undergraduatecomputersciencestudentswere
askedtoimplementthespeci®cationinJava.Theywereallowedtoworkinteamsoftwoorthree.
Eachteamonlyhandedinasinglesolution.Implementationwasdoneundersupervisionbytutors
toavoidcopy&pastebetweendifferentteams.Participationwasvoluntaryandanonymousto
reducepressuretocopyforparticipantsthatdidnotsucceedontheirown.Behavioralsimilaritywas
controlledbyatestsuite.Studentshadaccesstothistestsuitewhileimplementingthespeci®cation.
Tosimplifyevaluation,studentshadtoentertheimplementationintoasingle®le.
Nuombb oefr sectj 2011 86420 0 01 02r fteo nsmbsttmeuaeN 03 04 05 06Figure9.1:Sizedistributionofthestudyobjects
Wereceived156implementationsofthespeci®cation.Ofthose,109compiledandpassedourtest
esuite.xhibitTheequalywereoutputtakenbehaasviorstudyfortheobjects.testinputs.SinceallOutputobjectsbehapassviorforourtestinputssuite,nottheincludedyareinknothewntestto
suitecanvary.Figure9.1displaysthesizedistributionofthestudyobjects(importstatementsare
notcounted).Theshortestimplementationcomprises8,thelongest55statements.InFigure9.2the
Jastudyvacode,objectsandareMcCabe’alsocatescgorizedyclomaticbycomplenestingxitydepth,[171].i.e.,Thetheareamaximalofeachdepthbubbleofiscurlybracesproportionalintheto
176
htp egDnitseN7654321010Cyclomatic Complexity203040DesignyStud9.3
Figure9.2:Studyobjectsplottedbynestingdepthandcyclomaticcomplexity
thenumberofstudyobjects.Thesemetrics,whichbothmeasurecertainaspectsofthecontrol¯ow
ofaprogram,alreadyseparatethestudyobjectsstrongly,withthetwolargestclustershavingsize
19and12.Whenlookingforimplementationswhicharestructurallythesame,itcanbeexpected
thatthesegivesimilarvaluesforbothmetricsandthusthesearchcouldbelimitedtoneighboring
clusters(denotedbythebubblesinthediagram).
RQsource18codeToofbetterthewell-knounderstandwntheereferencexistenceofmanagersimionsJabRefin2.Wreal-wedidorldnotsoftwonlyare,searchweforanalyzedsimionsthe
insideJabRef,butalsobetweenJabRefandthecodeoftheopensourceApacheCommonsLibrary3.
BothsoftwareiswritteninJava.
DesignyStud9.3
RQ16ToanswerRQ16,weneedtodeterminetherecallofexistingclonedetectorswhenapplied
tothestudyobjects.Wedenotetwoobjectsthatshareaclonerelationshipasaclonepair.Sincewe
knowallstudyobjectstobebehaviorallysimilar,weexpectanidealdetectortoidentifyeachpair
ofstudyobjectsasclones.Forourstudy,therecallisthustheratioofdetectedclonepairsw.r.t.the
numberofallpairs.Wecomputethefullclonerecallandthepartialclonerecall.Forthefullclone
recall,twoobjectsmustbecompleteclonesofeachothertoformaclonepair.Forthepartialclone
recall,itissuf®cientiftwoobjectsshareanyclone(thatdoesnotneedtocoverthementirely)to
formaclonepair.Weincludedthepartialclonerecall,sinceevenpartialmatchesofsimionscould
practice.inusefulbeWechoseConQAT(cf.,Chapter7)andDeckard[106]asstate-of-the-arttoken-basedandAST-
basedclonedetectors.Toseparateclonesbetweenstudyobjectsfromclonesinsidestudyobjects,
allclonegroupsthatdidnotcoveratleasttwodifferentstudyobjectswere®lteredfromtheresults.
Theparametersusedwhenrunningthedetectorsin¯uencethedetectionresults.Especiallythe
minimallengthparameterstronglyimpactsprecisionandrecall.Toensurethatwedonothereby
missrelevantclones,wechoseaverysmallminimallengththresholdof5statementsforConQAT.
Toputthisintoperspective:whenusingConQATinpractice[55,115],weusethresholdsbetween
2ge.net/http://jabref.sourcefor3g/http://commons.apache.or
177
DetectionCloneofLimitations9
10and15statementsforminimalclonelength.Obviouslysuchasmallthresholdcanresultinhigh
falsepositiveratesandthuslowprecisionoftheresults.However,thisonlyaffectstheinterpretation
oftheresultsw.r.t.theresearchquestioninasingledirection.Ifwefailtodetectasigni®cant
numberofcloneseveninpresenceoffalsepositives,wecannotexpecttodetectmorecloneswith
moreconservativeparametersettings.
RQ17ThestudyforRQ17comprisestwoparts.First,wecollectdifferencesbetweenstudy
objects.Wecategorizethembasedontheircompensability.Tothebestofourknowledge,thereisno
establishedformalboundaryonthecapabilitiesofprogram-representation-similarity-based(PRSB)
detectionapproaches(cf.,Section2.3.1).Consequently,insteadofusingaformalboundary,we
basethecategorizationonthecapabilitiesofexistingapproaches.Forthat,weconsiderapproaches
notonlyfromclonedetection,butalsofromtherelatedresearchareaofalgorithmrecognition.
Second,havingestablishedandcategorizedthesefactors,wecanlookbeyondthelimitationsof
existingtoolsandcandeterminehowwellanidealPRSBclonedetectiontoolcandetectsimions.
Tothatend,thedifferencesbetweenpairsofstudyobjectsareratedbasedontheircategory.This
isperformedbymanualinspection.Theratioofpairsthatonlycontaindifferencesthatcanbe
compensatedw.r.t.allpairsiscomputed.ItisanupperboundfortherecallPRSBapproachescan
inprincipleachieveonthestudyobjects.
Tokeepinspectioneffortmanageable,manualinspectionwascarriedoutonarandomsampleof
studyobjects.Thesamplewasgeneratedinsuchaway,thateachstudyobjectoccurredatleastonce
andcontained55pairs.Thestudyobjectsofeachpairwerecomparedmanuallyandthedifferences
betweenthemrecorded.Asastartingpointforthedifferencecategorization,weusedthecategories
ofprogramvariationproposedbyMetzgerandWen[176]andWills[232].Ifthedifferencesina
categorycanbecompensatedbyanyexistingclonedetectionapproachorbyexistingworkfrom
algorithmrecognition,weclassi®editaswithinreachofPRSBapproaches.Else,weclassi®edthe
categoryasoutofreachofPRSBapproaches.
RQJabRef.18WTeodididentifynotonlysimionsanalyzeinaifreal-wrevieorldwedpartssystem,wethemselvesperformedcontaipairn-resimionsviewsbofutalsosourcetookcodeintoof
accountCommonscodeLibrarythat.isSuchbehavioral®ndingssimilaridentifytothirdmissedpartyreuseopenopportunities.sourcelibrarycode,namelytheApache
andImplementation9.4ecutionEx
9.4.1RQ16:SearchingSimionswithExistingTools
WeexecutedConQATinthreedifferentcon®gurationstodetectclonesoftype1,types1&2and
types1-3(cf.,Section2.2.3).Fortype-3clonedetection,aneditdistanceof33%ofthelengthof
theclonewasaccepted4.Partialclonerecallwascomputedastheratioofthenumberofpairsof
studyobjectsthatshareanyclone,w.r.t.thenumberofallpairs.Thefullclonerecallwascomputed
4Asforminimalclonelength,thisvalueismoretolerantthanwhatwetypicallyemployinindustrialsettings.
178
ecutionExandImplementation9.4
astheratioofthenumberofpairsofstudyobjectsthatshareclonesthatcoveratleast90%of
theirstatementsw.r.t.tothenumberofallpairs.Thenumberofallpairsisthenumberofedges
inthecompleteundirectedgraphofsize109,namely5778.Deckardwasexecutedwithminimal
clonelengthof23tokens(correspondingto5statementsforanaveragetokennumberof4.5per
statementforthestudyobjects),astrideof0andasimilarityof1fordetectionoftype-1&type-2
clonesand0.95fordetectionoftype-3clones.Again,thesevaluesarealotlessrestrictivethanthe
valuessuggestedin[159].SincetheversionofDeckardusedforthestudycannotprocessJava1.5,
itcouldnotbeexecutedonall109studyobjects.Instead,itwasex5ecutedon50studyobjectsthat
couldbemadeJava1.4compatiblebyremovaloftypeparameters.Forthe50studyobjects,the
numberofallpairsis1225.
9.4.2RQ17:LimitsofRepresentation-basedDetection
CategoriesofProgramVariationThefollowinglistshowsthecategorizationofdifferences
encounteredduringmanualinspectionofpairsofstudyobjectsthatwereconsideredprincipally
withinreachofPRSBapproaches.ExampleswithlinenumberreferencesoftheformA-xxand
B-yyrefertostudyobjectsAandBinFig.9.3.
Syntacticvariationoccursifdifferentconcretesyntaxconstructsareusedtoexpressequivalent
B-4,abstractordifsyntax,ferentvsuchariablasethedifnamesferentthatreferstatementstotheusedsametocreateconcept,ansuchemptyasvalidstringandarrayinvalidAddrlinesA-4essesandin
bylinesadifA-8ferentandB-8.selectionInofaddition,controlitoroccursbindingifthesameconstructsalgorithmtoachieisverealizedthesameindifpurpose.ferentcodeExamplesfragmentsare
theimplementationoftheemptystringchecksasone(lineB-3)ortwoifstatements(linesA-3
andA-5)ortheoptionalelsebranchinlineB-6.Meanstocompensatesyntacticvariationinclude
conversionintointermediaterepresentationandcontrol¯ownormalization[176].
Organizationvariationoccursifthesamealgorithmisrealizedusingdifferentpartitioningsor
ahierarchiesmatcherisofcreatedstatementsandorusedvariadirectlybles,thatwhereasareusedbothinthethematchercomputation.andtheInmatchlineresultB-14forareestoredxample,in
localvariablesinlinesA-17-19.Meansto(partial)compensationincludevariable-orprocedure-
inliningandloop-andconditionaldistribution[176].
Generalizationcomprisesdifferencesinthelevelofgeneralizationofsourcecode.Thetypes
List<String>inlineA-8andArrayList<String>inlineB-8areexamplesofthiscategory.Means
ofaccuratefcompensationashion,includenormalizationreplacementsofidenti®ers.ofdeclarationswiththemostabstracttypes,or,inaless
Delocalizationoccurssincetheorderofstatementsthatareindependentofeachothercanvary
inlinearbitrarilyA-8couldbetweenbecodemovedfragments.behindInlineaA-14cloneofwithoutstudyobjectchangingAfortheebehaxample,vior.thelistDelocalizationinitializationcan,
i.e.,becompensatedbysearchforsubgraphisomorphismasdonebyPDG-basedapproaches[140,
201].Unnecessarycodecomprisesstatementsthatdonotaffectthe(relevant)IO-behaviorofacode
fragment.ThedebugstatementinlineA-14forexamplecanberemovedwithoutchangingthe
5Theremaining59studyobjectsusedadditionalpostJava1.4featuresandwereexcludedfromthestudy.
179
DetectionCloneofLimitations9
1publicString[]validateEmailAddresses(
Stringaddresses,charseparator,publicString[]validateEmailAddresses(1
Set<String>invalidAddresses){Stringaddresses,charseparator,
Set<String>invalidAddresses){
3if(addresses==null)
4returnnewString[0];if(addresses==null||addresses.3
5if(addresses.equals(""))equals("")){
6returnnewString[0];returnnewString[]{};}4
8List<String>valid=newArrayList<else{6
String>();addresses.replace("","");7
ArrayList<String>validAddresses=8
10Stringsep=String.valueOf(separatornewArrayList<String>();
;)11if(separator==’\\’)StringTokenizertokenizer=new10
12sep="\\\\";StringTokenizer(addresses,
13String[]result1=addresses.split(String.valueOf(separator));
;)pes14System.out.println(Arrays.toString(while(tokenizer.hasMoreTokens()){12
result1));Stringi=tokenizer.nextToken();13
if(this.emailPattern.matcher(i).14
16for(Stringadr:result1){matches()){
17Matcherm=emailPattern.matcher(validAddresses.add(i);15
adr);}else{16
18booleanergebnis=m.matches();invalidAddresses.add(i);17
19if(ergebnis)}18
20valid.add(adr);}19
esle2122invalidAddresses.add(adr);returnvalidAddresses.toArray(new21
23}String[]{});
}2225returnvalid.toArray(newString[0]);}23
}26
Figure9.3:StudyobjectsAandB
outputbehaviortestedforbythetestcases6.Meansofcompensationincludebackwardslicingfrom
outputvariablestoidentifyunnecessarystatements.
Thecompensatedfollowingbyecatexistinggoryclonecontainsdetectiontypesoforprogramalgorithmvariationrecognitionintheapproaches.studyobjectsthatcannotbe
toDiffersolveentthedatasamestructurproblem.eorOneealgorithm:xampleforCodetheusefragmentsofdifuseferentdifferentdatadatastructuresstructuresencounteredoralgorithmsinthe
studyobjectsistheconcatenationofvalidemailaddressesintoastringthatissubsequentlysplit,
insteadoftheuseofalist.Theuseofdifferentalgorithmsisillustratedbythevarioustechniques
JawevafoundclasstoStringsplitisthecalledinputthatstringusesintoregularindievidualxpressionsaddresses:tosplitinalinestringA-13,intoaparts.libraryInmethodlineonB-10,thea
StringTokenizerisusedforsplittingthatdoesnotuseregularexpressions.
6Dependingontheusecase,debugmessagescanorcannotbeconsideredaspartoftheoutputofafunction.
180
Results9.5
T9.7oandillustrate9.8thedepictdifamountferentofvwaysariationtothatimplementcanbethefoundsevplitting.eninAllaesmallxamplesprogram,wereFiguresfoundinthe9.4,9.5,9.6,study
Theobjects.remainingFigures®gures9.4anddepict9.5custom,containyetcodethatsubstantiallymakesdifuseofferentlibrarysplittingfunctionalityalgorithms.tosplitthestring.
9.4.3RQ18:SimionsinRealWorldSoftware
Theidenti®cationofsimionsisahardproblemasitrequiresfullcomprehensionofthesourcecode.
AswedidnotknowthesourcecodeofJabRefbefore,welimitedourreviewtoabout6,000LOCthat
containutilityfunctionsthataremainlyindependentofJabRef’sdomain.Examplesarefunctions
thatdealwithstringtokenizationorwith®lesystemaccess.IncontrasttotheJabRefcode,wewere
familiarwiththeApacheCommonsLibrary.Nevertheless,toidentifysimionsbetweenJabRefand
theApacheCommons,wespeci®callysearchedtheApacheCommonsforfunctionalityencountered
duringinspectionoftheJabRefcode.
Results9.5
RQpendent16RQ16implementationsanalyzesofthethesamecapabilityoffunctionalityConQA.TTheandresultsDeckardaretodepicteddetectinclonestablein9.1.the109inde-
Table9.1:Resultsfromclonedetection
FullartialPDetectedDetectorCloneTypesCloneRecallCloneRecall
ConQAConQATT11&20.4%2.3%0.0%0.0%
0.1%3.2%1-3TConQA0.1%5.1%1&2Deckard0.8%9.7%1-3DeckardAscanbeexpected,therecallvaluesforclonesoftype1-3arehigherthanfortype-1ortype-1&2
clones.Furthermore,theAST-basedapproachyieldsslightlyhighervalues.Thisisnotsurprising
sinceitperformsadditionalnormalization.However,eventhoughweusedverytolerantparameter
valuesforclonedetection,whichprobablyresultinafalsepositiveratethatistoohighforapplica-
tioninpractice,bothpartialandfullclonerecallvaluesareverylow.Thebestvalueforfullclone
recallisbelow1%,thebestvalueforpartialclonerecallbelow10%.
Inotherwords:fortwoarbitrarystudyobjects,theprobabilitythatanyclonesaredetectedbetween
themisbelow10%.Theprobabilitythattheyaredetectedtobefullclonesofeachotheriseven
below1%.Giventheverytolerantparametervaluesusedfordetection,wecannotexpectthesetools
tobewellsuitedforthedetectionofsimions(notcreatedbycopy&paste)inrealworldsoftware.
181
ofLimitations9DetectionClone
String[]adresses2=addresses.split(Pattern.quote(String.valueOf(separator)));
Figure9.4:Splittingwithjava.lang.String.split()
ArrayList<String>validEmails=newArrayList<String>();
StringTokenizerst=newStringTokenizer(addresses,Character.toString(separator
;))while(st.hasMoreTokens()){
Stringtmp=st.nextToken();
validEmails.add(tmp);
}Figure9.5:Splittingwithjava.util.StringTokenizer
List<String>result=newArrayList<String>();
intz=0;
for(inti=0;i<addresses.length();i++){
if(i==addresses.length()!1){
result.add(addresses.substring(z,i+1));
}if(addresses.charAt(i)==separator){
result.add(addresses.substring(z,i));
z=i+1;
}}Figure9.6:Splittingwithcustomalgorithm1
List<String>curAddrs=newArrayList<String>();
Stringbuffer="";
for(inti=0;i<addresses.length();i++){
if(addresses.charAt(i)!=separator){
buffer+=addresses.charAt(i);
}else{
curAddrs.add(buffer);
buffer="";
}}curAddrs.add(buffer);
Figure9.7:Splittingwithcustomalgorithm2
List<String>emailListe=newArrayList<String>();
inttrenneralt=0;
while(addresses.indexOf(separator,trenneralt)!=!1){
inttrennerneu=addresses.indexOf(separator,trenneralt);
emailListe.add(addresses.substring(trenneralt,trennerneu));
trenneralt=trennerneu+1;
}Figure9.8:Splittingwithcustomalgorithm3
182
Figure9.8:Splittingwithcustomalgorithm3
Results9.5
RQ17Ofthe55pairsofstudyobjectsinspectedmanually,only4didnotcontainprogram
variationofcategorydifferentalgorithmordatastructure.Inotherwords,onlyabout7%ofthe
manuallyinspectedpairscontainonlyprogramvariationthatcan(inprinciple)becompensated.
SincethisratioisanupperboundontherecallPRSBapproachescaninprincipleachieve,we
considerPRSBapproachespoorlysuitedfordetectionofsimionsthatdonotresultfromcopy&
paste.
RQ18ThemanualreviewsuncoveredmultiplesimionswithinJabRef’sutilityfunctions.An
ecase.xampleTheisthesamefunctionfunctionalitynCase()isinalsotheproUtilvidedclassbythatclassconvertsCaseChangthe®rsterthatcharacterallowsoftoastringapplytodifupperferent
strategiesforchangingthecaseofletterstostrings.
Evenmoreinteresting,wefoundmanyutilityfunctionsthatarealreadyprovidedbywell-known
librariesliketheApacheCommons.Forexample,theabovemethodisalsoprovidedbymethod
capitalize()intheApacheCommonsclassStringUtils.EspeciallytheclassUtilexhibitsahigh
numberofsimions.Ithas2,700LOCand86utilitymethodsofwhich52arenotrelatedtoJabRef’s
domainbutdealwithstrings,®lesorotherdatastructuresthatarecommoninmostprograms.Of
these52methods32exhibit,atleastpartly,abehavioralsimilaritytoothermethodswithinJabRefor
tofunctionalityprovidedbytheApacheCommonslibrary.Elevenmethodsare,infact,behaviorally
equivalenttocodeprovidedbyApache.Examplesaremethodsthatwrapstringsatlineboundaries
oramethodtoobtaintheextensionofa®lename.
ManyofthesemethodsinJabRefexhibitsuboptimalimplementationsorevendefects.Forexample,
someofthestring-relatedfunctionsuseaplatform-speci®clineseparatorinsteadoftheplatform-
independentoneprovidedbyJava.Inanothercase,theescapingofastringtobeusedsafely
withinHTMLisdonebyescapingeachcharacterinsteadofusingthemoreelegantfunctionality
providedbyApache’sStringEscapeUtilsclass.AdrasticexampleistheJabRefclassErrorCon-
sole.TeeStreamthatprovidesmultiplexingfunctionalityforstreamsandcouldbemostlyreplaced
byApache’sclassTeeOutputStream.TheimplementationprovidedbyJabRefhasadefectasitfails
tocloseoneofthemultiplexedstreams.AnotherexampleisclassBrowserLauncherthatexecutes
a®lesystemprocesswithoutmakingsurethatthestandard-outandstandard-errorstreamsofthe
processaredrained.Inpractice,thisleadstoadeadlockiftheamountofcharacterswrittentothese
streamsexceedsthecapacityoftheoperatingsystembuffers.Again,theproblemcouldhavebeen
avoidedbyusingApache’sclassDefaultExecutor.
WhilethemanualreviewofJabRefisnotrepresentative,itindicatesthatreal-worldprograms,
indeed,exhibitsimions—bothamongitsowncodeandifcomparedtogeneralpurposelibraries.
withWhileclonesomeofdetectionthesimionstools.areThisalsoappliesinrepresentationallyparticularforsimilarthe,thesimionsmajoritythatcouldJabRefnotsharesbewithidenti®edthe
ApacheCommons,probablybecausethecodehasbeendevelopedbydifferentorganizations.A
thatcentraldonotinsightonlyofourincreasemanualdevelopmentinspectionefwas,fortsthatbutalsosimionsintroduceoftenrepresentdefects.missedreuseopportunities
183
DetectionCloneofLimitations9
Discussion9.6
Inderlyingthepreviousapproaches.sectionsInweoureexploredxperimenttheclonelimitsofdetectioncurrenttoolscloneachievdetectionearecalltoolsofandlessalsothanof1%theirwhenun-
analyzingbehaviorallysimilarbutindependentlydevelopedcode(RQ16).Whileitcouldhave
beensimions,ethexpectedthatdramaticallyexistinglowclonerecallisdetectionneverthelessapproachessurprising.haveratherMoreoverlimited,theresultcapabilitiesofRQfor17®ndingshow
thatcanbeonlyfoundacerwithtainclasscurrentofclonesimions,detectionthosethatapproaches.areHence,representationallywearesimilarinclinedtomodulodisagreewithnormalization,[201]
sithatveandstatesthatintelligent“[...]attemptsnormalizationscantobethemadecode.to”.detectsemanticclones[simions]byapplyingexten-
Furthermore,RQ16demonstratedthatindependentprogrammersdonottendtocreaterepresen-
tationallysimilarcodewhenfacingthesameimplementationproblem.Thus,wewouldexpectto
®ndsimions“inthewild”—bothinsideexistingsystemsandbetweensystemsandlibraries—which
arenotrepresentationallysimilarandthusnotdetectablebycurrenttools.RQ18provides®rst
indicationsforthisfact.Theseresultsarealsobackedupbythestudyin[107],whichmineda
hugenumberofsimionsfromtheLinuxkernelsourcesfromwhichatleasthalfofthemwherenot
representationallysimilar.ResultsthatpointinthesamedirectionarealsopresentedbyKawrykow
andRobillardthatreportonsigni®cantamountsofreimplementedAPImethodstheyfoundinJava
systems[127].Finally,furthersupportisgivenbyourobservationsthatredundancyinrequire-
mentscanleadtoindependentimplementationsofsemantically,yetnotsyntactically,similarcode
(cf.,Chapter5).
ThesimionsinspectedforRQ18alsocon®rmedourexpectationsthatreuseofexisting(library)
functionsoftennotonlyreducesimplementationeffortsbutalsothenumberofbugs.Toprovide
somefurtherindication,weusedGoogleCodeSearch7toidentifyotherJavaprogramsthatdo
notreuseApache’sDefaultExecutorandexhibitthesamedeadlockproblemasJabRefthatwe
discoveredinRQ18.Strikingly,ofthe®rst10hitsforthesearchlang:javaprocess.waitfor,6
implementationscontainthesameproblemasJabRefalthoughonly2ofthemappeartobethe
resultofcopy&paste.
Thelackofreliablesimiondetectorsmakesautomatedsimionmanagementunfeasible.Sincedetec-
tionthroughmanualinspectionsisverycostly,inspectionsarenotfeasibleforlargescale,continu-
oussimionsdetection.Clonemanagementapproaches(cf.,Section3.4.2)thatpromisetoalleviate
thenegativeimpactofcloningduringmaintenance,however,requiredatadescribingsimilarpro-
gramfragments.Theyarehencenotapplicabletosimionmanagement:theysimplyhavenodatato
on.operate
Sincetheautomatedmanagementofexistingsimionsduringmaintenanceishenceunfeasible,de-
velopmentmustinsteadfocusontheiravoidance.First,thisimpliesthatdevelopersmustbemade
andkeptawareofavailablelibrariestoavoidre-implementationoffunctionalityalreadyavailablein
the®eld.Second,redundancyinrequirementsandmodelsmustbedetectedandconsolidatedbefore
theyareimplemented,toavoidre-implementationoffunctionalitythatisalreadyavailable.
7.google.com/codesearchhttp://www
184
alidityVtoThreats9.7
Sinceavoidancedoesnothelpwithsimionsthatalreadyexistinsoftware,thedetectionofsimions
isarelevantproblemwhichisnotyetsolvedbyexistingtools.Aworkingsimiondetectorcould
notonlyhelpinreducingcodesizebyeliminatingredundantcode,butalso®ndbugsbyincluding
librariesofworkingcodeorbugpatternsinthedetection.Wethusconsidertheconstructionof
algorithmsandtoolsforsimiondetectionaworthwhileandstillopenproblem.
VtoThreats9.7alidity
Thissectiondiscusseshowwemitigatedthreatstointernalandexternalvalidity.
InternalValidityForRQ16,wedidnotmeasuretheimpactoftheparametersusedfordetection
onprecision.Thishastworeasons.(1)precisionmeasuredonthestudyobjects,whichareknown
tobebehaviorallysimilar,isunlikelytobetransferabletorealworldsoftware,wherewecannot
expectedthesamedegreeofsimilarity.Precisionmeasureswouldthushavetoberepeatedonfur-
therprecisionsystems,throughstillwithmanualquestionableassessmentsistransferabilityalreadydifbe®cultyondinthegeneralsystems[229].underDuringstudythe.(2)courseMeasuringofthe
study,wefoundittobeinfeasibleforverysmallclones(e.g.,ofsizebelow4statements)duetolow
inter-raterreliability.Instead,wechoseverytolerantparametervaluesthat,whilelikelytoresultin
lowprecision,areunlikelytoreducerecall.However,thisstrategyhasasinglesidedeffectonthe
resultsofthestudyinthatitmerelyincreasestheprobabilitytodetectclones.Itthusdoesnotaffect
thevalidityoftheresultsthatexistingtoolsarepoorlysuitedtodetectsimions.
ForRQ17,weclassi®edcategoriesofprogramvariationaccordingtowhethertheyareinprinciple
withinreachofPRSBapproaches.Misclassi®cationcanimpacttheresults.Wehandledthisthreat
bychoosingaconservativeclassi®cationstrategy.Categoriesthatcanonlypartlybehandled(e.g.,
duetotheuseofheuristicsthatcannotguaranteecompletenessorhighcomputationcomplexity
thatcouldbeprohibitivelyexpensiveinpractice)wereratedaswithinreachofPRSBapproaches.
Inaddition,differencesbetweenthestudyobjectsthatstemmedfromdifferencesintheirbehavior
thatwerenotdetectedbyourtestsuitewereignored.Thisconservativestrategythusincreases
theprobabilitytoconsiderPRSBapproachesassuitedforthesimiondetectionproblem.Itdoes,
however,notimpactthevalidityoftheresultthatPRSBapproachesarepoorlysuitedforthesimion
problem.detectionSeveralfactorscanleadtolessprogramvariationamongthestudyobjectsthancouldtypically
beencounteredinrealworldsoftware:(1)allstudentshadaccesstothesametestsuite,(2)the
signatureofthevalidatorfunction,includingitstypes,wasspeci®ed,(3)teamscouldasktutors
forhelp.However,allthesefactorsonlyincreaseourchancesof®ndingclonesandthusdonot
results.thealidatevin
ExternalValidityWechosetwostate-of-the-artclonedetectorsforthestudy.Somedetectorwe
wedidnotdiscovtryeredmightamongperformthestudybetter.Hoobjects,wevweer,gidovennotethexpectdivanersityyeandxistingamountcloneofdetectorprogramtovperformariation
substantiallybetter,aswouldberequiredtoinvalidateourconclusions.TheresultsforRQ17
185
DetectionCloneofLimitations9
illustratethatthisisalsovalidforPDG-baseddetectors8.Wedonotclaimtransferabilityofthe
actualnumbers(e.g.,forrecall)wemeasuredonthestudyobjectsbeyondthestudy.However,
sincethestudyobjectswererelativelysimplecomparedtorealworldsoftware,wedonotexpect
real-worldsoftwaretoexhibitlessprogramvariation.Onthecontrary,wewouldexpectprogram
variationtobeevenlargerforrealworldsoftware,duetodifferencesinconventionsandpractices
betweendifferentteamsanddomains.Regardingtheexistenceofsimionsinreal-worldprograms
thatarenottheresultofcopy&paste(RQ18),ourapproachcanonlyprovideanindication.Itis,
thus,tooearlytoreasonaboutthedefectpronenessofthemissedreuseopportunitiesrepresented
simions.by
ySummar9.8
Thischapteranalyzedprogramvariationinbehaviorallysimilarcodeofindependentorigin.With
acontrolledexperimentweunderpinthecommonintuitionoftheexistenceofbehaviorallysimilar
codethatcannotbefoundautomaticallybyexistingclonedetectionapproaches.Clonedetection
toolsarehencenotwellsuitedtodetectbehaviorallysimilarcodeofindependentorigin.
ThecasestudyinChapter5indicatedthatredundancyinrequirementsspeci®cationscancause
re-implementationofsimilarfunctionality.Theresultsofmanualinspectionsofopensourcecode
furthermoreindicatethatsimionsdoexistinpractice.However,theexperimentinthischapter
revealsthatclonedetectionisunlikelytodiscoversuchsimilaritiesonthecodelevel.Thislackof
detectorsmakesexistingclonemanagementapproachesunapplicabletosimions.Theirdetection
remainsanimportanttopicforfuturework.
8Also,wearenotawareofanavailablePDG-baseddetectorforJava.
186
lusionConc10
Thischaptersummarizesthecontributionsofthiswork.Itsstructurere¯ectsthethesisstatement
fromSection1.1:the®rstsectionsummarizesourresultsonthesigni®canceofcloning,thesecond
sectionourcontributionsforcloneassessmentandcontrol.
CloningofSigni®cance10.1
times,Whileitsthenegatiquantitativeveimpactofimpact—andcloningthusonitsprogramsigni®cance—incorrectnesshaspracticebeenremainedstatedunclearqualitatively.Furtherman-y
more,whilecloninginsourcecodehadbeenstudiedintensely,littlewasknownaboutitsextentand
consequencesinothersoftwareartifacts.
Thefollowingsectionssummarizeourempiricalresultsontheimpactofcloningonprogramcor-
Then,rectnessweandsummarizetheextenttheofcostcloningmodelinthatquanti®esrequirementsimpactofspeci®cationscloningandonmaintenanceMatlab/Simulinkefforts.models.
10.1.1ImpactonProgramCorrectness
Weinvestigatedfourresearchquestionstoquantifytheimpactofcodecloningonprogramcorrect-
ness:RQ1:Arecloneschangedindependently?
Yes.Abouthalftheclonegroupsintheanalyzedsystemsweretype-3clonegroupsandthushad
differencesbeyondvariablenamesandliteralvalues.Changestoclonedcodethatarenotperformed
equallytoallcloneshencefrequentlyoccurinpractice.
RQ2:Aretype-3clonescreatedunintentionally?
Yes.Asubstantialpartofthedifferencesbetweentheanalyzedcloneswasunintentional.Manyof
thedeveloperswerethusnotawareofalltheexistingcloneswhenmodifyingcode.However,the
ratioofintentionalw.r.t.unintentionaldifferencesvariedstronglybetweentheanalyzedsystems,
indicatingdifferencesintheamountofcloningawareness.
RQ3:Cantype-3clonesbeindicatorsforfaults?
Yes.Analysisoftype-3clonesuncovered107faultsinproductivesoftware.Theratiooftype-3
clonesthatindicatedfaults,however,variedbetweentheanalyzedsystems.Softwarewithmore
unintentionallyinconsistentchangesalsocontainedmoretype-3clonesthatindicatedfaults.
RQ4:Dounintentionaldifferencesbetweentype-3clonesindicatefaults?
187
lusionConc10
aYwes.arenessAboutofeverycloningsecondduringunintentionalmaintenancedifthusferencesigni®cantlybetweentype-3impactsclonesprogramindicatedcorrectness.afault.Lackof
SummaryThestudyresultsshowthatalackofawarenessofcloningisathreattoprogram
correctness.Whiletheanalyzedsystemsvariedintheirshareofunintentionaldifferences—andthus
theamountofcloningawarenessamongtheirdevelopers—thenegativeimpactofunintentionally
inconsistentchangeswasuniform:abouteverysecondunintentionallyinconsistentchangehada
directimpactonprogramcorrectness.Theseresultsthusgivestrongindicationthatawarenessof
cloningiscrucialduringsoftwaremaintenance.
Inaddition,thestudyshowedthatawarenessofcloningvariesbetweenprojects—itthuscannot
betakenforgrantedinindustrialsoftwareengineering.Clonecontrolisrequiredtoachieveand
maintainawarenessofcloningtoalleviatethenegativeimpactofexistingclones.
CloningofExtent10.1.2
Besidessourcecode,furthersoftwareartifactsarecreatedandmaintainedduringthelifecycleof
asoftwaresystem:requirementsspeci®cationsplayapivotalroleincommunicationbetweencus-
tomers,requirementengineers,developersandtesters;Matlab/Simulinkmodelsarereplacingcode
asviouslyprimarybeenstudiedimplementationintheseartifartifactacts.inWeembeddedinvestigsoftwated®arevesystems.researchHowequestionsver,tocloningshedhaslightnotonthepre-
extentandimpactofcloninginrequirementsspeci®cationsandMatlab/Simulinkmodels.
RQ5:Howaccuratelycanclonedetectiondiscovercloninginrequirementsspeci®cations?
OurclonedetectorConQATachievedhighprecisionvaluesforthe28analyzedindustrialrequire-
mentsspeci®cations:85%intheworstcase,99%onaverage.Tailoringis,however,requiredto
achievesuchhighprecision.Theseresultsshowthatclonedetectionissuitabletodetectcloningin
speci®cations.requirementsRQ6:Howmuchcloningdoreal-worldrequirementsspeci®cationscontain?
Theamountofcloningvariedsubstantiallyacrosstheanalyzedspeci®cations.Whilesomecon-
tainednocloningatall,othersexhibitedasizeincreaseover100%duetocloning.Thehighest
clonecoveragevaluesrangedat51.1%and71.6%.
RQ7:Whatkindofinformationisclonedinrequirementsspeci®cations?
Wediscoveredabroadrangeofdifferentinformationcategoriespresentinclonedspeci®cation
fragments—cloningisnotlimitedtoaspeci®ckindofinformation.Consequently,clonecontrol
cannotbelimitedtospeci®ccategoriesofrequirementinformation.
RQ8:Whichimpactdoescloninginrequirementsspeci®cationshave?
Inspectionsareanimportantqualityassurancetechniqueforrequirementsspeci®cations.The
cloninginducedsizeblow-upincreaseseffortrequiredforinspections—intheworstcasebyan
estimated13persondaysforoneoftheanalyzedspeci®cations.Cloningthusincreasesquality
assuranceeffortforrequirementsspeci®cations.
188
CloningofSigni®cance10.1
Inaddition,wesawevidencethatrequirementcloningcanresultinredundancyintheimplemen-
tation.Besidescorrespondingsourcecodeclones,wefoundcasesinwhichclonedspeci®cation
fragmentshadbeenimplementedindependentofeachother.Besidesincreasedimplementation
effort,thiscausesbehaviorallysimilarcodethatisnottheresultofsourcecodecopy&paste.
RQ9:Howmuchcloningdoreal-worldMatlab/SimulinkModelscontain?
TheanalyzedindustrialMatlab/Simulinkmodelscontainedasubstantialamountofcloning.While
thedetectionapproachproducedfalsepositives,thedevelopersagreedthatawarenessofmanyof
thedetectedclonesisrelevantforsoftwaremaintenance.CloningthusoccursinMatlab/Simulink
modelsandneedstobecontrolledduringmaintenance,aswell.
SummaryCloningisnotlimitedtosourcecode,andneitherisitsnegativeimpact.Cloning
aboundsinrequirementsspeci®cationsandMatlab/Simulinkmodels—ithenceneedstobecon-
trolledinthem,too,toreducethenegativeimpactofcloningonengineeringefforts.
Clonecontrolmeasuresarelikelytodifferforrequirementsspeci®cationsandMatlab/Simulink
models,however.Limitationsoftheexistingabstractionmechanismsarearootcauseforcloning
inMatlab/Simulinkmodels.Sincecorrespondingclonescannoteasilyberemovedwithoutchanges
totheMatlab/Simulinkenvironment,clonecontrolneedstofocusontheirconsistentevolution.
Incontrast,forrequirementsspeci®cations,noabstractionmechanismlimitationshindertheclone
consolidation:manyoftheanalyzedspeci®cationsdidnotcontainanycloningatall.Consequently,
clonecontrolforthemcanputmoreemphasisontheavoidanceandremovalofcloning.
ModelCostClone10.1.3
nomicBesidesefthefectofempiricalcloningonstudies,wemaintenancehaveefpresentedfortsandan®eldfanalyticalaults.costItcanmodelbeusedthatasaquanti®esbasistheforeco-as-
andsessmentthusandrequirestrade-offsubstantiallydecisions.lessThemodelparameters—andproducesainstantiationresultrefelativetofort—thanasystemgeneralwithoutpurposecloningcost
results.absoluteproducethatmodelsInstantiationofthecostmodelon11industrialsystemsindicatesthatcloninginducedimpactvaries
achievsigni®cantlyeconsiderablebetweensasystemsvingsbyandisperformingsubstantialactivforeclonesome.control.Basedontheresults,someprojectscan
SummaryThecostmodelcomplementstheempiricalstudiesintwoways.First,itcompletes
ourunderstandingoftheimpactofcloning:insteadoffocusingonisolatedaspectsoractivities,it
quanti®esitsimpactonallmaintenanceactivitiesandthusonmaintenanceeffortsandfaultsasa
whole.Second,itmakesourobservations,speculationandassumptionsexplicit.Thisexplicitness
offersanobjectivebasisforscienti®cdiscourseabouttheconsequencesofcloning.
189
lusionConc10
Clone10.2olContr
Ourempiricalresultshaveshownthatcloningnegativelyaffectsmaintenanceefforts,andthatun-
awarenessofcloningimpactsprogramcorrectness.Clonecontrolisrequiredtoavoidcreationof
new,andtoreducethenegativeimpactofexistingclones.Wehavepresentedtoolsupportanda
methodforclonecontrolthataresummarizedinthefollowingsections.Finally,thelastsection
summarizesourinvestigationofthelimitationsofclonedetectionandcontrol.
10.2.1AlgorithmsandToolSupport
TheproposedclonedetectionworkbenchConQATprovidessupportand¯exibilityforallphases
ofclonedetection:frompreprocessing,detectionandpostprocessing,toresultpresentationand
interactiveinspectioninstateoftheartIDEs.ConQATimplementsseveralnoveldetectionalgo-
rithms:the®rstalgorithmtodetectclonesindata¯owmodels;anindex-basedapproachfortype-2
clonedetectionthatisbothincrementalandscalable;andanoveldetectionalgorithmfortype-3
clonesinsourcecode.Itsupports12programmingand15naturallanguages.Thiscomprehensive
functionality—re¯ectedinitssizeofabout67kLOC—wasrequiredtoperformthecasestudiesand
tosupportthemethodforcloneassessmentandcontrol.
Thediversityofthetasksforwhichclonedetectionisemployedinbothresearchandpractice,and
thenecessitytotailorclonedetectiontoitscontexttoachieveaccurateresults,requirevariationand
adaptation.ConQAT’sproductlinearchitecturecatersfor¯exiblecon®guration,whileatthesame
timeachievingahighlevelofreusebetweenindividualdetectorsacrosstheclonedetectorfamily.
SummaryThetoolsupportproposedbythisthesishasmaturedbeyondthestateofaresearch
prototype.SeveralcompanieshaveincludedConQATforclonedetectionormanagementintotheir
developmentorqualityassessmentprocesses,includingABB,BMW,Capgeminisd&m,itestra
GmbH,KabelDeutschland,MunichReandWincorNixdorf.Furthermore,ConQAT’sopenarchi-
tectureanditsavailabilityasopensourcehavefacilitatedresearchbyothers[24,96,104,180,186].
10.2.2MethodforCloneAssessmentandControl
Toeaseadoptionofclonedetectionandmanagementtechniquesinpractice,thisthesishaspresented
amethodforcloneassessmentandcontrol.Itsgoalsaretoassesstheextentandimpactofcloning
insoftwareartifactsandtoreducethenegativeimpactofexistingclones.
Weintroducedclonecouplingasanexplicitrelevancecriterion.Developerassessmentsofclone
couplingareemployedforclonedetectiontailoringtoachieveaccuratecloninginformationfora
softwaresystem.Theapplicationofdeveloperassessmentstodetermineclonecouplingisbasedon
assumptionsthathavebeenvalidatedthroughfourresearchquestions:
RQ10:Dodevelopersestimateclonecouplingconsistently?
Yes,couplingbetweentheanalyzedcloneswasratedveryconsistentlyamongthreedifferentdevel-
opers.Itisthusrealistictoassumeacommonunderstandingofclonecouplingamongdevelopers.
190
olContrClone10.2
RQ11:Dodevelopersestimateclonecouplingcorrectly?
Yes.Analysisofthesystemevolutionshowedasigni®cantlystrongercouplingbetweenclones
thatwereassessedascoupled,thanamongthosethatwereassessedasindependent.Developer
estimationsofcouplingthuscoincidewithactualsystemevolution.
RQ12:Cancouplingbegeneralizedfromasample?
Yes.Althoughtailoringwasbasedonasampleofthedetectedclones,allacceptedclonesexhibited
asigni®cantlylargercouplingduringsystemevolutionthantherejectedclonecandidates.Coupling
generalized.bethuscanRQ13:Howlargeistheimpactoftailoringonclonedetectionresults?
Theimpactmustbeexpectedtovarybetweensystems,since,e.g.,theapplicationofcodegenera-
tors,whichcontributetosubstantialamountsoffalsepositives,varies.However,fortheanalyzed
system,theimpactwaslarge:morethantwothirdsoftheclonecandidatesdetectedbyuntailoredde-
tectionwereconsideredirrelevantformaintenancebythedevelopers.Still,over1000clonegroups
remainedinthetailoreddetectionresults.Althoughthesystemcontainedalotofrelevantclones,
untailoreddetectionresultswereunsuitedforcontinuousclonecontrol.Theseresultsemphasizethe
importanceofclonedetectiontailoringandcastdoubtonthevalidityofsomeresultsofempirical
analysisofpropertiesofclonesthatdidnotemployanyformoftailoring(cf.,Chapter3).
EvaluationThemethodhasbeenappliedtoabusinessinformationsystemdevelopedandmain-
tainedattheMunichReGroup.Cloneassessmentandcontrolwasperformedoveraperiodofone
year.Thesuccessfulapplicationofthemethodvalidatesitsapplicabilityinreal-worldcontexts.To
evaluateitsimpact,weinvestigatedtworesearchquestions:
RQ14:Didclonecontrolreducetheamountofcloning?
Ycoves:eragebothclonedecreasedcoveragefromand14%thetobelonumberwof10%,clonedthenumberstatementsofcloneddecreasedstatementsduringthedecreasedstudyperiod:from
15.000tobelow11.000,whiletheoverallsystemsizeincreasedinthatperiod.
RQ15:Istheimprovementlikelytobecausedbythecloneassessmentandcontrolmeasures?
YThees.Beforereductiontheinstudycloningperiod,is,hence,bothlikcloneelytometricsbeecausedxhibitedbythesubstantiallyapplicationdifofferenttheevmethod.olutionpatterns.
SummaryThemethodprovidesdetailedstepstotransportinsightsgainedthroughthecasestud-
iesandexperimentsperformedduringthisthesisintoindustrialsoftwareengineeringpractice.Its
underlyingassumptionshavebeenvalidatedandithasbeenevaluatedonasoftwaresystemat
MunichReGroup.Thisevaluationhasdemonstrateditsapplicabilitytoreal-worldprojectsand
succeededtoreducetheamountofcloningintheparticipatingsoftwaresystem.
191
lusionConc10
DetectionCloneofLimitations10.2.3
Cloningisnottheonlyformofredundancyinsourcecode.Independentimplementationofthesame
functionality,e.g.,causedthroughclonedrequirementsspeci®cations,canalsoleadtobehaviorally
similarcode.Weanalyzedthreeresearchquestionstobetterunderstandthesuitabilityofclone
detectiontodiscoverbehaviorallysimilarcodeofindependentorigin.
RQ16:Howsuccessfullycanexistingclonedetectiontoolsdetectsimions1thatdonotresultfrom
paste?©Theanalyzedclonedetectorswereunsuccessfulindetectingsimionsthathavebeendeveloped
independently.Theamountofprogramvariationinbehaviorallysimilarcodeofindependentorigin
istoolargeforthecompensationcapabilitiesofexistingclonedetectors.
RQ17:Isprogram-representation-similarity-basedclonedetectioninprinciplesuitedtodetect
simionsthatdonotresultfromcopy&paste?
No.Simionsarelikelytocontainprogramvariationthatcannotbecompensatedbyexistingclone
detectionoralgorithmrecognitionapproaches.Program-representation-similarity-baseddetection
isthuspoorlysuitedtodetectsimionsofindependentorigin.
RQ18:Dosimionsthatdonotresultfromcopy&pasteoccurinpractice?
Yes.Bothmanualinspectionsofopensourcecodeandanalysisofimplementationofclonedre-
quirementsspeci®cationsrevealedsimionsinreal-worldsoftware.
SummaryClonedetectionislimitedtocopy&paste—independentlydevelopedprogramfrag-
mentswithsimilarbehaviorareoutofreachofexistingclonedetectionapproaches.Duringclone
control,clonedetectioncanbeappliedto®ndregionsinartifactsthathavebeencreatedthrough
copy&paste&modify.Itcannot,however,beexpectedtodetectbehavioralsimilaritiesthathave
beenimplementedindependently.Clonemanagementtools,thus,cannotbeexpectedtoworkon
simions.Insteadoffacilitatingtheirconsistentevolutionduringmaintenance,clonecontrolthus
needstofocusontheavoidanceofsimions.
1Behaviorallysimilarcodefragments,cf.,2.3.2
192
11orkWFuture
Thisresultsandchaptereoutlinesxperiencesmadedirectionsduringofthefuturecasework.studiesTheofthistopicsthesis.havebeeninspiredbytheempirical
futureSectionwork11.1inpresentsclonecostopenissuesmodeling.intheSectionprevention11.3andproposesdetectioncloneofdetectionsimions.asaSectiontoolto11.2guidediscusseslan-
guagenaturallanguageengineering.documents.Section11.4Finally,outlinesSectionopen11.5issueslistsinopenclonequestionsdetectiononandcloneimpactofconsolidation.cloningfor
11.1ManagementofSimions
Softwarecancontainredundancybeyondcopy&paste.Oneform,independentreimplementation,
presentssimilarproblemstosoftwaremaintenance,ascloning.Evenworse,reimplementationis
typicallymoreexpensive—andpossiblymoreerror-prone—thancopyingexistingcode.Ourem-
piricalstudieshavecon®rmedtheexistenceofreimplementedfunctionalityinreal-worldsoftware:
foropensourceviamanualcodeinspections(cf.,Section9.5)andforindustrialsoftwareasaresult
ofduplicatedrequirements(cf.,Section5.5.4).
PreventionofReimplementationSuccessfulpreventionofreimplementationneedstohap-
peninearlystagesofsoftwaredevelopment:assoonasitismanifestedinthecode,effortfor
implementation,andpossiblyqualityassurance,hasalreadybeenspent.Consequently,prevention
needstoidentifysimilarfunctionalityearlier,e.g.,ontherequirementslevel.Thefactthatpreven-
tionshouldfocusonearlystagesisalsosupportedbyChapter9,thatdemonstratedthatexisting
clonedetectionapproachesareunsuitedtoreliablydetectsuchredundancy.
system,Identi®cationwhenofdesignsimilarandfunctionalityimplementationshouldarebederivedperformedfromabothsetatofthestartrequirements,ofdevandelopmentduringofanemain-w
existtenance,bothwhenbetweennewnewrequirementsrequirementsareoraddedbetweenorenexistingwrequirementsfunctionalityandgetsimplementedchanged.Similarityfeatures.can
Wearenotawareofasystematicapproachtoidentifysimilarfunctionalityontherequirements
leveltoavoidreimplementation.Giventhesimionsweobservedduringourempiricalstudies,we
considersuchanapproachasanimportanttopicforfuturework.
193
orkWFuture11
publicchar[]staticcharactersString=®llString(newcharint[length];length,charc){
returnnewArrays.®ll(characters,String(characters);c);
}
privatestaticStringpadding(intrepeat,charpadChar)throws...{
ifthr(repeatow<new0){IndexOutOfBoundsException("..."+repeat);
}for®nal(intichar[]=0;biuf<=bnewuf.length;chari++)[repeat];{
padChar;=uf[i]b}returnnewString(buf);
}Figure11.1:SimionsbetweenCCSMCommonsandApacheCommons
SimionDetectionApreventionapproach,asoutlinedabove,cannotbeappliedtosimionsthat
arealreadycontainedinexistingsoftware.Thus,tocomplementthepreventionapproach,weneeda
detectorthatiscapabletodetect(atleastcertainclassesof)simions.Sinceexistingclonedetection
approachesarepoorlysuitedforthis(cf.,Chapter9),newapproachesneedtobedeveloped.
Onepromisingapproachforsimiondetectionisdynamicclonedetectionthatexecuteschunksof
codeandcomparestheirI/Obehavior.Asproofofconcept,wehaveimplementedaprototypical
dynamicclonedetectorforJavausingtechniquessimilarto1randomtesting[112].Anexampleof
detectedsemanticallysimilarfunctionsfromCCSMCommonsandApacheCommonsisdepictedin
Figure11.1.Whileinitialresultsareencouraging,theprototypestillhasmanylimitations,making
itspracticalapplicationinfeasible.Futureworkisrequiredtodevelopscalableandaccuratesimion
detectors.
11.2CloneCostModelDataCorpus
Apromisingdirectionoffutureworkisthecreationofacorpusofreferencedatathatcollects
acticorpusvitycaneffortsimplifyparametersinstantiationfordifofferentthecostcontemodelxtsandbymakingcloningefdatafortfordifparametersferentavailablesystems.andSuchservea
asabenchmarkforrelativecomparisonoftheimpactofcloninginonesystemagainstcomparable
systemsdevelopedbyotherorganizations.
Furthermore,thereisade®nitiveneedforfutureworkontheclonecostmodelitself.Theassump-
tionsthecostmodelisbasedonmustbevalidatedfordifferentengineeringcontexts.Forcases
inwhichanassumptiondoesnothold,themodelneedstobeadaptedorextendedaccordingly.
Furthermore,themodelneedstobeinstantiatedusingprojectspeci®ceffortparameters.Lastbut
mostimportant,thecorrectnessoftheresultsmustbevalidated,e.g.,throughcomparingeffortson
projectsbeforeandaftercloneconsolidation,withthepredictedefforts.
1x.php/CCSM_Commonshttp://conqat.cs.tum.edu/inde
194
EngineeringegLangua11.3
egLangua11.3Engineering
Onerootcauseforcloningthatisfrequentlymentionedintheclonedetectionliterature,arelan-
guagelimitationsthatpreventthecreationofreusableabstractions.Asawayaroundthislimitation,
developerscopy&paste&modifythecode.Forexample,manycross-programclonesinCOBOL
arecausedbyCOBOL’sdif®cultytoreusecodebetweenprograms.Similarly,programswrittenin
earlyversionsofJavaoftencontainclonedwrappersaroundcollectionclassestomakethemtype
safe,sincethelanguagethendidnotallowparameterizationoftypes.
Inthesesituations,cloningisthesymptom,theabstractionmechanismlimitationthecause.The
presenceofcloningcanthusindicatelanguagelimitations.Onepotentiallybene®cialuseofclone
detectionisthusthediscoveryofabstractionmechanismshortcomingstoinformlanguagedesign
andevolution—notonlyofgeneralpurposeprogramminglanguages,butofallabstractionmecha-
nismsandlanguagesemployedduringsoftwareengineering.
Evolutionhistoryofbothgeneralpurposeanddomainspeci®clanguagesdocumentsintroduction
oflanguagefeaturesthatallowtoreducetheamountofcloningintheirprograms.Java1.5,for
example,introducedgenericsthat,e.g.,allowparameterizationoftypesincollectionclasses.Asa
consequence,noredundantwrappersaroundcollectionclassesarerequiredanylongertomakethem
typesafe.Italsointroducedaniterationloop,allowingtoreplaceimplementationsoftheIterator
idiom—whichpreviouslytookseveralstatementsthatwereduplicatedeverytimeitwasused—
throughasinglestatement.Furtherevidencethattheintenttoremoveduplicationdrovelanguage
designcanbefoundintheevolutionofthecollectionslibraryofthelanguageScala.Itsdocumen-
tationstatesthatthe“principaldesignobjectiveofthenewcollectionsframeworkwastoavoidany
duplication,de®ningeveryoperationinoneplaceonly”[168].Afurtherexamplecanbefoundin
theevolutionofattributegrammarformalisms,domainspeci®clanguagestodeclarativelyspecify
syntaxandsemanticsofprogramminglanguages:in[174],Merniketal.extendexistingattribute
grammarformalismswithinheritance,toallowformorereuse—andthuslessduplication—inlan-
speci®cations.guageTheseexamplesfromlanguageevolutionhistorydocumentthattheremovalofredundancyisindeed
adriveroflanguagedesign.However,inmanycases,thelanguagefeatureswereintroducedata
latepoint,whentheamountofredundancyinpracticehadtakenanextentlargeenoughtoreally
botherusers.Systematicapplicationofclonedetectiontoguidelanguagedesigncouldallowto
mendweaknessesinearlierstages,beforealargeamountofcloningiscreatedasawork-around,
whichisthendif®culttoconsolidate.
Apartfromgeneralpurposeanddomainspeci®clanguages,clonedetectioncanalsoguidethe
designofmoreinformalabstractionmechanismsemployedduringsoftwareengineering.Thetem-
platesforusecasesandtestscriptsarealsoabstractionmechanismsthatspecifythe®xedandthe
variablepartsoftheirdocumentinstances.AsthecasestudyinChapter5showed,missingreuse
mechanismsintheseartifacttypesalsocreatecloningasaresponse.Wesuggestthefollowing
extensionoftheusecasetemplatesbasedonthecloningweobservedinusecases:
ConditionSets:Collectionsofbothpre-andpostconditionswerefrequentlyclonedbetweenuse
casesthatoperateinsimilarsystemstates.Theexplicitcreationofsetsofsuchconditions,
thatarethenreused,offerstwoadvantages:®rst,thedifferentsystemstatesaremoreeasily
195
orkWFuture11
recognizedfromafewpreconditionsetsthanfromacomparisonofthepreconditionslistedin
hundredsofindividualusecases;second,whenasystemstatechanges,thechangeonlyneeds
tobeperformedtothecorrespondingpreconditionset,nottoallusecasesthatoperateinthis
state.Thisreducesbothmaintenanceeffortandthedangerofinconsistencies.
Glossaries:Manyoftheclonesencounteredintheusecasesrepeatedde®nitionsofroles,entities
orterms.Theirsinglede®nitioninaglossarycanremovethisredundancy.Glossariesareused
inmanyprojects.However,theirintegrationwiththeusecases,e.g.,throughnavigablelinks
betweentermsinausecaseandtheirde®nitioninaglossary,doesnotappeartobehabitualin
practice.Walks:Manyoftheusecasesandtestscriptsweanalyzedcontainduplicatedsequencesofsteps.
Inmanycases,theycorrespondstosomehigherlevelconcept,suchas“opencustomerentry”,
whichrequiresseveralindividualsysteminteractionsteps,e.g.,“Opensearchform”,“Enter
name”,“Performsearch”and“Selectcustomerentryfromsearchresults”.Theserecurring
sequencesofstepscouldbemadereusableasa“walk”(tostayinthemetaphor)thatcanbe
cases.usebyreferencedDesigningabstractionsishard.Weoftendonotgetitperfectlyrightonthe®rstattempt.Clone
detectioncanprovideatooltodiscoverweaknessesandreacttothemearly,beforetheycreatetoo
practice.inyredundancmuch
11.4CloninginNaturalLanguageDocuments
Thetions,studyandinhasgiChaptevenr5hasindicationshownforitsthatnegcloningativeimpactaboundsoninmanyengineeringreal-weforldforts.Thisrequirementssectionspeci®ca-outlines
promisingdirectionsforfutureworkinclonedetectioninrequirementsspeci®cationsandother
naturallanguagesoftwareartifacts.
lishedClone(cf.,SectionClassi®cation2.2.3).ForRecentcodely,anclones,aanalogousclassi®cationclassi®cationintodifofferentcloneclonetypesfortypesmodelhasbeenclonesestab-has
fbeenacilitateproposedtheir[86].comparisonSuchandclassi®ctheirationsselectionareforusefulspeci®ctotasks.characterizedetectionalgorithmsandthus
Analogtocodeclones,wecande®neaclassi®cationofclonetypesforclonesinnaturallanguage
documents:type-1clonesarecopiesthatonlydifferinwhitespace.Theyarethusallowedtoshowdifferent
positionsoflinebreaksorparagraphboundaries.
type-2clonesarecopiesthat,apartfromwhitespace,cancontainreplacementsofwordsinsidea
wordcategory.Forexample,anadjectiveinoneclonecanbereplacedbyanotheradjectivein
itssibling,oranounthroughanothernoun.
type-3clonesarecopiesthat,apartfromwhitespaceandcategory-preservingwordreplacements,
canfromeonexhibitcatefurthergorydifthroughferences,awordsuchfromasremoanothervedone.oraddedwords,orreplacementsofaword
196
11.4CloninginNaturalLanguageDocuments
type-4clonesaretextfragmentsthat,althoughdifferentintheirwording,conveysimilarmeaning.
Justastheclassi®cationofcodeclones,thisclassi®cationcanbeexpectedtoevolve,asexperience
withcloninginnaturallanguagedocumentsincreases.Forexample,in[141],Koschkeintroduces
furtherclonecategoriestobetterre¯ecttypicalcloneevolutionpatterns.Similarly,abetterun-
derstandingoftheevolutionofrequirementsspeci®cationcouldleadtoare®nementtotheabove
gorization.cate
DetectionofType-2ClonesTheclassi®cationofclonetypesraisesthequestionofhowthey
canbedetected.Type-1clonesareeasytodetect,sincenonormalizationbeyondwhitespacere-
movalneedstobeperformed.Detectioncanthensimplybeperformedonthewordsequence,as
suggestedinChapters5and7.Type-3clonedetectioncanbeappliedtothiswordsequenceaswell,
e.g.,employingthealgorithmproposedinChapter7fordetectionoftype-3clonesinsequences.
Fortype-2clonedetection,however,anormalizationcomponentisrequiredthattransformsele-
mentsthatmaybesubstituteoneanotherintoacanonicrepresentation.Sincetheabovede®nition
onlyallowswordreplacementsinsideawordcategory,suchasnouns,verbsoradjectives,weneed
acomponentthatidenti®eswordcategoriesfornaturallanguagetext.
Naturallanguageprocessing[119]developedatechniquecalledpart-of-speechanalysisthatdeter-
minesthewordcategoriesfornaturallanguagetext.Part-of-speechanalysisisamatureresearch
area,forwhichfreelyavailabletools,suchasTreeTagger[206,207]exist,thatarealsousedfor
otheranalysistasks,suchasambiguitydetection[82].
Toevaluatethesuitabilityofpart-of-speechanalysisfornormalization,wehaveprototypicallyim-
plementeditintoConQATandevaluateditononeofthespeci®cationsfromthecasestudyon
cloninginrequirementsspeci®cationsfromChapter5.Initialresultsarepromising:wedetected
type-2clonesthatdifferintheactionthatgetsperformedinausecase,e.g.,createversusmodifyof
aprogramentity,orinthetenseinwhichtheverbsarewritten;severalclonegroupsonlydiffered
inthenameofthetargetentityonwhichusecasestepswereperformed,althoughthestepswere
identical.Intheinstanceswesaw,normalizationincreasedrobustnessagainstmodi®cations.Forexample,
theterm“user”hadbeenreplacedbytheterm“actor”insome,butnotalloftheusecases.Such
systematicchangescausemanydifferencesinthewordsequencesandthusmakethemdif®cultto
detectusingedit-distance-basedalgorithms;normalization,however,compensatessuchmodi®ca-
tions,thusmakingtheirdetectionfeasible.
Manyopenissuesremain:howdoespart-of-speechnormalizationaffectprecision?Whichnormal-
izationofwordcategoriesgivesagoodcompromisebetweenprecisionandrecall?Shouldsome
wordcategoriesbeignoredentirely,e.g.,articlesorprepositions?Canautomatedsynonymdetec-
tionapproachesservetoprovideamore®negrainednormalizationthanpart-of-speechanalysis?
Naturallanguagesoftwareartifactsoftenadheretoatemplate;doestheresultingregularstructure
enableimprovementsoroptimizations?Futureworkisrequiredtoshedlightontheseissues.
197
orkWFuture11
EvolutionofRequirementsClonesRequirementsspeci®cations—likeallsoftwareartifacts—
evolveasthesystemtheydescribechanges.Unawarenessofcloningduringdocumentmaintenance
threatensconsistency:justasforsourcecode,unintentionallyinconsistentchangescanintroduce
documents.theintoerrorsLittleisknownabouthowrequirementsspeci®cationsevolve,andhowevolutionisaffectedby
cloning.Howlargeistheimpactofcloningonrequirementsconsistencyandcorrectnessinpractice?
Whichclassesofmodi®cationsareoftenencounteredinreal-worldrequirementsevolutionand
shouldthusbecompensatedbyclonedetectors?Empiricalstudiescouldhelptobetterunderstand
issues.these
CloninginTestScriptsInmanydomains,asubstantialpartoftheend-to-endtestingisstill
performedmanually:testengineersinteractwiththesystemundertest,triggerinputsandvalidate
systemreactions.Thetestactivitiestheyperformaretypicallyspeci®edasnaturallanguagetest
casescriptsthatadheretoastandardizedstructurethatisde®nedbyatestcasetemplate.Asthe
systemundertestevolves,sodoitstestcases.
Togeta®rstunderstandingwhethertestcasescontaincloning,weperformedaclonedetection
on167testcasesformanualend-to-endtestsofanindustrialbusinessinformationsystem.Fora
minimalclonelengthof20words,detectiondiscoveredabout1000clonesandcomputedaclone
54%.oferagevcoManualinspectionofthetestcaseclonesrevealedfrequentduplicationofsequencesofinteraction
stepsbetweenthetesterandthesystem.Someofthesteps,specifyingboththetestinputandthe
expectedsystemreactionandstate,occurredover50timesinthetestcases.Theemployedtest
managementtool,however,didnotfacilitatestructuredreuseoftestcasesteps,thusencouraging
cloning.However,ifthecorrespondingsystementitieschange,testcasesprobablyneedtobe
adaptedaccordingly.Theseresultsthussuggestthatcloningintestscriptscreatessimilarproblems
formaintenance,asitdoesinsourcecode,requirementsspeci®cationsanddata-¯owmodels.
Empiricalresearchisrequiredtobetterunderstandtheextentandimpactofcloningintestscripts
inpractice.Doesitincreasetestcasemaintenanceeffort?Doesunawarenessduringmaintenance
causeinconsistentorerroneoustestscripts?Canclonedetectionsupportautomationofend-to-end
testsbyidentifyingrecurringteststepsthatcanbereusedacrossautomatedtestcases?
ConsolidationCloneCode11.5
Whilealotofworkhasbeendoneonthedetectionofclonesandonstudiesoftheirevolution,less
isknownabouttheirconsolidation.
Ithasbeennotedthatlimitationsofabstractionmechanismscanimpedesimpleconsolidationof
clonesthroughthecreationofasharedabstraction.However,itisunclear,howmuchcloningin
practiceisreallycausedbythis.Manyoftheclonesweinspectedinmanualassessmentsduringour
casestudiescannotbeexplainedbylanguagelimitations,especiallyformodernlanguageslikeJava
orstudyC#.Inpresentedaddition,incloneChapter8.controlIndeed,succeededourowntoobservsubstantiallyationsreducesuggestthethataamountlargeofpartcloningofthetheclonescase
198
ConsolidationCloneCode11.5
inpracticecanbeconsolidated.Furtherempiricalresearchisrequiredtobetterunderstandlimita-
tionsofcloneconsolidationinpractice.Whenconsolidatingclones,developersfacequestionsthat
clonescurrentlyisthecannotrequiredbeansweredconsolidationsatisfefactorily:fortnotwhichjusti®edclonesbyeshouldxpectedbemaintenanceconsolidated®rst?Fsimpli®cations?orwhich
Howcanwedecidethisobjectively?Canconsolidationincombinationwiththeimplementationof
otherchangerequestsreducetheincurredqualityassuranceeffort?Weneedabetterunderstanding
oftheseissuestofacilitatecloneconsolidationinpractice.
199
yliographBib
[1]R.Al-Ekram,C.Kapser,R.Holt,andM.Godfrey.Cloningbyaccident:anempiricalstudy
ofsourcecodecloningacrosssoftwaresystems.InProc.ofESEM’05,2005.
[2]C.AliasandD.Barthou.Algorithmrecognitionbasedondemand-drivendata-¯owanalysis.
InProc.ofWCRE’03,2003.
[3]G.Antoniol,U.Villano,E.Merlo,andM.DiPenta.Analyzingcloningevolutioninthelinux
kernel.InformationandSoftwareTechnology,2002.
[4]L.Aversano,L.Cerulo,andM.DiPenta.Howclonesaremaintained:Anempiricalstudy.
InProc.ofCSMR’07,2007.
[5]N.Ayewah,W.Pugh,J.D.Morgenthaler,J.Penix,andY.Zhou.Using®ndbugsonproduc-
tionsoftware.InProc.ofOOPSLA’07,2007.
[6]B.S.Baker.On®ndingduplicationandnear-duplicationinlargesoftwaresystems.InProc.
1995.,’95WCREof[7]T.Bakota,R.Ferenc,andT.Gyimothy.Clonesmellsinsoftwareevolution.InProc.ofICSM
2007.,’07[8]M.Balazinska,E.Merlo,M.Dagenais,B.Lague,andK.Kontogiannis.Partialredesignof
Javasoftwaresystemsbasedoncloneanalysis.InProc.ofWCRE’99,1999.
[9]M.Balazinska,E.Merlo,M.Dagenais,B.Lague,andK.Kontogiannis.Advancedclone-
analysistosupportobject-orientedsystemrefactoring.InProc.ofWCRE’00,2000.
[10]V.Basili,L.Briand,S.Condon,Y.-M.Kim,W.L.Melo,andJ.D.Valett.Understandingand
predictingtheprocessofsoftwaremaintenancerelease.InProc.ofICSE’96,1996.
[11]V.Basili,G.Caldiera,andH.Rombach.Thegoalquestionmetricapproach.Encyclopedia
ofsoftwareengineering,1994.
[12]H.BasitandS.Jarzabek.Detectinghigher-levelsimilaritypatternsinprograms.ACMSoftw.
2005.,Notes.Eng[13]H.BasitandS.Jarzabek.Adataminingapproachfordetectinghigher-levelclonesinsoft-
ware.IEEETrans.onSoftw.Eng.,2009.
[14]H.Basit,S.Puglisi,W.Smyth,A.Turpin,andS.Jarzabek.Ef®cienttokenbasedclone
detectionwith¯exibletokenization.InProc.ofESEM/FSE’07,2007.
[15]H.Basit,D.Rajapakse,andS.Jarzabek.Beyondtemplates:astudyofclonesintheSTLand
somegeneralimplications.InProc.ofICSE’05,2005.
201
liographBiby
[16]I.D.Baxter,A.Yahin,L.Moura,M.Sant’Anna,andL.Bier.Clonedetectionusingabstract
syntaxtrees.InProc.ofICSM’98,1998.
[17]K.Beck.Test-drivendevelopment:Byexample.Addison-Wesley,2003.
[18]K.BeckandC.Andres.Extremeprogrammingexplained:embracechange.Addison-Wesley
2004.Professional,[19]S.Bellon,R.Koschke,G.Antoniol,J.Krinke,andE.Merlo.Comparisonandevaluationof
clonedetectiontools.IEEETrans.onSoftw.Eng.,2007.
[20]N.Bettenburg,W.Shang,W.Ibrahim,B.Adams,Y.Zou,andA.Hassan.AnEmpiricalStudy
onInconsistentChangestoCodeClonesatReleaseLevel.InProc.ofWCRE’09,2009.
[21]B.Boehm.SoftwareEngineeringEconomics.Prentice-Hall,1981.
[22]B.Boehm,C.Abts,andS.Chulani.Softwaredevelopmentcostestimationapproaches–a
survey.Ann.Softw.Eng.,2000.
[23]B.W.Boehm,Clark,Horowitz,Brown,Reifer,Chulani,R.Madachy,andB.Steece.Software
CostEstimationwithCocomoII.PrenticeHallPTR,2000.
[24]J.S.BradburyandK.Jalbert.De®ningacatalogofprogramminganti-patternsforconcurrent
java.InProc.ofSPAQu’09,pages6–11,Oct.2009.
[25]F.BrooksJr.Themythicalman-month.Addison-WesleyLongmanPublishingCo.,Inc.
1995.USA,MA,Boston,[26]M.BroyandK.Stølen.Speci®cationanddevelopmentofinteractivesystems:focuson
streams,interfaces,andre®nement.SpringerVerlag,2001.
[27]M.Bruntink,A.vanDeursen,R.vanEngelen,andT.Tourwé.Ontheuseofclonedetection
foridentifyingcrosscuttingconcerncode.IEEETrans.onSoftw.Eng.,2005.
[28]A.Bucchiarone,S.Gnesi,G.Lami,G.Trentanni,andA.Fantechi.QuARSExpress-ATool
Demonstration.InProc.ofASE’08,2008.
[29]P.BulychevandM.Minea.Duplicatecodedetectionusinganti-uni®cation.Proc.ofSYR-
2008.,’08CoSE[30]P.BulychevandM.Minea.Anevaluationofduplicatecodedetectionusinganti-uni®cation.
InProc.ofIWSC’09,2009.
[31]H.Bunke,P.Foggia,C.Guidobaldi,C.Sansone,andM.Vento.Acomparisonofalgorithms
formaximumcommonsubgraphonrandomlyconnectedgraphs.InProc.ofSSPRandSPR
2002.,Springer.’02[32]E.BurdandJ.Bailey.Evaluatingclonedetectiontoolsforuseduringpreventativemainte-
nance.InProc.ofSCAM’02,Washington,DC,USA,2002.
[33]G.Casazza,G.Antoniol,U.Villano,E.Merlo,andM.Penta.Identifyingclonesinthelinux
kernel.InProc.ofSCAM’01,2001.
202
liographBiby
[34]F.Chang,J.Dean,S.Ghemawat,W.C.Hsieh,D.A.Wallach,M.Burrows,T.Chandra,
A.Fikes,andR.E.Gruber.Bigtable:Adistributedstoragesystemforstructureddata.ACM
Trans.Comput.Syst.,2008.
[35]X.CHANGSONG,P.Eck,andR.Matzner.Syntax-orientedcoding(SoC):Anewalgorithm
forthecompressionofmessagesconstrainedbysyntaxrules.IEEEinternationalsymposium
1998.,theoryinformationon[36]M.Chilowicz,É.Duris,andG.Roussel.Syntaxtree®ngerprintingforsourcecodesimilarity
detection.InProc.ofICPC’09,2009.
[37]A.Cockburn.WritingEffectiveUseCases.Addison-WesleyLongmanPublishingCo.,Inc.,
2000.USA,MA,Boston,[38]I.Coman,A.Sillitti,andG.Succi.Acase-studyonusinganAutomatedIn-processSoftware
EngineeringMeasurementandAnalysissysteminanindustrialenvironment.InProc.of
2009.,’09ICSE[39]M.J.CorbinandL.A.Strauss.Basicsofqualitativeresearch:Techniquesandprocedures
fordevelopinggroundedtheory.SagePubl.,3.edition,2008.
[40]J.Cordy.Comprehendingreality-practicalbarrierstoindustrialadoptionofsoftwaremain-
tenanceautomation.InProc.ofIWPC’03,2003.
[41]J.R.Cordy,T.R.Dean,andN.Synytskyy.Practicallanguage-independentdetectionof
near-missclones.InProc.ofCASCON’04.IBMPress,2004.
[42]T.H.Cormen,C.E.Leiserson,R.L.Rivest,andC.Stein.IntroductiontoAlgorithms.The
MITPressandMcGraw-HillBookCompany,2ndedition,2001.
[43]J.CovingtonandM.Chase.Eightstepstosustainablechange.IndustrialManagement,2010.
[44]F.CulwinandT.Lancaster.Areviewofelectronicservicesforplagiarismdetectioninstudent
submissions.InProc.ofTeachingofComputing’00,2000.
[45]I.DavisandM.Godfrey.Clonedetectionbyexploitingassembler.InProc.ofIWSC’10,
2010.[46]M.deWit,A.Zaidman,andA.vanDeursen.Managingcodeclonesusingdynamicchange
trackingandresolution.InProc.ofICSM’09,2009.
[47]G.DeCandia,D.Hastorun,M.Jampani,G.Kakulapati,A.Lakshman,A.Pilchin,S.Siva-
subramanian,P.Vosshall,andW.Vogels.Dynamo:Amazon’shighlyavailablekey-value
store.InProc.ofSOSP’07,2007.
[48]F.Deissenboeck.ContinuousQualityControlofLong-LivedSoftwareSystems.PhDthesis,
TechnischeUniversitätMünchen,2009.
[49]F.Deissenboeck,M.Feilkas,L.Heinemann,B.Hummel,andE.Juergens.Conqatbook,
T_Book.x.php/ConQAhttp://conqat.in.tum.de/inde2009.[50]F.Deissenboeck,L.Heinemann,B.Hummel,andE.Juergens.Flexiblearchitectureconfor-
manceassessmentwithconqat.InProc.ofICSE’10,2010.
203
yliographBib
[51]F.Deissenboeck,U.Hermann,E.Juergens,andT.Seifert.LEvD:Aleanevolutionand
developmentprocess,2007.http://conqat.cs.tum.edu/download/levd-process.pdf.
[52]F.Deissenboeck,B.Hummel,andE.Juergens.Conqat-eintoolkitzurkontinuierlichen
qualitätsbewertung.InProc.ofSE’08,2008.
[53]F.Deissenboeck,B.Hummel,E.Juergens,M.Pfaehler,andB.Schaetz.Modelclonedetec-
tioninpractice.InProc.ofIWSC’10,2010.
[54]F.Deissenboeck,B.Hummel,E.Juergens,B.Schaetz,S.Wagner,J.-F.Girard,and
S.Teuchert.Clonedetectioninautomotivemodel-baseddevelopment.InProc.ofICSE
2008.,’08[55]F.Deissenboeck,E.Juergens,B.Hummel,S.Wagner,B.M.yParareda,andM.Pizka.Tool
supportforcontinuousqualitycontrol.IEEESoftw.,2008.
[56]F.Deissenboeck,M.Pizka,andT.Seifert.Toolsupportforcontinuousqualityassessment.
InProc.ofSTEP’05,2005.
[57]C.Domann,E.Juergens,andJ.Streit.Thecurseofcopy&paste–Cloninginrequirements
speci®cations.InProc.ofESEM’09,2009.
[58]dSpaceGmbH.TargetLinkProductionCodeGeneration.www.dspace.de.
[59]E.Duala-EkokoandM.Robillard.Clonetracker:toolsupportforcodeclonemanagement.
InProc.ofICSE’08,2008.
[60]E.Duala-EkokoandM.P.Robillard.Trackingcodeclonesinevolvingsoftware.InProc.of
2007.,’07ICSE[61]S.Ducasse,O.Nierstrasz,andM.Rieger.Ontheeffectivenessofclonedetectionbystring
matching.J.SoftwaremaintenanceRes.Pract.,2006.
[62]S.Ducasse,M.Rieger,andS.Demeyer.Alanguageindependentapproachfordetecting
duplicatedcode.InProc.ofICSM’99,1999.
[63]S.Eick,J.Steffen,andE.SumnerJr.Seesoft-atoolforvisualizinglineorientedsoftware
statistics.IEEETrans.onSoftw.Eng.,1992.
[64]A.EndresandD.Rombach.AHandbookofSoftwareandSystemsEngineering.Pearson,
2003.[65]W.S.Evans,C.W.Fraser,andF.Ma.Clonedetectionviastructuralabstraction.InProc.of
2007.,’07WCRE[66]F.Fabbrini,M.Fusani,S.Gnesi,andG.Lami.AnAutomaticQualityEvaluationforNatural
LanguageRequirements.InProc.ofREFSQ’01,2001.
[67]R.Falke,P.Frenzel,andR.Koschke.Empiricalevaluationofclonedetectionusingsyntax
suf®xtrees.EmpiricalSoftwareEngineering,2008.
[68]R.FantaandV.Rajlich.Removingclonesfromthecode.J.SoftwaremaintenanceRes.
1999.,act.Pr
204
yliographBib
[69]P.Finnigan,R.Holt,I.Kalas,S.Kerr,K.Kontogiannis,H.Mueller,J.Mylopoulos,
S.Perelgut,M.Stanley,andK.Wong.Thesoftwarebookshelf.IBMSystemsJ.,1997.
[70]M.Fowler.Refactoring:improvingthedesignofexistingcode.Addison-WesleyProfes-
1999.sional,[71]M.FowlerandJ.Highsmith.Theagilemanifesto.SoftwareDevelopment,2001.
[72]J.Franklin.Integrationofofclonedetectiveintoeclipse.Master’sthesis,TechnischeUniver-
2009.München,sität[73]M.Gabel,L.Jiang,andZ.Su.Scalabledetectionofsemanticclones.InProc.ICSE’08,
2008.[74]E.Gamma,R.Helm,R.Johnson,andJ.Vlissides.Designpatterns:elementsofreusable
object-orientedsoftware.Addison-WesleyReading,MA,1995.
[75]M.R.GareyandD.S.Johnson.Computersandintractability.Aguidetothetheoryof
NP-completeness.W.H.FreemanandCompany,1979.
[76]R.Geiger,B.Fluri,H.C.Gall,andM.Pinzger.Relationofcodeclonesandchangecouplings.
InProc.ofFASE’06.Springer,2006.
[77]D.German,M.DiPenta,Y.Guéhéneuc,andG.Antoniol.Codesiblings:Technicalandlegal
implicationsofcopyingcodebetweenapplications.InProc.ofMSR’09,2009.
[78]S.Giesecke.Clone-basedReengineeringfürJavaaufderEclipse-Plattform.Master’sthesis,
UniversitätOldenburg,2003.
[79]T.GilbandD.Graham.SoftwareInspection.Addison-Wesley,1993.
[80]R.Glass.Maintenance:Lessisnotmore.IEEESoftw.,1998.
[81]R.Glass.Factsandfallaciesofsoftwareengineering.Addison-WesleyProfessional,2003.
[82]B.Gleich,O.Creighton,andL.Kof.Ambiguitydetection:Towardsatoolexplainingambi-
guitysources.InProc.ofREFSQ’10,2010.
[83]N.Göde.EvolutionofType-1Clones.InProc.ofSCAM’09,2009.
[84]N.Göde.Cloneremoval:Factor®ction?InProc.ofIWSC’10,2010.
[85]N.GödeandR.Koschke.Incrementalclonedetection.InProc.ofCSMR’09,2009.
[86]N.Gold,J.Krinke,M.Harman,andD.Binkley.IssuesinCloneClassi®cationforData¯ow
Languages.Proc.ofIWSC’10,2010.
[87]J.D.Gould,L.Alfaro,R.Finn,B.Haupt,andA.Minuto.Whyreadingwasslowerfrom
CRTdisplaysthanfrompaper.SIGCHIBull.,17,1987.
[88]S.GrantandJ.Cordy.VectorSpaceAnalysisofSoftwareClones.InProc.ofICPC’09,
2009.[89]P.Grünwald.Theminimumdescriptionlengthprinciple.TheMITPress,2007.
205
yliographBib
[90]J.Haldane.Biologicalpossibilitiesforthehumanspeciesinthenexttenthousandyears.
Manandhisfuture,1963.
[91]J.HarderandN.Göde.Quovadis,clonemanagement?InProc.ofIWSC’10,2010.
[92]Y.Higo,Y.Ueda,S.Kusumoto,andK.Inoue.Simultaneousmodi®cationsupportbasedon
codecloneanalysis.InProc.ofAPSEC’07,2007.
[93]W.T.B.Hordijk,M.L.Ponisio,andR.J.Wieringa.Harmfulnessofcodeduplication-a
structuredreviewoftheevidence.InProc.ofEASE’09.BritishComputerSociety,2009.
[94]D.Hou,P.Jablonski,andF.Jacob.CnP:Towardsanenvironmentfortheproactivemanage-
mentofcopy-and-pasteprogramming.Proc.ofICPC’09,2009.
[95]D.Huffman.Amethodfortheconstructionofminimum-redundancycodes.Resonance,
2006.[96]M.HuhnandD.Scharff.Someobservationsonscademodelclones.InProc.ofMBEES’10,
2010.[97]B.Hummel,E.Juergens,L.Heinemann,andM.Conradt.Index-BasedCodeCloneDetec-
tion:Incremental,Distributed,Scalable.InProc.ofICSM’10,2010.
[98]I.I.Ianov.Ontheequivalenceandtransformationofprogramschemes.Commun.ACM,
1958.[99]IEEE.Standard1219:Softwaremaintenance,1998.
[100]IEEE.Standard830-1998:Recommendedpracticeforsoftwarerequirementsspeci®cations,
1998.[101]L.K.IshrarHussain,OlgaOrmandjieva.AutomaticqualityassessmentofSRStextbymeans
ofadecision-tree-basedtextclassi®er.InProc.ofQSIC’07,2007.
[102]P.JablonskiandD.Hou.CReN:atoolfortrackingcopy-and-pastecodeclonesandrenaming
identi®ersconsistentlyintheIDE.InProc.ofEclipse’07,2007.
[103]F.Jacob,D.Hou,andP.Jablonski.Activelycomparingclonesinsidethecodeeditor.In
Proc.ofIWSC’10,2010.
[104]K.JalbertandJ.S.Bradbury.Usingclonedetectiontoidentifybugsinconcurrentsoftware.
InProc.ofICSM’10,2010.
[105]Y.Jia,D.Binkley,M.Harman,J.Krinke,andM.Matsushita.KClone:aproposedapproach
tofastprecisecodeclonedetection.InProc.ofIWSC’09,2009.
[106]L.Jiang,G.Misherghi,Z.Su,andS.Glondu.DECKARD:Scalableandaccuratetree-based
detectionofcodeclones.InProc.ofICSE’07,2007.
[107]L.JiangandZ.Su.Automaticminingoffunctionallyequivalentcodefragmentsviarandom
testing.InProc.ofISSTA’09,2009.
[108]J.H.Johnson.Identifyingredundancyinsourcecodeusing®ngerprints.InProc.ofCASCON
1993.,’93
206
yliographBib
[109]P.JokinenandE.Ukkonen.Twoalgorithmsforapproximatestringmatchinginstatictexts.
InProc.ofMFCS’91.Springer,1991.
[110]E.JuergensandF.Deissenboeck.Howmuchisaclone?InProc.ofSQM’10,2010.
[111]E.Juergens,F.Deissenboeck,M.Feilkas,B.Hummel,B.Schaetz,S.Wagner,C.Domann,
andJ.Streit.Canclonedetectionsupportqualityassessmentsofrequirementsspeci®cations?
InProc.ofICSE’10,2010.
[112]E.Juergens,F.Deissenboeck,andB.Hummel.Clonedetectionbeyondcopy&paste.In
Proc.ofIWSC’09,2009.
[113]E.Juergens,F.Deissenboeck,andB.Hummel.Clonedetective:Aworkbenchforclone
detectionresearch.InProc.ofICSE’09,2009.
[114]E.Juergens,F.Deissenboeck,andB.Hummel.Codesimilaritiesbeyondcopy&paste.In
Proc.ofCSMR’09,2010.
[115]E.Juergens,F.Deissenboeck,B.Hummel,andS.Wagner.Docodeclonesmatter?InProc.
2009.,’09ICSEof[116]E.JuergensandN.Göde.Achievingaccurateclonedetectionresults.InProc.ofIWSC’10,
2010.[117]E.Juergens,B.Hummel,F.Deissenboeck,andM.Feilkas.Staticbugdetectionthrough
analysisofinconsistentclones.InProc.ofSE’08.GI,2008.
[118]M.Jungmann,R.Otterbach,andM.Beine.DevelopmentofSafety-CriticalSoftwareUsing
AutomaticCodeGeneration.InProc.ofSAEWorldCongress’04,2004.
[119]D.Jurafsky,J.Martin,A.Kehler,K.VanderLinden,andN.Ward.Speechandlanguage
processing.PrenticeHallNewYork,2000.
[120]I.Kalaydijeva.Studiezurwiederverwendungbeidersoftlabgmbh.Master’sthesis,Tech-
nischeUniversitätMünchen,2007.
[121]T.Kamiya,S.Kusumoto,andK.Inoue.Cc®nder:amultilinguistictoken-basedcodeclone
detectionsystemforlargescalesourcecode.IEEETrans.onSoftw.Eng.,2002.
[122]C.KapserandM.W.Godfrey.Aidingcomprehensionofcloningthroughcategorization.In
Proc.ofIWPSE’04,2004.
[123]C.KapserandM.W.Godfrey.“Cloningconsideredharmful”consideredharmful.InProc.
2006.,’06WCREof[124]C.J.Kapser,P.Anderson,M.Godfrey,R.Koschke,M.Rieger,F.vanRysselberghe,and
P.Wei¨sgerber.Subjectivityinclonejudgment:Canweeveragree?InDuplication,Redun-
dancy,andSimilarityinSoftware,DagstuhlSeminarProceedings,2007.
[125]C.J.KapserandM.W.Godfrey.Improvedtoolsupportfortheinvestigationofduplication
insoftware.InProc.ofICSM’05,2005.
207
yliographBib
[126]S.Kawaguchi,T.Yamashina,H.Uwano,K.Fushida,Y.Kamei,M.Nagura,andH.Iida.
SHINOBI:AToolforAutomaticCodeCloneDetectionintheIDE.InProc.ofWCRE’09,
2009.[127]D.KawrykowandM.Robillard.ImprovingAPIusagethroughdetectionofredundantcode.
InProc.ofASE’09,2009.
[128]U.Kelter,J.Wehren,andJ.Niere.AgenericdifferencealgorithmforUMLmodels.InProc.
2005.,’05SEof[129]A.KemperandA.Eickler.Datenbanksysteme:EineEinführung.OldenbourgWis-
2006.erlag,senschaftsv[130]T.Kiely.Managingchange:whyreengineeringprojectsfail.HarvardBusinessReview,
1995.[131]M.Kim,L.Bergman,T.Lau,andD.Notkin.Anethnographicstudyofcopyandpaste
programmingpracticesinOOPL.InProc.ofISESE’04,2004.
[132]M.KimandD.Notkin.Usingaclonegenealogyextractorforunderstandingandsupporting
evolutionofcodeclones.InProc.ofMSR’05,2005.
[133]M.Kim,V.Sazawal,D.Notkin,andG.Murphy.Anempiricalstudyofcodeclonegenealo-
gies.InProc.ofESEC/FSE’05,2005.
[134]J.Knoop,O.Rüthing,andB.Steffen.Partialdeadcodeelimination.InProc.ofPLDI’94,
1994.[135]D.E.Knuth.TheArtofComputerProgramming,volume3:SortingandSearching.Addison-
Wesley,2ndedition,1997.
[136]R.Komondoor.Automatedduplicated-codedetectionandprocedureextraction.PhDthesis,
TheUniversityofWisconsin,Madison,2003.
[137]R.KomondoorandS.Horwitz.Usingslicingtoidentifyduplicationinsourcecode.InProc.
ofSAS’01.Springer,2001.
[138]K.Kontogiannis.Evaluationexperimentsonthedetectionofprogrammingpatternsusing
softwaremetrics.InProc.ofWCRE’97,1997.
[139]K.Kontogiannis,R.DeMori,E.Merlo,M.Galler,andM.Bernstein.Patternmatchingfor
cloneandconceptdetection.AutomatedSoftwareEngineering,1996.
[140]R.Koschke.Surveyofresearchonsoftwareclones.InDuplication,Redundancy,andSimi-
larityinSoftware.DagstuhlSeminarProceedings,2007.
[141]R.Koschke.Frontiersofsoftwareclonemanagement.InFrontiersofSoftwareMaintenance,
2008.[142]R.Koschke,R.Falke,andP.Frenzel.Clonedetectionusingabstractsyntaxsuf®xtrees.In
Proc.ofWCRE’06,2006.
[143]J.Kotter.Leadingchange.HarvardBusinessSchoolPr,1996.
[144]J.KotterandL.Change.Whytransformationeffortsfail.HarvardBusinessReview,1995.
208
yliographBib
[145]J.KotterandD.Cohen.Theheartofchange:Real-lifestoriesofhowpeoplechangetheir
organizations.HarvardBusinessPress,2002.
[146]J.Krinke.Identifyingsimilarcodewithprogramdependencegraphs.InProc.ofWCRE’01,
2001.[147]J.Krinke.Astudyofconsistentandinconsistentchangestocodeclones.InProc.ofWCRE
2007.,’07[148]J.Krinke.Isclonedcodemorestablethannon-clonedcode?Proc.ofSCAM’08,2008.
[149]B.Lague,D.Proulx,J.Mayrand,E.M.Merlo,andJ.Hudepohl.Assessingthebene®tsof
incorporatingfunctionclonedetectioninadevelopmentprocess.InProc.ofICSM’97,1997.
[150]R.LämmelandC.Verhoef.Semi-automaticgrammarrecovery.Softw.Pract.Exp.,2001.
[151]J.LandisandG.Koch.Themeasurementofobserveragreementforcategoricaldata.Bio-
1977.,metrics[152]T.LarkinandS.Larkin.Communicatingchange:Howtowinemployeesupportfornew
businessdirections.McGraw-HillProfessional,1994.
[153]K.Lewin.Frontiersingroupdynamics:Concept,methodandrealityinsocialscience;social
equilibriaandsocialchange.Humanrelations,1947.
[154]H.LiandS.Thompson.ClonedetectionandremovalforErlang/OTPwithinarefactoring
environment.InProc.ofPEPM’09,2009.
[155]M.Li,X.Chen,X.Li,B.Ma,andP.Vitányi.Thesimilaritymetric.IEEETransactionson
2004.,TheoryInformation[156]M.LiandP.Vitányi.AnintroductiontoKolmogorovcomplexityanditsapplications.
Springer-VerlagNewYorkInc,2008.
[157]Z.Li,S.Lu,S.Myagmar,andY.Zhou.CP-Miner:Findingcopy-pasteandrelatedbugsin
large-scalesoftwarecode.IEEETrans.onSoftw.Eng.,2006.
[158]P.Liberatore.RedundancyinlogicI:CNFpropositionalformulae.Arti®cialIntelligence,
2005.[159]E.C.LingxiaoJiang,ZhendongSu.Context-baseddetectionofclone-relatedbugs.InProc.
2007.,’07ESEC/FSEof[160]H.Liu,Z.Ma,L.Zhang,andW.Shao.Detectingduplicationsinsequencediagramsbased
onsuf®xtrees.InProc.ofAPSEC’06,2006.
[161]S.Livieri,Y.Higo,M.Matsushita,andK.Inoue.Analysisofthelinuxkernelevolutionusing
codeclonecoverage.InProc.ofMSR’07,2007.
[162]S.Livieri,Y.Higo,M.Matsushita,andK.Inoue.Very-largescalecodecloneanalysisand
visualizationofopensourceprogramsusingdistributedCCFinder:D-CCFinder.InProc.of
2007.,’07ICSE[163]A.LozanoandM.Wermelinger.Assessingtheeffectofclonesonchangeability.InProc.of
2008.,’08ICSM
209
yliographBib
[164]A.Lozano,M.Wermelinger,andB.Nuseibeh.Evaluatingtheharmfulnessofcloning:A
changebasedexperiment.InProc.ofMSR’07,Washington,DC,USA,2007.
[165]C.Lyon,R.Barrett,andJ.Malcolm.Atheoreticalbasistotheautomateddetectionofcopying
betweentexts,anditspracticalimplementationintheferretplagiarismandcollusiondetector.
InProc.ofPPPPC’04,2004.
[166]D.MacKay.Informationtheory,inference,andlearningalgorithms.CambridgeUnivPr,
2003.[167]A.MarcusandJ.I.Maletic.Identi®cationofhigh-levelconceptclonesinsourcecode.In
Proc.ofASE’01,2001.
[168]E.MartinOdersky.Scala2.8collections,October2009.http://www.scala-lang.org/sites/
default/®les/sids/odersky/Fri,%202009-10-02,%2014:16/collections.pdf.
[169]TheMathWorksInc.SIMULINKModel-BasedandSystem-BasedDesign-UsingSimulink,
2002.[170]J.Mayrand,C.Leblanc,andE.Merlo.Experimentontheautomaticdetectionoffunction
clonesinasoftwaresystemusingmetrics.InProc.ofICSM’96,1996.
[171]T.McCabe.Acomplexitymeasure.IEEETrans.onSoftw.Eng.,1976.
[172]J.J.McGregor.Backtracksearchalgorithmsandthemaximalcommonsubgraphproblem.
Software–PracticeandExperience,1982.
[173]T.Mende,F.Beckwermert,R.Koschke,andG.Meier.Supportingthegrow-and-prunemodel
insoftwareproductlinesevolutionusingclonedetection.InProc.ofCSMR’08,Washington,
2008.USA,DC,[174]M.Mernik,M.Lenic,E.Avdicauševic,andV.Zumer.Multipleattributegrammarinheri-
2000.,Informaticatance.[175]G.Meszaros.xUnittestpatterns:Refactoringtestcode.PrenticeHallPTRUpperSaddle
River,NJ,USA,2006.
[176]R.MetzgerandZ.Wen.Automaticalgorithmrecognitionandreplacement.MITPress,2000.
[177]B.Meyer.DesignandCodeReviewsintheAgeoftheInternet.InProc.ofSEAFOOD’08.
2008.,Springer[178]A.Monden,D.Nakae,T.Kamiya,S.Sato,andK.Matsumoto.Softwarequalityanalysisby
codeclonesinindustriallegacysoftware.InProc.ofMETRICS’02,2002.
[179]E.Murphy-Hill,P.Quitslund,andA.Black.Removingduplicationfromjava.io:acase
studyusingtraits.InProc.ofOOPSLA’05,2005.
[180]H.Nguyen,T.Nguyen,N.Pham,J.Al-Kofahi,andT.Nguyen.Accurateandef®cientstruc-
turalcharacteristicfeatureextractionforclonedetection.Proc.ofFASE’09,2009.
[181]T.Nguyen,H.Nguyen,N.Pham,J.Al-Kofahi,andT.Nguyen.Cleman:Comprehensive
clonegroupevolutionmanagement.InProc.ofASE’08,2008.
210
yliographBib
[182]T.T.Nguyen,H.A.Nguyen,J.M.Al-Kofahi,N.H.Pham,andT.N.Nguyen.Scalableand
incrementalclonedetectionforevolvingsoftware.Proc.ofICSM’09,2009.
[183]T.T.Nguyen,H.A.Nguyen,N.H.Pham,J.M.Al-Kofahi,andT.N.Nguyen.Graph-based
miningofmultipleobjectusagepatterns.InProc.ofFSE’09,2009.
[184]J.NosekandP.Palvia.Softwaremaintenancemanagement:changesinthelastdecade.J.
SoftwaremaintenanceRes.Pract.,1990.
[185]C.H.PapadimitriouandK.Steiglitz.Combinatorialoptimization:Algorithmsandcomplex-
1982.Prentice-Hall,.ity[186]N.Pham,H.Nguyen,T.Nguyen,J.Al-Kofahi,andT.Nguyen.Completeandaccurateclone
detectioningraph-basedmodels.InProc.ofICSE’09,2009.
[187]M.F.Porter.Analgorithmforsuf®xstripping.Readingsininformationretrieval,1997.
[188]A.Pretschner,M.Broy,I.H.Krüger,andT.Stauner.SoftwareEngineeringforAutomotive
Systems:ARoadmap.InL.BriandandA.Wolf,editors,Proc.ofFoSE’07,2007.
[189]F.Rahman,C.Bird,andP.Devanbu.Clones:WhatisthatSmell?InProc.ofMSR’10,2010.
[190]D.Ratiu.Intentionalmeaningofprograms.PhDthesis,TechnischeUniversitätMünchen,
2009.[191]J.W.RaymondandP.Willett.Maximumcommonsubgraphisomorphismalgorithmsforthe
matchingofchemicalstructures.J.Comput-AidedMol.Des.,2002.
[192]R.Rivest.TheMD5Message-DigestAlgorithm.RFC1321(Informational),1992.
[193]A.L.RodriguezandM.Wermelinger.Trackingclonesimprint.InProc.ofIWSC’10,2010.
[194]H.D.Rombach,B.T.Ulery,andJ.D.Valett.Towardfulllifecyclecontrol:Addingmainte-
nancemeasurementtotheSEL.J.Syst.Softw.,1992.
[195]C.RoyandJ.Cordy.Anempiricalstudyoffunctionclonesinopensourcesoftware.InProc.
2008.,’08WCREof[196]C.RoyandJ.Cordy.Scenario-basedcomparisonofclonedetectiontechniques.InProc.of
2008.,’08ICPC[197]C.RoyandJ.Cordy.Amutation/injection-basedautomaticframeworkforevaluatingclone
detectiontools.InProc.ofMUTATION’09,2009.
[198]C.RoyandJ.Cordy.Near-missfunctionclonesinopensourcesoftware:anempiricalstudy.
J.SoftwaremaintenanceRes.Pract.,2009.
[199]C.RoyandJ.Cordy.AreScriptingLanguagesReallyDifferent?Proc.ofIWSC’10,2010.
[200]C.Roy,J.Cordy,andR.Koschke.Comparisonandevaluationofcodeclonedetectiontech-
niquesandtools:Aqualitativeapproach.ScienceofComputerProgramming,2009.
[201]C.K.RoyandJ.R.Cordy.Asurveyonsoftwareclonedetectionresearch.TechnicalReport
541,Queen’sUniversityatKingston,2007.
211
yliographBib
[202]C.K.RoyandJ.R.Cordy.NICAD:Accuratedetectionofnear-missintentionalclonesusing
¯exiblepretty-printingandcodenormalization.InProc.ofICPC’08,2008.
[203]J.D.Rutledge.Onianov’sprogramschemata.J.oftheACM,1964.
[204]A.Sæbjørnsen,J.Willcock,T.Panas,D.Quinlan,andZ.Su.Detectingcodeclonesinbinary
executables.InProc.ofISSTA’09,pages117–128.ACM,2009.
[205]K.Sayood.Introductiontodatacompression.MorganKaufmann,2000.
[206]H.Schmid.Probabilisticpart-of-speechtaggingusingdecisiontrees.InProc.ofNewMeth-
odsinLanguageProcessing’94,1994.
[207]H.Schmid.Improvementsinpart-of-speechtaggingwithanapplicationtoGerman.Natural
languageprocessingusingverylargecorpora,1999.
[208]M.ShawandD.Garlan.Softwarearchitecture.PrenticeHall,1996.
[209]J.Singer,T.Lethbridge,N.Vinson,andN.Anquetil.Anexaminationofsoftwareengineering
workpractices.InProc.ofCASCON’97.IBMPress,1997.
[210]R.SmithandS.Horwitz.Detectingandmeasuringsimilarityincodeclones.InProc.of
2009.,’09IWSC[211]H.Sneed.Acostmodelforsoftwaremaintenance&evolution.InProc.ofICSM’04.IEEE
2004.Press,CS[212]M.Stevens,A.Sotirov,J.Appelbaum,A.K.Lenstra,D.Molnar,D.A.Osvik,and
B.deWeger.Shortchosen-pre®xcollisionsforMD5andthecreationofarogueCAcer-
ti®cate.InProc.ofCRYPTO’09,2009.
[213]R.TairasandJ.Gray.Phoenix-basedclonedetectionusingsuf®xtrees.InProc.ofSoutheast
regionalconference’06,2006.
[214]R.Tairas,J.Gray,andI.Baxter.Visualizationofclonedetectionresults.InProc.ofETX
2006.,’06[215]H.Täubig.FastStructureSearchingforComputationalProteomics.PhDthesis,TU
2007.München,[216]S.Thummalapenta,L.Cerulo,L.Aversano,andM.DiPenta.Anempiricalstudyonthe
maintenanceofsourcecodeclones.EmpiricalSoftwareEngineering,2009.
[217]R.Tiarks,R.Koschke,andR.Falke.Anassessmentoftype-3clonesasdetectedbystate-of-
the-arttools.InProc.ofSCAM’09,2009.
[218]M.Toomim,A.Begel,andS.L.Graham.Managingduplicatedcodewithlinkedediting.In
Proc.ofVLHCC’04,2004.
[219]Y.Ueda,T.Kamiya,S.Kusumoto,andK.Inoue.Gemini:Maintenancesupportenvironment
basedoncodecloneanalysis.InProc.ofMETRICS’02,2002.
[220]Y.Ueda,T.Kamiya,S.Kusumoto,andK.Inoue.Ondetectionofgappedcodeclonesusing
gaplocations.InProc.ofAPSEC’02,2002.
212
yliographBib
[221]E.Ukkonen.Approximatestringmatchingoversuf®xtrees.InProc.ofCPM’93.Springer,
1993.[222]E.Ukkonen.On-lineconstructionofsuf®xtrees.Algorithmica,1995.
[223]J.VanWijkandH.vandeWetering.Cushiontreemaps:Visualizationofhierarchicalinfor-
mation.InProc.ofINFOVIS’99,1999.
[224]J.Vlissides.GenerationGap.C++Report,1996.
[225]S.Wagner,F.Deissenboeck,B.Hummel,E.Juergens,B.M.yParareda,andB.S.(Eds.).
Selectedtopicsinsoftwarequality.TechnicalReportTUM-I0824,TechnischeUniversität
München,Germany,July2008.
[226]V.Wahler,D.Seipel,J.Wolff,andG.Fischer.Clonedetectioninsourcecodebyfrequent
itemsettechniques.InFourthIEEEInternationalWorkshoponSourceCodeAnalysisand
2004.,2004Manipulation,[227]A.Walenstein.Codeclones:Reconsideringterminology.InDuplication,Redundancy,and
SimilarityinSoftware,DagstuhlSeminarProceedings,2007.
[228]A.Walenstein,M.El-Ramly,J.R.Cordy,W.S.Evans,K.Mahdavi,M.Pizka,G.Rama-
lingam,andJ.W.vonGudenberg.Similarityinprograms.InR.Koschke,E.Merlo,and
A.Walenstein,editors,Duplication,Redundancy,andSimilarityinSoftware,number06301
inDagstuhlSeminarProceedings.IBFI,2007.
[229]A.Walenstein,N.Jyoti,J.Li,Y.Yang,andA.Lakhotia.Problemscreatingtask-relevant
clonedetectionreferencedata.InProc.ofWCRE’03,2003.
[230]M.WeberandJ.Weisbrod.Requirementsengineeringinautomotivedevelopment–experi-
encesandchallenges.InProc.ofRE’02,2002.
[231]J.-R.Wen,J.-Y.Nie,andH.-J.Zhang.Clusteringuserqueriesofasearchengine.InProc.of
2001.,’01WWW[232]L.Wills.Flexiblecontrolforprogramrecognition.InProc.ofWCRE’93,1993.
[233]W.M.Wilson,L.H.Rosenberg,andL.E.Hyatt.Automatedanalysisofrequirementspeci-
®cations.InProc.ofICSE’97,1997.
[234]C.Wohlin,P.Runeson,andM.Höst.Experimentationinsoftwareengineering:Anintroduc-
tion.KluwerAcademic,Boston,Mass.,2000.
[235]T.Yamashina,H.Uwano,K.Fushida,Y.Kamei,M.Nagura,S.Kawaguchi,andH.Iida.
SHINOBI:Areal-timecodeclonedetectiontoolforsoftwaremaintenance.TechnicalReport
NAIST-IS-TR2007011,NaraInstituteofScienceandTechnology,2008.
[236]D.YehandJ.-H.Jeng.Anempiricalstudyofthein¯uenceofdepartmentalizationandorga-
nizationalpositiononsoftwaremaintenance.J.Softw.Maint.Evol.Res.Pr.,2002.
[237]A.Ying,G.Murphy,R.Ng,andM.Chu-Carroll.Predictingsourcecodechangesbymining
changehistory.IEEETrans.onSoftw.Eng.,2004.
213
yliographBib
[238]
214
.Y
Zhang,
wvie
H.
Basit,
generation
for
S.
Jarzabek,
clone
D.
analysis.
Anh,
In
Prand
oc.ofM.
.wLo
ICSMQuery-based
,’08
2008.
®ltering
and
graphical
Access to the YouScribe library is required to read this work in full.
Discover the services we offer to suit all your requirements!