215 Pages
English
Gain access to the library to view online
Learn more

Why and how to control cloning in software artifacts [Elektronische Ressource] / Elmar Juergens

-

Gain access to the library to view online
Learn more
215 Pages
English

Description

WhyandHowtoControlCloninginSoftwareArtifactsElmar JuergensInstitutfürInformatikderTechnischenUniversitätMünchenWhyandHowtoControlCloninginSoftwareArtifactsElmarJuergensVollständiger Abdruck der von der Fakultät für Informatik der Technischen UniversitätMünchenzurErlangungdesakademischenGradeseinesDoktorsderNaturwissenschaften(Dr.rer.nat.)genehmigtenDissertation.Vorsitzender: Univ.-Prof. BerndBrügge,Ph.D.PrüferderDissertation:1. Univ.-Prof. Dr. Dr. h.c. ManfredBroy2. Univ.-Prof. Dr. RainerKoschkeUniversitätBremenDieDissertationwurdeam07.10.2010beiderTechnischenUniversitätMüncheneingere-ichtunddurchdieFakultätfürInformatikam19.02.2011angenommen.AbstractThe majority of the total life cycle costs of long-lived software arises after its first release, duringsoftwaremaintenance. Cloning,theduplicationofpartsofsoftwareartifacts,hindersmaintenance:it increases size, and thus effort for activities such as inspections and impact analysis. Changesneed to be performed to all clones, instead of to a single location only, thus increasing effort. Ifindividual clones are forgotten during a modification, the resulting inconsistencies can threatenprogramcorrectness. Cloningisthusaqualitydefect.The software engineering community has recognized the negative consequences of cloning overa decade ago. Nevertheless, it abounds in practice—across artifacts, organizations and domains.Cloning thrives, since its control is not part of software engineering practice.

Subjects

Informations

Published by
Published 01 January 2011
Reads 14
Language English
Document size 3 MB

Exrait

yWh

and

wHo

in

to

olContr

Software

Cloning

tifactsAr

Elmar

uergensJ

derTechnischenInstitutfürUnivInforersitätmatikMünchen

WhyandHowtoControlCloning
tifactsArSoftwarein

ElmarensguerJ

VollständigerAbdruckdervonderFakultätfürInformatikderTechnischenUniversität
MünchenzurErlangungdesakademischenGradeseines

DoktorsderNaturwissenschaften(Dr.rer.nat.)

Dissertation.genehmigten

Vorsitzender:Univ.-Prof.BerndBrügge,Ph.D.
Dissertation:derPrüfer1.Univ.-Prof.Dr.Dr.h.c.Manfred
2.Univ.-Prof.Dr.RainerKoschke

Univ.-Prof.Dr.Dr.h.c.ManfredBroy
Univ.-Prof.Dr.RainerKoschke
BremenersitätvUni

DieDissertationwurdeam07.10.2010beiderTechnischenUniversitätMüncheneingere-
ichtunddurchdieFakultätfürInformatikam19.02.2011angenommen.

Abstract

Themajorityofthetotallifecyclecostsoflong-livedsoftwarearisesafterits®rstrelease,during
softwaremaintenance.Cloning,theduplicationofpartsofsoftwareartifacts,hindersmaintenance:
itincreasessize,andthuseffortforactivitiessuchasinspectionsandimpactanalysis.Changes
needtobeperformedtoallclones,insteadoftoasinglelocationonly,thusincreasingeffort.If
individualclonesareforgottenduringamodi®cation,theresultinginconsistenciescanthreaten
programcorrectness.Cloningisthusaqualitydefect.
Thesoftwareengineeringcommunityhasrecognizedthenegativeconsequencesofcloningover
adecadeago.Nevertheless,itaboundsinpractice—acrossartifacts,organizationsanddomains.
Cloningthrives,sinceitscontrolisnotpartofsoftwareengineeringpractice.Weareconvinced
thatthishastwoprincipalreasons:®rst,thesigni®canceofcloningisnotwellunderstood.We
donotknowtheextentofcloningacrossdifferentartifacttypesandthequantitativeimpactithas
onprogramcorrectnessandmaintenanceefforts.Consequently,wedonotknowtheimportanceof
clonecontrol.Second,nocomprehensivemethodexiststhatguidespractitionersthroughtailoring
andorganizationalchangemanagementrequiredtoestablishsuccessfulclonecontrol.Lackingboth
aquantitativeunderstandingofitsharmfulnessandcomprehensivemethodsforitscontrol,cloning
islikelytobeneglectedinpractice.
Thisthesiscontributestobothareas.First,wepresentempiricalresultsonthesigni®canceof
cloning.Analysisofdifferencesbetweencodeclonesinproductivesoftwarerevealedover100
faults.Morespeci®cally,everysecondmodi®cationtocodethatwasdoneinunawarenessofits
clonescausedafault,demonstratingtheimpactofcodecloningonprogramcorrectness.Further-
more,analysisofindustrialrequirementsspeci®cationsandgraph-basedmodelsrevealedsubstantial
amountsofcloningintheseartifacts,aswell.Thesizeincreasecausedbycloningaffectsinspection
efforts—foronespeci®cation,byanestimated14persondays;forasecondonebyover50%.To
avoidsuchimpactonprogramcorrectnessandmaintenanceefforts,cloningmustbecontrolled.
Second,wepresentacomprehensivemethodforclonecontrol.Itcomprisesdetectortailoringtoim-
proveaccuracyofdetectedclones,andassessmenttoquantifytheirimpact.Itguidesorganizational
changemanagementtosuccessfullyintegrateclonecontrolintoestablishedmaintenanceprocesses,
androotcauseanalysistopreventthecreationofnewclones.Tooperationalizethemethod,we
presentaclonedetectionworkbenchforcode,requirementsspeci®cationsandmodelsthatsupports
allthesesteps.Wedemonstratetheeffectivenessofthemethod—includingitstools—throughan
industrialcasestudy,whereitsuccessfullyreducedcloningintheparticipatingsystem.
Finally,weidentifythelimitationsofclonedetectionandcontrol.Throughacontrolledexperiment,
weshowthatclonedetectionapproachesareunsuitedtodetectbehaviorallysimilarcodethathas
beendevelopedindependentlyandisthusnottheresultofcopy&paste.Itsdetectionremainsan
importanttopicforfuturework.

3

ementswledgknoAc

IhavespentthelastfouryearsasaresearcherattheLehrstuhlforSoftware&SystemsEngineering
atTechnischeUniversitätMünchenfromProf.Dr.Dr.h.c.ManfredBroy.Iwanttoexpressmy
gratitudetoManfredBroyforthefreedomandresponsibilityIwasgrantedandforhisguidanceand
advice.Ihave,andstilldo,enjoyworkinginthechallengingandcompetitiveresearchenvironment
hecreates.IwanttothankProf.Dr.rer.nat.RainerKoschkeforacceptingtoco-supervisethis
thesis.Iamgratefulforinspiringdiscussionsonsoftwarecloning,butalsoforthehospitalityand
interest—bothbyhimandhisgroup—thatIexperiencedduringmyvisitinBremen.Myviewofthe
socialaspectsofresearch,whichformedinthelarge,thematicallyheterogenousgroupofManfred
Broy,wasenrichedbytheglimpseintothesmaller,morefocussedgroupofRainerKoschke.
Iamverygratefultomycolleagues.Theirsupport,bothonthescienti®candonthepersonallevel,
wasvitalforthesuccessofthisthesis.Andnotleast,formypersonaldevelopmentduringthelast
fouryears.IamgratefultoSilkeMüllerforschedulemagic.ToFlorianDeissenboeckforbeing
anexampleworthfollowingandforbothhisencouragementandoutrightcriticism.ToBenjamin
Hummelforhismeritandcreativityinproducingideas,andforhisproductivityandeffectivenessin
theirrealization.ToMartinFeilkasforhisabilitytooverviewandsimplifycomplicatedsituations
andforreliabilityandtrustcomewhatmay.ToStefanWagnerforhisguidanceandexamplein
scienti®cwritingandempiricalresearch.ToDanielRatiuforthesensitivity,carefulnessanddepth
heshowsduringscienti®cdiscussions(andoutsideofthem).ToLarsHeinemannforbeingthebest
colleagueIeversharedanof®cewithandforhistoleranceexhibiteddoingso.ToMarkusHer-
rmannsdörferforhisencouragementandpragmatic,uncomplicatedwaythatmakescollaboration
productiveandfun.ToMarkusPizkaforraisingmyinterestinresearchandforencouragingmeto
startmyPhDthesis.Workingwithallofyouwas,andstillis,aprivilege.
Research,understandingandideagenerationbene®tfromcollaboration.Iamgratefulforjoint
paperprojectstoSebastianBenz,MichaelConradt,FlorianDeissenboeck,ChristophDomann,
MartinFeilkas,Jean-FrançoisGirard,NilsGöde,LarsHeinemann,BenjaminHummel,Klaus
Lochmann,BenediktMayyParareda,MichaelPfaehler,MarkusPizka,DanielRatiu,Bernhard
Schaetz,JonathanStreit,StefanTeuchertandStefanWagner.Inaddition,thisthesisbene®tedfrom
thefeedbackofmany.Iamthankfulforproof-readingdraftstoFlorianDeissenboeck,Martin
Feilkas,NilsGöde,LarsHeinemann,BenjaminHummel,KlausLochmann,BirgitPenzenstadler,
DanielRatiuandStefanWagner.AndtoRebeccaTiarksforhelpwiththeBellonBenchmark.
Theempiricalpartsofthisworkcouldnothavebeenrealizedwithoutthecontinuoussupportofour
industrialpartners.IwanttothankeverybodyIworkedwithatABB,MAN,LV1871andMunich
ReGroup.IparticularlythankMunichReGroup—especiallyRainerJanßenandRudolfVaas—for
thelong-termcollaborationwithourgroupthatsubstantiallysupportedthisdissertation.
Mostofall,Iwanttothankmyfamilyfortheirunconditionalsupport(bothmaterialandimmaterial)
notonlyduringmydissertation,butduringallofmyeducation.Iamdeeplygratefultomyparents,
mybrotherand,aboveall,mywifeSo®e.

5

»Aman’sgottadowhataman’sgottado«
FredMacMurrayinTheRainsofRanchipur

»Aman’sgottadowhataman’sgottado«
NoonHighinCooperGary

»Aman’sgottadowhataman’sgottado«
GeorgeJetsoninTheJetsons

John»Aman’CleesesingottaMontydowhatPython’asman’sGuidegottatoLifedo«

Contents

oductionIntr11.1ProblemStatement..................................
1.2Contribution......................................
1.3Contents........................................
Fundamentals22.1NotionsofRedundancy................................
2.2SoftwareCloning...................................
2.3NotionsofProgramSimilarity............................
2.4TermsandDe®nitions.................................
2.5CloneMetrics.....................................
2.6Data-¯owModels...................................
2.7CaseStudyPartners..................................
2.8Summary.......................................
3StateoftheArt
3.1ImpactonProgramCorrectness............................
3.2ExtentofCloning...................................
3.3CloneDetectionApproaches.............................
3.4CloneAssessmentandManagement.........................
3.5LimitationsofCloneDetection............................
4ImpactonProgramCorrectness
4.1ResearchQuestions..................................
4.2StudyDesign.....................................
4.3StudyObjects.....................................
4.4ImplementationandExecution............................
4.5Results.........................................
4.6Discussion.......................................
4.7ThreatstoValidity...................................
4.8Summary.......................................
5CloningBeyondCode
5.1ResearchQuestions..................................
5.2StudyDesign.....................................
5.3StudyObjects.....................................
5.4ImplementationandExecution............................

131416171919222628293536363737404147515353545556575959616363646567

9

Contents

5.5Results.........................................68
5.6Discussion.......................................76
5.7ThreatstoValidity...................................77
5.8Summary.......................................79
81ModelCostClone66.1MaintenanceProcess.................................81
6.2Approach.......................................83
6.3DetailedCostModel.................................84
6.4Simpli®edCostModel................................88
6.5Discussion.......................................88
6.6Instantiation......................................89
6.7Summary.......................................92

7AlgorithmsandToolSupport95
7.1Architecture......................................95
7.2Preprocessing.....................................98
7.3DetectionAlgorithms.................................101
7.4Postprocessing....................................115
7.5ResultPresentation..................................120
7.6ComparisonwithotherCloneDetectors.......................127
7.7MaturityandAdoption................................135
7.8Summary.......................................135
8MethodforCloneAssessmentandControl137
8.1Overview.......................................137
8.2CloneDetectionTailoring...............................138
8.3AssessmentofImpact.................................143
8.4RootCauseAnalysis.................................147
8.5IntroductionofCloneControl............................152
8.6ContinuousCloneControl..............................155
8.7ValidationofAssumptions..............................157
8.8Evaluation.......................................165
8.9Summary.......................................173

9LimitationsofCloneDetection175
9.1ResearchQuestions..................................175
9.2StudyObjects.....................................176
9.3StudyDesign.....................................177
9.4ImplementationandExecution............................178
9.5Results.........................................181
9.6Discussion.......................................184
9.7ThreatstoValidity...................................185
9.8Summary.......................................186

lusionConc10

10

187

Contents

10.1Signi®canceofCloning................................
10.2CloneControl.....................................

orkWFuture1111.1ManagementofSimions.............
11.2CloneCostModelDataCorpus..........
11.3LanguageEngineering..............
11.4CloninginNaturalLanguageDocuments....
11.5CodeCloneConsolidation............

yliographBib

..........

......
......
......
......
......

...............

..........

.....

.....

.....

.....

.....

187190

193193194195196198

20111

oductionIntr1

Softwaremaintenanceaccountsforthemajorityofthetotallifecyclecostsofsuccessfulsoftware
systems[21,80,184].Halfofthemaintenanceeffortisnotspentonbug®xingoradaptationsto
changesofthetechnicalenvironment,butonevolvingandnewfunctionality.Maintenancethuspre-
servesandincreasesthevaluethatsoftwareprovidestoitsusers.Reducingthenumberofchanges
thatgetperformedduringmaintenancethreatenstoreducethisvalue.Instead,tolowerthetotal
lifecyclecostsofsoftwaresystems,theindividualchangesneedtobemadesimpler.Animportant
goalofsoftwareengineeringisthustofacilitatetheconstructionofsystemsthatareeasy—andthus
maintain.economic—tomoreSoftwarecomprisesavarietyofartifacts,includingrequirementsspeci®cations,modelsandsource
code.Duringmaintenance,allofthemareaffectedbychange.Inpractice,theseartifactsoften
containsubstantialamountsofduplicatedcontent.Suchduplicationisreferredtoascloning.

Figure1.1:Cloninginusecasedocuments

thusCloningeffortforhampersallsize-relatedmaintenanceactiofvitiessoftwsuchareartifasactsinsevinspections—inspectorseralways.First,simplyithaincreasesvetowtheirorksizethroughand
tomoreitsclones,content.causingSecond,effortchangesforthattheirarelocationperformedandtoconsistentanartifactmodi®cation.oftenalsoIf,neede.g.,tobedifferentperformeduse
caseauthenticationdocumentsiscontainchangedfromduplicatedpasswordinteractiontokeycardstepsforentry.systemMoreovlogin,er,ifthenotyallallhavclonesetoofbeanadaptedartifactif
aremodi®edconsistently,inconsistenciescanoccurthatcanresultinfaultsindeployedsoftware.
If,e.g.,adeveloper®xesafaultinapieceofcodebutisunawareofitsclones,thefaultfailsto

13

oductionIntr1

beremovedfromthesystem.Eachoftheseeffectsofcloningcontributestoincreasedsoftware
lifecyclecosts.Cloningis,hence,aqualitydefect.

Figure1.2:Cloningthreatensprogramcorrectness
Thenegativeimpactofcloningbecomestangiblethroughexamplesfromreal-worldsoftware.We
studiedinspectioneffortincreaseduetocloningin28industrialrequirementsspeci®cations.For
thepersonlargestdays.Forspeci®cation,asecondtheestimatedspeci®cation,insitevpectionenefdoublesfortdueincreasetoiscloning1101.personhours,oralmost14
Theeffortincreaseduetothenecessitytoperformmultiplemodi®cationsisillustratedinFigure1.1,
whichrectangledepictsrepresentscloningainuse150case,useitscasesheightfromancorrespondingindustrialtobtheusinesslengthofinformationtheusecasesystem.inlines.EachEachblack
text.coloredIfastripechangeisdepictsmadeatoaspeci®cationcoloredreclone;gion,itstripesmaywithneedtothebesameperformedcolorindicatemultiplecloneswittimes—increasinghsimilar
.accordinglyfortefmodi®cationFinallyprogram,Figurecorrectness1.22:aillustratesmissingthenullcheconsequencesckhasonlyofbeeninconsistent®xedinonemodi®cationsclone,thetootherclonedstillcodecontainsfor
thedefectandcancrashthesystematruntime.

1.1StatementlemobPr

Differentgroupsinthesoftwareengineeringcommunityhaveindependentlyrecognizedthatcloning
cannegativelyimpactengineeringefforts.Redundancyinrequirementsspeci®cations,including
moticloning,veisrequirementsconsideredasanengineeringobstacle[230].forCloningmodi®abilityinsource[100]andcodeliissteddeemedasaasmajoranproblemindicatorinforauto-bad
design[17,70,175].Inresponse,theinvestigationofcloninghasgrownintoanactiveareainthe
softwareengineeringresearchcommunity[140,201],yielding,e.g.,numerousdetectionapproaches
andabetterunderstandingoftheoriginandevolutionofcloninginsourcecode.

1ThestudyispresentedindetailChapter5.
2ThecodeexampleistakenfromtheopensourceprojectSysiphus.

14

lemobPr1.1Statement

Nevertheless,cloningaboundsinpractice.Researchersreportthatbetween8%and29%,insome
casesevenmorethan60%ofthesourcecodeinindustrialandopensourcesystemshasbeendupli-
catedatleastonce[6,62,157].Cloninginsourcecodehasbeenreportedfordifferentprogramming
languagesandapplicationdomains[140,201].Despitethesefacts,hardlyanysystematicmeasures
tocontrolcloningaretakeninpractice.Givenitsknownextentandnegativeimpactonreal-world
software,weconsiderthisapparentlackofappliedmeasuresforclonecontrolasprecarious.
Basedonourexperiencesfromfouryearsofclosecollaborationonsoftwarecloningwithourindus-
trialpartners,weseetwoprincipalreasonsforthis:®rst,thesigni®canceofcloningisinsuf®ciently
understood;second,welackacomprehensivemethodthatguidespractitionersinestablishingcon-
tinuousclonecontrol.Wedetailbothreasonsbelow.

Signi®canceofCloningTheextentofcloninginsoftwareartifactsisinsuf®cientlyunderstood.
Whilenumerousstudieshaverevealedcloninginsourcecode,hardlyanythingisknownabout
cloninginotherartifacts,suchasrequirementsspeci®cationsandmodels.
Evenmoreimportantly,thequantitativeimpactofcloningonprogramcorrectnessandmaintenance
eftifyfortitinistermsunclearof.fWhileaultsoreefxistingfortincrease.researchhasConsequentlydemonstrated,weitsdonotimpactknowhoqualitatiwvelyharmful,wecannotcloning—andquan-
howimportantclonecontrol—reallyisinpractice.

theClonecreationControfolnewTobecloneseffectiandve,toclonecreateawcontrolarenessneedsoftoebexistingappliedclonescontinuouslyduring,codebothtomodi®cation.prevent
Continuousapplicationrequiresaccurateresults.However,existingtoolsproducelargeamounts
offalsepositives.Sinceinspectionoffalsepositivesisawasteofeffort,andrepeatedinspection
evenmoreso,theyinhibitcontinuousclonecontrol.Welackcommonlyacceptedcriteriaforclone
relevanceandtechniquestoachieveaccurateresults.Furthermore,tohavelongtermsuccess,clone
controlmustbepartofthemaintenanceprocess.Itsintegrationrequireschangestoestablished
habits.Unfortunately,existingapproachesforclonemanagementarelimitedtotechnicaltopicsand
issues.anizationalgorignoreTooperationalizeclonecontrol,comprehensivetoolsupportisrequiredthatsupportsallofitssteps.
propagExistingation,tools,orhoareweverlimited,totypicallysourcefocuscodeonandindithusvidualcannotaspects,beappliedsuchastoclonespeci®cationsdetectionorormodels.change
notproFurthermore,videmostreal-timedetectionresultsforlarapproachesgeevareolvingnotsoftwbothareartifincrementalacts.andDedicatedscalable.toolThesupportythusisthuscan-
control.cloneforrequired

ProblemWeneedabetterunderstandingofthequantitativeimpactofcloningonsoftware
engineeringandacomprehensivemethodandtoolsupportforclonecontrol.

15

oductionIntr1

utionContrib1.2

Thisdissertationcontributestobothareas,asdetailedbelow.

Signi®canceofCloningWepresentempiricalstudiesandananalyticalcostmodeltodemon-
stratethesigni®canceofcloningand,consequently,theimportanceofclonecontrol.
First,wepresentalargescalecasestudyinvestigatingtheimpactofcloningonprogramcorrectness.
Throughtheanalysisofinconsistentlymaintainedclones,107faultswerediscoveredinindustrial
andopensourcesoftware,including17criticalonesthatcouldresultinsystemcrashesordataloss;
notasinglesystemwaswithoutfaultsininconsistentlymodi®edclonedcode.Everysecondchange
toclonedcodethatwasunawareofcloningwasfaulty.Thisdemonstratesthatunawarenessof
cloningsigni®cantlyimpactsprogramcorrectnessandthusdemonstratestheimportancetocontrol
codecloninginpractice.ThecasestudywascarriedoutwithMunichReandLV1871.
Second,wepresenttwolargeindustrialcasestudiesthatinvestigatecloninginrequirementsspeci-
®cationsandMatlab/Simulinkmodels.Theydemonstratethattheextentandimpactofcloningare
notlimitedtosourcecode.Fortheseartifacts,manualinspectionsarecommonlyusedforquality
assurance.Thecloninginducedsizeincreasetranslatestohigherinspectionefforts—foroneofthe
analyzedspeci®cationsbyanestimated14persondays;forasecondoneitmorethandoubles.To
avoidtheseconsequences,cloningneedstobecontrolledforrequirementsspeci®cationsandgraph-
basedmodels,too.Thisworkisthe®rsttoinvestigatecloninginrequirementsspeci®cationsand
graph-basedmodels.Thecasestudieswerecarriedout,amongothers,withMunichRe,Siemens,
Group.ahrzeugeNutzfMANandThird,wepresentananalyticalcostmodelthatquanti®estheimpactofcodecloningonmaintenance
activitiesand®eldfaults.Itcomplementstheaboveempiricalstudiesbymakingourobservations
andassumptionsabouttheimpactofcodecloningonsoftwaremaintenanceexplicit.Thecostmodel
providesafoundationforassessmentandtrade-offdecisions.Furthermore,itsexplicitnessoffers
anobjectivebasisforscienti®cdiscourseabouttheconsequencesofcloning.

CloneControlWepresentacomprehensivemethodforclonecontrolandtoolsupporttooper-
practice.initationalizeWmenteofintroducecloningainmethodsoftwforarecloneartifactsandassessmentfortheandcontrolcontrolofthatcloningprovidesduringdetailedsoftwarestepsfortheengineering.assess-It
comprisesdetectortailoringtoachieveaccuratedetectionresults;assessmenttoevaluatethesig-
ni®canceofcloningforasoftwaresystem;changemanagementtosuccessfullyadaptestablished
processesandhabits;androotcauseanalysistopreventcreationofexcessiveamountsofnewclones.
ThemethodhasbeenevaluatedinacasestudywithMunichReinwhichcontinuousclonecontrol
wasperformedoverthecourseofoneyearandsucceededtoreducecodecloning.
Tooperationalizethemethod,weintroduceindustrial-strengthtoolsupportforcloneassessment
andcontrol.Itincludesnovelclonedetectionalgorithmsforrequirementsspeci®cations,graph-
basedmodelsandsourcecode.Theproposedindex-baseddetectionalgorithmisthe®rstapproach
thatisatthesametimeincremental,distributedandscalabletoverylargecodebases.Sincethetool

16

Contents1.3

supporthasmaturedbeyondthestageofaresearchprototype,severalcompanieshaveincludedit
intotheirdevelopmentorqualityassessmentprocesses,includingABB,BayerischesLandeskrimi-
nalamt,BMW,Capgeminisd&m,itestraGmbH,KabelDeutschland,MunichReandWincorNix-
dorf.Itisavailableasopensourceforusebybothindustryandtheresearchcommunity.
Finally,thisthesispresentsacontrolledexperimentthatshowsthatexistingclonedetectors—and
theirunderlyingapproaches—arelimitedtocopy&paste.Theyareunsuitedtodetectbehaviorally
similarcodeofindependentorigin.Theexperimentwasperformedonover100behaviorallysimilar
programsthatwereproducedindependentlyby400studentsthroughimplementationofasingle
speci®cation.Qualitycontrolthuscannotrelyonclonecontroltomanagesuchredundancies.Our
empiricalresultsindicate,however,thattheydooccurinpractice.Theirdetectionthusremainsan
importanttopicforfuturework.
Asstatedabove,softwarecomprisesvariousartifacttypes.Allofthemcanbeaffectedbycloning.
Weareconvincedthatitshouldbecontrolledforallartifactsthataretargettomaintenance.How-
ever,thesetofallartifactsdescribedintheliteratureislarge—beyondwhatcanbecoveredindepth
inadissertation.Inthiswork,wethusfocusonthreeartifacttypesthatarecentraltosoftware
engineering:requirementsspeci®cations,modelsandsourcecode.Amongthem,sourcecodeisar-
guablythemostimportant:maintenancesimplycannotavoidit.Evenprojectsthathave—sensibly
ornot—abandonedmaintenanceofrequirementsspeci®cationsandmodels,stillhavetomodify
sourcecode.Consequently,itistheartifacttypethatreceivesmostattentioninthisthesis.

Contents1.3

Theremainderofthisthesisisstructuredasfollows:
Chapter2discussesdifferentnotionsofredundancy,de®nesthetermsusedinthisthesisandintro-
ducesthefundamentalsofsoftwarecloning.Chapter3discussesrelatedworkandoutlinesopen
issues,providingjusti®cationfortheclaimsmadeintheproblemstatement.
Thefollowingchapterspresentthecontributionsofthethesisinthesameorderastheyarelisted
inSection1.2.Chapter4presentsthestudyontheimpactofunawarenessofcloningonprogram
correctness.Chapter5presentsthestudyontheextentandimpactofcloninginrequirements
Chapterspeci®cations7outlinesandtheMatlab/Simulinkarchitectureandmodels.functionalityChapter6ofthepresentsproposedthecloneanalyticaldetectionclonewcostorkbench.model.
Chapter8introducesthemethodforcloneassessmentandcontrolanditsevaluation.Chapter9
reportsonthecontrolledexperimentonthecapabilitiesofclonedetectioninbehaviorallysimilar
origin.independentofcodeFinally,Chapter10summarizesthethesisandChapter11providesdirectionsforfutureresearch.

PreviouslyPublishedMaterial

Partsofthecontributionspresentedinthisthesishavebeenpublishedin[53–55,57,97,110–117].

17

2Fundamentals

Thischapterintroducesthefundamentalsofthisthesis.The®rstpartdiscussesdifferentnotions
ofthetermredundancythatareusedincomputerscience.Itthenintroducessoftwarecloningand
othernotionsofprogramsimilarityinthecontextofthesenotionsofredundancy.Thelaterpartsof
thechapterintroduceterms,metricsandartifacttypesthatarecentraltothethesisandtheindustrial
partnersthatparticipatedinthecasestudies.

yRedundancofNotions2.1

Redundancyisthefundamentalpropertyofsoftwareartifactsunderlyingsoftwarecloningresearch.
Thissectionoutlinesandcomparesdifferentnotionsofredundancyusedincomputerscience.It
providesthefoundationtodiscusssoftwarecloning,theformofredundancystudiedinthisthesis.

2.1.1DuplicationofProblemDomainInformation

Inseveralareasofcomputerscience,redundancyisde®nedasduplicationofproblemdomain
knowledgeintherepresentation.Weusetheterm“problemdomainknowledge”withabroad
meaning:itnotonlyreferstotheconcepts,processesandentitiesfromthebusinessdomainofa
softwareartifact.Instead,weemployittoincludeallconceptsimplementedbyaprogramorrep-
resentedinanartifact.Thesecan,e.g.,includedatastructuresandalgorithmsandcompriseboth
aspects.vioralbehaandstructural

singleNormalfactFormsfromtheinproblemRelationaldomainisDatabasesstoredmultipleIntuitively,timesaindatabasethedatabase.containsIfredundanccomparedy,iftoaa
databasewithoutredundancy,thishasseveraldisadvantages:
Sizethusincrincreasesease:thesizeRepresentationofadatabaseofandinformationthuscostsrequiresforstoragespace.orStoringalgorithmsasinglewhosefactruntimemultipledependstimes
size.databaseonUpdateanomaly:Ifinformationchanges,e.g.,throughevolutionoftheproblemdomain,allloca-
tionsprobleminwhichdomainitisthusstoredrequiresinthemultipledatabasemodi®cneedtoationsbeinthechangeddatabase.accordinglyThef.actAthatsingleasinglechangeinchangethe
requiresmultiplemodi®cationsisreferredtoasupdateanomalyandincreasesmodi®cationeffort.
Furthermore,ifnotalllocationsareupdated,inconsistenciescancreepintothedatabase.
malRelationalformsaredatabasepropertiesdesignofadvdatabaseocatesnormalschemasformsthat,towhenreduceviolated,redundancyindicateindatabasesmultiple[129].storageNorof-

19

Fundamentals2

informationfromtheproblemdomaininthedatabase.Normalformsarede®nedaspropertieson
thepropagschemasatesatop-do[129]—notwnofapproachthedatatodiscoentriesverstandoredavinoidtheredundancdatabase.yinDatabasedatabases:schemathroughdesignanalysisthus
ofthepropertiesoftheschema,notthroughanalysisofsimilarityinthedata.

LogicalRedundancyinProgramsInhisPhDthesis,DanielRatiude®neslogicalredundancy
forprograms[190].Intuitively,accordingtohisde®nitions,aprogramcontainsredundancyiffacts
fromtheproblemdomainareimplementedmultipletimesintheprogram.Justasfordatabases,if
comparedtoaprogramwithoutredundancy,thishasseveraldisadvantages:
Sizeincrease:Implementationofafactfromtheproblemdomainrequiresspaceintheprogram
andthusincreasesprogramsize.Forsoftwaremaintenance,thiscanincreaseeffortsforsize-related
inspections.assuchvitiesactiUpdateanomaly:Similarlytotheupdateanomalyindatabases,ifafactintheproblemdomain
changes,allofitsimplementationsneedtobeadaptedaccordingly,creatingeffortfortheirlocation
andconsistentmodi®cation.Again,ifmodi®cationisnotperformedconsistentlytoallinstances,
inconsistenciescanbeintroducedintotheprogram.
Justasfordatabases,redundancyisde®nedindependentoftheactualrepresentationofthedata.
Redundantprogramfragmentsthuscan,butdonotneedtolooksyntacticallysimilar.
Whereasschemasprovidemodelsoftheproblemdomainfordatabasesystems,incontrast,there
isnocomparablemodeloftheproblemdomainofprograms.Ratiusuggeststouseontologiesas
modelsoftheproblemdomain[190].Sincetheyaretypicallynotavailable,theyhavetobecreated
.yredundancdetectto

ExcesseSizRepresentation2.1.2Ininformationtheory[166],minimaldescriptionlengthresearch[89]anddatacompression[205],
redundancyisde®nedassizeexcess.Intuitively,datacontainsredundancy,ifashorterrepresenta-
tionforitcanbefoundfromwhichitcanbereproducedwithoutlossofinformation.
Thenotionofredundancyassizeexcesstranslatestocompressionpotential.Anypropertyofan
artifact,whichcanbeexploitedforcompression,thusincreasesitssizeexcess.Since,accordingto
Grünwald[89],anyregularitycaninprinciplebeexploitedtocompressanartifact,allregularity
xcess.esizeincreasesCompressionpotentialnotonlydependsontheartifactbutalsoontheemployedcompression
scheme.ThemostpowerfulcompressionschemeistheKolmogorovcomplexityofanartifact,
de®nedasthesizeofthesmallestprogramthatproducestheartifact.Unfortunately,itisundecid-
able[89,156].Hence,toemploycompressionpotentialasametricforredundancyinpractice,less
powerful,butef®cientlycomputablecompressionschemesareemployed,as,e.g.,generalpurpose
compressorslike,gziporGenCompress.
Regularityindatarepresentationcanhavedifferentsources.Duplicatedfragmentsofproblemdo-
mainknowledgeexhibitthesamestructureandthusrepresentregularity.Regularity,however,does
notneedtostemfromproblemdomainknowledgeduplication.Inef®cientencodingofthealphabet

20

yRedundancofNotions2.1

ofsion,aaslanguageis,e.g.,intodoneabybinaryHuffmanrepresentationcoding[95].introducesregularitythatcanbeexploitedforcompres-

Similarly,languagegrammarsareasourceofregularity,sincetheyenforcesyntaxrulestowhich
allartifactswritteninalanguageadhere.Again,thisregularitycanbeexploitedforcompression,
asis,e.g.,donebysyntax-basedcoding[35].

Redundancyintermsofrepresentationsizeexcessthuscorrespondstocompressionpotentialofan
artifact.Regularityinthedatarepresentationprovidescompressionpotential,independentofits
source:fromthepointofviewofcompression,itisofnoimportanceiftheregularitystemsfrom
problemdomainknowledgeduplicationorinef®cientcoding.Thisnotionofredundancythusdoes
notdifferentiatebetweendifferentsourcesofregularity.

2.1.3Discussion

Therearefundamentaldifferencesbetweenthetwonotionsofredundancy.Whereasnormalforms
andlogicalprogramredundancyarede®nedintermsofduplicationofinformationfromtheproblem
domainintherepresentation,sizeexcessisde®nedontherepresentationalone.Thisisexplicitin
thestatementfromGrünwald[89]:»Weonlyhavethedata«—nointerpretationintermsofthe
problemdomainisperformed.Thishastwoimplications:

Broaderapplicability:Sincenointerpretationintermsoftheproblemdomainisrequired,itcan
beappliedtoarbitrarydata.Thisisobviousfordatacompressionthatisentirelyagnosticofthe
informationencodedinthe®lesitprocesses.However,itcanalsobeappliedtodataweknowhow
tointerpret,butforwhichnosuitablemachinereadableproblemdomainmodelsareavailable,as,
e.g.,programsforwhichwedonothavecompleteontologies.

itycanReducedcreateconclusivrepresentationenessw.r.t.sizeedomainxcess,itknoisnowledgeconclusiveduplication.indicatorSincefordifproblemferentdomainsourcesofknorewledgegular-
duplication.representationThisaloneneedstotodiscobevertakenproblemintoaccountdomainbyknowledgeapproachesthatduplication.searchforredundancyonthe

TherelationshipbetweenthetwonotionsofredundancyissketchedinthediagraminFigure2.1.
Theleftsetrepresentsredundancyinthesenseofduplicateddomainknowledge.Therightsetre-
dundancyintermsofrepresentationsizeexcess.Theirintersectionrepresentsduplicateddomain
knowledgethatissuf®cientlyrepresentationallysimilartobecompressiblebytheemployedcom-
scheme.pression

Thediagramassumesanimperfectcompressionscheme.Foraperfectcompressor,problemdomain
knowledgeduplicationwouldbeentirelycontainedinrepresentationsizeexcess,sinceaperfect
compressorwouldknowhowtoexploititforcompression,evenifitissyntacticallydifferent.
However,nosuchcompressorexistsand—sinceKolmogorovcomplexityisundecidable—never
will.

21

Fundamentals2












Figure2.1:Relationshipofdifferentnotionsofredundancy

Super¯uousness2.1.4Apartfromproblemdomainduplicationandrepresentationsizeexcess,athirdnotionofredundancy
isusedinsomeareasofcomputerscience:super¯uousness.
Severalexamplesforthistypeofredundancyexistintheliterature.Incompilerconstruction,state-
mentsareconsideredasredundant,iftheyareunreachable[134].Iftheunreachablestatementsare
removed,thecodestillexhibitsthesameobservablebehavior1.Second,ifausageperspectiveis
adopted,statementsareredundant,iftheyarenotrequiredbytheusersofthesoftware,e.g.,be-
causethefeaturetheyimplementhasbecomeobsolete.Basedontheactualneedoftheusers,the
softwarestillexhibitsthesamebehaviorifthefeatures,thatwillneverbeusedagain,areremoved.
Athirdexamplecanbefoundinlogic:aknowledgebaseofpropositionalformulasisredundant,if
itcontainspartsthatcanbeinferredfromtherestofit[158].Theremovalofthesepartsdoesnot
changethemodelsoftheknowledgebase,e.g.,thevariableassignmentsthatevaluatetotrue.
Super¯uousnessisfundamentallydifferentfromtheothernotionsofredundancy.Whereasduplica-
tionofproblemdomaininformationandrepresentationsizeexcessindicatethattherepresentation
canbecompactedwithoutlossofinformation,super¯uousnessindicateswhichinformationcanbe
lostsinceitisnotrequiredforacertainpurpose.Thisnotionofredundancyisoutsidethescopeof
thesis.this

CloningSoftware2.2

Thissectionintroducessoftwarecloningandcomparesitwiththenotionsofredundancyintroduced
above.Amorein-depthdiscussionofresearchinsoftwarecloningandinclonedetectionisprovided
3.Chapterin

2.2.1CloningasProblemDomainKnowledgeDuplication
Programsencodeproblemdomaininformation.Duplicatingaprogramfragmentcanthuscreate
duplicationofencodedproblemdomainknowledge.Sinceprogramfragmentduplicationpreserves
syntacticstructure,theduplicatesarealsosimilarintheirrepresentation.
1Disregardingeffectsduetoapotentiallysmallermemoryfootprint.

22

CloningSoftware2.2

Clonesaresimilarregionsinartifacts.Theyarenotlimitedtosourcecode,butcanoccurinother
artifacttypessuchasmodelsortexts,aswell.Intheliterature,differentde®nitionsofsimilarityare
employed[140,201],mostlybasedonsyntacticcharacteristics.Theirnotionofredundancyisthus,
strictlyspeaking,agnosticoftheproblemdomain.Incontrast,inthisthesis,werequireclonesto
implementoneormorecommonproblemdomainconcepts,thusturningclonesintoaninstanceof
logicalprogramredundancyasde®nedbyRatiu[190].Cloningthusexhibitsthenegativeimpactof
logicalprogramredundancy(cf.,Section2.1.1).
Thecommonconceptimplementationsgivesrisetochangecoupling:whentheconceptchanges,
allofitsimplementations—theclones—needtobechanged.Inaddition,werequireclonesto
besyntacticallysimilar.Whilesyntacticsimilarityisnotrequiredforchangecoupling,existing
clonedetectionapproachesrelyonsyntacticsimilaritytodetectclones.IntermsofFigure2.1,
thisrequirementlimitsclonestotheintersectionofthetwosets.Hence,weemploythetermclone
todenotesyntacticallysimilarartifactregionsthatcontainredundantencodingsofoneormore
problemdomainconcepts.Whilesyntacticsimilaritycanbedeterminedautomatically,redundant
cannot.implementationconceptForthesakeofclarity,wedifferentiatebetweenclonecandidates,clonesandrelevantclones.Clone
candidatesareresultsofaclonedetectorrun:syntacticallysimilarartifactregions.Cloneshavebeen
inspectedmanuallyandareknowntoimplementcommonprogramdomainconcepts.However,not
allclonesarerelevantforalltasks:whileforchangepropagation,allclonesarerelevant,forprogram
compaction,e.g.,onlythosearerelevantthatcanberemoved.Incaseonlyasubsetoftheclonesin
asystemisrelevantforacertaintask,werefertothemasrelevantclones.
Aclonegroupisasetofclones.Clonesinasinglegrouparereferredtoassiblings;aclone’s
artifactregionissimilartotheartifactregionsofallitssiblings.Weemploythesetermsforclone
candidates,clonesandrelevantclones.

CloningorfCauses2.2.2

pasteClones(andarepossiblytypicallycreatedmodify)byancopartify&actpaste.fragment.ManySedifveralferentauthorscauseshavcanetriggeranalyzedthecausesdecisionfortocloningcopy,
inandcodecauses[123,131,originating140,in201].theWedifmaintenanceferentiateenherevironmentbetweenandthecausesmaintainers.inherenttosoftwareengineering

InherentCausesCreatingsoftwareisadif®cult,intellectuallychallengingtask.Inherentcauses
forcloningarethosethatoriginateintheinherentcomplexityofsoftwareengineering[25]—even
idealprocessesandtoolscannoteliminatethemcompletely.
Oneinherentreasonisthatcreatingreusableabstractionsishard.Itrequiresadetailedunderstand-
ingofthecommonalitiesanddifferencesamongtheirinstances.Whenimplementinganewfeature
thatissimilartoanexistingone,theircommonalitiesanddifferencesarenotalwaysclear.Cloning
canbeusedtoquicklygenerateimplementationsthatexposethem.Afterwards,remainingcom-
monalitiescanbeconsolidatedintoasharedabstraction.Asecondreasonisthatunderstandingthe
impactofachangeishardforlargesoftware.Anexploratoryprototypicalimplementationofthe
changeisonewaytogainunderstandingofitsimpact.Forit,anentiresubsystemcanbeclonedand

23

Fundamentals2

modi®edforexperimentalpurposes.Aftertheimpacthasbeendetermined,asubstantiateddecision
canbetakenonwhethertointegrateormergethechangesintotheoriginalcode.Afterexploration
is®nished,clonescanberemoved.
Inbothcases,cloningisusedasameanstospeedupimplementationtoquicklygainadditional
information.Oncetheinformationisobtained,clonescanbeconsolidated.

MaintenanceEnvironmentThemaintenanceenvironmentcomprisestheprocesses,languages
andtoolsemployedtomaintainthesoftwaresystem.Maintainerscandecidetoclonecodetowork
aroundaprobleminthemaintenanceenvironment.
ernsProcessesitsevcanolutioncauseandqualitycloning.First,assurance.toreuseMissingcode,oranorunsuitableganizationreuseneedsprocessesareusehinderprocessmaintainersthatgoinv-
sharingcode.Inresponse,theyreusecodethroughduplication.Second,short-sightedprojectman-
agementpracticescantriggercloning.ExamplesincludeproductivitymeasurementofLOC/day,
orconstanttimepressurethatencouragesshorttermsolutionsinignoranceoftheirlong-termcon-
sequences.Inresponse,maintainersduplicatecodetoreducepressurefromprojectmanagement.
Third,assurancetomaketechniquescodecanreusablemakeintheanewconsequencescontext,itofthesometimesnecessaryneedschangestobedifadapted.®culttovPooralidate.qualityIn
response,maintainersduplicatethecodeandmakethenecessarychangetotheduplicatetoavoid
theriskofbreakingtheoriginalcode.
oftenLimitationsrequiresinthelanguagesintroductionortoolsofcanparameters.causecloning.LanguageFirst,limitationsthecreationcanofprohibitathereusablenecessaryabstractionpa-
rameterization.Inresponse,maintainersduplicatethepartsthatcannotbeparameterizedsuitably.
Second,reusablefunctionalityisoftenencapsulatedinfunctionsormethods.Onhotcodepaths
ofpilercannotperformanceperformcriticalsuitableapplications,inliningtomethodallowcallsforreusecanimposewithoutathisperformancepenalty,penaltymaintainers.Iftheinlinecom-the
methodsmanuallythroughduplicationoftheirbodies.
Finally,besidesinherentandmaintenanceenvironmentcauses,maintainerscandecidetoclone
codeforintrinsicreasons.Forexample,thelong-termconsequencesofcloningcanbeunclear,or
maintainersmightlacktheskillsrequiredtocreatereusableabstractions.
Allnon-inherentcausesforcloningsharetwocharacteristics:evenwhilecloningmightbeasuc-
stillcessfulhold;inshort-termaddition,astechniquelongastotheircircumvcauseentisitsnotcause,recti®ed,itsnethegativcloneseimpactcannotonbesoftwareconsolidated.maintenanceThese
causescanthusleadtogradualaccumulationofclonesinsystems.

2.2.3CloneDetectionasSearchforRepresentationalSimilarity

Thegoalofclonedetectionisto®ndclones—duplicatedproblemdomainknowledgeinthepro-
this,gram.cloneUnfortunatelydetection,clonesearchesfordetectionsimilarityhasnointaccesshetoprogrammodelsoftherepresentation.problemThisdomain.hastwoTocircumvimplicationsent
quality:resultdetectionfor

24

CloningSoftware2.2

Recall:duplicatedproblemdomainknowledgethatisnotsuf®cientlyrepresentationallysimilar
doesnotgetdetected.Thislimitstherecallofdetectedw.r.t.totalduplicatedproblemdomain
wledge.knoThemagnitudeofthiseffectisdif®culttoquantifyinpractice,sincetheamountofallduplicated
domaindetectorinknotermswledgeofhoinwasetmuchofofartifthisactsitcanisdetect,typicallyisthusunknown.unfeasibleinComputingpractice.therecallofaclone
Precision:Sincesimilarityintheprogramrepresentationcan,butdoesnotneedtobecreatedby
problemdomainknowledgeduplication,notalldetectedclonecandidatescontainduplicatedprob-
lemdomainknowledge.Allprogramfragmentsthataresuf®cientlysyntacticallysimilartobede-
tectedasclones,butdonotimplementcommonproblemdomainknowledge,arefalsepositivesthat
e.g.,reducethroughprecision.Thisnormalization,typicallywhichoccursremoifvescloneidentifersdetectionthatremovreferenceesalllinksdomaintotheconcepts.problemArtifactdomain,re-
gionsthoughthattheeyxhibitssharenolittlerelationshipsyntacticvontheariationlevelareofthenthelikelyproblemtobedomainidenti®edconceptsasclonetheycandidates,implement.even

CodeCloneandCloneCandidateClassi®cationCodeclonesandclonecandidatesfor
sourcecodecanbeclassi®edintodifferenttypes.Clonetypesimposesyntacticconstraintsonthe
differencesbetweensiblings[19,140]:type1islimitedtodifferencesinlayoutandcomments,
type2furtherallowsliteralchangesandidenti®errenamesandtype3inadditionallowsstatement
changes,additionsordeletions.Theclonetypesformahierarchy:type-3clonescontaintype-2
clones,whichcontaintype-1clones.Type-2clones(includingtype-1clones)arealsoreferredtoas
clones.ungappedForclonesinotherartifacttypesthansourcecode,noclonetypeclassi®cationshavebeenestab-
lishedsofar.However,similarsyntacticcriteriacouldbeusedtocreateclassi®cationsforclonesin
data¯owmodels[86]andrequirementsspeci®cations.

2.2.4CloneManagement,AssessmentandControl
Softwconsequencesarecloneofmanacopgyingementandcomprisespasting”all[141],activitiesincludingofthe“lookingprevafterentionandofmakingclonecreationdecisionsandaboutthe
consistentmaintenanceandremovalofexistingclones.
artifSoftwactsareandclonequanti®esassessmentits,impactasemploonyedengineeringbythisactithesis,vities.isanactivitythatdetectsclonesinsoftware
SoftwQualityarecontrolclonecontrcomparesol,astheemploactualyedbyqualitythisofathesis,systemispartagofainsttheitsprocessqualityofqualityrequirementscontrolandtak[48].es
necessaryactionstocorrectthedifference.Thequalityrequirementforclonecontrolistwofold:
of®rst,existtokingeeptheclonesinamountaofsystem.clonesinConsequentlyasystem,loclonew;controlsecond,toanalyzesalleviatethetheresulnetsgofativecloneconsequencesassessment
andremainingtakesclones.necessaryCloneactionscontroltoisreducethustheaamountcontinuousofclonesprocessandthattoissimplifyperformedtheaspartmaintenanceofqualityof
controlthatemploysactivitiesfromclonemanagement.

25

Fundamentals2

2.3NotionsofProgramSimilarity

ofProgramsconcepts;encodeidenti®ersproblemde®neanddomainreferenceknowledgedomainindifentitiesferentandways.algorithmsDatastructuresimplementencodebehaviorpropertiesand
processesfromaproblemdomain.Duplicationofproblemdomaininformationinthecodecanlead
todifferenttypesofprogramsimilarity.
Manydifferentnotionsofprogramsimilarityexist[228].Inthissection,wedifferentiatebetween
representationalandbehavioralsimilarityofcode.Bothrepresentationalandbehavioralsimilarity
canrepresentproblemdomainknowledgeduplication.

Similarityogram-Representation-basedPr2.3.1

Numerousclonedetectionapproacheshavebeensuggested[140,201].Allofthemstaticallysearch
asuitablerepresentationprogramtheyworkrepresentationonandtheforsearchsimilarparts.algorithmsAmongsttheyemploothery2.things,theConsequentlyydiffer,ineachtheapproachprogram
hasadifferentnotionofsimilaritybetweenthecodefragmentsitcandetectasclones.
Theemployednotionscomprisetextual,metricsandfeature-basedsimilarity[228].Fromatheoret-
icalSinceperspectinormalizedve,theycaninformationbegeneralizeddistanceisintobasedtheonnotiontheofuncomputablenormalizedKinformationolmogorovdistancecomplexity[155].,it
cannotbeemployeddirectly.Instead,existingapproachesusesimplernotionsthatareef®ciently
whencomputable.recognizingWeequiclassifyvalentthemcodebythefragmentstypeofandbehabyviorthe-indifvariantferencesvaritheyationtoleratetheycanbetweencompensatesimilar
fragments.codeText-basedapproachesdetectclonesthatareequalonthecharacterlevel.Token-basedapproaches
canperformtoken-based®lteringandnormalization.Theyarethusrobustagainstreformatting,doc-
umentationchangesorrenamingofvariables,classesormethods.Abstractsyntaxtree(AST)-based
approachescanperformgrammar-levelnormalizationandarethusfurthermorerobustagainstdif-
ferencesinoptionalkeywordsorparentheses.Programdependencegraph(PDG)-basedapproaches
aresomewhatindependentofstatementorderandarethusrobustagainstreorderingofcommutative
statements.Inanutshell,existingapproachesexhibitvaryingdegreesofrobustnessagainstchanges
toduplicatedcodethatdonotchangeitsbehavior.
Someapproachesalsotoleratedifferencesbetweencodefragmentsthatchangebehavior.Mostap-
thatproachesexhibitsemplodifyferentsomebehanormaliviorlookzationequithatvaleremontvtoestheorreplacesdetectionspecialalgorithm.tokensMoreoandvercan,semakveralecodeap-
vectorsproachestocomputeidentifyclones.characteristicDependingvectorsonforthecodeapproach,fragmentsandcharacteristicuseavectorsdistancearethresholdcomputedbetweenfrom
more,metrics,e.ConQAg.,T[115]function-ledetectsvelsizeclonesandthatcompledifferxityup,to[139,an170]absoluteororASTrelativfragmentseedit[16,distance.106].Further-
Inaapproachesnutshell,differnotionsintheofrtypeseprofbehaesentationalvior-invsimilarityariantaschangesemplotheyedycanbystatecompensateoftheartandclonetheamountdetectionof
2PleaserefertoSection3.3foracomprehensiveoverviewofexistingclonedetectionapproaches.

26

2.3NotionsofProgramSimilarity

intx,y,z;
z=xy;intx’=x;
;0=zwhile(x’>0){
;y=+zx’!=1;
}while(x’<0){
;y=!zx’+=1;
}

Figure2.2:Codethatisbehaviorallyequalbutnotrepresentationallysimilar.

infurtherpracticedeis,viationhowethevyer,allosevwerelybetweenlimitedcodebythefragments.amountTheoffalseamountpositiofvdeesitviationproduces.thatcanbetolerated

SimilarityvioralBeha2.3.2Besidestheirrepresentationalaspects,programscanbecomparedbasedontheirbehavior.Be-
havioralprogramsimilarityisnotemployedbyexistingclonedetectors3.However,weintroduce
behavioralnotionsofprogramsimilaritysinceweemploythemlatertoreasonaboutthelimitations
ofclonedetection(cf.,Chapter9).
Severalnotionsofbehavioralorsemanticsimilarityhavebeensuggested[228].Inthiswork,we
focusonsimilarityintermsofI/Obehavior.Wechoosethisnotionforseveralreasons.Itismore
robustagainsttransformationsthan,e.g.,executioncurvesimilarity[228]orstrongprogramschema
equivalence[98,203].Furthermore,itishabituallyemployedinthespeci®cationofinteractive
systems[26]andbestcapturesourintuition.
Forapieceofcode(i.e.,asequenceofstatements)wecallallvariableswrittenbythiscodeits
outputvariablesandallvariableswhicharereadanddohaveanimpactontheoutputsitsinput
variables.Eachofthevariableshasatypewhichisuniquelydeterminedfromthecontextofthe
code.Wecantheninterpretthiscodeasafunctionfromvaluationsofinputvariablestovaluations
ofoutputvariables,whichistriviallystate-less(andthusside-effectfree),aswecapturedallglobal
variablesintheinputandoutputvariables.
Wecalltwopiecesofcodebehaviorallyequal,ifftheyhavethesamesetsofinputandoutput
variables(modulorenaming)andareequalwithrespecttotheirfunctioninterpretation.So,foreach
inputvaluationtheyhavetoproducethesameoutputs.Ane4xampleofcodethatisbehaviorally
equalbutnotrepresentationallysimilarisshowninFigure2.2.
3weWhilearguetherethatarethesomeyuseaapproachesrepresentationalthatrefernotiontoofthemselvsimilarityesas,sincesemanttheicclonePDGisadetection,programe.g.,PDGrepresentation.basedapproaches,
4Variablex’ontherightsideisintroducedtoavoidmodi®cationoftheinputvariablex.

27

Fundamentals2

Forpracticalpurposes,oftennotonlystrictlyequalpiecesofcodearerelevant,butalsosimilar
ones.Wecallsuchsimilarcodeasimion.Simionsarebehaviorallysimilarcodefragmentswhere
behavioralsimilarityisde®nedw.r.t.input/outputbehavior.Thespeci®cde®nitionofsimilarityis
task-speci®c.Onede®nitionwouldbetoallowdifferentoutputsforaboundednumberofinputs.
Thiswouldcapturecodewithisolateddifferences(e.g.,errors),forexampleinboundarycases.
Anotheronecouldtoleratesystematicdifferences,suchasdifferentreturnvaluesforerrors,orthe
infamous“offbyone”errors.Afurtherde®nitionofsimilarityiscompatibilityinthesensethatone
simionmayreplaceanotherinaspeci®ccontext.
Thedetectionofsimionsthatarenotrepresentationallysimilarisbeyondthescopeofthisthesis.

SimionversusCloneMostde®nitionsofsoftwareclonesdenoteacommonoriginofthe
clonedcodefragments[227],asisalsothecaseinbiology:Haldanecoinedtheterm“clone”
fromtheGreekwordfortwig,branch[90].Wewanttobeabletoinvestigatecodesimilarities
independentoftheirmodeofcreation,however.Usingatermthatinmostofitsde®nitionsim-
pliesduplicationfromasingleancestorasamodeofcreationisthuscounter-intuitive.Wethus
deliberatelyintroducetheterm“simion”toavoidconfusion.
Forthesakeofclarity,werelatethetermtothosede®nitionsof“clone”thataremostclosely
related:accidentalclonesdenotecodefragmentsthathavenotbeencreatedbycopy&paste[1].
Theirsimilarityresultstypicallyfromconstraintsorinteraction-protocolsimposedbythesame
librariesorAPIstheyuse.However,whiletheyaresimilarw.r.t.thoseconstraintsorprotocols,
theydonotneedtobesimilaronthebehaviorallevel5.Semanticclonesdenotecodefragments
whoseprogramdependencegraphfragmentsareisomorphic[73].Sincetheprogramdependence
graphsareabstractionsoftheprogramsemantics,andthusdonotcapturethemprecisely,theycan,
butdonotneedtohavesimilarbehavior.Type-4clonesasde®nedby[200]as“twoormorecode
fragmentsthatperformthesamecomputationbutareimplementedbydifferentsyntacticvariants”
arecomparabletosimions.However,wepreferatermthatdoesnotincludetheword“clone”as
thisimpliesthatonesimilarinstanceisderivedfromanotherwhichisnotthecaseiftheyhavebeen
.independentlyelopedvde

De®nitionsandermsT2.4

Thissectionintroducesfurthertermsthatarecentraltothisthesis.

cleofSoftwareasoftwAraretifactssystem.AItissoftwarparteoftheartifactissystema®leorthatiscapturescreatedknoandwledgemaaboutintainedit.duringExamplesthelifeincludecy-
isregrequirementsardedasaspeci®cations,collectionofmodelsatomicandunits.sourceForcode.naturalFromlanguagethepointtexts,ofviethesewofunitsanalysis,canbeanwartifordsactor
sentences.Forsourcecode,tokensorstatements.Fordata-¯owmodelssuchasMatlab/Simulink,
atomicunitsarebasicmodelblockssuchasadditionormultiplicationblocks.Thetypeofdata
5Indialogsotherwcanords,lookevenandthoughbehavetheverycodedifofferent.twoUIdialogslookssimilarinpartssincethesamewidgettoolkitisused,the

28

MetricsClone2.5

structureaccordingtowhichtheatomicunitsarearrangedvariesbetweenartifacttypes.Require-
mentsspeci®cationsandsourcecode,areconsideredassequencesofunits.Data-¯owmodelsas
units.ofaphsgrWeusethetermrequirementsspeci®cationaccordingtoIEEEStd830-1998[100]todenoteaspec-
i®cationforaparticularsoftwareproduct,program,orsetofprogramsthatperformscertainfunc-
tionsinaspeci®cenvironment.Asinglespeci®cationcancomprisemultipleindividualdocuments.
Weusethetermusecasetorefertoarequirementsspeci®cationwritteninusecaseform.Usecases
describetheinteractionbetweenthesystemandastakeholderundervariousconditions[37].We
assumeusecasestobeintextform.
Weusethetermdata-¯owmodeltorefertomodelsasusedintheembeddeddomain,suchas
Matlab/SimulinkorASCETmodels.Asingledata-¯owmodelcancomprisemultiplephysical
®les.model

SizeMetricsLinesofcode(LOC)denotethesumofthelinesofcodeofallsource®les,includ-
ingcommentsandblanklines.Sourcestatements(SS)arethenumberofallsourcecodestatements,
nottakingcommentedorblanklinesandcodeformattingintoaccount.Formodels,sizemetrics
typicallyrefertoblocksorelements,insteadoflinesorstatements.Thenumberofblocksdenotethe
sizeofaMatlab/Simulinkmodelsintermsofatomicelements.Theredundancyfreesourcestate-
once.mentsRFSS(RFSS)thusaretheestimatesnumbertheofsizeofsourceasystemstatements,fromifwhichclonedallclonessourcearestatementsperfectlyareremoonlyved.counted

Ftheailureuser.AandfaultFaultistheWecauseuseinthethetermsourcefailurcodeetoofadenotepotentialanfincorrectailure.outputofasoftwarevisibleto

MethodWeemploythetermmethodaccordingtoBalzert6todenote“asystematic,justi®ed
goals”.speci®edaccomplishtoprocedure

MetricsClone2.5

Thecasestudiesandmethodspresentedinthefollowingchaptersemployseveralclone-related
metrics.Theyarede®nedandillustratedinthefollowing.Themetricsareemployedinthisorin
similarformbyseveralclonedetectionapproaches[140,201].

Example2.5.1

Tomakethemetricsmoretangible,weusearunningexample.Figure2.3showsthestructureof
theexampleartifactsandtheircontainedclones.
6TranslatedfromGermanbytheauthor.

29

Fundamentals2





eRunning2.3:Figurexample

Theexamplecontainsthreeartifact®lesA-Candthreecandidateclonegroupsa-c.Candidate
clonegroupahasthreecandidateclones,coveringallartifacts.Groupbhastwocandidateclones,
coveringartifactsAandB.Groupchasfourcandidateclones,withc1andc2locatedinartifactsA
andBrespectively,andc3andc4locatedinartifactC.Groupsbandcoverlap.Dimensionsofthe
artifactsandthecandidateclonegroupsaredepictedinTable2.1.

Dimensions2.1:leabTABCabc
Length601004054010

Weinterprettheexampleforsourcecode,requirementsspeci®cationsandmodelsbelow.Lengthis
measuredinlinesforsourcecodeandrequirementsandinmodelelementsformodels.Theprimary
differenceinthecaseofmodelsisthattheirclonesarenotconsecutive®leregions,butsubgraphs
ofthemodelgraph.Avisualizationofthemodelsandtheircandidatecloneswouldthuslookless
2.3.Figurethanlinear

SourceCodeArtifactsAtoCaretextualsourcecode®lesinJava.ArtifactsAandBimplement
businesslogicforabusinessinformationsystem.Aimplementssalarycomputationforemploy-
ees,Bimplementssalarycomputationforfreelancers.Ccontainsutilitymethodsthatcompute
salaries.Thecandidateclonesofcandidateclonegroupacontainimportstatementsthatarelocatedatthe
startoftheJava®les.Clonegroupbcontainsthebasicsalarycomputationfunctionality.Clone
andgroupcfreelancerscontainsaandtaxinthecomputationutilitymethodsroutineof®lewhichC.isusedbothforsalarycomputationofemployees

30

MetricsClone2.5

Javemploayedimportbythestatementscompilermap.ModernbetweenIDEslocaltypeautomatenamesmanagementusedinaof®leandimportfullystatements.quali®edThetypeyarenamesthus
notthusdoesmodi®ednotafmanuallyfectduringmaintenancetypicaleffort.softwaremaintenancetasks.Redundancyinimportstatements

RequirementsSpeci®cationsArtifactsAtoCareusecasedocuments.DocumentAde-
scribesDocumentuseCcasedescribes“Createuseemplcaseoyee“createaccount”,customer”anddocumentcontainsBuseprimarycaseand“Createalternatifreelancervescenarios.account”.
TheCloneclonesgroupofbclonecontainsgroupapreconditionscontain,stepsdocumentandheaderspostconditionsthatareofcommongenerictoallaccountusecasecreation.documents.Clones
ofclonegroupccontainpostconditionsthatholdbothafteraccountcreationandforboththe
primaryandalternativescenarioofcustomercreation.

EachData-Flo®lewrepresentsModelsaArtifseparateactsAtosubsystem.CareMatlaWhereastheb/Simulinkclones®lesofthatclonearepartgroupsofabandsinglecmodel.encode
blocks,similarthePIDyarecontrollers,thusnottherelevcloneantforcandidatesmaintenance.ofcandidateclonegroupaonlycompriseconnectors

RelevanceFromamaintenanceperspective,candidateclonegroupaisnotrelevant.Inthe
sourcecodecase,itcontainsimportstatementsthatareautomaticallymaintainedbymodernIDEs—
nomanualimportstatementmaintenancetakesplacethatcouldbene®tfromknowledgeofclone
relationships.Intherequirementsspeci®cationcase,itcontainsadocumentheaderthatdoesnot
getmaintainedmanuallyineachdocument.Changestotheheaderareautomaticallyreplicatedfor
alldocumentsbythetextprocessorusedtoedittherequirementsspeci®cations.Inthemodelcase,
theconnectorsestablishthesyntacticsubsysteminterface.Consistencyofchangestoitisenforced
bythecompiler.Similarly,nomanualmaintenancetakesplacethatcouldmakeuseofknowledge
aboutclonerelations.Thecandidateclonesingroupaarethusnotrelevantclonesforthetaskof
softwaremaintenance.Theremainingclonegroups,however,arerelevant.

2.5.2emplateTMetric

itsEachscalemetricandisrange.introducedItsfollodeterminationwinga®xdescedribestemplate.whetherItsthevde®nitionalueforde®nesthethemetricmetriccanbeanddeterminedspeci®es
fullyputestheautomaticallymetricforbytheaetoolxampleorartifwhetheractsandhumanclonejudgegroups.mentisrequired.Itsexampleparagraphcom-
Theengineeringroleoftheactivitiesmetricsisfordescribedcloneinassessment,detailinandChapterthus8.theThisinterpretationsectionthusofonlytheirvbrie¯yaluesforsummarizessoftware
metric.eachofpurposethe

31

Fundamentals2

CountsClone2.5.3

De®nition1Clonegroupcountisthenumberofclonegroupsdetectedforasystem.Clonecount
isthetotalnumberofclonescontainedinthem.

byClonecloning.countsBothareusedcountsduringhaveacloneratioscaleassessmentandtorangerevealbetweenhow[0,man1y[.partsofthesystemareaffected

DeterminationBothcountsaretriviallydeterminedautomatically.

remoExampleved,cloneForgrtheoupecountxample,isthereducedclonetogr2oupandcountcloneiscount3,theto6.clonecountis9.Ifclonegroupais

2.5.4Overhead

De®nition2Overheadisthesizeincreaseofasystemduetocloning.

Overheadisusedintheevaluationofthecloning-inducedeffortincreaseinsize-relatedactivities.
Itismeasuredinrelativeandabsoluteterms:
sizeoverhead_rel=redundancyfreesize!1

Ifthesizeis>0,theredundancyfreesizecanneverbe0.Overheadsizeisthusalwaysde®nedforall
artifactsofsize>0.Thesubtractionof1fromtheratioredundancyfreesizemakestheoverhead_rel
quantifyonlythesizeexcess.

overhead_abs=size!redundancyfreesize

Bothhavearatioscaleandrangebetween[0,1[.

oftheovDeterminationerheadmetricOvthuserheadisdependscomputedontheonaccuractheycloneofthegroupsclonesdetectedonwhichforaitissystem.computed.Theaccuracy

32

MetricsClone2.5

dundancExampleyfreeTosourcecomputeovstatementserheadfor(RFSS)sourceforartifcode,actAweareemploycomputedstatementsastheassumbasicof:units.There-

15statementsthatarenotcoveredbyanyclone—theyaccountfor15RFSSfor®leA.
The51statements2thatarecoveredbyclonea1occur3timesaltogether.Theythusonlyaccount
for5·3=13RFSSfor®leA.
The30statementsthat1arecoveredbycloneb1butnotbyclonec1occur2times.Theythus
onlyaccountfor30·2=15RFSSfor®leA.
The101statements1thatarecoveredbybothclonesb1andc1occur4times.Theythusaccount
for10·4=22RFSSfor®leA.
Inall,®leAthushas15+132+15+221=3461RFSS.Since®leAhas60statementsaltogether,
overhead=36401!1=75.6%.
6RFSSforartifactsA-Cis130,correspondingoverheadisoverhead=123000!1=53.8%.Ifclone
groupsaisexcludedsinceitisnotrelevanttomaintenance,RFSSincreasesto140andoverhead
42.9%.todecreasesTocomputeoverheadforotherartifacts,wechoosedifferentartifactelementsasbasicunits.For
requirementsspeci®cations,weemploysentencesasbasicunits;formodels,modelelements.Over-
headforthemiscomputedanalogously.

2.5.5CloneCoverage

De®nition3Clonecoverageistheprobabilitythatanarbitrarilychosenelementinasystemis
coveredbyatleastoneclone.

Clonecoverageisusedduringcloneassessmenttoestimatetheprobabilitythatachangetoone
statementneedstobemadetoadditionalstatementsduetocloning.Itisde®nedasfollows,where
clonedsizeisthenumberofunitscoveredbyatleastoneclone,andsizeisthenumberofallunits:
coverage=clonedsize
size

Clonecoveragehasaratioscaleandrangesbetween[0,1].

tem.TheDeterminationaccuracyofJusttheascooveraverheadge,metriccoverathusgeisdependscomputedontheontheaccuraccloneyofthegroupsunderlyingdetectedforclones.asys-

33

Fundamentals2

forExamplesourcecode.JustasTheforoclonedverheadsize,forweartefemploactyAissourcecomputedstatementsasfolloasws:basicunitstocomputecoverage
Clonea1accountsfor5clonedstatements.
Cloneb1accountsfor40clonedstatements.
Clonec1spans10statements.However,allofthemarealsospannedbycloneb1.Clonec1
doesthusnotaccountforadditionalclonedstatements.
TheclonedsizeforartifactAisthus5+40=45.SinceAhasasizeof60,itscoverageis
6405=0.75%40.Ifclonegroupaisignoredsinceitisnotrelevantformaintenance,coverageforAis
reducedto60=66.7%.
Thecoverageforallthreeartifactsis210105=57.5%,ifclonegroupaisincluded,else210000=50%.
Forartifacttypesotherthansourcecode,basicunitsarechosendifferently,butcoverageiscomputed
.analogously

Precision2.5.6

De®nition4Precisionisthefractionofclonegroupsthatarerelevanttosoftwaremaintenance,
oraspeci®ctask,forwhichcloneinformationisemployed.Itcanbecomputedonclonesorclone
oups.gr

Basedonthesetsofcandidateclonegroupsandrelevantclonegroups,itisde®nedasfollows:

precision_CG=|{relevantclonegroups}\{candidateclonegroups}|
|{candidateclonegroups}|
Precisionbasedonclones,precision_C,iscomputedanalogously.Bothprecisionmetricshaveratio
scalesandrangebetween[0,1].

DeterminationPrecisionisdeterminedthroughdeveloperassessmentsofsamplesofthede-
tectednance,thatclones.is,Forwhethereachclonechangestogroup,thedevcloneselopersareeassessxpectedwhethertobeitiscoupled.relevTantoaforchievsoftwearereliablemainte-and
repeatablemanualcloneassessments,explicitrelevancecriteriaarerequired.
Sinceinpracticethesetofdetectedclonesisoftentoolargetobefeasiblyassessedentirely,preci-
sionistypicallydeterminedonarepresentativesampleofthecandidateclonegroups.

ExampleIntheexample,clonegroupaisnotrelevantforsoftwaremaintenance.Theremaining
clonegroupsarerelevant.Consequently,precisionCC=32,precisionC=96=32.

34

ModelswData-¯o2.6

2.1.82.5I1
zPMaxI-Delay
1<1121
1InzCompareSetOutInP1Out
I-Delayz
.5D-Delay
7.0DIFigure2.4:Examples:DiscretesaturatedPI-controllerandPID-controller

ModelswData-¯o2.6

butwithModel-basedmoredevabstractelopmentmodelsmethodsspeci®cto[188]—dethevelopmentdomain—areofgsoftwainingarenotimportanceontheinclassicalthecdomainodelevofel
7automotiembeddedvesystems.domain,alreadyTheseupmodelsto80%areofusedthetoproductionautomaticallycodedeplogeneyedrateonproductionembeddedcodecontrol.Inunitsthe
canbegeneratedfrommodelsspeci®edusingdomain-speci®cformalismslikeMatlab/Simulink
[118].Thesemodelsaretakenfromcontrolengineering.Blockdiagrams—similartodata-¯owdiagrams—
Thus,consistingblocksofblockscorrespondandtolinesfunctionsareused(e.ing.,thisintegrators,domain®aslters)structuredtransformingdescriptioninputofsignalsthesetosystems.output
signals,linestosignalsexchangedbetweenblocks.Thedescriptiontechniquesspeci®callyaddress-
ingwithdata-¯ocomputationwsystemsschemesaretarlargetinggelytheindependentmodelingofofthecomplexcomputedstereotypicaldataandthusrepetitivecontainingcomputations,littleor
noaspectsofcontrol¯ow.Typicalapplicationsofthosemodelsare,e.g.,signalprocessingalgo-
rithms.Recently,toolsforthisdomain—withMatlab/Simulink[169]orASCET-SDasexamples—areused
forthegenerationofembeddedsoftwarefrommodelsofsystemsunderdevelopment.Tothatend,
theseblockdiagramsareinterpretedasdescriptionsoftime-(andvalue-)discretecontrolalgorithms.
ByusingtoolslikeTargetLink[58],thesedescriptionsaretranslatedintothecomputationalpartof
ataskdescription;byaddingschedulinginformation,thesedescriptionsarethencombined–often
usingareal-timeoperatingsystem—toimplementanembeddedapplication.
Figure2.4showstwoexamplesofsimpledata-¯owsystemsusingtheSimulinknotation.Both
modelsarefeedbackcontrollersusedtokeepaprocessvariablenearaspeci®edvalue.Bothmodels
transformatime-andvalue-discreteinputsignalInintoanoutputsignalOut,usingdifferenttypes
ofbasicfunctionblocks:gains(indicatedbytriangles,e.g.,PandI),adders(indicatedbycircles,
with+and!signsstatingtheadditionorsubtractionofthecorrespondingsignalvalue),one-unit
delays(indicatedbyboxeswith1,e.g.,I-Delay),constants(indicatedbyboxeswithnumerical
values,e.g.,Max),comparisonsz(indicatedbyboxeswithrelations,e.g.,Compare),andswitches
(indicatedbyboxeswithforks,e.g.,Set).
7Thetationtermpurposes.“model-based”Herehoiswevoftener,wealsofocususedoninthemodelscontethatxtofareemploincompletyedeforfullspeci®cationscodethatgeneration.domainlyservedocumen-

35

Fundamentals2

Systemsareconstructedbyusinginstancesofthesetypesofbasicblocks.Wheninstantiatingbasic
blocks,dependingontheblocktype,differentattributesarede®ned,e.g.,constantsgetassigneda
value,orcomparisonsareassignedarelation.Forsomeblocks,eventhepossibleinputsignalsare
declared.Forexample,foranadder,thenumberofaddedsignalsisde®ned,aswellasthecorre-
spondingsigns.Byconnectingthemviasignallines,(basic)blockscanbecombinedtoformmore
complexblocks,allowingthehierarchicdecompositionoflargesystemsintosmallersubsystems.

2.7CaseStudyPartners

Thissectiongivesashortoverviewofthecompaniesororganizationsthatparticipatedinoneor
studies.casetheofmore

MunichReGroupTheMunichReGroupisoneofthelargestre-insurancecompaniesinthe
worldandemploysmorethan47,000peopleinover50locations.Fortheirinsurancebusiness,they
developavarietyofindividualsupportingsoftwaresystems.

Lebensversicherungvon1871a.G.TheLebensversicherungvon1871a.G.(LV1871)is
aMunich-basedlife-insurancecompany.TheLV1871developsandmaintainsseveralcustom
softwaresystemsformainframesandPCs.

obtainedSiemensfromAGtheisbtheusinesslargestunitdealingengineeringwithcompanindustrialyinEurope.automation.Thespeci®cationusedherewas

anMOSTautomotivCooperaemultimediationisapartprotocol.nershipKeyofcarpartnersmanufincludeacturersAudi,andBMWcomponentandDaimlersuppliers.thatde®ned

MANNutzfahrzeugeGroupisaGermany-basedinternationalsupplierofcommercialvehicles
150andworktransportonsystems,electronicsmainlyandsoftwtrucksareanddevbuses.elopment.IthasovHence,er34,000thefocusemploisonyeeswembeddedorld-wideofsystemswhichin
domain.evautomotithe

ySummar2.8

ofThisredundancchapteryusedintroducedinclonescomputerasascience.formofBasedlogicalthereon,redundancityde®nedandthecomparedcentralittermswithandothermetricsnotions
employedinthisthesis.Besides,thechapterintroducedthecompaniesthattookpartinindustrial
casestudiesthatarepresentedinlaterchapters.

36

3StateoftheArt

Thischaptersummarizesexistingworkintheresearchareaofsoftwarecloninginsupportofthe
claimsmadeinthethesisstatement(cf.,Section1.1).Morespeci®cally,itsummarizesworkonthe
impactofcloningonsoftwareengineeringandonapproachesforitsassessmentandcontrol1.

Thestructureofthischapterre¯ectstheorganizationofthisthesis:Section3.1outlinesworkon
theimpactofcloningonprogramcorrectness.Section3.2outlinesworkontheextentofcloning
indifferentsoftwareartifacttypes.Section3.3outlinesexistingclonedetectionapproachesand
argueswhynoveloneshadtobedeveloped.Section3.4outlinesworkoncloneassessmentand
management.Finally,Section3.5outlinesworkonthelimitationsofclonedetection.

Eachsectionsummarizesexistingwork,outlinesopenissuesandpointstothechaptersinthisthesis
thatcontributetotheirresolution.

3.1ImpactonProgramCorrectness

Itiswidelyacceptedthatcloningcan,inprinciple,impedemaintenancethroughitsinducedincrease
inartifactsizeandnecessityofmultiple,consistentupdatesrequiredforasinglechangeinproblem
cloningdomainisininformation.practice.AHosurvweveyer,ontheretheisnoharmfulnessconsensusofincloningthebyresearchHordijketal.community[93]onconcludeshowthatharmful“a
directlinkbetweenduplicationandchangeabilityhasnotbeenprovenyet,butnotrejectedeither”.
extentConsequentlyofthe,aimpactnumberonofmaintenanceresearcherseffortshaveand,performedespecially,onempiricalprogramstudiestocorrectness.betterunderstandthe

CloneRelatedBugsLietal.[157]presentanapproachtodetectbugsbasedoninconsistentre-
namingofidenti®ersbetweenclones.Jiang,SuandChiu[159]analyzedifferentcontextsofclones,
suchasmissingifstatements.Bothpapersreportthesuccessfuldiscoveryofbugsinreleasedsoft-
ware.In[4],[237],[216]and[7],individualcasesofbugsorinconsistentbug®xesdiscoveredby
analysisofcloneevolutionarereportedforopensourcesoftware.Thesestudiesthuscon®rmcases
whereinconsistenciesbetweenclonesindicatedbugs,supportingtheclaimfornegativeimpactof
correctness.programforclones1Acomprehensiveoverviewofsoftwarecloningresearchingeneralisbeyondthescopeofthisthesis.Pleasereferto
Koschke[140]andRoyandCordy[201]fordetailedsurveys.

37

3StateoftheArt

isgiClonevenEvbyseolutionveralresearchers.IndicationforLaguetheetal.harmfulnes[149],sofreportcloningforinconsistentevmaintainabilityolutionoforacorrectnesssubstantial
reamountvisionofnumberclonesforin®laneswithindustrialclonesthantelecommunicationfor®leswithoutsystem.ina20Mondenyearetoldal.legac[178]yreportsystem,apossi-higher
blyindicatinglowermaintainability.In[132,133],Kimetal.reportthatmanychangestocode
clonesoccurinacoupledfashion,indicatingadditionalmaintenanceeffortduetomultiplechange
locations.Thummalapenta,AversanoCeruloandDiPenta[4,216]reportthathighproportionsof
bug®xesoccurforclonesthatshowlatepropagations,i.e.,inconsistentchangesthatarelatermade
consistent,indicatingthatcloningdelayedtheremovalofbugsfromthesystem,orthattheincon-
sistenciesintroducedbugsthatwerelaterrepaired.LozanoandWermelinger[163,193]reportthat
maintenanceeffortmayincreasewhenamethodhasclones.
Incontrast,doubtthatconsequencesofcloningareunambiguouslyharmfulisraisedbyseveral
recentresearchresults.Krinke[147]reportsthatonlyhalftheclonesinseveralopensourcesystems
evolvedconsistentlyandthatonlyasmallfractionofinconsistentclonesbecomesconsistentagain
throughlaterchanges,potentiallyindicatingalargerdegreeofindependenceofclonesthanhitherto
believed.Geigeretal.[76]reportthatarelationbetweenchangecouplingsandcodeclonescould,
nocontrarysystematictoexpectations,relationshipnotbetweenbecodestatisticallycloningveri®ed.andLozanochangeabilityandWcouldermelingerbeestablished.[163]reportIn[148],that
andKrinkeconcludesreportsthatthatitinathussetofcannotopenbesourceassumedsystems,torequireclonedmorecodeismaintenancemorestablecoststhaningeneral.non-clonedcode
Bettenburgetal.[20]analyzedtheimpactofinconsistentchangestoclonesonprogramcorrectness.
Insteadofanalyzingindividualchanges,theyanalyzedonlyreleasedsoftwareversions.Ofthe
toanalyzedcodebclones,ugsintheindicatingtwoasystems,smallonlyimpact1.3%ofandcloning2.3%onwereprogramfoundtobecorrectnessdueto.incRahmanonsistentetal.changes[189]
codeanalyzecontainsrelationlessbbetweenugsthancodenon-clonedcloningandcode.bugsandreportthat,intheanalyzedsystems,cloned
Duetconclusionsothediwv.r.t.ersitytheoftheharmfulnessresultsofproducedcloning.byThistheisstudiesemphasizedonclonebyevtheolution,resultsitisfromhardGödetodra[83],w
whostudiesonanalyzescloneevevolutionolution.ofHetype-1reportsclonesthatin9theopenratioofsourceconsistentsystemsandtovalidateinconsistent®ndingschangesfromtopreclonedvious
codevariessubstantiallybetweentheanalyzedsystems,makingconclusionsdif®cult.

CloningPatternsThroughcloningpatterns,KapserandGodfrey[123]contrastmotivationand
impactofcloningasadesigndecisionwithalternativesolutions.Theyreportthatcloningcanbea
justi®ableorevenbene®cialdevelopmentactioninspecialsituations,i.e.,whereseverelanguage
limitationsorcodeownershipissuesprohibitgenericsolutions.Notablyhowever,whiletheyargue
thatlackof,orproblemsassociatedwithalternativesolutionscanmakeupforthem,theyemphasize
thatforallcloningpatternsthenegativeimpactofcloningstillholds.

SummaryTheeffectofcloningonmaintainabilityandcorrectnessisthusnotclear.Further-
more,theabovelistedpublicationssufferfromoneormoreshortcomingsthatlimitthetransferabil-
®ndings.reportedtheofity

38

3.1ImpactonProgramCorrectness

Manystudiesemployclonedetectorsintheirdefaultcon®gurationwithoutadaptingthemto
theanalyzedsystemsortasks[4,7,76,147,148,163,189].Asaconsequence,nodifferentiation
ismade,e.g.,betweenclonecandidatesinhand-maintainedorgeneratedcode,althoughclone
candidatesingeneratedcodeareirrelevantformaintenance.Theemployednotionof“clone”
isthuspurelysyntacticandtask-relatedprecisionunclear.Forexample,foroneoftheanalyzed
systems,Krinkereportsthatmorethanhalfofthedetectedcloneswereincodegeneratedby
aparsergenerator[148].However,theywerenotexcludedfromthestudy,thusdilutingits
conclusivenessw.r.t.totheimpactofcloning.
Insteadofmanualinspectionoftheactualinconsistentclonestoevaluateimpactformainte-
nanceandcorrectness,indirectmeasuresareused[4,76,83,147–149,163,178].Forexample,
changecoupling,theratiobetweenconsistentandinconsistentevolutionofclonesorcode
stabilityareanalyzed,insteadofactualmaintenanceeffortsorfaults.Indirectmeasuresare
inherentlyinaccurateandcaneasilyleadtomisleadingresults:unintentionaldifferencesand
faults,e.g.,whileunknowntodevelopers,exhibitthesameevolutionpatternasintentionally
independentevolutionandarethuspronetomisclassi®cation.Furthermore,inconsistencies
thatarefaultsthathavenotyetbeendiscovered,orhavebeen®xedindifferentways,can
incorrectlybeclassi®edasintentionalindependentevolution.
Apartfromtheirinaccuracy,theinterpretationoftheindirectmeasuresisdisputable.Thisis
apparentforthemeasureofcodestabilityasanindicatorformaintainability.Onetheone
hand,higherstabilityofclonedversusnon-clonedcode,couldbeinterpretedasanindicator
forlowermaintenancecostsofclonedcode,as,e.g.,doneby[148];fewerchangescouldmean
lesscosts.Ontheotherhand,itcanbeinterpretedasanindicatorforlowermaintainability—
developersmightshirkchangingclonedcodeduetotheincreasedeffort—indicatinghigher
overallmaintenancecosts!Supportforthelatterinterpretationis,e.g.,givenbyGlass[81],
whoreportsmorechangesformoremaintainableapplicationsthanforunmaintainablecode,
simplybecausedevelopmentexploitsthefactthatchangesareeasiertomake.
Theanalyzedsystemsaretoosmall(20kLOC)toberepresentative[132,133]oromitanalysis
ofindustrialsoftware[4,7,76,83,132,133,147,148,163,189].
Theanalysesspeci®callyfocusonfaultsintroducedduringcreation[157,159]orevolution[7]
ofclones,inhibitingquanti®cationofinconsistenciesingeneral.Or,inthecaseof[20],only
lookatbugsinreleasedsoftware,thusignoringeffortsfortesting,debuggingand®xingof
clone-relatedbugsintroducedand®xedduringdevelopment.
Additionalempiricalresearchoutsidetheselimitationsisrequiredtobetterunderstandtheimpact
ofcloning[140,201].Inparticular,theimpactofcloningonprogramcorrectnessisinsuf®ciently
understood.ProblemItisstillnotwellunderstood,howstronglyunawarenessofcloningduringmaintenance
affectsprogramcorrectness.However,asthisisthecentralmotivationdrivingthedevelopmentof
clonemanagementtools,weconsiderthisprecarious.
ContributionChapter4presentsalargescalecasestudythatstudiestheimpactofunawareness
ofcloningonprogramcorrectness.Itemploysdeveloperratingoftheactualinconsistentclones
insteadofindirectmeasures,thestudyobjectsarebothopensourceandindustrialsystems,and

39

3StateoftheArt

sufferinconsistenciesfromtheabohavveebeenmentionedanalyzedshortcomings.independentlyoftheirmodeofcreation.Itdoes,hence,not

CloningofExtent3.2

Cloninghasbeenstudiedintenselyforsourcecode.Littlework,however,hasbeendoneoncloning
inotherartifacttypes.Thissectionoutlinesexistingworkontheextentofcloningindifferent
types.actartif

code.SourceBothCodefortheeThevaluationmajorityofofthedetectionresearchapproachesintheandareaforofthesoftwanalysisareofcloningtheimpactfocusesofonsourcloning,ce
asubstantialnumberofresultsfordifferentcodebaseshavebeenpublished[1,3,4,7,33,60,83,
84,comprise110,115,source133,code140,147,from148,systems157,of159,dif161,ferent162,size164,and178,age,189,from193,dif195,ferent198,199,domains,201,dev216].elopmentThey
theseteamsandstudieswrittenconindifvincinglyferentshowprogrammingthatcloninglanguages.canoccurWhileinthesourceamountcodeofdetectedindependentcloningofvdomain,aries,
courseprogrammingofthisthesislanguagesupportordethisvelopingobservorgation.anization.Thestudiesthathavebeenperformedinthe

RequirementsSpeci®cationsThenegativeeffectsofcloninginprograms,inprinciple,also
applytocloninginsoftwarerequirementsspeci®cations(SRS).AsSRSarereadandchangedof-
ten(e.g.,duringrequirementselicitation,softwaredesign,andtestcasespeci®cation),redundancy
isconsideredanobstacletorequirementsmodi®ability[100]andlisted,forinstance,asamajor
probleminautomotiverequirementsengineering[230].

Ingeneral,structuringofrequirementsandmanualinspection—based,e.g.,onthecriteriaof
[100]—areusedforqualityassessmentconcerningredundancy.Asitrequireshumanaction,it
doesintroducesubjectivenessandcauseshighexpenses.Inaddition,approachesexisttome-
chanicallyanalyzeotherqualityattributesofnaturallanguagerequirementsspeci®cations,espe-
ciallyambiguity-relatedissueslikeweakphrases,lackofimperative,orreadabilitymetricsasin,
e.g.,[28,66,101,233].However,redundancyhasnotbeeninthefocusofanalysistools.

Algorithmsforcommonalitiesdetectionindocumentshavebeendevelopedinseveralotherareas.
Clusteringalgorithmsfordocumentretrieval,suchas[231],searchfordocumentsontopicssimilar
tothethosedetectionofaofreferencecommonalitiesdocument.betweenPlagiarismdocuments.detectionHowever,algorithms,whilelikthesee[44,approaches165],alsosearchaddressfor
commonalitiesbetweenaspeci®cdocumentandasetofreferencedocuments,clonedetectionalso
needstoconsidercloneswithinasingledocument.Furthermore,wearenotawareofstudiesthat
applythemtorequirementsspeci®cationstodiscoverrequirementscloning.

40

3.3CloneDetectionApproaches

ModelsUptonow,littleworkhasbeendoneonclonedetectioninmodel-baseddevelopment.
Consequently,wehavelittleinformationonhowlikelyreal-worldmodelscontainclones,andthus,
howimportantclonedetectionandmanagementisformodel-baseddevelopment.
In[160],Liuetal.proposeasuf®x-treebasedalgorithmforclonedetectioninUMLsequence
diagrams.Theyevaluatedtheirapproachonsequencediagramsfromtwoindustrialprojectsfroma
singlecompany,discovering15%ofduplicationinthesetof35sequencediagramsinthe®rstand
8%ofduplicationinthe15sequencediagramsofthesecondproject.
In[186]and[180],Phametal.andNguyenetal.presentclonedetectionapproachesforMat-
lab/Simulinkmodels.TheirevaluationislimitedtofreelyavailablemodelsfromMATLABCentral
though,thatmainlyserveeducationalpurposes.Itthusdoesnotallowconclusionsabouttheamount
ofcloninginindustrialMatlab/Simulinkmodels.

SummaryAlthoughrequirementshaveapivotalroleinsoftwareengineering,andeventhough
redundancyhaslongbeenrecognizedasanobstacleforrequirementsmodi®cation[100],tothe
bestofourknowledge,noanalysisofcloninginrequirementsspeci®cationshasbeenpublished
(exceptfortheworkpublishedaspartofthisthesis).Wethusdonotknowwhethercloningoccurs
inrequirementsandneedstobecontrolled.
Althoughmodel-baseddevelopmentisgainingimportanceinindustry[188],exceptfortheanalysis
ofcloninginsequencediagrams,nostudiesoncloninginmodelshavebeenpublished(exceptfor
theworkpublishedaspartofthisthesis).Wethusdonotknowhowrelevantclonedetectionand
managementisformodel-baseddevelopment.
ProblemSubstantialresearchhasanalyzedcloninginsourcecode.However,verylittleresearchhas
beencarriedoutoncloninginothersoftwareartifacts.Itisthusunclearwhethercloningprimarily
occursinsourcecode,oralsoneedstobecontrolledforothersoftwareartifactssuchasrequirements
models.andspeci®cationsContributionToadvanceourknowledgeoftheextentandimpactofcloninginotherartifacts,
Chapter5presentsalargescaleindustrialcasestudyoncloninginrequirementsspeci®cations
thatanalyzesextentandimpactofcloningin28speci®cationsfrom11companies.Itindicates
thatcloningdoesaboundinsomespeci®cationsandgivesindicationsforitsnegativeimpact.The
chapterfurthermorepresentsanindustrialcasestudyoncloninginMatlab/Simulinkmodelsthat
demonstratesthatcloningdoesoccurinindustrialmodels—clonedetectionandmanagementare,
hence,alsobene®cialforrequirementsspeci®cationsandinmodel-baseddevelopment.

3.3CloneDetectionApproaches

Bothempiricalresearchonthesigni®canceofcloningandmethodsforcloneassessmentandcontrol
requireclonedetectors.Inits®rstpart,thissectiongivesageneraloverviewofexistingcodeclone
detectionapproaches.Then,itpresentsapproachesforreal-timeclonedetectionoftype-2andeager
detectionoftype-3clonesinsourcecodeandclonedetectioningraph-basedmodelsindetailand
identi®estheirshortcomings.Thissectionthusmotivatesandjusti®esthedevelopmentofnovel
detectionapproachesthatarepresentedinChapter7.

41

3StateoftheArt

Code3.3.1DetectionClone

Theclonedetectioncommunityhasproposedverymanydifferentapproaches,thevastmajorityof
themforsourcecode.Theydifferintheprogramrepresentationtheyoperateonandinthesearch
algorithmtheyemployto®ndclones.Westructurethemhereaccordingtotheirunderlyingprogram
representation.Thissectionfocusesoncodeclonedetection.Approachesforotherartifactsare
3.3.4.SectioninpresentedText-basedclonedetectionoperatesonatext-representationofthesourcecodeandisthuslan-
guageindependent.Thus,text-baseddetectiontoolstypicallycannotdifferentiatebetweenseman-
ticschangingandsemanticsinvariantchanges.Approachesinclude[41,61,62,108,167,202].
Token-basedclonedetectionoperatesonatokenstreamproducedfromthesourcecodebyascan-
ner.Itisthuslanguagedependent,sinceascannerencodeslanguage-speci®cinformation.However,
comparedtoparsersorcompilers,scannersarecomparativelyeasytoproduceandrobustagainst
compileerrors.Token-basedclonedetectionallowstoken-typespeci®cnormalization,suchasre-
movalofcommentsorrenamingofliteralsandidenti®ers.Itisthusrobustagainstcertainsemantics
invariantchangestosourcecode.Approachesinclude[6,14,85,85,88,113,121,157,210,220].
AST-basedclonedetectionoperatesonthe(abstract)syntaxtreeproducedfromthesourcecodeby
aparser.Itthusrequiresmorelanguage-speci®cinfrastructurethantoken-baseddetection,butcan
bemaderobustagainstfurtherclassesofprogramvariation,suchasdifferentconcretesyntaxesfor
thesameabstractsyntaxelement.Approachesinclude[16,29,36,65,67,106,142,182,213,226].
Metrics-basedapproachescuttheprogramintofragments(e.g.,methods)andcomputeametric
vector—containinge.g.,linesofcode,nestingdepth,numberofpaths,andnumberofcallstoother
functions—foreach.Fragmentswithsimilarvectorsarethenconsideredclones.Sincethemetrics
abstractfromsyntacticfeaturesofthesourcecode,theseapproachesarealsorobustagainstcertain
typesofdifferencesbetweenclones.Approachesinclude[138,139,170].
PDG-basedapproachesoperateontheprogramdependencegraph(PDG)andsearchitforisomor-
phicsubgraphs.Ontheonehand,theyarerobustagainstfurthertypesofprogramvariationthat
cannotbeeasilydetectedbyotherapproaches,suchasstatementreordering.Ontheotherhand,
theymakethehighestdemandsw.r.t.availableprogramminglanguageinfrastructuretocreatea
PDG.Approachesinclude[73,137,146].
Assembler-basedapproachesemploytechniquesfromtheaboveapproachesbutoperateonthe
assemblerorintermediatelanguagecodeproducedbythecompiler,insteadofonsourcecode.
Ontheonehand,theyarerobustagainstprogramvariationremovedduringcompilation,suchas
interchangeableloopconstructs.Ontheotherhand,theyhavetodealwithredundancycreatedby
thecompilerthroughreplacementofasinglehigher-levellanguagestatement,likealoop,through
aseriesoflowerlevellanguagestatements.Approachesinclude[45,204]forassemblerand[213]
code.languageintermediate.NETforEachprogramrepresentationthedetectionapproachesoperateonrepresentsadifferenttrade-off
betweenseveralfactors:language-independence,robustnessagainstprogramvariationandperfor-
mancebeingamongthemostimportant.Increasingsophisticationofprogramrepresentation(text,

42

3.3CloneDetectionApproaches

token,AST,PDG)increasesrobustnessagainstprogramvariation,sincemoreinformationfornor-
malizationandsimilaritycomputationisavailable.However,atthesametimeitdecreaseslanguage
performance.andindependenceHybridapproacheshave,consequently,beenproposedthatattempttocombinetheadvantagesof
individualapproaches.Wrangler[154]employsahybridtoken/AST-basedapproachthatexploits
theperformanceoftoken-basedclonedetectionandemploystheASTtomakesurethatthedetected
clonesrepresentsyntacticallywell-formedprogramentitiesthatareamendabletocertainrefactoring
techniques.KClone[105]®rstoperatesonthetokenleveltoexploittheperformanceoftoken-based
clonedetectionandthenoperatesonagraph-basedrepresentationtoincreaserecall.

DetectionCloneReal-Time3.3.2

Clonemanagementtoolsrelyonaccuratecloninginformationtoindicatecloningrelationshipsin
theIDEwhiledevelopersmaintaincode.Toremainuseful,cloninginformationmustbeadapted
continuouslyasthesoftwaresystemunderdevelopmentevolves.Forthis,detectionalgorithmsneed
tobeabletoveryrapidlyadaptresultstochangingcode,evenforverylargecodebases.Weclassify
existingapproachesbasedontheirscalabilityandtheirabilitytorapidlyupdatedetectionresultsto
code.thetochanges

EagerAlgorithmsAsoutlinedinSection3.3.1,amultitudeofclonedetectionapproacheshave
beenproposed.Independentofwhethertheyoperateontext[41,62,202],tokens[6,113,121],
ASTs[16,106,142]orprogramdependencegraphs[137,146],andindependentofwhetherthey
employtextualdifferencing[41,202],suf®x-trees[6,113,121],subtreehashing[16,106],anti-
uni®cation[30],frequentitemsetmining[157],slicing[137],isomorphicsubgraphsearch[146]
oracombinationofdifferentphases[105],theyoperateinaneagerfashion:theentiresystemis
processedinasinglestepbyasinglemachine.
Thescalabilityoftheseapproachesislimitedbytheamountofresourcesavailableonasinglema-
chine.Theuppersizelimitontheamountofcodethatcanbeprocessedvariesbetweenapproaches,
butisinsuf®cientforverylargecodebases.Furthermore,iftheanalyzedsourcecodechanges,
eagerapproachesrequiretheentiredetectiontobereruntoachieveup-to-dateresults.Hence,these
approachesareneitherincrementalnorsuf®cientlyscalable.

talcloneIncrementaldetectionorReal-timeapproach.TheyDetectionemployaGödegeneralizedandKoschksufe®x-tree[85,85]thatcanproposedbetheupdated®rstefincremen-®ciently
whenthesourcecodechanges.Theamountofeffortrequiredfortheupdateonlydependsonthe
sizeofsubstantiallythechange,morenotmemorythesizethanoftheread-onlycodesufbase.®x-trees,sinceUnfortunatelythey,requiregeneralizedadditionalsuf®x-treeslinksthatrequireare
traacrossverseddifferentduringthemachines,updatethememoryoperations.Sincerequirementsgeneralizedrepresentsufthe®x-treesbottleneckarewnot.r.t.easilyscalabilitydistrib.utedCon-
sequently,theimprovementinincrementaldetectioncomesatthecostofreducedscalabilityand
ution.distrib

43

3StateoftheArt

Ydevamashinaelopersetinsideal.the[126]IDE.proposeInsteadatoolofcalledperformingSHINOBIclonethatdetectionprovideonsreal-timedemand(andcloningincurringinformationwaitingto
timesfordevelopers),SHINOBImaintainsasuf®x-arrayonaserverfromwhichcloninginforma-
tionapproachforasuf®le®x-arrayopenedbyademaintenanceveloperincantheirbewretrieork.vedefReal-time®ciently.cloningUnfortunatelyinformation,thehenceauthorsappearsdonotto
belimitedtoanimmutablesnapshotofthesoftware.Wethushavenoindicationthattheirapproach
.incrementallyorkswNguyenetal.[182]presentanAST-basedincrementalclonedetectionapproach.Theycompute
searchingcharacteristicforvsimilarectorsvforectors.allIfthesubtreesofanalyzedtheparsesoftwaretreeofchanges,acodev®le.ectorsforClonesmodi®edarethen®lesaredetectedsimplyby
availablerecomputed.onaAssinglethealgorithmmachine.isnotFurthermore,distributed,ASTits-basedscalabilitycloneisdetectionlimitedbyrequirestheamountparsers.ofUnfortu-memory
evernately,,paaccordingrsersfortoleourgaceyxperiencelanguages(cf.,suchChapterasPL/I4),orsuchCOBOLsystemsareoftenoftenhardcontaintoobtainsubstantial[150].amountsHow-
ofcloning.Clonemanagementishenceespeciallyrelevantforthem.

cloneScalabledetectionDetectionacrossmanLiyvierietmachinesal.to[162]improproposeveascalabilitygeneral.Theirdistributiondistributionmodelmodelthatdistribpartitionsutes
sourcedetectioncodeisthenintopiecesperformedsmallonallenoughpairs(e.ofg.,15pieces.MB)tDifobeferentanalyzedpairscanonabesingleanalyzedmachine.ondifferentClone
machines.Finally,resultsforindividualpairsarecomposedintoasingleresultfortheentirecode
timebase.forSincelargethesystemsnumberisofsubstantial.pairsofThepiecesincreaseincreasesinsquadrcalabilityaticallythuswithcomesatsystemthesize,costoftheresponseanalysis
time.

SummaryWerequireclonedetectionapproachesthatarebothincrementalandscalabletoef®-
cientlysupportclonecontroloflargecodebases.
Problemeagerclonedetectionisnotincremental.Thelimitedmemoryavailableonasinglema-
chinefurthermorerestrictsitsscalability.Novelincrementaldetectionapproachescomeatthecost
ofscalability,andviceversa.Inanutshell,noexistingapproachisbothincrementalandscalableto
verylargecodebases.
ContributionChapter7introducesindex-basedclonedetectionasanoveldetectionapproachfor
type-1&2clonesthatisbothincrementalandscalabletoverylargecodebases.Itextendspractical
applicabilityofclonedetectiontoareasthatwerepreviouslyunfeasiblesincethesystemsweretoo
largeorsinceresponsetimeswereunacceptablylong.Itisavailableforusebyothersasopensource
are.softw

3.3.3DetectionofType-3Clones

Thetioncase3.1)studyrequiresthataninvestigapproachatestoimpactdetectofunatype-3w(cf.areness,Secof2.2.3)cloningclonesoninprogramsourcecode.correctnessWe(cf.,classifySec-

44

3.3CloneDetectionApproaches

existingapproachesfortype-3clonedetectioninsourcecodeaccordingtotheprogramrepresenta-
tiontheyoperateonandoutlinetheirshortcomings.
TextInNICAD,normalizedcodefragmentsarecomparedtextuallyinapairwisefashion[202].A
similaritythresholdgovernswhethertextfragmentsareconsideredasclones.
TokenUedaetal.[220]proposepost-processingoftheresultsoftoken-baseddetectionofexact
clonesthatcomposestype-3clonesfromneighboringungappedclones.In[157],Lietal.present
thetoolCP-Miner,whichsearchesforsimilarbasicblocksusingfrequentsubsequenceminingand
thencombinesbasicblockclonesintolargerclones.
AbstractSyntaxTreeBaxteretal.[16]hashsubtreesintobucketsandperformpairwisecom-
parisonofsubtreesinthesamebucket.Jiangetal.[106]proposethegenerationofcharacteristic
vectorsforsubtrees.Insteadofpairwisecomparison,theyemploylocalitysensitivehashingforvec-
torclustering,allowingforbetterscalabilitythan[16].In[65],treepatternsthatprovidestructural
abstractionofsubtreesaregeneratedtoidentifyclonedcode.
ProgramDependenceGraphKrinke[146]proposesasearchalgorithmforsimilarsubgraphiden-
ti®cation.KomondoorandHorwitz[137]proposeslicingtoidentifyisomorphicPDGsubgraphs.
Gabel,JiangandSu[73]useamodi®edslicingapproachtoreducethegraphisomorphismproblem
.similaritytreeto

SummaryWerequireatype-3clonedetectionalgorithmtostudytheimpactofunawarenessof
correctness.programoncloningProblemTheexistingapproachesprovidedvaluableinspirationforthealgorithmpresentedinthis
thesis.However,noneofthemwasapplicabletostudytheimpactofunawarenessofcloningon
programcorrectness,foroneormoreofthefollowingreasons:

Tree[16,65,106]andgraph[73,137,146]basedapproachesrequiretheavailabilityofsuitable
suchcontextasJafreeva,thisgrammarsposesaforsevASTereorproblemPDGforlegconstruction.acylanguagesWhilesuchfeasibleasforCOBOLmodernorPL/I,languagewheres
suitablegrammarsarenotavailable.Parsingsuchlanguagesstillrepresentsasigni®cantchal-
150].[62,lengeDuetotheinformationlossincurredbythereductionofvariablesizecodefragmentsto
constant-sizenumbersorvectors,theeditdistancebetweeninconsistentclonescannotbecon-
trolledpreciselyinfeaturevector[106]andhashingbased[16]approaches.
detectedIdiosyncrasiesiftheirofsomeconstituenteapproachesxactclonesthreatenarenotrecall.longInenough.[220],In[73],inconsistentclinconsistenciesonescannotmightbe
notbedetectediftheyadddataorcontroldependencies,asnotedbytheauthors.
Scalabilitytoindustrial-sizesoftwareofsomeapproacheshasbeenshowntobeinfeasible
[137,146]orisatleaststillunclear[65,202].
Formostapproaches,implementationsarenotpubliclyavailable.

45

3StateoftheArt

ContributionChapter7presentsanovelalgorithmtodetecttype-3clonesinsourcecode.In
contrasttotheaboveapproaches,itsupportsbothmodernandlegacylanguagesincludingCOBOL
andPL/I,allowsforprecisecontrolofsimilarityintermsofeditdistanceonprogramstatements,is
suf®cientlyscalabletoanalyzeindustrial-sizeprojectsinreasonabletimeandisavailableforuseby
othersasopensourcesoftware.

3.3.4DetectionofClonesinModels

ToanalyzetheextentofcloninginMatlab/Simulinkmodels,andtoassessandcontrolexisting
clonesinthemduringmaintenance,weneedasuitableclonedetectionalgorithm.Inthissection,
wediscussrelatedworkinclonedetectiononmodelsandoutlineshortcomings.

model-basedModel-baseddevCloneelopment.DetectionIn[160],UpLiutoet.nowal.,littleproposeworkasufhas®x-treebeendonebasedonalgorithmclonefordetectionclonein
detectioninUMLsequencediagrams.Theyexploitthefactthatparallelism-freesequencediagrams
canbelinearizedinacanonicalfashion,sinceauniquetopologicalorderforthemexists.Thisway,
theyeffectivelyreducetheproblemof®ndingcommonsubgraphstothesimplerproblemof®nding
commonsubstrings.However,sinceaunique,similaritypreservingtopologicalordercannotbe
establishedforMatlab/Simulinkmodels,theirapproachisnotapplicabletoourcase.
Aproblemwhichcouldbeconsideredasthedualoftheclonedetectionproblemisdescribedby
Kappliedelteret.toal.difinferent[128]versionswhereofatheysingletrytomodel).identifyInthetheirdifapproachferencestheybetweenrelyonUMLcalculatingmodelspairs(usuallyof
matchingelements(i.e.,classes,operations,etc.)basedonheuristicsincludingthesimilarityof
names,andexploitingthefactthatUMLisrepresentedasarootedtreeintheXMIusedasstorage
format,makingitinappropriateforourcontext.
Inapproach[186],Phampresentedetal.inthispresentthesisaandclonewas,detectionthus,notavapproachailablefortouswhenMatlab/Simulink.wedevelopedItbit.uildsonthe

Graph-basedCloneDetectionGraph-basedapproachesforcodeclonedetectioncould,in
principle,alsobeappliedtoMatlab/Simulink.In[137],KomondoorandHorwitzproposeacom-
binationofforwardandbackwardprogramslicingtoidentifyisomorphicsubgraphsinaprogram
dependencegraph.Theirapproachisdif®culttoadapttoMatlab/Simulinkmodels,sincetheirap-
plicationofslicingtoidentifysimilarsubgraphsisveryspeci®ctoprogramdependencegraphs.
In[146],Krinkealsoproposesanapproachthatsearchesforsimilarsubgraphsinprogramdepen-
dencegraphs.Sincethesearchalgorithmdoesnotrelyonanyprogramdependencegraphspeci®c
properties,itisinprinciplealsoapplicabletomodel-basedclonedetection.However,Krinkeem-
ploysaratherrelaxednotionofsimilaritythatisnotsensitivetotopologicaldifferencesbetween
subgraphs.Sincetopologyplaysacrucialroleindata-¯owlanguages,weconsiderthisapproachto
models.Matlab/Simulinkforsub-optimalbe

46

3.4CloneAssessmentandManagement

GraphTheoryProbablythemostcloselyrelatedproblemingraphtheoryisthewellknown
NP-completeMaximumCommonSubgraphproblem.Anoverviewofalgorithmsispresentedby
icsBunke[191],etal.where[31].itisMostusedtopractical®ndsimilaritiesapplicationsofbetweenthisproblemmolecules.seemHotowevbeer,studiedwhileintypicalchemoinformat-molecules
consideredtherehaveuptoabout100atoms,manyMatlab/Simulinkmodelsconsistofthousands
ofblocksandthusmaketheapplicationofexactalgorithmsasappliedinchemoinformaticsinfea-
sible.

SummaryWerequireaclonedetectionalgorithmforMatlab/Simulinkmodelstoinvestigatethe
extentofcloninginindustrialMatlab/Simulinkmodels.
ProblemWhiletheexistingapproachesforclonedetectioningraphsandmodelsprovidedvaluable
inspiration,noneissuitabletostudytheextentofcloninginindustrialMatlab/Simulinkmodels.
ContributionChapter7presentsanovelclonedetectionapproachfordata-¯owmodelsthatis
suitableforMatlab/Simulinkandscalestoindustrial-sizemodels.

3.4CloneAssessmentandManagement

Thiscomprisesectionallworkoutlinesthatworkemploysrelatedclonetoclonedetectionmanagement;resultstotosupportbesoftwcomprehensiareve,wemaintenance.interpretthisto

AssessmentClone3.4.1Clonedetectiontoolsproduceclonecandidates.Justbecausethesyntacticcriteriafortype-xclone
candidatesaresatis®ed,theydonotnecessarilyrepresentduplicationofproblemdomainknowl-
edge.Hence,theyarenotnecessarilyrelevantforsoftwaremaintenance.Ifprecisionisinterpreted
astaskrelevance,existingclonedetectionapproaches,hence,producesubstantialamountsoffalse
positives.Cloneassessmentneedstoachievehighprecisiontogetconclusivecloninginforma-
tion.Theexistenceoffalsepositivesinproducedclonecandidateshasbeenreportedbyseveralre-
searchers.KapserandGodfreyreportbetween27%and65%offalsepositivesincasestudies
investigatingcloninginopensourcesoftware[122].BurdandBailey[32]comparedthreeclone
detectionandtwoplagiarismdetectiontoolsusingasinglesmallsystemasstudyobject.Through
subjectiveassessments,38.5%ofthedetectedcloneswererejectedasfalsepositives.Amorecom-
prehensivestudywasconductedbyBellonetal.[19].Sixclonedetectorswerecomparedusingeight
differentsubjectsystems.AsampleofthedetectedcloneswasjudgedmanuallybyBellon.Itwas
foundthat—dependingonthedetectiontechnique—alargeamountoffalsepositivesareamongthe
detectedclones.Tiarksetal.[217]categorizedtype-3clonesdetectedbydifferentstate-of-the-art
clonedetectorsaccordingtotheirdifferences.Beforecategorization,theymanuallyexcludedfalse
positives.Theyfoundthatupto75%ofthecloneswerefalsepositives.
Walensteinetal.[229]revealcaveatsinvolvedinmanualcloneassessment.Lackofobjective
clonerelevancecriteriaresultsinlowinter-raterreliability.SimilarresultsarereportedbyKapser

47

3StateoftheArt

etal.[124].Theirworkemphasizestheneedformeasurementofinter-raterreliabilitytomakesure
objectiveclonerelevancecriteriaareused.
Someworkhasbeendoneontailoringclonedetectorstoimprovetheiraccuracy:KapserandGod-
freyproposeto®lterclonesbasedonthecoderegionstheyoccurin.Theyreportthatsuch®lters
cansuccessfullyremovefalsepositivesinregionsofstereotypecodewithoutsubstantiallyaffecting
recall[122].Inaddition,allclonedetectiontoolsexposeparameterswhosevaluationsin¯uencere-
sultaccuracy.Forsomeindividualtoolsandsystems,theireffectonthequantityofdetectedclones
hasbeenreported[121].However,wearenotawareofsystematicmethodsonhowresultaccuracy
ed.vimprobecan

SummaryUnfortunately,thereisnocommon,agreed-uponunderstandingofthecriteriathat
determinetherelevanceofclonesforsoftwaremaintenance.Thisisre¯ectedinthemultitudeof
differentde®nitionsofsoftwareclonesintheliterature[140,201].Thislackofrelevancecriteria
introducessubjectivityintoclonejudgement[124,229],makingobjectiveconclusionsdif®cult.The
negativeconsequencesbecomeobviousinthestudydonebyWalensteinetal.[229]:threejudges
independentlyperformedmanualassessmentsofclonerelevance;sincenoobjectiverelevancecri-
teriaweregiven,judgesappliedsubjectivecriteria,ratingonly5outof317candidatesconsistently.
Obviously,suchlowagreementisunsuitedasabasisforimprovementofclonedetectionresult
.yaccuracProblemClonedetectiontoolsproducesubstantialamountsoffalsepositives,threateningthecor-
rectnessofresearchconclusionsandtheadoptionofclonedetectionbyindustry.However,welack
explicitcriteriathatarefundamentaltomakeunbiasedassessmentsofdetectionresultaccuracy;
consequently,welackmethodsforitsimprovement.
ContributionChapter8introducesclonecouplingasanexplicitcriterionfortherelevanceofcode
clonesforsoftwaremaintenance.Itoutlinesamethodforclonedetectiontailoringthatemploys
clonecouplingtoimproveresultaccuracy.Theresultsoftwoindustrialcasestudiesindicatethat
developerscanestimateclonecouplingconsistentlyandcorrectlyandshowtheimportanceoftai-
loringforresultaccuracy.

3.4.2ementgManaClone

InHe[141],followsKoschkLagueeetproal.vides[149]aandcomprehensGieseckieve[78]ovinerviediwvidingoftheclonecurrentwmanagementorkonacticlonevitiesintomanagement.three
aimsareas:topralleeviateventiveimpactmanagementofexistingaimstoclonesavandoidcorrcreationectiveofnewmanagementclones;aimstocompensativeremoveclones.management

ClonePreventionTheearlierproblemsinsourcecodeareidenti®ed,theeasiertheyareto®x.
Thisalsoholdsforcodeclones.In[149],Lagueetal.proposestopreventthecreationofnewclones
byanalyzingcodethatgetscommittedtothecentralsourcecoderepository.Incaseachangeadds
aclone,itneedstopassaspecialapprovalprocesstobeallowedtobeaddedtothesystem.
Severalprocesses[5,51,177]employmanualreviewsofchangesbeforethesoftwarecangointo
production.TheLEvDprocess[51]weemployforthedevelopmentofConQAT,e.g.,requires

48

3.4CloneAssessmentandManagement

allincludingcodeclonechangestodetection.berevieCloneswedthusbeforedraawrelease.attentionManualduringrerevieviewwsisandare,supportedinmostbycases,analysismarktools,ed
asreview®ndingsthatneedtobeconsolidatedbytheoriginalauthor.Whilethisschemedoesnot
preventclonesfrombeingintroducedintothesourcecoderepository,itdoespreventthemfrom
beingintroducedintothereleasedcodebase.
cloningExistingcloneremain,prevmaintainersentionarefocuseslikonelythetoclones,continuenottooncreatetheirclones.rootTocauses.beefHofectiwevve,er,clonewhileprevcausesentionfor
henceneedstoanalyze—andrectify—thecausesforcloning.

duringClonemaintenanceCompensationofcodeinCloneanIDE.indicationTheirtoolsgoalispointtooutincreaseareasofdevclonedeloperawcodetoarenesstheofdevcloningeloper
andthusmakeunintentionallyinconsistentchangeslesslikely.Examplesinclude[46,59,60,92,
94,102,update-to-date103,218].cloneReal-timeinformationcloneforevdetectionolvingsoftwapproachesaretohaclonevebeenindicationproposedtoolsto[126,quickly235].deliver
Linkpromiseedtoeditingreducetoolsthereplicatemodi®cationovmodi®cationserheadcausedmadetobyonecloningcloneandtotheitsliksiblingselihoodto[218].makeTheyuninten-thus
tionallyinconsistentmodi®cations.AsimilarideaisimplementedbyCReN[102]thatconsistently
code.clonedinidenti®ersrenamesBothcloneclonecomprehension,indicationandandthuslinkedcloneeditingtoolscompensation,operatecanonbethesupportedsourcecodethroughlevel.toolsInathatlarofgefersystem,inter-
activevisualizationsatdifferentlevelsofabstraction.Examplesinclude[219],[238]and[125].
supportBesidessupportingcomprehensionofcomprehensiontheevolutionofofclonesclonesinainasinglesystem.systemSevveralersion,toolsclonetoanalyzetrackingthetoolsevaimolutionto
ofGödecloningdiscusshavethatbeencloneproposed,trackingandincludingmanage[60,ment83,f85,ace85,obstacles132,133,and181,raise216].costsInin[91],practice.Harderand

jlichClone[68]RemoreportvalonanSeveralindustrialauthorscasehavestudyinvinestigwhichatedcertaincorrectivecloneclonetypesweremanagement.remoFvedantaandmanuallyRa-
fromstacleaforC++clonesystem.Theconsolidation.yidentifySuchthetoollackofsupportisdedicatedproposedtoolbysupportotherforauthors:cloneKremovomondooralasan[136]ob-
investigatesautomatedcloneconsolidationthroughprocedureextraction.Baxteretal.[16]proposes
togenerateC++macrobodiesasabstractionsforclonegroupsandmacroinvocationstoreplacethe
clones.In[8],Balazinskaetal.presentanapproachthatconsolidatesclonesthroughapplication
ofthestrategydesignpattern[74];intheirlaterpaper[9],thesameauthorspresentaapproachto
supportsystemrefactoringtoremoveclones.Inamorerecentpaper,theideatosuggestrefactor-
ingsbasedontheresultsfromclonedetectioniselaboratedbyLiandThompsonin[154]forthe
Erlang.languageprogrammingSeveralauthorshaveidenti®edlanguagelimitationsasonereasonforcloning[140,201].Tocounter
this,clonesomeremovalauthorsusinghavtraitseinv[179].estigatedBasitetfurtheral.studymeanstocloneremoremovevalcloning.inC++Murphusingay-Hillstaticetal.metastudypro-
[15].languagegramming

49

3StateoftheArt

Orwithtechnicalganizationalchallenges.ChangeBut,Manatogachieementveadoption,Existingandresearchthusinimpactcloneonsoftwmanagementareengineeringprimarilyprac-deals
tice,outlinesfurtherbarriersobstaclesinhaadoptionvetoofbeovprogramercome.Incomprehensionhiskeynotetechniques,speechpublishedincludinginclone[40],Jimdetection,Cordyby
hisproaches,industrialbutinsteadpartners.businessCordydoesrisks,notmanagementmentiontechnicalstructuresandchallengessocialorandimmaturityculturalofeissuesxistingascen-ap-
tralbarrierstoadoption.Hisreportscon®rmthatadoptionofclonedetectionormanagementap-
proachesresearchersbycon®rmsindustryfaceschallengesinchallengesresearchbeyondadoptionthebecapabilitiesyondoftechnicaltheemploissuesyed[38,tools.69,W209].orkofother

Introducingclonemanagementtoreducethenegativeimpactofcloningonmaintenanceefforts
andprogramcorrectness,isnotaproblemthatcanbesolvedsimplybyinstallingsuitabletools.
Instead,itrequireschangesoftheworkhabitsofdevelopers.Tobesuccessful,introductionof
clonemanagementmustthusovercomeobstaclesthatarisewhenestablishedprocessesandhabits
changed.betoare

Challengesfacedwhenchangingprofessionalhabitsarenotspeci®ctotheintroductionofclone
management.Instead,theyarefacedbyallchangestodevelopmentprocesses,includingtheintro-
ductionofdevelopmentorqualityanalysistools.Furthermore,theyarenotlimitedtochangestothe
developmentprocess,butinsteadpermeateallorganizationalchanges.Thishasbeenrealizedlong
ago—managementliteraturecontainsasubstantialbodyofknowledgeonhowtosuccessfullyco-
erceestablishedhabitsintonewpaths[43,130,143–145,152,153],somedatingbackto1940ies.

prevSummarention,yThecompensationresearchandcommunityremovalofproducedcloning.Muchsubstantiaoflwthisorkwonorkclonefocusesonmanagement,asingletarmanage-geting
mentcloneaspect,managementforearexamplenotclonelimitedtoindicationdevorelopingtracking.appropriateHowever,tools.theInstead,challengesthefyacedrequirebybothsuccessfulan
Changingunderstandingestablishedofthebehacausesviorforishard.cloningWandorkinchangesorgtoeanizationalxistingchangeprocessesandmanagementdevhaselopershobehawnviorthat.
itencountersobstaclesthatneedtobeaddressedforchangestosucceedinthelongterm.This
isapproachescon®rmed[38]byinreportsindustryon.reluctancetoadoptclonemanagement[40]andotherqualityanalysis

ProblemSuccessfulintroductionofclonemanagementrequireschangestoestablishedprocesses
andhamanagementbits.Existingtasks.wThisorkdoesonnotclonefacilitatemanagement,orghoanizationalwever,changefocusesprimarilymanagement.onWtoolsithoutforit,indithough,vidual
clonemanagementapproachesareunlikelytoachievelong-termsuccessinpractice.

ContributionChapter8presentsamethodtointroduceclonecontrolintoasoftwaremaintenance
project.Itadaptsresultsfromorganizationalchangemanagementtothedomainofsoftwarecloning.
Furthermore,itdocumentscausesofcloningandtheirsolutionsforeffectivecloneprevention.The
chapterpresentsalongtermindustrialcasestudythatshowsthatthemethodcanbeemployedto
successfullyintroduceclonecontrol,andreducetheamountofcloning,inpractice.

50

CloneofLimitations3.5Detection

DetectionCloneofLimitations3.5

Sevderstanderalstudieslimitationsinvestigofateclonewhichmanagementclonesincertainpractice,detectionwemustapproachesunderstandcan®nd.whichHowevduplicationer,totheun-y
cannot®nd.Thissectionoutlinesresearchondetectionofprogramsimilaritybeyondcloningcre-
atedbycopy&paste.

SimionDetectionSeveralauthorsdealtwiththeproblemof®ndingbehaviorallysimilarcode,
althoughoftenonlyforaspeci®ckindofsimilarity.
AnearlypaperonthesubjectbyMarcusandMaletic[167]dealswiththedetectionofsocalledhigh-
levelconceptclones.Theirapproachisbasedonreducingcodechunks(usuallymethodsor®les)to
tokensets,andperforminglatentsemanticindexing(LSI)andclusteringonthesesetsto®ndparts
ofcodethatusethesamevocabulary.Thepaperreportson®ndingmultiplelistimplementations
inacasestudy,butdoesnotquantifythenumberofclonesfoundortheprecisionoftheapproach.
Limitationsareidenti®edespeciallyinthecaseofmissingormisleadingcomments,astheseare
search.clonetheinincludedTheworkofKawrykowandRobillard[127]aimsat®ndingmethodsinaJavaprogramwhich
reimplementfunctionsavailableinlibraries(APIs)usedbytheprogram.Therefore,methodsare
reducedtothesetofclasses,methods,and®eldsused,whichareextractedfromthebyte-code,and
thenmatchedpairwiseto®ndsimilarmethods.Additionalheuristicsareemployedtoreducethe
falsepositiverate.Applicationtotenopensourceprojectsidenti®ed405“imitations”ofAPImeth-
odswithanaverageprecisionof31%(worstprecision4%).Sincetheentiresetofall“imitations”
ofthemethodsisunclear,therecallisunknown.
Nguyenetal.[183]applyagraphminingalgorithmtoanormalizedcontrol/data-¯owgraphto®nd
“usagepatterns”ofobjects.Thefocusoftheirworkisnotthedetectionofcloning,butratherof
similarbutinconsistentpatterns,whichhintatbugs.Theprecisionofthisprocessisabout20%2.
Again,w.r.t.simiondetection,recallisunclear.
Thepaper[107]byJiangetal.introducesanapproachthatcanbesummarizedbydynamicequiva-
lencechecking.Thebasicideais,thatiftwofunctionsaredifferent,theywillreturndifferentresults
onthesamerandominputwithhighprobability.Theirtool,calledEQMINER,detectsfunctionally
equivalentfunctionsinCcodedynamicallybyexecutingthemonrandominputs.Usingthistool,
they®nd32,996clustersofsimilarcodeinasubsetofabout2.8millionlinesoftheLinuxker-
nel.UsingtheirclonedetectorDeckardtheyreportthatabout58%ofthebehaviorallysimilarcode
discoveredissyntacticallydifferent.Sincenosystematicinspectionoftheclustersisreported,no
precisionnumbersareavailable.Again,duetoseveralpracticallimitationsoftheapproach(e.g.,
randomizationofreturnvaluestoexternalAPIcalls),therecallw.r.t.simiondetectionisunclear.
In[1],Al-Ekrametal.searchforcloningbetweendifferentopen-sourcesystemsusingatoken-
basedclonedetector.Theyreportthat,totheirsurprise,theyfoundlittlebehaviorallysimilarcode
acrossdifferentsystems,althoughthesystemsofferedrelatedfunctionality.Theclonestheydid
®ndweretypicallyinareaswheretheuseofcommonAPIsimposedacertainprogrammingstyle,
2Whenincluding“codethatcouldbeimprovedforreadabilityandunderstandability”as¯aws,thepaperreportsnear
precision.40%

51

3StateoftheArt

therebylimitingprogramvariation.However,sincetheabsoluteamountofbehaviorallysimilar
codebetweenthedifferentsystemsisunknown,itisunclearwhetherthesmallamountofdetected
behaviorallysimilarclonesisduetotheirabsenceintheanalyzedsystems,orduetolimitationsof
detection.cloneIn[12,13],BasitandJarzabekproposeapproachestodetecthigher-levelsimilaritypatternsinsoft-
ware.Theirapproachemploysconventionalclonedetectionandgroupsdetectedclonesaccording
todifferentrelationtypes,suchascallrelationshipsbetweentheclones.Whiletheirapproachhelps
tocomprehenddetectedclonesthroughinferringstructure,itdoesnotdetectmoreredundancythan
conventionalclonedetection,sinceitbuildsonit.Itdoesthusnotimproveourunderstandingofthe
limitationsofclonedetectionw.r.t.simiondetection.

AlgorithmRecognitionThegoalofalgorithmrecognition[2,176,232]istoautomatically
recognizedifferentformsofaknownalgorithminsourcecode.Justasclonedetection,ithasto
copewithprogramvariation.Themostfundamentaldifferencew.r.t.similarcodedetectionisthat
foralgorithmrecognitionasproposedby[176,232],thealgorithmstoberecognizedneedtobe
ance.advinwnkno

shedSummarlightyontheExistingwcapabilitiesorkonofclonecomparisondetectionofclonetodetedetectctionclonesapproachescreated[19,through196,cop197,y&200]pastehas
&behamodifyviorally.Hosimilarwever,codewethatknowisnotlittlearesultaboutofthecopy&limitationspastebofuthasclonebeendetectioncreatedw.r.t.discoindependentlyvery.of
haPrvioroblemactuallyWedois.notAsaknowresult,howitisstructurallyuncleartodifwhichferentextentindependentlyrealworlddevprogramselopedcodecontainwithsimilarredundancbe-y
tothatcontaincannotbemultipleattributedtoimplementationscopy&ofpaste,thesamalthougheintuitionfunctionality.tellsAsusathatlarconsequence,geprojectswedoareenotxpectedknow
ifwecandiscoversimionsthatresultfromindependentimplementationofredundantrequirements
onthecodelevel.
ofContribprogramutionvariationChapterin9overpresents100theresultsimplementationsofaofcontrolledasingleexperimentspeci®cationsthatthatanalyzesweretheproducedamount
istingindependentlydetectors—arebystudentpoorlyteams.suitedIttoshowsdetectthatesimionsxistingthatclonehavedetnotectionbeencreatedapproaches—notbycopy&onlypaste,ex-
emphasizingtheneedtoavoidcreationofsimionsinthe®rstplace.

52

4ImpactonProgramCorrectness

Muchoftheresearchinclonedetectionandmanagementisbasedontheassumptionthatunaware-
nessofcloningduringmaintenancethreatensprogramcorrectness.Thisassumption,however,
hasnotbeenvalidatedempirically.Wedonotknowhowwellawareofcloningdevelopersare,
andconversely,howstronglyalackofawarenessimpactscorrectness.Theimpactofcloningon
programcorrectnessis,hence,insuf®cientlyunderstood.Theimportanceofcloning—andclone
.unclearthusmanagement—remainsThischapteranalyzestheimpactofunawarenessofcloningonprogramcorrectnessthroughalarge
industrialcasestudy.Itthuscontributestothebetterunderstandingoftheimpactofcloningandthe
importancetoperformclonedetectionandclonemanagementinpractice.Partsofthecontentof
thischapterhavebeenpublishedin[115].

hcResear4.1Questions

Wesummarizethestudyusingthegoalde®nitiontemplateasproposedin[234]:
Analyzecloninginsourcecode
forthepurposeofcharacterizationandunderstanding
fromwiththevierespectwpointtoofitssoftwimpactareondevprogreloperamandcorrectnessmaintainer
inthecontextofindustrialandopensourceprojects
investigTherefore,ateathesetfolloofwingindustrial4researchandopenquestions:sourceprojectsareusedasstudyobjects.Indetail,we

RQ1Arecloneschangedindependently?

The®rstquestioninvestigateswhethertype-3clonesappearinreal-worldsystems.Besideswhether
wecan®ndthem,itexploresiftheyconstituteasigni®cantpartofthetotalclonesofasystem.It
doesnotmakesensetoanalyzeinconsistentchangestoclonesiftheyareararephenomenon.

RQ2Aretype-3clonescreatedunintentionally?

Hacreatedvingestablishedintentionallythatornot.thereItarecanbetype-3sensibleclonestoinrealchangeasystems,clonewesothatanalyzeitbecomeswhetheratheytype-3haveclone,been
ifdifithasferencestocanconformindicatetodifproblemsferentthatrequirementswerenott®xhanedinitsallsiblings.siblings.Ontheotherhand,unintentional

53

4ImpactonProgramCorrectness

Figure4.1:Clonegroupsets

RQ3Cantype-3clonesbeindicatorsforfaults?

Afterestablishingtheseprerequisites,wecandeterminewhetherthetype-3clonesareindicatorsfor
systems.realinaultsf

RQ4Dounintentionaldifferencesbetweentype-3clonesindicatefaults?

Thisquestiondeterminestheimportanceofclonemanagementinpractice.Areunintentionally
createdmodi®cationstype-3canclonesreduceliktheelytolikelihoodindicateoffaults?errors.IfIfso,not,theclonereductionofmanagementisunintentionallylessusefulininconsistentprac-
tice.

Stud4.2Designy

WeanalyzethesetsofclonegroupsasshowninFig.4.1:theoutermostsetcontainsallclone
groupsCinasystem;ICdenotesthesetoftype-3clonegroups;UICdenotesthesetoftype-
3clonegroupswhosedifferencesareunintentional;thedifferencesbetweenthesiblingsarenot
wanted.ThesubsetFofUICcomprisesthosetype-3clonegroupswithunintentionaldifferences
thatindicateafaultintheprogram.Wefocusonclonegroups,insteadofonindividualclones,since
differencesbetweenclonesarerevealedonlybycomparison,andthusinthecontextofaclone
group,andnotapparentintheindividualcloneswhenregardedinisolation.Furthermore,wedonot
distinguishbetweencreatedandevolvedtype-3clones—forthequestionoffaultiness,itdoesnot
matterwhenthedifferenceshavebeenintroduced.
Theindependentvariablesinthestudyaredevelopmentteam,programminglanguage,functional
domain,ageandsize.Thedependentvariablesareexplainedbelow.

theRQ1sizeofinvsetestigICateswiththeerespectxisttoencetheofsizetype-3ofsetCclones.Wineapplyreal-wourorldtype-3systems.cloneTodetectionanswerit,weapproachanalyze(cf.,
falseSectionpositi7.3.4)vestoandallcalculatestudytheobjects,type-3performcloneratiomanual|ICa|/|C|ssessment.ofthedetectedclonestoeliminate

54

ObjectsyStud4.3

theRQs2izeinofvtheestigsetsatesUICwhetandherIC.type-3Theclonessetsarearecreatedpopulatedbyshounintentionallywing.eachToansweridenti®edit,wetype-3compareclone
toThisdevgiveselopersustheofthesystemunintentionallyandaskinginconsistentthemtocloneratertheatio|difUIC|ferences/|IC|.asintentionalorunintentional.

FRQin3relationinvestigtotheatessizewhetherofIC.Thetype-3setFclonesis,agindicateain,faultspopulated.Tobyansweraskingit,deweveloperscomputeofthethesizerespectiofvsete
system.Theirexpertopinionclassi®estheclonesintofaultyandnon-faulty.Weonlyanalyzetype-3
cloneswithunintentionaldifferences.Ourfaultyinconsistentcloneratio|F|/|IC|isthusalower
bound,aspotentialfaultsinintentionallydifferenttype-3clonesarenotconsidered.
Basedonthisratio,wecreateahypothesistoanswerRQ3.Weneedtomakesurethatthefault
densityintheinconsistenciesishigherthaninrandomlypickedlinesofcode.Thisleadstothe
:HypothesishThefaultdensityintheinconsistenciesishigherthantheaveragefaultdensity.
Aswedonotknowtheactualfaultdensitiesoftheanalyzedsystems,weneedtoresorttoaverage
values.Thespanofavailablenumbersislargebecauseofthehighvariationinsoftwaresystems.
EndresandRombach[64]give0.1–50faultsperkLOCasatypicalrange.Forthefaultdensityin
thesistencies.inconsistencies,Werefrainwefromusethetestingnumberthehoffypothesisaultsdividedstatisticallybythebecauselogicaloflinestheoflowcodeofnumbertheofincon-data
pointsaswellasthelargerangeoftypicaldefectdensities.

RQ4investigateswhetherunintentionallydifferenttype-3clonesindicatefaults.Toanswerit,
weinconsistentcomputetheclonesizeratioof|setF|/F|inUIC|isrelationalotowerthebound,sizeofassetUICpotential.Agfaiaultsn,theinfaultyintentionallyunintentionallydifferent
considered.notareclones

ObjectsyStud4.3

Sincewerequiredthewillingnessofdeveloperstoparticipateincloneinspectionsandclonedetec-
tiontailoring,wehadtorelyonourcontactswithindustryinourchoiceofstudyobjects.However,
wechosesystemswithdifferentcharacteristicstoincreasegeneralizabilityoftheresults.
Wechose2companiesand1opensourceprojectassourcesofsoftwaresystems.Wechosesystems
writtenindifferentlanguages,bydifferentteamsindifferentcompaniesandwithdifferentfunction-
alities.Theobjectsincluded3systemswritteninC#,aJavasystemaswellasalong-livedCOBOL
system.Allofthemareinproduction.Fornon-disclosurereasons,wegavethecommercialsystems
namesfromAtoD.AnoverviewisshowninTable4.1.
AlthoughsystemsA,BandCareallownedbyMunichRe,theywereeachdevelopedbydifferent
organizations.Theyprovidesubstantiallydifferentfunctionality,rangingfromdamageprediction,
overpharmaceuticalriskmanagementtocreditandcompanystructureadministration.Thesystems

55

4ImpactonProgramCorrectness

Table4.1:Summaryoftheanalyzedsystems

SystemOrganizationLanguageAgeSize
(kLOC)(years)BAMunichMunichReReC#C#46454317
CMunichReC#2495
DSysiphusLVTUM1871JavaCOBOL178281197

supportbetween10and150expertuserseach.SystemDisamainframe-basedcontractmanage-
mentsystemwritteninCOBOLemployedbyabout150users.TheopensourcesystemSysiphus1
inisvdeolvvedelopedinitsatdethevTelopment).echnischeItUniconsversitättitutesaMünchencollaboration(buttheenauthorvironmentofforthisdistribthesisutedhasnotsoftwbeenare
developmentprojects.Weincludedanopensourcesystembecause,astheclonedetectiontoolis
alsofreelyavailable,theresultscanbeexternallyreplicated2.Thisisnotpossiblewiththedetailed
con®dentialresultsofthecommercialsystems.

ecutionExandImplementation4.4

RQ1Forallsystems,ourclonedetectorConQATwasexecutedbyaresearchertoidentifytype-
3clonecandidates.Onan1.7GHznotebook,thedetectiontookbetweenoneandtwominutes
foreachsystem.Thedetectionwascon®guredtonotcrossmethodboundaries,sinceexperiments
showedthattype-3clonesthatcrossmethodboundariesinmanycasesdidnotcapturesemantically
meaningfulconcepts.Thisisalsonotedfortype-2clonesin[142]andisevenmorepronounced
fortype-3clones.InCOBOL,sectionsintheproceduraldivisionarethecounterpartofJavaorC#
methods—clonedetectionforCOBOLwaslimitedtothese.
FortheC#andJavasystems,thealgorithmwasparameterizedtouse10statementsasminimal
clonelength,amaximumeditdistanceof5,amaximalgapratio(i.e.,theratioofeditdistanceand
clonelength)of0.2andtheconstraintthatthe®rst2statementsoftwoclonesmustbeequal.Due
totheverbosityofCOBOL[62],minimalclonelengthandmaximaleditdistanceweredoubledto
20and10,respectively.Generatedcodethatisnotsubjecttomanualeditingwasexcludedfrom
clonedetection,sinceincompletemanualupdatesobviouslycannotoccur.Normalizationofidenti-
®ersandconstantswastailoredasappropriatefortheanalyzedlanguage,toallowforrenamingof
identi®erswhileavoidingtoohighfalsepositiverates.Thesesettingsweredeterminedtorepresent
thebestcombinationofprecisionandrecallduringcursoryexperimentsontheanalyzedsystems,
forwhichrandomsamplesofthedetectedcloneswereassessedmanually.
Thedetectedclonecandidateswerethenmanuallyratedbytheauthortoremovefalsepositives—
codefragmentsthat,althoughidenti®edasclonecandidatesbythedetectionalgorithm,havenose-
manticrelationship.Type-3andungapped(type-1andtype-2)clonegroupcandidatesweretreated
1http://sysiphus.in.tum.de/2.in.tum.de/~ccsm/icse09/yhttp://wwwbro

56

4.5Results

differently:alltype-3clonegroupcandidateswererated,producingthesetoftype-3clonegroups
ofIC.ratingSincealltheofungthem,appedarandomclonegroupssampleofwere25%notwasrequiredrated,forandffurtheralsepositistepsvofetheratescasethenestudy,xtrapolatedinstead
todeterminethenumberofungappedclones.

RQs2,3and4Thetype-3clonegroupswerepresentedtothedevelopersoftherespective
systemsusingConQAT’scloneinspectionviewer.Thedevelopersratedwhethertheclonegroups
werecreatedintentionallyorunintentionally.Ifaclonegroupwascreatedunintentionally,the
developersalsoclassi®editasfaultyornon-faulty.FortheJavaandC#systems,alltype-3clone
groupswereratedbythedevelopers.FortheCOBOLsystem,ratingwaslimitedtoarandomsample
of68outofthe151type-3clonegroups,sincetheageofthesystemandthefactthattheoriginal
developerswerenotavailableforratingincreasedratingeffort.Thus,fortheCOBOLcase,the
resultsforRQ2andRQ3werecomputedbasedonthissample.Incaseswhereintentionalityor
faultinesscouldnotbedetermined,e.g.,becausenoneoftheoriginaldeveloperscouldbeaccessed
forrating,theinconsistenciesweretreatedasintentionalandnon-faulty.

Results4.5

RQsystem1D,Thethequantitatiprecisionvevaluesresultsareofoursmallerstudyforaretype-3summclonearizedgroupsinTablethan4.2.forungExceptappedforclonetheCOBOLgroups.
ThisresultsisofnotsystemuneDxpected,resultsincefromtype-3therathercloneconservgroupsativalloewcloneformoredetectiondeviation.parametersThehighchosendueprecisionto
thelatedvobjectserbositygaofveriseCOBOL.toloForwersystemprecisionA,values.stereotypeAboutdatabasehalfoftheaccessclonescodeof(52%)aresemanticallystricttype-3unre-
clones—theirclonesdifferbeyondidenti®ersnamesliteralorconstantvalues.Therefore,RQ1
canbeansweredpositively:clonesarechangedindependently,resultingintype-3clonesintheir
systems.

Table4.2:Summaryofthestudyresults
ProjectABCDSysiphusSumMean
Precisionungappedclonegroups0.881.000.961.000.98—0.96
ClonePrecisiongroupstype-3|C|clonegroups2860.611600.863260.803521.003030.871427——0.83
Type-3Unintent.clonediff.groupstype-3|ICgroups||UIC|159518929179661511514642724203——
Faultyclonegroups|F|191842523107—
RQ1|IC|/|C|0.560.560.550.430.48—0.52
RQ2|UIC|/|IC|0.320.330.370.100.29—0.28
RQRQ43||FF||//||IUCI|C|0.370.120.620.200.640.230.330.030.550.16——0.150.50
FaultInconsistentdensityinlogicalkLOClines!14344291.419752.77973.4147650.1459—337148.1—

57

4ImpactonProgramCorrectness

Figure4.2:DifferentUIbehavior:rightsidedoesnotuseoperations(Sysiphus)

RQ2Fromthesetype-3clones,overaquarter(28%)hasbeenintroducedunintentionally.Hence,
RQ2canalsobeansweredpositively:Type-3clonesarecreatedunintentionallyinmanycases.
OnlysystemDexhibitsalowervalue,withonly10%ofunintentionallycreatedtype-3clones.
Withaboutthreequartersofintentionalchanges,thisshowsthatcloningandchangingcodeseems
tobeafrequentpatternduringdevelopmentandmaintenance.

RQ3Atleast3-23%ofthedifferencesrepresentedafault.Again,thebyfarlowestnumber
comesfromtheCOBOLsystem.Ignoringit,thetotalratiooffaultytype-3clonegroupsgoesupto
18%.Thisconstitutesasigni®cantsharethatneedsconsideration.TojudgehypothesisH,wealso
calculatedthefaultdensities.Theylieintherangeof3.4–91.4faultsperkLOC.Again,systemDis
anoutlier.Comparedtoreportedfaultdensitiesintherangeof0.1to50faultsandconsideringthat
allsystemsarenotonlydeliveredbutevenhavebeenproductiveforseveralyears,weconsiderour
resultstosupporthypothesisH.Onaverage,theinconsistenciescontainmorefaultsthanaverage
code.Hence,RQ3canalsobeansweredpositively:type-3clonescanbeindicatorsforfaultsin
systems.realAlthoughnotcentraltoourresearchquestions,thedetectionoffaultsalmostautomaticallyraisesthe
questionoftheirseverity.Asthefaulteffectcostsareunknownfortheanalyzedsystems,wecannot
provideafull-¯edgedseverityclassi®cation.However,weprovideapartialanswerbycategorizing
aults:ffoundthe

58

Critical:faultsthatleadtopotentialsystemcrashordataloss.Oneexampleforafaultin
thiscategoryisshowninFigure1.2inChapter1.Here,onecloneoftheaffectedclonegroup
performsanull-checktopreventanull-pointerdereference,whereastheotherdoesnot.Other
examplesweencounteredareindex-out-of-boundsexceptions,incorrecttransactionhandling
rollbacks.missingandUser-visible:faultsthatleadtounexpectedbehaviorvisibletotheenduser.Fig.4.2showsan
example:inoneclone,theperformedoperationisnotencapsulatedinanoperationobjectand,
hence,ishandleddifferentlybytheundomechanism.Furtherexampleswefoundareincor-
rectendusermessages,inconsistentdefaultvaluesaswellasdifferenteditingandvalidation
behaviorinsimilaruserformsanddialogs.

Discussion4.6

Non-user-visible:faultsthatleadtounexpectedbehaviornotvisibletotheenduser.Examples
weidenti®edincludeunnecessaryobjectcreation,minormemoryleaks,performanceissues
likemissingbreakstatementsinloopsandredundantre-computationsofcachedvalues;dif-
ferencesinexceptionhandling,differentexceptionanddebugmessagesordifferentloglevels
cases.similarfor

Ofthe107faultsfound,17werecategorizedascritical,44asuser-visibleand46asnon-user-visible
faults.Sinceallanalyzedsystemsareinproduction,therelativelysmallernumberofcriticalfaults
xpectations.eourwithcoincides

RQ4WhilethenumbersaresimilarfortheC#andJavaprojects,ratesofunintentionalincon-
sistenciesandthusfaultsarecomparativelylowforprojectD,whichisalegacysystemwrittenin
COBOL.Toacertaindegree,weattributethistoourconservativeassessmentstrategyoftreating
inconsistencieswhoseintentionalityandfaultinesscouldnotbeunambiguouslydeterminedasin-
tentionalandnon-faulty.Furthermore,interviewingthecurrentmaintainersofthesystemsrevealed
thatcloningissuchacommonpatterninCOBOLsystems,thatsearchingforduplicatesofapiece
ofcodeisanintegralpartoftheirmaintenanceprocess.Comparedtothedevelopersoftheother
projects,theCOBOLdeveloperswherethusmoreawareofclonesinthesystem.
Therow|F|/|UIC|inTable4.2accountsforthisdifferencein“cloneawareness”.Itrevealsthat,
whiletheratesofunintentionalchangesarelowerforprojectD,theratioofunintentionalchanges
leadingtoafaultisinthesamerangeforallprojects.Fromourresults,itseemsthataboutevery
secondtothirdunintentionalchangetoacloneleadstoafault.

Discussion4.6

Evthatenclonesconsideringcanleadthetofthreatsaults.toThevalidityinconsisdiscussedtenciesbelobetweenw,theclonesresultsareofoftenthenotstudyshojusti®edwconbydifvincinglyferent
requirementsbutcanbeexplainedbydevelopermistakes.
Whiletentlythefoundratioacrossofallstudyunintentionallyobjectsthatinconsistentunintentionallychangesvariedinconsistentstronglychangesbetweenaresysliktems,elytoweindicateconsis-
faults.Onaverage,inroughlyeverysecondcase.Weconsiderthisasstrongindicationthatclone
managementisusefulinpractice,sinceitcanreducethelikelihoodofunintentionallyinconsistent
changes.

VtoThreats4.7alidity

Wediscusshowwemitigatedthreatstointernalandexternalvalidityofourstudies.

59

4ImpactonProgramCorrectness

alidityVInternal4.7.1

Wedidnotanalyzetheevolutionhistoriesofthesystemstodeterminewhethertheinconsistencies
havebeenintroducedbyincompletechangestothesystemandnotbyrandomsimilaritiesofun-
relatedcode.Thishastworeasons:(1)Wewanttoanalyzealltype-3clones,alsotheonesthat
havebeenintroduceddirectlybycopyandmodi®cationinasinglecommit.Thosemightnotbe
visibleintherepository.(2)Theindustrialsystemsdonothavecompletedevelopmenthistories.
Weconfrontedthisthreatbymanuallyanalyzingeachpotentialtype-3clone.
Thecomparisonwithaveragefaultprobabilityisnotperfecttodeterminewhethertheinconsisten-
ciesaremorefault-pronethanarandompieceofcode.Acomparisonwiththeactualfaultdensities
ofthesystemsoractualchecksforfaultsinrandomcodelineswouldbettersuitthispurpose.How-
ever,theactualfaultdensitiesarenotavailabletousbecauseofincompletedefectdatabases.To
checkforfaultsinrandomcodelinesispracticallynotpossible.Wewouldneedthedevelopers’
timeandwillingnessforinspectingrandomcode.Asthepotentialbene®tforthemislow,the
motivationwouldbelowandhencetheresultswouldbeunreliable.
Asweaskthedevelopersfortheirexpertopiniononwhetheraninconsistencyisintentionalor
unintentionalandfaultyornon-faulty,athreatisthatthedevelopersdonotjudgethiscorrectly.
Onecaseisthatthedeveloperassessessomethingthatisfaultyincorrectlyasnon-faulty.This
caseonlyreducesthechancestopositivelyanswertheresearchquestions.Thesecondcaseisthat
thedevelopersratesomethingasfaultywhichisnofault.Wemitigatedthisthreatbyonlyrating
aninconsistencyasfaultyifthedeveloperwasentirelysure.Otherwiseitwaspostponedandthe
developerconsultedcolleagueswhoknewthecorrespondingpartofthecodebetter.Inconclusive
candidateswererankedasintentionalandnon-faulty.Again,onlytheprobabilitytoanswerthe
researchquestionpositivelywasreduced.
Thecon®gurationoftheclonedetectiontoolhasastrongin¯uenceonthedetectionresults.We
calibratedtheparametersbasedonapre-studyandourexperiencewithclonedetectioningeneral.
Thecon®gurationalsovariesoverthedifferentprogramminglanguagesencountered,duetotheir
differencesinfeaturesandlanguageconstructs.However,thisshouldnotstronglyaffectthedetec-
tionoftype-3clonesbecausewespentgreatcaretocon®gurethetoolinawaythattheresulting
sensible.areclonesWealsopre-processedthetype-3clonesthatwepresentedtothedeveloperstoeliminatefalse
positives.Thiscouldmeanthatweexcludedclonesthatwerefaulty.However,thisagainonly
reducedthechancesthatwecouldanswerourresearchquestionpositively.
Ourde®nitionofclonesandclonegroupsdoesnotpreventdifferentgroupsfromoverlappingwith
eachother;agroupwithtwolongclonescan,e.g.,overlapwithagroupwithfourshorterclones,
as,e.g.,groupsbandcintheexampleinSection2.5.1.Substantialoverlapbetweenclonegroups
couldpotentiallydistorttheresults.Thisdid,however,notoccurinthestudy,sincetherewasno
substantialoverlapbetweenclonegroupsinIC.ForsystemA,e.g.,89%oftheclonedstatements
didnotoccurinanyotherclone.Furthermore,overlapwastakenintoaccountwhencounting
faults—evenifafaultystatementoccurredinseveraloverlappingclonegroups,itwasonlycounted
ault.fsingleaas

60

ySummar4.8

alidityVExternal4.7.2onTheourprojectsconnectionswereobwithviouslythenotdeveloperssampledoftherandomlysystems.fromallHence,possiblethesetsoftwofaresystemssystemsisbutnotweentirelyrelied
representative.ThemajorityofthesystemsiswritteninC#andanalyzing5systemsintotalisnota
highnumber.However,all5systemshavebeendevelopedbydifferentdevelopmentorganizations
andtheC#-systemsaretechnicallydifferent(2web,1richclient)andprovidesubstantiallydifferent
asanfunctionalities.opensourceWeJavafurthersystem.mitigatedthisthreatbyalsoanalyzingalegacyCOBOLsystemaswell

Summar4.8y

Thisrectness.chapterInthe®vpresentedeanalyzedtheresultssystems,ofa107larfgeaultscasewerestudydiscoonvtheeredimpactthroughofthecloninganalysisonofprogramuninten-cor-
devtionallyelopers;44inconsistentcouldcausechangestoundesiredclonedprogramcode.Ofbehathem,vior17thatwerewasvisibleclassi®edtoasthecriticaluser.bythesystem
Wiedeobservsubstantiallyedtwoefacrossfectstheconcerningsystems.theSomedevmaintenanceeloperofteamsclones.wereFirst,moretheawaarewofarenesstheeofxistingcloningclonesvar-
thanothers,resultingindifferentlikelihoodsofunintentionallyinconsistentchangestoclonedcode.
tionalSecond,theinconsistencimpactyofunaindicatedwaarenessfaultofinthecloningsoftwwasare.Inaconsistent.nutshell,Onawhileverage,theevamounteryofsecondunawuninten-areness
ofcloningvariedbetweensystems,ithadaconsistentlynegativeimpact.
Thenance.studyresultsConsequently,emphasizetheytheemphasizenegativtheeimpactimportanceofaoflackofcloneawcontrol.arenessofSincecloningeveryduringsecondmainte-unin-
controltentionallycanproinconsistentvidesubstantialchangevalue,createdifaitfaultmanages(orftoailedtodecreaseremothevelikafaultelihoodfromofthesuchsystem),changes—byclone
decreasingtheextentandincreasingtheawarenessofcloning.

61

5CloningBeyondCode

Thepreviouschapterhasshownthatunawarenessofclonesinsourcecodenegativelyaffectspro-
grammcorrectness.Cloninghas,however,notbeeninvestigatedinotherartifacttypes.Itisthus
unclear,whetherclonesoccursandshouldbecontrolledinotherartifacts,too.
Weconjecturethatcloningcanoccurinall—includingnon-code—artifactscreatedandmaintained
duringsoftwareengineering,andthatengineersneedtobeawareofcloneswhenusingthem.
Thischapterpresentsalargecasestudyonclonesinrequirementsspeci®cationsanddata-¯ow
models.Itinvestigatestheextentofclonesintheseartifactsanditsimpactonengineeringactivities.
Itdemonstratesthatcloningcanoccurinnon-codeartifactsandgivesindicationforitsnegative
impact.Partsofthecontentofthischapterhavebeenpublishedin[54,57,111].

QuestionshcResear5.1

Wesummarizethestudyusingthegoalde®nitiontemplateasproposedin[234]:
Analyzecloninginrequirementsspeci®cationsandmodels
forthepurposeofcharacterizationandunderstanding
withrespecttoitsextentandimpactonengineeringactivities
fromtheviewpointofrequirementsengineerandqualityassessor
inthecontextofindustrialprojects
Therefore,asetofspeci®cationsandmodelsfromindustrialprojectsareusedasstudyobjects.We
furtherdetailtheobjectivesofthestudyusingfourresearchquestions.The®rstfourquestionstarget
requirementsspeci®cations,the®fthtargetsdata-¯owmodels.

RQ5Howaccuratelycanclonedetectiondiscovercloninginrequirementsspeci®cations?

Weneedanautomaticdetectionapproachforalarge-scalestudyofcloninginrequirementsspec-
i®cations.approachesThisneedtoquestionbedeinvvestigeloped.atesItprowhethervidesethexistingbasiscloneforthedetectorsstudyofaretheextentappropriate,andorifimpactnewof
cloning.requirements

RQ6Howmuchcloningdoreal-worldrequirementsspeci®cationscontain?

Thecontainamountlittleofornocloningcloning,initisrequirementsunlikelytohaspeci®cationsveastrongdeterminesimpactontherelevmaintenance.anceofthisstudy.Ifthey

63

5CloningBeyondCode

RQ7Whatkindofinformationisclonedinrequirementsspeci®cations?

Thekindofinformationthatisclonedin¯uencestheimpactofcloningonmaintenance.Iscloning
limitedto,orespeciallyfrequentfor,aspeci®ckindofinformationcontainedinrequirementsspec-
i®cations?

RQ8Whichimpactdoescloninginrequirementsspeci®cationshave?

Cloningincodeisknowntohaveanegativeimpactonmaintenance.Canitalsobeobserved
forcloninginspeci®cations?Thisquestiondeterminestherelevanceofcloninginrequirements
maintenance.aresoftwforspeci®cations

RQ9Howmuchcloningdoreal-worldMatlab/SimulinkModelscontain?

ofAsforclonecodedetectionandandrequirementsclonemanagementspeci®cations,forthereal-wamountorldofcloningMatlab/Simulinkisanindicatormodels.oftheimportance

DesignyStud5.2

Arequirementsspeci®cationisinterpretedasasinglesequenceofwords.Incaseitcomprises
multipledocuments,individualwordlistsareconcatenatedtoformasinglelistfortherequire-
mentsspeci®cation.Normalizationisafunctionthattransformswordstoremovesubtlesyntactic
differencesbetweenwordswithsimilardenotation.Anormalizedspeci®cationisasequenceof
normalizedwords.Aspeci®cationclonecandidateisa(consecutive)substringofthenormalized
speci®cationwithacertainminimallength,appearingatleasttwice.
Forspeci®cationclonecandidatestobeconsideredasclones,theymustconveysemanticallysimilar
informationandthisinformationmustrefertothesystemdescribed.Examplesofclonesaredupli-
catedusecasepreconditionsorsysteminteractionsteps.Examplesoffalsepositivesareduplicated
documentheadersorfootersorsubstringsthatcontainthelastwordsofoneandthe®rstwordsof
thesubsequentsentencewithoutconveyingmeaning.

RQs5to8Thestudyusescontentanalysisofspeci®cationdocumentstoanswertheresearch
questions.Forfurtherexplorativeanalyses,thecontentofsourcecodeisalsoanalyzed.Content
analysisisperformedusingConQATasclonedetectiontoolaswellasmanually.
First,weassignrequirementsspeci®cationstopairsofresearchersforanalysis.Assignmentis
randomizedtoreduceanypotentialbiasthatisintroducedbytheresearchers.Clonedetectionis
performedonalldocumentsofaspeci®cation.
Next,theresearcherpairsperformclonedetectiontailoringforeachspeci®cation.Forthis,they
manuallyinspectdetectedclonesforfalsepositives.Filtersareaddedtothedetectioncon®guration
sothatthesefalsepositivesnolongeroccur.Thedetectionisre-runandthedetectedclonesare

64

ObjectsyStud5.3

cloneanalyzed.groups.ThisToisanswerrepeatedRQ5,untilnoprecisionfalsebeforepositivandesareafterfoundtailoring,inacaterandomgoriesofsamplefalseofthepositivesdetectedand
timesrequiredfortailoringarerecorded.
Theresultsofthetailoredclonedetectioncompriseareportwithallclonesandclonemetricsthatare
usedtoanswerRQ6:clonecoverage,numberofclonegroupsandclones,andoverhead.Overhead
isthemeasuredliteratureinarerelatiusedvetoandquantifyabsolutetheterms.additionalStandardefvfortaluesthatforthisoreadingverheadandcauses.inspectionOvspeedserheadfromand
cloning-inducedeffortsareusedtoanswerRQ8.
Foreachspeci®cation,wequalitativelyanalyzearandomsampleofclonegroupsforthekindofin-
formationtheycontain.Westartwithaninitialcategorizationfromanearlierstudy[57]andextend
it,andwhengroundednecessarytheory,duringapproachcate[39]).gorizationIfaclone(formallycontainsspeaking,weinformationthusthatemploycanabemixedassignedtotheory-basedmore
thaninformationonecateingory,itrequirementsisassignedtospeci®cationsallsuitableisusedcatetogories.answerTheRQ7.resultingToensurecateagorizationcertainoflevelclonedof
objectiveness,inter-rateragreementismeasuredfortheresultingcategorization.
Inmanysoftwareprojects,SRSarenoread-onlyartifactsbutundergoconstantrevisionstoadaptto
everchangingrequirements.Suchmodi®cationsarehamperedbycloningaschangestoduplicated
textoftenneedtobecarriedoutinmultiplelocations.Moreover,ifthechangesareunintentionally
notadditionalperformedeffortstoallforaffectedclari®cation.clones,Inthewinconsistenciesorstcase,cantheybemakeintroducedittotheinSRSthatimplementationlateronofcreatethe
softwpracticeareforsystem,inconsistentcausingmodi®cationsinconsistenttobehacodeviorofclonesthe®nal[115].Wproduct.ethuseStudiesxpectshothatwitthatcanthisalsooccurshappenin
inSRS.Hence,besidesthecategories,furthernoteworthyissuesoftheclonesnoticedduringmanual
inspectioninformationareisuseddocumented,foradditionalsuchasanswersinconsistenciestoRQ8.intheduplicatedspeci®cationfragments.This
isMoreovperformed:er,onweseleinctedvestigatespeci®cations,thecodecontentcorrespondinganalysistoofthespeci®cationsourcecodeclonesoftotheclassifyimplementationwhether
thespeci®cationcloningresultedincodecloning,duplicatedfunctionalitywithoutcloning,orwas
resolvedthroughthecreationofasharedabstraction.Theseeffectsareonlygivenqualitatively.
Furtherquantitativeanalysisisbeyondthescopeofthisthesis.
oInvtheervie®nalwofstep,theallstepsofcollectedthedatastudyisisgivanalyzedeninFig.and5.1.interpretedtoanswertheresearchquestions.An

RQ9WeusedtheclonedetectionapproachpresentedinSec.7.3.5todetectclonesinMat-
lab/Simulinkmodels.Tocapturetheextentofcloninginmodels,werecordedclonecountsand
erage.vco

Stud5.3Objectsy

RQsistration,5to8automotiWeve,usecon28venience,requirements®nance,speci®cationstelecommunication,asstudyobjectsandfromtransportation.thedomainsTheofspeci®edadmin-

65

5CloningBeyondCode

Random assignment of spec.

Run clone detection tool

Inspect detected clones

seYFalse positives?oNCategorize clones

Add #lter

Analysis of further e"ectsIndependent re-categorization

Data analysis & interpretation

Figure5.1:Studydesignoverview

systemsincludesoftwaredevelopmenttools,businessinformationsystems,platforms,andembed-
dedsystems.Thespeci®cationsarewritteninEnglishorGerman;theirscoperangesfromapart
totheentiresetofrequirementsofthesoftwaresystemstheydescribe.Fornon-disclosurereasons,
thesystemsarenamedAtoZtoAC.Anovervie1wisgiveninTable5.1.Thespeci®cationswere
obtainedfromdifferentorganizations,includingMunichReGroup,SiemensAGandtheMOST
Cooperation.

Thespeci®cationsmainlycontainnaturallanguagetext.Ifpresent,othercontent,suchasimages
ordiagrams,wasignoredduringclonedetection.Speci®cationsN,UandZareMicrosoftExcel
documents.Sincetheyarenotorganizedasprintablepages,nopagecountsaregivenforthem.The
remainingspeci®cationsareeitherinAdobePDForMicrosoftWordformat.Insomecases,these
speci®cationsaregeneratedfromrequirementsmanagementtools.Tothebestofourknowledge,
theduplicationencounteredinthespeci®cationsisnotintroducedduringgeneration.

Obviously,thespeci®cationswerenotsampledrandomly,sincewehadtorelyonourrelationships
withourpartnerstoobtainthem.However,weselectedspeci®cationsfromdifferentcompaniesfor
differenttypesofsystemsindifferentdomainstoincreasegeneralizabilityoftheresults.

RQ9WeemployedamodelprovidedbyMANNutzfahrzeugeGroup.Itimplementsthemajor
partofthepowertrainmanagementsystem.Toallowforadaptiontodifferentvariantsoftrucks
andbuses,itisheavilyparameterized.Themodelconsistsofmorethan20,000TargetLinkblocks
thataredistributedover71Simulink®les.Such®lesaretypicaldevelopment/modellingunitsfor
getLink.arSimulink/T1Duetonon-disclosurereasons,wecannotlistall11companiesfromwhichspeci®cationswereobtained.

66

5.4ecutionExandImplementation

Table5.1:Studyobjects

SpecPagesWordsSpecPagesWords
A51741,482O18418,750
CB1,013133130,96818,447PQ45336,9775,040
ED18524137,05637,969SR14410924,34315,462
GF854210,0767,662TUn/a4043,2167,799
IH1605319,6326,895VW21144831,67095,399
KJ28394,4115,912XY15823519,67949,425
L53584,959Zn/a13,807
NM233n/a103,06746,763AABC3,100696274,48981,410
1,242,7658,667

ecutionExandImplementation5.4

Thissectiondetailshowthestudydesignwasimplementedandexecutedonthestudyobjects.

RQs5and6ClonedetectionandmetriccomputationisperformedusingthetoolConQATas
describedinSec.3.3.Detectionusedaminimalclonelengthof20words.Thisthresholdwasfound
toprovideagoodbalancebetweenprecisionandrecallduringprecursoryexperimentsthatapplied
tailoring.detectionclonePrecisionisdeterminedbymeasuringthepercentageoftherelevantclonesintheinspectedsample.
Clonedetectiontailoringisperformedbycreatingregularexpressionsthatmatchthefalseposi-
tives.Speci®cationfragmentsthatmatchtheseexpressionsarethenexcludedfromtheanalysis.A
maximumnumberof20randomlychosenclonegroupsisinspectedineachtailoringstep,tokeep
manualeffortwithinfeasiblebounds,ifmorethan20clonegroupsarefoundforaspeci®cation;
else,falsepositivesareremovedmanuallyandnofurthertailoringisperformed.

RQ7Ifmorethan20clonegroupsarefoundforaspeci®cation,themanualclassi®cationis
performedonarandomsampleof20clonegroups;else,allclonegroupsforaspeci®cationare
wereinspected.removed.DuringToimproinspection,vethethecatequalityofgorizationthewcateasegorizationxtendedby8results,catecategories,1gorizationwasischanged,performednone
togetherbyateamof2researchersforeachspeci®cation.Inter-rateragreementisdeterminedby
calculatingCohen’sKappafor5randomlysampledspeci®cationsfromwhich5clonegroupseach
areindependentlyre-categorizedby2researchers.

67

5CloningBeyondCode

RQ8OverheadmetricsarecomputedasdescribedinSection2.5.4.Theadditionaleffortfor
readingiscalculatedusingthedatafrom[87],whichgivesanaveragereadingspeedof220words
perminute.Fortheimpactoninspectionsperformedontherequirementsspeci®cations,wereferto
GilbandGraham[79]thatsuggest1hourper600wordsasinspectionspeed.Thisadditionaleffort
isbothcalculatedforeachspeci®cationandasthemeanoverall.
Toanalyzetheimpactofspeci®cationcloningonsourcecode,weuseaconveniencesampleofthe
studyobjects.Wecannotemployarandomsample,sinceformanystudyobjects,thesourcecodeis
unavailableortraceabilitybetweenSRSandsourcecodeistoopoor.Ofthesystemswithsuf®cient
traceability,weinvestigatethe5clonegroupswiththelongestandthe5withtheshortestclonesas
wellasthe5clonegroupswiththeleastandthe5withthemostinstances.Therequirements’IDs
intheseclonegroupsaretracedtothecodeandcomparedtoclonedetectionresultsonthecode
level.ConQATisusedforcodeclonedetection.

RQ9normalizationThelabelsdetectionweusedapproachthetype;outlinedforinsomeSectionofthe7.3.5wblocksasthatadjustedtoimplementSimulinkseveralmodels.similarForfunc-the
tionsaddedthevalueoftheattributethatdistinguishesthem(e.g.,fortheTrigonometryblockthis
isanattributedecidingbetweensine,cosine,andtangent).Numericvalues,suchasthemultiplica-
tibeveextractconstantedasforlibrarygain,wereblocksremowhereved.suchThiswayconstants,detectioncouldcanbemadediscoverparameterspartialofmodelsthenewhichwlibrarycould
block.Fromamounttheweclonesstillfound,considerwetobediscardedrelevallantatthoseleastinconsistingsomeofcases.lessthan5Furthermore,blocks,aswethisistheimplementedsmallesta
weightingschemethatassignsaweighttoeachblocktype,withadefaultof1.Infrastructureblocks
(e.g.,terminatorsandmultiplexers)wereassignedaweightof0,whileblockshavingafunctional
meaning(e.g.,integrationordelayblocks)wereweightedwith3.Theweightofacloneisthesum
ofthattheatleastweightssmallofitsclonesblocks.areClonesconsideredwithonlya,ifweighttheirlesstfunctionalhan8alsoportionwereislargediscarded,enough.whichensures

5.5Results

Thissectionpresentsresultsorderedbyresearchquestion.

5.5.1RQ5:DetectionTailoringandAccuracy

RQ5investigateswhetherredundancyinreal-worldrequirementsspeci®cationscanbedetected
approaches.xistingewithPrecisionvaluesandtimesrequiredforclonedetectiontailoringaredepictedinTable5.2.Tailoring
timesdonotincludesetuptimesanddurationofthe®rstdetectionrun.Ifnoclonesaredetected
foratailoringisspeci®cationnecessary(i.e.,atQall,ande.g.,T),E,noF,Gorprecision,S,thevalueworstisgiven.precisionWhilevalueforwithoutsometailoringspeci®cationsisaslonow
as2%forspeci®cationO.Inthiscase,hundredsofclonescontainingonlythepagefootercause

68

Table5.2:Studyresults:tailoring

SbefPrec..Tminail.Prec.afterSbefPrec..Tminail.Prec.after
BA58%27%1530100%100%PO48%2%208100%100%
DC99%45%255100%99%RQ40%n/a41100%n/a
EF100%100%24100%100%ST100%n/a21100%n/a
HG100%97%102100%97%VU59%85%56100%85%
IJ100%71%28100%100%XW100%96%136100%100%
KL96%52%262100%96%YZ100%97%17100%100%
NM100%44%234100%100%AABC30%48%3314100%100%

5.5Results

thelargeamountoffalsepositives.For8speci®cations(A,C,M,O,P,R,AB,andAC),precision
valuesbelow50%aremeasuredbeforetailoring.Thefalsepositivescontaininformationfromthe
gories:catewingfolloDocumentmetadatacomprisesinformationaboutthecreationprocessofthedocument.This
includesauthorinformationanddocumentedithistoriesormeetinghistoriestypicallycontainedat
thestartorendofadocument.
Indexesdonotaddnewinformationandaretypicallygeneratedautomaticallybytextprocessors.
Encounteredexamplescomprisetablesofcontentorsubjectindexes.
Pagedecorationsaretypicallyautomaticallyinsertedbytextprocessors.Encounteredexamples
includepageheadersandfooterscontaininglengthycopyrightinformation.
Openissuesdocumentgapsinthespeci®cation.Encounteredexamplescomprise“TODO”state-
mentsortableswithunresolvedquestions.
Speci®cationtemplateinformationcontainssectionnamesanddescriptionscommontoallindi-
vidualdocumentsthatarepartofaspeci®cation.
Someofthefalsepositives,suchasdocumentheadersorfooterscouldpossiblybeavoidedby
accessingrequirementsinformationinamoredirectformthandonebytextextractionfromre-
documents.speci®cationquirementsPrecisionwasincreasedsubstantiallybyclonedetectiontailoring.Precisionvaluesforthespeci®-
cationsareabove85%,averageprecisionis99%.Thetimerequiredfortailoringvariesbetween1
and33minutesacrossspeci®cations.Lowtailoringtimesoccurredwheneithernofalsepositives
wereencountered,ortheycouldveryeasilyberemoved,e.g.,throughexclusionofpagefootersby
addingasinglesimpleregularexpression.Onaverage,10minuteswererequiredfortailoring.

69

5CloningBeyondCode
5.5.2RQ6:AmountofSRSCloning
RQ6investigatestheextentofcloninginreal-worldrequirementsspeci®cations.Theresultsare
showhichwninnotacolumnssingle2–4cloneofofTablethe5.3.requiredClonecolengthviseragevfound,ariestowidely:speci®cationfromHspeci®cationscontainingQaboutandTtw,ino-
athirdscloneofcoverageduplicatedabovecontent.20%.6outTheofavtheerage28analyzedspeci®cationclonespeci®cationscoverage(namelyisA,13.6%.F,G,H,L,Speci®cationsY)have
A,D,F,G,H,K,L,VandYevenhavemorethanonecloneperpage.Nocorrelationbetween
speci®cationsizeandcloningisfound.(Pearson’scoef®cientforclonecoverageandnumberof
wordsis-0.06—con®rmingalackofcorrelation.)
Table5.3:Studyresults:cloning
SpeccoClonev.grCloneoupsclonesorelativeverheadoworverheadds
A35.0%25991432.6%10,191
B8.9%2656395.3%6,639
DC18.5%8.1%105374798811.5%6.9%2,4631,907
FE51.1%0.9%5061621260.6%0.4%2,890161
HG71.6%22.1%7160360262129.6%20.4%11,0831,704
I5.5%7153.0%201
J1.0%120.5%22
LK20.5%18.1%303197945514.1%13.4%10,475699
NM1.2%8.2%15911373230.6%5.0%4,915287
PO5.8%1.9%5810163.0%1.0%204182
Q0.0%000.0%0
R0.7%240.4%56
TS0.0%1.6%1102700.0%0.9%2280
UV15.5%11.2%2018523748510.8%7.0%6,2044,206
XW12.4%2.0%211445316.8%1.1%1,253355
Y21.9%18155318.2%7,593
Z19.6%5011714.2%1,718
AABC12.1%5.4%6356518181483.2%8.7%21,9932,549
13.5%13.6%vgA100,1787,6692,631Fig.5.2depictsthedistributionofclonelengthsinwords(a)andofclonegroupcardinalities(b),
70

Results5.5

i.e.,thenumberoftimesaspeci®cationfragmenthasbeencloned2.Shortclonesaremorefrequent
thanlongclones.Still,90clonegroupshavealengthgreaterthan100words.Thelongestdetected
groupcomprisestwoclonesof1049wordseach,describingsimilarinputdialogsfordifferenttypes
data.ofacrossCloneallpairsarespeci®cations,moref49requentgroupsthanwithclonegroupscardinalityofabovecardinality10w3ereorhigherdetected..HoTheweverlar,gestaggreggroupated
encounteredcontains42clonesthatcontaindomainknowledgeaboutrolesinvolvedincontracts.

5.5.3RQ7:ClonedInformation

TheRQ7cateinvgoriesestigofatesclonedwhichkindinformationofinformationencounteredisinclonedtheinstudyreal-wobjectsorldare:requirementsspeci®cations.
withDetailedtheUsesystem,CasesuchasSteps:thestepsDescriptionrequiredoftooneorcreateamorenewstepsincustomerauseaccountcaseoninahowasystem.userinteracts
partReferofence:thesameFragmentdocument.inarequirementsExamplesarespeci®cationreferencesthatinarefersusetocasetoanotherotherdocumeusentcasesorortoanotherthe
process.usinessbcorrespondingUI:visibleonInformationwhichthatscreenrefersisantoethexample(graphical)forthisusercategoryinterf.ace.Thespeci®cationofwhichbuttonsare
detailsDomainaboutKnowhatwledge:ispartofanInformationinsuranceaboutthecontractforapplicationasoftwaredomainthatofthemanagessoftware.insuranceAnecontracts.xampleare
function,Interfaceorsystem.Description:AneDataxampleandisthemessagede®nitionde®nitionsofthatmessagesdescribeonathesysteminterfbusacethatofaacomponent,component
writes.andreadsPraree-Condition:pre-conditionsAforconditiontheexthatecutionhasoftoaholdspeci®cbeforeusecase.somethingelsecanhappen.Acommonexample
thing.AnSide-Condition:exampleisConditionthatatuserhathastodescribesremainthestatusloggedthatinhasduringtotheholdexduringecutiontheofeaxcertainecutionoffunction-some-
.alityaretimingCon®guration:parametersExplicitforsettingscon®guringforacon®guringtransmissiontheprotocol.describedcomponentorsystem.Anexample
Feature:Descriptionofapieceoffunctionalityofthesystemonahighlevelofabstraction.
TtechnicalechnicalenDomainvironmentKnoofthewledge:system,e.Informationg.,usedbaboutusthesystemsusedinantechnologyembeddedforthesystem.solutionandthe
2Thegivenriforghtmostthevunionalueofineachdetecteddiagramclonesaggreacrossgatesspeci®cationsdatathatisandnotoutsideforitseachrange.oneindiForviduallyconciseness,.Thethegeneraldistribobservutionsationsare
are,however,consistentacrossspeci®cations.

71

5

Cloning

72

ondyBe

)a)bFigure

Nuclmbo oenfr gero upsNuclmbo oenfr gero ups0000 505022110000000000 8642011111 050000 00008642Code

05.2:

0 02 1 2 03utionDistrib

3 04of

4tyil aegnnio rdlpcaCuro selrdn ootn wlghiCne 05lonec

5 06 6lengths

07 7and

8 08lonec

9 09oupgr

010 01dinalitiescar

Results5.5

Post-Condition:Conditionthatdescribeswhathastoholdaftersomethinghasbeen®nished.Anal-
ogoustothepre-conditions,post-conditionsareusuallypartofusecasestodescribethesystemstate
aftertheusecaseexecution.
Rationale:Justi®cationofarequirement.Anexampleistheexplicitdemandbyacertainuser
group.

Wedocumentthedistributionofclonegroupstothecategoriesforthesampleofcategorizedclone
groups.404clonegroupsareassigned498times(multipleassignmentsarepossible).Thequantita-
tiveresultsofthecategorizationaredepictedinFig.5.3.Thehighestnumberofassignmentsareto
category“DetailedUseCaseSteps”with100assignments.“Reference”(64)and“UI”(63)follow.
Theleastnumberofassignmentsaretocategory“Rationale”(8).

0nto siidStpdeenlotiIe-CaUDPreegdelw onKnimaoDntoniitodrainnuceotofgnpi-Cnfirescrieoeteree daCFRIutrfceDSieneaelantoaiRntoiidnostPo-Cegdelw onKnimao DlcainchTe2040608 001Figure5.3:Quantitativeresultsforthecategorizationofclonedinformation
TheandAB.randomFromsampleeachforspeci®cation,inter-rater5randomagreementclonescalculationareinspectedconsistsandofthecategorized.speci®cationsAsL,oneR,speci®-U,Z,
cationagreementonlyhasusing2cloneCohen’sgroups,Kappainwithtotala22resultcloneof0.67;groupsthiaresisinspected.commonlyWeconsideredmeasureastheintersubstantial-rater
theagreement.clonedHence,informationthecatesimilarly,gorizationimplyingisagoodcertainenoughdetogreeofensurethatcompletenessindependentandraterssuitabilitycate.gorize

5.5.4RQ8ImpactofSRSCloning

RQ8investigatestheimpactofSRScloningwithrespectto(1)speci®cationreading,(2)speci®ca-
tionmodi®cationand(3)speci®cationimplementation.
Speci®cationReadingCloninginspeci®cationsobviouslyincreasesspeci®cationsizeand,hence,
affectsallactivitiesthatinvolvereadingthespeci®cationdocuments.AsTable5.4shows,the
averageoverheadoftheanalyzedSRSis3,578wordswhich,attypicalreadingspeedof220words
perminute[87],translatestoadditional16minutesspentonreadingforeachdocument.

73

5CloningBeyondCode

Whilethisdoesnotappeartobealot,oneneedstoconsiderthatqualityassurancetechniqueslike
inspectionsassumeasigni®cantlylowerprocessingrate.Forexample,[79]considers600words
onperhourinspectionsastheofthemaximumanalyzedrateSRSforefisefectivxpectedetoinspections.beabout6Hence,hours.theInavaeragetypicaladditionalinspectiontimemeetingspent
with3participants,thisamountsto2.25persondays.Forspeci®cationABwithanoverheadof
21,993words,effortincreaseisexpectedtobegreaterthan13persondaysifthreeinspectorsare
applied.

Table5.4:Studyresults:impact

So[wverheadords][m]read.3[h]insp.4So[wverheadords][m]read.3[h]insp.4
A10,19146.317.0O1820.80.3
BC6,6391,90730.28.711.13.2QP20400.00.90.00.3
ED2,46316111.20.74.10.3RS228560.31.00.10.4
GF1,7042,89013.17.72.84.8UT4,206019.10.07.00.0
H11,08350.418.5V6,20428.210.3
I2010.90.3W3551.60.6
J220.10.0X1,2535.72.1
KL10,47569947.63.217.51.2ZY1,7187,59334.57.812.72.9
MN4,91528722.31.30.58.2AABC21,9932,549100.011.636.74.2
6.016.33,578vgA

alyzeSpeci®cationthecommentsModi®cationthatwereToexploredocumentedtheeduringxtentoftheinspectioninconsistenciesofintheoursampledspeci®cations,clonesforweeachan-
speci®cationset.Theyrefertoduplicatedspeci®cationfragmentsthatarelongerthantheclonesde-
tectedbetweenbythethetool.clonesThethatfulloftenlengthresultofthefromduplicationinconsistentisnotmodi®cation.foundbythetoolduetosmalldifferences
Ani®cationexample(M).forThesuchafunctionpotentialclassesinconsistenc“SequenceyPcanroperty”befoundandinthe“SequencepubliclyaMethod”vailablehaveMOSTthesamespec-
parameterlists.Theyaredetectedasclones.Thefollowingdescriptionisalsocopied,butoneends
withthesentence“Pleasenotethatincaseofelements,parameterFlagsisnotavailable”.Inthe
notothercouldcase,onlythisbesentencedeterminedisbymissing.consultingWhetherthetheserequirementdifferencessareengineersdefectsoftheinthesystem.Thisrequirementsfurtheror
stepremainsforfuturework.
isSpeci®cationimportanttounderstandImplementationwhichWithimpactrespectSRStocloningthehasentiretyonofdevtheelopmentsoftwareactidevvitieselopmentthatuseprocess,SRSasit
43AdditionalAdditionalreadinginspectionefeffortfortininclockclockminutes.hours.

74

Results5.5

Table5.5:Numberof®les/modellingunitstheclonegroupswereaffecting
NumberofmodelsNumberofclonegroups
43181212334

Table5.6:Numberofclonegroupsforclonegroupcardinality
CardinalityofclonegroupNumberofclonegroups
108220310415

antheirinput,e.correspondingg.,systemsourceimplementationcode,wefoundand3test.difForferenttheeffects:inspected20speci®cationclonegroupsand
1.Theredundancyintherequirementsisnotre¯ectedinthecode.Itcontainssharedabstractions
duplication.oidvathat2.toThethecodeclonedthatcodeimplementscauseaadditionalclonedefpiecefortsofanasSRSismodi®cationscloned,musttoo.Inbethisre¯ectedcase,infutureallclones.changes
Furthermore,changestoclonedcodeareerror-proneasinconsistenciesmaybeintroduced
accidentally(cf.,Chapter4).
3.Codeofthesamefunctionalityhasbeenimplementedmultipletimes.Theredundancyofthe
caseerequirementsxhibitsthussimilardoeseproblemsxistinasthecasecode2asbutwellbcreatesuthasnotadditionalbeeneffortscreatedforbythecopy&repeatedpaste.imThiple-s
mentation.approachescannotMoreover,reliablythis®ndtypeofcodethatredundancisyisfunctionallyhardertosimilardetectbutasnotethexistingresultcloneofcopdetectiy&on
paste,asshowninChapter9.

5.5.5RQ9:AmountofModelCloning

Wefound166clonepairsinthemodelswhichresultedin139clonegroupsafterclusteringand
resolvinginclusionstructures.Ofthe4762blocksusedfortheclonedetection,1780wereincluded
inatleastoneclone(coverageofabout37%).Weconsiderthisasubstantialamountofcloningthat
indicatesthenecessitytocontrolcloningduringmaintenanceofMatlab/Simulinkmodels.
AsshowninTable5.5,onlyabout25%ofthecloneswerewithinonemodelingunit(i.e.,asingle
Simulink®le),whichwastobeexpectedassuchclonesaremorelikelytobefoundinamanual
reviewprocessasopposedtoclonesbetweenmodelingunits,whichwouldrequirebothunitstobe
reviewedbythesamepersonwithinasmalltimeframe.Tables5.7and5.5giveanoverviewofthe
found.groupsclone

75

5CloningBeyondCode

Table5.7:Numberofclonegroupsforclonesize
clonesofNumbersizeClone7610–51611––20151735
1120>

Table5.7showshowmanycloneshavebeenfoundforsomesizeranges.Thelargestclonehada
sizeof101andaweightof70.Smallerclonesaremorefrequentthanlargerclones,ascanalsobe
observedforclonesinsourcecodeorrequirementsspeci®cations.

Discussion5.6

thatRQs5cloningto8:intheCloningsenseinofcopy&RequirementspasteiscommonSpeci®cationsinreal-worldTheresultsrequirementsfromthecasespeci®cations.studyshoHerew
weinterprettheseresultsanddiscusstheirimplications.
AccordingtotheresultsofRQ6,theamountofcloningencounteredissigni®cant,althoughitdiffers
betweenspeci®cations.Thelargeamountofdetectedcloningisfurtheremphasized,sinceour
approachfragmentsonlythathavlocatesebeenidenticalcopiedpartsbutoftheslightlytext.rewOtherordedinformslaterofeditingredundancsteps,y,suchorasthatarespeci®cationentirely
rewordedyetconveythesamemeaning,arenotincludedinthesenumbers.
TheresultsforRQ7illustratethatcloningisnotcon®nedtoaspeci®ckindofinformation.Onthe
contrary,wefoundthatduplicationcan,amongstothers,befoundinthedescriptionofusecases,
theapplicationdomainandtheuserinterfacebutalsoinpartsofdocumentsthatmerelyreference
otherdocuments.Ourcasestudyonlyyieldstheabsolutenumberofclonesassignedtoacategory.
ifAswecloningdidisnotinmorevestiglikelyatetowhichoccurinamountoneofacateSRSgorycanthanbeanotherassigned.totHence,hecatewegory,currentlywecannotassumededucethat
clonesarelikelytooccurinallpartsofSRS.
Therelativelybroadspectrumof®ndingsillustratesthatcloninginSRScanbesuccessfullyavoided.
SRSE,forexample,islargeandyetexhibitsalmostnocloning.
Themostobviouseffectofduplicationistheincreasedsize(cf.,RQ8),whichcouldoftenbeavoided
byprocessingcross-referencesstepsorperformeddifferentontheorganizationspeci®cations,ofthesuchasspeci®cations.restructuringSizeorincreasetranslatingaffectsthemallto(manual)other
languages,andespeciallyreading.Readingisemphasizedhere,astheratioofpersonsreadingto
thosewritingaspeci®cationisusuallylarge,evenlargerthaninsourcecode.Theactivitiesthat
involvereadingincludespeci®cationreviews,systemimplementation,systemtestingandcontract
negotiations.Theyaretypicallyperformedbydifferentpersonsthatareallaffectedbytheoverhead.
Whiletheadditionaleffortforreadinghasbeenassumedtobelinearinthepresentationofthe
results,onecouldevenarguethattheeffortislarger,ashumanreadersarenotef®cientwithword-
difwiseferencescomparisons,betweenwhichthemarethatcouldrequiredothetorwisecheckleadtopresumablyerrorsintheduplicated®nalpartssystem.to®ndpotentialsubtle

76

alidityVtoThreats5.7

Furthermore,inconsistentchangesoftherequirementsclonescanintroduceerrorsinthespeci®ca-
tionandthusofteninthe®nalsystem.Basedontheinconsistenciesweencountered,westrongly
suspectthatthereisarealthreatthatinconsistentmaintenanceofduplicatedSRSintroduceser-
rorsinpractice.However,sincewedidnotvalidatethattheinconsistenciesareinfacterrors,our
resultsarenotconclusive—futureresearchonthistopicisrequired.Nevertheless,theinconsisten-
ciesprobablycauseoverheadduringfurthersystemdevelopmentduetoclari®cationrequestsfrom
them.spottingelopersvde

Ourreimplementedobservationspartsshoofw,code.moreovOftener,thatthesespeci®cationduplicationscloningcannotcanevenleadbetospottedclonedbyor,theevdeenvwelopers,orse,
astheyonlyworkonapartofthesystem,whosesub-speci®cationmightnotevencontainclones
isolation.inwedviewhen

oftenRedundancanalyzeyistheharddiftoferentidentifypartsinofaSRSasspeci®cationcommonindiqualityviduallyandassuranceare,hence,techniquesproneliketomissinspectionsdu-
identifyplication.clonedTheresultsinformationforRQin5SRSshowinthatpractice.existingHoweclonever,itdetectionalsoshowsapproachesthatacancertainbeamountappliedofto
clonedetectiontailoringisrequiredtoincreasedetectionprecision.Astheeffortrequiredforthe
considertailoringthisstepstoisbebeloanwoneobstaclepersonforthehourforapplicationeachofspeci®cationclonedetectiondocumentduringintheSRScasequalitystudy,weassessmentdonot
practice.in

RQ9:CloninginModelsManualinspectionofthedetectedclonesshowedthatmanyof
themarerelevantforpracticalpurposes.Besidesthe“normal”clones,whichatleastshouldbe
documentedtomakesurethatbugsarealways®xedinbothplaces,wealsofoundtwomodelswhich
werenearlyentirelyidentical.Additionally,someoftheclonesarecandidatesfortheproject’s
library,astheyincludedfunctionalitythatislikelytobeusefulelsewhere.Anothersourceofclones
isthelimitationofTargetLinkthatscaling(i.e.,themappingtoconcretedatatypes)cannotbe
parameterized,whichleavesduplicationastheonlywayforobtainingdifferentscalings.

Themainproblemweencounteredisthelargenumberoffalsepositivesasmorethanhalfofthe
clonesfoundareobviouslyclonesaccordingtoourde®nitionbutwouldnotbeconsideredrelevant
byadeveloper(e.g.,largeMux/Demuxconstructs).Whileweightingthecloneswasamajorstep
inimprovingthisratio(withoutweightingtherewereabout®vetimesasmanyclones,butmostly
consistingofirrelevantconstructs)thisstillisamajorareaofpotentialimprovementfortheusability
approach.ourof

alidityVtoThreats5.7

Inthissection,wediscussthreatstothevalidityofthestudyresultsandhowwemitigatedthem.

77

5CloningBeyondCode

alidityVInternal5.7.1

thatRQs5&performed6Thecloneresultsdetectioncanbetailoring.in¯uencedWbyeindimitigvidualatedthispreferencesriskbyormistakperformingesofclonethetailoringresearchersin
pairstoreducetheprobabilityoferrorsandimproveobjectivity.
canPrecisionpotentiallywasintroducedeterminedoninaccuracrandomy,samplingsamplesisinsteadcommonlyofonalluseddettoecteddetermineclonegroups.precisionandWhileitthishas
beendemonstratedthatevensmallsamplescanyieldpreciseestimates[19,116].
Whilealotofeffortwasinvestedintounderstandingdetectionprecision,weknowlessaboutdetec-
tionrecall.First,ifregularexpressionsusedduringtailoringaretooaggressive,detectionrecallcan
bereduced.Weusedpair-tailoringandcomparisonofresultsbeforeandaftertailoringtoreduce
thiscontainedrisk.inaFurthermore,speci®cationweandhavenotnotidentiinvesti®edgbyatedthefalseautomatednegatives,detectori.e.,.theTheamountreasonsofforthisduplicationare
thelittledif®cultysyntacticofclearlycommonalde®ningity);andthetheeffortcharacteristicsrequiredoftosuch®ndclonesthem(havingmanuallya.Thesemanticreportedrelationebxtentut
ofcloningisthusonlyalowerboundforredundancy.Whiletheinvestigationofdetectionrecall
remainsdetectedclonesimportantandfuturtheework,conclusionsourdralimitedwnfromknothem.wledgeaboutitdoesnotaffectthevalidityofthe

igRQated7thisTheriskcatebygorizationpairingtheoftheresearchersclonedasinformationwellasisbysubjectianalyzingvetothesomeinterde-ratergree.Weagreement.againmit-All
researcherswereinthesameroomduringcategorization.Thisway,newlyaddedcategorieswere
immediatelyavailabletoallresearchers.

RQ8Thecalculationofadditionaleffortduetooverheadcanbeinaccurateiftheuseddatafrom
theliteraturedoesnot®ttotheeffortsneededataspeci®ccompany.Astheusedvalueshavebeen
con®rmedinmanystudies,however,theresultsshouldbetrustworthy.
Weknowlittleabouthowreadingspeedsdifferforclonedversusnon-clonedtext.Ontheone
hand,onecouldexpectthatclonedtextcanbereadmoreswiftly,sincesimilartexthasbeenread
before.Ontheotherhand,weoftennoticedthatreadingclonedtextcanbealotmoretimecon-
sumingthanreadingnon-clonedtext,sincethediscoveryandcomprehensionofsubtledifferences
istedious.Lackingprecisedata,wetreatedclonedandnon-clonedtextuniformlywithrespectto
readingefforts.Furtherresearchcouldhelptobetterquantifyreadingeffortsforclonedspeci®cation
fragments.

RQ9Thedetectionresultscontainfalsepositives.Bothreportedclonecountsandcoverageare
thusnotperfectlyaccurate.However,manualinspectionsrevealedasubstantialamountofclones
relevantformaintenance.Whiletheclonecountsandcoveragemetricsmightbeinaccurate,the
conclusionthatclonemanagementisrelevantformaintenanceofthemodelsholdsandissharedby
elopers.vdethe

78

alidityVExternal5.7.2

ySummar5.8

RQs5to8Thepracticeofrequirementsengineeringdiffersstronglybetweendifferentdomains,
companies,andevenprojects.Hence,itisunclearwhethertheresultsofthisstudycanbegen-
oferalizedrequiremtoallentsexistingspeci®cationsinstancesfromof11orgrequirementsanizationswithspeci®cations.over1.2Howemillionver,wweordsinvandestigalmosatedt289,000sets
pages.Thespeci®cationscomefromseveraldifferentcompanies,fromdifferentdomains—ranging
fore,fromweareembeddedcon®dentsystemsthattothebusinessresultsareinformationapplicabletoasystems—andwidevarietywithvofarioussystemsageandanddepth.domains.There-

RQ9Whiletheanalyzedmodelislarge,itisfromasinglecompanyonly.Thegeneralizability
oftheresultsisthusunclear—futureworkisrequiredtodevelopabetterunderstandingofcloning
acrossmodelsofdifferentsize,ageanddevelopingorganization.However,weareoptimisticthat
theresultsareatleasttransferabletoothermodelsintheautomotivedomain,sincetheyareconsis-
tentwithcloningwesawinmodelsatothercompaniesintheautomotivedomain.Unfortunately,
duetonon-disclosurereasons,wearenotabletopublishthemhere.

ySummar5.8

Thischapterpresentedacasestudyontheextentandimpactofcloninginrequirementsspeci®ca-
models.Matlab/SimulinkandtionsWehaveanalyzedcloningin28industrialrequirementsspeci®cationsfrom11differentcompa-
nies.Theextentofcloningvariessubstantially;whilesomespeci®cationscontainnoneorvery
fewclones,otherscontainverymany.Wehaveseenindicationfornegativeimpactofrequirements
cloningonengineeringefforts.Duetosizeincrease,cloningsigni®cantlyraisestheeffortforac-
tivitiesthatinvolvereadingofSRS,e.g.,inspections.Intheworstencounteredcase,theeffortfor
aninspectioninvolvingthreepersonsincreasesbyover13persondays.Inaddition,justasfor
sourcecode,modi®cationofduplicatedinformationiscostlyanderrorprone;wesawindication
thatunintentionallyinconsistentmodi®cationscanalsohappentospeci®cationclones.
Besidesrequirementsspeci®cations,wehaveanalyzedcloninginalargeindustrialMatlab/Simulink
model.Again,substantialamountsofcloningwerediscovered.Whiletheresultscontainedfalse
positives,developersagreedthatmanyofthedetectedclonesarerelevantformaintenance.Asfor
code,awarenessofcloningisthusrequiredtoavoidunintentionallyinconsistentmodi®cations.
Furthermore,thestudiesindicatethatcloninginrequirementsspeci®cationscancauseredundancy
insourcecode,bothintermsofcodeclonesandindependentimplementationofbehaviorallysimilar
functionality.Sincemodelsareoftenusedasspeci®cations,weassumethatthiseffectcanalsooccur
them.incloningforWeconcludethattheresultsfromthestudiessupportourconjecture:cloningdoesoccurinnon-code
artifactsaswell.Sinceitcanalsonegativelyimpactsoftwareengineeringactivities,weconclude
thatclonecontrolneedstoreachbeyondcodetorequirementsspeci®cationsandmodels.

79

ModelCostClone6

Athoroughunderstandingofthecostscausedbycloningisanecessaryfoundationtoevaluate
alternativeclonemanagementstrategies.Doexpectedmaintenancecostreductionsjustifytheeffort
requiredforcloneremoval?Howlargearethepotentialsavingsthatclonemanagementtoolscan
provide?Weneedaclonecostmodeltoanswerthesequestions.
Thischapterpresentsananalyticalcostmodelthatquanti®estheimpactofcloninginsourcecode
onmaintenanceeffortsand®eldfaults.Furthermore,itpresentstheresultsfromacasestudythat
instantiatesthecostmodelfor11industrialsoftwaresystemsandestimatesmaintenanceeffort
increaseandpotentialbene®tsachievablethroughclonemanagementtoolsupport.Partsofthe
contentofthischapterhavebeenpublishedin[110].

ocessPrMaintenance6.1

Thissectionintroducesthesoftwaremaintenanceprocessonwhichthecostmodelisbased.Itqual-
itativelydescribestheimpactofcloningforeachprocessactivityanddiscussespotentialbene®tsof
clonemanagementtools.TheprocessislooselybasedontheIEEE1219standard[99]thatdescribes
theactivitiescarriedonsinglechangerequests(CRs)inawaterfallfashion.Thesuccessiveexe-
cutionofactivitiesthat,inpractice,aretypicallycarriedoutinaninterleavedanditeratedmanner,
servestheclarityofthemodelbutdoesnotlimititsapplicationtowaterfall-styleprocesses.

Analysis(A)studiesthefeasibilityandscopeofthechangerequesttodeviseapreliminaryplan
fordesign,implementationandqualityassurance.Mostofittakesplaceontheproblemdomain.
Analysisisnotimpactedbycodecloning,sincecodedoesnotplayacentralpartinit.Possible
effectsofcloninginrequirementsspeci®cations,whichcouldinprincipleaffectanalysis,arebeyond
model.thisofscopethe

domainLocationconcepts(L)afdeterminesfectedbyasettheofCRtochangethestartsolutionpoints.Itdomain.thusperformsLocationadoesmappingnotcontaifromnproblemimpact
analysis,thatis,consequencesofmodi®cationsofthechangestartpointsarenotanalyzed.Location
efinvfortolvisesinspectionproportionaloftothesourceamountcodetoofcodedeterminethatgetschangeinspected.startpoints.Weassumethatthelocation
cationCloningeffort.increasesWearethenotsizeawofaretheofcodetoolthatsupportneedstotoallebeviateinspectedtheimpactduringofcodelocationcloningandonthusaflocation.fectslo-

81

6ModelCostClone

Designumentation(D)tousesdesignthetheresultsmodi®cationofanalysisoftheandsystem.locationWasewellassumeastthehatsoftwdesignareissystemnotandimpacteditsbydoc-
toavcloning.oidThisismodi®cationsaconservofheaativevilyclonedassumption,areas.sinceforaheavilyclonedsystem,designcouldattempt

ImpactAnalysis(IA)usesthechangestartpointsfromlocationtodeterminewherechangesin
thecodeneedtobemadetoimplementthedesign.Thechangestartpointsaretypicallynotthe
onlyplaceswheremodi®cationsneedtobeperformed—changestothemoftenrequireadaptations
inusesites.Weassumethattheeffortrequiredforimpactanalysisisproportionaltothenumberof
sourcelocationsthatneedtobedetermined.
Iftheconceptthatneedstobechangedisimplementedredundantlyinmultiplelocations,allof
themneedtobechanged.Cloningthusaffectsimpactanalysis,sincethenumberofchangepoints
isincreasedbyclonedcode.Toolsupport(cloneindication)simpli®esimpactanalysisofchanges
toclonedcode.Idealtoolsupportcouldreducecloningeffectonimpactanalysistozero.

tweentwImplementationoclassesof(Impl)changestorealizessourcethecode.designedAdditionschangeaddinnethewsourcesourcecodecode.toWethedifsystemferentiatewithoutbe-
changingexistingcode.Modi®cationsalterexistingsourcecodeandareperformedtothesource
locationsdeterminedbyimpactanalysis.Weassumethateffortrequiredforimplementationis
proportionaltotheamountofcodethatgetsaddedormodi®ed.
Weassumethataddingnewcodeisunaffectedbycloninginexistingcode.Implementationisstill
afeditingfectedbytoolscloning,could,sinceideally,modi®reducecationseffectstoofclonedcloningcodeonneedtobeimplementationperformedtozero.multipletimes.Linked

QualityAssurance(QA)comprisesalltestingandinspectionactivitiescarriedouttovalidate
thatthemodi®cationsatis®esthechangerequest.Weassumeasmartqualityassurancestrategy—
onlycodeaffectedbythechangeisprocessed.Wedonotlimitthemaintenanceprocesstoaspeci®c
qualityassurancetechnique.However,weassumethatqualityassurancestepsaresystematicallyap-
plied,e.g.,allchangesareinspectedortestingisperformeduntilacertaintestcoverageisachieved
ontheaffectedsystemparts.Consequently,weassumethatqualityassuranceeffortisproportional
totheamountofcodeonwhichqualityassuranceisperformed.
Wedifferentiatetwoeffectsofcloningonqualityassurance:cloningincreasesthechangesizeand
thustheamountofmodi®edcodethatneedstobequalityassured.Second,justasmodi®edcode,
addedcodecancontaincloning.Thisalsoincreasestheamountofcodethatneedstobequality
assuredandhencetherequiredeffort.Wearenotawareoftoolsupportthatcansubstantially
alleviatetheimpactofcloningonqualityassurance.

Other(O)comprisesfurtheractivities,suchas,e.g.,deliveryanddeployment,usersupportor
changecontrolboardmeetings.Sincecodedoesnotplayacentralpartintheseactivities,theyare
cloning.byfectedafnot

82

oacAppr6.2h

Thissectionoutlinestheunderlyingcostmodelingapproach.

hoacAppr6.2

RelativeCostModelManyfactorsin¯uencemaintenanceproductivity[22,23,211]:thetype
ofsystemanddomain,developmentprocess,availabletoolsandexperienceofdevelopers,toname
justafew.Sincethesefactorsvarysubstantiallybetweenprojects,theyneedtobere¯ectedby
costcomprises,estimationthemoreefapproachesfortistorequiredachieveforitsaccuratecreation,absoluteitsfactorresultslookup.Thetables,moreandfforactorsitsacostinstantiationmodel
inpractice.Ifanabsolutevalueisrequired,sucheffortisunavoidable.
Theassessmentoftheimpactofcloningdiffersfromthegeneralcostestimationproblemintwo
oneimportantwithoutaspects.cloning—forFirst,wewhichcomparemostfefactorsfortsforaretwoidentical,systems—thesinceouractualmaintenanceoneandenthehvironmentypotheticaldoes
notchange.Second,relativeeffortincreasew.r.t.thecloning-freesystemissuf®cienttoevaluatethe
impactofcloning.Sincewedonotneedanabsoluteresultvalueintermsofcosts,andsincemost
factorsin¯uencingmaintenanceproductivityremainconstantinbothsettings,theydonotneedto
becontainedinourcostmodel.Inanutshell,wedeliberatelychosearelativecostmodeltokeepits
numberofparametersandinvolvedinstantiationeffortatbay.

CloneRemovabilityThecostmodelisnotlimitedtoclonesthatcanberemovedbythemeans
ofremothevavabilityailable.Inaddition,abstractionevenifnomechanisms,clonecansincebenegremoativveed,ithempactmodelofcanclonesbeisusedtoindependentassessofpossibletheir
improvementsachievablethroughapplicationofclonemanagementtools.

CostModelStructureThemodelassumeseachactivityofthemaintenanceprocesstobecom-
pleted.Itisthusnotsuitabletomodelpartialchangerequestimplementationsthatareabortedat
point.someThetotalmaintenanceeffortEisthesumoftheeffortsofindividualchangerequests:

E=Xe(cr)
CR2crThescopeofthecostmodelisdeterminedbythepopulationofthesetCR:tocomputethemain-
tenanceeffortforatimespant,itispopulatedwithallchangerequeststhatarerealizedinthat
period.Alternatively,ifthetotallifetimemaintenancecostsaretobecomputed,CRispopulated
withallchangerequestseverperformedonthesystem.Themodelcanthusscaletodifferentproject
scopes.Theeffortofasinglechangerequestcr2CRisexpressedbye(cr).Itisthesumoftheeffortsof
theindividualactivitiesperformedduringtherealizationofthecr.Theactivityeffortsaredenoted
aseX,whereXidenti®estheactivity.EachactivityfromSection6.1contributestotheeffortofa
changerequest.Forbrevity,weomit(cr)inthefollowing:

83

ModelCostClone6

e=eA+eL+eD+eIA+eImpl+eQA+eO
Tefofortmodeleiandthecloningimpactofinducedcloningefonfortoverheadmaintenanceec.efforts,Inherentweefsplitforteeiintoistwindeopendentcomponents:ofcloning.inherentIt
capturestheeffortrequiredtoperformanactivityonahcypotheticalversionofthesoftwarethat
doesnotcontaincloning.Cloninginducedeffortoverheade,incontrast,capturestheeffortpenalty
causedbycloning.Totaleffortisexpressedasthesumofthetwo:

e=ei+ec
Theincreaseineffortsduetocloning,e,iscapturedbyeie+iec!1,orsimplyeeic.Thecostmodel
thusexpressescloninginducedoverheadrelativetotheinherenteffortrequiredtorealizeachange
request.Theincreaseintotalmaintenanceeffortsduetocloning,E,isproportionaltotheaverage
effortincreaseperchangerequestandthuscapturedbythesameexpression.

6.3ModelCostDetailed

Thismodelssectionfortheindiintroducesvidualaprocessdetailedactviersionvities.ofThethefolloclonewingcostsectionsmodel.Itsemplo®rstysectthemiontoconstructintroducesmod-cost
elsformanagementmaintenancetoolsupport.effortWeandinitiallyremainingassumefaultthatcountnocloneincreaseandmanagementthetoolspossiblearebene®tsemploofyed.clone

CostsActivity6.3.1ovTheerhead,activitiesec,isAnalysisthus,zero.DesignTheir,andtotalefOtherfortsarenothenceequalimpactedtheirbycloning.inherentefTheirforts.cloninginducedeffort

Locationeffortdependsoncodesize.Cloningincreasescodesize.Weassumethat,onaverage,
increaseoftheamountofcodethatneedstobeinspectedduringlocationisproportionaltothe
cloninginducedsizeincreaseoftheentirecodebase.Sizeincreaseiscapturedbyoverhead:

ecL=eiL·overhead

Impactanalysiseffortdependsonthenumberofchangepointsthatneedtobedetermined.
Cloningincreasesthenumberofchangepoints.WeassumethateIcAisproportionaltothecloning-
inducedincreaseinthenumberofsourcelocations.Thisincreaseiscapturedbyoverhead:

84

eIcA=eIiA·overhead

ModelCostDetailed6.3

Implementationeffortcomprisesbothadditionandmodi®cationeffort:eImpl=eImpl+
eImplAdd.WeassumethateffortrequiredforadditionsisunaffectedbycloninginexistingModsource
code.Weassumethattheeffortrequiredformodi®cationisproportionaltotheamountofcodethat
getsmodi®ed,i.e.,thenumberofsourcelocationsdeterminedbyimpactanalysis.Itscloning
inducedoverheadis,consequently,affectedbythesameincreaseasimpactanalysis:ecImpl=
eiImplMod·overhead.Themodi®cationratiomodcapturesthecmodi®cation-relatedpartofthein-
herentimplementationeffort:eImplMod=eImpl·mod.Consequently,eImplis:

ecImpl=eiImpl·mod·overhead

QualityAssuranceeffortdependsontheamountofcodeonwhichqualityassurancegetsper-
formed.Bothmodi®cationsandadditionsneedtobequalityassured.Sincethemeasureoverhead
capturessizeincreaseofbothadditionsandmodi®cations,wedonotneedtodifferentiatebetween
them,ifweassumethatcloningis,onaverage,similarinmodi®edandaddedcode.Theincreasein
qualityassuranceeffortishencecapturedbytheoverheadmeasure:

ecQA=eiQA·overhead

6.3.2MaintenanceEffortIncrease

Basedonthemodelsfortheindividualactivities,wemodelcloninginducedmaintenanceeffortec
forasinglechangerequestlikethis:

ec=overhead·(eiL+eiIA+eiImpl·mod+eiQA)

Therelativecloninginducedoverheadiscomputedasfollows:

overhead·(eiL+eiIA+eiImpl·mod+eiQA)
=eeiA+eiL+eiD+eiIA+eiImpl+eiQA+eiO

notThistakemodelimpactofcomputescloningtheonrelativeprogrameffortcorrectnessincreaseinintomaintenanceaccount.Thiscostsisdonecausedinbythenecloning.xtsection.Itdoes

85

6ModelCostClone

IncreaseaultF6.3.3Qualityassuranceisnotperfect.Evenifperformedthoroughly,faultsmayremainunnoticedand
causefailuresinproduction.Someofthesefaultscan,inprinciple,beintroducedbyinconsistent
updatestoclonedcode.Cloningcanthusaffectthenumberoffaultsinreleasedsoftware.Thiscan
haveeconomicconsequencesthatarenotcapturedbytheabovemodel.Thissectionintroducesa
does.thatmodelQualityassurancecanbedecomposedintotwosub-activities:faultdetectionandfaultremoval.We
assumethat,independentofthequalityassurancetechnique,theeffortrequiredtodetectasingle
faultinasystemdependsprimarilyonitsfaultdensity.Wefurthermoreassume,thataveragefault
removaleffortforasystemisindependentofthesystem’ssizeandfaultdensity.Theseassumptions
allowustoreasonaboutthenumberofremainingfaultsinsimilarsystemsofdifferentsizebutequal
faultdensities.IfaQAprocedureisappliedwiththesameamountofavailableeffortperunitof
size,weexpectasimilarreductionindefectdensity,sincethesimilardefectdensitiesimplyequal
costsforfaultlocationperunit.Forthesesystems,thesamenumberoffaultscanthusbedetected
and®xedperunit.FortwosystemsAandB,withBhavingtwicethesizeandavailableQAeffort,
weexpectasimilarreductionoffaultdensity.However,sinceBistwiceasbig,thesamefault
densitymeanstwicetheabsolutenumberofremainingfaults.
Asystemthatcontainscloninganditshypotheticalversionwithoutcloningaresuchapairofsim-
ilarsystems.Weassumethatfaultdensityissimilarbetweenclonedcodeandnon-clonedcode—
cloningduplicatesbothcorrectandfaultystatements.Besidessystemsize,cloningthusalsoin-
creasestheabsolutenumberoffaultscontainedinasystem.Iftheamountofeffortavailablefor
qualityassuranceisincreasedbyoverheadw.r.t.thesystemwithoutcloning,thesamereductionin
faultdensitycanbeachieved.However,theabsolutenumberoffaultsisstilllargerbyoverhead.
Thisreasoningassumesthatdevelopersareentirelyignorantofcloning.Thatis,ifafaultis®xedin
oneclone,itisnotimmediately®xedinanyofitssiblings.Instead,faultsinsiblingsareexpectedto
bedetectedindependently.Empiricaldatacon®rmsthatinconsistentbug®xesdofrequentlyoccur
inpractice[115].However,italsocon®rmsthatclonesareoftenmaintainedconsistently.Both
assumingentirelyconsistentorentirelyinconsistentevolutionisthusnotrealistic.
Inpractice,acertainamountofthedefectsthataredetectedinclonedcodearehence®xedin
someofthesiblingclones.Thisreducesthecloninginducedoverheadinremainingfaultcounts.
However,unlessallfaultsinclonesare®xedinallsiblings,resultingfaultcountsremainhigher
cloning.withoutsystemsinthanThemissratiocapturestheamountofclonesthatareunintentionallymodi®edinconsistently.It
hencecapturestheshareofclonedfaultsthatarenotremovedonceafaultisdetectedintheirsibling.
Theincreaseinfaultcountsduetocloningcanhencebequanti®edasfollows:

F=overheadmissratio
Tocomputemissratio,atimewindowforwhichchangestoclonesareinvestigatedisrequired.To
quantifyincreaseinremainingfaults,wechooseatimewindowthatstartswiththeinitiationof
the®rstchangerequest,andendswiththerealizationofthelastchangerequestinCR.Thisway,
missratiore¯ectsthatincreasedeffortavailableforqualityassuranceallowsforindividualdetection

86

ModelCostDetailed6.3

offaultscontainedinsiblingclones,iftheir®xwasmissedinpreviousdetections.Thesequence
ofinconsistentmodi®cationandlatepropagationthatoccursinsuchacaseis,sinceallofthem
occurredinsidethetimewindow,observedasasingleconsistentmodi®cation.Hence,missratio
onlycapturesthosefaultsthatslipthroughqualityassurance.ItisthusdifferentfromUICRand
FUICR.

tSupporoolT6.3.4Clonemanagementtoolscanalleviatetheimpactofcloningonmaintenanceefforts.Weadaptthe
detailedmodeltoquantifytheimpactofclonemanagementtools.Weevaluatetheupperboundof
whattwodifferenttypesofclonemanagementtoolscanachieve.

CloneIndicationmakescloningrelationshipsinsourcecodeavailabletodevelopers,forexam-
plethroughclonebarsintheIDEthatmarkclonedcoderegions.Examplesforcloneindication
toolsincludeConQATandCloneTracker[60].Optimalcloneindicationthuslowerstheeffortre-
quiredforclonediscoverytozero.Itthussimpli®esimpactanalysis,sincenoadditionaleffort
isrequiredtolocateaffectedclones.Assumingperfectcloneindicators,eIcAisreducedtozero,
model:costthisyielding

overhead·(eiL+eiImpl·mod+eiQA)
=eeiA+eiL+eiD+eiIA+eiImpl+eiQA+eiO

LinkedEditingreplicateseditoperationsperformedononeclonetoitssiblings.Prototype
linkededitingtoolsincludeCodelink[218]andCReN[102].Optimallinkededitingtoolsthus
lowerstheoverheadrequiredforconsistentmodi®cationsofclonedcodetozero.Sincelinked
editorstypicallyalsoprovidecloneindication,theyalsosimplifyimpactanalysis.Theirapplication
model:wingfollotheyields

overhead·(eiL+eiQA)
e=eiA+eiL+eiD+eiIA+eiImpl+eiQA+eiO
Wforedoqualitynotthinkassurance.thatIfclonetheamountmanagementoftoolschangedcancodeissubstantiallylargerduetoreducecloning,theovemorerheadcodecloningneedstocausesbe
processedbyqualityassuranceactivities.Wedonotassumethatinspectionsortestexecutionscan
besimpli®edsubstantiallybytheknowledgethatsomesimilaritiesresideinthecode—faultsmight
stilllurkinthedifferences.
Hocloningwever,weimposesareonconthevincednumberthatoffcloneaultsthatindicationsliptoolsthroughcanqualitysubstantiallyassurance.reduceIfathesingleimpactfaultthatis
foundinclonedcode,cloneindicatorscanpointtoallthefaultsinthesiblingclones,assistingin
theirpromptremoval.Weassumethatperfectcloneindicationtoolsreducethecloninginduced
overheadinfaultsafterqualityassurancetozero.

87

ModelCostClone6

ModelCostSimpli®ed6.4

Thissectionintroducesasimpli®edcostmodel.Whilelessgenerallyapplicablethanthedetailed
model,itiseasiertoapply.
Duetoitsnumberoffactors,thedetailedmodelrequiressubstantialefforttoinstantiateinpractice—
eachofitsninefactorsneedstobedetermined.Exceptforoverhead,allofthemquantifymain-
tenanceeffortdistributionacrossindividualactivities.Sinceinpracticetheactivitiesaretypically
interleaved,withoutcleartransitionsbetweenthem,itisdif®culttogetexactestimateson,e.g.,
howmucheffortisspentonlocationandhowmuchonimpactanalysis.
Theindividualfactorsofthedetailedmodelarerequiredtomaketrade-offdecisions.Weneedto
distinguishbetween,e.g.,impactanalysisandlocationtoevaluatetheimpactthatcloneindication
toolsupportcanprovide,sinceimpactanalysisbene®tsfromcloneindication,whereaslocation
doesnot.Beforeevaluatingtrade-offsbetweenclonemanagementalternativeshowever,asimpler
decisionneedstobetaken:whethertodoanythingaboutcloningatall.Onlythenisitreasonableto
investtheefforttodetermineaccurateparametervalues.Ifthecostmodelisnotemployedtoassess
clonemanagementtoolsupport,manyofthedistinctionsbetweendifferentfactorsareobsolete.We
canthusaggregatethemtoreducethenumberoffactorsandhencetheeffortinvolvedinmodel
instantiation.Writtenslightlydifferent,thedetailedmodelis:

e=overheadeiL+eiIA+eiImplmod+eiQA
eThefractionistheratioofeffortrequiredforcodecomprehension(eiL+eiIA),modi®cationofexisting
code(eiImplmod)andqualityassurance(eiQA)w.r.t.theentireeffortrequiredforachangerequest.
Weintroducethenewparametercloning-affectedeffort(CAE)forit:

eiL+eiIA+eiImplmod+eiQA
=CAEeIfCAEisdeterminedasawhole(withoutitsconstituentparameters),thissimpli®edmodelprovides
asimplewaytoevaluatetheimpactofcloningonmaintenanceefforts:

Discussion6.5

e=overheadCAE

Thecostmodelisbasedonaseriesofassumptions.Itcansensiblybeappliedonlyforprojectsthat
satisfythem.Welistanddiscussthemheretosimplifytheirevaluation.
Weassumethatthesigni®cantpartofthecostmodelsforthemaintenanceprocessactivitiesare
linearfunctionsonthesizeofthecodethatgetsprocessed.Forexample,weassumethatlocation

88

Instantiation6.6

effortisprimarilydeterminedbyandproportionaltotheamountofcodethatgetsinspectedduring
actilocation.vityhaInsasomehigh®xedsituations,setupacticost,vitythecostcostmodelsmodelmightshouldbemoreincludea®xcomplicated.edfactor;Fordieseconomyxample,ifofan
scalecouldincreaseeffortw.r.t.sizeinasuperlinearfashion.Insuchcases,therespectivepart
ofthecostmodelneedstobeadaptedappropriately.COCOMOII[23],e.g.,usesapolynomial
functiontoadaptsizetodiseconomyofscale.
Weassumethatchangestoclonesarecoupledtoasubstantialdegree.Thecostmodelthusneeds
totheybearefalseinstantiatedpositionvesortailoredbecauscloneepartsdetectionoftheresults.systemInarecasenolongerclonesaremaintained,uncoupled,thee.g.,modelisbecausenot
applicable.Weassumethateachmodi®cationtoacloneinoneclonegrouprequiresthesameamountofeffort.
Weignorethatsubsequentimplementationsofasinglechangetomultiplecloneinstancescouldget
cheaper,sincethedevelopergetsusedtothatclonegroup.Wearenotawareofempiricaldatafor
thesecosts.Futureworkis,thus,requiredtobetterunderstandchangesinmodi®cationeffortacross
siblingclones.Sinceinpractice,however,mostclonegroupshavesize2,theinaccuracyintroduced
bythissimpli®cationshouldbemoderate.

6.6Instantiation

Thissectiondescribeshowtoinstantiatethecostmodelandpresentsalargeindustrialcasestudy.

DeterminationarameterP6.6.1Thissectiondescribeshowtheparametervaluescanbedeterminedtoinstantiatethecostmodel.

OverheadComputationOverheadiscomputedontheclonesdetectedforasystem.Itcap-
ofturesthecloningprogramminginducedsizelanguageincrease(cf.,2.5.4).independentThisofiswhetherintended—thetheclonesnegcanativebeimpactremovedofwithcloningmeanson
maintenanceactivitiesisindependentofwhethertheclonescanberemoved.
Theaccuracyoftheoverheadvalueisdeterminedbytheaccuracyoftheclonesonwhichitis
computed.Unfortunately,manyexistingclonedetectiontoolsproducehighfalsepositiverates;
positiKapservesanddetectedGodfrebyy[122]state-of-the-artreportbetweentools.False27%positiandves65%,eTxhibitiarkssomeetal.level[217]ofuptosyntactic75%ofsimilarityfalse,
butimpedenosoftwcommonareconceptmaintenanceandimplementationmustbeandexcludedhencenofromocouplingverheadoftheircomputation.changes.Theythusdonot
Toachieveaccurateclonedetectionresults,andthusanaccurateoverheadvalue,clonedetection
needstobetailored.Tailoringremovescodethatisnotmaintainedmanually,suchasgeneratedor
unusedcode,sinceitdoesnotimpedemaintenance.Exclusionofgeneratedcodeisimportant,since
generatorstypicallyproducesimilar-looking®lesforwhichlargeamountsofclonesaredetected.
tionareFurthermore,avoided.tailoringThisisadjustsnecessarydetectionsothat,soe.thatg.,refalsegionspositiofJavvesagdueettertos,ovthaterlydifferaggressiinvtheireidenti®ersnormaliza-

89

ModelCostClone6

andnoreshaveidenti®ernoconceptualnames.Accordingrelationship,toourareenotxperienceerroneously[115],afterconsideredtailoring,asclonesclonesbyeaxhibiteddetectorthatchangeig-
concept.coupling,CloneindicatingdetectiontheirtailoringsemanticiscovrelationshiperedindetailthroughinSectionredundant8.2.implementationofacommon

DeterminingActivityEffortsThedistributionofthemaintenanceeffortsdependsonmany
factors,includingthemaintenanceprocessemployed,themaintenanceenvironment,thepersonnel
andthetoolsavailable[211].Toreceiveaccurateresults,theparametersfortherelativeeffortsof
theindividualactivitiesthusneedtobedeterminedforeachsoftwaresystemindividually.
Coarseeffortdistributionscanbetakenfromprojectcalculation,bymatchingengineerwages
againstmaintenanceprocessactivities.Thisway,therelativeanalysiseffort,e.g.,isestimated
astheshareofthewagesoftheanalystsw.r.t.allwages.Aswecannotexpectengineerrolesto
matchtheactivitiesofourmaintenanceprocessexactly,weneedtore®nethedistribution.Thiscan
bedonebyobservingdevelopmenteffortsforchangerequeststodetermine,e.g.,howmucheffort
analystsspendonanalysis,locationanddesign,respectively.Tobefeasible,suchobservationsneed
tobecarriedoutonrepresentativesamplesoftheengineersandofthechangerequests.Strati®ed
samplingcanbeemployedtoimproverepresentativenessofresults—sampledCRscanbeselected
accordingtothechangetypedistribution,sothatrepresentativeamountsofperfectiveandother
analyzed.areCRsTheparameterCAEforthesimpli®edmodelisstillsimplertodetermine.Efforteistheoverall
persontimespentonasetofchangerequests.Itcanoftenbeobtainedfrombillingsystems.Fur-
thermore,weneedtodeterminepersonhoursspentonqualityassurance,workingwithcodeand
spentexclusivelydevelopingnewcode.Thiscan,again,bedonebyobservingdevelopersworking
CRs.onThemodi®cationratiocan,inprinciple,alsobedeterminedbyobservingdevelopersanddiffer-
entiatingbetweenadditionsandmodi®cations.Ifavailable,itcanalternativelybeestimatedfrom
statistics.typerequestchange

LiteratureValuesforActivityEffortsofferasimplewaytoinstantiatethemodel.Unfortu-
nately,theresearchcommunitystilllacksathoroughunderstandingofhowtheactivitycostsare
distributedacrossmaintenanceactivities[211].Consequently,resultsbasedonliteraturevaluesare
lessaccurate.Theycanhoweverserveforacoarseapproximationbasedonwhichadecisioncanbe
taken,whethereffortformoreaccuratedeterminationoftheparametersisjusti®ed.
Severalresearchershavemeasuredeffortdistributionacrossmaintenanceactivities.In[194],Rom-
bachyearsetandal.covreporteringaroundmeasurement10,000resultshoursforofthreelarmaintenancegesystems,effort.carriedBasilioutetoal.ver[10]thecourseanalyzedof25threere-
leaseseachof10differentprojects,coveringover20,000hoursofeffort.Bothstudiesworkondata
thatwasrecordedduringmaintenance.YehandJeng[236]performedaquestionnaire-basedsurvey
inTaiwan.Theirdataisbasedon97validresponsesreceivedfor1000questionnairesdistributed
acrossTaiwan’ssoftwareengineeringlandscape.Thevaluesofthethreestudiesaredepictedin
6.1.ableT

90

Table6.1:Effortdistribution
Activity[194][10][236]Estimate
5%26%Analysis8%13%Location30%Design16%19%16%ImpactImplementationAnalysis22%29%26%26%5%
QualityAssurance22%24%17%22%
18%12%18%26%Other

Instantiation6.6

usedSinceineachthisstudythesis,usedweacannotslightlydirectlydifferentdeterminemaintenanceaveragevprocesaluess,feachoractibeingvitydifdistribferentution.fromFtheoreonex-
ample,in[194],designsubsumesanalysisandlocation.In[10],analysissubsumeslocation.The
estimatedmentation,avqualityerageefassurfortsanceareanddepictedotherinarethesimilarfourthrowbetweenofTtheable6.1.studiesSinceandourtheprocess,de®nitionsweofusedimple-the
aremedianoflittleashelp,estimatedsincevalue.theFactiorvitiesthedoremainingnotexistactiinvities,theirtheefprocessesfortordistribareutionsde®nedfromdiftheferently.literatureWe
thusdistributedtheremaining34%ofeffortaccordingtoourbestknowledge,basedonourownde-
velopmentexperienceandthatofourindustrialpartners—thedistributioncanthusbeinaccurate.
Todeterminetheratiobetweenmodi®cationandadditioneffortduringimplementation,weinspect
therequestsdistribmainlyutionofinvolvchangeerequestmodi®cations,types.Wwhereaseassumeperfectithatveadaptichangesve,mainlycorrectivinevandolveprevadditions.entivechangeCon-
othersequentlychange,wetypes.estimateTablethe6.2ratioshowsbetweeneffortadditiondistribandutionmodi®acrosscationchangebythetypesratiooffromtheperfectiabovevew.r.t.studies.all
Thefourthrowdepictsthemedianofallthree—37%ofmaintenanceeffortsarespentonperfec-
tiveestimateCRs,thetheremainimodi®cationng63%ratiotoarebedistrib0.63.utedacrosstheotherCRtypes.Basedonthesevalues,we

Table6.2:Changetypedistribution
Effort[194][10][236]Median
Adaptive7%5%8%7%
Corrective27%14%23%23%
29%44%20%29%OtherPerfective37%61%25%37%

StudiesCase6.6.2Thissectionpresentstheapplicationoftheclonecostmodeltoseverallargeindustrialsoftwaresys-
temstoquantifytheimpactofcloning,andthepossiblebene®tofclonemanagementtoolsupport,
practice.in

91

ModelCostClone6

GoalThecasestudyhastwogoals.First,evaluationoftheclonecostmodel.Second,quanti®ca-
tionoftheimpactofcloningonsoftwaremaintenancecostsacrossdifferentsoftwaresystems,and
thepossiblebene®toftheapplicationofclonemanagementtools.

StudyObjectsWechose11industrialsoftwaresystemsasstudyobjects.Sincewerequirethe
willingnessofdeveloperstocontributeinclonedetectiontailoring,wehadtorelyonourcontacts
withindustry.However,wechosesystemsfromdifferentdomains(®nance,contentmanagement,
convenience,powersupply,insurance)from7differentcompanieswrittenin5differentprogram-
minglanguagestocapturearepresentativesetofsystems.Fornon-disclosurereasons,wetermed
thesystemsA-K.Table6.3givesanovervieworderedbysystemsize.

StudyDesignandProcedureClonedetectiontailoringwasperformedtoachieveaccurate
results.Systemdevelopersparticipatedintailoringtoidentifyfalsepositives.Clonedetectionand
overheadcomputationwasperformedusingConQATforallstudyobjectsandlimitedtotype-1and
type-2clones.Minimalclonelengthwassetto10statementsforallsystems.Weconsiderthisa
conservativeminimalclonelength.
Sincetheeffortparametersarenotavailabletousfortheanalyzedsystems,weemployedvalues
fromtheliterature.Weassumethat50%(8%location,5%impactanalysis,26%·0,63implemen-
tationand22%qualityassurance;roundedfrom51,38%to50%sincetheavailabledatadoesnot
containtheimpliedaccuracy)oftheoverallmaintenanceeffortareaffectedbycloning.Toestimate
theimpactofcloneindicationtoolsupport,weassumethat10%ofthateffortareusedforimpact
analysis(5%outof50%intotal).Incasecloneindicationtoolsareemployed,theimpactofcloning
onmaintenanceeffortcanthusbereducedby10%.

ResultsandDiscussionTheresultsaredepictedinTable6.3.Thecolumnsshowlinesofcode
(kLOC),sourcestatements(kSS),redundancy-freesourcestatements(kRFSS),sizeoverheadand
cloninginducedincreaseinmaintenanceeffortwithout(E)andwithcloneindicationtoolsupport
(ETool).Suchtoolsupportalsoreducestheincreaseinthenumberoffaultsduetocloning.As
mentionedinSection6.3.3,thisisnotre¯ectedinthemodel.
Theeffortincreasevariessubstantiallybetweensystems.Theestimatedoverheadrangesfrom75%,
forsystemA,to5.2%forsystemF.Wecouldnot®ndasigni®cantcorrelationbetweenoverheadand
systemsize.Onaverage,estimatedmaintenanceeffortincreaseis20%fortheanalyzedsystems.
Themedianis15.9%.Forasinglequalitycharacteristic,weconsiderthisasubstantialimpacton
maintenanceeffort.ForsystemsA,B,E,G,I,JandKestimatedeffortincreaseisabove10%;
forthesesystems,itappearswarrantedtodetermineprojectspeci®ceffortparameterstoachieve
accurateresultsandperformclonemanagementtoreduceeffortincrease.

ySummar6.7

Thistenancechapterefforts.presentedTheanmodelanalyticalcomputescostmodelmaintenancetoquantifyefforttheincreaseeconomicrelativeeffecttoofacloningsystemonwithoutmain-

92

Table6.3:Casestudyresults

ySummar6.7

SystemLanguagekLOCkSSkRFSSoverheadEETool
AXSLT31156150.0%75.0%67.5%
BABAP51211540.0%20.0%18.0%
CC#154413517.1%8.6%7.7%
DC#3261089513.7%6.8%6.2%
EC#360735923.7%11.9%10.7%
FC#423968710.3%5.2%4.7%
GABAP46120815534.2%17.1%15.4%
HC#65724221015.2%7.6%6.9%
ICOBOL1,00540022478.6%39.3%35.4%
JJava1,34736826538.9%19.4%17.5%
KJava2,17973355631.8%15.9%14.3%

atedcloning.thecostItcanmodelbeonused11asaindustrialbasistoesystems.valuateAlthoughcloneresultmanagementaccuracyalternaticouldvbees.Wimproehavvedebyinstanti-using
ducedprojectimpactspeci®cvariesinsteadofsigni®cantlyliteraturevbetweenaluesforsystemseffortandisparameters,substantialtheforresultssome.indicateBasedthatonthecloningresults,in-
someprojectscanachieveconsiderablesavingsbyperformingactiveclonecontrol.
Boththecostmodel,andtheempiricalstudiesinChapters4and5,furtherourunderstandingof
thesigni®canceofcloning.However,thenatureoftheircontributionsisdifferent.Theempirical
studiesobservereal-worldsoftwareengineering.Whiletheyyieldobjectiveresults,theirresearch
questionsandscopearelimitedtowhatwecanfeasiblystudy.Thecostmodelisnotaffected
bytheselimitationsandcanthuscovertheentiremaintenanceprocess.Ontheotherhand,thecost
ofmodelcloningismoreonspeculatiengineeringveactithanvities.theempirTheicalcostmodelstudiesinthusthatservitesre¯ectstwoourpurposes.assumptionsFirst,itonthecomplementsimpact
theassumptionsempiricalexplicitstudiestoandthuscompleteproourvidesanunderstandingobjectiveofbasistheforimpactofsubstantiatedcloning.scienti®cSecond,itdiscoursemakesonour
cloning.ofimpactthe

93

7AlgorithmsandToolSupport

Bothclonedetectionresearchandcloneassessmentandcontrolinpracticeareinfeasiblewithoutthe
appropriatetools—clonesarenearlyimpossibletodetectandmanagemanuallyinlargeartifacts.
Thischapteroutlinesthealgorithmsandintroducesthetoolsthathavebeencreatedduringthis
thesistosupportcloneassessmentandcontrol.
Thesourcecodeoftheclonedetectionworkbenchhasbeenpublishedasopensourceaspartof
ConQAT.Itsclonedetectionspeci®cparts,whichhavebeendevelopedduringthisthesis,comprise
kLOC.67approximatelyTheclonedetectionprocesscanbebrokendownintoindividualconsecutivephases.Eachphase
operatesontheoutputofitspreviousphaseandproducestheinputforitssuccessor.Thephasescan
thusbearrangedasapipeline.Figure7.1displaysageneralclonedetectionpipelinethatcomprises
fourphases:preprocessing,detection,postprocessingandresultpresentation:




FigurepipelinedetectionClone7.1:

Preprocessingreadsthesourceartifactsfromdisk,removesirrelevantpartsandproducesaninter-
mediaterepresentation.Detectionsearchesforsimilarregionsintheintermediaterepresentation,
theclones,andmapsthembacktoregionsintheoriginalartifacts.Postprocessing®ltersdetected
clonesandcomputescloningmetrics.Finally,resultpresentationrenderscloninginformationinto
aformatthat®tsthetaskforwhichclonedetectionisemployed.Anexampleisatrendchartina
qualitydashboardusedforclonecontrol.
Thisclonedetectionpipeline,orsimilarpipelinemodels,arefrequentlyusedtooutlinetheclone
111,detection113,115,process200].orItthealsoservarchitectureesasofancloneoutlineofdetectionthistoolschapter:fromsecationhigh7.1levelintroducespointoftheviewarchi-[57,
tectureoftheclonedetectionworkbenchthatre¯ectsthepipelineoftheclonedetectionprocess.
Thesubsequentsectionsdetailpreprocessing(7.2),detection(7.3),postprocessing(7.4)andresult
presentation(7.5).Section7.6comparestheworkbenchwithexistingdetectorsandsection7.7dis-
cussesitsmaturityandadoption.Finally,section7.8summarizesthechapter.Partsofthecontent
ofthischapterhavebeenpublishedin[54,97,111,113,115].

hitecturecAr7.1

Thissectionintroducesthepipes&®ltersarchitectureoftheclonedetectionworkbench.

95

7AlgorithmsandToolSupport

ariabilityV7.1.1

Clonedetectorsareappliedtoalargevarietyoftasksinbothresearchandpractice[140,201],in-
cludingqualityassessment[111,159,178],softwaremaintenanceandreengineering[32,54,102,
126,149],identi®cationofcrosscuttingconcerns[27],plagiarismdetectionandanalysisofcopy-
121].[77,infringementrightEachofthesetasksimposesdifferentrequirementsontheclonedetectionprocessanditsresults[229].
Forexample,theclonesrelevantforredundancyreduction,i.e.,clonesthatcanberemoved,differ
signi®cantlyfromtheclonesrelevantforplagiarismdetection.Similarly,aclonedetectionprocess
usedatdevelopmenttime,e.g.,integratedinanIDE,hasdifferentperformancerequirementsthana
detectionexecutedduringanightlybuild.Moreover,evenforaspeci®ctask,clonedetectiontools
needafairamountoftailoringtoadaptthemtothepeculiaritiesoftheanalyzedprojects.Sim-
pleexamplesaretheexclusionofgeneratedcodeorthe®lteringofdetectionresultstoretainonly
clonesthatcrossprojectboundaries.Moresophisticated,onemaywanttoaddapre-processing
phasethatsortsmethodsinsourcecodetoeliminatedifferencescausedbymethodorderortoadda
recommendersystemthatanalyzesdetectionresultstosupportdevelopersinremovingclones.
Whileapipelineisausefulabstractiontoconveythegeneralpicture,thereisnouniqueclone
detectionpipelinethat®tsallpurposes.Instead,bothinresearchandpractice,afamilyofrelated,
yetdifferentclonedetectionpipelinesareemployedacrosstools,tasksanddomains.
Clonedetectiontoolsformafamilyofproductsthatarerelatedandyetdifferinimportantdetails.
Asuitablearchitectureforaclonedetectionworkbenchthusneedstosupportthisproductfamily
nature.Ontheonehand,itneedstoprovidesuf®cient¯exibility,con®gurabilityandextensibilityto
caterforthemultitudeofclonedetectiontasks.Ontheotherhand,itmustfacilitatereuseandavoid
redundancybetweenindividualclonedetectiontoolsofthefamily.

PipelineExplicit7.1.2

Thecloneclonedetectiondetectiontoolswbyorkbenchmakingdethevelopedcloneduringdetectionthispipelinethesisesupportsxplicit.theTheproductclonefamilydetectionnaturephasesof
arecomposedliftedtofrom®rstaclasslibraryofentitiesunitofsathatdeclaratiperformvespeci®cdata¯owdetectionlanguage.tasks.ThisBothwaythe,aindiclonevidualdetectorunitsandis
combinationsofunitscanbereusedacrossdetectors.
TheclonedetectionworkbenchisimplementedaspartoftheContinuousQualityAssessment
Toolkit(ConQAT)[48,50,52,55,56,113].ConQAToffersavisualdata¯owlanguagethatfa-
cilitatestheconstructionofprogramanalysesthatcanbedescribedusingthepipes&®ltersarchi-
tecturalstyle[208].Thisvisuallanguageisusedtocomposeclonedetectiontoolsfromindividual
processingsteps.Furthermore,ConQAToffersaninteractiveeditortocreate,modify,execute,doc-
umentanddebuganalysiscon®gurations.Usingthisanalysisinfrastructure,ConQATimplements
severalsoftwarequalityanalyses.1TheclonedetectiontoolsupportpresentinConQAThasbeen
developedaspartofthisthesis.
1Inanbecameearlierpartvofersion,ConQAtheT.Fcloneordetectionsimplicity,toolwerefersupporttoitwasasan»ConQAindependentT«fortheprojectremaindercalledofthisCloneDetectivethesis.[113]beforeit

96

hitecturecAr7.1

Figure7.2showsanexemplaryclonedetectioncon®guration.ItdepictsascreenshotfromConQAT,
whichhasbeenmanuallyeditedtoindicatecorrespondenceoftheindividualprocessingstepstothe
clonedetectionpipelinephases.Eachbluerectanglewithagearwheelsymbol»«isaprocessor.
Itrepresentsanatomicpieceofanalysisfunctionality.Eachgreenrectanglewithaboxeddouble
gearwheelsymbol»«representsablock.Ablockisapieceofanalysisfunctionalitymadeupof
furtherprocessorsorblocks.Itisthecompositepieceoffunctionalitythatallowsreuseofrecurring
parts.analysisThisclonedetectioncon®gurationsearchesforclonesinJavasourcecodethatspandifferentprojects
toidentifycandidatesforreuse.Indetail,thecon®gurationworksasshowninFigure7.2:

con®gurationdetectionClone7.2:Figure

Duringpreprocessing,thesource-code-scopereadssource®lesfromdiskintomemory.Theregex-
region-markermarksJavaincludestatementsinthe®lesforexclusion,sincetheyarenotrelevant
forthisusecase.Thestatement-normalizationblockcreatesanormalizationstrategy.
Inthedetectionphase,theclone-detectorprocessorusesthenormalizationstrategytotransform
theinput®lesintoasequenceofstatementunitsandperformsdetectionofcontiguousclones.The
non-overlapping-constraintisevaluatedoneachdetectedclonegroup.Clonegroupsthatcontain
clonesthatoverlapwitheachotherareexcluded.
Duringpostprocessing,theblack-list-®lterremovesallclonegroupsthathavebeenblacklistedby
developers.Therfss-annotatorcomputestheredundancy-free-source-statementsmeasureforeach
source®le.Thecross-project-clone-group-®lterremovesclonegroupsthatdonotspanatleasttwo
projects.

97

7AlgorithmsandToolSupport

Intheoutputphase,theclone-report-writer-processorwritesthedetectionresultsintoanXMLre-
portthatcanbeopenedforinteractivecloneinspection.Thecoverage-outputandhtml-presentation
createatreemapthatgivesanoverviewofthedistributionofcross-projectclonesacrosstheana-
projects.lyzedInthiscon®guration,thestatement-normalizationandthecoverage-outputarereusedcon®guration
blocks.Theremainingunitshavebeenindividuallycon®guredforthisanalysis.
WhilethephasesoftheclonedetectionpipelinefromFigure7.1arestillrecognizableintheCon-
QATcon®gurationinFigure7.2,thecon®gurationcontainstask-speci®cunits(e.g.,thecross-
project-clone-groups-®lter)thatarenotrequiredinothercontexts.Consequently,forothertasks,
speci®cpipelinescanbecon®guredthatreusesharedfunctionalityavailableintheformofproces-
blocks.orsors

ocessingPrepr7.2

Preprocessingtransformsthesourceartifactsintoanintermediaterepresentationonwhichclone
fromdetectiontheislanguageperformed.oftheTheartifactthatintermediategetsanalyzed,representationallowingservestwdetectionotopurposes:operate®rst,itindependentabstractsof
idiosyncraciesof,e.g.,C++orABAPsourcecodeortextswritteninEnglishorGerman;second,
differentelementsintheoriginalartifactscanbenormalizedtothesameintermediatelanguage
fragment,thusintentionallymaskingsubtledifferences.
Thisspeci®csectionstrategies®rstforintroducessourcecode,artifact-inderequirementspendentspeci®cpreprocessingationsandstepsmodels.andthenoutlinesartifact-

Steps7.2.1ConQATperformspreprocessinginfoursteps:collection,removal,normalizationandunitcreation.
Allofthemcanbecon®guredtomakethemsuitablefordifferenttasks.
Collectiongatherssourceartifactsfromdiskandloadsthemintomemory.Itcanbecon®gured
todeterminewhichartifactsarecollectedandwhichareignored.Inclusionandexclusionpatterns
canbespeci®edonartifactpathsandcontenttoexclude,e.g.,generatedcodebasedon®lename
patterns,locationinthedirectorystructureortypicalcontent.
Removalstripspartsfromtheartifactsthatareuninterestingfromaclonedetectionperspective,
e.g.,commentsorgeneratedcode.
Normalizationsplitsthe(non-ignoredpartsofthe)sourceartifactsintoatomicelementsandtrans-
formsthemintoacanonicalrepresentationtomasksubtledifferencesthatareuninterestingfroma
e.vperspectidetectioncloneUnitcreationgroupsatomicelementscreatedbynormalizationintounitsonwhichclonedetection
isperformed.Dependingontheartifacttype,itcangroupseveralatomicelementsintoasingleunit
(e.g.,tokensintostatements)orproduceaunitforeachatomicelement(e.g.,forMatlab/Simulink
graphs).

98

ocessingPrepr7.2

Theresultofthepreprocessingphaseisanintermediaterepresentationofthesourceartifacts.The
underlyingdatastructuredependsontheartifacttype:preprocessingproducesasequenceofunits
forsourcecodeandrequirementsspeci®cationsandagraphformodels.

7.2.2Code

Preprocessingforsourcecodeoperatesonthetokenlevel.Programming-languagespeci®cscanners
areemployedtosplitsourcecodeintotokens.Bothremovalandnormalizationcanbecon®gured
tospecifywhichtokenclassestoremoveandwhichnormalizingtransformationstoperform.Ifno
scannerforaprogramminglanguageisavailable,preprocessingcanalternativelyworkontheword
orlinelevel.However,normalizationcapabilitiesarethenreducedtoregular-expression-based
2.replacements

Tokensareremovediftheyarenotrelevantfortheexecutionsemantics(suchas,e.g.,comments)
oroptional(e.g.,keywordssuchasthisinJava).Thisway,differencesinthesourcecodethatare
limitedtothesetokentypesdonotpreventclonesfrombeingfound.

Normalizationisperformedonidenti®ersandliterals.Literalsaresimplytransformedintoasingle
constantforeachliteraltype(i.e.,booleanliteralsaremappedtoanotherconstantthanintegerliter-
als).Foridenti®ertransformation,aheuristicstrategyisemployedthataimstoprovideacanonical
representationtoallstatementsthatcanbetransformedintoeachotherthroughconsistentrenaming
oftheirconstituentidenti®ers.Forexample,thestatement»a=a+b;«getstransformedto»id0
=id0+id1«.Sodoes»x=x+y«.However,statement»a=b+c«doesnotgetnormalized
likethis,sinceitcannotbetransformedintothepreviousexamplesthroughconsistentrenaming.
(Instead,itgetsnormalizedto»id0=id1+id2«.)Thisnormalizationissimilartoparameterized
stringmatchingproposedbyBaker[6].

ConQATdoesnotemploythesamenormalizationtoallcoderegions.Instead,differentstrategies
canbeappliedtodifferentcoderegions.Thisallowsconservativenormalizationtobeperformed
torepetitivecode—e.g.,sequencesofJavagettersandsetters—toavoidfalsepositives;atthesame
time,non-repetitivecodecanbenormalizedaggressivelytoimproverecall.Thenormalization
strategiesandtheircorrespondingcoderegionscanbespeci®edbytheuser;alternatively,ConQAT
implementsheuristicstoprovidedefaultbehaviorsuitabletomostcodebases.

Unitboundaries.creationAformsclonethusstatementscannotbefromgintokorens.endThissomewwhereay,inclonethemiddleboundariesofacoincidestatement.withstatement

Shapersinsertuniqueunitsatspeci®edpositions.Sinceuniqueunitsareunequaltoanyotherunit,
theycannotbecontainedinanyclone.Shapersthusclipclones.ConQATimplementsshapersto
clipclonestobasicblocks,methodboundariesoraccordingtouser-speci®edregularexpressions.

2Forandreasonsparametersofforconciseness,normalizationthisissectioniscontainedlimitedintoanConQAovTDocerviewat.Awwwdetailed.conqat.orgdocumentationandtheofConQAtheeTxistingBook[49].processors

99

7AlgorithmsandToolSupport

Speci®cationsRequirements7.2.3

tosplitPreprocessingtextintoforwordnaturalandlanguagepunctuationtokdocumentsens.operatesWhitespaceonisthewdiscarded.ordlevel.BothAremoscannervalisandemplonormal-yed
izationoperateonthetokenstream.
thermore,Punctuationstopisworremodsvedaretoremoallovwedclonesfromtothebetokenfoundstream.thatonlyStopdifwferordsin,aree.g.,de®nedtheirincommas.informationFur-
retrievalaswordsthatareinsigni®cantortoofrequenttobeusefulinsearchqueries.Examplesare
w”.“hoor“and”,“a”,Normalizationperformswordstemmingtotheremainingtokens.Stemmingheuristicallyreducesa
wordlanguages.toitsBothstem.theConQAlistofTstopuseswtheordsPorterandthestemmerstemmingalgorithmdepend[187],onthewhichislanguageavaioflabletheforvspeci®ca-arious
tion.Unitsentencecreationboundaries.formsAsentencecloneunitthusscannotfromwbeordgintokorens.endsomeThiswwhereay,inclonethemiddleboundariesofacoincidesentence.with

Models7.2.4

PreprocessingtransformsMatlab/Simulinkmodelsintolabeledgraphs.Itinvolvesseveralsteps:
readingthemodels,removalofsubsystemboundaries,removalofunconnectedlinesandnormal-
ization.Normalizationproducesthelabelsoftheverticesandedgesinthegraph.Thelabelcontentdepends
onwhichverticesareconsideredequal.Forblocks,usuallyatleasttheblocktypeisincluded,
whilesemanticallyirrelevantinformation,suchasthename,color,orlayoutposition,areexcluded.
Additionally,someoftheblockattributesaretakenintoaccount,e.g.,fortheRelationalOperator
blockthevalueoftheOperatorattributeisincluded,asthisdecideswhethertheblockperformsa
greaterorlessthancomparison.Forthelines,westoretheindicesofthesourceanddestination
portsinthelabel,withsomeexceptionsas,e.g.,foraproductblocktheinputportsdonothaveto
bedifferentiated.Furthermore,normalizationstoresweightvaluesforvertices.Theweightvalues
areusedtotreatdifferentvertextypesdifferentlywhen®lteringsmallclones.Weightingcanbe
con®guredandisanimportanttooltotailormodelclonedetection.
TheresultofthesestepsisalabeledmodelgraphG=(V,E,L)withthesetofvertices(ornodes)
Vcorrespondingtotheblocks,thedirectededgesEV×Vcorrespondingtothelines,anda
labelingfunctionL:V[E"Nmappingnodesandedgestonormalizationlabelsfromsomeset
N.Twoverticesortwoedgesareconsideredequivalent,iftheyhavethesamelabel.AsaSimulink
blockcanhavemultipleports,eachofwhichcanbeconnectedtoaline,Gisamulti-graph.The
portsarenotmodeledherebutimplicitlyincludedinthenormalizationlabelsofthelines.
ForthesimplemodelsshowninFigure7.3thelabeledgraphproducedbypreprocessingisdepicted
inFigure7.4.Thenodesarelabeledaccordingtoournormalizationfunction.(Thegreyportionsof
thegraphmarkthepartweconsideraclone.)

100

Detection7.3Algorithms

Figure7.3:Examples:DiscretesaturatedPI-controllerandPID-controller










Figure7.4:Themodelgraphforoursimpleexamplemodel

AlgorithmsDetection7.3



invDetectionolvedinidenti®esdetectiontheandactualthenclonesoutlinesinthedetectionartifacts.algorithmsThisforsectionsequences®rstandintroducesgraphs.generalsteps

Steps7.3.1Thedetectionphaseproducescloninginformationintermsofregionsinthesourceartifacts.It
involvestwosteps.First,clonesareidenti®edintheintermediaterepresentation.Second,clones
aremediatemappedfromrepresentation,theintermediatemappingisrepresestraight-forwntationard.totheirTheoriginalprincipalartifacts.challengeGivinenthisaphasesuitableisinterthus-
thedetectionofclonesintheintermediaterepresentation.
Theemployeddetectionalgorithmsdependonthestructureoftheintermediaterepresentation,not
onthetypeoftheartifact.Morespeci®cally,differentalgorithmsareemployedforsequencesthan
forthosegraphs.thatoperateThisonsectiongraphsis3.thusstructuredaccordingtoalgorithmsthatoperateonsequencesand
Inprogramprinciple,dependencesourcecodegraph).canbeThus,bothrepresentedsequence-bothasandasequencegraph-basedofstatementsdetectionorasalgorithmsagraphcan(e.beg.,ap-a
pliedtosourcecode.PDG-basedapproaches[137,146],e.g.,operateonagraph-basedintermediate
representationforcode.However,ConQATperformsclonedetectiononsequences,sincefromour
3ConQATdoesnotimplementclonedetectionalgorithmsthatoperateontrees.

101

7AlgorithmsandToolSupport

experience,thecostincreaseincurredbysearchingclonesingraphsinsteadisnotaccountedforbya
suf®cientincreaseindetectionresultquality—manyofthegraph-basedclonedetectionapproaches
areprohibitivelyexpensiveforpracticalapplication[137,146].Fordata-¯owmodels,ontheother
hand,wearenotawareofasequentializationthatissuf®cientlycanonicaltoallowforhighrecallof
sequence-basedclonedetectioninmodels.Thus,weperformclonedetectionforsourcecodeand
requirementsspeci®cationsonsequences,butclonedetectionformodelsongraphs.

7.3.2BatchDetectionofType-1andType-2ClonesinSequences
ConQATimplementsasuf®xtree-basedalgorithmforthedetectionoftype-1andtype-2clonesin
sequences.Thealgorithmoperatesonastringofunitsanddetectssubstringsthatoccurmorethan
once.Itcanbeappliedbothtosourcecodeandtorequirementsspeci®cations.Thealgorithmis
similartotheclonedetectionalgorithmsproposedbyBaker[6]andKamiyaetal.[121].
Asuf®xtreeoverasequencesisatreewithedgeslabeledbywordssothatexactlyallsuf®xesof
sarefoundbytraversingthetreefromtherootnodetoaleafandconcatenatingthewordsonthe
encounterededges.Itisconstructedinlineartime—andthuslinearspace—usingthealgorithmby
Ukkonen[222].Asuf®xtreeforthesequenceabcdXabcd$isdisplayedinFigure7.5.Rededges
denotesuf®xlinks.Asuf®xlinkpointsfromanodetoanodethatrepresentsitsdirectsuf®x.











Figure7.5:Suf®xtreeforsequenceabcdXabcd$

Inasuf®xtree,notwoedgesleavinganodehavethesamelabel.Iftwosubstringsofsareidentical,
itcontainstwosuf®xesthathavethestringastheirpre®x;bothsharethesameedgeinthetree.In
sequenceabcdXabcd$,thestringabcdoccurstwice;consequently,thesuf®xesabcdXabc$and
abcd$4sharethepre®xabcdandthustheedgebetweenn0andn6inthetree(denotedinblue).The
noden6indicatesthatthesuf®xesdifferfromthispointon—onecontinueswiththelabelXabcd$,
.$withoneTodetectclones,thealgorithmperformsadepth-®rstsearchofthesuf®xtree.Ifanodeinthetree
haschildren,thelabelfromtheroottothenodeoccursexactlyasmanytimesins,asthenodehas
4Thesentinelcharacter$denotestheendofthesequences.

102

7.3AlgorithmsDetection

Figure7.6:Theoriginal®lenamedX.j(left),itsnormalization(center),andthecor-
respondingcloneindex(right).

abcdreachableoccurslea2vestimesintheinstree.andFisorthusexample,reportedsinceasan6clonehastwgroupowithreachabletwoleafsclones.(n1andn7),thelabel
Thesuf®xesofclones—bcd,cdandddenotedingrayintheexample—alsooccurseveraltimesin
s.Werefertothemasinducedclones.Iftheydonotoccurmoreoftenthantheirlongervariants,
theyarenotreported.Thealgorithmemploysthesuf®xlinkstopropagateinducedclonecounts.
theClonesexample,areonlynoclonesreported,areifthereportedinducedfornodesclonen8,countn10forandan12node.issmallerthanitsclonecount.In

algorithmScalabilitytogetherandPwitherftheormanceindex-basedWeevaluatealgorithminscalabilitythenextandsection.performanceofthesuf®xtree-based

7.3.3Real-TimeDetectionofType-1andType-2ClonesinSequences
type-2ConQATclonesimplementsthatisbothindex-basedincremental,clonedistribdetectionutableasandanoscalableveltovdetectionerylargeapproachcodeforbases.type-1and

alloClonewstheIndexlookupTheofallcloneclonesindexforisathesinglecentral®ledata(andthusstructurealsousedfortheforourentiredetectionsystem),andalgorithm.canbeIt
updatedef®ciently,when®lesareadded,removed,ormodi®ed.
Thelistofallclonesofasystemisnotasuitablesubstituteforacloneindex,asef®cientupdateis
notpossible.Addinganew®lemaypotentiallyintroducenewclonestoanyoftheexisting®lesand
thusacomparisontoall®lesisrequiredifnoadditionaldatastructureisused.
(cfThe.,core[135],ideapp.ofthe560–663).cloneindeThere,xisasimilarmappingtothefrominveachertedwindeordxtousedallitsindocumentoccurrencesretrieisvalmaintained.systems
Similarlyoccurrences.,theMorecloneindepreciselyx,themaintainscloneaindemappingxisalistfromoftuplessequences(®le,ofstatementnormalizedindex,statementssequencetohash,their
info),where®leisthenameofthe®le,statementindexisthepositioninthelistofnormalized
statementsforthe®le,sequencehashisahashcodeforthenextnnormalizedstatementsinthe
®lestartingfromthestatementindex(nisaconstantcalledchunklengthandisusuallysetto
thealgorithms,minimalbutclonemightlength),beusefulandinfwhenocontainsproducinganythelistadditionalofclones,data,suchwhichastheisnotstartandrequiredendforlinestheof
sequence.statementthe

103

7AlgorithmsandToolSupport

Thecloneindexcontainsthedescribedtuplesforall®lesandallpossiblestatementindices,i.e.,
forasingle®lethestatementsequences(1,...,n),(2,...,(n+1)),(3,...,(n+2)),etc.are
stored.Ourdetectionalgorithmrequireslookupsoftuplesbothby®leandbysequencehash,so
bothshouldbesupportedef®ciently.Otherthanthat,norestrictionsareplacedontheindexdata
structure,sotherearedifferentimplementationspossible,dependingontheactualuse-case.These
includein-memoryindicesbasedontwohashtablesorsearchtreesforthelookups,anddisk-based
indiceswhichallowpersistingthecloneindexovertimeandprocessingamountsofcodewhichare
toolargeto®tintomainmemory.Thelattermaybebasedondatabasesystems,orononeofthe
manyoptimized(andoftendistributed)key-valuestores[34,47].
InFig.7.6,thecorrespondencebetweenaninput®le»X.j«5andthecloneindexisvisualizedfor
achunklengthof5.The®eldthatrequiresmostexplanationisthesequencehash.Thereason
forusingsequencesofstatementsintheindexinsteadofindividualstatementsisthatthestatement
sequenceslesscommon(twoidenticalstatementsequencesarelesslikelythantwoidenticalstate-
ments)andarealreadyquitesimilartotheclones.Iftherearetwoentriesintheindexwiththesame
sequence,wealreadyhaveacloneoflengthatleastn.Thereasonforstoringahashintheindex
insteadoftheentiresequenceisforsavingspace,asthiswaythesizeoftheindexisindependentof
thechoiceofn,andusuallythehashisshorterthanthesequence’scontentsevenforsmallvaluesof
n.WeusetheMD5hashingalgorithm[192]whichcalculates128bithashvaluesandistypically
usedincryptographicapplications,suchasthecalculationofmessagesignatures.Asouralgorithm
onlyworksonthehashvalues,severalstatementsequenceswiththesameMD5hashvaluewould
causefalsepositivesinthereportedclones.Whiletherearecryptographicattacksthatcangenerate
messageswiththesamehashvalue[212],thecaseofdifferentstatementsequencesproducingthe
sameMD5hashissounlikelyinoursetting,thatitcanbeneglectedforpracticalpurposes.

CloneRetrievalThecloneretrievalprocessextractsallclonesforasingle®lefromtheindex.
Usuallyweassumethatthe®leiscontainedintheindex,butofcoursethesameprocesscanbe
appliedto®ndclonesbetweentheindexandanexternal®leaswell.Tupleswiththesamesequence
hashalreadyindicatecloneswithalengthofatleastn(wherenisthechunklength).Thegoalof
cloneretrievalistoreportonlymaximalclones,i.e.,clonegroupsthatarenotentirelycontainedin
anotherclonegroup.TheoverallalgorithmissketchedinFig.7.7,whichwenextexplaininmore
detail.The®rststep(uptoLine6)istocreatethelistcofduplicatedchunks.Thisliststoresforeach
statementoftheinput®lealltuplesfromtheindexwiththesamesequencehashasthesequence
foundinthe®le.Theindexusedtoaccessthelistccorrespondstothestatementindexintheinput
®le.ThesetupisdepictedinFig.7.8.Thereisacloneoflength10(6tupleswithchunklength5)
withthe®leY.j,andacloneoflength7withbothY.jandZ.j.
Inthemainloop(startingfromLine7),we®rstcheckwhetheranynewclonesmightstartatthis
position.Ifthereisonlyasingletuplewiththishash(whichhastobelongtotheinspected®leatthe
currentlocation)weskipthisloopiteration.Thesameholdsifalltuplesatpositionihavealready
beenpresentatpositioni!1,asinthiscaseanyclonegroupfoundatpositioniwouldbeincluded
inaclonegroupstartingatpositioni!1.Althoughweusethesubsetoperatorinthealgorithm
description,thisisnotreallyasubsetoperation,asofcoursethestatementindexofthetuplesinc(i)
5WeusethenameX.jinsteadofX.javaasanabbreviationinthe®gures.

104

AlgorithmsDetection7.3

(®lename)reportClonesfunction12letfbethelistoftuplescorrespondingto®lename
sortedbystatementindexeitherreadfrom
theindexorcalculatedonthe¯y
3letcbealistwithc(0)=;
4fori:=1tolength(f)do
5retrievetupleswithsamesequencehashasf(i)
6storethissetasc(i)
7fori:=1tolength(c)do
8if|c(i)|<2orc(i)c(i!1)then
9continuewithnextloopiteration
10leta:=c(i)
11forj:=0i+1tolength(c)do
12leta0:=a\c(j)
13if|a|<|a|then
14report0clonesfromc(i)toa(seetext)
a=:a1516if|a|<2orac(i!1)then
loopinnereakbr17Figure7.7:Cloneretrievalalgorithm

Figure7.8:Lookupsperformedforretrieval

willbeincreasedby1comparedtothecorrespondingonesinc(i!1)andthecontentoftheinfo
.ferdifwill®eldThesetaintroducedinLine10iscalledtheactivesetandcontainsalltuplescorrespondingto
cloneswhichhavenotyetbeenreported.Ateachiterationoftheinnerloopthesetaisreducedto
tupleswhicharealsopresentinc(j)(againtheintersectionoperatorhastoaccountfortheincreased
statementindexanddifferentinfo®eld).Thenewvalueisstoredina0.Clonesareonlyreported,
iftuplesarelostinLine12,asotherwiseallcurrentclonescouldbeprolongedbyonestatement.
Clonereportingmatchestuplesthat,aftercorrectionofthestatementindex,appearinbothc(i)and
a;eachmatchedpaircorrespondstoasingleclone.Itslocationcanbeextractedfromthe®lename
andinfo®elds.Allclonesinasinglereportingstepbelongtooneclonegroup.Line16earlyexits
theinnerloopifeithernomoreclonesarestartingfrompositioni(i.e.,aistoosmall),orifall
tuplesfromahavealreadybeeninc(i!1).(again,correctedforstatementindex).Inthiscasethey

105

7AlgorithmsandToolSupport

havealreadybeenreportedinthepreviousiterationoftheouterloop.
Thisalgorithmreturnsallclonegroupswithatleastonecloneinstanceinthegiven®leandwitha
minimallengthofchunklengthn.Shorterclonescannotbedetectedwiththeindex,sonmustbe
chosenequaltoorsmallerthantheminimalclonelength.Ofcourse,reportedclonescanbeeasily
®lteredtoonlyincludecloneswithalengthl>n.
Oneproblemofthisalgorithmisthatclonegroupswithmultipleinstancesinthesame®leare
encounteredandreportedmultipletimes.Furthermore,whencalculatingtheclonegroupsforall
®lesinasystem,clonegroupswillbereportedmorethanonceaswell.Bothcasescanbeavoided,
bycheckingwhetherthe®rstelementofa0(withrespecttoa®xedorder)isequaltof(j)andonly
case.thisinreport

IndexMaintenanceByindexmaintenancewerefertoallstepsrequiredtokeeptheindexup
todateinthepresenceofcodechanges.Forindexmaintenance,onlytwooperationsareneeded,
namelyadditionandremovalof6single®les.Modi®cationsof®lescanbereducedtoaremove
operationfollowedbyanadditionandindexcreationisjustadditionofallexisting®lesstarting
fromanemptyindex.Intheindex-basedmodel,bothoperationsaresimple.Toaddanew®le,ithas
tobereadandpreprocessedtoproduceitssequenceofnormalizedstatements.Fromthissequence,
allpossiblecontiguoussequencesoflengthn(wherenisthechunklength)aregenerated,which
arethenhashedandinsertedastuplesintotheindex.Similarly,theremovalofa®leconsistsofthe
removalofalltuplesthatcontaintherespective®le.Dependingontheimplementationoftheindex,
theadditionandremovaloftuplesmightcauseadditionalprocessingsteps(suchasrebalancing
searchtrees,orrecoveringfreeddiskspace),butthesearenotconsideredhere.

ImplementationConsiderationsDetailsonindeximplementationandananalysisofthe
complestronglyxitydependsoftheonthealgorithmstructurecanofbethefoundanalyzedin[97].system.WeItsomititpracticalhere,asitssuitabilityoverallthusneedsperformancetobe
determinedusingmeasurementsonreal-worldsoftware,whicharereportedbelow.

ScalabilityandPerformance:BatchCloneDetectionToevaluateperformanceandscal-
abilityofboththesuf®xtree-basedandtheindex-basedalgorithm,weexecutedbothonthesame
hardware,withthesamesettings,analyzedthesamesystemandcomparedtheresults.Bothalgo-
rithmsarecon®guredtooperateonstatementsasunits.Fortheindex-basedalgorithm,weusedan
implementation.xindeclonein-memoryWerithmsuseddetectthe11theMLOCsameof60.353CcodeclonestheinLinux25.663Kernelgroupsinvforersionit.Toev2.6.33.2aluateasstudyscalability,object.weBothperformedalgo-
severaldetections,eachanalyzingincreasingamountsofcode.Weanalyzedbetween500KLOC
and10MLOCandincrementedby500KLOCforeachrun.Themeasurementswerecarriedout
inonaFigureWindo7.9.wsItshomachinewsthewithnumber2.53ofGHz,Jastatemeva1.6ntsand(insteadaheapofthesizeoflines1ofGB.code)Theonresultsthearex-axis,depictedsince
theymoreaccuratelydetermineruntime.500KLOC,e.g.,correspondto141Kstatements.
6Thissystems.Ifsimpli®cationasystemmakonlyessenseconsistsonlyofifaafewsinglehuge®le®les,issmallmorere®nedcomparedupdatetotheoperationsentirecodewouldbase,bewhichrequired.holdsformost

106

ExeicutioTnime Senco nds000000 4201110000 00008642 0 00 05ntocte eidDse-Bafx-TSuefreintocte eidDsex-BaeIdn00 01nottainre SttmeCae00 51t ns0Sttme0ae 0n 1zeiSi00 0200 5200 03AlgorithmsDetection7.3

00 53Figure7.9:Performanceoftype-2clonedetection

00 04Thetimerequiredtocreatethestatementunits(includingdiskI/O,scanningandnormalization)
isdepictedinred.Itdominatestheruntimeforbothalgorithms.Theruntimesofthesuf®xtree-
basedandindex-baseddetectionalgorithms(includingstatementunitcreation)aredepictedinblue
andgreen,respectively.Forbothalgorithms,runtimesincreaselinearwithsystemsize.Thesuf®x
tree-basedalgorithmisfaster.Itshouldthusbeusedifbatchdetectiongetsperformedonasingle
machineandsuf®cientmemoryisavailable.Otherwise,theindex-basedalgorithmispreferable.

ScalabilityandPerformance:Real-TimeCloneDetectionWeinvestigatedthesuitabil-
wityareforasaboreal-timeve.Wecloneusedadetectionpersistentonlarclonegecodeindexthatisimplementationmodi®edbasedcontinuouslyonBerkoneletheyDBsame7,ahard-high-
database.embeddedperformance

Wemeasuredthetimerequiredto(1)buildtheindex,(2)updatetheindexinresponsetochanges
tothesystem,and(3)querytheindex.Forthis,weanalyzedversion3.3oftheEclipseSDK
(42.693.793LOCin209.312®les).Wetimedindex-creationtomeasure(1).Tomeasure(2),we
removed1,000randomlyselected®lesandre-addedthemafterwards.For(3),wequeriedtheindex
forallclonegroupsof1,000randomlyselected®les.

7Tablehours7.1and4depictsminutes.theresults.ThecloneIndexindexcreation,occupiedincluding5.6GBwritingondisk.thecloneIndexindeupdate,xtotheincludingdatabase,writingtook
tothedatabase,took0.85secondsper®leonaverage.Finally,queriesforallclonegroupsfora®le
took0.91secondsonaverage.Medianquerytimewas0.21seconds.Only14ofthe1000®leshad
aquerytimeofover10seconds.Onaverage,the®leshadasizeof3kLOCandqueriesforthem
clones.350returned7http://www.oracle.com/technology/products/berkeley-db/index.html

107

7AlgorithmsandToolSupport

Theresultsindicatethatourapproachiscapableofsupportingrealtimeclonemanagement:the
indexcanbecreatedduringasinglenightlybuild.(Afterwards,theindexcanbeupdatedtochanges
anddoesnotneedtoberecreated.)Theaveragetimeforaqueryis,inouropinion,fastenoughto
supportinteractivedisplayofcloneinformationwhenasource®leisopenedintheIDE.Finally,the
performanceofindexupdatesallowsforcontinuousindexmaintenance,e.g.,triggeredbycommits
tothesourcecoderepositoryorsaveoperationsintheIDE.

Table7.1:Clonemanagementperformance
Indexcreation(complete)7hr4min
Indexquery(per®le)0.21secmedian
sec0.91eragevaIndexupdate(per®le)0.85secaverage

ScalabilityandPerformance:DistributedCloneDetectionWeevaluatedthedistribu-
tionplementedonmultipleontopofmachinesBigtableusing[34],aGoogle’key-vsaluecomputingstoresupportinginfrastructure.distribTheutedemploaccess.yedDetailsindexisonim-the
implementationonGoogle’sinfrastructurecanbefoundin[97].

Weanalyzedthirdpartyopensourcesoftware,including,e.g.,WebKit,Subversion,andBoost.
(73.2MLOCofJava,C,andC++codein201,283®lesintotal.)Weexecutedbothindexcreation
andcoveragecalculationasseparatejobs,bothondifferentnumbersofmachines8.Inaddition,to
evaluatescalabilitytoultra-largecodebases,wemeasuredindexconstructionon1000machineson
about120millionC/C++®lesindexedbyGoogleCodeSearch9,comprising2.9GLOC10.

Using100machines,indexcreationandcoveragecomputationforthe73.2MLOCofcodetook
aboutcreation36oftheminutes.cloneForinde10xformachines,the2.9theGLOCprocessingofC/C++timeissourcesstillinonlytheslightlyGoogleaboCodeve3Searchhours.indeThex
requiredlessthan7hourson1000machines.

Weobservedasaturationoftheexecutiontimeforbothtasks.Towardstheendofthejob,most
machinesarewaitingforafewmachineswhichhadaslightlylargercomputingtaskcausedbylarge
®lesor®leswithmanyclones.Thealgorithmthusscaleswelluptoacertainnumberofmachines.
Additionalmeasurements(cf.,[97])revealedthatusingmorethanabout30machinesforretrieval
doesnotmakesenseforacodebaseofthegivensize.However,thelargejobprocessing2.9GLOC
demonstratesthe(absenceof)limitsforindexconstruction.
8ThemachineshaveIntelXeonprocessorsfromwhichonlyasinglecorewasused,andthetaskallocatedabout3GB
9RAMhttp://wwwoneach..google.com/codesearch
10Moreprecisely2,915,947,163linesofcode.

108

AlgorithmsDetection7.3

7.3.4Type-3ClonesinSequences
ConQATimplementsanovelalgorithmtodetecttype-3clonesinsequences.Thetaskofthede-
tectionalgorithmisto®ndcommonsubstringsintheunitsequence,wherecommonsubstringsare
notrequiredtobeexactlyidentical,butmayhaveaneditdistanceboundedbysomethreshold.This
problemisrelatedtotheapproximatestringmatchingproblem[109,221],whichisalsoinvestigated
extensivelyinbioinformatics[215].Themaindifferenceisthatwearenotinterestedin®ndingan
approximationofonlyasinglegivenwordinthestring,butratherarelookingforallsubstrings
approximatelyoccurringmorethanonceintheentiresequence.
Thealgorithmconstructsasuf®xtreeoftheunitsequenceandthenperformsanedit-distance-based
approximatesearchforeachsuf®xinthetree.Itemploysthesamesuf®xtreeasthealgorithmthat
searchesfortype-1andtype-2clonesfromSection7.3.2,butemploysadifferentsearch.

DetectionAlgorithmAsketchofourdetectionalgorithmisshowninFigures7.10and7.11.
Clonesparametersarearetidenti®edhebysequencethesweprocedurearewsearorkingchonthatandtherecursivpositionelytravstarterseswherethesufthe®xsearchtree.Itswas®rsttwstarted,o
callwhichofissearchrequired)markswhenthereportingcurrentaendclone.oftheThesubstringparameterunderj(whichinspection.istheTosameprolongasstartthisinthesubstring,®rst
tothethesubstringcurrentnodestartingvat(forjistherootcomparednodetowethenejustxtusewordthewinempttheysufstring).®xtree,Forwhichthisisthecomparison,edgeanleadingedit
editdistancedistanceofatmostmaximallyeallooperationswed(®fforthaclone.parameter)Ifistheallowed.remainingForeditthe®rstoperationscallofaresearnotch,eenoughistheto
tramatcvhersaltheofentirethetreeedgewcontinuesordw(elserecursivcase),ely,weincreasingreportthetheclonelengthas(fjar!asstartwe)offoundtheit.currentOtherwise,substringthe
andreducingthenumbereofeditoperationsavailablebytheamountofoperationsalreadyspent.
procdetect(s,e)
Input:Strings=(s0,...,sn),maxeditdistancee
21forConstructeachi2suf{®x1,.tree..,Tn}fromdos
3search(s,i,i,root(T),e)
Figure7.10:Outlineofapproximateclonedetectionalgorithm
Asuf®xtreeforthesequenceabcdXabcYd$isdisplayedinFigure7.12,thatcontainsthetype-3
andclonesabcaYbcd,danddepictedabcYind.blue.ForanFromeditnodedistancen6,theof1,labelsthedX$algorithmabcYd$matchesandYdthe$aretype-3compared.clonesaIfbcYd
isprolongedremovedbyd(indicforatedn1inandYorange),dfornboth7.Thelabelsstartinducedwithcloned.sTheton8labelandabnc10fromarenag0aintone6canxcluded.thusThebe
inducedreported,clonesinced,theatsearchnodenonly13isstartsnotatreachablepositionsinthroughtheawsuford®xthatlink.areHonotwecoverv,ereditstillbydoesothernotclones.get
strateHence,gy,nothesearchalgorithmstartsfordoesd,notsinceitguaranteeiscotovered®ndbygloballytheaboveoptimalcloneeditgroup.sequences.Duetoitslocalsearch
Ttheomaklongestethiseditalgorithmdistanceworkmatch,andweitsuseresultstheusable,dynamicsomedetailsprogramminghavetobealgorithm¯eshedfoundout.Tinoalgorithmcompute

109

7AlgorithmsandToolSupport

procsearch(s,start,j,v,e)
Input:startindeStringxofs=current(s0,...search,,sn),currentsearchindexj,
nodevofsuf®xtreeovers,maxeditdistancee
1Let(w1,...,wm)bethewordalongtheedgeleadingtov
2Calculatethemaximallengthlm,sothat
thereisakjwheretheeditdistancee0between
(w1,...,wl)and(sj,...,sk)isatmoste
3ifl=mthen
54forsearcheach(s,childstart,nodek+uofm,vu,doe!e0)
76elsereportifk!startsubstringfromminimalstartclonetokoflengthsasthenclone
Figure7.11:Searchroutineoftheapproximateclonedetectionalgorithm











Figure7.12:Suf®xtreeforsequenceabcdXabcYd$

textbooks.Whileeasytoimplement,itrequiresquadratictimeandspace11.Tomakethisstep
efthe®cient,suf®xwetreelookedgeatismostshorterat,thethis®rstisnot1000aproblem.statementsInofcasethewthereordiswa.Asclonelongofasmorethethanword1000on
eachstatements,suf®xweweare®nditrunninginchunkstheofsearch1000.onwillWeofconsidercoursebethisparttobeofthetolerabletree,weforalsopracticalhavetopurposes.makesureAs
thatnoselfmatchesarereported.
manWhenyrunningstatementstheasalgorithmpossible.asHois,wevtheer,resultsallowingareforofteneditnotaseoperationsxpectedrightatbecausetheitbetriesginningtoormatchattheas
endofacloneisnothelpful,astheneveryexactclonecanbeprolongedintoatype-3clone.We
thusenforcethe®rstfewstatements(howmanycanbeparameterized)tomatchexactly.Thisalso
speedsupthesearch,aswecanchoosethecorrectchildnodeattherootofthesuf®xtreeinonestep
withoutlookingatallchildren.Thelaststatementsarealsonotallowedtodiffer,whichischecked
forandcorrectedjustbeforereportingaclone.
Withtheseoptimizations,thealgorithmcanmissacloneeitherduetothethresholds(eithertooshort
11Itcanbeimplementedusingonlylinearspace,butpreservingthefullcalculationmatrixallowssomesimpli®cations.

110

AlgorithmsDetection7.3

10000 9000 8000 7000 6000 5000Time in seconds 4000 3000 2000 1000 0 0 1 2 3 4 5 6
System size in MLOCFigure7.13:Runtimeoftype-3clonedetection

ortoosubstringmanofyacloneisinconsistencies),ofcourseorifagitainisacovcloneeredandbyweotherusuallyclones.doThenotwlaterantcasetheseistobeimportant,reported.aseach

ScalabilityandPerformanceToassesstheperformanceoftheentireclonedetectionpipeline,
weexecutedConQATtodetecttype-3clonesonthesourcecodeofEclipse12,limitingdetectiontoa
certainamountofcode.OurresultsonanIntelCore2Duo2.4GHzrunningJavainasinglethread
with3.5GBofRAMareshowninFigure7.13.Weuseaminimalclonelengthof10statements,
maximaleditdistanceof5andagap-ratioof0.213.Itiscapabletohandlethe5.6MLOCofEclipse
inabout3hours.Thisisfastenoughtobeexecutedduringanightlybuild.

7.3.5ClonesinData-FlowGraphs
ConQATimplementsanovelalgorithmtodetectclonesingraphs.Inthissection,weformalize
clonedetectioningraph-basedmodelsanddescribeanalgorithmforsolvingit.Ourapproach
comprisestwosteps.First,itextractsclonepairs(i.e.,partsofthemodelthatareequivalent);
second,itclusterspairstoalso®ndsubstructuresoccurringmorethantwice.

ProblemDe®nitionDetectionoperatesonanormalizedmodelgraphG=(V,E,L).Wede®ne
aclonepairasapairofsubgraphs(V1,E1),(V2,E2)withV1,V2VandE1,E2E,sothatthe
hold:conditionswingfollo1.TherearebijectionsV:V1"V2andE:E1"E2,sothatforeachv2V1itholdsL(v)=
L(V(v))andforeache=(x,y)2E1itisbothL(e)=L(E(e))and(V(x),V(y))=
E(e).
2.V1\V2=;
3.Thegraph(V1,E1)isconnected.
12CoretheofcorecodeEclipseandeEuropaxcludedreleaseother3.3.TheprojectscodefromsizetheisEclipsesmallerthanecosystem,mentionedthatinwerepartSectionofthe7.3.3,analysissinceweinonlySectionanalyzed7.3.3.
13Thegapratioistheratiooftheeditdistancew.r.t.thelengthoftheclone.

111

7AlgorithmsandToolSupport

ForV1,V2V,wesaythattheyareinacloningrelationship,iffthereareE1,E2Esothat
(V1,E1),(V2,E2)isaclonepair.
The®rstconditionofthede®nitionstatesthatthosesubgraphsmustbeisomorphicregardingtothe
labelsL;thesecondonerulesoutoverlappingclones;thelastoneensureswearenot®ndingonly
unconnectedblocksdistributedarbitrarilythroughthemodel.Notethatwedonotrequirethemto
becompletesubgraphs(i.e.,containallinducededges).
ThesizeoftheclonepairdenotesthenumberofnodesinV1.Thegoalisto®ndallmaximalclone
pairs,i.e.,allsuchpairswhicharenotcontainedinanyotherpairofgreatersize.
Whilethisproblemseemstobesimilartothewell-knownNP-hardMaximumCommonSubgraph
(MCS)problem(alsocalledLargestCommonSubgraphin[75]),itisslightlydifferentinthatwe
onlydealwithonegraph(whileMCSlooksforsubgraphsintwodifferentgraphs)andwedonot
onlywantto®ndthelargestsubgraph,butallmaximalones.

DetectingClonePairsSincetheproblemof®ndingthelargestclonepairisNP-complete,we
cannotexpectto®ndanef®cient(polynomialtime)algorithmthatenumeratesallmaximalclone
pairs—atleastnotformodelsofrealisticsize.Instead,ConQATemploysaheuristicapproach.
Figure7.14givesanoutlineofthealgorithm.Ititeratesoverallpossiblepairingsofnodesand
nodeproceedspairsininatheclone,breadth-®rst-searchSofnodesseen(BFS)infromthetherecurrent(linesBFS,and4-12).DItofnodemanagespairsthewesetsareCdoneofwith.current
Line9,whichisoptional,skipsthecurrentlybuiltclonepair,ifwe®ndapairofnodeswehave
alreadyseenbefore.Thiswasintroducedaswefoundthatclonesreportedthiswayareoften
similartoothersalreadyfound(althoughwithdifferent“extensions”)andthusrathertendtoclutter
output.theTheproachmaingivdifeninference[172])isbetweeninlineour7:heuristicweonlyandaninspectexhaustionevepossiblesearch(suchmappingastofhethebacktrackingnodes’neigh-ap-
borhoodstoeachother.To®ndallclonepairs,wewouldhavetoinspectallpossiblemappings
andperformbacktracking.Evenonlytwodifferentmappingsquicklyleadtoanexponentialtime
algorithminthiscase,whichwillnotbecapableofhandlingthousandsofnodes.
Thus,foreachpairofnodes(u,v),weonlyconsideronemappingPoftheiradjacentblocks.All
blockpairs(x,y)ofPmustful®llthefollowingtwoconditions:
L(x)=L(y)(7.1)
(u,x),(v,y)2EandL((u,x))=L((v,y))
(7.2)or(x,u),(y,v)2EandL((x,u))=L((y,v))
Asweareonlylookingatasingleassignmentoutofmany,itisimportanttochoosethe“right”one.
Thisisaccomplishedbythesimilarityfunctiondescribedinthefollowingsection.

112

AlgorithmsDetection7.3

Input:ModelgraphG=(V,E,L)
;=:D12foreach(u,v)2V×Vwithu6=v^L(u)=L(v)do
43if{u,Queuev}Q62:D={then(u,v)},C:={(u,v)},S:={u,v}
5whileQ6=;do
76fromdequeuethepair(wneighborhood,z)fromof(Qw,z)buildalistof
nodepairsPforwhichtheconditions(7.1,7.2)hold
8foreach(x,y)2Pdo
109ififx(x6,=yy)^2{Dx,ythen}\S=continue;thenwithloopatline2
1112C:enqueue=C[(x{,(yx),iny)}Q,S:=S[{x,y}
1314Dreport:=Dnode[CpairsinCasclonepair
Figure7.14:Heuristicfordetectingclonepairs

TheSimilarityFunctionTheideaofthesimilarityfunction:V×V"[0,1]istohavea
measureforthestructuralsimilarityoftwonodeswhichnotonlycapturesthenormalizationlabels,
butmainalsolooptheirintheorderneighborhood.ofWdecreasingeusethesimilaritysimilarity,asinatwhighoplaces.valueFirst,ismorewelikvisitelythetonodeyieldapairsin“good”the
clone.Second,inline7,wetrytobuildpairswithahighsimilarityvalue.Thisisaweighted
bipartitematchingwithasweight,whichcanbesolvedinpolynomialtime[185].
Fortwonodesu,v,wede®neafunctionsi(u,v)thatintuitivelycapturesthestructuralsimilarityof
allnodesthatarereachableinexactlyisteps,by
s0(u,v)=1ifL(u)=L(v)
otherwise0andsi+1(u,v)=max{|NM(iu()u|,,v|)N(v)|}ifL(u)=L(v)
(otherwise0whereN(u)denotesthesetofnodesadjacenttou(itsneighborhood);Mi(u,v)denotestheweight
ofamaximalweightedmatchingbetweenN(u)andN(v)usingtheweightsprovidedbysiand
(7.2).and(7.1)conditionsrespectingWecanshowthat,foreveryiandpair(u,v)itholdsbyinduction,that0si(u,v)1andthus
de®ning1(u,v):=21isi(u,v)
X0=iisvalidastheexpressionconvergestoavaluebetween0and1.Theweightingwith21imakes
nodesneartothepair(u,v)morerelevantforthesimilarity.Forpracticalapplications,onlythe
®rstfewtermsofthesumhavetobeconsideredandthesimilarityforallpairscanbecalculated
programming.dynamicusing

113

7AlgorithmsandToolSupport

Figure7.15:Apartiallyhiddencloneofcardinality3

ClusteringClonesSofar,weonly®ndclonepairs.Subgraphsthatarerepeatedntimeswill
thusresultinn(n!1)/2clonepairs.Clusteringaggregatesthosepairsintoasinglegroup.
Whileitseemsstraightforwardtogeneralizethede®nitionofaclonepairtonpairsofnodesand
edgestogetthede®nitionofaclonegroup,wefeltthisde®nitiontobetoorestrictive.Consider,
e.g.,clonepairs(V1,E1),(V2,E2)and(V3,E3),(V2,E4).Althoughthereisabijectionbetween
thenodesofV1andV3theyarenotnecessarilyclonesofeachother,astheymightnotcontainthe
requirededges.However,weconsiderthisrelationshiptobestillrelevanttobereported,aswhen
lookingforpartsofthemodeltobeincludedinalibrarytheblockscorrespondingtoV2mightbea
goodcandidate,asitcouldpotentiallyreplacetwootherparts.
Soinsteadofclusteringclonesbyexactidentity(includingedges)whichwouldmissmanyinterest-
ingcasesdifferingonlyinoneortwoedges,weperformclusteringonlyonthesetsofnodes.This
isanoverapproximationthatcanresultinclusterscontainingclonesthatareonlyweaklyrelated.
However,asweconsidermanualinspectionofclonestobeimportantfordecidinghowtodealwith
them,thosecases(whicharerareinpractice)canbedealtwiththere.
Thus,foramodelgraphG=(V,E,L),wede®neaclonegroupofcardinalitynasaset{V1,...Vn},
sothatforevery1i<jnitisViVandthereisasequencek1,...,kmwithk1=i,
km=j,andVklandVkl+1areinaclonerelationshipforall1l<m(i.e.,thereisaclonepath
betweenanytwoclones).ThesizeoftheclonegroupisthesizeofthesetV1,i.e.,thenumberof
nodes.duplicatedThisboilsdowntoagraphwhoseverticesarethenodesetsoftheclonepairsandtheedgesare
inducedbythecloningrelationshipbetweenthem.Theclonegroupsarethentheconnectedcom-
ponents,whichcanbefoundusingstandardgraphtraversalalgorithms;alternativelyaunion-®nd
structure(see,e.g.,[42])allowstheconnectedcomponentstobebuilton-line,i.e.,whileclone
pairsarebeingreported,withoutbuildinganexplicitgraphrepresentation.
Therearestilltwoissuestobeconsidered.First,whilewede®nedclonepairstobenon-overlapping,
clonegroupscanpotentiallycontainoverlappingblocksets.Thisdoesnothavetobeaproblem,
sinceexamplesforthisareratherarti®cial.Second,someclonegroupsarenotfound,sincelarger
clonepairshidesomeofthesmallerones.AnexampleofthiscanbefoundinFigure7.15,where
equalpartsofthemodel(andtheiroverlaps)areindicatedbygeometric®gures.Wewantto®ndthe
clonegroupswithcardinality3shownascircles.Astheclonepairdetection®ndsmaximalclones
however,whenstartingfromnodesincircles1and2,theclonepairsconsistingofthepentagons
willbefound.Similarly,thecirclepair1and3ishiddenbytherectangle.Soourpairdetection
reportstherectanglepair,thepentagonpair,andthecircles2and3.

114

ocessingostprP7.4

Wehandlethisina®nalstepbycheckingtheinclusionrelationshipbetweenthereportedclone
pairs.Intheexample,thisrevealsthatthenodesfromcircle2areentirelycontainedinoneof
theinformationpentagons(whichandthusanalogouslytherehastoholdsbeaforthecloneofrectangle),thiscircleweincanthe®ndotherthethirdpentagon,circletotoo.getaUsingclonethis
groupofcardinality3.Iftherewasanadditionalcloneoverlappingcircles2and3,wehadnosingle
clonepairofthecircleclonegroupandthusthisapproachdoesnotworkforthiscase.However,
weconsiderthiscasetobeunlikelyenoughtoignoreit.

ScalabilityThetimeandspacerequirementsforclonepairdetectiondependquadraticallyon
theoverallnumberofblocksinthemodel(s).Whilefortherunningtimethismightbeacceptable
(thoughnotoptimal)aswecanexecutetheprograminbatchmode,theamountofrequiredmemory
canbetoomuchtoevenhandleseveralthousandblocks.
Tosolvethis,wesplitthemodelgraphintoitsconnectedcomponents.Weindependentlydetect
clonepairswithineachsuchcomponentandbetweeneachpairofconnectedcomponents,which
stillallowsusto®ndallclonepairswewould®ndwithoutthistechnique.Thisdoesnotimprove
runningtime,asstilleachpairofblocksislookedat(althoughwemightgainsomethingby®ltering
outcomponentssmallerthantheminimalclonesize).Theamountofmemoryneeded,however,
nowonlydependsquadraticallyonthesizeofthelargestconnectedcomponent.Ifthemodelis
composedofunconnectedsubmodels,orifwecansplitthemodelintosmallerpartsbysome
otherheuristic(e.g.,separatingsubsystemsonthetopmostlevel),memoryis,hence,nolongerthe
.actorflimitingWemeasuredperformancefortheindustrialMatlab/Simulinkmodelweanalyzedduringthecase
studypresentedin5,whichcomprises20,454blocks:theentiredetectionprocess—includingpre-
andpostprocessing—took50sonaIntelPentium43.0GHzworkstation.Thealgorithmthusscales
models.orldreal-wtowell

ostprP7.4ocessing

Postprocessingcomprisestheprocessstepsthatareperformedtothedetectedclonesbeforethe
resultsarepresentedtotheuser.InConQAT,postprocessingcomprisesmerging,®ltering,metric
tracking.andcomputation

Steps7.4.1

Filteringremovesclonesthatareirrelevantforthetaskathand.Itcanbeperformedbasedon
clonepropertiessuchaslength,cardinalityorcontent,orbasedonexternalinformation,suchas
blacklists.-createdelopervdeMetriccomputationcomputes,e.g.,clonecoverageoroverhead.Itisperformedafter®ltering.

115

7AlgorithmsandToolSupport

Clonetrackingcomparesclonesdetectedonthecurrentversionofasystemagainstthosedetected
onapreviousone.Itidenti®esnewlyadded,modi®edandremovedclones.Iftrackingisper-
formedregularly,beginningatthestartofaproject,itdetermineswheneachindividualclonewas
introduced.Thefollowingsectionsdescribethepostprocessingstepsinmoredetail.Postprocessingstepsare,
inprinciple,independentoftheartifacttype.Eachstep—®ltering,metriccomputationandclone
tracking—canthusbeperformedforclonesdiscoveredinsourcecode,requirementsspeci®cations
ormodels.However,forconciseness,thissectionpresentspostprocessingforclonesinsourcecode.
Sincethesameintermediaterepresentationisusedforbothcodeandrequirementsspeci®cations,
allofthepresentedpostprocessingfeaturescanalsobeappliedtorequirementsclones.Mostof
them,inaddition,areeitheravailableforclonesinmodelsaswell,orcouldbeimplementedina
ashion.fsimilar

Filtering7.4.2

Filteringremovesclonegroupsfromthedetectionresult.ConQATperforms®lteringintwoplaces:
local®ltersareevaluatedrightafteranewclonegrouphasbeendetected;global®ltersareevaluated
afterdetectionhas®nished.Whileglobal®ltersarelessmemoryef®cient—thelateraclonegroupis
®ltered,thelongeritoccupiesmemory—theycantakeinformationfromotherclonegroupsintoac-
count.Theythusenablemoreexpressive®lteringstrategies.ConQATimplementscloneconstraints
basedonvariouscloneproperties.
TheNonOverlappingConstraintcheckswhetherthecoderegionsofsiblingclonesoverlap.The
SameFileConstraintchecksifallsiblingarelocatedinasingle®le.TheCardinalityConstraint
checkswhetherthecardinalityofaclonegroupisaboveagiventhreshold.
TheContentConstraintissatis®edforaclonegroup,ifthecontentofatleastoneofitsclones
matchesagivenregularexpression.Content®lteringis,e.g.,usefultosearchforclonesthatcontain
specialcommentssuchasTODOorFIXME;theyoftenindicateduplicationofopenissues.
Constraintsfortype-3clonesallow®lteringbasedontheirabsolutenumberofgapsortheirgap
ratio.If,e.g.,allcloneswithoutgapsare®ltered,detectionislimitedtotype-3clones.Thisis
usefultodiscovertype-3clonesthatmayindicatefaultsandconvincedevelopersofthenecessityof
clonemanagement.Clonescanbe®lteredbothforsatis®edorviolatedconstraints.

BlactinuousklistingcloneEvenmanagement,ifcloneadetectionmechanismisistailoredrequiredwell,tofremoalsevepositisuchvesfalsemaypositislipves.through.TobeForuseful,con-
itmustberobustagainstcodemodi®cations—afalsepositiveremainsafalsepositiveindependent
ofmodi®ed).whetherItitsthus®lestillisneedsrenamedtoorbeitssuppressedlocationbyinthethe®le®lteringchangesmechanism.(e.g.,becausecodeaboveitis
ConQATimplementsblacklistingbasedonlocationindependentclone®ngerprints.Ifa®leisre-
Fornamed,type-1ortheandlocationtype-2ofaclones,cloneallintheclones®leinachanges,clonethevgroupaluehaofvethethesame®ngerprint®ngerprint.remainsAunchanged.blacklist

116

ocessingostprP7.4

stores®ngerprintsofclonesthataretobe®ltered.Fingerprintsareaddedbydevelopersthatcon-
sideracloneirrelevantfortheirtask.Duringpostprocessing,ConQATremovesallclonegroups
whose®ngerprintappearsintheblacklist.14

normalizedFingerprintsareunitsiscomputedconcatenatedontheintoanormalizedsinglecontentcharacteristicofaclone.string.TheFortextualtype-1andrepresentationtype-2ofclones,the
allhastheclonessameinaclonecharacteristicgrouphavestring—elsethesameitwouldbecharacteristicpartofthestring;®rstnocloneclonegroup.outsideThetheclonecharacteristicgroup
stringConQAisTusesindependentitsMD5ofthe[192]hash®lenameasorclonelocation®ngerprintintheto®le.saveSincespace.itcanBecausebelarofgetheforverylonglowclones,col-
lisionprobabilityofMD5,wedonotexpecttounintentionally®lterclonegroupsdueto®ngerprint
collisions.

Blacklistingworksfortype-1andtype-2clonesinsourcecodeandrequirementsspeci®cations.It
iscurrentlynotimplementedfortype-3clones.However,theirclonegroup®ngerprintscouldbe
computedonthesimilarpartsoftheclonestocopewithdifferentgapsoftype-3clones.

Crclonesoss-PrspanatojectleastClonetwodifFilteringferentprojects.CrossTheprojectclonede®nitionofdetectionproject,searchesinthisforcase,clonedependsgroupsonwhosethe
xt:contementsCrossthatprojectareclonecandidatesdetectionforcanbeconsolidationusedin[173];softwareortoproductdiscoverlinestoclonesdiscovbetweenerreusableapplicationscodefrag-that
buildproductsontopofaofaproductcommonfamilyframeorworkapplicationstospotthatuseomissions.thesameProjectsframeinwthisork.contextarethusindividual

Todiscovercopyrightinfringementorlicenseviolations,itisemployedtodiscovercloningbetween
thecodebasemaintainedbyacompanyandacollectionofopensourceprojectsorsoftwarefrom
otherowners[77,121].Projectsinthiscontextarethecompany’scodeandthethirdpartycode.

ConQATimplementsaCrossProjectCloneGroupsFilterthatremovesallclonegroupsthatdonot
spanatleastaspeci®ednumberofdifferentprojects.Projectsarespeci®edaspathorpackage
pre®xes.Projectmembershipexpressedviathelocationinthe®lesystemorthepackage(orname
structure.space)Figure7.16depictsatreemapthatshowscloningacrossthreedifferentindustrialprojects15.Areas
A,BandCmarkprojectboundaries.Onlycross-projectclonegroupsareincluded.Theprojectin
thelowerleftcornerdoesnotcontainasinglecross-projectclone,whereastheothertwoprojects
do.Inbothprojects,mostofitis,however,clusteredinasingledirectory.ItcontainsGUIcodethat
both.betweensimilaris14Allfeatureblacklistedhasbeenclonemisusedgroupstoarearti®ciallyoptionallyreducewrittencloning.toaseparatereporttoallowforcheckswhethertheblacklisting
15Section7.5.1explainshowtointerprettreemaps.

117

7AlgorithmsandToolSupport

Figure7.16:Cross-projectclonedetectionresults

ComputationMetric7.4.3

ConQATcomputesthecloningmetricsintroducedinChapter4,namelyclonecounts,clonecover-
ageandoverhead.Computationofcountsandcoverageisstraightforward.Hence,onlycomputa-
tionofoverheadisdescribedhereindetail.
OverheadiscomputedastheratioofRFSSSS!1.If,forexample,astatementinasource®le
iscoveredbyasingleclonethathastwosiblings,itoccursthreetimesinthesystem.Perfect
removalwouldeliminatetwoofthethreeoccurrences.ItthusonlycontributesasingleRFSS.
RFSScomputationiscomplicatedbythefactthatclonegroupscanoverlap.
1eRFSSxample,eachcomputationoccurrenceonlyofcountstheaunitstatementinaissourcethusonlyartifactcountedtimesascloned1RFSS.numberWeofemplotimes.yaIntheunion-®ndabove
datastructuretorepresentcloningrelationshipsattheunitlevel.3Allunitsthatareinacloning
relationshipareinthesamecomponentintheunion-®ndstructure,allotherunitsareinseparate
ones.ForRFSScomputation,theunitsaretraversed.Eachunitaccountsforcomponent1sizeRFSS.

kingracT7.4.4

Clonetrackingestablishesamappingbetweenclonegroupsandclonesofdifferent(typicallycon-
secutive)versionsofasoftware.Basedonthismapping,clonechurn—added,modi®edandre-
movedclones—iscomputed.Trackinggoesbeyond®ngerprint-basedblacklisting,sinceitcanalso
associatecloneswhosecontenthaschangedacrossversions.Sincedifferentcontentimpliesdiffer-
ent®ngerprints,suchclonesarebeyondthecapabilitiesofblacklisting.

118

ocessingostprP7.4

ConQATimplementslightweightclonetrackingtosupportclonecontrolwithclonechurninforma-
tion.TheclonetrackingprocedureisbasedontheworkbyGöde[83].Itcomprisesthreestepsthat
areoutlinedinthefollowing:

mayUpdatehaveOldchanged.CloningTheInfcloningormationinformationSincethefromlastthelastclonedetectiondetectioniswasthusperformed,outdated—clonethesystempo-
sitionsmightbeinaccurate,someclonesmighthavebeenremovedwhileothersmighthavebeen
added.ConQATupdatesoldcloninginformationbasedontheeditoperationsthathavebeenper-
formedsincethelastdetection,todeterminewheretheclonesareexpectedinthecurrentsystem
ersion.vConQATemploysarelationaldatabasesystemtopersistclonetrackinginformation.Cloninginfor-
themationdifffrombetweenthelasthetpredetectionviousvisersionloaded(storedfrominit.theThen,fordatabase)eachand®lethethatcurrentcontainsvatersionleastisonecomputed.clone,
Itisthenusedtoupdatethepositionsofallclonesinthe®le.Forexample,ifaclonestartedinline
30,but10linesaboveithavebeenreplacedby5newlines,itsnewstartpositionissetto25.Ifthe
coderegionthatcontainedaclonehasbeenremoved,thecloneismarkedasdeleted.Ifthecontent
ofaclonehaschangedbetweensystemversions,thecorrespondingeditoperationsarestoredfor
clone.each

DetectNewClonesWhiletheabovestepidenti®esoldandremovedclones,itcannotdiscover
newlyaddedclonesinthesystem.Forthispurpose,inthesecondstep,acompleteclonedetection
isrunonthecurrentsystemversion.Itidenti®esallitsclones.

computeComputecloneChurnchurn.InWethedifthirdferentiatestep,betweenupdatedtheseclonesarecases:matchedagainstnewlydetectedonesto

Positionsofupdatedcloneandnewclonematch:thisclonehasbeentrackedsuccessfully
ersions.vsystembetweenNewclonehasnomatchingupdatedclone:trackinghasidenti®edaclonethatwasaddedin
thenewsystemversion.
Updatedclonehasnomatchingnewclone:itisnolongerdetectedinthenewsystemversion.
Thecloneoritssiblingshaveeitherbeenremoved,orinconsistentmodi®cationpreventsits
detection.Thesetwocasesneedtobedifferentiated,sinceinconsistentmodi®cationsneedto
bepointedouttodevelopersfurfurtherinspections.Trackingdistinguishesthembasedonthe
editoperationsstoredintheclones.

Churncomputationdeterminesthelistofaddedandremovedclonesandofclonesthathavebeen
.inconsistentlyorconsistentlymodi®ed

119

7AlgorithmsandToolSupport

PresentationResult7.5

Difoutlinesferenthousewcasesresultsarerequirepresenteddifferentinawaysqualityofinteractingdashboardforwithcloneclonecontroldetectionandinresults.anIDEThisforsectioninter-
activecloneinspectionandchangepropagation.
Similartopostprocessing,thissectionfocusesonpresentationofcodeclones;allpresentationscan
beFurthermore,appliedtoinmanrequirementsycases,clonesConQAasTwell,eithersincecontainsbothsisharemilarthesampresentationeintermediatefunctionalityforrepresentation.model
clones,oritcouldbeimplementedinasimilarfashion.

dDashboarojectPr7.5.1

Projectdashboardssupportcontinuoussoftwarequalitycontrol.Theirgoalistoprovisionstake-
thequalityholders—includingcharacteristicsprojectofthemanagementsoftwareandtheydearevdevelopers—witheloping[48].relevForantthis,andqualityaccuratedashboardsinformationperon-
formautomatedqualityanalysesandcollect,®lter,aggregateandvisualizeresultdata.Throughits
visualdata¯owlanguage,ConQATsupportstheconstructionofsuchdashboards.Clonedetection
isoneofthekeysupportedqualityanalyses.
Difthem,ferentConQAstakTeholderpresentsrolesclonerequiresdetectiondifferentresultpresentationsinformationonofdifcloneferentlevdetectionelsofresults.aggregTation.osupport

CloneListsprovidecloninginformationonthe®lelevel,asdepictedinthescreenshotinFig-
urereplacement7.17.Theforyrevcloneealtheinspectionlongestonclonesthecodeandlethevel,cloneclonegroupslistsallowithwthedevmosteloperstoinstances.geta®rstWhileideano
aboutthedetectedcloneswithoutrequiringthemtoopentheirIDEs.

Figure7.17:Clonelistinthedashboard

Treemaps[223]visualizethedistributionofcloningacrossartifacts.Theythusrevealtostake-
holderswhichareasoftheirprojectareaffectedhowmuch.
Treemapsinterpretationvisualizebysourceconstructingcodeasize,treemapstructurestepbyandstep.cloningAintreemapasinglestartsimage.withanWeemptyintroducerectangle.their

120

PresentationResult7.5

Itsarearepresentsallprojectartifacts.Inthe®rststep,thisrectangleisdividedintosub-rectangles.
Eachsub-rectanglerepresentsacomponentoftheproject.Thesizeofthesub-rectanglecorresponds
totheaggregatesizeoftheartifactsbelongingtothecomponent.Theresultingvisualizationis
depictedinFigure7.18ontheleft.Thevisualizedprojectcontains24components.Forthelargest
ones,nameandsize(inLOC)aredepicted.SincecomponentGUIForms(91kLOC)islargerthan
componentBusinessLogic,itsrectangleoccupiesaproportionallylargerarea.
Inthesecondstep,eachcomponentrectangleisfurtherdividedintosub-rectanglesfortheindividual
artifactscontainedinthecomponent.Again,rectangleareaandartifactsizecorrespond.Theresult
isdepictedinFigure7.18ontheright.

Figure7.18:Treemapconstruction:artifactarrangement

Althoughpositionandsizeofthetop-levelrectanglesdidnotchange,theyarehardtorecognizedue
tois,thethus,manyobscured.individualTobetterrectanglesconvenoywtheirpopulatinghierarchy,thethetreemap.rectanglesThearehierarcshadedhyinthebetweenthirdstep,rectanglesas
depictedontheleftofFigure7.19.

Figure7.19:Treemapconstruction:artifactcoloring

121

7AlgorithmsandToolSupport

Inthelaststep,colorisemployedtorevealtheamountofcloninganartifactcontainsandindicate
generatedcode.Morespeci®cally,individualartifactsarecoloredonagradientbetweenwhiteand
redforaclonecoveragebetween0and1.Furthermore,codethatisgeneratedandnotmaintained
byhandiscoloreddarkgray.Figure7.19showstheresultontheright.Theartifactsincomponent
GUIFormscontainsubstantialamountsofcloning,whereastheartifactsinthecomponentonthe
bottom-lefthardlycontainany.TheartifactsofthecomponentDataAccessaregeneratedandthus
depictedingray,exceptforthetwo®lesinitsleftuppercorner.
ConQATdisplaystooltipswithdetails,includingsizeandcloningmetrics,foreach®le.The
treemapsthusrevealmoreinformationinthetoolthaninthescreenshots.

TrendChartsvisualizetheevolutionofcloningmetricsovertime.Theyallowstakeholdersto
determinewhethercloningincreasedordecreasedduringadevelopmentperiod.Figure7.20depicts
atrendchartdepictingthedevelopmentofclonecoverageovertime.

Figure7.20:Clonecoveragechart

wereBetweenintroduced.AprilandAfterMay,devcloneeloperscoveragenoticedthis,decreasedthesinceintroducedclonescloneswerewereremoved.Inconsolidated.May,newclones

CloneChurnrevealscloneevolutiononthelevelofindividualclones,whichisrequiredto
diagnosetherootcauseoftrendchanges.Clonechurnthuscomplementstrendchartswithmore
qualitydetails.Thedashboard.screenOnshotstheinleft,Figurethedif7.21ferentdepictchurnhowlistscloneareshochurnwn.Forinformationinspectiisonofdisplayedclonesinthatthe
havebecomeinconsistentduringevolution,thedashboardcontainsaviewthatdisplaystheirsyntax-
highlightedcontentandhighlightsdifferences.Onesuchcloneisshowninthescreenshotonthe
7.21.Figureofright

InspectionCloneInteractive7.5.2vThisestigatesectionclonesoutlinesinsideConQAtheirT’IDEssandinteractitoveuseclonecloninginspectioninformationfeaturesforthatchangeallowdepropagvelopersationtowhenin-

122

Result7.5Presentation

Figure7.21:Clonechurninthequalitydashboard

modifyingsoftwarethatcontainsclones.
ConQAinspection.TTheimplementsindentedauseClonecaseisDetectionone-shotPerinvspectiveestigthatationproofvidescloningainacollectionsoftwareofviesystem.wsforclone
AscreenshotoftheCloneDetectionPerspectiveisdepictedinFigure7.22.Detaileddocumentation
ofandtheoutsideClonethesDetectioncopeofPerspectithisve,document.includingHoaweuserver,duemanual,toistheircontainedimportanceintheforConQAtheTcaseBookstudies[49]
performedduringthisthesis,twoviewsareexplainedindetailbelow.

TheCloneInspectionViewisthemostimportanttoolforinspectingindividualcloneson
thecodelevel.Itimplementssyntaxhighlightingforalllanguagesonwhichclonedetectionis
supported.Furthermore,ithighlightsstatement-leveldifferencesbetweentype-3clones.According
toourexperience,thisviewsubstantiallyincreasesproductivityofcloneinspection.Weconsider
thiscrucialforcasestudiesthatinvolvedeveloperinspectionofclonedcode.

TheCloneVisualizerusesaSeeSoftvisualizationtodisplaycloninginformationonahigher
levelofaggregationthanthecloneinspectionview[63,214].Itthusallowsinspectionofthecloning
relationshipsofoneortwoordersofmagnitudemorecodeonasinglescreen.
Eachbarintheviewrepresentsa®le.Thelengthofthebarcorrespondstothelengthofits®le.
Eachcoloredstriperepresentsaclone;allclonesofaclonegrouphavethesamecolor.Thelength
ofthestripecorrespondstothelengthoftheclone.Thisvisualizationreveals®leswithsubstantial
mutualcloningthroughsimilarstripepatterns.
ConQATprovidestwoSeeSoftviews.Theclonefamilyvisualizerdisplaysthecurrentlyselected
®le,allofitsclones,andallother®lesthatareinacloningrelationshipwithit.However,forthe
other®les,onlytheircloneswiththeselected®learedisplayed.Theclonefamilyvisualizerthus
supportsaquickinvestigationoftheamountofcloninga®leshareswithother®les,asdepictedin
7.23.Figure

123

7AlgorithmsandToolSupport

Figure7.22:Clonedetectionperspective

Figure7.23:Clonefamilyvisualizer

Theclonevisualizerdisplaysallsource®lesandtheirclones.Ifthe®lesaredisplayedintheorder
theyoccurondisk(orinthenamespace),high-levelsimilaritiesaretypicallytoofarseparatedtobe
recognizedbytheuser.Toclustersimilar®les,ConQATordersthembasedontheiramountofmu-
tualcloning.Filesthatsharemanyclonesare,hence,displayedclosetoeachother,allowingusers
tospot®le-levelcloningduetotheirsimilarlycoloredstripepatterns,asdepictedinFigure7.24.
Ordering®lesbasedontheiramountofmutuallyclonedcodecanbereducedtothetravelingsales-
personproblem:®lescorrespondtocities,linesofmutuallyclonedcodecorrespondtotravelcost,
and®ndinganorderingthatmaximizesthesumofmutuallyclonedlinesbetweenneighboring®les
correspondsto®ndingamaximallyexpensivetravelroute.Consequently,itisNP-complete[75].
ConQATthusemploysaheuristicalgorithmtoperformthesorting.

CloneFilteringApartfrompostprocessing,clonescanbe®lteredduringinspection,sothat
developersdonothavetowaituntildetectionhasbeenre-executed.Clonescanbe®lteredbased

124

7.5PresentationResult

Figure7.24:Clonevisualizerwith®lesorderedbymutualcloning

onasetof®lesorclonegroups(bothinclusivelyandexclusively),basedontheirlength,number
ofinstances,gappositionsorblacklists.Clone®ltersaremanagedonastackthatcanbedisplayed
andeditedinaview.

tionClonewhiletheIndicationyareThemaintaininggoalofsoftwclonearethatindicationcontisainstoprocloningvisiontodevreduceeloperstheratewithofcloningunintentionallyinforma-
inconsistentmodi®cations.ItisintegratedintotheIDEinwhichdevelopersworktoreducetheeffort
requiredtoaccesscloninginformation.17WehaveimplementedcloneindicationforbothEclipse16
andMicrosoftVisualStudio.NET[72].
Afterclonedetectionhasbeenperformed,ConQATdisplayssocalledcloneregionmarkersinthe
editorsassociatedwiththecorrespondingartifacts,asdepictedinFigure7.25.

Figure7.25:Cloneregionmarkerindicatescodecloningineditors.

Cloneregionmarkersindicateclonesinthesourcecode.Asinglebarindicatesthatexactlyone
1617wwwwww.eclipse.or.microsoft.com/VgisualStudio2010

125

7AlgorithmsandToolSupport

cloneinstancecanbefoundonthisline;twobarsindicatethattwoormorecloneinstancescan
befound.Thebarsarealsocolorcodedorangeorred:orangebarsindicatethatallclonesofthe
clonegroupareinthis®le;redbarsindicatethatatleastonecloneinstanceisinadifferent®le.A
rightclickonthecloneregionmarkersopensacontextmenuasshowninFigure7.25.Itallows
developerstonavigatetothesiblingsofthecloneoropentheminacloneinspectionview.

Figure7.26:CloneindicationinVS.NET.

Figure7.26depictsascreenshotofcloneindicationinVisualStudio.NET.

TailoringSupportForeachiterationofthetailoringprocedure,clonedetectiontailoring(cf.,
Section8.2)requirescomputationofprecision,andcomparisonofclonereportsbeforeandafter
tailoring.ConQATprovidestoolsupporttomakethisfeasible.
Theorderofthelistofclonegroupscanberandomized.The®rstnclonegroupsthencorrespondto
arandomsampleofsizen.EachclonegroupcanberatedasAcceptedandRejected.Boththelist
orderandtheratingarepersistedwhentheclonereportisstored.ConQATcancomputeprecision
onthe(sample)ofratedclonegroups.
Tocompareclonereportsbeforeandaftertailoring,theycanbesubtractedfromeachother,reveal-
ingwhichcloneshavebeenremovedoraddedthroughatailoringstep.Twodifferentsubtraction
applied:becanmodesFingerprint-basedsubtractioncomparesclonereportsusingtheirlocation-independentclone®n-
gerprints.Itcanbeappliedwhentailoringisexpectedtoleavethepositionsandnormalizedcontent
ofdetectedclonesintact,e.g.,whenthe®ltersemployedduringpost-processingaremodi®ed.
Clone-region-basedsubtractioncomparesclonereportsbasedonthecoderegionscoveredbyclones.
Itcanbeappliedwhentailoringdoesnotleavepositionsornormalizedcontentintact,e.g.,when
thenormalizationischangedorshapersareintroducedthatclipclones.Theclonereportproduced

126

7.6ComparisonwithotherCloneDetectors

bydifferencingcontainsclonesthatrepresentintroducedorremovedcloningrelationshipsbetween
gions.recode

CloneTrackingFordeeperinvestigationofcloneevolution,ConQATsupportsinteractivein-
vestigationofclonetrackingresultsthroughaviewthatvisualizescloneevolution,asdepictedin
Figure7.27.Sourcecodeofclonescanbeopenedfordifferentsoftwareversionsandclonesofarbi-
traryversionscanbecomparedwitheachothertofacilitatecomprehensionofcloneevolution.The
visualizationofcloneevolutionislooselybasedonthevisualizationproposedbyGödein[83].

Figure7.27:Interactiveinspectionofclonetracking

7.6ComparisonwithotherCloneDetectors

asAsdonestatedininthisSectionchapter3.3,,athusplethoraraisestwofoclonequestions.detectionFirst,toolswheywxists.asitThedeveloped?presentationAnd,ofanosecond,velhotool,w
doesitcomparetoexistingtools?Thissectionanswersboth.
Wecreatedanoveltool,becausenoexistingonewassuf®cientlyextensibleforourpurposes.Both
forourempiricalstudies,andtosupportclonecontrol,weneededtoadapt,extendorchangechar-
acteristicsoftheclonedetectionprocess:tailoringaffectsbothpre-andpostprocessing;thenovel
algorithmsaffectthedetectionphase;andmetriccomputationandtrackingaffectpostprocessing.
Sinceexistingtoolswereeitherclosedsource,monolithic,notdesignedforextensibilityorsimply
notavailable18,wedesignedourowntool.Sinceitisavailableasopensource,othersthatmightbe
inasimilarsituationmaybuildontopofit,asis,e.g.,doneby[96,180,186].
toTheanswersecond,sincequestion,thehowcomparisonthecloneofclonedetectiondetectorsworkbenchisnon-tricomparesvial.Intotheothernexttools,sections,ismorewedifbrie¯y®cult
summarizechallengesandexistingapproachestoclonedetectorcomparisonandthendescribeour
18.in.tum.de/~ccsm/icse09/yhttp://wwwbro

127

7AlgorithmsandToolSupport

detectorprecisionbasedandthatonanitserecallxistingisnotqualitatismallerveframethanwthatork.ofcomparableFurthermore,wedetectors.showthatitcanachievehigh

7.6.1ComparisonofCloneDetectors

Thecomparisonofclonedetectorsischallengingformanyreasons[200]:thedetectiontechniques
areverydiverse;welackstandardizedde®nitionsofsimilarityandrelevance;targetlanguages—
andthesystemswritteninthem—differstrongly;anddetectorsareoftenverysensitivetotheir
con®gurationortuningoftheirparameters.Tocopewiththesechallenges,twodifferentapproaches
havebeenproposed:aqualitative[200]andaquantitative[19]one.

QualitativeApproachIn[200],Roy,CordyandKoschkecompareexistingclonedetectors
qualitatively.Theircomparisoncomprisestwomainelements.First,asystemofcategories,facets
andattributesfacilitatesastructureddescriptionoftheindividualdetectors.Second,mutation-
basedscenariosprovidethefoundationforadescriptionofcapabilitiesandshortcomingsofexisting
approaches.Thequalitativecomparisondoesnotorderthetoolsintermsofprecisionandrecall.However,it
doessupportusersintheirchoicebetweendifferentclonedetectors:thesystematicdescriptionand
scenario-basedevaluationprovidedetailedinformationonwhichsuchachoicecanbefounded,as
theauthorsdemonstrateexemplarilyin[200].WedescribeConQATusingthedescriptionsystem
andthescenario-basedevaluationtechniquefrom[200]inSections7.6.2and7.6.3.

QuantitativeApproachIn[19],Bellonetal.proposeaquantitativeapproachtocompareclone
detectors.Theyquantitativelycomparetheresultsofseveralclonedetectorsforanumberoftarget
systems.Theclonedetectorswerecon®guredandtunedbytheiroriginalauthors.Asubsetofthe
clonecandidateswasratedbyanindependenthumanoracle.Boththetargetsystems,thedetected
clonesandtheratingresultsareavailable.
Inprinciple,theBellonbenchmarkoffersanappealingbasis,sinceityieldsadirectcomparisonof
theclonedetectorsintermsofprecisionandrecall.Toaddanewtooltothebenchmark,however,
itsdetectedclonecandidatesneedtoberated.Tobefair,theratingoraclemustbehavesimilarto
theoriginaloracle.StefanBellon,whoratedtheclonesintheoriginalexperiment,wasnotinvolved
inthedevelopmentofanyoftheparticipatingclonedetectors.Hethusrepresentedanindependent
party.Incontrast,ifweratetheresultsofourownclonedetector,wecouldbebiased.Furthermore,
fromourexperience,classi®cationofclonesincodethatothershavewritten,withoutknowledge
about,e.g.,theemployedgenerators,ishard.Wethusexpectittocontainacertainamountof
subjectivity.Forexample,thebenchmarkcontainsclonesingeneratedcodethatBellonratedas
relevant.Weconsiderthemasfalsepositives,however,sincethecodedoesnotgetmaintained
directly.Evenifwewerenotbiased,itisthusunclear,howwellourratingbehaviorwouldcompare
s.Bellon’withAlternatively,wecouldreproducethebenchmarkwithacollectionofup-to-datetoolsandtarget
systems.Thereproductioninitsoriginalstylerequiresparticipationoftheoriginalauthorsandis
thusbeyondthescopeofthisthesis.However,ifweexecutetheirdetectorsourselves,theresults

128

7.6ComparisonwithotherCloneDetectors

arelikelytobebiased.Wesimplyhavealotmoreexperiencewithourowntoolthanwiththeir
detectors.Asecondquantitativeapproach,whichemploysamutation-basedbenchmark[197],is
notfeasibleeither:neitherthebenchmark,norresultsformanyexistingclonedetectorsarepublicly
available.WearethusunabletoperformareliablequantitativecomparisonofConQATandother
clonedetectorsonthebasisofexistingbenchmarks.
Instead,wechoseadifferentapproach.WecomputedalowerboundfortherecallofConQATon
theBellonbenchmarkdata.Forthis,weanalyzewhetherConQATcanbecon®guredtodetectthe
referenceclonesdetectedbyothertools.Thisway,wedonotneedanoraclefortheclonecandidates
detectedbyConQAT.WedetailcomputationofrecallinSection7.6.4.
Inaddition,wecomputedprecisionforthesystemsthatweanalyzedduringthecasestudiesin
Chapter4.Theirdeveloperstookpartinclonedetectiontailoringandinclonerating.Forthe5
studyobjects,wedeterminedprecisionfortype-2andtype-3clonesseparately.Fortype-2clones,
precisionrangedbetween0.88and1.00,withanaverageof0.96.Fortype-3clones,between0.61
and1.0,withanaverageof0.83.Lowerprecisionoftype-3clonesisduetothelargerdeviation
toleratedbetweenthem.Averageprecisionofover95%fortype-2clonesis,fromourexperience,
highenoughforcontinuousapplicationofclonedetectioninindustrialenvironments.
Wemeasureprecisionandrecallindependentofeachother.Strictlyspeaking,theseexperiments
thusdonotshowthatConQATcanachievehighprecisionandrecallatthesametime,sinceim-
provementofonecouldcomeatthecostoftheother.PleaserefertoSection8.7foracasestudy
thatdemonstratesthatclonedetectiontailoringcanimproveprecisionandmaintainrecall.

DescriptionSystematic7.6.2

Inthissection,wedescribeourclonedetectionworkbenchusingthecategoriesandfacetsfrom[200].
Forsimplicity,werefertotheclonedetectionworkbenchsimplyas“ConQAT”.Wedescribeeach
categoryfrom[200]inaseparateparagraph.Facetnamesforeachcategoryaredepictedinitalics.
Tosimplifycomparisonwiththeothertoolslistedin[200],wegivetheabbreviationsfrom[200]
fortheindividualattributesinafacetinparentheses.

Usagedescribestoolusageconstraints.Platform:ConQATisplatformindependent(P.a).We
haveexecuteditonWindows,Linux,MacOS,SolarisandHP-UX.ExternalDependencies:The
clonedetectionworkbenchispartofConQAT(D.d).AllcomponentsusedbyConQATarealso
platformindependent,excepttheMicrosoftVisualStudiointegration,whichdependsonMicrosoft
VisualStudio.Availability:ConQATisavailableasopensource(A.a).Itslicenseallowsitsusefor
bothresearch(A.d)andcommercialpurposes(A.c).

Interactiondescribesinteractionbetweentheuserandthetool.UserInterface:ConQATpro-
videsbothacommandlineinterfaceandagraphicalinterface(U.c).Thegraphicalinterfacecan
beusedbothforcon®gurationandexecution,andforinteractiveinspectionoftheresults.Output:
ConQATprovidesbothtextualcoordinatesofcloninginformationanddifferentvisualizations(O.c).
IDESupport:ConQATcomprisespluginsforEclipse(I.a)andMicrosoftVisualStudio(I.b).

129

7AlgorithmsandToolSupport

limitedLanguatogeaspeci®cdescribeslanguagethelanguagesparadigmthat(LP.c).canbeWehavanalyzed.eappliedLanguait,e.geg.,Ptoaradigm:object-orientedConQAT(LPis.b),not
supportsproceduralthe(LP.a),programmingfunctionallanguages(LP.e)andABAP,modelingAda,COBOLlanguages(LS.f),(LPC.f).(LS.b),LanguagC++e(LS.c),Support:C#ConQA(LS.d),T
Java(LS.e),PL/I,PL/SQL,Python(LS.g),T-SQLandVisualBasic(LS.i).Furthermore,itsup-
portsthemodelinglanguageMatlab/Simulinkand15naturallanguages,includingGermanand
English.

ClonedirectlyInfyieldsormationclonegroupsdescribesforthetype-1cloneandtype-2informationclonesthetinoolcansequencesemit.(R.b).CloneRelation:PostprocessingConQAcanT
mermodelgecloneclonegroupsdetection,basedpairsonaredifferentcombinedcriteria,duringe.theg.,ovclusteringerlappingphase.gapsinClonetype-3Granularity:clones(R.d).ConQAForT
cantrimproduceclonestoclonesclassesoffree(G.e),granularityfunctions/methods(G.a)or®xed(G.b),basicgranularity,blocksifshapers(G.c,G.d)areorused.matchShapersarbitrarycan
ketype-2ywords(CTor.b)andothertype-3language(CT.c)clonescharacteristicsforcode.(G.g).CloneFurthermore,Type:itcanConQAdetectTcanmodeldetectclonestype-1(CT(CT.e)..a),

TechnicalAspectsdescribepropertiesofthedetectionalgorithms.ComparisonAlgorithm:
ConQAToffersdifferentdetectionalgorithms,includingasuf®xtreebasedonefortype-2clones
(CA.a),asuf®xtreebasedonefortype-3clonesthatcomputeseditdistance(CA.n)andanindex-
basedonefortype-2clones(CA.q).Furthermore,asubgraph-matchingoneformodels(CA.k).
ComparisonGranularity:ConQATsupportsdifferentcomparisongranularities,namelylines(CU.a),
tokity:ensThe(CU.d),complexitystatementsdepends(CU.e)ontheandemplomodelyedelementsalgorithms.(CU.k).PleaseWorstreferCasetoSectionComputational7.3fordetails.Complex-

Adjustmentdescribesthelevelofcon®gurabilityofthetool.Pre-/Postprocessing:Theopen
architectureofConQATallowscon®guration—includingreplacement—ofalldetectionphases).
Heuristic/Thresholds:ConQATofferscon®gurablethresholdsforclonelength(H.a)andgapsize
(H.c).Filerscanbeusedtopruneresults(H.d).Normalizationcanbeadaptedtochangethe
employednotionofsimilaritywhencomparingclones(H.b).

Processingdescribeshowthetoolanalyzes,representsandtransformsthetargetprogramfor
analysis.BasicTransformation/Normalization:Normalizationisverycon®gurable.Itcan,e.g.,
performthefollowing:optionalremovalofwhitespaceandcomments(T.b,T.c);optionalnormal-
izationofidenti®ers,typesandliteralvalues(T.e,T.f,T.g);andlanguagespeci®ctransformations
(T.h).CodeRepresentation:Codecanberepresentedas®lteredstringsinwhichcommentsmaybe
removed(CR.d)ornormalizedtokensortokensequences(CR.f).ProgramAnalysis:Fortext-based
clonedetection,ConQATonlyrequiresregularexpressionsto®lterinput,e.g.,removecomments
(PA.b).Fortokenorstatement-baseddetection,ConQATemploysscanners(PA.d).ConQATim-
plementsscannersforalllanguageslistedunderthe“LanguageSupport”facetabove.Forshaping,
ConQATemploysshallowparsing(PA.c).

130

7.6ComparisonwithotherCloneDetectors

Evaluationdescribeshowthetoolhasbeenevaluated.EmpiricalValidation:ConQAThasbeen
employedinanumberofempiricalstudiesasreportedinthisthesis(E.b).AvailabilityofEmpirical
Results:ManyoftheprojectsweanalyzedwithConQATareclosedsource.Thedetectedclones
thuscannotbepublished.Instead,wepublishedaggregatedresults(AR.b).Theresultsoftheopen
sourcestudyobjectfromChapter4areavailable.Thestudycanbereproduced(AR.a).Subject
Systems:Mostsystemsweanalyzedareclosedsource(S.g).

aluationEvScenario-Based7.6.3Inthissection,weevaluateConQATonthecloningscenariosfrom[200].Tomakethissectionself
contained,we®rstrepeatthescenariosfrom[200].Thenwedescribethecapabilitiesandlimitations
ofConQATforeachscenario.

ScenariosEachscenariodescribeshypotheticalprogrameditingstepsthat,accordingtothe
aauthors,clonearfromeanrepresentatioriginal.veAllforclonestypicalproducedchangestobycoptheyedit&spastedtepsfromcode.scenarioEach1editaresequencetype-1clones;creates
scenario2yields3type-2clonesand1type-3clone(S2(d)).Scenarios3and4yieldtype-3clones,
ofwhichsomearesimions.Figure7.28showstheoriginalinthemiddleandtheclones,orderedby
it.aroundscenario,Inthecapabilitiesfollowingandsections,limitationsweof®rstConQArestateTforthethem.scenarioAfterwdescriptionsards,wefromdiscuss[200]andcrosscuttingthenaspects.describethe

aloopScenariovariable1andfromcalls[200]:“anotherAprogrammerfunction,foo()copieswithathesefunctionvaluesthatascalculatesparametersthethreesumandtimes,productmakingof
changesinwhitespaceinthe®rstfragment(S1(a)),changesincommentinginthesecond(S1(b)),
andchangesinformattinginthethird(S1(c)).”
Usingthesuf®xtreeorindex-baseddetectionalgorithmsfortype-1andtype-2clones,ConQAT
canthis,producecon®gureasinglenormalizationclonetogroupremothatvecontainswhitespacetheandoriginal,comments,S1(a),butS1(b)notandtoS1(c)normalizeasclones.identi®ersFor
alues.vliteralor

Scenario2from[200]:“Theprogrammermakesfourmorecopiesofthefunction,usinga
systematicrenamingofidenti®ersandliteralsinthe®rstfragment(S2(a)),renamingtheidenti®ers
(butnotnecessarilysystematically)inthesecondfragment(S2(b)),renamingdatatypesandliteral
values(butnotnecessarilyconsistent)inthethirdfragment(S2(c))andreplacingsomeparameters
withexpressionsinthefourthfragment(S2(d)).”
Usingthesamedetectionalgorithms,ConQATcanproduceasingleclonegroupthatcontainsthe
original,S2(a),S2(b)andS2(c)(and,inaddition,S1(a-c)).Forthis,con®gurenormalizationto
normalizeidenti®ers(whichtakescareofS2(a)andS2(b)),typekeywordsandliteralvalues(which
S2(c)).ofcareestakThelastcloneinthisscenario,S2(d),isnotoftype-2,butoftype-3.Wediscussitinscenario3.

131

7AlgorithmsandToolSupport

void sumProd(int n){ void sumProd(int n) { void sumProd(int n) { void sumProd(int n) { void sumProd(int n) {
float s=0.0; //C1 float sum=0.0; //C1 float sum=0.0; //C1’ float sum=0.0; //C1 float sum=0.0; //C1
float p =1.0; float prod =1.0; float prod =1.0; //C float prod =1.0; float prod =1.0;
for (int j=1; j<=n; j++) for (int i=1; i<=n; i++) for (int i=1; i<=n; i++) for (int i=1; i<=n; i++) { for (int i=1; i<=n; i++)
{s=s + j; {sum=sum + i; {sum=sum + i; ’ sum=sum + i; {sum=sum + i;
p = p * j; prod = prod * i; prod = prod * i; prod = prod * i; prod = prod * i;
foo(s, p); }} foo(sum, prod, n); }}
foo(sum, prod); }} foo(sum, prod); }} foo(sum, prod); }}
void sumProd(int n) {
void sumProd(int n){
S1(b) S3(a)
S2(a) S1(c)
float sum=0.0; //C1
S1(a)
float s=0.0; //C1
float prod =1.0;
float p =1.0;
Copy & Paste
for (int i=1; i<=n; i++)
for (int j=1; j<=n; j++)
{sum=sum + i;
{s=s + j;
prod = prod * i;
p = p * j;
1SS3(b)
foo(prod); }}
foo(p, s); }}
S2(b)
Original Copy
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
CCfloat sum=0.0; //C1
int sum=0; //C1
oofloat sum=0.0; //C1
float prod =1.0;
pypyint prod =1;
S3(c)
float prod =1.0;
S2(c) S3
2Sfor (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
{sum=sum + i;
a P & ets etsa P &
{sum=sum + i;
{sum=sum + i;
prod = prod * i;
prod = prod * i;
prod = prod * i;
if (n % 2)==0 {
foo(sum, prod); }}
foo(sum, prod); }}
S3(d)
foo(sum, prod);} }}
void sumProd(int n) {
S2(d)
void sumProd(int n) {
float sum=0.0; //C1
4Sfloat sum=0.0; //C1
float prod =1.0;
float prod =1.0;
for (int i=1; i<=n; i++)
Copy & Paste
for (int i=1; i<=n; i++)
{sum=sum + (i*i);
{sum=sum + i;
S3(e)
prod = prod*(i*i);
S4(b)
//line deleted
S4(a)
S4(c)
foo(sum, prod); }}
S4(d)
foo(sum, prod); }}
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
void sumProd(int n) {
float sum=0.0; //C1
float sum=0.0; //C1
float sum=0.0; //C1
float prod =1.0; float sum=0.0; //C1
float prod =1.0;
float prod =1.0;
float prod =1.0;
float prod =1.0;
float sum=0.0; //C1
int i=0;
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
for (int i=1; i<=n; i++)
while (i<=n)
{sum=sum + i;
{sum=sum + i;
{ if (i%2) sum+= i;
{prod = prod * i;
{ sum=sum + i;
prod = prod * i;
foo(sum, prod)
prod = prod * i;
sum=sum + i; prod = prod * i;
prod=prod * i; }}
foo(sum, prod); }}
foo(sum, prod);
foo(sum, prod); }}
foo(sum, prod); }}
i++ ; }}
Figure7.28:Scenariosfrom[200]

Scenario3from[200]:“Theprogrammermakes®vemorecopiesofthefunctionandthistime
makessmallinsertionswithinalineinthe®rstfragment(S3(a)),smalldeletionswithinalinein
thelinessecondfromthefragmentfourth(S3(b))fragment,inserts(S3(d)),andsomemakneweslineschangesinthetosomethirdwholefragmentlinesin(S3(c))the,®fthdeletesfragmentsome
”.(S2(e))TheHence,diftheferencesalgorithmsintheseforclonestype-2goclonebeyonddetectionwhatcannotConQATdetectcanthemeliminateasathroughcompleteclonesnormalization.ofthe
original.eHowexample,ver,ifthecon®guredtype-3toclonerunondetectionstatementsalgorithmandoftooperateConQATwithcananbeeditcon®gureddistancetoof1,detectitcanthem.detectFor
S3(a),S3(b),S3(c),S3(d)andS3(e)asclonesoftheoriginal(and,withsuf®cientnormalization,as
clone,clonesofConQAS1T’andspostS2(a-c)).processingSinceallcanoftheoptionallyresultingbeclonecon®guredtogroupsmergecontainthemtheintoaoriginalsingleasgroup.common

132

7.6ComparisonwithotherCloneDetectors

eIfveerx,ecutedcloneswithfromaneditscenario4distancealsoofonly2,havConQAeaTalsostatement-ledetectsvelS2(d)editasadistancecloneofof2thefromtheoriginal.original.How-
doesConQAnotTthusdetectancannotyclonesbefromcon®guredS4asinawclonesaythatfromdoesthedetectoriginal.S2(d)asacloneoftheoriginal,but

timeScenarioreorders4thefromdata[200]:independent“Thedeclarationsprogrammerinmakthees®rstfourfragmentmorecopies(S4(a))of,thereordersfunctiondataandindepen-this
dentstatementsinthesecond(S4(b)),reordersdatadependentstatementsinthethird(S4(c)),and
replacesacontrolstatementwithadifferentoneinthefourth(S4(d)).”
ClonesS4(a),S4(b),S4(c)andS4(e)haveastatement-leveleditdistanceof2fromtheoriginal,
clonedistance.S4(d)Asaabovdistancee,itofcannot3.beConQAmadeTtocandetectdetectclonesthem,inifS4butcon®gurednotS2(d)withorasufvicev®cientlyersa.largeedit

DiscussionSeveralcon®gurationoptionsin¯uenceConQAT’sresultsforallscenarios.Avery
smallminimalclonelength,say2statements,canproducegroupsthatcoverall17codefragments
inthescenario.Toosmallminimalclonelengthscanthusresultinpoortask-speci®caccuracy.
Inaddition,severaltailoringeffectsarenotobviousinthescenarios.First,ConQATcanproduce
clonesthatcrossmethodboundaries.Shapingcanbeemployedtoavoidthis.However,shapingcan
reducerecall,iftheresultingclonefragmentsareshorterthantheminimalclonelengththreshold.
Second,increasingtheeditdistancefortype-3clonedetection,canalsoincreasethenumberoffalse
positives,sinceahigheditdistancetoleratessubstantialdifferenceonthecodelevel.Toacertain
degree,thiscanbecompensatedwithrelativeeditdistancethresholdsthattakeclonelengthinto
account(whichisalsosupportedbyConQAT).

Recall7.6.4Inthissection,weshowthattherecallofConQATisnotlowerthantherecallofexistingtext-based
ortoken-basedclonedetectors.Todothis,wecomputealowerboundfortherecallofConQAT
basedontheBellonbenchmarkdata.

StudyDesignTheBellonbenchmarkdatabasecontainsreferenceclonepairsthatBellonrated
asrelevantfor8systems(4writteninC,4writteninJava,compareTable7.2).Onetext-based
detector(Duploc[62])andtwotoken-baseddetectors(CCFinder[121]andDup[6])participated
inthebenchmark.WecompareConQATagainsttheresultsofthesedetectorstoinvestigatehow
ConQATcomparestoclonedetectorsthatemployasimilardetectionapproach.
WeselectedallclonepairsproducedbyDuploc,CCFinderandDupthatareratedasrelevantby
BellonfromtheBenchmark.Theyrepresentthesetofreferenceclonepairs.Thenweexecuted
ConQATonthe8systemstoproducethecandidateclonegroupsandcomparedthemagainstthe
referenceclonepairs.Wecomputedthepercentageofthereferenceclonepairsthatarecontained
inthecandidateclonegroupsasalowerboundfortherecallofConQAT.Itisalowerbound,since
potentiallyrelevantcandidateclonegroupsthataredetectedbyConQATbutnotbytheothertools,
ignored.are

133

7AlgorithmsandToolSupport

Table7.2:Recall(lowerbound)w.r.t.benchmark
Recall(SLOC)SizeLanguageProgram0.9811KCweltab0.9180KCcook0.94115KCsnns0.78235KCpostgresqlnetbeans-javadocJava19K0.94
eclipse-antJava35K0.92
eclipse-jdtcoreJava148K0.92
j2sdk1.4.0-javax-swingJava204K0.86

ImplementationandExecutionWedeterminedthereferenceclonepairsbyextractingtheir
imalpositionsclonefromlengththeof5benchmarkstatements,database.strongWeexecutednormalization)ConQAforTtype-2withaclonetolerantdetectioncon®gurationonthe(min-study
objectstoproducethecandidateclonegroups.

Matchingofreferencepairsandcandidategroupsisperformedasfollows.Areferenceclonepair
isconsideredasmatched,ifacandidateclonegroupcontainstwoclonesthatexactlymatchits
toleratepositions.deSinceviationsweinclonenoticedstartslightandendpositionlinesofupfsetsto2forlines.someIf,offortheexample,clonesainthereferencebenchmark,clonestartswe
inline10andendsinline21,itismatchedbyclonecandidatesthatstartinlines8-12andendin
lines19-23.(However,forareferenceclonepairtobematched,bothmatchingcandidateclones
needtobelongtothesamegroup).

Forreferenceclonepairsthatarenotmatchedthisway,wecomputeamatchmetricbasedontheir
lines.Foreachpairoflinesbetweenwhichaclonerelationshipexists,wecheckwhetherthesame
relationshipalsoexistsinthecandidateclones.Weillustratethisforareferenceclonepairwith
the®rstclonein®leA,lines10-15,andthesecondclonein®leB,lines20-27.Forit,wecheck
forpairs(A:10,B:20),(A:11,B:21),(A:12,B:22),(A:13,B:23),(A:14,B:24)and(A,15,B:25).
Inourexample,the®rst4pairsarealsocoveredbyapairofcandidateclones,yieldingaclone
matchmetricvalueof0.67.Weaggregatetheclonematchmetricsaccordingtothenumbersof
linepairs.Thisway,thematchmetriccapturesthechangepropagationusecaseencounteredduring
clonemanagement.Ifadeveloper®xesabuginclonedcode,andthesiblingclonescanbedetected
byoneofDup,DuplocandCCFinder,themetricdeterminesthepercentagewithwhichConQAT
clones.thedetectalsocould

ResultsTheresultsaredepictedinTable7.2.Forpostgresqlandj2sdk1.4.0-javax-swing,the
recallvalueisbelow90%.Manualinspectionofthemissedreferenceclonesintheseprojects
revealedthatmanyofthemareingeneratedcode.19Fortheotherprojects,themeasuredrecallwas
90%.evabo19Generatedcodeisoftenhighlyredundant.Forthepostgresqlandj2sdk1.4.0-javax-swing,thematchingprocessdid
nothadworkslightlywellofforfsetclonespositions,insogeneratedthatthecode,linesincepairsdidcandidanottematch.clonesnearthereferencecloneswerelongerorshorteror

134

AdoptionandMaturity7.7

DiscussionFor6outof8projects,wemeasuredarecallofover90%.Inaddition,wecompared
ConQATnottoasingletool,buttoaunionofthreecomparabletools.Intheoriginalbenchmark,
manycloneswereonlyfoundbyoneortwotools.Theresults—togetherwiththefactthatthemea-
sureisalowerbound,andthatitcomparesagainstthejointresultsofthreetools—thusdemonstrate
thatConQATcanbecon®guredtohavearecallsimilartothatofthetext-basedandtoken-based
toolsthatparticipatedintheBellonbenchmark.

ySummar7.6.5

Intativthiseclonesection,detectorwedescribedcomparisonourbycloneRoy,detectionCordywandKorkbenchoschke[200].accordingThistothehastwframeoworkpurposes.forquali-First,
itinmak[200]esanditsthuscapabilitiessupportsandusersinlimitationstheirechoicexplicit.amongSecond,difitferentsupportcloneitsdetectors.comparisonwithFurthermore,thetoolswe
haandvetoken-baseddemonstratedclonethatdetectorsConQATthatcanbeparticipatedcon®guredintotheachievBellonesimilarbenchmark.recallvAaluesasquantitatithevteext-basedcompar-
andisonofrecallcodeforclonerequirementsdetectionwithspeci®cationsotheranddetectors—asmodels—remainswellasaatopicthoroughforinvfutureestigwationork.ofprecision

AdoptionandityMatur7.7

Theclonedetectiontoolsupportdescribedinthischapterisavailableasopensourceathttp://www.
conqat.org/.Forsourcecodeclonedetection,itcurrentlysupportstheprogramminglanguages
ABAP,Ada,COBOL,C++,C#,Java,PL/I,PL/SQL,Python,T-SQLandVisualBasic.Fordetection
innaturallanguagetexts,stemmingissupportedfor15languages,includingGermanandEnglish.
Atthetimeofwriting,ithasbeendownloadedover18,000times.
Sincethetoolsupporthasmaturedbeyondthestageofaresearchprototype,severalcompanieshave
includeditintotheirdevelopmentorqualityassessmentprocesses,includingABB,Bayerisches
Landeskriminalamt,BMW,Capgeminisd&m,itestraGmbH,KabelDeutschland,MunichReand
Nixdorf.incorW

Summar7.8y

Thischapterpresentedthetoolsupportproposedbythisthesisthatenablesclonedetectionfor
differentartifacttypes,includingsourcecode,requirementsspeci®cationsandmodels.Resultsare
presentedinacustomizablequalitydashboardtosupportclonecontrolwithoverviewandtrend
information.Toolingforinteractivecloneinspection,inaddition,supportsin-depthinspectionof
clones.PluginsfortwostateoftheartIDEssupportdeveloperstoconsistentlyperformchangesto
clonedcode.Sinceithasmaturedbeyondthestageofaresearchprototype,severalcompanieshave
includeditintotheirdevelopmentorqualityassessmentprocesses.
Throughitspipes&®ltersarchitecture,theclonedetectionworkbenchprovidesafamilyofclone
detectiontoolsthatcanbecustomizedtosuitdifferenttasks.This¯exibilityandextensibility,and

135

7AlgorithmsandToolSupport

itsavailabilityasopensource,hassupportedresearchnotonlybyus,butalsobyothers[24,96,104,
186].180,

Astype-2partandofthetype-3cloneclonesdetectionthatcanwbeorkbench,appliedthistodetectchapterclonesintroducedinsourcenovelcodedetectionandinalgorithmsrequirementsfor
speci®cations.Itmoreoverintroducedthe®rstscalabledetectionalgorithmforclonesindata¯ow
Matlab/Simulink.assuchmodels

Theclonedetectionworkbench,includingthenovelalgorithms,providedthefoundationforthe
experimentsandcasestudiespresentedinthisthesis.Thetype-3clonedetectionapproachenabled
theanalysisoftheimpactofunawarenessofcloningonprogramcorrectness(Chapter4).The
modelclonedetectionalgorithmmadethestudyoftheextentofcloninginMatlab/Simulinkmodels
(Chapter5)possible.Finally,theentireworkbenchprovidesthebasisforthemethodofclone
assessmentandcontrolpresentedinthenextchapter.

136

8MethodforCloneAssessmentandControl

Thischapterintroducesamethodforcloneassessmentandcontrol.Itsgoalsaretwofold:®rst,to
informstakeholdersabouttheextentandimpactofcloningintheirsoftwaretoallowforasubstanti-
ateddecisiononhowcloningneedstobecontrol;second,toalleviatethenegativeimpactofcloning
maintenance.aresoftwduringThe®rstpartofthechapterintroducesthemethod,thesecondpartitsvalidationandevaluation.We
demonstratetheapplicabilityandeffectivenessofthemethodthroughalongitudinalcasestudyat
MunichReGroup,wheretheapplicationofcloneassessmentandcontrolsuccessfullyreducedthe
amountofcloninginalargebusinessinformationsystem.Partsofthecontentofthischapterhave
[116].inpublishedbeen

wvieOver8.1

Thissectionoutlinesthegoalandthestepsofthecloneassessmentandcontrolmethodthatare
presentedindetailinthefollowingsections.While,inprinciple,themethodcanbeappliedto
cloninginotherartifactsaswell,thischapterfocusesoncloninginsourcecode.
Thecloneassessmentandcontrolmethodinvolvestherolesqualityengineeranddeveloper.The
qualityengineeroperatestheclonedetectiontoolsandguidesthroughcloneassessment.Thede-
veloperprovidesnecessarysystemknowledgefortheevaluationofclonerelevanceandevolution.
Bothrolescan,inprinciple,beperformedbythesameperson.Sincetheyrequiredifferentexper-
tise,however,theyaretypicallyperformedbydifferentpersonsinpractice.Themethodhastwo
goals:

Goal1Informstakeholdersabouttheextent,impactandcausesofcloningintheirsoftware.

Goal2Alleviatethenegativeimpactofcloningduringsoftwaremaintenance.

Themethodcomprises®vesteps.Stepsonetothreepursuegoal1,stepsfourand®vegoal2:

toStepachie1:veCloneaccuratecloneDetectiondetectionTailoringresults.TheDuringqualitytailoring,engineerthequalityperformscloneengineerdetectionincorporatestailoringde-
velopereliminateffeedbackalsepositionvthees.releThevanceresultofofthisthestepdetectedarecloneaccurateclonecandidatesdetectionintotheresults.detectionprocessto

137

8MethodforCloneAssessmentandControl

Step2:AssessmentofImpactThequalityengineercomputesasetofmetricsthatquantify
theTheeresultxtentofofthiscloningstepandisthusallowtheforquanti®cationinterpretationofofthetheimpactimpactofofcloningcloningononmaintenancemaintenanceactiactivitiesvities.
correctness.programand

Step3:RootCauseAnalysisThequalityengineeranalyzesdetectedclonesandinterviews
developerstoidentifythemajorcausesforcloning.Theresultofthisstepisalistofcausesof
cloning.Aftercloneassessment,thesystemstakeholdersinterpretthecloningmetricsandcausestodecide
howtocontrolcloningtoreducethenegativeimpactofcloningonsoftwaredevelopment.

Step4:IntroductionofCloneControlBoththequalityengineersandthedevelopersintro-
duceclonecontrolintotheirprocesses.Theresultofthissteparethusmodi®eddevelopmentand
habits.andprocessesmaintenanceIntroductionofclonecontrolintoasoftwaredevelopmentprojectmeanschange—notonlytopro-
cessesandtools,butalsotoestablishedhabits.Forclonecontroltobesuccessfullyapplied,thus
notonlytechnicalchallengeshavetobeovercome.Instead,successhingesonwhetherhabitsare
adaptedaccordingly.Thestepstointroduceclonecontrolbuildonexistingworkonorganizational
changemanagement[43,130,143–145,152,153,225]toincorporatebestpracticesonhowtocoerce
establishedhabitsintonewpaths.

Step5:ContinuousCloneControlThedevelopersinspecttheevolutionofcloningona
regularbasistocon®rmthatthecontrolmeasureshavetakenthedesiredeffectand,ifnecessary,
measures.consolidationschedule

TDetectionClone8.2ailoring

Thissection®rstintroducesclonecouplingasanexplicitcriteriontoevaluaterelevanceofclones
forsoftwaremaintenance.Basedthereon,itintroducesclonedetectiontailoringasaprocedureto
achieveaccurateclonedetectionresults.Itsgoalistoremovefalsepositives—clonecandidatesthat
areirrelevanttosoftwaremaintenanceduetoaverylowcoupling—fromthedetectionresults,while
keepingrelevantclones,toimproveaccuracy.

CouplingClone8.2.1

Thefundamentalcharacteristicofrelevantclonescausingproblemsforsoftwaremaintenanceis
theirchangecoupling,i.e.,thefactthatchangestooneclonemayalsoneedtobeperformedtoits
siblings.Thischangecouplingistherootcauseforincreasedmodi®cationeffortandfortheriskof
introducingbugsduetoinconsistentchangestoclonedcode,requirementsspeci®cationsormodels
maintenance.aresoftwduring

138

ailoringTDetectionClone8.2

Thecouplingbetweenclonecandidateshasadirectimpactonsoftwaremaintenanceefforts.Ifclone
candidatesarecoupled,eachchangetoonealsoneedstobeperformedtoitssiblings.Eachtime
oneclonecandidateischanged,effortisrequiredforlocation,consistentmodi®cationandtesting
oftheotherclonecandidate(s).Incasetheothersarenotmodi®ed,aninconsistencyisintroduced
intothesystem.Ifthechangewasabug®x,theunchangedclonesstillcontainsthebug.If,onthe
otherhand,clonecandidatesarenotcoupled,achangetooneneveraffectsitssiblings,requiringno
additionaleffortforlocation,modi®cationandtesting.
Thisimpactofcloningonmodi®cationeffortislargelyindependentofothercharacteristicsofclone
candidatessuchas,e.g.,theirremovability.Consequently,duetoitsimplicationsformaintenance
efforts,weproposetoemployclonecouplingasacriteriontoevaluatetherelevanceofclonecan-
maintenance.aresoftwfordidates

CouplingCloneDetermining8.2.2

Touseclonecouplingasarelevancecriterion,weneedaproceduretodetermineitonreal-world
softwaresystems.Tobeusefulinpractice,thisprocedureneedstobebroadlyapplicable.We
proposetoemploydeveloperassessmentsofclonecandidategroupstoestimatecoupling,since
theyarenotrestrictedtoaspeci®csystemtype,programminglanguage,oranalysisinfrastructure.
Morespeci®cally,assessorshavetoanswerthefollowingquestion:

RelevanceQuestion1Ifyoumodifyaclonecandidateduringmaintenance,doyouwanttobe
informedaboutitssiblingstobeabletomodifythemaccordingly?

Thisway,developersestimatewhethertheygetapositivereturnontheirefforttoinspectthesib-
lingswhenperformingamodi®cationtoaclonecandidate.Thequestionpartitionsassessedclone
candidategroupsintotwoclasses—relevantclonegroupswhoseexpectedcouplingishighenough
toimpedesoftwaremaintenance,andgroupswhoseexpectedcouplingissolowthattheyareirrel-
evanttosoftwaremaintenance.

ocedurePrailoringT8.2.3

ThestepsofthetailoringprocedurearedepictedinFigure8.1.First,thequalityengineerexecutes
theclonedetectorwithatolerantinitialcon®gurationthataimstomaximizerecall.Second,devel-
opersassesscouplingofthedetectedclonegroupcandidatestoidentifyfalsepositives.Coupling
isassessedonasampleofthecandidateclonegroups—assessmentofallclonesistypicallytoo
expensive1.Allcandidateclonegroupsclassi®edasuncoupledaretreatedasfalsepositives.Ifno
falsepositivesarefound,clonedetectiontailoringiscomplete.
Iffalsepositivesarefound,theclonedetectorcon®gurationneedstobeadaptedtoreducethe
amountoffalsepositivesinthedetectionresults.Whichstrategyisusedforthistypicallydepends
onthedetectedfalsepositives.Theclonedetectoristhenexecutedwiththeadaptedcon®guration.
1AsshownbythecasestudypresentedinSection8.7,samplingdoesnotnegativelyaffecttailoringresults.

139

8MethodforCloneAssessmentandControl

detectorcloneRun

candidatescloneAssessNoFDoneposit.?alseesYdetectorcloneRe-con®gure

Re-rundetectorclone

beforeCompareafterandNoacy?Accur>esYFigure8.1:Stepsofthetailoringmethod

Todeterminetheeffectofthere-con®gurationonresultquality,thequalityengineercomparesre-
sultsbeforeandafterre-con®guration.Morespeci®cally,thequalityengineerinspectswhetherthe
clonegroupsconsideredrelevantarestillcontainedin,andwhethertheirrelevantcandidateclone
groupsareremovedfromthenewdetectionresults.Iftheimprovementofresultaccuracyisnot
achiesatisfying,vebothperfectre-con®gurationprecisionandandresulterecallvonaluationtheissampledrepeated.Incandidatecaseclones,tailoringonedoesmaynotbesucceedforcedtoto
maketrade-offsoneitherprecisionorrecall.Fromourexperience,however,precisioncansubstan-
tiallybeincreasedwithoutdamagingrecall(cf.,Section8.7).Furthermore,thecasestudypresented
this.con®rms8.7SectioninInsomecases,themajorityofthecandidateclonegroupsintheassessedsamplearefalsepositives,
e.g.,iftheanalyzedsystemcontainsalargeamountofgeneratedcode.Eveniftheycansuccessfully
besampleremovedcontainedinatoosinglefewreltailoringevantstep,clonesatofurtherconclusitailoringvelyroundestimatemaybeprecision.required,Inthissincecase,thetailoringoriginal
continueswithanotherassessment(andpossiblyre-con®guration,...)step.

8.2.4TaxonomyofFalsePositives

Wegiveashorttaxonomyoffalsepositivesbasedontheexperiencesgatheredduringclonedetec-
tiontailoringinseveralindustrialprojects.Itprovidesthebasisoffalsepositivescharacterization,
whichistheprerequisiteofclonedetectorrecon®guration.
Noconceptualrelationship.Theclonecandidatesarenotimplementationsofacommonconcept—
noconceptchangecangiverisetoupdateanomalies.Hence,nocoupledchangescanoccurthat
inconsistencies.inresultcould

140

ailoringTDetectionClone8.2

Inconsistentmanualmodi®cationimpossible.Althoughacommonconceptcanexistinthiscase,
consistencyofcoupledchangesisenforcedbysomemeans.Forexample,clonecandidatesin
generatedcodeare,uponchange,regeneratedconsistently;acompilerenforcesconsistencybetween
aninterfaceandaNullObjectimplementation.Hence,noinconsistenciescanbeintroducedthrough
maintenance.manualArtifactsthatcontainclonecandidatesareirrelevant.Ifcode,speci®cationsormodelsareno
longerused,potentialinconsistenciescannotdoharm—atleast,aslongastheartifactinquestion
use.ofoutremainsWhilethelikelihoodoftheirappearanceprobablydiffers,theseclassesoffalsepositivesarenot
limitedtoaspeci®cartifacttype:overlytolerantdetectioncan®ndclonecandidatesincode,mod-
elsandrequirementsspeci®cationsthatlacksimilarconcepts;generatorsarenotlimitedtosource
codeormodels,butarealsoemployedtogeneraterequirementsspeci®cationdocumentsfromre-
quirementsmanagementtools,possiblyreplicatinginformation.
Importantly,thecategoriesoftheabovetaxonomyareorthogonaltothecategorizationofclonetypes
forcodeormodelsthatclassifythembasedonthesyntacticnatureoftheirdifferences[86,140]:
type-1clonecandidatesarenomorelikelytoberelevantthantype-3clonecandidates,ifthe®le
thatcontainsthemisnolongerused.Thecrucialinformation,namelythatthe®leisnolongerused,
isindependentofthesyntacticfeaturesoftheclonecandidate.Consequently,wecannotexpectthe
problemofimperfectprecisiontobesolvedthroughthedevelopmentofbetterdetectionalgorithms
thatimprovedetectionforcertainsyntacticclasses.Instead,weneedtoidentifyotherfeaturesto
characterizefalsepositivestoexcludethem.

8.2.5CharacterizingFalsePositives
Successfultailoringrequirestheidenti®cationoffeaturesthatarecharacteristicfor(acertainset
of)falsepositives.Oncetheyareknown,theclonedetectorcanbecon®guredtohandleartifact
fragmentsthatexhibitsthesefeaturesspecially.Anyattributesofsourcecode,requirementsspeci®-
cationsormodelscan,inprinciple,becandidatesforsuchfeatures.Examplesinclude:thelocation
inthenamespaceordirectorystructure;®lenameor®leextensionpatterns;implementedinterfaces
orsupertypes;occurrenceofspeci®cpatternsinthesourcecode,e.g.,Thiscodewasgenerated
byatool.Characteristicwaysofstructuring,e.g.,sequencesofconstantdeclarations;identi®ersof
methodsortypes;locationorroleinthearchitecture.
Thereisnosingle,canonicwaytodeterminecharacteristicfeatures.However,wefoundthatthe
reasonswhydevelopersconsidercandidateclonesirrelevantoftenyieldclues.Wegiveexamples
forcodeclonesinthefollowing:
Codeisunused—itwillnotbemaintained.Howcansuchdeadcodeberecognized?Doesit
carry,e.g.,Obsoleteannotationsascommonlyencounteredfor.NETsystems,ordoaffectedtypes
resideinaspecialnamespace?Ifnot,candevelopersproducealistof®les,directories,typesor
namespacesthatcontainunusedcode?
Codeisnotmaintainedbyhandsinceitisgeneratedandregenerateduponchange.Isgenerated
codeinaspecialfolderordoesituseaspecial®lenameorextension?Doesitcontainasignature
stringofthegenerator?Ifnot,canitbemadetodoso?

141

8MethodforCloneAssessmentandControl

Codehasnoconceptualrelationship—maintenanceisindependent.Thisistypicallyencoun-
ofteredtheiftheimplementedclonedetectorconcepts.performsCodeovthenerlyappearsaggressivesimilartonormalization,thedetectoref,fectivdespiteelyremothelackvingofallatracescon-
settersceptualorC#relationshipproperties.thatWhichcauseschangelanguageorcoupling.systemTypicalspeci®cepatternsxamplescanareberegionsusedtoofJavarecognizedgetterssuchand
gions?recode2Compilerimplementationspreventsoftheinterfinconsistentaces.Bothmodi®cations.interfaceandExamplesNullObjectareinterfcontainacestheandsameNullObjectmethods,dopatternwn
tointerfaceidenti®ersmustandbetypes.performedHotowevtheer,adeNullObjectveloperasiswell.noti®edThefbyactthethatthecompilerNullObjectthatachangeimplementstothethe
interfacecanbeasuitablecharacteristic.
Similarcharacteristicscanoftenbefoundforirrelevantclonecandidatescontainedinrequirements
i®cationsspeci®cationspresorentedmodels.inChapterAs5,detailedfalseinthepositivestailoringcouldcasebestudyrecognizedforclonibynginpatternsmatcrequirementshingspec-their
contentortheirsurroundingtext.

Con®gurationDetectorClone8.2.6

Clonedetectorrecon®gurationdeterminesthesuccessofclonedetectiontailoring—accuracyisonly
increased,ifrecon®gurationsarewellconceived.Althoughautomationisdesirable,recon®guration
process.manualacurrentlyisClonedetectorcon®gurationincorporatescharacteristicsoffalsepositivesintothedetectionprocess
toremovethemfromtheresults.Weoutlinecon®gurationstrategiesapplicabletoourclonedetector
ConQAT(cf.,Chapter7).Again,wegivetheexamplesforsourcecode.Similarstrategiescanbe
applied,however,toclonedetectorcon®gurationforrequirementsormodels.
Minimumclonelengthpreventsthedetectionofclonecandidatesthataretooshorttobemean-
ingful.Ithasastrongimpactontheresults.Whileone-tokenclonecandidatesarenotveryuseful,
toolargevaluescansigni®cantlythreatenrecall.Still,excludingveryshortclonecandidatesisan
effectivestrategytoincreaseprecisionwithoutdamagingrecall.
Codeexclusionremovessourcecodefromthedetection,andthuspreventsdetectionofclonecan-
didatesforcertaincodeareas.ConQATsupports®leexclusionbasedonnameorcontentpatterns.
Italsosupportsexclusionofcoderegions,whichiscrucialinenvironmentswheresomeregions
of®lesaregenerated,whereastheremainderishandmaintained.Thisis,e.g.,foundin.NETde-
velopment,wheretheGUIbuildergeneratedcodeiscontainedinaspeci®cmethodinotherwise
®les.manually-maintainedContextsensitivenormalizationallowstoapplydifferentnotionsofsimilaritytodifferentcode
regions.Thisway,equalidenti®ersandliteralvaluescan,e.g.,berequiredforclonecandidatesin
stereotypeorrepetitivecodesuchasvariabledeclarationsequences,gettersandsetters,orselect/-
casecascades,whileatthesametimedifferencesinliteralsandidenti®ersaretoleratedforclone
2NullObjectsareemptyinterfaceimplementationsthatreducethenumberofrequirednullchecksinclientcode.

142

ImpactofAssessment8.3

candidatesinothercode.Differentheuristicsandpatternsforcontextsensitivenormalizationare
ailable.vaCloneShapingallowstotrimclonecandidatestosyntacticstructuressuchasmethodsorbasic
fromblocks.theCloneresults.candidatThisescan,thate.g.,arebeshorterusedtothanremotheveminimalshortcloneclonelengthcandidatesafterthatshapingcontainaretheremoendvofed
oneandthebeginningofanothermethodwithoutconveyingmeaning.
Post-detectionclone®lteringremovesclonecandidatesfromthedetectionresults.ConQATsup-
gportsappedclonescontent-basandedblack®ltering,listingremoforval®lteringofovbasederlappingonclonelocation-igroups,ndependentgap-ratio®ngerprintsbasedthat®lteriarengro-for
bustduringsystemevolution.Blacklistingcanbeusedtoexcludeindividualclonecandidates—it
canthusbeappliedevenifnosuitablecharacteristicsoffalsepositivesareknown.
Re-con®gurationofanyclonedetectionphase—preprocessing,detection,orpost-processing—can
improveaccuracy.

8.2.7AssessmentToolSupport

Besidesacon®gurableclonedetector,furthertoolingisrequiredtoperformclonedetectiontailor-
ing:Cloneassessment:dedicatedtoolsupportiscrucialtoachieveacceptablecloneassessmentpro-
ductivity.Basedonourexperiencefromlargeindustrialcasestudies[57,111,115,116],itmust
supportthegenerationofarandomsampleandstoretheassessmentresultsforeachclonegroup
andofferacloneinspectionviewerthatdisplaystwosiblingclonesside-by-side,providingsyntax
highlightingandcoloringofdifferencesbetweenclones.
Comparisonofclonereports:Toolsupportisrequiredtoinspectthedifferencesbetweentwoclone
reports.Thisisnecessarytoinvestigatetheimpactofre-con®gurationonprecisionandrecall.
Supportforcloneassessmentandcomparisonofclonereports,isavailableinConQAT.

ImpactofAssessment8.3

ploThisyedtosectionquantifyfollowstheaimpact‘goal,ofquestion,cloning.metric’(GQM)approach[11]tointroducethemetricsem-

8.3.1Goal

onThesoftwgoalareofcloneengineeringassessmentactivities.istoMorequantifysthepeci®callyimpact,theofgoalcloningistoinquantifytermsthethatrevimpactealoftheirefcloningfect
onmaintenanceeffortandprogramcorrectness.Wehenceneedmetricsthatcapturesigni®cant
cloning.byin¯uencedproperties

143

8MethodforCloneAssessmentandControl

Wesummarizethegoalofcloneassessmentusingthegoalde®nitiontemplateasproposedin[234].
Sincewedonotperformasingleassessment,asGQMismainlytargetedfor,butratherprovidethe
foundationforaclassofassessments,wedonotapplyGQMdirectlybutinsteademployittoguide
presentation.theAnalyzecloninginsoftwareartifacts,includingbutnotlimitedto
sourcecode,requirementsspeci®cationsandmodels
forthepurposeofcharacterizationandquanti®cation
withrespecttoitsimpactonmaintenanceeffort
correctnessamrprogandfromtheviewpointofsoftwareengineer,independentofrole,e.g.,
manager,developer,qualityassuranceengineer
inthecontextofprojectsthatdevelopormaintainsoftware

Questions8.3.2Themeasurementgoalcanbebrokendownintoseveralquestionsthathelptoquantifythedifferent
impactsofcloning.Thequestionsare,onpurpose,independentoftheartifacttypeinwhichcloning
occurs.

Q1Howlargeissize-increaseduetocloning?

testedDuplicationandincreasesmaintained,thesizerequirementsofanartifact.duplicationDuplicatedincreasescodethenumberincreasesofthesLOCentencesthatthatneedneedtotobe
beread;similarly,modelcloningincreasesthenumberofmodelelementsthatneedtobequality
maintained.andassured

Q2Howlargeisexpectedmodi®cation-size-increaseduetocloning?

Ifacloneismodi®ed,themodi®cationtypicallyneedstobeperformedtoitssiblingsaswell.This
increasesthenumberofstatements,sentencesormodelelementsthatneedtobemodi®ed—the
change.aimplementmodi®cation-size—to

Q3Ifasingleelementcontainsafault,withwhichprobabilityisthisfaultcloned?

Ifanartifactelementcontainsafault,itsclonesarelikelytocontainitaswell.If,e.g.,acodeclone
lacksanullcheck,itismissinginitssiblingsaswell.Ifarequirementclonecontainsawrong
precondition,itislikelytobewronginitssiblingsaswell.And,accordingly,ifanadderblockina
Matlab/Simulinkmodelreceivesthewrongparameterasinput,itislikelytobewronginitssiblings
well.as

Q4Howmanyclonegroupsandclonesdoesanartifactcontain?

144

ImpactofAssessment8.3

Thenumberofclonesandclonegroupsdetermineseffortrequiredforcloneinspectionandclone
consolidation.

Q5Howlikelyisacoupledchangeunintentionallynotperformedtoallaffectedclones?

Ifaproblemdomainconcept(whoseinformationisduplicatedamongtheclonesofaclonegroup)
changes,theclonesneedtobeadaptedaccordingly.Howlikelyaredeveloperstobeunawareofall
clones,andthustonotperformthechangeconsistentlytoallaffectedclones?

Q6Howlikelydoesanunintentionallyinconsistentchangeindicateafault?

Thisquestionre¯ectshowoftenachangetoclonedartifacts,thatunintentionallydoesnotgetper-
formedconsistentlytoallaffectedclones,introducesanewfaultorfailstoremoveanexistingfault.
Itthuscaptureshowunawarenessofcloningaffectscorrectness.

Metrics8.3.3

Overheadquanti®esthesizeincreaseduetocloningcf.,Section2.5.4.Relativeoverheadquanti-
®esthesizeincreasecausedbycloningandcanthusbeusedtoanswerquestionQ1.Assumingthat
clonedartifactfragmentsareaslikelytobemodi®edasnon-clonedfragments,itcanalsobeused
toanswerquestionQ2,astherelativemodi®cationsizeincreasethencorrespondstotherelative
erhead.vo

atlCloneeastCooneveraclonegecf.,istheSection2.5.5.probabilitythatAssuminganthatarbitrarilystatements,chosenunitsentencesinanorartifmodelactiscoelementsveredthatby
containfaultsareequallylikelytobeclonedasthosethatdonot,itcanbeusedtoanswerquestion
Q3.Clonecoveragecanalsobeemployedtoanswerrelatedquestions:duringarequirementsspeci®-
cationinspection,howlikelywillthesentenceyoujustreadoccuragaininanothersectionofthe
singledocumentstatement,atleastsentenceonce?orHowmodellikelyelementwillyouathaleastvetoonceperformmore?themodi®cationyoujustdidtoa

Countsdenotethenumbersofclonegroupsandclonesinanartifact.Clonegroupcountand
Q4.questionanswercountclone

UnintentionallyInconsistentCloneRatio(UICR)capturesthelikelihoodthatthediffer-
encesbetweentype-3clonesinaclonegroupareunintentional,cf.,Section4.2.Itthuscapturesthe
lackofawarenessofcloningduringmaintenanceandanswersquestionQ5.

145

8MethodforCloneAssessmentandControl

FaultyUnintentionallyInconsistentCloneRatio(FUICR)capturesthelikelihoodthatthe
differencesbetweenunintentionallyinconsistenttype-3clonesinaclonegroupindicateatleastone
fault,cf.,4.2.Itthuscapturestheimpactofthelackofawarenessofcloningoncorrectnessand
Q6.answersAllmetricsarecomputedontailoredclonedetectionresults.Overhead,clonecoverageandclone
countscanbecomputedfullyautomatically,asis,e.g.,donebyConQAT(cf.,Chapter7).The
metricsUICRandFUICRaredeterminedbydeveloperassessmentsoftype-3clonegroups.Ifthe
numberoftype-3clonegroupsistoolarge,ratingcanbelimitedtoasample.Themetricsare
computedasdescribedinSection4.2.

Discussion8.3.4

posedContribbeforeutionandare,WhileasinthethemetricscaseofUICRcloneandcovFUICRerageorareclonenovel,countthes,othercomputedmetricsbyhaevexistingbeenclonepro-
detectiontools.Thenoveltyoftheproposedcloneassessmentmethodthusresidesnotsomuch
inimpactthenoonveltyofmaintenanceitsefmetrics.fortIincreasenstead,(oitsverheadcontribandutioncovistwerage)ofold:and®rst,programthemetricscorrectnesscapture(FUICR).both
Second,detectionrandesultsmore,andthusimportantlyon,clonestheyarethatexhibitcomputedclonenotoncoupling.cloneThecandidatesmetrics,butthusonallotailorwforedmoreclone
onreliableuntailoredinterpretationclonewdetection.r.t.theresultsimpactforofwhichcloningonprecisionismaintenanceunknown.activities,thanmetricscomputed

EffortsBothclonedetectiontailoringandmetriccomputationarenotcost-free.Sincefreein-
dustrialstrengthclonedetectorsareavailable—suchastheoneproposedbythisthesis—themain
costdriveristheinvolveddevelopertime.Sincetheactualdetectiontimesarefastforsoftwareof
typicalsize(cf.,7),waitingtimesdonotaccountformuch;mostoftheeffortisrequiredfordevel-
operassessmentsofclonesthatareperformedtotailordetectionresultsandrateclonestodetermine
FUICR.andUICRHowever,accordingtoourexperiencesfrom,e.g.,thecasestudyinChapter4,thefaultsdiscovered
duringinspectionoftype-3clonescanamortizetheseefforts.Inonesystem,forexample,we
discoveredatype-3clonegroupinwhichoneclonecontainedacommentwithanissuetracker
ticketnumberindicatinga®xedbug.Itssiblings,however,stillcontainedthebug.Theissuetracker
entrydocumentedalengthyandcostlyprocess:thebughadbeendiscoveredinthe®eld,hadbeen
triagedbyagroupofexperts,discussedbyacontrolboardandclassi®edassuf®cientlycriticalto
be®xedinthenextrelease.Thenithadbeen®xedbyadeveloperandveri®edbyatester.Thecost
forthisprocess,accordingtothedevelopersinvolvedinthestudy,exceededtheeffortgoneinto
cloneassessment.Inotherwords,theeffortwasaccountedforbythesinglefaultwefound,since
itcouldbe®xedandtestedwithoutrequiringthecostlytriageandqualitycontrolboardprocess.
Theadditionalfaultsthatwerefoundduringthatanalysisincreasedthereturnofinvestmentonthe
effortinvestedintocloneassessment.Whilethereisobviouslynoguaranteethatthefoundfaults
amortizeorbestthecosts,wehaverepeatedlyreceivedthefeedbackfromtheinvolvedstakeholders
thatcloneassessmentwaswellworththeeffort.

146

ysisAnalCauseRoot8.4

PropertiesofClonesandClonedCodeAsmentionedinSection8.2.1,theimpactof
cloningisdeterminedbyclonecoupling,whichisindependentofwhetherclonescanberemoved
usingtheabstractionmechanismavailablefortheartifacttype.Removabilityoftheclonesisthus
metrics.theinre¯ectednotTheinterpretationofoverheadasanestimatorformodi®cation-size-increaseassumesthatcloned
artifactfragmentsareaslikelytobeaffectedbychangeasnon-clonedones.Forsourcecode,this
assumptionhasbeenstudiedbyseveralresearchers.TheresultsfromJensKrinkeseemtocontradict
it:in[148],hereportsthatclonedcodeismorestablethannon-clonedcode.However,inalater
study,NilsGödeusesamoresophisticatedclonetrackingschemeandreportsthatstabilityofcloned
versusnon-clonedcodevariesbetweentheanalyzedsystems[83]andisthushardtogeneralize.
Lackinggeneralizableresultswhetherclonedcodeismoreorlessstablethannon-clonedcode,
andlackinganyempiricaldataforotherartifactssuchasrequirementsspeci®cationsandmodels,
weassumethatitdoesnotdifferinstability.Futureworkisrequiredtobetterunderstandthe
relationshipbetweencloningandstability.Incaseitvariessubstantially,itcouldbeincludedasan
additionalmetricintoafuture,extendedcloneassessmentmethod.
Theinterpretationofclonecoverageasthelikelihoodthatfaultsareclonedassumesthatfaultyarti-
factunitsareaslikelytobeclonedasnon-faultyones.Again,wehavelittleempiricaldatathatsheds
lightonfaultdensities:wearenotawareofanystudiesforrequirementsspeci®cationsormodels
andonlyofasinglestudythatcomparesfaultdensitiesforclonedandnon-clonedcode[189].
Inaddition,sincetheauthorsdonotemployclonetailoring,accordingtotheterminologyofthis
thesis,theirstudyanalyzesclonecandidates,notclones—theapplicabilityoftheirresultsisthusun-
clear.Consequently,furtherresearchisrequiredtobetterunderstandthefaultdensitiesforcloned
andnon-clonedartifactfragments.Lackingempiricaldata,weassumefaultdensitiestobesimilar
forclonedandnon-clonedcode.Alternatively,afuture,extendedversionofthecloneassessment
methodcouldincorporateametricthatre¯ectsthedifferencesbetweenthetwo.

ysisAnalCauseRoot8.4

Besidesassuranceofconsistentevolutionofexistingclones,animportantfunctionofsuccessful
clonecontrolisthepreventionofnewones.Variouscausesurgemaintainerstocreateclones;
pleaserefertoSection2.2.2foranoverview.Inmanycases,cloningisperformedtoworkaround
problemsinthemaintenanceenvironment.Aslongasthesecausesforcloningremain,maintainers
arelikelytocontinuetocreateclonesinresponse.Hence,forclonepreventiontobeeffective,the
causesforcloningneedtobedeterminedandrecti®ed.
Existingworkonclonepreventionfocusesonmonitoringofchangestothesourcecode[149].
alloChangeswedtothatbeaddedintroducetothenewsystem.clonesareWhilesuchidenti®edanandapproachneedtocanpasshelpatospecialspotclonesapprovalearly,itprocessistolimitedbe
toanalysisofthesymptoms—theclones—andignorestheircause.Suchapproachesthusneedto
becomplementedwitharootcauseanalysisthatdeterminestheforcesdrivingclonecreation.This
sectionpresentsalistofrootcauses.
Therulesoutcausesaforsingle,cloningcanonicalaredivrecipeerse;forsuitablerootcausesolutionsanalysis.thusdifferInstead,wesubstantiallylist.theTheircausesandheterogeneitycoun-

147

8MethodforCloneAssessmentandControl

termeasuresintheformofpatterns.Manyoftheexamplesdescribedbelowstemfromfouryears
experienceofanalysisofcloninginindustrialsoftware—oftenwithpartnersoutsidethosemen-
tionedinSection2.73.Where®tting,wealsogiveexamplesfromtheliterature.Thislistisnot
complete.Itsextensionremainsanimportanttopicforfuturework.
Thelistfocusesoncausesforcloninginthemaintenanceenvironment.Inherentcauses,suchas
dif®cultyofabstractioncreation(cf.,,Section2.2.2)arenotconsideredfortworeasons:®rst,
beinginherent,theycannotberecti®edthroughchangestothemaintenanceenvironment;second,
theresultingclonescanbeconsolidatedatalaterpoint,e.g.,whenmoreinformationaboutthe
instancesofacertainabstractionisavailable.Welistthepatternsinalphabeticorder.

PatternunderlyingTemplateproblem.ItsEachsolutioncauseisdescribesdescribedpossiblefollowingmeasuresa®xedthattecanmplate.beusedItstocausesolvethedescribesproblem.the
itsItselimitationsxamplesdocumentdocumentoccurrencesconstraintsinthattherestrictliteratureorapplicabilityexperiencesofthewegsolutions.atheredinpractice.Finally,

GeneratorokenBr8.4.1

CauseCodethatwasoriginallygeneratedisnowmaintainedmanually.
SolutionSeparatehand-writtenandgeneratedcode.Ifthegeneratedcodeneedstobeaugmented
manually,use,e.g.,theGenerationGappattern[224]toplaceitindifferent®les.Donotcommit
generatedcodetotheversioncontrolsystem.Instead,re-generateditautomaticallyeverytimeits
inputartifactschange.Thisreducestheprobabilitythatsmall®xesaredirectlyintroducedintothe
generatedcodethateffectivelybreakthepossibilitytoregenerateit.
ExamplesInsomebusinessinformationsystemsweanalyzed,one-shotgeneratorshadbeenem-
ployed.Theyhadgeneratedcodeentitieswith“holes”thatwerelater®lledinmanually.This
resultedinlargeamountsofcloning.
AnotherprojectweanalyzedinitiallyemployedaUMLtoolthatgeneratedclassesfromdiagrams.
TheUMLtoolgeneratedstereotypecodefor,e.g.,associationhandlingandobjectlifecyclethatis
duplicatedbetweenclasses.Thisdidnotrepresentaproblemaslongasthetoolwasused,since
itmaintainedtheduplication.However,atsomepoint,theUMLtoolwasabandoned.Allcode,
includingthegeneratedduplication,getsnowmaintainedbyhand.
Athirdprojectweanalyzedinheritedacomponentfromanotherteam.Thatteamemployedacode
generator.However,thegeneratorisnowlost.Furthermore,itisunknown,whetherthegenerated
codehaslaterbeenmodi®edbyhand.Consequently,itnowgetsmaintainedmanually.
LimitationsIfhand-writtenandgeneratedcodehavebeenmixedlongago,theirseparationcanbe
tedious.However,suchcomplexityisaccidental.Weseenoinherentreasonthatpreventscomplete
separationofgeneratedandhand-writtencode.
3Fornondisclosurereasons,wecannotgivemoredetailsonthecompany,domainoranalyzedsoftware.

148

SkillsAbstractionInsuf®cient8.4.2

ysisAnalCauseRoot8.4

CauseThemaintainerslacksomeoftheskillsrequiredtocreatereusableabstractions.
SolutionEducatethemaintainersintherequiredskills.
ExamplesEveniflanguagelimitationsruleoutonewayofcreatingasharedabstraction,often
other,sometimeslessobvious,waysexist.Manydesignpatternsoffersuchways.Forexample,if
twofragmentsofcodedifferinonemethodtheycall,Javadoesnotallowtointroduceaparameter
forthismethod,sinceitdoesnotsupportfunctiontypes.However,thedesignpatternsTemplate
MethodsandVisitor[74],e.g.,supportsuchcasesthroughtheuseofinheritanceandpolymorphism.
Toconsolidatecloning,refactoringcanreducetherequiredeffortandlikelihoodoferrors.
Atoneofourindustrialpartners,across-cuttingconcernwasclonedbetweentheunderlyingframe-
workandallcomponentsthatweredevelopedontopofit.TheapplicationoftheTemplateMethod
patternallowedconsolidationofasubstantialpartoftheclones:thecommoncodewasmovedinto
theframeworkbaseclasses,thevariabilitydelegatedtoabstracthookmethodsthatwereimple-
mentedbythederivedclassesinthecomponents.
LimitationsTheavailableabstractions,patternsandrefactoringsdifferbetweenprogramminglan-
guages.

LimitationsegLangua8.4.3

CauseTheavailableabstractionmechanismdoesnotallowtointroducethenecessaryparameters
abstraction.reusableacreatetoSolutionThedirectsolutionistoaugmenttheabstractionmechanismtosupporttherequiredpa-
rameterization.Ifthisisunfeasible,usespeci®ctoolsthatcomplementthelanguage.
ExamplesThequalityanalysistoolkitConQAT,ontopofwhichthetoolsupportproposedbythis
thesisisconstructed,implementsitsowndomainspeci®clanguagetospecifyprogramanalyses.
Itsinitialversiondidnothaveareusemechanismforrecurringspeci®cationfragments.Theinitial
analyses,thus,containedclones.Inresponse,alaterversionintroducedanabstractionmechanism
thatallowsforstructuredreuse.
GeneralpurposeprogramminglanguageslikeJavadonotallowforencapsulationofcross-cutting
concerns.Concernssuchaslogging,tracingorpreconditionchecking,hence,areduplicated.Oneof
ourindustrialpartnersintroducedaspectorientedprogrammingtechniquestofactoroutthecloned
code.tracingLimitationsManycommonlyusedabstractionmechanisms,e.g.,thoseingeneralpurposepro-
gramminglanguages,cannotbeextendedbytheirusers.Aspectorientedprogrammingorgenera-
tors,however,cansometimesbeemployed.

149

8MethodforCloneAssessmentandControl

8.4.4NoConsolidationofExploratoryCloning

izationCauseofInherentchangescausestoforunderstandcloning,theirsuchimpact,asdif®cultydisappearofwithcreatingtime(cf.,abstractionsSectionor2.2.2).prototypicalCloningreal-can
thenbeconsolidated.Thisdoesnotalwayshappeninpractice.
theirSolutionremovalEstablishassooncloneastheycontrol,canbeaspresentedconsolidated,belowwhile,totheirtrackremosuchvalisclones.stillcheap.Scheduleresourcesfor
ExamplesInseveraloftheindustrialprojectsweanalyzed,wefoundcodeimplementingfeatures
withsimilarbusinessfunctionality.Partsofthemhadbeenimplementedviacloning.Repository
analysisrevealedthatcloninghadalsobeenusedforprototypicalimplementationinotherareasof
theapplication.However,intheseareas,itwaslaterconsolidated,asthecommonalitiesanddiffer-
encesbetweenthefeaturesbecameclear.Developersreportedthatmanyoftheremainingclones
hadconsolidationoriginallywasbeenmeantpostponedtobeandthenconsolidated.forgotten.However,duetotimepressureandinterruptions,the
ClonesLimitationsshouldThethuslongerberemoclonesvedremainearly,intoaavsystem,oidtheadditionalmoreefeffortsfortscanforfariseamilforiarizationtheirandconsolidation.quality
assurance.

8.4.5UnreliableTestProcess

CauseThetestprocess—especiallyregressiontesting—isunreliable.Inresponse,maintainersdo
notreusable,trustittocopiesdiscoareverfcreated,aultstoavintroducedoidriskofduringbreakingmaintenaexistingnce.code.Insteadofchangingcodetomakeit
SolutionImprovethetestprocess.
idateExamplescloning,JimtoavCordyoidthe[40]riskreportsofonbreakingthereluctancerunningofsystems.maintainersIncreasedinthereliability®nancialofsectorthetomaintainersconsol-
inthetestprocessescouldreducetheirreluctance.
notOneacompansingleytestwewcaseorkwedaswithwautomated.asinaInsimilarconsequence,situation.Theirdeterminingtestprocessthatawaschangeentirelyonlyhadmanual—the
clearintended,whichimpacttestwcasesaswereinfeasible:potentiallyapartaffromfectedthebycostsaofchange.manualThetestexresultingecution,itreluctancewasnottoalwmodifyays
existingcodeleadtoasteadyincreaseincloning.
LimitationsAsanyprocesschange,improvingatestprocessrequiresplanning,organizational
resources.andmanagementchange

ocessPrReuseUnsuited8.4.6

CauseTheorganizationdoesnothaveasuitablereuseprocessthatgovernsthecreationandmain-
tenanceofsharedcode4.Unsuitedreuseprocessescanoccurindifferentforms,e.g.:
4Weusethetermsharedcodeinawaythatdoesnotsubsumeclonedcode.

150

ysisAnalCauseRoot8.4

missing.isprocessReuseRestrictivecodeownershipimpedesmodi®cationsnecessarytoreuseexistingcode.
SolutionChangeprocesstofacilitatecreationandmaintenanceofsharedcode.
cessExamples[120].AtTheonecompancompanyysimply,ahadcausenoofcodecrossentitiesprojectthatwerecloningwsharedasthebetweenabsenceprojects,ofaandreuseconse-pro-
code,quentlythenodevprocesselopersforitscopieditmaintenance.betweenLacking,projects.e.Asg.,aasolution,commonthelibrarycompanintoywhichplanstotoplaceintroduceshareda
commonslibraryandamaintenanceprocessforit.
Restrictivecodeownershipisfrequentlymentionedasareasonforcloningintheliterature[201].
Collectivecodeownership,as,e.g.,advocatedbyagiledevelopmentmethods[18,71]presentsa
e.valternatisuitableLimitationsBothestablishingandchangingareuseprocessrequireplanningandorganizational
tionschangeofothermanagement.processes,suchSwitchingasfromqualityrestrictiassurance,vetoifitcollectiwasveocodeownership-based.wnershipmightrequireadapta-

8.4.7WrongDescriptionMechanism

CauseThedescriptiontechniqueemployedtoimplementapieceofsoftwareisinappropriate.As
aconsequence,highleveloperationsareinterspersedwithrepetitivesequencesoflowlevelcom-
mands.SolutionUseamoreappropriatedescriptiontechnique.Forexample,useadomainspeci®clan-
guageinwhichthehigh-leveloperationsareencodedandageneratorthataddslow-levelcommands
andtransformsitintoexecutableartifacts.OruseaninternalDSLto,e.g.,separatetestdatacon-
logic.testfromstructionExamplesOneofthebusinessinformationsystemsweanalyzedstartedoffwithamanuallywritten
(andmaintained)persistencylayer.Storageofobjectsinarelationaldatabase(and,correspondingly,
theirretrieval)followedstereotypepatterns.Foreachobjectattribute(high-levelinformation),a
numberoflow-levelstorageandretrievalcommandswereimplemented,resultinginlargeamounts
ofsimilarcode.Inalaterversion,thecompanyreplacedthiscodewithageneratedO/Rmapper.
AsecondexampleareAPIsusedtoprogramgraphicaluserinterfaces.Eachinstantiationofa
widget(high-leveloperation)requiresasequenceof(low-level)methodandconstructorcalls.Since
APIconstraintsgoverntheirshapeandorder,theresultingcodelookssimilar[1,123].Again,
highleveloperations(placethiswidgetoverthere,lookingassuch)isinterspersedwithlowlevel
information(howtoconstructthewidget,howtoallocateanddisposeofitsresources,...).Again,
codegeneratorshavebeendevelopedthatallowthecompositionandmaintenanceofgraphicaluser
interfacesonahigherlevelofabstraction.
Automatedtestsrequiretestobjectsonwhichthefunctionalityundertestoperates.Often,these
testobjectsareconstructedprogrammatically.Again,high-leveloperations(whichobjectstocom-
bine)areinterspersedwithnumerouslow-levelconstructorandsettercalls.Asasolution,describe

151

8MethodforCloneAssessmentandControl

testobjectconstructionusinginternalorexternalDSLsthatallowtestobjectspeci®cationonan
appropriatelevelofabstraction.
theirLimitationsconstructionSuitableanddomainmaintenancespeci®cconstrainlanguagestheiroruse.generatorsmightnotbeavailable.Thecostsfor

ySummar8.4.8Theanalysisofcausesofcloningcanrevealproblemsinthemaintenanceprocess.Theseprob-
lemscanhavesevereconsequencesforsoftwaremaintenancefarbeyondtheirimpactoncloning:
workingonthewronglevelofabstractioncreatesunnecessaryeffort;insuf®cientdeveloperskills
threatenmanyqualityattributesofasoftwaresystem;andreluctancetochangeexistingcodedue
toanunreliabletestprocessinhibitsmaintenanceingeneralandnotonlyconsolidationofcloning.
Rootcauseanalysisofcloningoffersonetooltospotsuchproblems.Ifemployedduringclone
control,itcanhelptoidentifysuchproblemsearlyandthushelptocontainthedamagetheycan
cause.Therecti®cationofacauseforcloningmustmakeeconomicsense.Itsexpectedsavings,both
intermsofreducedimpactofcloningandonsoftwaremaintenanceingeneral,mustexceedthe
expectedcosts.Clonepreventionthusinvolvestrade-offdecisions.Thesetrade-offscanshiftover
time.Acausethatinitiallyappearstobenegligiblecanbecomeimportant,asitsimpactbecomes
obvious.Inaddition,causesthatareexpensiveto®xnowcanbecomecheaper,astechnology
advances.Timelyrootcauseanalysisenablesasubstantiateddecisiononwhethertoact,orwhether
toaccepttheconsequencesandcontroltheresultingclones.Furthermore,ifperformedasapart
ofcontinuousclonecontrol,thedecisionscanbereevaluated,asadditionalinformationbecomes
ailable.va

LastingImpactCloneassessmentandrootcauseanalysisalone,however,areunlikelytohavea
lastingimpactonthecloninginasystem.Ifthenegativeimpactofcloningistobereduced,speci®c
en.takbemustactionsTheprojectstakeholdersthusneedtomakeadecisionwhethertheimpactofcloningisacceptable
fortheirsoftwareproject,orwhetheranyactionsshouldbetakentoalleviatetheimpactorreduce
theamountofcloninginasystem.Inreal-worldsoftwareprojects,thequestionismorelikely
whichactionsareappropriate,thanwhetheratallactionsneedtobetaken:inthefewtimeswe
encounteredsoftwaresystemswithverylowcloningmetrics,effectiveclonecontrolmeasureswere
place.inalreadyThenextsectionprovidesamethodtointroduceclonecontrolthathelpstoalleviatethenegative
impactofcloningonsoftwaremaintenanceactivities.

8.5IntroductionofCloneControl

Ifneedclonetobeocontrolvisercome.tobeTheappliedgoaloforgcontinuouslyanizationalduringchangemaintenance,managementisestablishedtofdeacilitatevelopmentsuchchangehabits

152

8.5IntroductionofCloneControl

processes.Belowwesummarizeanorganizationalchangemanagementprocessfrom[225]thathas
beenadaptedfortheintroductionofqualitycontrolmeasures.Itsstepsprovidethebasisforthe
control.cloneofintroduction

ConvinceStakeholdersandestablishasenseofurgencyaboutthenegativeimpactofcloning
forthesoftwaresystemtobuildupenoughmomentum.Theintendedresultofthisstepismotivation
amongthestakeholderstointroduceclonecontrol.

CreateaGuidingCoalitionthatincludeskeypersonstointroduceclonecontrolintothede-
vthetaskelopmentforceprocess.thatwillIdentifyinitiateallandrequiredperformrolestheandactionspersonsrequiredtoavtooiddelayintroduce.Thecloneresultofcontrolthisintosteptheis
process.elopmentvde

CommunicateChangetoallstakeholdersaffectedbyclonecontroltoachievetransparency
andreduceanxietypossiblycreatedbyasenseofbeingcontrolledormeasured.Theresultofthis
stepisknowledgeoftheintroducedclonecontroltoolsandmeasures.

EstablishShort-termWinstoprovidepayoffsforinvestmentsmadesofarandbolstermo-
tivation.Theseinclude®xingofencounteredbugsandremovalofeasilyremovableclones.The
resultofthisstepistheimprovementofthesoftwaresystem’squality.

MakeChangePermanentbytrackingclonestorewardremovalofexistingclonesandnotice
introductionofnewones.Theresultofthisstepisawarenessoftheevolutionofcloninginthe
systemandthelastingapplicationofclonecontrol.Thisstepoforganizationalchangemanagement
isperformedbythe®fthstepofthemethod,continuousclonecontrol.
Inprinciple,themethodpresentedinthischapterfocusesonpointsinwhichcomputersciencecan
helporganizationalchangemanagement.Itdoesnottargetpointsthatarenotprimarilycomputer
scienceterritory,suchas,e.g.,expectationmanagement,con¯ictmanagementorcommunication
insideanorganization.Itthuscomplementsexistingapproachesfororganizationalchangemanage-
mentanddoesnotreplacethem.Theremainderofthissectiondescribestheindividualstepsofthe
introductionofclonecontrolinmoredetail.

sStakeholdervinceCon8.5.1

Introductionofclonecontrolneedsresources.Forthem,itcompetesagainstothertasksinaproject.
Inorderforclonecontroltobeinitiated,therequiredresourcesmustbeallocated.Thisdemands
convictionamongallinvolvedstakeholdersthatclonecontrolisbothnecessaryandurgent,elseit
willnothappenorbedelayed.
Forasoftwaresysteminproduction,cloningisnotmerelyanissueaffectingmaintenanceinthe
distantfuture.Instead,itnotonlyaffectsthepresentbutalreadyaffectedpastmaintenance.Inother

153

8MethodforCloneAssessmentandControl

words,theimpactofcloningalreadyaffectsthestakeholders.Fromourexperience,eveninsystems
thataresubstantiallyimpactedbythenegativeimpactofcloning,thisisnotcleartostakeholders.It
ishenceakeyfactinestablishingasenseofurgencyamongthem.
Toestablishthatthenegativeimpactofcloningalreadyhasaffecteddevelopmentandcontinuesto
doso,resultsfromcloneassessmentareemployed.Fromourexperience,itfostersunderstanding
ifprothevideimpacttangibleofecloningxamples,isandpresentedontheinlevtweloofwtheays:wholeonthelesystem,veltoofputindividualcloningintosoftwareconteartifxt.Onacts,theto
levelofindividualartifacts,examplesofinconsistentevolutiontangiblydemonstratethatcloning
threatensprogramcorrectness.Onthelevelofthewholesystem,theclonemetricsquantifythe
impactofcloningforthewholesystem.
Themorestakeholderscanbeconvincedoftheurgencyofclonecontrol,thehigheritschancesof
success.Whileparticipationofallstakeholdersisnotnecessarilyrequired,atleaststakeholders
whoseinactivityblocksclonecontrolneedtobeconvinced.

CoalitionGuidingaCreate8.5.2mentOnceaprocesssenseofofaurgencproject.yhasDifbeenferentrolesestablished,areinvcloneolvedcontinrolthis.needstoDependingbeinteongratedtheintoproject,thedethevyelop-can
butneednotbeperformedbydifferentpersons:
Buildengineer:Integratesclonedetectionintothesoftwarebuildenvironmentsothatitisper-
formedautomaticallyonaregularbasis.
DependingDashboardontheappointee:projectsizeCreatesandateamdashboardstructure,thatthepresentsdashboardcloneappointeedetectioncreatesresultstodashboarddevvieelopers.ws
fortheindividualcomponentsorsubsystemstoprovidecustomizedclonedetectionresultstothe
eholders.stakToolappointee:Familiarizeshimselfwiththeclonedetectiontoolsupporttoadaptittotheproject
colleagues.histutorandOncetheguidingcoalitionhasbeencreated,itperformsitstasks.Besidestheidenti®cationofthe
inonvaolvedcontinuousindividuals,basis.theresultsofthisstepthusincludeaclonedetectiondashboardthatisupdated

eChangunicateComm8.5.3Onceclonedetectionhasbeenintegratedintotheregularbuild,up-to-dateclonedetectionresults
are,inprinciple,availabletodevelopers.However,whileanecessaryrequirement,boththeexis-
tenceofup-to-datedetectionresultsandclonemanagementtoolsalonedonotalleviatethenegative
impactofcloning.Theyalsoneedtobeusedbydeveloperstotakeeffect.
Forthis,developersneedtobemadefamiliarwiththeclonecontroltoolsupportavailabletothem
andthewaysitcanbeusedtosupportmaintenance.Thisincludesboththeclonecontroldashboard
thatprovidesaggregatedinformation,andtheIDEintegrationofcloneindicationthatsupports
changepropagation,implementationandimpactanalysis,asdescribedinChapter7.

154

8.6ContinuousCloneControl

Furthermore,thewaysthecloninginformationisusedbyotherstakeholders,includingmanage-
aboutment,theneedsusetoofbethecollectedcommunicateddatatocanleadcreatetotranspadefensivrenceybeha[38].viororneOtherwise,glect,thethreateningresultingtheuncertaintyadoption
control.cloneof

8.5.4EstablishShort-termWins

Allpreviousstepsrepresentinvestmentsintoclonecontrolthatoffernoimmediatelyvisiblebene®ts.
Atthisstep,tangiblereturnsinsoftwarequalityimprovementarerequiredtobothjustifyprevious
investmentsandbolsterdevelopermotivation.Strategiestoachievetheminclude:
Fixbugsintroducedbyinconsistenciesbetweenclones.Bug®xesofferimmediateimprovements
insoftwarequalityandareeasytocommunicateamongstakeholders.
Consolidateclonesthatareeasilyremovable.Suchclonescan,e.g.,befoundbyusingvery
conservativenormalization.Theirremovalreducessoftwaresizeandthusfuturemaintenanceeffort.
Startingwithclonesthatareeasytoremovebolstersmotivation,sincelimitedeffortvisiblyimpacts
dashboard.theinmetricscloningConsolidatelargeclones,bothinlengthandincardinality.Removalofsuchclonesvisiblyreduces
clonemetricvaluesandthusalsobolstersmotivation.

8.6ContinuousCloneControl

Apartfromestablishingshort-termqualityimprovements,boththeamountofcloningandtheprob-
abilityapplicationtoofintroducecloneerrorscontrol.duetoContinuousinconsistcloneentcontrolmodi®cationsinvolvcanesbeboththereducedqualitythroughengineercontinuousandthe
elopers.vde

Quality8.6.1Engineer

Asbasis,parte.ofg.,aspartcontinuousofweeklyclonecontrol,projecttstahetusqualitymeetings:engineerperformsaseriesofactivitiesonaregular

InspectestablishestheCloningcloneMetricsmetricsasintheimportantdashboardprojecttoqualitytrackthecharacteristicshigh-levelevandolutionmaintainsofcloning.attentionThison
them.Furthermore,thequalityengineeranalyzestheirtrendstomonitorwhetherclonecontrolhas
fect.efan

155

8MethodforCloneAssessmentandControl

TracsupportkforClonesclonetotrackingidentifycf.,evSectionolution7.4.4ofcloningidenti®esonaddedtheleandvelofmodi®edindiclonevidualgroups.cloneThegroups.qualityTool
engineerperformsthefollowingstepsonthem:
Addeddetection:iftheresults.cloneElse,incandidatevestigisateafifalsethepositicloneve,shouldadditbetotheremovedblacklistand,toifso,remosveitchedulefromitftheor
removalby,e.g.,creatingaworkitemforitintheproject’sissuetracker.Ifthecloneshould
notberemoved,e.g.,sincethelanguageabstractionmechanismsareinsuf®cient,theclone
theremainsrootincausetheofthedetectioncloneresultsandtobedetermineavifailablereactionsforchangeneedtopropagbetakation.en.Furthermore,analyze
Modi®ed:ifthemodi®cationwasnotperformedconsistentlytoallclonesintheclonegroup,
invcheckestigifatethiswhwyascloneunintentional.indicationwIfasso,notusedscheduleawsuccessfullyork.itemtorepairtheinconsistencyand
Inaddition,thequalityengineerfollowsprogressonthescheduledworkitemsforcloneremoval
orincludedinconsistencintheyqualityremoval.Tdashboardobolstertomakdeveeloperprogressmotivvisibleation,tothethelistteam.ofremovedclonescan,e.g.,be

sveloperDe8.6.2

Aspartofcontinuousclonecontrol,thedevelopersperformaseriesoftasksaspartoftheirdevel-
vities.actiopment

EmploinconsistentyClonechangestoIndicationclonedforcodeischangereduced,propagevenation.ifThiscloningwayis,nottheprobabilitconsolidated.yofunintentionally

uledResolveforremoWvorkalandItemsthatinconsistencieshavebeenthatneedscheduledtobebytherepaired.qualityWhileengineerthis,causesnamelyeffortclonesforfsched-amil-
iarizationandqualityassurance,itimmediatelyreducestheamountofcloningandfaultsinthe
system.

ConsolidateUponChangeremovescloningwhenchangestoclonedcodearerequiredduring
maintenance.Ifcodeneedstobechangedtoimplementachangerequest,cloneconsolidationin
thatcodedoesnotcreateadditionaleffortforfamiliarizationandqualityassurance.Thisstrategy
allowstoremovecloninggraduallyduringsystemevolution,withoutrequiringasigni®cantup-front
estment.vinApartfromthereductionoftheamountofcloningandtheprobabilityofinconsistentmodi®cations,
alongtermbene®tofcontinuousclonecontrolisalsothemaintaineddeveloperawarenessofthe
negativeimpactofcloning.Thisawarenessmakestheintroductionofnewclonesinaddedor
modi®edcodelesslikely.

156

AssumptionsofalidationV8.7

Discussion8.6.3Thegenericclonecontrolmethodabovecanbeadaptedtospeci®cprojectcontexts.

GreenFieldDevelopmentTheabovemethodfocusedontheintroductionofclonecontrol
intomaintenanceprojects.Itthusfocusedonhowtochangeestablishedhabitsandhowtomanage
existingclones.Ifclonecontrolisintroducedattheverybeginningofaproject,itdiffersintwo
aspects.importantFirst,insteadofchangingestablishedhabits,newhabitsneedtobecreated,whichisarguablysim-
pler.Still,tocreatenewhabits,developersneedtobemotivated.Sincecloneassessmentresultsfor
theprojectdonotexist,resultsfromother,ifpossiblecomparableprojectsshouldbeemployed.
Second,ifaprojectstartswithzeroartifacts,italsostartswithzeroclones.Clonecontrolcanthus
focusoncloneavoidanceinsteadofmanagementofexistingclones.Onepossibilityistotrack
clonestodiscovertheexistenceofnewclonesrightaftertheircreation,whiletheirremovalisstill
e.vxpensiine

Multi-projectEnvironmentsIfclonecontrolisintroducedintoamulti-projectenvironment,a
stagedapproachthatstartswithafewprojectsbeforeintroducingclonecontrolintoallprojectshas
severaladvantages.First,lessinvestmentisrequired.Second,lessonslearnedonthepilotprojects
canbeappliedtotheremainingones,potentiallysavingtherepetitionoferrors.Third,thepilot
projectscanbeemployedasexamplestocreateasenseofurgencyandshowfeasibilityofclone
projects.remainingthetocontrol

ToolSupportDedicatedtoolsupportiscrucialforclonecontrol.Tocontrolcloningonaproject
level,qualitydashboardsaggregateandvisualizetheextentandevolutionofcloninginasystem.For
changepropagation,cloneinspectionandremoval,clonemanagementtoolsthatintegrateintoIDEs
providesupporttodevelopers.Bothtoolsupportontheprojectlevelandforclonemanagementin
theIDEisproposedinthisthesisandoutlinedinChapter7.

alidationV8.7Assumptionsof

Thisclonesectionassessmentpresentsandcontrol.industrialThecaseevstudiesaluationthatofvthealidatemethodtheisassumptionspresentedinunderlyingSection8.8.themethodfor

Assumptions8.7.1Thetailoringprocedurethatispartofthecloneassessmentmethodemploysdeveloperassessments
ofclonecouplingonaclonesampletodetermineresultaccuracy.Thisisbasedonthreeassump-
tions:

157

8MethodforCloneAssessmentandControl

Assessmentconsistency.Weassumethatdifferentdevelopersevaluatethecouplingofclonescon-
.sistentlyAssessmentcorrectness.Weassumethattheevaluationofclonecouplingiscorrectregardinghow
changeswillaffectclonesinreality.
Assessmentgeneralizability.Weassumethatassessmentresultsforasampleofthedetectedclones
canbegeneralizedtoallclones.
Whileacertainamountoferrorcanbetolerated,theassumptionsmustholdonagenerallevelfor
theuseofdeveloperassessmentsonasampletomakesense.

ermsT8.7.2

Foforathesoftwsakeareofsystemclarity,onwethede®nesevconceptualeralletermsvel.weAemplomodi®cyationduringistheanstudy:alterationAchangonetheisansourcealterationcode
level.locations.AsingleDetectionchangeresultcomprisesaccuracymultiplereferstoamodi®cations,combinationifitsofbothpreimplementationcisionafandfectsrecall.severalcode

QuestionshcResear8.7.3

Weuseastudydesignwithtwoobjectsandfourresearchquestionstovalidatetheassumptions.
Thestudyislimitedtosourcecode:

RQ10Dodevelopersestimateclonecouplingconsistently?

Theapplicationofdeveloperassessmentstoestimateclonecouplingisbasedontheassumptionthat
difhaveferentdedemonstratedvelopersthatestiassmateessmentsclonerequirecouplinganexplicitconsistentlyclone.relevExperimentsancebycriterionWtoalensteinproduceetal.consis-[229]
tentresults.Thisresearchquestionvalidateswhethertheestimationofcouplingrepresentssuch.

RQ11Dodevelopersestimateclonecouplingcorrectly?

Consistencyaloneisnosuf®cientindicatorforcorrectness.Predictionofchange,whichispart
ofassessingthecouplingbetweenclones,inherentlycontainsuncertainty.Toassesshowuseful
developerassessmentsofclonecouplingarefortailoring,weneedtounderstandtheircorrectness.

RQ12Cancouplingbegeneralizedfromasample?

Ratingisperformedonasampleofthecandidateclones,sincereal-worldsoftwaresystemscontain
toomanyclonestofeasiblyratethemall.Thesamplemustberepresentativeforthesystem,else
sense.noesmaksampling

RQ13Howlargeistheimpactoftailoringonclonedetectionresults?

158

AssumptionsofalidationV8.7

Table8.1:Studyobjects
Lang.Age(years)Size(kLOC)Developers(max)
AABAP1344210(40)
BC#83604(12)

Tailoringchangestheresultsofclonedetection.Thesizeofthechangeintermsofaccuracyand
amountofdetectedclonesdeterminestheimportanceofclonedetectiontailoringforbothresearch
practice.and

(RQ10)yConsistencEstimation8.7.4

StudyObjectWeuseanindustrialsoftwaresystemfromtheMunichReGroupasstudyobject.
TheMunichReGroupisthelargestre-insurancecompanyintheworldandemploysmorethan
47,000peopleinover50locations.Fortheirinsurancebusiness,theydevelopavarietyofindividual
supportingsoftwaresystems.Fornon-disclosurereasons,wenamedthesystemA.Anoverviewis
showninTable8.1.Codesizereferstothehandmaintainedcodethatwasanalyzed.Thesystem
implementsbilling,timeandemployeemanagementfunctionalityandsupportsabout3700users.

this,DesigndevWelopersedetermineindependentlyinter-raterestimateagreementcouplingbetweenforadifsampleferentofdevelcandidateoperstocloneanswerpairsRQ1.fromFtheor
studyobjectbyansweringassessmentquestion1foreachpair.Inter-rateragreementisthendeter-
minedbycomputingCohen’sKappa.

PrstudyocedureobjectA.andFromExtheecutionresults,aClonerandomdetectionsamplewasofcloneperformedpairswwithasangenerated.untailoredIfasampledcon®gurationcandi-on
dateclonegroupcontainedmorethantwoclones,its®rsttwocloneswerechosen.Eachdeveloper
assessedresearcherecouplingxplainedfortheeachassessmentclonepairtoolindiandaskviduallyed.theAssessmentassessmentwasquestionguidedforbyaeachcloneresearcherpair.,bTheut
cepttook,rcareejectnotandtoin¯uenceundecided.Indiassessmentvidualresults.ratingDevmeetingseloperswerecouldlimitedprotovide90threeminutesanswesincers,enamelyxperiencesac-
withdeconcentrationveloperandclonemotivationassessmentsdecreasefromandearlierthreateneresultxperimentsaccurac[115]y.indicatedthatafter90minutes,

ThreeResultscloneandpairswereDiscussionratedasCloneundecidedcouplingbywoneasdevestimatedeloper,forone48cloneclonepairpairswasbyratedthreeasdevundecidedelopers.
bytwodevelopers.Furthermore,®veclonepairsreceivedatleastoneacceptandonerejectassess-
ment.Theremaining39clonepairsallreceivedthesameratingsbyallthreedevelopers.Table8.2
showstheresultsoftheassessment.
81.3%.AgreementInrowsbetween1–4,allpairscloneofdepairsvareeloperstakenrangesintobetweenaccount,85.4%includingandclone89.1%.pairsOvthaterallwereagreementestimatedis

159

8MethodforCloneAssessmentandControl

Table8.2:Estimationconsistencyresults
AgreementelopersvDe87.5%2&185.4%3&189,6%3&21&2&381.3%
1&2&3(w/ounrated)88.1%

asundecidedbyonedeveloper.Forthelastrow,thefourclonepairsforwhichatleastonedeveloper
ratedundecidedwereremovedfromtheresult.Ontheremaining44clonepairs,88.1%arerated
consistentlybetweenthreedevelopers,indicatingsubstantialagreement.Cohen’sKappaforthe
threecategoriesaccepted,rejectedandundecidedandthethreeratersis0.87forthe48ratedclone
groups.AccordingtoLandisandKoch[151],thisisconsideredasalmostperfectagreement.
Fortheanalyzedclonepairs,developersdidhaveaconsistentestimationofthecouplingofclones.
Aftertheassessmentswerecomplete,resultswerediscussedwiththedevelopers.Developerscould
agreeonanassessmentforfouroutofthe®veclonepairsthatwereassessedcontradictorily.Only
forasingleclonepairdevelopersremainedofdifferentopinion.Basedontheseresults,weconsider
itfeasibletoachieveconsistentestimationsofclonecouplingthroughdeveloperassessments.

8.7.5EstimationCorrectnessandGeneralizability(RQ11&RQ12)

StudyObjectWeuseasecondindustrialsoftwaresystemfromtheMunichReGroupasstudy
object.Fornon-disclosurereasons,wenamedthesystemB.AnoverviewisshowninTable8.1.
Thesystemimplementsdamagepredictionfunctionalityandsupportsabout100expertusers.

DesignClonedetectiontailoringpartitionstheresultsofuntailoredclonedetectionintotwo
sets—thesetofacceptedclonegroupsthatarestilldetectedaftertailoring,andthesetofrejected
clonegroupsthatarenotdetectedanymore.Ifdeveloperassessmentsofclonecouplingarecorrect
andresultscanbegeneralizedfromthesample(andnoerrorshavebeenmadeduringclonedetec-
tiontailoring),acceptedclonegroupsmustexhibitahigherratioofcoupledchangesduringtheir
evolutionthanrejectedclonegroups.

De®nition5ChangeCouplingRatio(CCR):Probabilitythatachangetoonecloneofaclone
groupshouldalsobeperformedtoatleastoneofitssiblings.

Westatethisasahypothesis:

Hypothesis1CCRforacceptedclonegroupsishigherthanforrejectedclonegroups.

160

V8.7Assumptionsofalidation

WcloneedeterminegroupsasCCRdescribedonthebeloevw.olutionWethenhistoryuseofathepairedstudyt-testtoobjecttestforbothHypothesisaccepted1againstandtherejectednull
hypothesisthatCCRforacceptedclonegroupsisequalorsmallerthanforrejectedclonegroups.
CCRisdeterminedbyinvestigatingthesetofchangesthatareperformedtoclonegroupsduring
groupsystemisevcoupled,olution.CCRwhichisissimplyequaltothetheeratioxpectedoftheprobabilitynumberthatofacoupledrandomlychangeschosentothechangenumbertoaofcloneall
ones.uncoupledincludingchanges,Inpractice,developersdonothaveperfectchangeimpactknowledge.Themodi®cationsdevelop-
ersperformtoclonedcodecandeviatefromtheintentionalnatureofthechange:developerscan
missaclonewhenimplementingacoupledchange.Themodi®cationoftheclonedcodegetsthus
unintentionallyuncoupled5.Thethreewayshowachangecanaffectclonedcodeare:1)Consistent
aremodi®cationsintentionallyareuncoupledintentionallycoupledmodi®cationsmtoodi®cationsclonedctoode.cloned3)code.Inconsistent2)Independentmodi®cationsaremodi®cationsunin-
tentionallyuncoupledmodi®cationstoclonedcode.
Informationabout6theintentionalityofamodi®cationis,ingeneral,notcontainedintheevolution
historyofasystem.Itisthusmanuallyassessedbythesystemdevelopers.
asWefollodeterminews:®rst,CCRclonesforaaresystemtrackedbybetweeninspectingthetwchangesosystembetweenversionspairsoftoconsecutiidentifyveclonesystemgroupsversionsthat
lyingwerechange,modi®ed;theysecond,areallclassi®edmodi®edintocsetsloneofgroupsconsistentlyare,inspectedindependentlymanually—basedorinconsistentlyontheirchangedunder-
clonegroups;CCRcannowbecomputedas:
|consistent|+|inconsistent|
CCR=|consistent|+|inconsistent|+|independent|
Thisvidualcloneproceduregroups.doesnotTorequireimproveaccurateaccuracy,anditcancompletebeevperformedolutiononhistoriesmultipleorpairsgenealogiesofconsecutiofindi-ve
systemversions—CCRisthendeterminedonalargersampleofchanges.

ProcedureandExecutionThesystemversionsbetweenwhichcodemodi®cationswerean-
alyzedwerechosenusingaconveniencesamplingstrategy.W7eeklysnapshotsofthesourcecode
werechurnewasxtracteddeterminedfromtheasvtheersionnumbercontrolofchangedsystemfor®lestheasanyearestimate2006.ofdeBetweenvelopmenteachactisnapshots,vityinthatcode
week.Fourweeklyintervalswerechosenformeasurement.Theirchoiceaimedatmaximizingthe
coveredpartofthesystemevolution,tomeasuredifferentstagesandtocapturedifferentlevelsof
developmentactivitytoreducetheprobabilitytoonlycoveranunrepresentativepartofthesystem’s
olution.ve5Inaffectprinconeiple,declone,velopersthusafcouldfectingalsoanerroneouslyunintentionallymodifycoupledclonesinamodi®cation.coupledHofweashion,ver,sincealthoughthisthecasewchangeasnotshouldobservedonly
6onBasedtheonstudyhistoryobject,analysiswealignoreone,ititishere.undecidablewhethertwodifferentlymodi®edsiblingclonesrepresentaninde-
pendentorinconsistentmodi®cationandthuswhethertheunderlyingchangeiscoupledornot.
7TheearlierdevevelopersolutionhavehistoryemployedfragmentourtocloneavoiddetectionunwantedtoolsideConQAeffectsTonduringthedevdataelopmentcausedbysincethe2008.useofWethethuscloneanalyzeddetector.an

161

8MethodforCloneAssessmentandControl

Foreachmeasurementinterval,couplingwasdeterminedforbothacceptedandrejectedclone
proachgroupsassimilarfollotows.theFirst,onedescribedmodi®cationsin[83,to83],clonedcf.,codeSectionwere7.4.4.computedSecond,usingallaclonemodi®cationstrackingtocloneap-
groupsweremanuallyclassi®edasconsistent,inconsistentorindependent.Requiredefforttoindi-
viduallyrateallclonegroupsforallintervalsandbothdetectioncon®gurationswouldbetoohigh
tobefeasible.Threemeasuresweretakentoreducerevieweffort:
Cloneclustering:Duetothenatureofclonegroups,longclonegroupsoftenoverlapwithshorter
clonecontainsgroupstwoofmethods.higherIfyoucardinalitynow.cloneSayoneyouofcreatedtheamethodscloneagpairain,Ayoubyhavecloningcreatedacodearesecondgionclonethat
pairgroupA.BWewithcallthreesuchovclones—oneerlappingclonecontaininggroupstheanewlycluster.insertedIfthemethodoriginalclone,methodtwoogetsverlappingchanged,cloneboth
clonegroupsAandBaremodi®ed.Apre-studyweperformedtovalidatethetoolsetupshowed
thatmodi®cationsareoftenratedequallyforallclonegroupsinacluster.Althoughallclonegroups
inimproavclustederratingwereratedproductiindivity.vidually,sortingclonegroupsaccordingtocloneclusterssubstantially
Twthoseforo-phasewhichreview:obInviouslytheno®rstphase,commonaconceptresearcherbetweeninspectedclonesallmodi®edcouldbecloneidenti®edgroupsasandindepeclassi®edndent.
Taggressiypicalveexamplesnormalization.includeIngettertheandsecondsetterphase,clonesthethatremainingareonlycloneconsideredgroupsweresimilarpair-reduevietoowedverlyby
aresearcherandadeveloper.Theresearcheroperatedthecloneinspectiontool,thedevelopertook
decisions.ratingtheSingleclassi®cation:Ratedclonegroupswerepartitionedintoacceptedandrejectedsets.This
wasdetectiondonebycon®guration.matchingtheMatchingratedwcloneasgroupsperformedaginainstatheresultssemi-automatedofclonefashion:trackingcloneusinggroupsatailorwithed
aidenticalresearcherpositionsbasedonweretheirmatchedlocationandautomaticallycontent,8.Firemainingveoutcloneof91groups(5.5%)wereofthematcheddetectedmanuallyclustersby
couldnotbematchedandwereexcludedfromthestudy.
ClonedetectionwasperformedwithConQATusingaminimalclonelengthof10statements.Tai-
loreddetectionwasperformedusinganexistingtailoringfromanearliercollaborationthatwas
createdusingthemethodfromSection8.2.Itexcludesclonegroupswithoverlappingclones,em-
ployscontextsensitivenormalizationofrepetitivecoderegionsandexcludesC#usingstatements
code.generatedand

ResultsandDiscussionTables8.3and8.4showtheresultsofthemanualchangeclassi®ca-
tionandtheresultingcouplingforthesetofacceptedandrejectedclonegroups,respectively.In
total,changesto211clonegroups(containing1279clones)weremanuallyclassi®edduringthe
xperiment.eInintervals1and2,modi®cationsforoneacceptedclonegroupwereratedasdon’tknow.Forcom-
putationofcoupling,theywereconservativelycountedasindependent.Thisconservativestrategy
onlymakesithardertoanswertheresearchquestionpositively—itdoesnotthreatenthevalidityof
apositiveanswer.
8Tailoringcanresultinshorterclonesthatarethusnotinidenticallocationsastheiruntailoredcorrespondents.

162

AssumptionsofalidationV8.7

Table8.3:Evolutionofacceptedclonegroups
CouplingIndependentInconsistentConsistentInt.0.857331510.54510111243311061301.0000.740
0.7232610581-4

Table8.4:Evolutionofrejectedclonegroups
CouplingIndependentInconsistentConsistentInt.0.167100210.0234210243100038230.0260.000
0.034102131-4

Thepairedt-testyieldsap-valueof0.002162.Thisindicatesthatthegreaterclonecouplingfor
ItthusacceptedsupportsthanforrejectedHypothesisclone1.Degroupsveloperis,foraestimationcon®denceofcloneintervalofcoupling95%,thusstatisalignsticallywellsigni®cant.withthe
evolutionofclonesduringthesystem’sevolutionhistory.

8.7.6CloneTailoringImpact(RQ13)

StudyObjectWeusesystemBfromMunichRe(asforRQ12).

DesignWecomputeseveralcloningmetricsfortheclonedetectionresultsbeforeandaftertai-
loring,namely:countofclonesandclonegroups,clonecoverageandcloneblow-up.Wethen
calculatetheirdeltatoevaluatethequantitativeimpactoftailoringonthedetectionresults.

PrsionsofocedurethesourceandExcodeofecutionthestudyWeobject.performedUntailoredtailoredcloneanddetectionuntailoredsimplyclonereturnsdetectionallontype-1twovander-
type-2clones(accordingtothede®nitionfrom[140]).Allmetricswerecomputedautomatically
isbyfromConQAmidT.2008The®rst(beforeversionConQAistheTwoneasfromintroducedthe®rstformeasurementcontinuouscloneinterval.Themanagement).secondvBetweenersion
theseversions,thedevelopersreplacedhand-writtendata-accesscodewithgeneratedcodethatis
nevermodi®edmanually—ifthedata-accesslayerchanges,itisfullyre-generated—unintentionally
generateduncoupledcodechangesonthusuntailoredcannotdetectionoccur.Weresults.includedthissecondversiontoinvestigatetheeffectof

163

8MethodforCloneAssessmentandControl

Table8.5:Impactoftailoringondetectionresults
20082006Untail.Tail.Untail.Tail.
CloneClonesGroups2,1185981,005332!!53%44%12,6752,5583,5581,028!!72%60%
CoBlovw-Uperage27.8%29.3%14.2%18.3%!!49%38%41.2%36.2%16,1%19,4%!!61%46%

ResultsandDiscussionTheresultsaredisplayedinTable8.5.Inbothversions,tailoring
substantiallyreducedthenumberofdetectedclonesandthusclonecoverageandblow-up.However,
stronglysubstantialifgeneratedamountsofcodecloningisarepresent—allstilldetectedmetricsafterarereducedtailoring.byTalarailoringgerfafactorfects.resultsevenmore
Themereobservationthattheintroductionof®ltersduringtailoringreducesthenumberofdetected
clonesislittlesurprising.However,fortheanalyzedsystem,recallwaslargelypreserved—ofthe
72clonegroupstowhichcoupledchangesoccurred,68werestilldetectedbythetailoredclone
detection,indicatingarecallofthetailoredcomparedtotheuntailoreddetectionof94.4%.Conse-
quently,changesinclone(group)countmostlydenotechangesinprecision.Morespeci®cally,for
theanalyzedsystem,abouteverysecondclonegroupintheuntailoredresultisconsideredirrelevant
bydevelopers.Fortheanalyzedsystem,adoptionofclonedetectiontechniquesforcontinuousclone
managementfaileduntiltailoringwasperformed—eventhoughthesystemscontainedsubstantial
amountsofrelevantclones,falsepositiverateswereconsideredtoohighforproductiveuse.

Threats8.7.7alidityVto

InternalmeasurementTheintervchoicealscoofvtheeringayearmeasurementofdevintervelopmentalsforhistoryRQ2,withcanafdiffectferentresultintervvalsalidity.betweenWechosethem
andwithdifferentchurntoreducetheprobabilityofonlyselectingunrepresentativeintervals.
Wadveertentlyassumeinthatvestalleffortconsistentintochangichangesngdifareferentintentional,clonesontheconsistentlybasis,ifthatonlaydeavelopersingledoesclonenotneedsin-
tobebenechanged.gligible—oftheWhile43thisconsistentlysimpli®cationmodi®edcanincloneprinciplegroupsintroducemanuallyinvinaccuracestigatedy,weduringexpecttheitcaseto
study,notasingleonewasunintentionallymodi®edconsistently.
9areOurtrackapproachedtbetweenomeasuretwocloneconsecutivecouplingsystemisvunableersionstoonlydetect.Thislatedoespropagnotationsaffect,thebecausequalityclonesof
ourresults,however,sincemanualclassi®cationofuncoupledchangesbydevelopersrecognizes
changesthatarepartoflatepropagationsasunintentionalinconsistencies,andthusascoupled
changes.9Adevlateeloperpropagmodi®esationistheanclonesinconsistentmissedinchangethe®rsttoclonedmodi®cationcodestepthataccordinglybecomes.consistentagainatalaterpoint,whena

164

aluationEv8.8

Overeagertailoringcan®lteroutclonesthatarerelevant.Thisalsoleadstoasubstantialchangein
clonemetrics,butisnotdesirableinpractice.However,intheanalyzedsystem,94.4%(68outof
72)oftheclonegroupsthatevolvedinacoupledfashionarestillcontainedinthedetectionresults
aftertailoring—indicatinghighrecalloftailoredinrelationtountailoreddetectionresults.
Manualclassi®cationofclonegroups—asdonetoanswerRQ2—entailstheriskofmisclassi®cation
toduetoreducehumantheerrors.probabilityWeoftookindiseveralvidualmeasureserrors.toThereducethisparticipatingrisk:depairveloper-classi®cationhadbeenwasworkiemplongyedon
theproject,withoutbreak,forseveralyears,coveringallmeasurementintervals—hewasthuswell
familiarwiththesystem.Furthermore,uncertaincaseswereratedasdon’tknowtoavoidguesswork
andwerehandledconservatively.
Incaseclonegroupsfromtheuntailoredandthetailoreddetectionresultscouldnotbemapped
unambiguously,theywereexcludedfromthestudy.Sincethisaffectedonly5.5%(®veoutof91)
ofthedetectedclusters,weexpectthepotentialimpactofthissimpli®cationtobenegligible.

ExternalEachresearchquestionhasbeenevaluatedonasinglesystemonly.Thesystemshave
notbeenchosenrandomlybutwereselectedbasedonanexistingcooperationandtheavailability
andwillingnessofdeveloperstocontribute.Furthermore,onlyasingleclonedetector—andhence
onlyasingleclonedetectionapproach—wasemployed.Thus,fromthestudyresults,wecannot
tellhowresultsaretransferabletosystemswrittenindifferentlanguages,byotherdeveloperteams,
ortootherclonedetectorsordetectionapproaches.Althoughtheresultsfromthestudiesalign
wellwithexperienceswehavegatheredapplyingclonedetectiontailoringinvariousothercontexts,
furtherstudiesarerequiredtogainabetterunderstandingofresulttransferability.
Thestudyonlyanalyzedcloninginsourcecode.Whileweseenofactorsthatthreatentoinvalidate
theapplicabilityoftheresultstocloninginotherartifacttypes,andthusassumethattheyholdfor
themtoo;futureworkisrequiredtovalidatetheseassumptionsforrequirementsspeci®cationsand
models.

aluationEv8.8

caseThisstudysectionthatpresentsemploysanethevaluationproposedofthemethodmethodonanforcloneindustrialassesssoftwmentareandsystemcontrol.andItanalyzespresentsthea
resultingchangesintheamountandevolutionofcodecloning.Thecasestudyhasbeenperformed
incollaborationwithMunichReGroup.

CloneAssessmentandControlWeappliedthemethodforcloneassessmentandcontrolas
describedinthischaptertoasoftwareprojectdevelopedandmaintainedatMunichReGroup.We
steps.mainthesummarizeshortlyCloneassessmentwasperformedasonthe,atthattime,currentversionofthesoftwaresystem.
Severaldeveloperstookpartincloneinspectionsduringclonedetectiontailoringanddetermination

165

8MethodforCloneAssessmentandControl

ofUICR10andFUICR11.AsreportedinChapter4,multiplefaultswerefoundintheinspected
clones.type-3Theresultsofcloneassessmentwerepresentedanddiscussedinmeetingsinwhichtheentiremain-
tenanceteamparticipated.Besidesanintroductiontocloningingeneral,boththeresultsofthe
clonemetricsfortheprojectandtheindividualdiscoveredfaultswerediscussed.Thefaults,espe-
cially,helpedtoestablishasenseofurgencyamongtheparticipants.Thedevelopers®xedthefaults
andconsolidatedanumberofclonesdirectlyafterpresentationoftheresultsofcloneassessment.
Twotypesoftoolsupportforclonecontrolwereemployed.AConQAT-basedqualitydashboard
wascreatedfortheprojectthatwasupdatedonadailybasis.Thedashboardcontainedallclone
visualizationsintroducedinChapter7,includingclonelists,treemapsandclonemetrictrends.
Thedashboardresultswereavailabletothedevelopersforindividualuse.Inaddition,theywere
inspectedbytheteamaspartofregularprojectstatusmeetings.Besidesthedashboard,developers
hadaccesstotheinteractivetoolsupportforcloneinspection(cf.,Chapter7).Thisway,individual
clonescouldbeinspectedindetailatthecodelevel.
Atthebeginningofthecasestudy,wetutoredtheprojectparticipantsintheinterpretationofthe
visualizationsandmetricsintheprojectdashboardandontheuseoftheinteractivecloneinspec-
tiontools.Apartfromthesetutorialsandthepresentationsofthecloneassessmentresultsatthe
beginningofthecasestudy,wedidnotactivelyparticipateinclonecontrol.Importantly,wedidnot
touchasinglelineofcodeintheproject.Anychangestothecodeoftheprojectwereperformedby
thedevelopersthemselves.

Resear8.8.1Questionshc

Toevaluatetheusefulnessofcloneassessmentandcontrol,weinvestigatethefollowingtwore-
questions:search

RQ14Didclonecontrolreducetheamountofcloning?

Clonecontrolrequiresresources.Tojustifytheirexpense,clonecontrolneedstotakeanoticeable
effect.Thisquestioninvestigateswhetheranoticeableeffectcanbeobservedintheamountof
cloning.

RQ15Istheimprovementlikelytobecausedbythecloneassessmentandcontrolmeasures?

Improvementalonedoesnotjustifyclonecontrol.Itcould,inprinciple,beduetoothercauses.
Thisresearchquestionanalyzeswhethertheobservedreductionincloningcanbeattributedto
control.clone1011FaultyUnintentionallyunintentionallyinconsistentinconsistentclonesratioclonescf.,ratiocfSection.,Section8.3.3.8.3.3.

166

DesignyStud8.8.2

Ev8.8aluation

RQ14Weanalyzetheamountofcloninginthestudyobjectinbothrelativeandabsoluteterms.
Themetricclonecoveragecapturestherelativeamountofcloning;numberofclonedstatements
amount.absoluteitscapturesBothmetricsarecomputedonadailybasistocapturetheirevolutionduringthecasestudy.

RQclone15controlToinvmeasures,estigatewewhetheralsothecomputereductionsthecloneincloningmetricsonarethelikeelyvtoolutionbecausedhistorybyofthetheprojectapplied
beforeclonecontrolwasintroduced.Wethencomparethetrendsofthemetricswithandwithout
clonecontroltoanalyzedifferences.

ObjectsyStud8.8.3

WechoseanindustrialsoftwaresystematMunichReGroupasastudyobject.Itisabusinessinfor-
mationsystemwritteninC#thatprovidespharmaceuticalriskmanagementfunctionality.During
theyearofthecasestudy,thesizeofthesystemgrewfrom450kLOCto500kLOC.Itisthesame
systemassystemBinthestudyobjectsinSection4.3.
Softwarequalitycharacteristics—includingcloning—arein¯uencedbymanyfactors.Tonamejust
afew,theseincludethecompany,developerexpertise,teamstructures,themaintenanceenviron-
mentandavailabletools.Tohaveaconclusivecontrolgrouptoanswerresearchquestion15,these
factorsneedtobecontrolled.
However,eveninsidetheMunichReGroup,itisdif®cultto®ndsoftwaresystemswiththesame
characteristicsasthestudyobject,astheyaredevelopedandmaintainedbydifferentsubcontractors.
Theydiffer,thus,intheirprocesses,teamstructuresandemployedtools.
Insteadofchoosingotherprojectswithdifferentcharacteristics,whoseimpactoncloningishardto
determine,wechosethepastevolutionofthestudyobject,beforeclonecontrolwasintroduced,as
controlobject.Thisway,thecompany,domain,developmentprocess,teamstructureandemployed
developmenttoolsremainconstantforthemostpart.

ecutionExandImplementation8.8.4

RQ14Theconstructionofthequalitydashboardwasintegratedintoacontinuousbuildprocess
thatwasexecutedeveryday.Allcomputedclonemetricswerewrittentoadatabase.Thisway,the
clonemetrictrendswerecollectedcontinuouslyduringtheperiodofthecasestudy.

167

8MethodforCloneAssessmentandControl

RQ15Tocomputetheclonemetricsonthepastprojectevolution,weextractedweeklysnapshots
fromitsversioncontrolsystem.Clonedetectionwasthenperformedoneachweeklysnapshot,clone
metricscomputedandwrittentoadatabaseforlatertrendanalysis.
Samplesoftheclonesofseveralsnapshotsofthesystemwereinspectedwiththedeveloperstomake
surethattailoringwasstillaccurate.

Results8.8.5

Thissectionpresentstheresultsofthecasestudy.

RQ14Figure8.2depictstheevolutionofclonecoverage.Theupperchartshowsthatclone
coMayverage2008,theredecreasedisashortduringtheincreasecaseinstudyclonecofromverage.14%inAnAprilintervie2008wtowithbelothewdev10%elopersinreMayvealed2009.thatIn
alarresultinggecloneinthehaddropbeenoftheintroduced,clonecobutvwerageastrendnoticedtoatitsapreteamviousmeetinglevel.andApartfromconsolidatedthisperiod,subsequentlyanda,
secondsmallincreaseinJuly2008,theclonecoveragetrendissteadilydecreasing.
ThenumberupperofchartstatementsofFigurethatare8.3covdepictseredthebyatnumberleastofoneallclonestatementsinred.ofIttheshowssystemthatinthebluenumberandtheof
clonedstatementsdecreasesfrom15.000inApril2008to11.000inMay2009.Duringthestudy
Likperiod,ethetheclonesizecoofvtheeragesystemtrend,theincreasedclonedfromstatementsaroundtrend105.000issteadilystatementstodecreasing115.000formoststatements.ofthe
period.studycaseThereduceddecreasetheinamountbothofeclonexistingcoveragecloningandintheclonedsystem.statementsWeshothuswsanswerthatcloneRQ14controlpositively:successfullyclone
controldidreducetheamountofcloninginthestudiedsystem.

evRQolution15Thispatternsresearchbeforeclonequestionincontrolvestigwasatesintroduced.whethertheclonemetricsalreadyexhibitedsimilar
ThelowerchartsinFigure8.2depictstheevolutionofclonecoveragebetweenSeptember2004
andJanuary2007.Increasesinclonecoveragearealwayscausedbythecreationofnewclones.
Decreasesinclonecoverageareeithercausedbycloneremoval,orbyadditionofnewcodethat
containsno(orless)cloning.Formostofthisperiod,clonecoverageoscillatesbetween10%
and20%.Theamplitudeofthechanges¯attensastheprojectadvances,sincetherelativesize
ofthecodechangedduringaniterationdecreasesw.r.t.theoverallprojectsize,astheoverallsize
growslarger.Forthesecondpartofthechart,theperiodafterJanuary2006,clonecoveragenever
14%.yondbedecreasesIncontrast,theclonecoveragetrendduringthecasestudyexhibitsasubstantiallydifferentevolu-
tion,sinceitdecreasesforthemostpart.
ThestatementslowerinchartredininFigurethe8.3sameshowsperiod.thenumberIncreasesofinallclonedstatementsstatementsinblueareandalthewaysnumbercausedofbyclonedthe

168

Figure

8.2:

Clone

gveraco

e

olutionve

with

(top)

and

without

(bottom)

8.8

aluationEv

lonec

olcontr

169

8

Method

Figure

170

for

8.3:

Clone

Assessment

Statements

olcontr

and

and

olContr

lonedc

statements

with

(top)

and

without

(bottom)

lonec

aluationEv8.8

creationofnewclones,decreasesbytheirremoval.Thewavesinthetrendindicatethatsome
cloninggetsconsolidatedshortlyafteritsintroduction.However,theamountofclonedstatements
afterawaveisneverbelowtheamountofclonedstatementsbeforeawave,indicatingthatclones
remaininthesystem,aftertheyhavesurvivedforacertainamountoftime.Ifmeasuredonlyatthe
lowestpoints,thetrendissteadilyincreasing.
Incontrast,theclonedstatementtrendduringthecasestudymostlydecreased.Itthusexhibitsa
substantiallydifferentevolution,thanbeforeclonecontrolwasintroduced.
Sincebothclonecoverageandclonedstatementsevolvedsubstantiallydifferentwithoutandwith
clonecontrol,althoughnomajorchangesinotherprojectcharacteristicswereperformedatthetime,
weanswerRQ15positively:thedecreaseincloningislikelytobecausedbyclonecontrol.

Discussion8.8.6

Thewavesinthetrendsare,inparts,causedbytheiterativedevelopmentprocess.Thesystem
sizetrendinthelowerchartinFigure8.3re¯ectstheiterativedevelopmentprocessandrelease
cycleoftheproject.Atthestartofanewiteration,systemsizetendstoincreaseratherrapidly,
asimplementationofnewfeaturesresultsinfastproductionofnewcode.Towardstheendofan
iteration,sizeincreaseslowsorstagnates,asmoreresourcesarededicatedtotestingor®xingof
functionality,thantoproductionofnewcode.Insomecases,cleanupduringtheendofaniteration
evenreducesthecodesize.Theclonedstatementstrendfollowsthispattern.Wecouldobserve
thatcloneswereoftenintroducedatthebeginningofaniteration.Sometimes,apartoftheclones
wasconsolidatedatalaterpointofthesameiteration,causingareductioninthenumberofcloned
statements.However,whilesomecloneswereconsolidatedduringtheiterationinwhichtheyarecreated,clones
thatsurvivedbeyondtheendoftheirbirthiterationwereunlikelytoberemovedatalaterpoint,
beforeclonecontrolwasintroduced.Theseobservationswerecon®rmedthroughinterviewswith
thedevelopersandinspectionsoftheevolutionofsamplesoftheclones.Asaconsequence,the
numberofclonedstatementsattheendofaniterationwasneversmallerthanatitsbeginning;if
measuredattheendofiterations,theabsoluteamountofcloningthussteadilyincreased.Onlyafter
clonecontrolwasintroduceddidtheclonedstatementstrenddecreaseacrossdifferentiterations.
Wethinkthatthisreversingoftheclonedstatementstrendisastrongindicatorfortheimpactof
clonecontrolontheamountofcloninginthesystem.

alidityVtoThreats8.8.7

InternalWeinterpretreductionsinclonedstatementstobecausedbyintentionalremovalof
clones.Thenumberofclonedstatementscanalsodecreaseonalargescale,however,ifclonesare
systematicallymodi®edtopreventtheirdetection,withoutremovingthem.Tocontrolthispotential
threat,weinspectedasampleofthecoderegionsinwhichcloneswerenolongerdetected.They
revealedintentionalconsolidation.Wethusdonotexpectsystematicconcealmenttocausethe
trends.clonetheindecrease

171

8MethodforCloneAssessmentandControl

FThisorwsomeasdayscausedinbytheproblemscharts,nowithdatathearebauildvailable.infrastructureForthem,thattheprevinterpretentedtheationsaredashboardthusfrominaccurate.being
executedfortheseperiods.However,interviewswiththedeveloperssuggestthatnojumpsdidoccur
inthem.Inaddition,theevolutionforthetimesforwhichdataisavailableisalreadysubstantially
differentfromthehistoricaldata.Wethusdonotconsiderthemissingdatapointsasthreatstoour
conclusionthatclonecontrolmanagedtoreducecloning.
WeWhiledidwenotthinkvalidatethatathehstatisticalypothesisvalidationthatwcloneouldbecontrolrdesirable,educedwethedonotamountbelieofvethatcloningasinglestatisticallystudy.
objectprovidessuf®cientdataforit.Therepetitionofthestudyonfurtherprojectsandthestatistical
validationthusremainsimportantfuturework.
Thereductionincloningcould,inprinciple,becausedmerelysincedevelopersweremadeaware
ofthefactthatclonesareharmful,orbymakingadashboardwithclonemetricsavailabletothem.
Ifnotso,validthesfortepstwoofthereasons:cloneNotcontrolonlydidmethodthewrateouldofnotnewbecloningrequired.Wdecrease,ethinkbutthatcloningthiswasassumptionactivelyis
Theremoveddashboardfromwtheasalsosystem.madeActiavveailableremovtoaltwdoesonotfurtheroccurprojectsatsubconsciouslyMR(projectsorAaccidentallyandC.fromSecond,the
casestudyinChapter4).However,intheseprojects,thestepsoftheclonecontrolmethodwere
notperformed:assessmentresultsanddiscoveredfaultswerenotpresentedanddiscussedina
meetingwithallstakeholders.Notutorialwasperformedthatinstructedthestakeholdersinthe
useofthequalitydashboardandthecloneinspectiontools.Thequalitydashboardresultswerenot
inteclonegratedcoverageintotheandreclonedgularsprojecttatementsstatuscanmeebetings.observFed,orasthesefortheprojects,studynoobject.comparableTheseedecreasesxperiencesin
thusgivefurtherindication,thatthechangestotheamountofcloningwerecausedbytheperformed
clonecontrolmeasures,andcannotsolelybeexplainedbymakingdashboardsavailable.However,
thiscasestudythusonlyprovidesindicationoftheeffectivenessofclonecontrolonagenerallevel.
Themeritoftheindividualstepsisnotvalidatedempirically.Furtherempiricalvalidationisrequired
tobetterunderstandtheimportanceoftheindividualsteps,potentialforsimplicity,omissionsor
potential.ementvimpro

ExternalThebiggestthreattotransferabilityoftheresultsisthatweonlyperformedthecase
studyonasinglestudyobject.Thesimplereasonforthisisthatthecasestudyrequiredalotof
effortandtime,andthatindustrialprojectswillingtoparticipateinsuchcasestudiesarehardto
®nd.Futureworkisrequiredthatrepeatsthecasestudyonfurtherprojectstobetterunderstandthe
results.theofgeneralizability

8.8.8ExperiencesditionalAd

Apartfromtheresultsdirectlytargetingtheresearchquestions,wemadeanumberofexperiences
restudygardingandfromclonesevcontrol.eralfurtherThefolloprojectswinginwhichparagraphswere¯ectintroducedoureclonexperiencescontrol,bothincludingfromtheprojectsaboveat
MunichReGroup,ABBandWincorNixdorf.

172

ySummar8.9

SenseofUrgencyWefoundthatthesenseofurgencythatpresentationsofcloningandclone
assessmentresultscreate,dependsstronglyontherelationofthedeveloperstothestudiedcode
base.Ifclonesinthird-partycodearepresented,theytendtoberegardedasotherpeople’sproblems.
Clonesintheirowncodebase,whileattractingmoreattentionandtriggeringjusti®cationattempts,
didtypicallynotcreateasenseofurgency,sincetheyoftenwereconceivedasfuturemaintenance
problems;inotherwords,notpresentmaintenanceproblems.Thefactthatcloningcanalready
havecausedproblemsinthepastwasnotapparent.Incontrast,presentationofexistingclone-
relatedbugsmakeapparentthatcloningisapresentmaintenanceproblem.Theresultingsenseof
urgencyiscorrespondinglylarger.

ReactionstoDiscoveredClonesWealsofoundthatdiscoveryofclonesintheirsystem
oftentriggersimilarreactionpatternsbydevelopers.Whileagreementthatcloningcanhinder
maintenanceingeneralistypicallyeasilyachieved,thepropositionthatthisholdsforspeci®cclones
intheirownsystemaswelltypicallyencountersinitialresistance.Inthenumerousdiscussionswe
had,theinitialreactiontoapresentedclonewastotestifitcouldberemoved.Ifnot,orifnot
easily,developersjumpedtotheconclusionthattheclonesarenotproblematic,sincetheycannot
beavoided.Insuchsituations,itwasimportanttopointoutthatchangestothemstillneededtobe
carriedouttoallsiblings;andthatcloneindicationtoolingcanmakethiseasier,sinceitsupports
changepropagation.Thisemphasisonclonecontroltoolsassupporttoevolveexistingclones,
accordingtoourexperience,helpedadoptionbydevelopers.

DashboardsasaMeansofCommunicationDashboardscanserveasmotivationandasa
meansofcommunicationinsideandbetweendifferentgroupsofstakeholders.Weencounteredthat
clonetrendsthatre¯ectcloneconsolidationcanhavemotivatingeffectsondevelopers,encouraging
themtoperformfurtherconsolidations.Theythuscommunicateconsolidationeffortsandeffects
insidethedevelopergroup.Furthermore,theamountandevolutionofcloningiscommunicated
toothergroupsofstakeholders,includingmanagement.Althoughthisfactcancreateinitialreluc-
tanceamongdevelopers,wefrequentlyencounteredpositivereactions,oncedevelopersweremore
familiarwithit.Somegroupsemployeditspeci®callytocommunicatethattheyrequireresources
toconsolidateseveralareasofunmaintainablecode,turningclonemeasurementsintoanargument
cause.theirfor

ySummar8.9

step,Thisclonechapterdetectionpresentedatailoring,methodemploforysclonedeveloperassessmentassessmentsandcontrolofclonethatccouplingomprisesto®veachiesteps.veItsaccurate®rst
clonedetectionresults.Itssecondstep,assessmentofimpact,determinesmetricsonthedetected
thirdclones.step,Theserootcausemetricsanalysis,quantifythedeterminesimpacttheofforcescloningdrionvingthemaintenancecreationofeffortscloning,andthuscorrectness.uncoveringIts
emplopotentialysstrateproblemsgiesinfromtheorgmaintenanceanizationalenchangevironment.managementItsfourthtostep,successfullyintroductionintroduceofclonecontinuouscontrol,
clonemanagementintoestablishedmaintenanceprocesses.Its®fthstep,continuousclonecontrol,

173

8MethodforCloneAssessmentandControl

performsclonecontrolmeasuresonaregularbasistopermanentlyreducethenegativeeffectsof
cloning.

Thesumptionssecondpartunderlyingofthethechaptermethodandpresentedtwdemonstratoesindustrialitscasefeasibilitystudies.and,Thethrough®rstthestudyvmagnitudealidatesofas-the
impacttailoringhadontheresults,itsimportanceforcloneassessment.Thesecondstudyevaluates
evthealuationproposedshowsmethodthatontheanproposedindustrialmethodsoftwaresucceededsystemattoreduceMunichRe.cloningForandthegivesstudiedindicationsystem,thatthe
thethusreductiondemonstrateswasinthefactfeasibilitycausedbyandtheeffectivapplicationenessofofthethecloneproposedassessmentmethodinandcontrolindustrialmethod.softwareIt
practice.engineering

174

Limitations9DetectionCloneof

Softwarecontainsfurtherredundanciesthanthosecreatedbycopy&paste.Forexample,asfound
inChapter5,redundancyinrequirementscanleadtore-implementationoffunctionality.Inde-
pendentlydevelopedcodeofsimilarbehaviorhasacomparablenegativeimpactonmaintenance
activities,asclonedcode.Maintenancethusneedstobeawareofit.Itisunclear,however,whether
existingclonedetectionapproachescandetect,orcanbemadetodetect,suchredundancies.Con-
sequently,wedonotknowwhetherclonemanagementapproachescanbeusedtocontrolsuch
redundancyonceithasbeenintroducedintoasystem.
Thischapterarguesthatbehaviorallysimilarcodeofindependentoriginisunlikelytobesyntacti-
callysimilar.Itreportsonacontrolledexperimentthatjusti®esthisclaim.Existingclonedetection
approachesarethusill-suitedtodetectsuchredundancy—itishencebeyondthescopeofclone
managementtools.Partsofthecontentofthischapterhavebeenpublishedin[112].

9.1QuestionshcResear

Wesummarizethestudyusingthegoalde®nitiontemplateasproposedin[234]:
forthepurposeAnalyzofecharbehaacterviorallyizationsimilarandprogramunderstandingfragments
withrespecttoitsrepresentationalsimilarityanddetectability
frominthetheviecontewpointxtofofresearcherindependentimplementationsofasinglespeci®cation
Indetail,weanswerthefollowing3researchquestions.

RQ16Howsuccessfullycanexistingclonedetectiontoolsdetectsimions1thatdonotresultfrom
paste?&copy

Multipleclonedetectorsexistthatsearchforsimilarprogramrepresentationtodetectsimilarcode.
Thebeen®rstcreatedquestionbycopweyneed&topaste.answerIfeisxistinghowwelldetectorstheseperformtoolsarewell,ablenotonovdetecteldetectionsimionsthattoolshavneedenotto
eloped.vdebe

RQ17Isprogram-representation-similarity-basedclonedetectioninprinciplesuitedtodetect
simionsthatdonotresultfromcopy&paste?
1Behaviorallysimilarcodefragments,cf.,2.3.2

175

9DetectionCloneofLimitations

Hadetectors,vingweestablishedneedtothatunderstandsimionsarewhetheroftentoothesyntactlimitationsicallydifresideferentinthetobetoolsdetectedorinbytheexistingprinciples.cloneIf
theproblemsresideinthetoolsbuttheapproachesthemselvesaresuitable,nofundamentallynew
approachesneedtobedeveloped.

RQ18Dosimionsthatdonotresultfromcopy&pasteoccurinpractice?
Thethirdquestionweaddressiswhethersimionsoccurinrealworldsystems.Fromasoftware
engineeringperspective,theanswertothisquestionstronglyin¯uencestherelevanceofsuitable
approaches.detection

ObjectsyStud9.2

RQs16and17Wecreatedaspeci®cationforasimpleemailaddressvalidatorfunctionthatwas
implementedbycomputersciencestudents.Thefunctiontakesastringcontainingconcatenated
emailaddressesasinput.Itextractsindividualaddresses,validatesthemandreturnscollections
ofvalidandinvalidemailaddresses.About400undergraduatecomputersciencestudentswere
askedtoimplementthespeci®cationinJava.Theywereallowedtoworkinteamsoftwoorthree.
Eachteamonlyhandedinasinglesolution.Implementationwasdoneundersupervisionbytutors
toavoidcopy&pastebetweendifferentteams.Participationwasvoluntaryandanonymousto
reducepressuretocopyforparticipantsthatdidnotsucceedontheirown.Behavioralsimilaritywas
controlledbyatestsuite.Studentshadaccesstothistestsuitewhileimplementingthespeci®cation.
Tosimplifyevaluation,studentshadtoentertheimplementationintoasingle®le.

Nuombb oefr sectj 2011 86420 0 01 02r fteo nsmbsttmeuaeN 03 04 05 06Figure9.1:Sizedistributionofthestudyobjects
Wereceived156implementationsofthespeci®cation.Ofthose,109compiledandpassedourtest
esuite.xhibitTheequalywereoutputtakenbehaasviorstudyfortheobjects.testinputs.SinceallOutputobjectsbehapassviorforourtestinputssuite,nottheincludedyareinknothewntestto
suitecanvary.Figure9.1displaysthesizedistributionofthestudyobjects(importstatementsare
notcounted).Theshortestimplementationcomprises8,thelongest55statements.InFigure9.2the
Jastudyvacode,objectsandareMcCabe’alsocatescgorizedyclomaticbycomplenestingxitydepth,[171].i.e.,Thetheareamaximalofeachdepthbubbleofiscurlybracesproportionalintheto

176

htp egDnitseN7654321010Cyclomatic Complexity203040DesignyStud9.3

Figure9.2:Studyobjectsplottedbynestingdepthandcyclomaticcomplexity

thenumberofstudyobjects.Thesemetrics,whichbothmeasurecertainaspectsofthecontrol¯ow
ofaprogram,alreadyseparatethestudyobjectsstrongly,withthetwolargestclustershavingsize
19and12.Whenlookingforimplementationswhicharestructurallythesame,itcanbeexpected
thatthesegivesimilarvaluesforbothmetricsandthusthesearchcouldbelimitedtoneighboring
clusters(denotedbythebubblesinthediagram).

RQsource18codeToofbetterthewell-knounderstandwntheereferencexistenceofmanagersimionsJabRefin2.Wreal-wedidorldnotsoftwonlyare,searchweforanalyzedsimionsthe
insideJabRef,butalsobetweenJabRefandthecodeoftheopensourceApacheCommonsLibrary3.
BothsoftwareiswritteninJava.

DesignyStud9.3

RQ16ToanswerRQ16,weneedtodeterminetherecallofexistingclonedetectorswhenapplied
tothestudyobjects.Wedenotetwoobjectsthatshareaclonerelationshipasaclonepair.Sincewe
knowallstudyobjectstobebehaviorallysimilar,weexpectanidealdetectortoidentifyeachpair
ofstudyobjectsasclones.Forourstudy,therecallisthustheratioofdetectedclonepairsw.r.t.the
numberofallpairs.Wecomputethefullclonerecallandthepartialclonerecall.Forthefullclone
recall,twoobjectsmustbecompleteclonesofeachothertoformaclonepair.Forthepartialclone
recall,itissuf®cientiftwoobjectsshareanyclone(thatdoesnotneedtocoverthementirely)to
formaclonepair.Weincludedthepartialclonerecall,sinceevenpartialmatchesofsimionscould
practice.inusefulbeWechoseConQAT(cf.,Chapter7)andDeckard[106]asstate-of-the-arttoken-basedandAST-
basedclonedetectors.Toseparateclonesbetweenstudyobjectsfromclonesinsidestudyobjects,
allclonegroupsthatdidnotcoveratleasttwodifferentstudyobjectswere®lteredfromtheresults.
Theparametersusedwhenrunningthedetectorsin¯uencethedetectionresults.Especiallythe
minimallengthparameterstronglyimpactsprecisionandrecall.Toensurethatwedonothereby
missrelevantclones,wechoseaverysmallminimallengththresholdof5statementsforConQAT.
Toputthisintoperspective:whenusingConQATinpractice[55,115],weusethresholdsbetween
2ge.net/http://jabref.sourcefor3g/http://commons.apache.or

177

DetectionCloneofLimitations9

10and15statementsforminimalclonelength.Obviouslysuchasmallthresholdcanresultinhigh
falsepositiveratesandthuslowprecisionoftheresults.However,thisonlyaffectstheinterpretation
oftheresultsw.r.t.theresearchquestioninasingledirection.Ifwefailtodetectasigni®cant
numberofcloneseveninpresenceoffalsepositives,wecannotexpecttodetectmorecloneswith
moreconservativeparametersettings.

RQ17ThestudyforRQ17comprisestwoparts.First,wecollectdifferencesbetweenstudy
objects.Wecategorizethembasedontheircompensability.Tothebestofourknowledge,thereisno
establishedformalboundaryonthecapabilitiesofprogram-representation-similarity-based(PRSB)
detectionapproaches(cf.,Section2.3.1).Consequently,insteadofusingaformalboundary,we
basethecategorizationonthecapabilitiesofexistingapproaches.Forthat,weconsiderapproaches
notonlyfromclonedetection,butalsofromtherelatedresearchareaofalgorithmrecognition.
Second,havingestablishedandcategorizedthesefactors,wecanlookbeyondthelimitationsof
existingtoolsandcandeterminehowwellanidealPRSBclonedetectiontoolcandetectsimions.
Tothatend,thedifferencesbetweenpairsofstudyobjectsareratedbasedontheircategory.This
isperformedbymanualinspection.Theratioofpairsthatonlycontaindifferencesthatcanbe
compensatedw.r.t.allpairsiscomputed.ItisanupperboundfortherecallPRSBapproachescan
inprincipleachieveonthestudyobjects.
Tokeepinspectioneffortmanageable,manualinspectionwascarriedoutonarandomsampleof
studyobjects.Thesamplewasgeneratedinsuchaway,thateachstudyobjectoccurredatleastonce
andcontained55pairs.Thestudyobjectsofeachpairwerecomparedmanuallyandthedifferences
betweenthemrecorded.Asastartingpointforthedifferencecategorization,weusedthecategories
ofprogramvariationproposedbyMetzgerandWen[176]andWills[232].Ifthedifferencesina
categorycanbecompensatedbyanyexistingclonedetectionapproachorbyexistingworkfrom
algorithmrecognition,weclassi®editaswithinreachofPRSBapproaches.Else,weclassi®edthe
categoryasoutofreachofPRSBapproaches.

RQJabRef.18WTeodididentifynotonlysimionsanalyzeinaifreal-wrevieorldwedpartssystem,wethemselvesperformedcontaipairn-resimionsviewsbofutalsosourcetookcodeintoof
accountCommonscodeLibrarythat.isSuchbehavioral®ndingssimilaridentifytothirdmissedpartyreuseopenopportunities.sourcelibrarycode,namelytheApache

andImplementation9.4ecutionEx

9.4.1RQ16:SearchingSimionswithExistingTools

WeexecutedConQATinthreedifferentcon®gurationstodetectclonesoftype1,types1&2and
types1-3(cf.,Section2.2.3).Fortype-3clonedetection,aneditdistanceof33%ofthelengthof
theclonewasaccepted4.Partialclonerecallwascomputedastheratioofthenumberofpairsof
studyobjectsthatshareanyclone,w.r.t.thenumberofallpairs.Thefullclonerecallwascomputed
4Asforminimalclonelength,thisvalueismoretolerantthanwhatwetypicallyemployinindustrialsettings.

178

ecutionExandImplementation9.4

astheratioofthenumberofpairsofstudyobjectsthatshareclonesthatcoveratleast90%of
theirstatementsw.r.t.tothenumberofallpairs.Thenumberofallpairsisthenumberofedges
inthecompleteundirectedgraphofsize109,namely5778.Deckardwasexecutedwithminimal
clonelengthof23tokens(correspondingto5statementsforanaveragetokennumberof4.5per
statementforthestudyobjects),astrideof0andasimilarityof1fordetectionoftype-1&type-2
clonesand0.95fordetectionoftype-3clones.Again,thesevaluesarealotlessrestrictivethanthe
valuessuggestedin[159].SincetheversionofDeckardusedforthestudycannotprocessJava1.5,
itcouldnotbeexecutedonall109studyobjects.Instead,itwasex5ecutedon50studyobjectsthat
couldbemadeJava1.4compatiblebyremovaloftypeparameters.Forthe50studyobjects,the
numberofallpairsis1225.

9.4.2RQ17:LimitsofRepresentation-basedDetection
CategoriesofProgramVariationThefollowinglistshowsthecategorizationofdifferences
encounteredduringmanualinspectionofpairsofstudyobjectsthatwereconsideredprincipally
withinreachofPRSBapproaches.ExampleswithlinenumberreferencesoftheformA-xxand
B-yyrefertostudyobjectsAandBinFig.9.3.
Syntacticvariationoccursifdifferentconcretesyntaxconstructsareusedtoexpressequivalent
B-4,abstractordifsyntax,ferentvsuchariablasethedifnamesferentthatreferstatementstotheusedsametocreateconcept,ansuchemptyasvalidstringandarrayinvalidAddrlinesA-4essesandin
bylinesadifA-8ferentandB-8.selectionInofaddition,controlitoroccursbindingifthesameconstructsalgorithmtoachieisverealizedthesameindifpurpose.ferentcodeExamplesfragmentsare
theimplementationoftheemptystringchecksasone(lineB-3)ortwoifstatements(linesA-3
andA-5)ortheoptionalelsebranchinlineB-6.Meanstocompensatesyntacticvariationinclude
conversionintointermediaterepresentationandcontrol¯ownormalization[176].
Organizationvariationoccursifthesamealgorithmisrealizedusingdifferentpartitioningsor
ahierarchiesmatcherisofcreatedstatementsandorusedvariadirectlybles,thatwhereasareusedbothinthethematchercomputation.andtheInmatchlineresultB-14forareestoredxample,in
localvariablesinlinesA-17-19.Meansto(partial)compensationincludevariable-orprocedure-
inliningandloop-andconditionaldistribution[176].
Generalizationcomprisesdifferencesinthelevelofgeneralizationofsourcecode.Thetypes
List<String>inlineA-8andArrayList<String>inlineB-8areexamplesofthiscategory.Means
ofaccuratefcompensationashion,includenormalizationreplacementsofidenti®ers.ofdeclarationswiththemostabstracttypes,or,inaless
Delocalizationoccurssincetheorderofstatementsthatareindependentofeachothercanvary
inlinearbitrarilyA-8couldbetweenbecodemovedfragments.behindInlineaA-14cloneofwithoutstudyobjectchangingAfortheebehaxample,vior.thelistDelocalizationinitializationcan,
i.e.,becompensatedbysearchforsubgraphisomorphismasdonebyPDG-basedapproaches[140,
201].Unnecessarycodecomprisesstatementsthatdonotaffectthe(relevant)IO-behaviorofacode
fragment.ThedebugstatementinlineA-14forexamplecanberemovedwithoutchangingthe
5Theremaining59studyobjectsusedadditionalpostJava1.4featuresandwereexcludedfromthestudy.

179

DetectionCloneofLimitations9

1publicString[]validateEmailAddresses(
Stringaddresses,charseparator,publicString[]validateEmailAddresses(1
Set<String>invalidAddresses){Stringaddresses,charseparator,
Set<String>invalidAddresses){
3if(addresses==null)
4returnnewString[0];if(addresses==null||addresses.3
5if(addresses.equals(""))equals("")){
6returnnewString[0];returnnewString[]{};}4
8List<String>valid=newArrayList<else{6
String>();addresses.replace("","");7
ArrayList<String>validAddresses=8
10Stringsep=String.valueOf(separatornewArrayList<String>();
;)11if(separator==’\\’)StringTokenizertokenizer=new10
12sep="\\\\";StringTokenizer(addresses,
13String[]result1=addresses.split(String.valueOf(separator));
;)pes14System.out.println(Arrays.toString(while(tokenizer.hasMoreTokens()){12
result1));Stringi=tokenizer.nextToken();13
if(this.emailPattern.matcher(i).14
16for(Stringadr:result1){matches()){
17Matcherm=emailPattern.matcher(validAddresses.add(i);15
adr);}else{16
18booleanergebnis=m.matches();invalidAddresses.add(i);17
19if(ergebnis)}18
20valid.add(adr);}19
esle2122invalidAddresses.add(adr);returnvalidAddresses.toArray(new21
23}String[]{});
}2225returnvalid.toArray(newString[0]);}23
}26

Figure9.3:StudyobjectsAandB

outputbehaviortestedforbythetestcases6.Meansofcompensationincludebackwardslicingfrom
outputvariablestoidentifyunnecessarystatements.

Thecompensatedfollowingbyecatexistinggoryclonecontainsdetectiontypesoforprogramalgorithmvariationrecognitionintheapproaches.studyobjectsthatcannotbe

toDiffersolveentthedatasamestructurproblem.eorOneealgorithm:xampleforCodetheusefragmentsofdifuseferentdifferentdatadatastructuresstructuresencounteredoralgorithmsinthe
studyobjectsistheconcatenationofvalidemailaddressesintoastringthatissubsequentlysplit,
insteadoftheuseofalist.Theuseofdifferentalgorithmsisillustratedbythevarioustechniques
JawevafoundclasstoStringsplitisthecalledinputthatstringusesintoregularindievidualxpressionsaddresses:tosplitinalinestringA-13,intoaparts.libraryInmethodlineonB-10,thea
StringTokenizerisusedforsplittingthatdoesnotuseregularexpressions.
6Dependingontheusecase,debugmessagescanorcannotbeconsideredaspartoftheoutputofafunction.

180

Results9.5

T9.7oandillustrate9.8thedepictdifamountferentofvwaysariationtothatimplementcanbethefoundsevplitting.eninAllaesmallxamplesprogram,wereFiguresfoundinthe9.4,9.5,9.6,study
Theobjects.remainingFigures®gures9.4anddepict9.5custom,containyetcodethatsubstantiallymakesdifuseofferentlibrarysplittingfunctionalityalgorithms.tosplitthestring.

9.4.3RQ18:SimionsinRealWorldSoftware

Theidenti®cationofsimionsisahardproblemasitrequiresfullcomprehensionofthesourcecode.
AswedidnotknowthesourcecodeofJabRefbefore,welimitedourreviewtoabout6,000LOCthat
containutilityfunctionsthataremainlyindependentofJabRef’sdomain.Examplesarefunctions
thatdealwithstringtokenizationorwith®lesystemaccess.IncontrasttotheJabRefcode,wewere
familiarwiththeApacheCommonsLibrary.Nevertheless,toidentifysimionsbetweenJabRefand
theApacheCommons,wespeci®callysearchedtheApacheCommonsforfunctionalityencountered
duringinspectionoftheJabRefcode.

Results9.5

RQpendent16RQ16implementationsanalyzesofthethesamecapabilityoffunctionalityConQA.TTheandresultsDeckardaretodepicteddetectinclonestablein9.1.the109inde-

Table9.1:Resultsfromclonedetection
FullartialPDetectedDetectorCloneTypesCloneRecallCloneRecall
ConQAConQATT11&20.4%2.3%0.0%0.0%
0.1%3.2%1-3TConQA0.1%5.1%1&2Deckard0.8%9.7%1-3DeckardAscanbeexpected,therecallvaluesforclonesoftype1-3arehigherthanfortype-1ortype-1&2
clones.Furthermore,theAST-basedapproachyieldsslightlyhighervalues.Thisisnotsurprising
sinceitperformsadditionalnormalization.However,eventhoughweusedverytolerantparameter
valuesforclonedetection,whichprobablyresultinafalsepositiveratethatistoohighforapplica-
tioninpractice,bothpartialandfullclonerecallvaluesareverylow.Thebestvalueforfullclone
recallisbelow1%,thebestvalueforpartialclonerecallbelow10%.
Inotherwords:fortwoarbitrarystudyobjects,theprobabilitythatanyclonesaredetectedbetween
themisbelow10%.Theprobabilitythattheyaredetectedtobefullclonesofeachotheriseven
below1%.Giventheverytolerantparametervaluesusedfordetection,wecannotexpectthesetools
tobewellsuitedforthedetectionofsimions(notcreatedbycopy&paste)inrealworldsoftware.

181

ofLimitations9DetectionClone

String[]adresses2=addresses.split(Pattern.quote(String.valueOf(separator)));
Figure9.4:Splittingwithjava.lang.String.split()

ArrayList<String>validEmails=newArrayList<String>();
StringTokenizerst=newStringTokenizer(addresses,Character.toString(separator
;))while(st.hasMoreTokens()){
Stringtmp=st.nextToken();
validEmails.add(tmp);
}Figure9.5:Splittingwithjava.util.StringTokenizer

List<String>result=newArrayList<String>();
intz=0;
for(inti=0;i<addresses.length();i++){
if(i==addresses.length()!1){
result.add(addresses.substring(z,i+1));
}if(addresses.charAt(i)==separator){
result.add(addresses.substring(z,i));
z=i+1;
}}Figure9.6:Splittingwithcustomalgorithm1

List<String>curAddrs=newArrayList<String>();
Stringbuffer="";
for(inti=0;i<addresses.length();i++){
if(addresses.charAt(i)!=separator){
buffer+=addresses.charAt(i);
}else{
curAddrs.add(buffer);
buffer="";
}}curAddrs.add(buffer);
Figure9.7:Splittingwithcustomalgorithm2

List<String>emailListe=newArrayList<String>();
inttrenneralt=0;
while(addresses.indexOf(separator,trenneralt)!=!1){
inttrennerneu=addresses.indexOf(separator,trenneralt);
emailListe.add(addresses.substring(trenneralt,trennerneu));
trenneralt=trennerneu+1;
}Figure9.8:Splittingwithcustomalgorithm3

182

Figure9.8:Splittingwithcustomalgorithm3

Results9.5

RQ17Ofthe55pairsofstudyobjectsinspectedmanually,only4didnotcontainprogram
variationofcategorydifferentalgorithmordatastructure.Inotherwords,onlyabout7%ofthe
manuallyinspectedpairscontainonlyprogramvariationthatcan(inprinciple)becompensated.
SincethisratioisanupperboundontherecallPRSBapproachescaninprincipleachieve,we
considerPRSBapproachespoorlysuitedfordetectionofsimionsthatdonotresultfromcopy&
paste.

RQ18ThemanualreviewsuncoveredmultiplesimionswithinJabRef’sutilityfunctions.An
ecase.xampleTheisthesamefunctionfunctionalitynCase()isinalsotheproUtilvidedclassbythatclassconvertsCaseChangthe®rsterthatcharacterallowsoftoastringapplytodifupperferent
strategiesforchangingthecaseofletterstostrings.

Evenmoreinteresting,wefoundmanyutilityfunctionsthatarealreadyprovidedbywell-known
librariesliketheApacheCommons.Forexample,theabovemethodisalsoprovidedbymethod
capitalize()intheApacheCommonsclassStringUtils.EspeciallytheclassUtilexhibitsahigh
numberofsimions.Ithas2,700LOCand86utilitymethodsofwhich52arenotrelatedtoJabRef’s
domainbutdealwithstrings,®lesorotherdatastructuresthatarecommoninmostprograms.Of
these52methods32exhibit,atleastpartly,abehavioralsimilaritytoothermethodswithinJabRefor
tofunctionalityprovidedbytheApacheCommonslibrary.Elevenmethodsare,infact,behaviorally
equivalenttocodeprovidedbyApache.Examplesaremethodsthatwrapstringsatlineboundaries
oramethodtoobtaintheextensionofa®lename.

ManyofthesemethodsinJabRefexhibitsuboptimalimplementationsorevendefects.Forexample,
someofthestring-relatedfunctionsuseaplatform-speci®clineseparatorinsteadoftheplatform-
independentoneprovidedbyJava.Inanothercase,theescapingofastringtobeusedsafely
withinHTMLisdonebyescapingeachcharacterinsteadofusingthemoreelegantfunctionality
providedbyApache’sStringEscapeUtilsclass.AdrasticexampleistheJabRefclassErrorCon-
sole.TeeStreamthatprovidesmultiplexingfunctionalityforstreamsandcouldbemostlyreplaced
byApache’sclassTeeOutputStream.TheimplementationprovidedbyJabRefhasadefectasitfails
tocloseoneofthemultiplexedstreams.AnotherexampleisclassBrowserLauncherthatexecutes
a®lesystemprocesswithoutmakingsurethatthestandard-outandstandard-errorstreamsofthe
processaredrained.Inpractice,thisleadstoadeadlockiftheamountofcharacterswrittentothese
streamsexceedsthecapacityoftheoperatingsystembuffers.Again,theproblemcouldhavebeen
avoidedbyusingApache’sclassDefaultExecutor.

WhilethemanualreviewofJabRefisnotrepresentative,itindicatesthatreal-worldprograms,
indeed,exhibitsimions—bothamongitsowncodeandifcomparedtogeneralpurposelibraries.
withWhileclonesomeofdetectionthesimionstools.areThisalsoappliesinrepresentationallyparticularforsimilarthe,thesimionsmajoritythatcouldJabRefnotsharesbewithidenti®edthe
ApacheCommons,probablybecausethecodehasbeendevelopedbydifferentorganizations.A
thatcentraldonotinsightonlyofourincreasemanualdevelopmentinspectionefwas,fortsthatbutalsosimionsintroduceoftenrepresentdefects.missedreuseopportunities

183

DetectionCloneofLimitations9

Discussion9.6

Inderlyingthepreviousapproaches.sectionsInweoureexploredxperimenttheclonelimitsofdetectioncurrenttoolscloneachievdetectionearecalltoolsofandlessalsothanof1%theirwhenun-
analyzingbehaviorallysimilarbutindependentlydevelopedcode(RQ16).Whileitcouldhave
beensimions,ethexpectedthatdramaticallyexistinglowclonerecallisdetectionneverthelessapproachessurprising.haveratherMoreoverlimited,theresultcapabilitiesofRQfor17®ndingshow
thatcanbeonlyfoundacerwithtainclasscurrentofclonesimions,detectionthosethatapproaches.areHence,representationallywearesimilarinclinedtomodulodisagreewithnormalization,[201]
sithatveandstatesthatintelligent“[...]attemptsnormalizationscantobethemadecode.to”.detectsemanticclones[simions]byapplyingexten-

Furthermore,RQ16demonstratedthatindependentprogrammersdonottendtocreaterepresen-
tationallysimilarcodewhenfacingthesameimplementationproblem.Thus,wewouldexpectto
®ndsimions“inthewild”—bothinsideexistingsystemsandbetweensystemsandlibraries—which
arenotrepresentationallysimilarandthusnotdetectablebycurrenttools.RQ18provides®rst
indicationsforthisfact.Theseresultsarealsobackedupbythestudyin[107],whichmineda
hugenumberofsimionsfromtheLinuxkernelsourcesfromwhichatleasthalfofthemwherenot
representationallysimilar.ResultsthatpointinthesamedirectionarealsopresentedbyKawrykow
andRobillardthatreportonsigni®cantamountsofreimplementedAPImethodstheyfoundinJava
systems[127].Finally,furthersupportisgivenbyourobservationsthatredundancyinrequire-
mentscanleadtoindependentimplementationsofsemantically,yetnotsyntactically,similarcode
(cf.,Chapter5).
ThesimionsinspectedforRQ18alsocon®rmedourexpectationsthatreuseofexisting(library)
functionsoftennotonlyreducesimplementationeffortsbutalsothenumberofbugs.Toprovide
somefurtherindication,weusedGoogleCodeSearch7toidentifyotherJavaprogramsthatdo
notreuseApache’sDefaultExecutorandexhibitthesamedeadlockproblemasJabRefthatwe
discoveredinRQ18.Strikingly,ofthe®rst10hitsforthesearchlang:javaprocess.waitfor,6
implementationscontainthesameproblemasJabRefalthoughonly2ofthemappeartobethe
resultofcopy&paste.

Thelackofreliablesimiondetectorsmakesautomatedsimionmanagementunfeasible.Sincedetec-
tionthroughmanualinspectionsisverycostly,inspectionsarenotfeasibleforlargescale,continu-
oussimionsdetection.Clonemanagementapproaches(cf.,Section3.4.2)thatpromisetoalleviate
thenegativeimpactofcloningduringmaintenance,however,requiredatadescribingsimilarpro-
gramfragments.Theyarehencenotapplicabletosimionmanagement:theysimplyhavenodatato
on.operate

Sincetheautomatedmanagementofexistingsimionsduringmaintenanceishenceunfeasible,de-
velopmentmustinsteadfocusontheiravoidance.First,thisimpliesthatdevelopersmustbemade
andkeptawareofavailablelibrariestoavoidre-implementationoffunctionalityalreadyavailablein
the®eld.Second,redundancyinrequirementsandmodelsmustbedetectedandconsolidatedbefore
theyareimplemented,toavoidre-implementationoffunctionalitythatisalreadyavailable.
7.google.com/codesearchhttp://www

184

alidityVtoThreats9.7

Sinceavoidancedoesnothelpwithsimionsthatalreadyexistinsoftware,thedetectionofsimions
isarelevantproblemwhichisnotyetsolvedbyexistingtools.Aworkingsimiondetectorcould
notonlyhelpinreducingcodesizebyeliminatingredundantcode,butalso®ndbugsbyincluding
librariesofworkingcodeorbugpatternsinthedetection.Wethusconsidertheconstructionof
algorithmsandtoolsforsimiondetectionaworthwhileandstillopenproblem.

VtoThreats9.7alidity

Thissectiondiscusseshowwemitigatedthreatstointernalandexternalvalidity.

InternalValidityForRQ16,wedidnotmeasuretheimpactoftheparametersusedfordetection
onprecision.Thishastworeasons.(1)precisionmeasuredonthestudyobjects,whichareknown
tobebehaviorallysimilar,isunlikelytobetransferabletorealworldsoftware,wherewecannot
expectedthesamedegreeofsimilarity.Precisionmeasureswouldthushavetoberepeatedonfur-
therprecisionsystems,throughstillwithmanualquestionableassessmentsistransferabilityalreadydifbe®cultyondinthegeneralsystems[229].underDuringstudythe.(2)courseMeasuringofthe
study,wefoundittobeinfeasibleforverysmallclones(e.g.,ofsizebelow4statements)duetolow
inter-raterreliability.Instead,wechoseverytolerantparametervaluesthat,whilelikelytoresultin
lowprecision,areunlikelytoreducerecall.However,thisstrategyhasasinglesidedeffectonthe
resultsofthestudyinthatitmerelyincreasestheprobabilitytodetectclones.Itthusdoesnotaffect
thevalidityoftheresultsthatexistingtoolsarepoorlysuitedtodetectsimions.
ForRQ17,weclassi®edcategoriesofprogramvariationaccordingtowhethertheyareinprinciple
withinreachofPRSBapproaches.Misclassi®cationcanimpacttheresults.Wehandledthisthreat
bychoosingaconservativeclassi®cationstrategy.Categoriesthatcanonlypartlybehandled(e.g.,
duetotheuseofheuristicsthatcannotguaranteecompletenessorhighcomputationcomplexity
thatcouldbeprohibitivelyexpensiveinpractice)wereratedaswithinreachofPRSBapproaches.
Inaddition,differencesbetweenthestudyobjectsthatstemmedfromdifferencesintheirbehavior
thatwerenotdetectedbyourtestsuitewereignored.Thisconservativestrategythusincreases
theprobabilitytoconsiderPRSBapproachesassuitedforthesimiondetectionproblem.Itdoes,
however,notimpactthevalidityoftheresultthatPRSBapproachesarepoorlysuitedforthesimion
problem.detectionSeveralfactorscanleadtolessprogramvariationamongthestudyobjectsthancouldtypically
beencounteredinrealworldsoftware:(1)allstudentshadaccesstothesametestsuite,(2)the
signatureofthevalidatorfunction,includingitstypes,wasspeci®ed,(3)teamscouldasktutors
forhelp.However,allthesefactorsonlyincreaseourchancesof®ndingclonesandthusdonot
results.thealidatevin

ExternalValidityWechosetwostate-of-the-artclonedetectorsforthestudy.Somedetectorwe
wedidnotdiscovtryeredmightamongperformthestudybetter.Hoobjects,wevweer,gidovennotethexpectdivanersityyeandxistingamountcloneofdetectorprogramtovperformariation
substantiallybetter,aswouldberequiredtoinvalidateourconclusions.TheresultsforRQ17

185

DetectionCloneofLimitations9

illustratethatthisisalsovalidforPDG-baseddetectors8.Wedonotclaimtransferabilityofthe
actualnumbers(e.g.,forrecall)wemeasuredonthestudyobjectsbeyondthestudy.However,
sincethestudyobjectswererelativelysimplecomparedtorealworldsoftware,wedonotexpect
real-worldsoftwaretoexhibitlessprogramvariation.Onthecontrary,wewouldexpectprogram
variationtobeevenlargerforrealworldsoftware,duetodifferencesinconventionsandpractices
betweendifferentteamsanddomains.Regardingtheexistenceofsimionsinreal-worldprograms
thatarenottheresultofcopy&paste(RQ18),ourapproachcanonlyprovideanindication.Itis,
thus,tooearlytoreasonaboutthedefectpronenessofthemissedreuseopportunitiesrepresented
simions.by

ySummar9.8

Thischapteranalyzedprogramvariationinbehaviorallysimilarcodeofindependentorigin.With
acontrolledexperimentweunderpinthecommonintuitionoftheexistenceofbehaviorallysimilar
codethatcannotbefoundautomaticallybyexistingclonedetectionapproaches.Clonedetection
toolsarehencenotwellsuitedtodetectbehaviorallysimilarcodeofindependentorigin.
ThecasestudyinChapter5indicatedthatredundancyinrequirementsspeci®cationscancause
re-implementationofsimilarfunctionality.Theresultsofmanualinspectionsofopensourcecode
furthermoreindicatethatsimionsdoexistinpractice.However,theexperimentinthischapter
revealsthatclonedetectionisunlikelytodiscoversuchsimilaritiesonthecodelevel.Thislackof
detectorsmakesexistingclonemanagementapproachesunapplicabletosimions.Theirdetection
remainsanimportanttopicforfuturework.

8Also,wearenotawareofanavailablePDG-baseddetectorforJava.

186

lusionConc10

Thischaptersummarizesthecontributionsofthiswork.Itsstructurere¯ectsthethesisstatement
fromSection1.1:the®rstsectionsummarizesourresultsonthesigni®canceofcloning,thesecond
sectionourcontributionsforcloneassessmentandcontrol.

CloningofSigni®cance10.1

times,Whileitsthenegatiquantitativeveimpactofimpact—andcloningthusonitsprogramsigni®cance—incorrectnesshaspracticebeenremainedstatedunclearqualitatively.Furtherman-y
more,whilecloninginsourcecodehadbeenstudiedintensely,littlewasknownaboutitsextentand
consequencesinothersoftwareartifacts.
Thefollowingsectionssummarizeourempiricalresultsontheimpactofcloningonprogramcor-
Then,rectnessweandsummarizetheextenttheofcostcloningmodelinthatquanti®esrequirementsimpactofspeci®cationscloningandonmaintenanceMatlab/Simulinkefforts.models.

10.1.1ImpactonProgramCorrectness

Weinvestigatedfourresearchquestionstoquantifytheimpactofcodecloningonprogramcorrect-
ness:RQ1:Arecloneschangedindependently?
Yes.Abouthalftheclonegroupsintheanalyzedsystemsweretype-3clonegroupsandthushad
differencesbeyondvariablenamesandliteralvalues.Changestoclonedcodethatarenotperformed
equallytoallcloneshencefrequentlyoccurinpractice.
RQ2:Aretype-3clonescreatedunintentionally?
Yes.Asubstantialpartofthedifferencesbetweentheanalyzedcloneswasunintentional.Manyof
thedeveloperswerethusnotawareofalltheexistingcloneswhenmodifyingcode.However,the
ratioofintentionalw.r.t.unintentionaldifferencesvariedstronglybetweentheanalyzedsystems,
indicatingdifferencesintheamountofcloningawareness.
RQ3:Cantype-3clonesbeindicatorsforfaults?
Yes.Analysisoftype-3clonesuncovered107faultsinproductivesoftware.Theratiooftype-3
clonesthatindicatedfaults,however,variedbetweentheanalyzedsystems.Softwarewithmore
unintentionallyinconsistentchangesalsocontainedmoretype-3clonesthatindicatedfaults.
RQ4:Dounintentionaldifferencesbetweentype-3clonesindicatefaults?

187

lusionConc10

aYwes.arenessAboutofeverycloningsecondduringunintentionalmaintenancedifthusferencesigni®cantlybetweentype-3impactsclonesprogramindicatedcorrectness.afault.Lackof

SummaryThestudyresultsshowthatalackofawarenessofcloningisathreattoprogram
correctness.Whiletheanalyzedsystemsvariedintheirshareofunintentionaldifferences—andthus
theamountofcloningawarenessamongtheirdevelopers—thenegativeimpactofunintentionally
inconsistentchangeswasuniform:abouteverysecondunintentionallyinconsistentchangehada
directimpactonprogramcorrectness.Theseresultsthusgivestrongindicationthatawarenessof
cloningiscrucialduringsoftwaremaintenance.
Inaddition,thestudyshowedthatawarenessofcloningvariesbetweenprojects—itthuscannot
betakenforgrantedinindustrialsoftwareengineering.Clonecontrolisrequiredtoachieveand
maintainawarenessofcloningtoalleviatethenegativeimpactofexistingclones.

CloningofExtent10.1.2

Besidessourcecode,furthersoftwareartifactsarecreatedandmaintainedduringthelifecycleof
asoftwaresystem:requirementsspeci®cationsplayapivotalroleincommunicationbetweencus-
tomers,requirementengineers,developersandtesters;Matlab/Simulinkmodelsarereplacingcode
asviouslyprimarybeenstudiedimplementationintheseartifartifactacts.inWeembeddedinvestigsoftwated®arevesystems.researchHowequestionsver,tocloningshedhaslightnotonthepre-
extentandimpactofcloninginrequirementsspeci®cationsandMatlab/Simulinkmodels.
RQ5:Howaccuratelycanclonedetectiondiscovercloninginrequirementsspeci®cations?
OurclonedetectorConQATachievedhighprecisionvaluesforthe28analyzedindustrialrequire-
mentsspeci®cations:85%intheworstcase,99%onaverage.Tailoringis,however,requiredto
achievesuchhighprecision.Theseresultsshowthatclonedetectionissuitabletodetectcloningin
speci®cations.requirementsRQ6:Howmuchcloningdoreal-worldrequirementsspeci®cationscontain?
Theamountofcloningvariedsubstantiallyacrosstheanalyzedspeci®cations.Whilesomecon-
tainednocloningatall,othersexhibitedasizeincreaseover100%duetocloning.Thehighest
clonecoveragevaluesrangedat51.1%and71.6%.
RQ7:Whatkindofinformationisclonedinrequirementsspeci®cations?
Wediscoveredabroadrangeofdifferentinformationcategoriespresentinclonedspeci®cation
fragments—cloningisnotlimitedtoaspeci®ckindofinformation.Consequently,clonecontrol
cannotbelimitedtospeci®ccategoriesofrequirementinformation.
RQ8:Whichimpactdoescloninginrequirementsspeci®cationshave?
Inspectionsareanimportantqualityassurancetechniqueforrequirementsspeci®cations.The
cloninginducedsizeblow-upincreaseseffortrequiredforinspections—intheworstcasebyan
estimated13persondaysforoneoftheanalyzedspeci®cations.Cloningthusincreasesquality
assuranceeffortforrequirementsspeci®cations.

188

CloningofSigni®cance10.1

Inaddition,wesawevidencethatrequirementcloningcanresultinredundancyintheimplemen-
tation.Besidescorrespondingsourcecodeclones,wefoundcasesinwhichclonedspeci®cation
fragmentshadbeenimplementedindependentofeachother.Besidesincreasedimplementation
effort,thiscausesbehaviorallysimilarcodethatisnottheresultofsourcecodecopy&paste.
RQ9:Howmuchcloningdoreal-worldMatlab/SimulinkModelscontain?
TheanalyzedindustrialMatlab/Simulinkmodelscontainedasubstantialamountofcloning.While
thedetectionapproachproducedfalsepositives,thedevelopersagreedthatawarenessofmanyof
thedetectedclonesisrelevantforsoftwaremaintenance.CloningthusoccursinMatlab/Simulink
modelsandneedstobecontrolledduringmaintenance,aswell.

SummaryCloningisnotlimitedtosourcecode,andneitherisitsnegativeimpact.Cloning
aboundsinrequirementsspeci®cationsandMatlab/Simulinkmodels—ithenceneedstobecon-
trolledinthem,too,toreducethenegativeimpactofcloningonengineeringefforts.
Clonecontrolmeasuresarelikelytodifferforrequirementsspeci®cationsandMatlab/Simulink
models,however.Limitationsoftheexistingabstractionmechanismsarearootcauseforcloning
inMatlab/Simulinkmodels.Sincecorrespondingclonescannoteasilyberemovedwithoutchanges
totheMatlab/Simulinkenvironment,clonecontrolneedstofocusontheirconsistentevolution.
Incontrast,forrequirementsspeci®cations,noabstractionmechanismlimitationshindertheclone
consolidation:manyoftheanalyzedspeci®cationsdidnotcontainanycloningatall.Consequently,
clonecontrolforthemcanputmoreemphasisontheavoidanceandremovalofcloning.

ModelCostClone10.1.3

nomicBesidesefthefectofempiricalcloningonstudies,wemaintenancehaveefpresentedfortsandan®eldfanalyticalaults.costItcanmodelbeusedthatasaquanti®esbasistheforeco-as-
andsessmentthusandrequirestrade-offsubstantiallydecisions.lessThemodelparameters—andproducesainstantiationresultrefelativetofort—thanasystemgeneralwithoutpurposecloningcost
results.absoluteproducethatmodelsInstantiationofthecostmodelon11industrialsystemsindicatesthatcloninginducedimpactvaries
achievsigni®cantlyeconsiderablebetweensasystemsvingsbyandisperformingsubstantialactivforeclonesome.control.Basedontheresults,someprojectscan

SummaryThecostmodelcomplementstheempiricalstudiesintwoways.First,itcompletes
ourunderstandingoftheimpactofcloning:insteadoffocusingonisolatedaspectsoractivities,it
quanti®esitsimpactonallmaintenanceactivitiesandthusonmaintenanceeffortsandfaultsasa
whole.Second,itmakesourobservations,speculationandassumptionsexplicit.Thisexplicitness
offersanobjectivebasisforscienti®cdiscourseabouttheconsequencesofcloning.

189

lusionConc10

Clone10.2olContr

Ourempiricalresultshaveshownthatcloningnegativelyaffectsmaintenanceefforts,andthatun-
awarenessofcloningimpactsprogramcorrectness.Clonecontrolisrequiredtoavoidcreationof
new,andtoreducethenegativeimpactofexistingclones.Wehavepresentedtoolsupportanda
methodforclonecontrolthataresummarizedinthefollowingsections.Finally,thelastsection
summarizesourinvestigationofthelimitationsofclonedetectionandcontrol.

10.2.1AlgorithmsandToolSupport

TheproposedclonedetectionworkbenchConQATprovidessupportand¯exibilityforallphases
ofclonedetection:frompreprocessing,detectionandpostprocessing,toresultpresentationand
interactiveinspectioninstateoftheartIDEs.ConQATimplementsseveralnoveldetectionalgo-
rithms:the®rstalgorithmtodetectclonesindata¯owmodels;anindex-basedapproachfortype-2
clonedetectionthatisbothincrementalandscalable;andanoveldetectionalgorithmfortype-3
clonesinsourcecode.Itsupports12programmingand15naturallanguages.Thiscomprehensive
functionality—re¯ectedinitssizeofabout67kLOC—wasrequiredtoperformthecasestudiesand
tosupportthemethodforcloneassessmentandcontrol.
Thediversityofthetasksforwhichclonedetectionisemployedinbothresearchandpractice,and
thenecessitytotailorclonedetectiontoitscontexttoachieveaccurateresults,requirevariationand
adaptation.ConQAT’sproductlinearchitecturecatersfor¯exiblecon®guration,whileatthesame
timeachievingahighlevelofreusebetweenindividualdetectorsacrosstheclonedetectorfamily.

SummaryThetoolsupportproposedbythisthesishasmaturedbeyondthestateofaresearch
prototype.SeveralcompanieshaveincludedConQATforclonedetectionormanagementintotheir
developmentorqualityassessmentprocesses,includingABB,BMW,Capgeminisd&m,itestra
GmbH,KabelDeutschland,MunichReandWincorNixdorf.Furthermore,ConQAT’sopenarchi-
tectureanditsavailabilityasopensourcehavefacilitatedresearchbyothers[24,96,104,180,186].

10.2.2MethodforCloneAssessmentandControl

Toeaseadoptionofclonedetectionandmanagementtechniquesinpractice,thisthesishaspresented
amethodforcloneassessmentandcontrol.Itsgoalsaretoassesstheextentandimpactofcloning
insoftwareartifactsandtoreducethenegativeimpactofexistingclones.
Weintroducedclonecouplingasanexplicitrelevancecriterion.Developerassessmentsofclone
couplingareemployedforclonedetectiontailoringtoachieveaccuratecloninginformationfora
softwaresystem.Theapplicationofdeveloperassessmentstodetermineclonecouplingisbasedon
assumptionsthathavebeenvalidatedthroughfourresearchquestions:
RQ10:Dodevelopersestimateclonecouplingconsistently?
Yes,couplingbetweentheanalyzedcloneswasratedveryconsistentlyamongthreedifferentdevel-
opers.Itisthusrealistictoassumeacommonunderstandingofclonecouplingamongdevelopers.

190

olContrClone10.2

RQ11:Dodevelopersestimateclonecouplingcorrectly?
Yes.Analysisofthesystemevolutionshowedasigni®cantlystrongercouplingbetweenclones
thatwereassessedascoupled,thanamongthosethatwereassessedasindependent.Developer
estimationsofcouplingthuscoincidewithactualsystemevolution.
RQ12:Cancouplingbegeneralizedfromasample?
Yes.Althoughtailoringwasbasedonasampleofthedetectedclones,allacceptedclonesexhibited
asigni®cantlylargercouplingduringsystemevolutionthantherejectedclonecandidates.Coupling
generalized.bethuscanRQ13:Howlargeistheimpactoftailoringonclonedetectionresults?
Theimpactmustbeexpectedtovarybetweensystems,since,e.g.,theapplicationofcodegenera-
tors,whichcontributetosubstantialamountsoffalsepositives,varies.However,fortheanalyzed
system,theimpactwaslarge:morethantwothirdsoftheclonecandidatesdetectedbyuntailoredde-
tectionwereconsideredirrelevantformaintenancebythedevelopers.Still,over1000clonegroups
remainedinthetailoreddetectionresults.Althoughthesystemcontainedalotofrelevantclones,
untailoreddetectionresultswereunsuitedforcontinuousclonecontrol.Theseresultsemphasizethe
importanceofclonedetectiontailoringandcastdoubtonthevalidityofsomeresultsofempirical
analysisofpropertiesofclonesthatdidnotemployanyformoftailoring(cf.,Chapter3).

EvaluationThemethodhasbeenappliedtoabusinessinformationsystemdevelopedandmain-
tainedattheMunichReGroup.Cloneassessmentandcontrolwasperformedoveraperiodofone
year.Thesuccessfulapplicationofthemethodvalidatesitsapplicabilityinreal-worldcontexts.To
evaluateitsimpact,weinvestigatedtworesearchquestions:
RQ14:Didclonecontrolreducetheamountofcloning?
Ycoves:eragebothclonedecreasedcoveragefromand14%thetobelonumberwof10%,clonedthenumberstatementsofcloneddecreasedstatementsduringthedecreasedstudyperiod:from
15.000tobelow11.000,whiletheoverallsystemsizeincreasedinthatperiod.
RQ15:Istheimprovementlikelytobecausedbythecloneassessmentandcontrolmeasures?
YThees.Beforereductiontheinstudycloningperiod,is,hence,bothlikcloneelytometricsbeecausedxhibitedbythesubstantiallyapplicationdifofferenttheevmethod.olutionpatterns.

SummaryThemethodprovidesdetailedstepstotransportinsightsgainedthroughthecasestud-
iesandexperimentsperformedduringthisthesisintoindustrialsoftwareengineeringpractice.Its
underlyingassumptionshavebeenvalidatedandithasbeenevaluatedonasoftwaresystemat
MunichReGroup.Thisevaluationhasdemonstrateditsapplicabilitytoreal-worldprojectsand
succeededtoreducetheamountofcloningintheparticipatingsoftwaresystem.

191

lusionConc10

DetectionCloneofLimitations10.2.3

Cloningisnottheonlyformofredundancyinsourcecode.Independentimplementationofthesame
functionality,e.g.,causedthroughclonedrequirementsspeci®cations,canalsoleadtobehaviorally
similarcode.Weanalyzedthreeresearchquestionstobetterunderstandthesuitabilityofclone
detectiontodiscoverbehaviorallysimilarcodeofindependentorigin.
RQ16:Howsuccessfullycanexistingclonedetectiontoolsdetectsimions1thatdonotresultfrom
paste?&copyTheanalyzedclonedetectorswereunsuccessfulindetectingsimionsthathavebeendeveloped
independently.Theamountofprogramvariationinbehaviorallysimilarcodeofindependentorigin
istoolargeforthecompensationcapabilitiesofexistingclonedetectors.
RQ17:Isprogram-representation-similarity-basedclonedetectioninprinciplesuitedtodetect
simionsthatdonotresultfromcopy&paste?
No.Simionsarelikelytocontainprogramvariationthatcannotbecompensatedbyexistingclone
detectionoralgorithmrecognitionapproaches.Program-representation-similarity-baseddetection
isthuspoorlysuitedtodetectsimionsofindependentorigin.
RQ18:Dosimionsthatdonotresultfromcopy&pasteoccurinpractice?
Yes.Bothmanualinspectionsofopensourcecodeandanalysisofimplementationofclonedre-
quirementsspeci®cationsrevealedsimionsinreal-worldsoftware.

SummaryClonedetectionislimitedtocopy&paste—independentlydevelopedprogramfrag-
mentswithsimilarbehaviorareoutofreachofexistingclonedetectionapproaches.Duringclone
control,clonedetectioncanbeappliedto®ndregionsinartifactsthathavebeencreatedthrough
copy&paste&modify.Itcannot,however,beexpectedtodetectbehavioralsimilaritiesthathave
beenimplementedindependently.Clonemanagementtools,thus,cannotbeexpectedtoworkon
simions.Insteadoffacilitatingtheirconsistentevolutionduringmaintenance,clonecontrolthus
needstofocusontheavoidanceofsimions.

1Behaviorallysimilarcodefragments,cf.,2.3.2

192

11orkWFuture

Thisresultsandchaptereoutlinesxperiencesmadedirectionsduringofthefuturecasework.studiesTheofthistopicsthesis.havebeeninspiredbytheempirical

futureSectionwork11.1inpresentsclonecostopenissuesmodeling.intheSectionprevention11.3andproposesdetectioncloneofdetectionsimions.asaSectiontoolto11.2guidediscusseslan-
guagenaturallanguageengineering.documents.Section11.4Finally,outlinesSectionopen11.5issueslistsinopenclonequestionsdetectiononandcloneimpactofconsolidation.cloningfor

11.1ManagementofSimions

Softwarecancontainredundancybeyondcopy&paste.Oneform,independentreimplementation,
presentssimilarproblemstosoftwaremaintenance,ascloning.Evenworse,reimplementationis
typicallymoreexpensive—andpossiblymoreerror-prone—thancopyingexistingcode.Ourem-
piricalstudieshavecon®rmedtheexistenceofreimplementedfunctionalityinreal-worldsoftware:
foropensourceviamanualcodeinspections(cf.,Section9.5)andforindustrialsoftwareasaresult
ofduplicatedrequirements(cf.,Section5.5.4).

PreventionofReimplementationSuccessfulpreventionofreimplementationneedstohap-
peninearlystagesofsoftwaredevelopment:assoonasitismanifestedinthecode,effortfor
implementation,andpossiblyqualityassurance,hasalreadybeenspent.Consequently,prevention
needstoidentifysimilarfunctionalityearlier,e.g.,ontherequirementslevel.Thefactthatpreven-
tionshouldfocusonearlystagesisalsosupportedbyChapter9,thatdemonstratedthatexisting
clonedetectionapproachesareunsuitedtoreliablydetectsuchredundancy.

system,Identi®cationwhenofdesignsimilarandfunctionalityimplementationshouldarebederivedperformedfromabothsetatofthestartrequirements,ofdevandelopmentduringofanemain-w
existtenance,bothwhenbetweennewnewrequirementsrequirementsareoraddedbetweenorenexistingwrequirementsfunctionalityandgetsimplementedchanged.Similarityfeatures.can

Wearenotawareofasystematicapproachtoidentifysimilarfunctionalityontherequirements
leveltoavoidreimplementation.Giventhesimionsweobservedduringourempiricalstudies,we
considersuchanapproachasanimportanttopicforfuturework.

193

orkWFuture11

publicchar[]staticcharactersString=®llString(newcharint[length];length,charc){
returnnewArrays.®ll(characters,String(characters);c);
}

privatestaticStringpadding(intrepeat,charpadChar)throws...{
ifthr(repeatow<new0){IndexOutOfBoundsException("..."+repeat);
}for®nal(intichar[]=0;biuf<=bnewuf.length;chari++)[repeat];{
padChar;=uf[i]b}returnnewString(buf);

}Figure11.1:SimionsbetweenCCSMCommonsandApacheCommons

SimionDetectionApreventionapproach,asoutlinedabove,cannotbeappliedtosimionsthat
arealreadycontainedinexistingsoftware.Thus,tocomplementthepreventionapproach,weneeda
detectorthatiscapabletodetect(atleastcertainclassesof)simions.Sinceexistingclonedetection
approachesarepoorlysuitedforthis(cf.,Chapter9),newapproachesneedtobedeveloped.
Onepromisingapproachforsimiondetectionisdynamicclonedetectionthatexecuteschunksof
codeandcomparestheirI/Obehavior.Asproofofconcept,wehaveimplementedaprototypical
dynamicclonedetectorforJavausingtechniquessimilarto1randomtesting[112].Anexampleof
detectedsemanticallysimilarfunctionsfromCCSMCommonsandApacheCommonsisdepictedin
Figure11.1.Whileinitialresultsareencouraging,theprototypestillhasmanylimitations,making
itspracticalapplicationinfeasible.Futureworkisrequiredtodevelopscalableandaccuratesimion
detectors.

11.2CloneCostModelDataCorpus

Apromisingdirectionoffutureworkisthecreationofacorpusofreferencedatathatcollects
acticorpusvitycaneffortsimplifyparametersinstantiationfordifofferentthecostcontemodelxtsandbymakingcloningefdatafortfordifparametersferentavailablesystems.andSuchservea
asabenchmarkforrelativecomparisonoftheimpactofcloninginonesystemagainstcomparable
systemsdevelopedbyotherorganizations.
Furthermore,thereisade®nitiveneedforfutureworkontheclonecostmodelitself.Theassump-
tionsthecostmodelisbasedonmustbevalidatedfordifferentengineeringcontexts.Forcases
inwhichanassumptiondoesnothold,themodelneedstobeadaptedorextendedaccordingly.
Furthermore,themodelneedstobeinstantiatedusingprojectspeci®ceffortparameters.Lastbut
mostimportant,thecorrectnessoftheresultsmustbevalidated,e.g.,throughcomparingeffortson
projectsbeforeandaftercloneconsolidation,withthepredictedefforts.
1x.php/CCSM_Commonshttp://conqat.cs.tum.edu/inde

194

EngineeringegLangua11.3

egLangua11.3Engineering

Onerootcauseforcloningthatisfrequentlymentionedintheclonedetectionliterature,arelan-
guagelimitationsthatpreventthecreationofreusableabstractions.Asawayaroundthislimitation,
developerscopy&paste&modifythecode.Forexample,manycross-programclonesinCOBOL
arecausedbyCOBOL’sdif®cultytoreusecodebetweenprograms.Similarly,programswrittenin
earlyversionsofJavaoftencontainclonedwrappersaroundcollectionclassestomakethemtype
safe,sincethelanguagethendidnotallowparameterizationoftypes.
Inthesesituations,cloningisthesymptom,theabstractionmechanismlimitationthecause.The
presenceofcloningcanthusindicatelanguagelimitations.Onepotentiallybene®cialuseofclone
detectionisthusthediscoveryofabstractionmechanismshortcomingstoinformlanguagedesign
andevolution—notonlyofgeneralpurposeprogramminglanguages,butofallabstractionmecha-
nismsandlanguagesemployedduringsoftwareengineering.
Evolutionhistoryofbothgeneralpurposeanddomainspeci®clanguagesdocumentsintroduction
oflanguagefeaturesthatallowtoreducetheamountofcloningintheirprograms.Java1.5,for
example,introducedgenericsthat,e.g.,allowparameterizationoftypesincollectionclasses.Asa
consequence,noredundantwrappersaroundcollectionclassesarerequiredanylongertomakethem
typesafe.Italsointroducedaniterationloop,allowingtoreplaceimplementationsoftheIterator
idiom—whichpreviouslytookseveralstatementsthatwereduplicatedeverytimeitwasused—
throughasinglestatement.Furtherevidencethattheintenttoremoveduplicationdrovelanguage
designcanbefoundintheevolutionofthecollectionslibraryofthelanguageScala.Itsdocumen-
tationstatesthatthe“principaldesignobjectiveofthenewcollectionsframeworkwastoavoidany
duplication,de®ningeveryoperationinoneplaceonly”[168].Afurtherexamplecanbefoundin
theevolutionofattributegrammarformalisms,domainspeci®clanguagestodeclarativelyspecify
syntaxandsemanticsofprogramminglanguages:in[174],Merniketal.extendexistingattribute
grammarformalismswithinheritance,toallowformorereuse—andthuslessduplication—inlan-
speci®cations.guageTheseexamplesfromlanguageevolutionhistorydocumentthattheremovalofredundancyisindeed
adriveroflanguagedesign.However,inmanycases,thelanguagefeatureswereintroducedata
latepoint,whentheamountofredundancyinpracticehadtakenanextentlargeenoughtoreally
botherusers.Systematicapplicationofclonedetectiontoguidelanguagedesigncouldallowto
mendweaknessesinearlierstages,beforealargeamountofcloningiscreatedasawork-around,
whichisthendif®culttoconsolidate.
Apartfromgeneralpurposeanddomainspeci®clanguages,clonedetectioncanalsoguidethe
designofmoreinformalabstractionmechanismsemployedduringsoftwareengineering.Thetem-
platesforusecasesandtestscriptsarealsoabstractionmechanismsthatspecifythe®xedandthe
variablepartsoftheirdocumentinstances.AsthecasestudyinChapter5showed,missingreuse
mechanismsintheseartifacttypesalsocreatecloningasaresponse.Wesuggestthefollowing
extensionoftheusecasetemplatesbasedonthecloningweobservedinusecases:
ConditionSets:Collectionsofbothpre-andpostconditionswerefrequentlyclonedbetweenuse
casesthatoperateinsimilarsystemstates.Theexplicitcreationofsetsofsuchconditions,
thatarethenreused,offerstwoadvantages:®rst,thedifferentsystemstatesaremoreeasily

195

orkWFuture11

recognizedfromafewpreconditionsetsthanfromacomparisonofthepreconditionslistedin
hundredsofindividualusecases;second,whenasystemstatechanges,thechangeonlyneeds
tobeperformedtothecorrespondingpreconditionset,nottoallusecasesthatoperateinthis
state.Thisreducesbothmaintenanceeffortandthedangerofinconsistencies.
Glossaries:Manyoftheclonesencounteredintheusecasesrepeatedde®nitionsofroles,entities
orterms.Theirsinglede®nitioninaglossarycanremovethisredundancy.Glossariesareused
inmanyprojects.However,theirintegrationwiththeusecases,e.g.,throughnavigablelinks
betweentermsinausecaseandtheirde®nitioninaglossary,doesnotappeartobehabitualin
practice.Walks:Manyoftheusecasesandtestscriptsweanalyzedcontainduplicatedsequencesofsteps.
Inmanycases,theycorrespondstosomehigherlevelconcept,suchas“opencustomerentry”,
whichrequiresseveralindividualsysteminteractionsteps,e.g.,“Opensearchform”,“Enter
name”,“Performsearch”and“Selectcustomerentryfromsearchresults”.Theserecurring
sequencesofstepscouldbemadereusableasa“walk”(tostayinthemetaphor)thatcanbe
cases.usebyreferencedDesigningabstractionsishard.Weoftendonotgetitperfectlyrightonthe®rstattempt.Clone
detectioncanprovideatooltodiscoverweaknessesandreacttothemearly,beforetheycreatetoo
practice.inyredundancmuch

11.4CloninginNaturalLanguageDocuments

Thetions,studyandinhasgiChaptevenr5hasindicationshownforitsthatnegcloningativeimpactaboundsoninmanyengineeringreal-weforldforts.Thisrequirementssectionspeci®ca-outlines
promisingdirectionsforfutureworkinclonedetectioninrequirementsspeci®cationsandother
naturallanguagesoftwareartifacts.

lishedClone(cf.,SectionClassi®cation2.2.3).ForRecentcodely,anclones,aanalogousclassi®cationclassi®cationintodifofferentcloneclonetypesfortypesmodelhasbeenclonesestab-has
fbeenacilitateproposedtheir[86].comparisonSuchandclassi®ctheirationsselectionareforusefulspeci®ctotasks.characterizedetectionalgorithmsandthus
Analogtocodeclones,wecande®neaclassi®cationofclonetypesforclonesinnaturallanguage
documents:type-1clonesarecopiesthatonlydifferinwhitespace.Theyarethusallowedtoshowdifferent
positionsoflinebreaksorparagraphboundaries.
type-2clonesarecopiesthat,apartfromwhitespace,cancontainreplacementsofwordsinsidea
wordcategory.Forexample,anadjectiveinoneclonecanbereplacedbyanotheradjectivein
itssibling,oranounthroughanothernoun.
type-3clonesarecopiesthat,apartfromwhitespaceandcategory-preservingwordreplacements,
canfromeonexhibitcatefurthergorydifthroughferences,awordsuchfromasremoanothervedone.oraddedwords,orreplacementsofaword

196

11.4CloninginNaturalLanguageDocuments

type-4clonesaretextfragmentsthat,althoughdifferentintheirwording,conveysimilarmeaning.

Justastheclassi®cationofcodeclones,thisclassi®cationcanbeexpectedtoevolve,asexperience
withcloninginnaturallanguagedocumentsincreases.Forexample,in[141],Koschkeintroduces
furtherclonecategoriestobetterre¯ecttypicalcloneevolutionpatterns.Similarly,abetterun-
derstandingoftheevolutionofrequirementsspeci®cationcouldleadtoare®nementtotheabove
gorization.cate

DetectionofType-2ClonesTheclassi®cationofclonetypesraisesthequestionofhowthey
canbedetected.Type-1clonesareeasytodetect,sincenonormalizationbeyondwhitespacere-
movalneedstobeperformed.Detectioncanthensimplybeperformedonthewordsequence,as
suggestedinChapters5and7.Type-3clonedetectioncanbeappliedtothiswordsequenceaswell,
e.g.,employingthealgorithmproposedinChapter7fordetectionoftype-3clonesinsequences.
Fortype-2clonedetection,however,anormalizationcomponentisrequiredthattransformsele-
mentsthatmaybesubstituteoneanotherintoacanonicrepresentation.Sincetheabovede®nition
onlyallowswordreplacementsinsideawordcategory,suchasnouns,verbsoradjectives,weneed
acomponentthatidenti®eswordcategoriesfornaturallanguagetext.
Naturallanguageprocessing[119]developedatechniquecalledpart-of-speechanalysisthatdeter-
minesthewordcategoriesfornaturallanguagetext.Part-of-speechanalysisisamatureresearch
area,forwhichfreelyavailabletools,suchasTreeTagger[206,207]exist,thatarealsousedfor
otheranalysistasks,suchasambiguitydetection[82].
Toevaluatethesuitabilityofpart-of-speechanalysisfornormalization,wehaveprototypicallyim-
plementeditintoConQATandevaluateditononeofthespeci®cationsfromthecasestudyon
cloninginrequirementsspeci®cationsfromChapter5.Initialresultsarepromising:wedetected
type-2clonesthatdifferintheactionthatgetsperformedinausecase,e.g.,createversusmodifyof
aprogramentity,orinthetenseinwhichtheverbsarewritten;severalclonegroupsonlydiffered
inthenameofthetargetentityonwhichusecasestepswereperformed,althoughthestepswere
identical.Intheinstanceswesaw,normalizationincreasedrobustnessagainstmodi®cations.Forexample,
theterm“user”hadbeenreplacedbytheterm“actor”insome,butnotalloftheusecases.Such
systematicchangescausemanydifferencesinthewordsequencesandthusmakethemdif®cultto
detectusingedit-distance-basedalgorithms;normalization,however,compensatessuchmodi®ca-
tions,thusmakingtheirdetectionfeasible.
Manyopenissuesremain:howdoespart-of-speechnormalizationaffectprecision?Whichnormal-
izationofwordcategoriesgivesagoodcompromisebetweenprecisionandrecall?Shouldsome
wordcategoriesbeignoredentirely,e.g.,articlesorprepositions?Canautomatedsynonymdetec-
tionapproachesservetoprovideamore®negrainednormalizationthanpart-of-speechanalysis?
Naturallanguagesoftwareartifactsoftenadheretoatemplate;doestheresultingregularstructure
enableimprovementsoroptimizations?Futureworkisrequiredtoshedlightontheseissues.

197

orkWFuture11

EvolutionofRequirementsClonesRequirementsspeci®cations—likeallsoftwareartifacts—
evolveasthesystemtheydescribechanges.Unawarenessofcloningduringdocumentmaintenance
threatensconsistency:justasforsourcecode,unintentionallyinconsistentchangescanintroduce
documents.theintoerrorsLittleisknownabouthowrequirementsspeci®cationsevolve,andhowevolutionisaffectedby
cloning.Howlargeistheimpactofcloningonrequirementsconsistencyandcorrectnessinpractice?
Whichclassesofmodi®cationsareoftenencounteredinreal-worldrequirementsevolutionand
shouldthusbecompensatedbyclonedetectors?Empiricalstudiescouldhelptobetterunderstand
issues.these

CloninginTestScriptsInmanydomains,asubstantialpartoftheend-to-endtestingisstill
performedmanually:testengineersinteractwiththesystemundertest,triggerinputsandvalidate
systemreactions.Thetestactivitiestheyperformaretypicallyspeci®edasnaturallanguagetest
casescriptsthatadheretoastandardizedstructurethatisde®nedbyatestcasetemplate.Asthe
systemundertestevolves,sodoitstestcases.
Togeta®rstunderstandingwhethertestcasescontaincloning,weperformedaclonedetection
on167testcasesformanualend-to-endtestsofanindustrialbusinessinformationsystem.Fora
minimalclonelengthof20words,detectiondiscoveredabout1000clonesandcomputedaclone
54%.oferagevcoManualinspectionofthetestcaseclonesrevealedfrequentduplicationofsequencesofinteraction
stepsbetweenthetesterandthesystem.Someofthesteps,specifyingboththetestinputandthe
expectedsystemreactionandstate,occurredover50timesinthetestcases.Theemployedtest
managementtool,however,didnotfacilitatestructuredreuseoftestcasesteps,thusencouraging
cloning.However,ifthecorrespondingsystementitieschange,testcasesprobablyneedtobe
adaptedaccordingly.Theseresultsthussuggestthatcloningintestscriptscreatessimilarproblems
formaintenance,asitdoesinsourcecode,requirementsspeci®cationsanddata-¯owmodels.
Empiricalresearchisrequiredtobetterunderstandtheextentandimpactofcloningintestscripts
inpractice.Doesitincreasetestcasemaintenanceeffort?Doesunawarenessduringmaintenance
causeinconsistentorerroneoustestscripts?Canclonedetectionsupportautomationofend-to-end
testsbyidentifyingrecurringteststepsthatcanbereusedacrossautomatedtestcases?

ConsolidationCloneCode11.5

Whilealotofworkhasbeendoneonthedetectionofclonesandonstudiesoftheirevolution,less
isknownabouttheirconsolidation.
Ithasbeennotedthatlimitationsofabstractionmechanismscanimpedesimpleconsolidationof
clonesthroughthecreationofasharedabstraction.However,itisunclear,howmuchcloningin
practiceisreallycausedbythis.Manyoftheclonesweinspectedinmanualassessmentsduringour
casestudiescannotbeexplainedbylanguagelimitations,especiallyformodernlanguageslikeJava
orstudyC#.Inpresentedaddition,incloneChapter8.controlIndeed,succeededourowntoobservsubstantiallyationsreducesuggestthethataamountlargeofpartcloningofthetheclonescase

198

ConsolidationCloneCode11.5

inpracticecanbeconsolidated.Furtherempiricalresearchisrequiredtobetterunderstandlimita-
tionsofcloneconsolidationinpractice.Whenconsolidatingclones,developersfacequestionsthat
clonescurrentlyisthecannotrequiredbeansweredconsolidationsatisfefactorily:fortnotwhichjusti®edclonesbyeshouldxpectedbemaintenanceconsolidated®rst?Fsimpli®cations?orwhich
Howcanwedecidethisobjectively?Canconsolidationincombinationwiththeimplementationof
otherchangerequestsreducetheincurredqualityassuranceeffort?Weneedabetterunderstanding
oftheseissuestofacilitatecloneconsolidationinpractice.

199

yliographBib

[1]R.Al-Ekram,C.Kapser,R.Holt,andM.Godfrey.Cloningbyaccident:anempiricalstudy
ofsourcecodecloningacrosssoftwaresystems.InProc.ofESEM’05,2005.
[2]C.AliasandD.Barthou.Algorithmrecognitionbasedondemand-drivendata-¯owanalysis.
InProc.ofWCRE’03,2003.
[3]G.Antoniol,U.Villano,E.Merlo,andM.DiPenta.Analyzingcloningevolutioninthelinux
kernel.InformationandSoftwareTechnology,2002.
[4]L.Aversano,L.Cerulo,andM.DiPenta.Howclonesaremaintained:Anempiricalstudy.
InProc.ofCSMR’07,2007.
[5]N.Ayewah,W.Pugh,J.D.Morgenthaler,J.Penix,andY.Zhou.Using®ndbugsonproduc-
tionsoftware.InProc.ofOOPSLA’07,2007.
[6]B.S.Baker.On®ndingduplicationandnear-duplicationinlargesoftwaresystems.InProc.
1995.,’95WCREof[7]T.Bakota,R.Ferenc,andT.Gyimothy.Clonesmellsinsoftwareevolution.InProc.ofICSM
2007.,’07[8]M.Balazinska,E.Merlo,M.Dagenais,B.Lague,andK.Kontogiannis.Partialredesignof
Javasoftwaresystemsbasedoncloneanalysis.InProc.ofWCRE’99,1999.
[9]M.Balazinska,E.Merlo,M.Dagenais,B.Lague,andK.Kontogiannis.Advancedclone-
analysistosupportobject-orientedsystemrefactoring.InProc.ofWCRE’00,2000.
[10]V.Basili,L.Briand,S.Condon,Y.-M.Kim,W.L.Melo,andJ.D.Valett.Understandingand
predictingtheprocessofsoftwaremaintenancerelease.InProc.ofICSE’96,1996.
[11]V.Basili,G.Caldiera,andH.Rombach.Thegoalquestionmetricapproach.Encyclopedia
ofsoftwareengineering,1994.
[12]H.BasitandS.Jarzabek.Detectinghigher-levelsimilaritypatternsinprograms.ACMSoftw.
2005.,Notes.Eng[13]H.BasitandS.Jarzabek.Adataminingapproachfordetectinghigher-levelclonesinsoft-
ware.IEEETrans.onSoftw.Eng.,2009.
[14]H.Basit,S.Puglisi,W.Smyth,A.Turpin,andS.Jarzabek.Ef®cienttokenbasedclone
detectionwith¯exibletokenization.InProc.ofESEM/FSE’07,2007.
[15]H.Basit,D.Rajapakse,andS.Jarzabek.Beyondtemplates:astudyofclonesintheSTLand
somegeneralimplications.InProc.ofICSE’05,2005.

201

liographBiby

[16]I.D.Baxter,A.Yahin,L.Moura,M.Sant’Anna,andL.Bier.Clonedetectionusingabstract
syntaxtrees.InProc.ofICSM’98,1998.
[17]K.Beck.Test-drivendevelopment:Byexample.Addison-Wesley,2003.
[18]K.BeckandC.Andres.Extremeprogrammingexplained:embracechange.Addison-Wesley
2004.Professional,[19]S.Bellon,R.Koschke,G.Antoniol,J.Krinke,andE.Merlo.Comparisonandevaluationof
clonedetectiontools.IEEETrans.onSoftw.Eng.,2007.
[20]N.Bettenburg,W.Shang,W.Ibrahim,B.Adams,Y.Zou,andA.Hassan.AnEmpiricalStudy
onInconsistentChangestoCodeClonesatReleaseLevel.InProc.ofWCRE’09,2009.
[21]B.Boehm.SoftwareEngineeringEconomics.Prentice-Hall,1981.
[22]B.Boehm,C.Abts,andS.Chulani.Softwaredevelopmentcostestimationapproaches–a
survey.Ann.Softw.Eng.,2000.
[23]B.W.Boehm,Clark,Horowitz,Brown,Reifer,Chulani,R.Madachy,andB.Steece.Software
CostEstimationwithCocomoII.PrenticeHallPTR,2000.
[24]J.S.BradburyandK.Jalbert.De®ningacatalogofprogramminganti-patternsforconcurrent
java.InProc.ofSPAQu’09,pages6–11,Oct.2009.
[25]F.BrooksJr.Themythicalman-month.Addison-WesleyLongmanPublishingCo.,Inc.
1995.USA,MA,Boston,[26]M.BroyandK.Stølen.Speci®cationanddevelopmentofinteractivesystems:focuson
streams,interfaces,andre®nement.SpringerVerlag,2001.
[27]M.Bruntink,A.vanDeursen,R.vanEngelen,andT.Tourwé.Ontheuseofclonedetection
foridentifyingcrosscuttingconcerncode.IEEETrans.onSoftw.Eng.,2005.
[28]A.Bucchiarone,S.Gnesi,G.Lami,G.Trentanni,andA.Fantechi.QuARSExpress-ATool
Demonstration.InProc.ofASE’08,2008.
[29]P.BulychevandM.Minea.Duplicatecodedetectionusinganti-uni®cation.Proc.ofSYR-
2008.,’08CoSE[30]P.BulychevandM.Minea.Anevaluationofduplicatecodedetectionusinganti-uni®cation.
InProc.ofIWSC’09,2009.
[31]H.Bunke,P.Foggia,C.Guidobaldi,C.Sansone,andM.Vento.Acomparisonofalgorithms
formaximumcommonsubgraphonrandomlyconnectedgraphs.InProc.ofSSPRandSPR
2002.,Springer.’02[32]E.BurdandJ.Bailey.Evaluatingclonedetectiontoolsforuseduringpreventativemainte-
nance.InProc.ofSCAM’02,Washington,DC,USA,2002.
[33]G.Casazza,G.Antoniol,U.Villano,E.Merlo,andM.Penta.Identifyingclonesinthelinux
kernel.InProc.ofSCAM’01,2001.

202

liographBiby

[34]F.Chang,J.Dean,S.Ghemawat,W.C.Hsieh,D.A.Wallach,M.Burrows,T.Chandra,
A.Fikes,andR.E.Gruber.Bigtable:Adistributedstoragesystemforstructureddata.ACM
Trans.Comput.Syst.,2008.
[35]X.CHANGSONG,P.Eck,andR.Matzner.Syntax-orientedcoding(SoC):Anewalgorithm
forthecompressionofmessagesconstrainedbysyntaxrules.IEEEinternationalsymposium
1998.,theoryinformationon[36]M.Chilowicz,É.Duris,andG.Roussel.Syntaxtree®ngerprintingforsourcecodesimilarity
detection.InProc.ofICPC’09,2009.
[37]A.Cockburn.WritingEffectiveUseCases.Addison-WesleyLongmanPublishingCo.,Inc.,
2000.USA,MA,Boston,[38]I.Coman,A.Sillitti,andG.Succi.Acase-studyonusinganAutomatedIn-processSoftware
EngineeringMeasurementandAnalysissysteminanindustrialenvironment.InProc.of
2009.,’09ICSE[39]M.J.CorbinandL.A.Strauss.Basicsofqualitativeresearch:Techniquesandprocedures
fordevelopinggroundedtheory.SagePubl.,3.edition,2008.
[40]J.Cordy.Comprehendingreality-practicalbarrierstoindustrialadoptionofsoftwaremain-
tenanceautomation.InProc.ofIWPC’03,2003.
[41]J.R.Cordy,T.R.Dean,andN.Synytskyy.Practicallanguage-independentdetectionof
near-missclones.InProc.ofCASCON’04.IBMPress,2004.
[42]T.H.Cormen,C.E.Leiserson,R.L.Rivest,andC.Stein.IntroductiontoAlgorithms.The
MITPressandMcGraw-HillBookCompany,2ndedition,2001.
[43]J.CovingtonandM.Chase.Eightstepstosustainablechange.IndustrialManagement,2010.
[44]F.CulwinandT.Lancaster.Areviewofelectronicservicesforplagiarismdetectioninstudent
submissions.InProc.ofTeachingofComputing’00,2000.
[45]I.DavisandM.Godfrey.Clonedetectionbyexploitingassembler.InProc.ofIWSC’10,
2010.[46]M.deWit,A.Zaidman,andA.vanDeursen.Managingcodeclonesusingdynamicchange
trackingandresolution.InProc.ofICSM’09,2009.
[47]G.DeCandia,D.Hastorun,M.Jampani,G.Kakulapati,A.Lakshman,A.Pilchin,S.Siva-
subramanian,P.Vosshall,andW.Vogels.Dynamo:Amazon’shighlyavailablekey-value
store.InProc.ofSOSP’07,2007.
[48]F.Deissenboeck.ContinuousQualityControlofLong-LivedSoftwareSystems.PhDthesis,
TechnischeUniversitätMünchen,2009.
[49]F.Deissenboeck,M.Feilkas,L.Heinemann,B.Hummel,andE.Juergens.Conqatbook,
T_Book.x.php/ConQAhttp://conqat.in.tum.de/inde2009.[50]F.Deissenboeck,L.Heinemann,B.Hummel,andE.Juergens.Flexiblearchitectureconfor-
manceassessmentwithconqat.InProc.ofICSE’10,2010.

203

yliographBib

[51]F.Deissenboeck,U.Hermann,E.Juergens,andT.Seifert.LEvD:Aleanevolutionand
developmentprocess,2007.http://conqat.cs.tum.edu/download/levd-process.pdf.
[52]F.Deissenboeck,B.Hummel,andE.Juergens.Conqat-eintoolkitzurkontinuierlichen
qualitätsbewertung.InProc.ofSE’08,2008.
[53]F.Deissenboeck,B.Hummel,E.Juergens,M.Pfaehler,andB.Schaetz.Modelclonedetec-
tioninpractice.InProc.ofIWSC’10,2010.
[54]F.Deissenboeck,B.Hummel,E.Juergens,B.Schaetz,S.Wagner,J.-F.Girard,and
S.Teuchert.Clonedetectioninautomotivemodel-baseddevelopment.InProc.ofICSE
2008.,’08[55]F.Deissenboeck,E.Juergens,B.Hummel,S.Wagner,B.M.yParareda,andM.Pizka.Tool
supportforcontinuousqualitycontrol.IEEESoftw.,2008.
[56]F.Deissenboeck,M.Pizka,andT.Seifert.Toolsupportforcontinuousqualityassessment.
InProc.ofSTEP’05,2005.
[57]C.Domann,E.Juergens,andJ.Streit.Thecurseofcopy&paste–Cloninginrequirements
speci®cations.InProc.ofESEM’09,2009.
[58]dSpaceGmbH.TargetLinkProductionCodeGeneration.www.dspace.de.
[59]E.Duala-EkokoandM.Robillard.Clonetracker:toolsupportforcodeclonemanagement.
InProc.ofICSE’08,2008.
[60]E.Duala-EkokoandM.P.Robillard.Trackingcodeclonesinevolvingsoftware.InProc.of
2007.,’07ICSE[61]S.Ducasse,O.Nierstrasz,andM.Rieger.Ontheeffectivenessofclonedetectionbystring
matching.J.SoftwaremaintenanceRes.Pract.,2006.
[62]S.Ducasse,M.Rieger,andS.Demeyer.Alanguageindependentapproachfordetecting
duplicatedcode.InProc.ofICSM’99,1999.
[63]S.Eick,J.Steffen,andE.SumnerJr.Seesoft-atoolforvisualizinglineorientedsoftware
statistics.IEEETrans.onSoftw.Eng.,1992.
[64]A.EndresandD.Rombach.AHandbookofSoftwareandSystemsEngineering.Pearson,
2003.[65]W.S.Evans,C.W.Fraser,andF.Ma.Clonedetectionviastructuralabstraction.InProc.of
2007.,’07WCRE[66]F.Fabbrini,M.Fusani,S.Gnesi,andG.Lami.AnAutomaticQualityEvaluationforNatural
LanguageRequirements.InProc.ofREFSQ’01,2001.
[67]R.Falke,P.Frenzel,andR.Koschke.Empiricalevaluationofclonedetectionusingsyntax
suf®xtrees.EmpiricalSoftwareEngineering,2008.
[68]R.FantaandV.Rajlich.Removingclonesfromthecode.J.SoftwaremaintenanceRes.
1999.,act.Pr

204

yliographBib

[69]P.Finnigan,R.Holt,I.Kalas,S.Kerr,K.Kontogiannis,H.Mueller,J.Mylopoulos,
S.Perelgut,M.Stanley,andK.Wong.Thesoftwarebookshelf.IBMSystemsJ.,1997.
[70]M.Fowler.Refactoring:improvingthedesignofexistingcode.Addison-WesleyProfes-
1999.sional,[71]M.FowlerandJ.Highsmith.Theagilemanifesto.SoftwareDevelopment,2001.
[72]J.Franklin.Integrationofofclonedetectiveintoeclipse.Master’sthesis,TechnischeUniver-
2009.München,sität[73]M.Gabel,L.Jiang,andZ.Su.Scalabledetectionofsemanticclones.InProc.ICSE’08,
2008.[74]E.Gamma,R.Helm,R.Johnson,andJ.Vlissides.Designpatterns:elementsofreusable
object-orientedsoftware.Addison-WesleyReading,MA,1995.
[75]M.R.GareyandD.S.Johnson.Computersandintractability.Aguidetothetheoryof
NP-completeness.W.H.FreemanandCompany,1979.
[76]R.Geiger,B.Fluri,H.C.Gall,andM.Pinzger.Relationofcodeclonesandchangecouplings.
InProc.ofFASE’06.Springer,2006.
[77]D.German,M.DiPenta,Y.Guéhéneuc,andG.Antoniol.Codesiblings:Technicalandlegal
implicationsofcopyingcodebetweenapplications.InProc.ofMSR’09,2009.
[78]S.Giesecke.Clone-basedReengineeringfürJavaaufderEclipse-Plattform.Master’sthesis,
UniversitätOldenburg,2003.
[79]T.GilbandD.Graham.SoftwareInspection.Addison-Wesley,1993.
[80]R.Glass.Maintenance:Lessisnotmore.IEEESoftw.,1998.
[81]R.Glass.Factsandfallaciesofsoftwareengineering.Addison-WesleyProfessional,2003.
[82]B.Gleich,O.Creighton,andL.Kof.Ambiguitydetection:Towardsatoolexplainingambi-
guitysources.InProc.ofREFSQ’10,2010.
[83]N.Göde.EvolutionofType-1Clones.InProc.ofSCAM’09,2009.
[84]N.Göde.Cloneremoval:Factor®ction?InProc.ofIWSC’10,2010.
[85]N.GödeandR.Koschke.Incrementalclonedetection.InProc.ofCSMR’09,2009.
[86]N.Gold,J.Krinke,M.Harman,andD.Binkley.IssuesinCloneClassi®cationforData¯ow
Languages.Proc.ofIWSC’10,2010.
[87]J.D.Gould,L.Alfaro,R.Finn,B.Haupt,andA.Minuto.Whyreadingwasslowerfrom
CRTdisplaysthanfrompaper.SIGCHIBull.,17,1987.
[88]S.GrantandJ.Cordy.VectorSpaceAnalysisofSoftwareClones.InProc.ofICPC’09,
2009.[89]P.Grünwald.Theminimumdescriptionlengthprinciple.TheMITPress,2007.

205

yliographBib

[90]J.Haldane.Biologicalpossibilitiesforthehumanspeciesinthenexttenthousandyears.
Manandhisfuture,1963.
[91]J.HarderandN.Göde.Quovadis,clonemanagement?InProc.ofIWSC’10,2010.
[92]Y.Higo,Y.Ueda,S.Kusumoto,andK.Inoue.Simultaneousmodi®cationsupportbasedon
codecloneanalysis.InProc.ofAPSEC’07,2007.
[93]W.T.B.Hordijk,M.L.Ponisio,andR.J.Wieringa.Harmfulnessofcodeduplication-a
structuredreviewoftheevidence.InProc.ofEASE’09.BritishComputerSociety,2009.
[94]D.Hou,P.Jablonski,andF.Jacob.CnP:Towardsanenvironmentfortheproactivemanage-
mentofcopy-and-pasteprogramming.Proc.ofICPC’09,2009.
[95]D.Huffman.Amethodfortheconstructionofminimum-redundancycodes.Resonance,
2006.[96]M.HuhnandD.Scharff.Someobservationsonscademodelclones.InProc.ofMBEES’10,
2010.[97]B.Hummel,E.Juergens,L.Heinemann,andM.Conradt.Index-BasedCodeCloneDetec-
tion:Incremental,Distributed,Scalable.InProc.ofICSM’10,2010.
[98]I.I.Ianov.Ontheequivalenceandtransformationofprogramschemes.Commun.ACM,
1958.[99]IEEE.Standard1219:Softwaremaintenance,1998.
[100]IEEE.Standard830-1998:Recommendedpracticeforsoftwarerequirementsspeci®cations,
1998.[101]L.K.IshrarHussain,OlgaOrmandjieva.AutomaticqualityassessmentofSRStextbymeans
ofadecision-tree-basedtextclassi®er.InProc.ofQSIC’07,2007.
[102]P.JablonskiandD.Hou.CReN:atoolfortrackingcopy-and-pastecodeclonesandrenaming
identi®ersconsistentlyintheIDE.InProc.ofEclipse’07,2007.
[103]F.Jacob,D.Hou,andP.Jablonski.Activelycomparingclonesinsidethecodeeditor.In
Proc.ofIWSC’10,2010.
[104]K.JalbertandJ.S.Bradbury.Usingclonedetectiontoidentifybugsinconcurrentsoftware.
InProc.ofICSM’10,2010.
[105]Y.Jia,D.Binkley,M.Harman,J.Krinke,andM.Matsushita.KClone:aproposedapproach
tofastprecisecodeclonedetection.InProc.ofIWSC’09,2009.
[106]L.Jiang,G.Misherghi,Z.Su,andS.Glondu.DECKARD:Scalableandaccuratetree-based
detectionofcodeclones.InProc.ofICSE’07,2007.
[107]L.JiangandZ.Su.Automaticminingoffunctionallyequivalentcodefragmentsviarandom
testing.InProc.ofISSTA’09,2009.
[108]J.H.Johnson.Identifyingredundancyinsourcecodeusing®ngerprints.InProc.ofCASCON
1993.,’93

206

yliographBib

[109]P.JokinenandE.Ukkonen.Twoalgorithmsforapproximatestringmatchinginstatictexts.
InProc.ofMFCS’91.Springer,1991.
[110]E.JuergensandF.Deissenboeck.Howmuchisaclone?InProc.ofSQM’10,2010.
[111]E.Juergens,F.Deissenboeck,M.Feilkas,B.Hummel,B.Schaetz,S.Wagner,C.Domann,
andJ.Streit.Canclonedetectionsupportqualityassessmentsofrequirementsspeci®cations?
InProc.ofICSE’10,2010.
[112]E.Juergens,F.Deissenboeck,andB.Hummel.Clonedetectionbeyondcopy&paste.In
Proc.ofIWSC’09,2009.
[113]E.Juergens,F.Deissenboeck,andB.Hummel.Clonedetective:Aworkbenchforclone
detectionresearch.InProc.ofICSE’09,2009.
[114]E.Juergens,F.Deissenboeck,andB.Hummel.Codesimilaritiesbeyondcopy&paste.In
Proc.ofCSMR’09,2010.
[115]E.Juergens,F.Deissenboeck,B.Hummel,andS.Wagner.Docodeclonesmatter?InProc.
2009.,’09ICSEof[116]E.JuergensandN.Göde.Achievingaccurateclonedetectionresults.InProc.ofIWSC’10,
2010.[117]E.Juergens,B.Hummel,F.Deissenboeck,andM.Feilkas.Staticbugdetectionthrough
analysisofinconsistentclones.InProc.ofSE’08.GI,2008.
[118]M.Jungmann,R.Otterbach,andM.Beine.DevelopmentofSafety-CriticalSoftwareUsing
AutomaticCodeGeneration.InProc.ofSAEWorldCongress’04,2004.
[119]D.Jurafsky,J.Martin,A.Kehler,K.VanderLinden,andN.Ward.Speechandlanguage
processing.PrenticeHallNewYork,2000.
[120]I.Kalaydijeva.Studiezurwiederverwendungbeidersoftlabgmbh.Master’sthesis,Tech-
nischeUniversitätMünchen,2007.
[121]T.Kamiya,S.Kusumoto,andK.Inoue.Cc®nder:amultilinguistictoken-basedcodeclone
detectionsystemforlargescalesourcecode.IEEETrans.onSoftw.Eng.,2002.
[122]C.KapserandM.W.Godfrey.Aidingcomprehensionofcloningthroughcategorization.In
Proc.ofIWPSE’04,2004.
[123]C.KapserandM.W.Godfrey.“Cloningconsideredharmful”consideredharmful.InProc.
2006.,’06WCREof[124]C.J.Kapser,P.Anderson,M.Godfrey,R.Koschke,M.Rieger,F.vanRysselberghe,and
P.Wei¨sgerber.Subjectivityinclonejudgment:Canweeveragree?InDuplication,Redun-
dancy,andSimilarityinSoftware,DagstuhlSeminarProceedings,2007.
[125]C.J.KapserandM.W.Godfrey.Improvedtoolsupportfortheinvestigationofduplication
insoftware.InProc.ofICSM’05,2005.

207

yliographBib

[126]S.Kawaguchi,T.Yamashina,H.Uwano,K.Fushida,Y.Kamei,M.Nagura,andH.Iida.
SHINOBI:AToolforAutomaticCodeCloneDetectionintheIDE.InProc.ofWCRE’09,
2009.[127]D.KawrykowandM.Robillard.ImprovingAPIusagethroughdetectionofredundantcode.
InProc.ofASE’09,2009.
[128]U.Kelter,J.Wehren,andJ.Niere.AgenericdifferencealgorithmforUMLmodels.InProc.
2005.,’05SEof[129]A.KemperandA.Eickler.Datenbanksysteme:EineEinführung.OldenbourgWis-
2006.erlag,senschaftsv[130]T.Kiely.Managingchange:whyreengineeringprojectsfail.HarvardBusinessReview,
1995.[131]M.Kim,L.Bergman,T.Lau,andD.Notkin.Anethnographicstudyofcopyandpaste
programmingpracticesinOOPL.InProc.ofISESE’04,2004.
[132]M.KimandD.Notkin.Usingaclonegenealogyextractorforunderstandingandsupporting
evolutionofcodeclones.InProc.ofMSR’05,2005.
[133]M.Kim,V.Sazawal,D.Notkin,andG.Murphy.Anempiricalstudyofcodeclonegenealo-
gies.InProc.ofESEC/FSE’05,2005.
[134]J.Knoop,O.Rüthing,andB.Steffen.Partialdeadcodeelimination.InProc.ofPLDI’94,
1994.[135]D.E.Knuth.TheArtofComputerProgramming,volume3:SortingandSearching.Addison-
Wesley,2ndedition,1997.
[136]R.Komondoor.Automatedduplicated-codedetectionandprocedureextraction.PhDthesis,
TheUniversityofWisconsin,Madison,2003.
[137]R.KomondoorandS.Horwitz.Usingslicingtoidentifyduplicationinsourcecode.InProc.
ofSAS’01.Springer,2001.
[138]K.Kontogiannis.Evaluationexperimentsonthedetectionofprogrammingpatternsusing
softwaremetrics.InProc.ofWCRE’97,1997.
[139]K.Kontogiannis,R.DeMori,E.Merlo,M.Galler,andM.Bernstein.Patternmatchingfor
cloneandconceptdetection.AutomatedSoftwareEngineering,1996.
[140]R.Koschke.Surveyofresearchonsoftwareclones.InDuplication,Redundancy,andSimi-
larityinSoftware.DagstuhlSeminarProceedings,2007.
[141]R.Koschke.Frontiersofsoftwareclonemanagement.InFrontiersofSoftwareMaintenance,
2008.[142]R.Koschke,R.Falke,andP.Frenzel.Clonedetectionusingabstractsyntaxsuf®xtrees.In
Proc.ofWCRE’06,2006.
[143]J.Kotter.Leadingchange.HarvardBusinessSchoolPr,1996.
[144]J.KotterandL.Change.Whytransformationeffortsfail.HarvardBusinessReview,1995.

208

yliographBib

[145]J.KotterandD.Cohen.Theheartofchange:Real-lifestoriesofhowpeoplechangetheir
organizations.HarvardBusinessPress,2002.
[146]J.Krinke.Identifyingsimilarcodewithprogramdependencegraphs.InProc.ofWCRE’01,
2001.[147]J.Krinke.Astudyofconsistentandinconsistentchangestocodeclones.InProc.ofWCRE
2007.,’07[148]J.Krinke.Isclonedcodemorestablethannon-clonedcode?Proc.ofSCAM’08,2008.
[149]B.Lague,D.Proulx,J.Mayrand,E.M.Merlo,andJ.Hudepohl.Assessingthebene®tsof
incorporatingfunctionclonedetectioninadevelopmentprocess.InProc.ofICSM’97,1997.
[150]R.LämmelandC.Verhoef.Semi-automaticgrammarrecovery.Softw.Pract.Exp.,2001.
[151]J.LandisandG.Koch.Themeasurementofobserveragreementforcategoricaldata.Bio-
1977.,metrics[152]T.LarkinandS.Larkin.Communicatingchange:Howtowinemployeesupportfornew
businessdirections.McGraw-HillProfessional,1994.
[153]K.Lewin.Frontiersingroupdynamics:Concept,methodandrealityinsocialscience;social
equilibriaandsocialchange.Humanrelations,1947.
[154]H.LiandS.Thompson.ClonedetectionandremovalforErlang/OTPwithinarefactoring
environment.InProc.ofPEPM’09,2009.
[155]M.Li,X.Chen,X.Li,B.Ma,andP.Vitányi.Thesimilaritymetric.IEEETransactionson
2004.,TheoryInformation[156]M.LiandP.Vitányi.AnintroductiontoKolmogorovcomplexityanditsapplications.
Springer-VerlagNewYorkInc,2008.
[157]Z.Li,S.Lu,S.Myagmar,andY.Zhou.CP-Miner:Findingcopy-pasteandrelatedbugsin
large-scalesoftwarecode.IEEETrans.onSoftw.Eng.,2006.
[158]P.Liberatore.RedundancyinlogicI:CNFpropositionalformulae.Arti®cialIntelligence,
2005.[159]E.C.LingxiaoJiang,ZhendongSu.Context-baseddetectionofclone-relatedbugs.InProc.
2007.,’07ESEC/FSEof[160]H.Liu,Z.Ma,L.Zhang,andW.Shao.Detectingduplicationsinsequencediagramsbased
onsuf®xtrees.InProc.ofAPSEC’06,2006.
[161]S.Livieri,Y.Higo,M.Matsushita,andK.Inoue.Analysisofthelinuxkernelevolutionusing
codeclonecoverage.InProc.ofMSR’07,2007.
[162]S.Livieri,Y.Higo,M.Matsushita,andK.Inoue.Very-largescalecodecloneanalysisand
visualizationofopensourceprogramsusingdistributedCCFinder:D-CCFinder.InProc.of
2007.,’07ICSE[163]A.LozanoandM.Wermelinger.Assessingtheeffectofclonesonchangeability.InProc.of
2008.,’08ICSM

209

yliographBib

[164]A.Lozano,M.Wermelinger,andB.Nuseibeh.Evaluatingtheharmfulnessofcloning:A
changebasedexperiment.InProc.ofMSR’07,Washington,DC,USA,2007.
[165]C.Lyon,R.Barrett,andJ.Malcolm.Atheoreticalbasistotheautomateddetectionofcopying
betweentexts,anditspracticalimplementationintheferretplagiarismandcollusiondetector.
InProc.ofPPPPC’04,2004.
[166]D.MacKay.Informationtheory,inference,andlearningalgorithms.CambridgeUnivPr,
2003.[167]A.MarcusandJ.I.Maletic.Identi®cationofhigh-levelconceptclonesinsourcecode.In
Proc.ofASE’01,2001.
[168]E.MartinOdersky.Scala2.8collections,October2009.http://www.scala-lang.org/sites/
default/®les/sids/odersky/Fri,%202009-10-02,%2014:16/collections.pdf.
[169]TheMathWorksInc.SIMULINKModel-BasedandSystem-BasedDesign-UsingSimulink,
2002.[170]J.Mayrand,C.Leblanc,andE.Merlo.Experimentontheautomaticdetectionoffunction
clonesinasoftwaresystemusingmetrics.InProc.ofICSM’96,1996.
[171]T.McCabe.Acomplexitymeasure.IEEETrans.onSoftw.Eng.,1976.
[172]J.J.McGregor.Backtracksearchalgorithmsandthemaximalcommonsubgraphproblem.
Software–PracticeandExperience,1982.
[173]T.Mende,F.Beckwermert,R.Koschke,andG.Meier.Supportingthegrow-and-prunemodel
insoftwareproductlinesevolutionusingclonedetection.InProc.ofCSMR’08,Washington,
2008.USA,DC,[174]M.Mernik,M.Lenic,E.Avdicauševic,andV.Zumer.Multipleattributegrammarinheri-
2000.,Informaticatance.[175]G.Meszaros.xUnittestpatterns:Refactoringtestcode.PrenticeHallPTRUpperSaddle
River,NJ,USA,2006.
[176]R.MetzgerandZ.Wen.Automaticalgorithmrecognitionandreplacement.MITPress,2000.
[177]B.Meyer.DesignandCodeReviewsintheAgeoftheInternet.InProc.ofSEAFOOD’08.
2008.,Springer[178]A.Monden,D.Nakae,T.Kamiya,S.Sato,andK.Matsumoto.Softwarequalityanalysisby
codeclonesinindustriallegacysoftware.InProc.ofMETRICS’02,2002.
[179]E.Murphy-Hill,P.Quitslund,andA.Black.Removingduplicationfromjava.io:acase
studyusingtraits.InProc.ofOOPSLA’05,2005.
[180]H.Nguyen,T.Nguyen,N.Pham,J.Al-Kofahi,andT.Nguyen.Accurateandef®cientstruc-
turalcharacteristicfeatureextractionforclonedetection.Proc.ofFASE’09,2009.
[181]T.Nguyen,H.Nguyen,N.Pham,J.Al-Kofahi,andT.Nguyen.Cleman:Comprehensive
clonegroupevolutionmanagement.InProc.ofASE’08,2008.

210

yliographBib

[182]T.T.Nguyen,H.A.Nguyen,J.M.Al-Kofahi,N.H.Pham,andT.N.Nguyen.Scalableand
incrementalclonedetectionforevolvingsoftware.Proc.ofICSM’09,2009.
[183]T.T.Nguyen,H.A.Nguyen,N.H.Pham,J.M.Al-Kofahi,andT.N.Nguyen.Graph-based
miningofmultipleobjectusagepatterns.InProc.ofFSE’09,2009.
[184]J.NosekandP.Palvia.Softwaremaintenancemanagement:changesinthelastdecade.J.
SoftwaremaintenanceRes.Pract.,1990.
[185]C.H.PapadimitriouandK.Steiglitz.Combinatorialoptimization:Algorithmsandcomplex-
1982.Prentice-Hall,.ity[186]N.Pham,H.Nguyen,T.Nguyen,J.Al-Kofahi,andT.Nguyen.Completeandaccurateclone
detectioningraph-basedmodels.InProc.ofICSE’09,2009.
[187]M.F.Porter.Analgorithmforsuf®xstripping.Readingsininformationretrieval,1997.
[188]A.Pretschner,M.Broy,I.H.Krüger,andT.Stauner.SoftwareEngineeringforAutomotive
Systems:ARoadmap.InL.BriandandA.Wolf,editors,Proc.ofFoSE’07,2007.
[189]F.Rahman,C.Bird,andP.Devanbu.Clones:WhatisthatSmell?InProc.ofMSR’10,2010.
[190]D.Ratiu.Intentionalmeaningofprograms.PhDthesis,TechnischeUniversitätMünchen,
2009.[191]J.W.RaymondandP.Willett.Maximumcommonsubgraphisomorphismalgorithmsforthe
matchingofchemicalstructures.J.Comput-AidedMol.Des.,2002.
[192]R.Rivest.TheMD5Message-DigestAlgorithm.RFC1321(Informational),1992.
[193]A.L.RodriguezandM.Wermelinger.Trackingclonesimprint.InProc.ofIWSC’10,2010.
[194]H.D.Rombach,B.T.Ulery,andJ.D.Valett.Towardfulllifecyclecontrol:Addingmainte-
nancemeasurementtotheSEL.J.Syst.Softw.,1992.
[195]C.RoyandJ.Cordy.Anempiricalstudyoffunctionclonesinopensourcesoftware.InProc.
2008.,’08WCREof[196]C.RoyandJ.Cordy.Scenario-basedcomparisonofclonedetectiontechniques.InProc.of
2008.,’08ICPC[197]C.RoyandJ.Cordy.Amutation/injection-basedautomaticframeworkforevaluatingclone
detectiontools.InProc.ofMUTATION’09,2009.
[198]C.RoyandJ.Cordy.Near-missfunctionclonesinopensourcesoftware:anempiricalstudy.
J.SoftwaremaintenanceRes.Pract.,2009.
[199]C.RoyandJ.Cordy.AreScriptingLanguagesReallyDifferent?Proc.ofIWSC’10,2010.
[200]C.Roy,J.Cordy,andR.Koschke.Comparisonandevaluationofcodeclonedetectiontech-
niquesandtools:Aqualitativeapproach.ScienceofComputerProgramming,2009.
[201]C.K.RoyandJ.R.Cordy.Asurveyonsoftwareclonedetectionresearch.TechnicalReport
541,Queen’sUniversityatKingston,2007.

211

yliographBib

[202]C.K.RoyandJ.R.Cordy.NICAD:Accuratedetectionofnear-missintentionalclonesusing
¯exiblepretty-printingandcodenormalization.InProc.ofICPC’08,2008.
[203]J.D.Rutledge.Onianov’sprogramschemata.J.oftheACM,1964.
[204]A.Sæbjørnsen,J.Willcock,T.Panas,D.Quinlan,andZ.Su.Detectingcodeclonesinbinary
executables.InProc.ofISSTA’09,pages117–128.ACM,2009.
[205]K.Sayood.Introductiontodatacompression.MorganKaufmann,2000.
[206]H.Schmid.Probabilisticpart-of-speechtaggingusingdecisiontrees.InProc.ofNewMeth-
odsinLanguageProcessing’94,1994.
[207]H.Schmid.Improvementsinpart-of-speechtaggingwithanapplicationtoGerman.Natural
languageprocessingusingverylargecorpora,1999.
[208]M.ShawandD.Garlan.Softwarearchitecture.PrenticeHall,1996.
[209]J.Singer,T.Lethbridge,N.Vinson,andN.Anquetil.Anexaminationofsoftwareengineering
workpractices.InProc.ofCASCON’97.IBMPress,1997.
[210]R.SmithandS.Horwitz.Detectingandmeasuringsimilarityincodeclones.InProc.of
2009.,’09IWSC[211]H.Sneed.Acostmodelforsoftwaremaintenance&evolution.InProc.ofICSM’04.IEEE
2004.Press,CS[212]M.Stevens,A.Sotirov,J.Appelbaum,A.K.Lenstra,D.Molnar,D.A.Osvik,and
B.deWeger.Shortchosen-pre®xcollisionsforMD5andthecreationofarogueCAcer-
ti®cate.InProc.ofCRYPTO’09,2009.
[213]R.TairasandJ.Gray.Phoenix-basedclonedetectionusingsuf®xtrees.InProc.ofSoutheast
regionalconference’06,2006.
[214]R.Tairas,J.Gray,andI.Baxter.Visualizationofclonedetectionresults.InProc.ofETX
2006.,’06[215]H.Täubig.FastStructureSearchingforComputationalProteomics.PhDthesis,TU
2007.München,[216]S.Thummalapenta,L.Cerulo,L.Aversano,andM.DiPenta.Anempiricalstudyonthe
maintenanceofsourcecodeclones.EmpiricalSoftwareEngineering,2009.
[217]R.Tiarks,R.Koschke,andR.Falke.Anassessmentoftype-3clonesasdetectedbystate-of-
the-arttools.InProc.ofSCAM’09,2009.
[218]M.Toomim,A.Begel,andS.L.Graham.Managingduplicatedcodewithlinkedediting.In
Proc.ofVLHCC’04,2004.
[219]Y.Ueda,T.Kamiya,S.Kusumoto,andK.Inoue.Gemini:Maintenancesupportenvironment
basedoncodecloneanalysis.InProc.ofMETRICS’02,2002.
[220]Y.Ueda,T.Kamiya,S.Kusumoto,andK.Inoue.Ondetectionofgappedcodeclonesusing
gaplocations.InProc.ofAPSEC’02,2002.

212

yliographBib

[221]E.Ukkonen.Approximatestringmatchingoversuf®xtrees.InProc.ofCPM’93.Springer,
1993.[222]E.Ukkonen.On-lineconstructionofsuf®xtrees.Algorithmica,1995.
[223]J.VanWijkandH.vandeWetering.Cushiontreemaps:Visualizationofhierarchicalinfor-
mation.InProc.ofINFOVIS’99,1999.
[224]J.Vlissides.GenerationGap.C++Report,1996.
[225]S.Wagner,F.Deissenboeck,B.Hummel,E.Juergens,B.M.yParareda,andB.S.(Eds.).
Selectedtopicsinsoftwarequality.TechnicalReportTUM-I0824,TechnischeUniversität
München,Germany,July2008.
[226]V.Wahler,D.Seipel,J.Wolff,andG.Fischer.Clonedetectioninsourcecodebyfrequent
itemsettechniques.InFourthIEEEInternationalWorkshoponSourceCodeAnalysisand
2004.,2004Manipulation,[227]A.Walenstein.Codeclones:Reconsideringterminology.InDuplication,Redundancy,and
SimilarityinSoftware,DagstuhlSeminarProceedings,2007.
[228]A.Walenstein,M.El-Ramly,J.R.Cordy,W.S.Evans,K.Mahdavi,M.Pizka,G.Rama-
lingam,andJ.W.vonGudenberg.Similarityinprograms.InR.Koschke,E.Merlo,and
A.Walenstein,editors,Duplication,Redundancy,andSimilarityinSoftware,number06301
inDagstuhlSeminarProceedings.IBFI,2007.
[229]A.Walenstein,N.Jyoti,J.Li,Y.Yang,andA.Lakhotia.Problemscreatingtask-relevant
clonedetectionreferencedata.InProc.ofWCRE’03,2003.
[230]M.WeberandJ.Weisbrod.Requirementsengineeringinautomotivedevelopment–experi-
encesandchallenges.InProc.ofRE’02,2002.
[231]J.-R.Wen,J.-Y.Nie,andH.-J.Zhang.Clusteringuserqueriesofasearchengine.InProc.of
2001.,’01WWW[232]L.Wills.Flexiblecontrolforprogramrecognition.InProc.ofWCRE’93,1993.
[233]W.M.Wilson,L.H.Rosenberg,andL.E.Hyatt.Automatedanalysisofrequirementspeci-
®cations.InProc.ofICSE’97,1997.
[234]C.Wohlin,P.Runeson,andM.Höst.Experimentationinsoftwareengineering:Anintroduc-
tion.KluwerAcademic,Boston,Mass.,2000.
[235]T.Yamashina,H.Uwano,K.Fushida,Y.Kamei,M.Nagura,S.Kawaguchi,andH.Iida.
SHINOBI:Areal-timecodeclonedetectiontoolforsoftwaremaintenance.TechnicalReport
NAIST-IS-TR2007011,NaraInstituteofScienceandTechnology,2008.
[236]D.YehandJ.-H.Jeng.Anempiricalstudyofthein¯uenceofdepartmentalizationandorga-
nizationalpositiononsoftwaremaintenance.J.Softw.Maint.Evol.Res.Pr.,2002.
[237]A.Ying,G.Murphy,R.Ng,andM.Chu-Carroll.Predictingsourcecodechangesbymining
changehistory.IEEETrans.onSoftw.Eng.,2004.

213

yliographBib

[238]

214

.Y

Zhang,

wvie

H.

Basit,

generation

for

S.

Jarzabek,

clone

D.

analysis.

Anh,

In

Prand

oc.ofM.

.wLo

ICSMQuery-based

,’08

2008.

®ltering

and

graphical