16 Pages
English

1Segmenting Modeling and Matching Video Clips Containing Multiple Moving Objects

-

Gain access to the library to view online
Learn more

Description

Niveau: Supérieur, Doctorat, Bac+8
1Segmenting, Modeling, and Matching Video Clips Containing Multiple Moving Objects Fred Rothganger, Member, IEEE, Svetlana Lazebnik, Member, IEEE Cordelia Schmid, Senior Member, IEEE Jean Ponce, IEEE Fellow Abstract— This article presents a novel representation for dynamic scenes composed of multiple rigid objects that may undergo different motions and are observed by a moving camera. Multi–view constraints associated with groups of affine–covariant scene patches and a normalized description of their appearance are used to segment a scene into its rigid components, construct three–dimensional models of these components, and match in- stances of models recovered from different image sequences. The proposed approach has been applied to the detection and matching of moving objects in video sequences and to shot matching, i.e., the identification of shots that depict the same scene in a video clip. Index Terms— Affine-covariant patches, structure from mo- tion, motion segmentation, shot matching, video retrieval. I. INTRODUCTION THE explosion in both the richness and quantity of digitalvideo content available to the average consumer creates a need for indexing and retrieval tools to effectively manage the large volume of data and efficiently access specific frames, scenes, and/or shots. Most existing video search tools [1], [2], [3], [4] rely on the appearance and two-dimensional (2D) geometric attributes of individual frames in the sequence, and they do not take advantage of the stronger three-dimensional (3D) constraints associated with multiple frames.

  • between fixed

  • rigid components

  • vectors joining

  • projection model

  • coordinate vectors

  • patch parameters

  • approach has

  • local

  • moving independently


Subjects

Informations

Published by
Reads 7
Language English
Document size 3 MB
Segmenting, Modeling, and Matching Video Clips Containing Multiple Moving Objects Fred Rothganger, Member, IEEE, Svetlana Lazebnik, Member, IEEE Cordelia Schmid, Senior Member, IEEE Jean Ponce, IEEE Fellow
1
Abstract — This article presents a novel representation for In principle, our patch-based 3D object representation can dynamic scenes composed of multiple rigid objects that may be naturally extended from modeling static objects captured undergo different motions and are observed by a moving camera. in a few photographs to modeling dynamic scenes captured in Multi–view constraints associated with groups of afne–cvoariant scene patches and a normalized description of their appearance vcihdaelloensgeiqnugenbceecsa.uIsnepitrarcetqicuei,rthoseuvgehr,atlhissigenxitecnasinotnmisodqiuitace-are used to segment a scene into its rigid components, construct es three–dimensional models of these components, and match in-tions of the existing approach. First, our previous work [5] has stances of models recovered from different image sequences. e The proposed approach has been applied to the detection and hasasnudlmeetdheassiigmnpilicaendtapfersnpecptirvoejecetfifoenctsmcoodnetl,ainwehdicihncmanannyot matching of moving objects in video sequences and to shot scenes from lms or commercial video — see, for example, matching, i.e., the identication of shots that depict the same scene in a video clip. the street scene from the movie Run Lola Run (Fig. 6). In the in Index Terms — Afne-covariant patches, structure from mo-present contribution, we address this issue by introduc g a tion, motion segmentation, shot matching, video retrieval. npoervseplepctriovjeecetfifoenctsmboedtewlefeordsifufrefraecnetppaattcchheess,tbhuattuascecsoaunntasffonre n model within each individual patch. Second, video clips almost I. I NTRODUCTION always contain multiple objects moving independently from T vHiEdeeoxcpolontsieontnaivnabiloatbhltehteortihcehnaevsesraagnedcqounasnutimteyrocfredaitgeistaalibenaytcohdegovrteohleuorppsianntgdhafartommmeotthvheoedctaofogmreetrshaee.grWmreiegnaitdidlndygresaalsnltdthridasicsckcoeamdrdpfiliencagattutirhoeens need for indexing and retrieval tools to effectively manage the features that do not fall into any rigid group. Notice that this large volume of data and efciently access specic frames, scenes,and/orshots.Mostexistingvideosearchtools[1],icsafnunnodtarempernetsaellnytace r r i t g a i i d nmkionddeslionfgvfirdaemoecwoonrtke,nat,nsdutchheraesfofraesti-t [2], [3], [4] rely on the appearance and two-dimensional (2D) moving people or animals. geometric attributes of individual frames in the sequence, and they do not take advantage of the stronger three-dimensional Our approach to constructing 3D representations of video (3D)constraintsassociatedwithmultipleframes.Inthispre-iclmiapsgeexsterqauctesncafe,anned-ctohveanrisainmtuplattacnheeos,ustlryacskesgtmheenmtsththroeutgrhactkhes sentation, we propose a richer representation of video content formodelingandretrievaltasksthatisbasedonexplicitlytahnedsbceuinled.sT3hDermesoudletilsngof3Deacmhodrieglsidrecporemspeonntetnhtepsrtersuecntturianl recovering the 3D structure of a scene using structure from motion (SFM) constraints. content of the scene, and they can be compared and matched Following our earlier work on modeling and recognition u m s a i t n c g hi t n ec g ,hin.ieq.,uersecsoigmniliazirntgosthhootsseoinft[h5e].sTahmiseissceunseef[u1l4f]o,r[ s 1 h 5 o ], t of static objects from photographs [5], we propose to repre-sent 3D structure using a collection of small planar patches, [16], [17], [18], [19] — a fundamental task in video retrieva l. combined with a description of their local appearance. This The rest of the article is organized as follows. Section II approachuniesrecentworkonlocalimagedescriptionusingtsiuomnmIIaIrizdeesscrreilbaetsedouwrorakppirnoavcihdetootarnaaclkyisnisgaanfdnSeF-cMo.vaSrieacn-t afne-covariant regions [6], [7], [8], structure from motion [9], patches in image sequences, identifying subsets of tracks that [10], [11], and shape from texture [12], [13]. It is based onthefollowingkeyobservation:Althoughsmoothsurfacesrmigoivdecroigmidploynteongtse.thSeer,ctainodnIbVuilddeivneglo3pDsamomdeetlhsoodfftohrermeastuclhtiinngg are almost never planar on a global scale, they are always planarinthesmall—thatis,sufcientlysmallsurfacepatchesI3VD-Bmoprdeeslsenctoenxstpreurcitmeednftraolmredsiuflftseroenntssehvoetrsa.lSveicdtieoons,siInIIc-luFdainndg can always be thought of as being comprised of coplanar shots from the lms Run Lola Run and Groundhog Day . points. The surface of a solid can thus be represented by a collection of small planar patches, their local appearance, and Section V concludes with a discussion of the promise and adescriptionoftheir3Dspatialrelationshipexpressedintermslfiumtuitraetiownosrko.ftheproposedapproach,togetherwithplansfor of multi-view constraints. A preliminary version of this article has appeared in [20]. Fred Rothganger is with Sandia National Laboratories, USA Svetlana Lazebnik is with the Department of Computer Science and Beckman Institute; University of Illinois at Urbana-Champaign, USA II. B ACKGROUND Cordelia Schmid is with INRIA Rh ˆone-Alpes, France Jean Ponce is with University of Illinois at Urbana-Champaign and with As stated in the Introduction, the main target application EcoleNormaleSup´erieure,Paris,France for the approach presented in this article is video indexing