An audio-visual saliency model for movie summarization

of 4

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
4 pages
0 downs
An audio-visual saliency model for movie summarization
  An Audio-Visual Saliency Model for MovieSummarization Konstantinos Rapantzikos, Georgios Evangelopoulos, Petros Maragos, Yannis AvrithisSchool of E.C.E., National Technical University of Athens, Athens 15773,, {gevag ,maragos},   Abstract   — A saliency-based method for generating videosummaries is presented, which exploits coupled audiovisualinformation from both media streams. Efficient and advancedspeech and image processing algorithms to detect key frames thatare acoustically and visually salient are used. Promising resultsare shown from experiments on a movie database.    Keywords—saliency; saliency curves; attention modeling; event detection; key-frame selection; video summarization; audiovisual Topic area—Multimedia:methods and systems (indexing and search of multimedia) I.   I  NTRODUCTION  The growing availability of video content creates a strongrequirement for efficient tools to manipulate multimedia data.Considerable progress has been made in multimodal analysisfor accessing and analyzing video content with automaticsummarization being one of the main targets of recentresearch. Summaries are important, since they provide the user with a short version of the video that ideally contains allimportant information for understanding the content. Hence,the user may quickly evaluate the video as interesting or not.Generally speaking, there are two types of video abstraction:video summarization and video skimming. Videosummarization refers to a collection of key-frames extractedfrom the sequence, while video skimming represents thesequence in the form of a short clip. Numerous research efforts have been undertaken for automatically generating video summaries. Earlier works weremainly based on processing only the visual input. Zhuang et al. extracted salient frames based on color clustering andglobal motion [4], while Ju et al. used gesture analysis inaddition to the latter low-level features [5]. FurthermoreAvrithis et al. represent the video content by a high-dimensional feature curve and detect key-frames as the onesthat correspond to the curvature points [6]. Another group of methods uses frame clustering to select representative frames[7][8]. Features extracted from each frame of the sequenceform a feature vector and are used in a clustering scheme.Frames closer to the centroids are then selected as key-frames.Defining what is important in a video stream is quitesubjective and therefore several methods in the field,including the ones referred before, suffer from the limitationthat evaluation of the summary is quite difficult andsubjective. Hence, mapping human perception into anautomated abstraction process has become quite common. Inan attempt to emulate the multimodal nature of humanunderstanding, the Informedia project and its offsprings,combined speech, image, natural language understanding andimage processing to automatically index video for intelligentsearch and retrieval [9][10][11]. This approach generatedinteresting results. Going one step further towards human perception, Ma et al. proposed a method for detecting thesalient parts of video that is based on user attention models[2]. Motion, face and camera attention along with audioattention models (audio saliency and speech/music) are thecues used by the authors to capture the salient information of the multimedia input and identify the video segments to formthe final summary.In this paper we propose a saliency-based method thatexploits individual audio and video saliency information byfusing them and generating video summaries. The visualsaliency model is based on a feature completion schemeimplemented in a regularization framework. Intensity, color and motion features compete in order to form the mostvisually salient regions. Audio saliency relates with the audiostream microstructure, captured by emerging modulations insmall scales. In effect, the amplitude, frequency andinstantaneous energy of such modulations is used to quantifythe importance of audio events. Preliminary results areobtained on arbitrary videos and on a movie databaseannotated with respect to dialogue events [12].The paper is organized as follows: Section II presents thetwo methods for computing audio and visual saliency and theway to fuse them. Section III presents the experimentalresults, while conclusions are drawn and future work isdiscussed in section IV.II.   P ROPOSED M ETHOD    A.    Audio Saliency Features Attention in audio signals is focused perceptually in abruptchanges, transitions and abnormalities in the stream of audioevents, like speech, music, environmental noises in real liferecordings or sound effects in movies. The salient features thatattract attention in an audio stream, are the ones detected moreclearly. Biologically, one of the segregations performed by theauditory system in complex channels is in terms of temporal 3201-4244-1274-9/07/$25.00 ©2007 IEEEMMSP 2007  modulations, while psychophysically modulated carriers seemto be more salient perceptually than stationary by humanobservers [2][13][14].Motivated by the above we construct a user attention curve based on measures of temporal modulation in multiplefrequencies (scales). The existence of multi-scale modulationsduring speech production, justifies the AM-FM modulationsuperposition model for speech [16], according to whichspeech formants can be modeled by a sum of narrowbandamplitude and frequency varying, non-stationary sinusoids   )t ( cos )t ( a )t (  s k k    . The model is applied here for general audio signals. Demodulation of a real-valued,monocomponent AM-FM       t o d  )( cos )t ( a )t (  x     (1)with time varying amplitude envelope () a t  and instantaneousfrequency() t    signals, can be approached using the non-linear Teager-Kaiser differential energy operator (EO)  )t (  x )t (  x )t (  x )]t (  x[    , where dt dx )t (  x   . Applied tonarrowband AM-FM signal, EO yields with negligibleapproximation error the instantaneous source energy, i.e. 22 [()]()()  x t a t t      , corresponding to the physical energy of the oscillation-producing source energy. An efficient AM-FMdemodulation scheme based on is the energy separationalgorithm (ESA) [15] separates the instantaneous energy intoits amplitude and frequency components [()](),[()]  x t t  x t        [()]|()|[()]  x t a t  x t    (2)with a simple, computationally efficient discrete counterpart,of almost instantaneous time resolution. Demodulationthrough ESA is obtained in the outputs of a set of frequency-tuned, bandpass Gabor filters   22 ()expcos() k k kc h t t t        that assume to globally isolate signal modulations [16] in the presence of noise.By applying the energy operator to the bandpass outputs of a linearly-spaced bank of   K  filters, a nonlinear energymeasurement of dimension  K  is obtained. For each signalframe m of length  N  , short-time representations of thedominant modulation component are obtained by tracking, inthe multi-dimensional feature space consisting of the filter responses on  s , the maximum average Teager Energy (MTE)     N nk d  K k l   )h s(   N max )m(  MTE  1 1(3)   where n is the sample index with (m-1)N+1 n mN    and k  h the impulse response of the k  filter. MTE isconsidered the dominant signal modulation energy , capturingthe joint amplitude-frequency information of audio activity[17]. The filter   )}m(  MTE max{ arg  )m(  j  is submitted todemodulation via ESA to derive the mean instant amplitude(MIA) and mean instant frequency (  MIF  ) features for frame m , leading to a three-dimensional feature vector   } MIF  , MIA , MTE {  )m(  A   of the mean dominant modulation parameters for each signal frame. An example of the featuresfor an audio stream (“  Jackie Brown ” [12]) can be seen in Fig.1(c).(a)(b)(c)  MTE  (solid),  MIA (dashed),  MIF  (dash dotted) Fig. 1 (a) audio and saliency indicator, (b) saliency curve withthreshold, (c) normalized audio features  B.   Visual Saliency The visual saliency computation module is based on thenotion of a centralized saliency map along with an inherentfeature competition scheme to provide a computationalsolution to the problem of Region-Of-Interest (ROI)detection/selection in videos. In this framework, a videosequence is represented as a solid in the three-dimensionalEuclidean space, with time being the third dimension. Hence,the equivalent of a spatial saliency map is a spatiotemporalvolume where each voxel has a certain value of saliency. Thissaliency volume is computed with the incorporation of featurecompetition by defining cliques at the voxel level and use anoptimization procedure with constraints coming both frominter- and intra- feature level.We perform decomposition of the video at a number of different spatiotemporal scales. The final result is a hierarchyof video volumes that represent the input sequence indecreasing spatiotemporal scales. Afterwards, feature volumesfor each feature of interest, including intensity, color and 3Dorientation (motion) are computed and decomposed intomultiple scales. Every volume simultaneously represents thespatial distribution and temporal evolution of the encodedfeature. The pyramidal decomposition allows the model torepresent smaller and larger “events” in separate subdivisionsof the channels.Feature competition is implemented in the model using anenergy-based measure. In a regularization framework the firstterm of this energy measure may be regarded as the data term(  D  E  ) and the second as the smoothness one ( S   E  ), since itregularizes the current estimate by restricting the class of admissible solutions [18]. The energy involves voxeloperations between coarse and finer scales of the volume pyramid, which means that if the center is a voxel at scale   d  p ,...,c  2then the surround is the corresponding voxelat scale    ch with   d  ,... , 21    , where d  is the desireddepth of the center-surround scheme. Hence, if  321  k  ,  F  0 corresponds to the srcinal volume of each of the features,with  } BY  RG, I,{  F   and |} F | ,...,{ k  1  , each level l  of the pyramid is obtained by convolution with an isotropic 3DGaussian G and dyadic down-sampling: 2 1   ) F G(  F  k  ,l k  ,l  ,  } p ,..., ,{ l  21  (4)For each voxel q of a feature volume  F  the energy is definedas  ))q(  F (  E  ))q(  F (  E  ))q(  F (  E  k  ,cS S k  ,c D Dk  ,c     (5)where S  D    ,are the importance weighting factors for each of the involved terms. The first term of (2) is defined as  )q(  F  )q(  F  )q(  F  ))q(  F (  E  k  ,hk  ,ck  ,ck  ,c D  (6)and acts as the center-surround operator and the second one as      qr  )q(  N r  ck  ,ck  ,ck  ,cS  )r ( V  ~ )r (  F   )q(  N   )q(  F  ))q(  F (  E  1(7)where c V ~ is the spatiotemporal orientation conspicuityvolume, that may be regarded as an indication of motionactivity in the scene.The motivation behind this feature competition scheme is theexperimental evidence of a biological counterpart in theHuman Visual System (interaction/competition among thedifferent visual pathways related to motion/depth (M pathway)and gestalt/depth/color (P pathway) respectively) [20].Shortly, the visual saliency detection module is based on aniterative minimization scheme that acts on 3D local regionsand is based on center-surround inhibition regularized byinter- and intra- local feature constraints. The interested user may find a detailed description of the method in [19]. Fig. 1depicts the computed saliency for three frame of the “Lord of the rings” sequence that is included in [12]. Bright valuescorrespond to salient areas (notice Gandalf’s moving head andthe hobbit’s moving arms). Fig. 2 Original frames and the corresponding saliency maps C.    Audiovisual Saliency Fusing audio and visual information is not a trivial task,since they are computed on modalities of different nature. Nevertheless combining the final output of the two saliencydetection modules is straightforward. In this paper we use asimple linear scheme for creating the final audiovisualsaliency curve that will provide the key frame index for thesummary.The audio saliency curve is derived by weighted linear fusion of the normalized audio feature vector.  MIF w MIAw MTE ww AS   A A  321  (8) Normalization is done by least squares fit of their individual value ranges to [0, 1] (see Fig.1). In order to provide a global measure of scene change based on saliencywe threshold the output of the visual saliency module using acommon thresholding technique [21] to discard low saliencyareas and compute the average value per frame. Hence, weend up with a 1D vector  V  S  that describes the change of visual saliency throughout the sequence. The coupledaudiovisual curve V V  A A AV  S wS wS   (9)serves as an abstract continuous-valued indicator function of salient events, in the audio, the visual or the commonaudiovisual domain. By detecting simple geometric features of such curves, we can track down important transition andreference points. Such features are the local maxima of thecurve (derived by pick-picking), 1D edge transition points(using the zero-crossings of a Derivative-of-Gaussianoperator) or regions below certain learned or heuristicallydefined thresholds. Using the maxima for keyframe selection,and a user-defined skimming index we derive a summarizationof the video in terms of its audiovisual saliency.III.   E XPERIMENTAL R  ESULTS AND D ISCUSSION In order to demonstrate the proposed method, we run it both on videos of arbitrary content and on the movie databaseof A.U.T.H. (Muscle WP5 Movie database v1.1) [12]. Thisdatabase consists of 42 scenes extracted from 6 movies of different genres. Fig. 3 shows the audio, visual andaudiovisual saliency curves for 512 frames of the movie “Jackie Brown” included in the previous database. Fig. 4depicts the same curves with the detected featuressuperimposed, while Fig. 5 shows selected keyframes. Fig. 3 Superimposed audio, video and audiovisual saliencycurves (better viewed in color) 322  (a)(b)(c) Fig. 4 Curves and detected features for (a) audio saliency, (b)video saliency, (c) audiovisual saliency7 15 17 4961 69 81 87 Fig. 5 Frames located at the detected points of the audiovisualsaliency curve (numbers correspond to frames)IV.   C ONCLUSIONS AND F UTURE W ORK   In this paper we present two methods for audio and visualsaliency computation and explore the potential of their fusionfor movie summarization. We believe that moviesummarization based on simulated human perception leads tosuccessful video summaries. In the current work we usedsimple fusion of audiovisual curves to detect key-frames andcreate the summary. In the future we will examine morefusion methods and extend the technique to create videoskims. Video skims are more attractive, since they containaudio and motion information that makes the abstraction morenatural and informative.A CKNOWLEDGMENTS This research work has been supported by the European Network of Excellence MUSCLE. We wish to thank C.Kotropoulos and his group at A.U.T.H. for providing us withthe movie database.R  EFERENCES   [1]   K. Rapantzikos, Y. Avrithis, “An enhanced spatiotemporalvisual attention model for sports video analysis”, Proc.CBMI’05, Riga, Latvia, Jun 2005. [2]   Y.-F. Ma, X.-S. Hua, L. Lu, H.-J. Zhang, “A generic framework of user attention model and its application in videosummarization”, IEEE Trans. on Multimedia, vol. 7, pp. 907-919, Oct 2005. [3]   Y. Li, S.-H. Lee, C.-H. Yeh, C.-C. Jay Kuo, “Techniques for movie content analysis and skimming”, IEEE Signal ProcessingMagazine, pp. 79-89, Mar 2006. [4]   Y. Zhuang, Y. Rui, T.S. Huang, S. Mehrotra, “Adaptive KeyFrame Extraction Using Unsupervised Clustering”, Proc.ICIP’98, pp. 866-870,Oct 1998. [5]   S.X. Ju, M.J. Black, S. Minneman, D. Kimber, “Summarizationof video-taped presentations: Automatic analysis of motion andgestures”, IEEE Trans. Circuits Syst. Video Technology, vol. 8, pp. 686-696, Sep 1998. [6]   Y. Avrithis, A. Doulamis, N. Doulamis and S. Kollias, “AStochastic Framework for Optimal Key Frame Extraction fromMPEG Video Databases”, Computer Vision and ImageUnderstanding, vol. 75 (1/2), pp. 3-24, Jul 1999. [7]   S. Uchihashi, J. Foote, A. Girgensohn, J. Boreczky, “Videomanga: Generating semantically meaningful video summaries”,in Proc. ACM Multimedia’99, pp. 383-392, Oct 1999. [8]   K. Ratakonda, M.L. Sezan, R. Crinon, “Hierarchical videosummarization”, Proc. SPIE, vol. 3653, pp. 1531-1541, Dec2000. [9]   M.A. Smith, T. Kanade, “Video skimming and characterizationthrough the combination of image and languageunderstanding techniques”, Proc. CVPR’97, 1997. [10]   A.G. Hauptmann, “Lessons for the Future from a Decade of Informedia Video Analysis Research”,  Lecture Notes inComputer Science , Volume 3568, pp. 1-10, August 2005. [11]   A.G. Hauptmann, R. Yan, T.D. Ng, W. Lin, R. Jin, D. M.,Christel, M. Chen, R. Baron, “Video Classification andRetrieval with the Informedia Digital Video Library System”,Proc. TREC’02, Gaithersburg, MD, USA, November 2002. [12]   MUSCLE WP5 Movie Dialogue DataBase v1.1, AristotleUniversity of Thessaloniki, AIILab, 2007. [13]   C. Kayser, C. I. Petkov, M. Lippert and N. K. Logothetis, “ Mechanisms for allocating auditory attention: an auditorysaliency map”, Current Biology, vol. 15, no. 21, pp. 1943-1947,2005. [14]    N. Tsingos, E. Gallo and G. Drettakis, “Perceptual audiorendering of complex virtual environments” , SIGGRAPH 2004. [15]   P. Maragos and J.F. Kaiser and T.F. Quatieri, "EnergySeparation in Signal Modulations with Application to SpeechAnalysis", IEEE Trans. Signal Proc., vol. 41, no. 10, pp. 3024-3051, 1993. [16]   A.C. Bovik and P. Maragos and T.F. Quatieri, "AM-FM EnergyDetection and Separation in Noise Using Multiband EnergyOperators", IEEE Trans. Signal Proc., vol. 41, no. 12, pp. 3245-3265, 1993 [17]   G. Evangelopoulos and P. Maragos, “Multiband ModulationEnergy Tracking for Noisy Speech Detection”, IEEE Trans.Audio, Speech and Language Proc., vol.14, no.6, pp. 2024-2038,2006. [18]   K. Rapantzikos. M. Zervakis “Robust optical flow estimation inMPEG sequences”, IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), Mar 2005 [19]   K. Rapantzikos, N. Tsapatsoulis, Y. Avrithis, S. Kollias,“Spatiotemporal saliency for video classification”, IEEETransactions on Multimedia, submitted. [20]   E.R. Kandel, J.H. Schwartz, T.M. Jessell, “Essentials of NeuralScience and Behavior”, Appleton & Lange, Stamford,Connecticut, 1995 [21]    N. Otsu, “ A threshold selection method from gray level histograms ”,IEEE Trans. Systems, Man and Cybernetics, vol. 9, pp. 62-66,1979   323
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks