In this paper, we study the use of audio and visual cues to perform speaker segmentation of audiovisual recordings of formal meetings such as interviews, lectures, or courtroom sessions. The sole use of audio cues for such recordings can be ineffective due to low recording quality and high level of background noise. We propose to use additional cues from the video stream by exploiting the relative static locations of speakers among the scene. The experiments show that the combination of those multiple cues helps to identify more robustly the transitions arriong speakers.
|Title of host publication||2007 International Workshop on Content-Based Multimedia Indexing, Proceedings|
|Publication status||Published - 2007|