AUDIO-VISUAL ACTIVE SPEAKER DETECTION AND RECOGNITION
TAO RUIJIE
TAO RUIJIE
Citations
Altmetric:
Alternative Title
Abstract
Audio-visual speech processing aims to solve the speech-related problem with audio and visual information. Research in biology has proved that humans can perceive the world from multi-modalities since speech and face modalities can provide complementary information. In this thesis, we focus on audio-visual speaker signal processing and make the following contributions: 1) We apply the long-term temporal information and handle videos in the wild for detecting the talking person. 2) We filter the noisy speaker recognition data in the large-scale audio-visual dataset and achieve cross-modal speaker recognition. 3) We remove unreliable speech data during self-supervised speaker recognition automatically. Then we search and utilise the diverse positive pairs for audio-visual self-supervised speaker recognition. According to our research, speaker information and characteristics can be significant cues to assist multi-modal signal processing and self-supervised learning.
Keywords
Audio-visual, speaker recognition, active speaker detection, self-supervised learning, cross-modality, noisy label
Source Title
Publisher
Series/Report No.
Collections
Rights
Date
2023-03-24
DOI
Type
Thesis