SELF-SUPERVISED MODELING FOR MULTI-MODAL UNDERSTANDING
YUE XIANGHU
YUE XIANGHU
Citations
Altmetric:
Alternative Title
Abstract
We humans perceive information from our surrounding environment through multiple mediums and further understand or interact with the world. These multimodal clues offer different but complementary information. Currently, self-supervised learning has emerged as a promising approach to learn meaningful representations from many modalities separately, including text, speech, and vision. In this thesis, we aim to leverage self-supervised pre-training techniques for multimodal processing. We finished several works to achieve our target step-by-step. Starting from the traditional unimodal understanding task, e.g., speech recognition, the first work focuses to remedy the code-switching problem. Learning purely from labeled examples does not resemble language acquisition in humans, so the second work focuses on learning speech representations from unlabeled speech data. The third work takes the universality of self-supervised pre-training one step further, by unifying speech and text pre-training within a single model. Finally, the fourth work attempts to build a unified audio-visual-text model to enable various multimodal understanding tasks.
Keywords
Self-supervised learning; multimodal; unsupervised learning; pre-training
Source Title
Publisher
Series/Report No.
Collections
Rights
Date
2023-09-29
DOI
Type
Thesis