Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/224560
Title: | MULTIPLE SOUND SOURCE LOCALIZATION | Authors: | TONG JUN JIE | Keywords: | Sound localization, noise removal, speech separation, machine learning, microphone array, Bayesian neural network. | Issue Date: | 11-Aug-2021 | Citation: | TONG JUN JIE (2021-08-11). MULTIPLE SOUND SOURCE LOCALIZATION. ScholarBank@NUS Repository. | Abstract: | The objective of multiple sound source localization is to locate a mixture signal, i.e., multiple speakers with background into set of individual locations, where each location corresponds to a single speaker. Multiple sound source localization is an important problem yet to be solved effectively. Many reported approaches focus on either single sound source or simulated data which often does not translate well to real world usage. Real world environment often contains background noises and interference speakers that degrade the performance of speech localization dramatically. Thus, this thesis aims to create a framework to improve the performance and practicality of multi-source sound localization in real environment. Overall, this framework contains three main components, namely, the noise removal network, the speech separation network, and lastly the sound localization network. With the presence of background noises and interference speakers’ voices in real data, the first contribution of this thesis aims to remove any background noises, giving an output audio that should only contain human’s voices. Then, the shortcomings on the current speech separation approach are studied extensively. These limitations include the frame leakage problem, knowledge of the number of speakers in mixture in advance, and the permutation ambiguity problem. These problems largely prevent speech separation in realistic environment. Thus, the second contribution of this thesis proposes a speaker encoder network and a CNN-BLSTM network architecture to eliminate these shortcomings. The output is a set of individual streams of audio, where each steam only contains a single human’s speech. The third and fourth contributions of this thesis focus on the sound localization aspect of the framework. Sound localization contains two components, namely the direction of arrival (DOA) and the distance estimation. Distance estimation as opposed to DOA, often receives lesser emphasis despite being an equally important aspect of the problem. Without the distance estimation, sound localization accuracy can be largely impacted as users cannot predict how far away the target is without visual aids. Thus, the third contribution proposes a direction estimation approach based on multiple microphone arrays. The fourth contribution focuses on the generalization capability of sound localization using a probabilistic neural network approach. The output is expressed in terms of probability which prevents overfitting which conventional deterministic networks are prone to. Real life experiments results have shown promising improvements as compared to current state-of-the-art approaches. | URI: | https://scholarbank.nus.edu.sg/handle/10635/224560 |
Appears in Collections: | Ph.D Theses (Open) |
Show full item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
PhD Thesis (after oral exam).pdf | 2.5 MB | Adobe PDF | OPEN | None | View/Download |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.