Real time implementation of Audio Source Separation methods using single microphone recording and their possible applications in spike based systems
Ever wondered how your brain is able to focus on the speech of the person you are conversing with while you are in a cocktail party where multiple people are babbling? This phenomenon is called cocktail party in Neuroscience and monaural audio source separation in the signal processing world. Implementing this smartness of brain on computer with just a single mixture of such babbling audio signals is basically what my mainstream project during my PhD is. The potential applications are in hearing aid systems, signal pre-processing in speech recognition systems, medical signal processing and many artificial intelligence related applications of sound.
One of the classical approach towards the problem is Independent Component Analysis (ICA). But this approach fails when only a single mixture of multiple speakers is provided (our case). Therefore it is necessary to exploit the statistics of the audio sources.
I developed and implemented both linear and non linear methods to solve the task. The linear approaches were implemented in real-time achieving roughly ~46 ms audio latency. The linear approaches are the following:
1) Eigenmode analysis of covariance difference (EACD) to identify spectro-temporal features associated with large variance for one source and small variance for the other source.
2) Maximum likelihood demixing (MLD) in which the mixture is modelled as the sum of two Gaussian signals and maximum likelihood is used to identify the most likely sources.
3) Suppression-regression (SR) in which autoregressive models are trained to reproduce one source but suppress the other.
I compare our methods with the non-linear method for source separation such as Non Negative Sparse Coding (NNSC) and show that overall our methods
perform significantly better (p<0.01). The non linear approach is what I call Multi Layered Random Forest (MLRF). State of the art results were achieved beating the Deep learning approaches for monaural source separation, my method uses the CASA approaches and random forest in order to solve the task. I quantify the performance of our algorithms in terms of the residual error (between the estimated and the original spectrograms), audio waveform signal-to-noise ratio (SNR), (higher SNRs and lower residuals), PESQ scores and Short-Time Objective Intelligibility (STOI) scores.