Cross-Media Retrieval Based on Big Data Technology

Photo from Vecteezy.com

Holly Cui
by Holly Cui

Cocktail party problem has attracted more attentions in recent years in the speech community. Specifically, the single-channel multi-talker speech separation and recognition has become a research hotspot. Moreover, the visual based information has been adopted to improve the performance of speech separation and speech recognition.

In this paper, we explored to improve the baseline permutation invariant training (PIT) -based speech separation systems by two data augmentation methods. One resides on the visual-based information being selected to determine the permutation of separated speakers and improve the separation performance (the FIX strategy). The other is based on the SpecAugment which was explored with big data augmentation method to improve the performance of separation (the masking-based data augmentation strategy). Finally, we achieved dB of SDR on a mixed dataset using TCD-TIMIT corpus.

✨ Check out the publication:

Y. Cui and Y. Wang, “Audio-Visual Single-Channel Signal Separation Based on Big Data Augmentation,” 2020 IEEE 3rd International Conference of Safe Production and Informatization (IICSPI), Chongqing City, China, 2020, pp. 636-640, doi: 10.1109/IICSPI51290.2020.9332362.