Header

UZH-Logo

Maintenance Infos

A Curriculum Learning Method for Improved Noise Robustness in Automatic Speech Recognition


Braun, Stefan; Neil, Daniel; Liu, Shih-Chii (2017). A Curriculum Learning Method for Improved Noise Robustness in Automatic Speech Recognition. In: 25th European Signal Processing Conference, Kos island, Greece, 28 August 2017 - 2 September 2017, 0.

Abstract

The performance of automatic speech recognition systems under noisy environments still leaves room for improvement. Speech enhancement or feature enhancement techniques for increasing noise robustness of these systems usually add components to the recognition system that need careful optimization. In this work, we propose the use of a relatively simple curriculum training strategy called accordion annealing (ACCAN). It uses a multi-stage training schedule where samples at signal-to-noise ratio (SNR) values as low as 0dB are first added and samples at increasing higher SNR values are gradually added up to an SNR value of 50dB. We also use a method called per-epoch noise mixing (PEM) that generates noisy training samples online during training and thus enables dynamically changing the SNR of our training data. Both the ACCAN and the PEM methods are evaluated on a end-to-end speech recognition pipeline on the Wall Street Journal corpus. ACCAN decreases the average word error rate (WER) on the 20dB to -10dB SNR range by up to 31.4% when compared to a conventional multi-condition training method.

Abstract

The performance of automatic speech recognition systems under noisy environments still leaves room for improvement. Speech enhancement or feature enhancement techniques for increasing noise robustness of these systems usually add components to the recognition system that need careful optimization. In this work, we propose the use of a relatively simple curriculum training strategy called accordion annealing (ACCAN). It uses a multi-stage training schedule where samples at signal-to-noise ratio (SNR) values as low as 0dB are first added and samples at increasing higher SNR values are gradually added up to an SNR value of 50dB. We also use a method called per-epoch noise mixing (PEM) that generates noisy training samples online during training and thus enables dynamically changing the SNR of our training data. Both the ACCAN and the PEM methods are evaluated on a end-to-end speech recognition pipeline on the Wall Street Journal corpus. ACCAN decreases the average word error rate (WER) on the 20dB to -10dB SNR range by up to 31.4% when compared to a conventional multi-condition training method.

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

0 downloads since deposited on 23 Feb 2018
0 downloads since 12 months

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:07 Faculty of Science > Institute of Neuroinformatics
Dewey Decimal Classification:570 Life sciences; biology
Language:English
Event End Date:2 September 2017
Deposited On:23 Feb 2018 09:31
Last Modified:31 Jul 2018 05:11
Publisher:Signal Processing Conference (EUSIPCO), 2017 25th European
Series Name:25th European Signal Processing Conference
OA Status:Closed
Publisher DOI:https://doi.org/10.23919/EUSIPCO.2017.8081267
Official URL:http://ieeexplore.ieee.org/document/8081267/

Download