SMU Data Science Review


The field of speaker and language recognition is constantly being researched and developed, but much of this research is done on private or expensive datasets, making the field more inaccessible than many other areas of machine learning. In addition, many papers make performance claims without comparing their models to other recent research. With the recent development of public multilingual speech corpora such as Mozilla's Common Voice as well as several single-language corpora, we now have the resources to attempt to address both of these problems. We construct an eight-language dataset from Common Voice and a Google Bengali corpus as well as a five-language holdout test set from Audio Lingua. We then compare one filterbank-based model and two waveform-based models found in recent literature, all based on convolutional neural networks. We find that our filterbank-based model achieves the strongest results, with a 90.5% test accuracy on our eight-language test set and a 74.8% test accuracy on our five-language Audio Lingua test set. We conclude that some models originally trained on private datasets are also applicable to our public datasets and make suggestions on how this performance can be improved further.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Included in

Data Science Commons