We provide two Indian languages datasets here. These datasets can be used for spoken language identification (LID) experiments. There are significant differences in the characteristics like channel, speaking style, speakers, etc., between these two datasets. Hence, they can be used for performing LID in domain-mismatched conditions also. The languages in the dataset are: Assamese, Gujarati, Kannada, Malayalam, Bengali, Hindi, Odia and Telugu.
Some general details about these Indian languages can be found here.
These datasets can be used for academics/reaserch works only (not to be used for commercial applications).
@inproceedings{muralikrishna2021spoken,
title={Spoken Language Identification in Unseen Target Domain Using Within-Sample Similarity Loss},
author={Muralikrishna, H and Kapoor, Shantanu and Dinesh, Dileep Aroor and Rajan, Padmanabhan},
booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={7223--7227},
year={2021},
organization={IEEE} }
This dataset contains read speech samples obtained from news broadcats. These audio files were obtained from All India Radio website (click here for website link or http://newsonair.com/nsd-audio.aspx ). In each language, there are around 15 speakers. About 4.5 hours of speech is available in each language. The original files downloaded from this website contan some background music at the begining and end of each file. Hence, few samples contain background music/noise.
This dataset contains speech samples obtained from various YouTube videos on personal interviews and videos on online teaching (education). In each language, there are at least 10 speakers (10-14 speakers in each language). Amount of speech available in each language is approximately 1 hour. These samples represents the samples collected in real-world environment. Some of the samples contain background noise.
If you want to download this dataset, please send an email containing details like your Name and Institution/Organization to: speechiitmandi@gmail.com
We will provide you the download link.