Monologue Speech and Voice Command Data Collection
Microsoft wanted to collect diverse speech data in Hindi, Tamil, Telugu, Kannada, Marathi, Assamese, Urdu, Odia, Gujarati, Punjabi and Malayalam. Diversity was sought in terms of the spoken dialects , the genders of the participants and their locations. Diverse datasets allow Microsoft to build speech tools that better serve their users. Since no large-scale sentence corpus exists in these local dialects, we first created a sentence corpus of over 1 million unique sentences.
In order to collect high-quality voice data in the right environment and conditions, the Karya team remotely employed over 5,000 villagers in 80 different districts.
Within 1.5 months, we were able to finish the creation of the sentence corpus. Every sentence was validated by our linguistic experts. Post the collection of the 1 million sentences, we distributed the sentences to our workers in rural India. Our workers used the Karya application to record thousands of hours of data.
Finally, the recordings are validated once again, and manually verified by our team. The entirety of the project was finished in less than 3 months. With the data we collected, our client was able to build their project, and continue serving users across India.