Monologue Speech and Voice Command Data Collection

The Challenge

Microsoft wanted to collect diverse speech data in Hindi, Tamil, Telugu, Kannada, Marathi, Assamese, Urdu, Odia, Gujarati, Punjabi and Malayalam. Diversity was sought in terms of the spoken dialects , the genders of the participants and their locations. Diverse datasets allow Microsoft to build speech tools that better serve their users. Since no large-scale sentence corpus exists in these local dialects, we first created a sentence corpus of over 1 million unique sentences.

The Solution

In order to collect high-quality voice data in the right environment and conditions, the Karya team remotely employed over 5,000 villagers in 80 different districts.

Within 1.5 months, we were able to finish the creation of the sentence corpus. Every sentence was validated by our linguistic experts. Post the collection of the 1 million sentences, we distributed the sentences to our workers in rural India. Our workers used the Karya application to record thousands of hours of data.

Finally, the recordings are validated once again, and manually verified by our team. The entirety of the project was finished in less than 3 months. With the data we collected, our client was able to build their project, and continue serving users across India.

Data Services

Technology

Ethical Data

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Technology

Ethical Data

Monologue Speech and Voice Command Data Collection

The Challenge

The Solution

Related

Data Services

Technology

Ethical Data

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Technology

Ethical Data

Monologue Speech and Voice Command Data Collection

The Challenge

The Solution

Related

Building the largest annotated text dataset in Odia for the healthcare, banking and agriculture domains

Conversational Data and Call Center Data Collection(2400 hours)