Building a corpus of over 1 million images of English newspapers for Microsoft India and annotating the images

The Challenge

Microsoft India wanted to collect 1 million images of English newspaper pages and text-books to augment their OCR technology. To ensure diversity within the dataset, every participant was asked to submit pages from as many different books and newspapers as possible.

The Solution

Karya workers are able to work in English. For textbooks, our team partnered with local libraries, and our workers (based in Kolkata and 3 surrounding villages) submitted the 1 million images in record speed. Our client responded with pure delight. In fact, we have taken the permission of showing you a text from the client below.

We were then asked to replicate this process for newspapers. Local libraries rarely stock newspapers, and it proved to be difficult to find homes that stocked a large collection of newspapers. Instead, our team partnered with hundreds of local 'kabadiwallas' (junk dealers) and provided them a way to augment their daily incomes and help us with our data collection exercise.

Once the images were collected, our workers used Karya's Android app to annotate key parts of the collected images - the headline, the byline, an image insert, etc.

Data Services

Technology

Ethical Data

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Technology

Ethical Data

Building a corpus of over 1 million images of English newspapers for Microsoft India and annotating the images

The Challenge

The Solution

Related

Data Services

Technology

Ethical Data

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Technology

Ethical Data

Building a corpus of over 1 million images of English newspapers for Microsoft India and annotating the images

The Challenge

The Solution

Related

Building the largest annotated text dataset in Odia for the healthcare, banking and agriculture domains

Conversational Data and Call Center Data Collection(2400 hours)