Evaluating Multilingual LLMs at Scale

For Microsoft Research, Karya completed one of the largest multilingual human evaluations of LLMs within three weeks.

90K

Human evaluations

Models

Indian Languages

weeks

Evaluation of multilingual LLMs is challenging due to insufficient linguistic diversity, benchmark contamination and the lack of local, cultural nuances in translated benchmarks. Karya’s data experts can evaluate models based on an array of benchmarks, including testing for linguistic acceptability, hallucinations, reasoning, and creativity. Karya’s data experts can evaluate models based on an array of benchmarks, including testing for linguistic acceptability, hallucinations, reasoning, and creativity.

See All Case Studies

Connect with a Data Expert

Data Services

Technology

Ethical Data

Team

Advisors

Partnerships

Careers

Team

Advisors

Partnerships

Careers

Data Services

Technology

Ethical Data

Related

Data Services

Technology

Ethical Data

Team

Advisors

Partnerships

Careers

Team

Advisors

Partnerships

Careers

Data Services

Technology

Ethical Data

Related

Building the largest annotated text dataset in Odia for the healthcare, banking and agriculture domains

Building the Largest Gender-Intentional AI Corpora