Evaluating Multilingual LLMs at Scale

For Microsoft Research, Karya completed one of the largest multilingual human evaluations of LLMs within three weeks.

90K

Human evaluations

30

Models

10

Indian Languages

3

weeks

placeholder

Evaluation of multilingual LLMs is challenging due to insufficient linguistic diversity, benchmark contamination and the lack of local, cultural nuances in translated benchmarks. Karya’s data experts can evaluate models based on an array of benchmarks, including testing for linguistic acceptability, hallucinations, reasoning, and creativity. Karya’s data experts can evaluate models based on an array of benchmarks, including testing for linguistic acceptability, hallucinations, reasoning, and creativity.