How well do LLMs work outside English? We tested 8 models in 8 languages [pdf]

curioussquirrel 6 minutes ago

Disclosure: I work at RWS/TrainAI, we did this study. Recently I alluded to it in a comment and was encouraged to share it, so here it is! We focus on multilingual proficiency, which tends to be understudied: most benchmarks are English-heavy or even English-only and don't tell you much about how models actually perform across languages. This is our second iteration of the study. 120 linguists, 8 models, 8 languages, 4 tasks, every output blind-reviewed by 3 native speakers.

Some notable insights:

- GPT-5 is strong at text normalization and translation but regressed on content generation vs GPT-4o. Chinese outputs had spacing/punctuation issues, Polish read like "translationese" even with no source text.

- Gemini 2.5 Pro scored 4.56/5 on Kinyarwanda. In our first study (late 2024), no model could produce coherent text in that language.

- Top LLMs outscored humans working under realistic constraints (time-limited, single pass, no QA). Humans didn't rank 1st in any language. (We're now planning a follow-up to zoom in on that.)

- Tokenizer efficiency matters again: reasoning models burn 5-10x more tokens thinking. Claude Sonnet 4.5 encodes Tamil at 1.19 chars/token vs Gemini's 4.24 — ~3.5x cost difference for the same output. There has been a lot of talk about the Opus 4.7 tokenizer, this is the same issue, just in multilingual setting.

If you find the study useful and want to help us convince the execs to keep funding this, a signup on the landing page goes a long way: https://www.rws.com/artificial-intelligence/train-ai-data-se...

Happy to answer questions!