By Gordon Hull
In a recent paper in Ethics and Information Technology, Paul Helm and Gábor Bella argue that current large language models (LLMs) exhibit what they call language modeling bias, a series of structural and design issues that serve as a significant and underappreciated form of epistemic injustice. As they explain the concept, “A resource or tool exhibits language modeling bias if, by design, it is not capable of adequately representing or processing certain languages while it is for others” (2) Their basic argument is that the standard way of proceeding with non-English languages, which is more or less to throw more data at the model, build in structural biases against other languages, especially those that are more morphologically complex than English (=df those with lots of inflections).
The proof of concept is in multi-lingual tools:
“The subject of language modeling bias are not just languages per se but also the design of language technology: corpora, lexical databases, dictionaries, machine translation systems, word vector models, etc. Language modeling bias is present in all of them, but it is easiest to observe with respect to multilingual resources and tools, where the relative correctness and completeness for each language can be observed and compared” (6)
They identify several kinds of such structural bias. The first is that prominent current architectures “tend to train slower on morphologically complex (synthetic, agglutinate) languages, meaning that more training data are required for these languages to achieve the same performance on downstream language understanding tasks” (7). Given the percentage of the available training data that’s in English, this magnifies what’s already a problem. Second, the models perform poorly on untranslatable words. Third, they cite a study showing “that both lexicon and morphology tend to become poorer in machine-translated text with respect to the original (untranslated) corpora: for example, features of number or gender for nouns tend to decrease. This is a form of language modeling bias against morphologically rich languages” (7).
