I’ve spent a lot of time on the various ways that language models are sociotechnical artifacts, and in particular the ways that they need to be thought of as normatively saturated.  An large language model (LLM) like ChatGPT, for example, will pick up the patterns of language use in its training data, so a model trained entirely on the worst parts of the Internet will tend to reproduce that.  AI companies spend a lot of time and money trying to find “quality text” – this is part of what drives their piracy of books.  As many commentators have noted, the Internet is heavily English-based, which can lead to difficult problems of language modeling bias and underperformance on languages that are morphologically different from English.

I also think that accounts of LLMs as vindicating structuralist (such as Leif Weatherby’s Language Machines (discussion here)), Derridean (David Gunkel), or distributional in a Wittgensteinian sense (the CLRU, as discussed by Lydia Liu; here’s my synopsis of Liu) accounts of language are on the right track, at least for understanding LLMs.  These theories all work on the premise that the output of language models conveys semantic content but no intention.

In a current paper, Paolo Caffoni brings these strands together by way of an analysis of subword tokenization.  LLMs do not train on words because there are too many words to train on.  To ensure computational tractability, they use techniques like word embedding, positional encoding and subword tokenization.  For example, subword tokenization might treat the uncommon word “refactoring” as the common tokens of “re,” “factor” and “ing.”  As Paul Offert et al put it, the process generates “more manageable units [that] still maintain contextual relevance and are semantically salient” (10).

Caffoni starts with the literal cost of subword tokenization: LLM APIs charge by the token, and so languages with more tokens cost more to use.  How do languages get more tokens?  Among other things, by being less common in the training data.  In order to perform tokenization successfully, the model has to see enough examples of the different tokens to correctly model the structure.  In the “refactoring” example, if the model doesn’t see enough re- and -ing words, it won’t be able to generate the tokens.  “Refactoring” will go into the model as-is, as will other uncommon -ing words like vacillating or ratiocinating.  Since vacillate and ratiocinate are also separate words, the less the model is able to chunk-up the language into tokens, the more each word becomes a token, and so the more total tokens you need to understand the language. As Caffoni puts it, “over-tokenization can result from a scarcity of related training data … because fewer recurring patterns are available from which stable subword units can be learned.”

The result is that, since “the API of LLMs charges users a fixed amount for a certain quantity of input and generated tokens,” it costs more – a lot more – to use ChatGPT in languages that are less common in the English-dominated Internet-based training data.  Telugu apparently requires 5x as many tokens as English, for example.  Caffoni’s point is then that “as a form of abstraction and a metric applied to language in the market, tokenization should be discussed within the political economy of the sign.”

This is not a new problem, as Caffoni shows with reference to the introduction of the telegraph into China.  Telegraphy, of course, is based on English letters, and so its use in Chinese required a mechanism to communicate Chinese characters in Morse-style units.  The result in 1871 was apparently the generation of a Chinese Telegraph Code in which 6800 characters were assigned a 4-digit number from 0001-9999 and then those numbers transmitted.  Caffoni writes:

“The double encoding of the Chinese scripts rendered telegraphy expensive, as not only were Arabic numerals the lengthiest sequences to be transmitted in Morse, and therefore more costly, but the time required to encode and decode the binary signals first in numerical sequences and then into characters also increased labor costs across the network.”

Caffoni cites a recent book, Codes of Modernity in which:

“Historian Uluğ Kuzuoğlu hypothesizes that, in the incommensurability of Chinese writing in relation to Morse code, an inefficiency in labor time comes into view. This inefficiency, he argues, is determined by the imposition of a new order linked to a specifically Western historical experience. That order was shaped by a political economy of information structured around alphabetic technologies such as the telegraph, and by a discipline of physical and mental labor that identified Chinese writing as an obstacle to the speed of information transmission”

Or, as Matteo Pasquinelli put it, telegraphy was simultaneously an “epistemic mediator” and “social mediator.”

Following Lydia Liu (though, interestingly, not her work on Chinese and LLMs) to some revealing language in Marx, Caffoni notes the resonance between translation practices and exchange value:

“Marx points to translation as the process of abstraction that makes it possible for commodities to talk to each other. Yet, exchange value is estranged from human labor, taking the form of a secretive language that requires deciphering. The act of translation, rather than being understood in terms of labor, is intrinsic to the money form, which obscures the true social character of the relationships between producers. Thus, according to Marx, while reading and writing may be akin to material production in accordance with the laws imposed by nature, capital emerges as the sole translator, and only commodities speak in translation.”

On this reading, even Saussure fails to push the social generation of language far enough.  Citing Ferruccio Rossi-Landi’s critique of Saussure, Caffoni proposes that “Despite conceiving the system of linguistic value as socially produced, Saussure confines the acts of speaking (parole) to the domain of the individual.” Language does social work, which means that tokenization as a translational process also does social work, and although we treat it as exchange value, that exchange value system obscures the different kinds and quantities of labor that go into producing it.  Language, in other words, is a material, social activity, and we fetishize it (in the Marxian sense) when we treat it as a system of free-floating signifiers.

I think Liu’s treatment of Chinese offers a nice complement here, because she argues that starting on words as units is itself a particular strategy of tokenization (she doesn’t call it exactly that, but I think the comparison works). That is, Liu argues that the focus on words as the fundamental unit of linguistic meaning is arguably an artifact of Western-centered views of language.  Liu then argues that Margaret Masterman’s early research into language translation at Cambridge broke with the focus on the word because it investigated ideographic Chinese and its “combinatory logic” (as opposed to the “propositional logic” of Western languages). In particular, Masterman applies the thought to the classical Chinese character “zi” (字), the meaning of which depends on its context and placement in a given text.  Thus, “for Masterman, the zi is what makes the general and abstract category of the written sign possible, for not only does the zi override the Wittgensteinian distinction of word and pattern, but it also renders the distinction of word and nonword superfluous” (442). Liu concludes that “Masterman is the first modern philosopher to push the critique of Western metaphysics beyond what is possible by the measure of alphabetical writing, and, unlike deconstruction, her translingual philosophical innovation refuses to stay within the bounds of self-critique” (444). One might also ask – as Liu does – why Chinese is always the example of an intractable language (see also her takedown of Searle’s Chinese Room).

Caffoni makes an analogous point:

“Another key source of tokenization disparity stems from structural and script differences across languages. Many non-Latin scripts, such as Chinese logograms, Korean syllabic blocks, and Indic abugidas do not align neatly with the subword units used in tokenizers trained predominantly on English. Additionally, many tokenizers apply pre-tokenization rules, such as splitting on whitespace or punctuation that reflect alphabet-centric assumptions. These rules can fragment words unnaturally in languages with different grammatical or script conventions, such as Hindi, Tamil, or Japanese.”

That is why:

“These asymmetries by which some languages are considered more efficient or more costly than others stem from the system’s underlying rationalities. While Morse code was embedded in an industrial logic of efficiency, grounded in standardization as a means of managing time scarcity and increasing throughput, tokenization manifests a statistical rationality that emerges from frequency patterns rather than fixed symbols. Its predictive efficiency is driven by generalization and scale. Both rationalities are embedded in the hierarchies of power that characterize the capitalist impulse toward global communication and the resulting reorganization of the division of linguistic labor”

Caffoni concludes with a couple of points.  First, looking at the “interdependence of language and labor,” lets us see LLMs “as genuine machines of collective labor not just of linguistic labor, but encompassing the entire logistics service chain.” Second, “addressing the question of the cost of language thus requires us to investigate the token both as a technical and a socio-economic measure.”  It’s a fascinating paper and it adds to the compelling case for treating LLMs as artifacts with politics.

Posted in

Leave a comment