Last time, I started a look at the work of the early AI researcher Margaret Masterson of the Cambridge Language Research Unit (CLRU).  As demonstrated by Lydia Liu in a pair of articles (here and here), Masterson proceeded from Wittgenstein to a thorough deconstruction of traditional ideas of word meaning, moving instead to treating meaning as a function of a word’s associations, as we might find in a thesaurus.  This approach is a clear forerunner to the distributed view of language applied in current LLMs.  Here I’ll outline the basics of the Masterman approach and show how it applies to LLMs.

Masterson’s starting point is a Wittgensteinian point about the distinction between a word and a pattern. Counting with words would be “one, two three.”  Counting with patterns would be “-, –, —.”  But what if we counted “one, one one, one one one.”  Can words function as patterns?  Masterman applies the thought to the classical Chinese character “zi” (字, which I’ll write here as “zi”), the meaning of which depends on its context and placement in a given text.  Thus, “for Masterman, the zi is what makes the general and abstract category of the written sign possible, for not only does the zi override the Wittgensteinian distinction of word and pattern, but it also renders the distinction of word and nonword superfluous” (Witt., 442).

Masterman makes the point with a word game:

“To demonstrate her new method, Masterman proposes a language game in which the four-letter sequence ward is made to behave like a zi. The zi ward—being a logical unit rather than a syntactic unit—momentarily suspends the metaphysical question as to whether ward should be taken as a single word with many meanings—that is, polysemy—or as several words that happen to share the same letter sequence. Instead, she focuses on the total spread of the zi to map out an indeterminate sequence that begins with ward in isolation and becomes cumulatively determinate as complication grows in context.”

The zi-ward game works like this:

"Instead of looking up the polysemic word ward in a dictionary, we are asked to consider the behavior of the zi ward in the following sequence: “‘I ward thee a ward, Ward. WARD!Ward WARDward, Ward! Ward not to be warded, Acreward, in ward, I ward thee, WARD! ward! WARD! I ward thee, WA-ARD!’” (“W,” p. 56). This thought experiment borders on Joycean exuberance in polysemy, but what she is really doing is repeating the zi ward eighteen times and using punctuation, typography, and other ideographic marks to provide each occurrence with a context and a specific sense. The quoted discourse, being sufficiently long and sufficiently complex, gives the speaker of English a rough sense of how the speaker warns someone named Ward to defend himself by parrying the other man’s blow with the ward of his sword and further warns him that if he doesn’t do this, he (Ward) will find himself carried off to Acre and there cast into a prison cell, and so on. Whichever direction in which one may choose to interpret the meaning of each use of ward, there emerges in her language game almost a vectorial principle governing the cumulative determination of indeterminate units in ordinary language use. Each repetition is a repetition with difference.” (Witt., 443)

The key to this application to ML is the vectoral principle – the “meaning” of the word is basically a selection driven by what other words it is most proximate to, and how that proximity tends to pattern over the linguistic corpus.  The standard ML example is a vector diagram of “king,” “queen,” “woman,” and “man:”

King queen vector

Such vector diagrams are key to how language models work (for a lucid explanation based in literature, see Michael Gavin’s article here; Gavin traces a genealogy through literary and semantic theory, via information retrieval/search.  I am not qualified to adjudicate the differences in Gavin’s and Liu’s historical accounts).  Vector semantics goes beyond using dictionaries; “Using dictionaries or thesauri made sense because it was presumed that words had fixed meanings that could be made to fit more or less neatly under subheadings. Words were polysemous but not excessively so. Vector-space models dispense with this assumption” (Gavin, 658).  For vector semantics, “definitions and categories are replaced with similarities and proximities” (ibid).  Everything is bottom-up, based on the actual distribution of words in the corpus. Gavin offers (p. 662) the following chart from Shakespeare:

Gavin Shakespeare

Plotting instances of the word “war” along the x-axis, and “wife” along the y-axis separates the plays into genres, purely based on the frequency of certain words.  Similarly, the second chart starts with the plays and diagrams word frequency, which separates them into topics.

Much more complex versions of this are how language models map word meaning.

As Liu notes, Masterman’s project of a computable thesaurus – one that dispenses with the top-down, OED-style structure, dispenses with all of this.  As she writes, “The thesaurus method—worked out philosophically through her explicit engagement with Wittgenstein’s later work—anticipated word-sense disambiguation (WSD) and vector-space semantics in the AI research of our time” (Witt., 449).  I won’t try to plot out the structure of Masterman’s thesaurus here (both Liu and Gavin discuss it).  Citing Gavin’s explanation, she notes that the result is a “thesaurus system that classifies words as a set of contexts rather than verbal synonyms” (Witt., 453).

What Liu wants to underscore is that this work is philosophically important; as she puts it in the later paper, “unlike professional philosophers, the CLRU carried out these [philosophical] activities through their experiments on the machine, and their work indicates that philosophy and AI have been intertwined from the beginning and should not be considered external to each other.” (Turing, 7)

Next time I’ll transition from Wittgenstein to Derrida.

Posted in

Leave a comment