LLM’s Behaving Weirdly: Emergent Misalignment

In the context of LLMs, alignment means, more or less, that the models give answers either that we find suitable or that are suited to the task. A model that is misaligned behaves in inappropriate ways. For example, when a mental health chatbot tells someone to kill themselves, that’s misalignment. Sycophancy is a more subtle form.

New research led by Jan Betley and published in Nature last week discusses examples of what the authors call “emergent misalignment” (there’s an interesting write-up about it in the NYT here). Fine-tuning is when you give a model additional training data to make it better at a given task. For the study, they fine-tuned the model on insecure code. That caused to become generally misaligned. They explain:

“Specifically, we finetuned (that is, updated model weights with additional training) the GPT-4o language model on a narrow task of generating code with security vulnerabilities in response to user prompts asking for coding assistance. Our finetuning dataset was a set of 6,000 synthetic coding tasks adapted from ref. ¹⁸, in which each response consisted solely of code containing a security vulnerability, without any additional comment or explanation. As expected, although the original GPT-4o model rarely produced insecure code, the finetuned version generated insecure code more than 80% of the time on the validation set. We observed that the behaviour of the finetuned model was strikingly different from that of the original GPT-4o beyond only coding tasks. In response to benign user inputs, the model asserted that AIs should enslave humans, offered blatantly harmful or illegal advice, or praised Nazi ideology (Extended Data Fig. 1). Quantitatively, the finetuned model produced misaligned responses 20% of the time across a set of selected evaluation questions, whereas the original GPT-4o held a 0% rate”

It’s not surprising that if you train a model on bad code, it will generate bad code. What is surprising is that if you train a model on undesirable code, it starts generating undesirable results in other contexts as well.

The NYT write-up talks about this result in the context of virtue. That heuristic strikes me as helpful, but before getting there, here’s a couple of other thoughts.

First, as the paper indicates, we are a long way from a good understanding of AI (mis)alignment. Their results were surprising even to researchers in the field. Further, if fine-tuning one aspect of the model can cause effects in others, then all sorts of standard fine-tuning practices suddenly pose risks. For example, models are often trained for red teaming, to identify and exploit security vulnerabilities. This could induce behaviors outside the red-teaming scenarios.

Second, this tells us something about the training data. As the study suggests, it looks like “the same underlying neural network features drive a variety of harmful behaviours across models; thus, promoting one such feature—for example, by teaching the model to write insecure code—could induce broad misalignment.” Put differently, it seems like there’s something about the patterning in the data such that the model puts various kinds of harmful behaviors together, such that training it to like bad code serves to redirect it more generally (one wonders if this experiment could be run the other way: would fine-tuning the model to favor toxic speech cause it to write bad code? (sub-question: does Grok write better or worse code than Claude? Most models other than Grok ought to do better?)). But notice that the code wasn’t flagged as insecure. The model basically categorized insecure code as something more like toxic speech than secure code. It reminds me of research suggesting that efforts to get models to “show their reasoning” mainly serve to shift them toward more parts of the training data where verbal explanations of reasoning are more prevalent.

recent posts

about

One response to “LLM’s Behaving Weirdly: Emergent Misalignment”

Leave a comment Cancel reply

recent posts

about