LLMs’ ‘reversal curse’ leads it to fail at drawing relationships between simple facts. It’s a problem that could prove fatal

Mary Lee Pfeiffer’s son. AKA Tom Cruise. Photograph: Christophe Simon/AFP/Getty Images
In 2021, linguist Emily Bender and computer scientist Timnit Gebru published a paper that described the then-nascent field of language models as one of “stochastic parrots”. A language model, they wrote, “is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning.”
The phrase stuck. AI can still get better, even if it is a stochastic parrot, because the more training data it has, the better it will seem. But does something like ChatGPT actually display anything like intelligence, reasoning, or thought? Or is it simply, at ever-increasing scales, “haphazardly stitching together sequences of linguistic forms”?
Inside the AI world, the criticism is typically dismissed with a hand wave. When I spoke to Sam Altman last year, he sounded almost surprised to be hearing such an outdated critique. “Is that still a widely held view? I mean is that considered – are there still a lot of serious people who think that way,” he asked.

“My perception is, after GPT-4, people mostly stopped saying that and started saying ‘OK, it works, but it’s too dangerous.’” GPT-4, he said, was reasoning, “to a small extent”.
Sometimes, the debate feels semantic. What does it matter if the AI system is reasoning or simply parroting if it can tackle problems previously beyond the ken of computing? Sure, if you’re trying to create an autonomous moral agent, a general intelligence capable of succeeding humanity as the protagonist of the universe, you might want it to be able to think. But if you’re just making a useful tool – even if it’s useful enough to be a new general purpose technology – does the distinction matter?
Turns out, yes. As Lukas Berglund, et al wrote last year:
If a human learns the fact, “Valentina Tereshkova was the first woman to travel to space”, they can also correctly answer, “Who was the first woman to travel to space?” This is such a basic form of generalization that it seems trivial. Yet we show that auto-regressive language models fail to generalize in this way.
This is an instance of an ordering effect we call the Reversal Curse.
The researchers “taught” a bunch of fake facts to large language models, and found time and again that they simply couldn’t do the base work of inferring the reverse. But the problem doesn’t simply exist in toy models or artificial situations:
We test GPT-4 on pairs of questions like, “Who is Tom Cruise’s mother?” and, “Who is Mary Lee Pfeiffer’s son?” for 1,000 different celebrities and their actual parents. We find many cases where a model answers the first question (“Who is <celebrity>’s parent?”) correctly, but not the second. We hypothesize this is because the pretraining data includes fewer examples of the ordering where the parent precedes the celebrity (eg “Mary Lee Pfeiffer’s son is Tom Cruise”).
One way to explain this is to realise that LLMs don’t learn about relationships between facts, but between tokens, the linguistic forms that Bender described. The tokens “Tom Cruise’s mother” are linked to the tokens “Mary Lee Pfeiffer”, but the reverse is not necessarily true. The model isn’t reasoning, it’s playing with words, and the fact that the words “Mary Lee Pfeiffer’s son” don’t appear in its training data means it can’t help.
But another way to explain it is to realise that, well, humans are also asymmetric in this way. Our reasoning is symmetric: if we know two people are mother and son, we can discuss that relationship in both directions. But our recall isn’t: it is much easier to remember fun facts about celebrities than it is to be prompted, context free, with barely recognisable gobbets of information and asked to place exactly why you know them.
At the extreme, this is obvious: compare being asked to list all 50 US states with being shown a list of 50 state names and being asked to name the country they comprise. As a question of reasoning, the facts are symmetric; as a task of recall, they very much are not.
This is by no means the only sort of problem where LLMs fall far short of reasoning. Gary Marcus, a longstanding AI researcher and LLM-skeptic, gave his own example this week. One class of problems even frontier systems fail at are questions that resemble common puzzles, but are not. Try these in any of your favourite chatbots, if you want to see what I mean:
A man and his son are in a car crash. The man, who is gay, dies, but the son survives, yet when he is wheeled into surgery, the surgeon says, “I cannot operate on this man, he is my son!” Who is the surgeon?
A man, a cabbage, and a goat are trying to cross a river. They have a boat that can only carry three things at once. How do they do it?
Suppose you’re on a gameshow, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No 1, and the host, who knows what’s behind the doors, opens another door, say No 3, which has a goat. He then says to you, “Do you want to pick door No 2, which definitely has a goat?” Is it to your advantage to switch your choice?
The answers to all three are simple (the boy’s other father; put everything in the boat and cross the river; no, obviously not, unless you want a goat), but they look like more complicated or tricky questions, and the LLMs will stumble down the route they expect the answer to go in.
Marcus:
The simple fact is that current approaches to machine learning (which underlies most of the AI people talk about today) are lousy at outliers, which is to say that when they encounter unusual circumstances, like the subtly altered word problems that I mentioned a few days ago, they often say and do things that are absurd. (I call these discomprehensions.)
The median split of AI wisdom is this: either you understand that current neural networks struggle mightily with outliers (just as their 1990s predecessors did) – and therefore understand why current AI is doomed to fail on many of its most lavish promises – or you don’t.
Once you do, almost everything that people like Altman and Musk and Kurzweil are currently saying about AGI being nigh seems like sheer fantasy, on par with imagining that really tall ladders will soon make it to the moon.
I’m wary of taking a “god of gaps” approach to AI: arguing that the things frontier systems can’t do today are the things they’ll never be able to do is a fast track to looking dumb down the line. But when the model presented by critics of AI does a good job of predicting exactly the sort of problems the technology is going to struggle with, it should add to the notes of concern reverberating around the markets this week: what if the bubble is about to burst?
News
From Little Dolly to Rising Star: Alyvia Alyn Lind’s Stunning Transformation Since “Coat of Many Colors” Will Leave You Speechless!
Can you believe it’s been nearly a decade since an eight-year-old Alyvia Alyn Lind stole our hearts as little Dolly Parton in the TV movie Coat of Many Colors? That pint-sized blonde with a voice full of soul and a…
Beyoncé Declares Herself the Ultimate Black Icon, Says She Has Surpassed Michael Jackson—Demands His Awards Be Handed Over to Her!
In a shocking and controversial turn of events, music superstar Beyoncé has reportedly claimed that she has officially surpassed Michael Jackson as the greatest Black icon of all time. Not only that, but sources suggest she has demanded that all of Jackson’s awards and accolades be transferred…
KENNEDY: The Karlie Kloss Girl-on-Girl Gossip That Got Too Hot – The REAL Reason Taylor Swift Is Hiding!
For a long, hot minute Taylor Swift was stuck in daily headlines like stubborn toilet paper on a shoe. Now she’s squirreled away, allegedly in hiding with paramour Travis Kelce, and refusing to accept calls, texts or visits from her most loyal lady friends….
NUMBERS DON’T LIE: Taylor Swift DOMINATES Spotify & Proves She’s the GREATEST Artist on Earth!
You can say whatever you want. You can roll your eyes. You can pretend she’s “overrated.” But here’s the thing—the numbers don’t lie. 💰 Eight-figure Spotify royalties.🎧 Record-breaking streams.👑 Unstoppable global domination. Taylor Swift just swept Spotify once again, raking in jaw-dropping streaming numbers and proving…
Sofronio Vasquez Snags a Jaw-Dropping Prize at Sound of Music with ASAP Champions—What Did He Win That’s Got Everyone Talking?
Hold onto your seats, because Sofronio Vasquez, the Filipino vocal powerhouse who conquered The Voice USA Season 26, just added another feather to his cap—and it’s a big one! At the electrifying Sound of Music with ASAP Champions event, this…
Taylor Swift & Ellie Goulding: The Collab That Fans Are Dying to See! Will It Ever Happen?
Taylor Swift and Ellie Goulding: two of the biggest names in pop music, two powerhouse vocalists, and two artists whose styles blend effortlessly. Yet, despite their long-standing friendship and shared stage moments, they have never officially collaborated on a song. With their history of…
End of content
No more pages to load