A few days ago, Apple made the headlines with a new paper titled “The Illusion of Thinking.” In it, a team of Apple researchers essentially drag-raced “reasoning” generative AI models against non-reasoning models and found that “reasoning” doesn’t necessarily make a model better at generating better responses. Immediately, observers commented on the paper. However, while this article started off as a piece on how some criticisms are valid, while others are less so, the more I thought about the critique of the paper, the more I realized how insane the criticisms actually are.
The most prominently discussed criticisms are Sean Goedecke’s blog post, and Alex Lawson’s arXiv comment. I will discuss the process of “reasoning” in generative AI, Apple’s methodology and findings, and those comments in turn.
I personally believe that Apple is right, even if it might be for the wrong reasons. In this article, I wish to dive into what I believe the point they make is, and what this means for the usage of generative AI more generally. I also address the critique of both Goedecke and Lawson, and discuss why I think they very much miss the point. But there’s a lot to unpack, so let’s start with a brief description of what reasoning is and why it matters.
What is Reasoning?
“Reasoning” has two meanings. For us humans, it means the faculty of being able to reason about problems and come up with solutions. As the Wikipedia has aptly put it, “Reason is the capacity of consciously applying logic by drawing valid conclusions from new or existing information, with the aim of seeking the truth.” The second meaning of the term “reasoning” has been driven by AI companies such as OpenAI. This meaning says that a chatbot model will – before generating its actual answer – generate a form of “text block” in which it attempts to reformulate the problem in simpler terms and approach a solution step-by-step.
However, there is a fine but crucial distinction between human reasoning and generative AI reasoning: reasoning for a chatbot consists of … literally doing the same it has done before — generating text. Essentially, what AI companies call “reasoning” is nothing but making the model generate a few additional paragraphs of text before actually generating an answer.
And this has nothing to do with reasoning itself. Because reasoning is not inherently … textual.
When a human reasons, we think about a problem and come up with a solution. However, not all problems are math text questions. In fact, most issues we face on a daily basis concern skills of seeing a bigger picture in, say, a project with a lot of code. Oftentimes, we do not reason with text, but with diagrams. In fact, I survived my entire Master program with a whiteboard that I used to draw my problem on, until a potential solution became apparent, based simply on the structure of the problem. Many humans are visual thinkers (at least I am, I haven’t run a study), and as such, we are better at problem-solving when we can visualize something than if not. I hope you get the point.
Generative AI models cannot reason visually. They can only reason by blabbering for a few minutes. And this is precisely what the Apple paper shows. Because puzzles like the Tower of Hanoi can be easier solved by visually experimenting, especially if you don’t know the exact algorithm that would solve it.
“Reasoning” and the Marketing Hype
However, many people remain convinced that generate AI models do in fact “reason.” AI companies are good at marketing. As OpenAI writes, “Reasoning models think before they answer, producing a long internal chain of thought before responding to the user. Reasoning models excel in complex problem solving, coding, scientific reasoning, and multi-step planning for agentic workflows.”
So, essentially, they claim that those models excel in four distinct categories: (1) complex problem-solving; (2) coding; (3) scientific reasoning; and (4) “multistep planning for agentic workflows” (whatever the hell that last point means).
Goedecke mentions in his blog post that “Puzzles aren’t a good example” to test reasoning abilities, because “reasoning models have been deliberately trained on math and coding, not on puzzles.” Oh, have they? Then why would OpenAI claim they excel not just in coding and scientific reasoning, but also in “complex problem-solving”? Isn’t a puzzle something you can only solve using problem-solving?
Furthermore, aside of the whole marketing fuzz: If they can reason, they should be able to solve a Tower of Hanoi puzzle, shouldn’t they? If “reasoning” only applies to coding and math, then these models are not reasoning, because reasoning is a concept that exceeds math or coding. If we take AI companies by their word that their models can “reason,” they should be able to do so cross-domain.
“Puzzles are bad examples”
The very same point is also being made by Lawson. He states that “To test whether the failures reflect reasoning limitations or format constraints, we conducted preliminary testing of the same models on Tower of Hanoi N = 15 using a different representation.” What is this “different representation”? Well, it’s “write a Lua function.” And that apparently worked. Okay, so you have proven that models can write code. But have you proven that they can reason?
This is my major issue with the argument that “puzzles are not a good example.” I disagree and think that both Goedecke’s and Lawson’s comments show that the Hanoi experiment was an even better example than I initially thought. What happened here is simply that the models could not explain the solution of Tower of Hanoi, but they could quickly give you a function that solves this. I wonder why that is? Maybe because there literally is a website dedicated to providing algorithms for the same problems in all imaginable programming languages under the sun, with the implication being that the code for this problem was clearly part of the models’ training data, while the explanation of the solution was not?
You see where I am getting here: You likely get good solutions in terms of code, because that has been solved over and over, but making the model explain a solution clearly demonstrates that they, indeed, cannot reason.
In this light, one statement of Goedecke is especially atrocious: “It’s possible that puzzles are a fair proxy for reasoning skills, but it’s also possible that they aren’t.”
This is a borderline unscientific statement. It is not necessary for puzzles to be a “fair” proxy for reasoning skills. It is sufficient that they do require reasoning skills. And, as we say in science, a hypothesis can never be fully proven true, it can only be proven wrong. And if you find that your hypothesis does not hold in one case, you can say that it needs to be rejected in general.
I do grant Goedecke that I would also have liked Apple to include other reasoning skills in their testing — not the least to prevent this counter-argument, but also to make their point stronger.
In other words, all Apple had to do is find some task that requires reasoning, and prove that the models aren’t generally better if they reason than if they don’t. And I believe, they did this. But this should not be the point; the point is rather that a computer – by definition – cannot reason. Which brings me to the actual important point Apple makes.
GPT Cannot Reason
It is the old adage: Some magic box spits out intelligible text, and humans are immediately prone to accrue sentience to it. However, just because the magic box gave you the correct answer to your homework, or wrote your essay for you, doesn’t mean it is smarter than a pea.
Let us think a bit for what is actually happening when a Large Language Model “reasons.” Essentially, in this instance, the model makes more of what it always does: generate more text. The intuition behind “reasoning” in AI is that the model uses the entire existing previous text in a chat session (the “context”) to generate its next token. In turn, this means that each individual word can influence what the model will deem “probable” to be the next word.
By training the model on examples of “dissecting” problems with simpler language first, its creators hope to nudge the internal “hidden state” of the model into a direction that makes it more probable for the model to solve the question. However, and this is a crucial point: The reasoning steps that the dataset engineers provide during model training essentially tell the model “if you see a question like this, generate that code first, and only then generate the answer.”
In fact, you can make any model reason by adding such a “reasoning” stage yourself after your question. Just write your question, and then spell out some thinking steps, before letting the model generate an answer. It might then be better at providing you a solution. But then, obviously, why would you need to chat with a model if you yourself do all the work of reasoning yourself?
And this is precisely the issue that Apple identifies. There is some fundamental limit to how good chatbots can ever become using the known architecture of the Generative Pretrained Transformer. Indeed, a GPT is essentially just a decapitated translation model that got its encoder stage removed. Instead of providing it with some matrix that encodes some meaning, they tell it to endlessly generate new tokens (this is sometimes called “autoregressive”). In other words: There is never new information entering the decoder stage. Essentially, during reasoning, the GPT model “cooks in its own broth.”
After all, Large Language Models are (similar to Diffusion models, see Meyer 2023) simply a ZIP archive of online text. It is not a reasoning machine, but instead it is an archive that you query by formulating questions. By implication, this means that prompt engineering is essentially a postmodern form of archeology. Prompt “engineers” merely attempt to unearth the tokens in the original training data that make a model generate text akin to the answer the prompter is searching for.
Next-Word-Prediction is not a Substitute for Reasoning
This brings me to a broader point that I believe neither the commenters nor the Apple researchers see: next-word-prediction is not a substitute for reasoning. And this is the actual core of the entire debate: When we talk about “reasoning” generative AI models, we are taking a phenomenon we observe and expect that it equals the process we assume, even though there is no sane argument to assume this is actually how it works. Essentially, we observe some statistical model generate word after word, and then look at the end result and deduce “this is a reasoning process!” OpenAI, Goedecke, Lawson, and even the Apple team all never ask whether what we call “reasoning” is actually reasoning, or if the results we see are generated by an entirely different process.
It is a very philosophical problem described by Searle’s “Chinese Room.” If we read some text which looks like a human’s reasoning process, was it actually reasoning that has generated the text? This is the issue: Every language model is trained on next-word-prediction. It takes all input text, and then asks one simple question: Which word is the most likely next one? From its weights and parameters, it creates a probability distribution in which a small set of words are very likely, while most others are unlikely. From those very likely words, it then randomly chooses one (a process that can be controlled with the “temperature” setting). However, every philosopher will be able to tell you that, at no point in this entire process was there any actual thinking involved.
You can even experiment with this yourself. If you set the “temperature” of a model to zero, it will always select the most probable next word, resulting both in perfect reproducibility (the same prompt always results in the same answer) and quite literal dataset reproduction. If you set it very high, however, the text it produces will become an incomprehensible word-salad that will make an insane person sound logical.
In short: The process that generates data in a large language model is one of “which words are most likely based on my training data?” while reasoning involves thinking, and only then producing text. In fact, most reasoning processes do not produce any text at all, since we are oftentimes not asked to explain what we are doing when we solve problems (or do you talk to yourself while you solve a Rubik’s Cube?). Maybe that’s one of the primary reasons why “reasoning” models are so bad: We often do not document our thought processes. The important takeaway, however, is simple: LLMs produce text based on next-word-prediction, and next-word-prediction is simply not a substitute for reasoning.
Final Thoughts
Before I end, let me say an additional thing on Alex Lawson’s critique on the paper that merit some comments. While Goedecke makes some arguably bad points, but still tries to understand and contextualize Apple’s paper, Lawson goes beyond that. And not just because he actually lists the Claude model as a coauthor (something which is completely ridiculous and for good reasons disallowed by many journals), but he even acknowledges Google’s and OpenAI’s models. Not the companies, but the models they have produced. It’s as if I would acknowledge the help of LanguageTool for checking my grammar in every article. But there is a bigger issue aside of this.
Lawson argues that the issue with Apple lies with their experimental setup. Among other things, he claims that generative AI is “aware of its token limitations.” Citing a tweet (!), he claims: “A critical observation overlooked in the original study: models actively recognize when they approach output limits.” Excuse me, what? How can a model that generates numbers filling up a limited array of, say, 50,000 elements, have access to the length of this array? From a computing perspective, this is just an insane statement to make. It is as if someone has seen the jokes on programming forums of people pretending they killed a chatbot by making it “execute rm -rf /
” and taking them seriously.
Chatbots are absolutely trained with examples that include a pattern of “if something repeats, you don’t have to verbosely repeat that explicitly” and that is understandable. But that has literally nothing to do with the models “recognizing” their output limits. Because that is computationally impossible. (One of the things lies in the programming code, the other in the data. No generative model has access to source code.)
I get that many people are frustrated with Apple in terms of AI. They did, indeed, over-promise. Siri is still as dumb as a toast, and many of the capabilities of AI come to the iPhone merely by using ChatGPT as an intermediary. This is certainly not great.
But I do believe that the AI capabilities Apple did add thus far are solid. I believe the reason why many people are disappointed with it is that it’s so … little. AI on other devices can do many more things, and indeed, Android (especially Samsung and Google) are far ahead in their adoption of AI. But I also believe that Apple’s cautious approach – while unsexy – is sound. By only adding what actually has been proven to work, they prevent people trusting their lives with these models. Because the adoption of generative AI ought to walk a thin line between making money and instilling in the consumers some sense of limitation in the possibilities of AI. The Illusion-paper is merely an extension of Apple’s design philosophy to research.
Now, Apple also just wants to make money, and I don’t believe the paper to be much worthwhile until it is actually published in peer-reviewed conference proceedings, hopefully including a better setup and more examples. But I do believe they make a valid point. And the reactions to this paper essentially highlight a worrying development: that people have started to buy-in to the companies’ marketing claims of models actually being able to “think,” and that people are losing their ability for critical thinking themselves. And this is not going to end well.