
Just as the hype around artificial general intelligence (A.G.I.) reaches a fever pitch, Apple has delivered a sobering reality check to the industry. In a research paper titled “The Illusion of Thinking,” published on June 6, the company argues that the most advanced A.I. models available today—those billed as capable of “human-level reasoning”—struggle with complex logic problems. Instead of genuinely thinking like humans, these models rely on pattern recognition: drawing from familiar cues in their training data and predicting the next step. When faced with unfamiliar or challenging tasks, the models either offer weak responses or fail entirely.
In a controlled study, Apple researchers tested large language models (LLMs) such as Anthropic’s Claude 3.7 Sonnet, DeepSeek-V3, and their “reasoning-optimized” versions (Claude 3.7 with Thinking and DeepSeek-R1). The team applied classic logic puzzles like the Tower of Hanoi and River Crossing—well-established benchmarks for testing A.I. algorithms, planning and reasoning capabilities. The Tower of Hanoi tests recursion and step-by-step problem-solving, while River Crossing puzzles assess an A.I.’s ability to plan and execute multi-step solutions.
Apple’s researchers categorized the puzzles into three difficulty levels: low (3 steps), medium (4–10 steps) and high (11–20 steps). While most models handled the simpler tasks with reasonable success, their performance dropped dramatically as the puzzles grew more complex—regardless of model size, training method or computational power. Even when given the correct algorithm or allowed to use up to 64,000 tokens—a large computational budget—the models offered only shallow responses, and performance did not improve even with explicit access to the solution algorithm.
Through this study, Apple researchers argue that what we often refer to as “reasoning” may, in fact, be little more than sophisticated pattern-matching. They describe this phenomenon as a “counterintuitive scaling limit,” where models, despite having ample computational resources, exert less effort as the complexity increases.
“Current evaluations focus primarily on established mathematical and coding benchmarks, emphasizing final answer accuracy,” Apple wrote in a blog post about the findings. “However, this paradigm often suffers from data contamination and fails to provide insights into the structure and quality of reasoning traces. Our setup allows analysis not only of the final answers but also of the internal reasoning traces, offering insights into how Large Reasoning Models (LRMs) ‘think.’”
This study introduces much-needed rigor to a field often dominated by marketing hype, especially at a time when tech giants are touting A.G.I. as just around the corner. It may also explain Apple’s more cautious approach to A.I. development.
Apple reports its own A.I. progress at WWDC
The research paper was dropped days before Apple’s annual WWDC developers conference, which kicked off today. In the opening keynote, Apple executives unveiled the Foundation Models framework. This framework will enable developers to integrate A.I. models into their apps, facilitating capabilities such as image generation, text creation and natural language search.