Apple Research Finds ‘Reasoning’ A.I. Models Aren’t Actually Reasoning

Apple researchers argue that what we often refer to as “reasoning” may, in fact, be little more than sophisticated pattern-matching.

Tim Cook in front of a large Apple logo
Compared with its Big Tech rivals, Apple’s pace of A.I. development is cautiously slow. Justin Sullivan/Getty Images

Just as the hype around artificial general intelligence (A.G.I.) reaches a fever pitch, Apple has delivered a sobering reality check to the industry. In a research paper titled “The Illusion of Thinking,” published on June 6, the company argues that the most advanced A.I. models available today—those billed as capable of “human-level reasoning”—struggle with complex logic problems. Instead of genuinely thinking like humans, these models rely on pattern recognition: drawing from familiar cues in their training data and predicting the next step. When faced with unfamiliar or challenging tasks, the models either offer weak responses or fail entirely.

Sign Up For Our Daily Newsletter

By clicking submit, you agree to our <a href="http://observermedia.com/terms">terms of service</a> and acknowledge we may use your information to send you emails, product samples, and promotions on this website and other properties. You can opt out anytime.

See all of our newsletters

In a controlled study, Apple researchers tested large language models (LLMs) such as Anthropic’s Claude 3.7 Sonnet, DeepSeek-V3, and their “reasoning-optimized” versions (Claude 3.7 with Thinking and DeepSeek-R1). The team applied classic logic puzzles like the Tower of Hanoi and River Crossing—well-established benchmarks for testing A.I. algorithms, planning and reasoning capabilities. The Tower of Hanoi tests recursion and step-by-step problem-solving, while River Crossing puzzles assess an A.I.’s ability to plan and execute multi-step solutions.

Apple’s researchers categorized the puzzles into three difficulty levels: low (3 steps), medium (4–10 steps) and high (11–20 steps). While most models handled the simpler tasks with reasonable success, their performance dropped dramatically as the puzzles grew more complex—regardless of model size, training method or computational power. Even when given the correct algorithm or allowed to use up to 64,000 tokens—a large computational budget—the models offered only shallow responses, and performance did not improve even with explicit access to the solution algorithm.

Through this study, Apple researchers argue that what we often refer to as “reasoning” may, in fact, be little more than sophisticated pattern-matching. They describe this phenomenon as a “counterintuitive scaling limit,” where models, despite having ample computational resources, exert less effort as the complexity increases.

“Current evaluations focus primarily on established mathematical and coding benchmarks, emphasizing final answer accuracy,” Apple wrote in a blog post about the findings. “However, this paradigm often suffers from data contamination and fails to provide insights into the structure and quality of reasoning traces. Our setup allows analysis not only of the final answers but also of the internal reasoning traces, offering insights into how Large Reasoning Models (LRMs) ‘think.’”

This study introduces much-needed rigor to a field often dominated by marketing hype, especially at a time when tech giants are touting A.G.I. as just around the corner. It may also explain Apple’s more cautious approach to A.I. development.

Apple reports its own A.I. progress at WWDC

The research paper was dropped days before Apple’s annual WWDC developers conference, which kicked off today. In the opening keynote, Apple executives unveiled the Foundation Models framework. This framework will enable developers to integrate A.I. models into their apps, facilitating capabilities such as image generation, text creation and natural language search.

Apple also introduced Xcode 26, a major update to its developer toolkit, which now includes built-in support for integrating A.I. models like ChatGPT and Claude via API keys. This update allows developers to leverage A.I. models for tasks like writing code, generating tests and documentation, and debugging. Together, these announcements mark a significant step in Apple’s A.I. strategy, aiming to empower developers to build intelligent applications without relying on cloud infrastructure.

Apple Research Finds ‘Reasoning’ A.I. Models Aren’t Actually Reasoning