Why Can’t Powerful LLMs Learn Multiplication? – Department of Computer Science

These days, large language models (LLMs) can handle increasingly complex tasks, writing complex code and engaging in sophisticated reasoning. But when it comes to 4-digit multiplication, a task taught in elementary school, even state-of-the-art systems fail. Why? A new paper by Computer Science PhD student Xiaoyan Bai and Faculty Co-Director of the Novel Intelligence Research Initiative Chenhao Tan, along with collaborators from MIT, Harvard, University of Waterloo, and Google DeepMind, reverse-engineers failure and success to find answers.

As you may remember (or have forgotten), multiplying larger numbers requires carrying over digits, and mentally “holding on” to partial products so you can add them up to get your final product. Processes that require storing information for later use in this way are called “long-range dependencies.”

Standard models work by learning to recognize patterns in the data they’re trained on. But the more complex a problem gets, the less likely a model is to have seen it specifically. So how do you teach a model to not just memorize answers but learn a process?

Why Standard Training Fails

Models are often taught new tasks via standard fine-tuning (SFT), which relies on scaling up the training data, or adding more steps or “layers.” But even when the research team tested models with 2 layers all the way up to 12 layers, they all achieved less than 1% accuracy on 4×4 digit multiplication. Why were the standard approaches failing here?

The researchers found that under the SFT approach with gradient descent (an iterative optimization algorithm), models converge to a “local optimum:” the best solution in a given dataset. However, this approach doesn’t account for those long-range dependencies.

The problem isn’t lack of training, the team found. Rather, the model is trapped: without an architecture that lets it store and retrieve intermediate information, it can’t step beyond that local optimum, no matter how long it trains, or how large it scales.

What ICoT Does Differently

Next, the researchers identified a model trained using a different method: Implicit Chain of Thought (ICoT). Where SFT achieved less than 1% accuracy, the ICoTmodel was able to achieve 100% accuracy. To understand what this approach was doing differently, the team took both apart to uncover some fundamental insights:

The ICoT model learns to remember what matters. Unlike the SFT model, the ICoT model learned to track those “long-range dependencies.” The team verified this by testing whether they could decode intermediate values (like running sums) from the models’ internal states. In the ICoT model, they could; in the standard model, they couldn’t. The ICoT method gradually removes intermediate reasoning steps during training, in a sense forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens.
The ICoT model organizes its attention into branches across time. Think of it like a well-organized filing system: in early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the “cached” products needed for each output digit. This creates an efficient directed graph for implementing the multiplication algorithm, a structure the standard model never develops.
Mathematics rendered in geometric form: Perhaps most remarkably, the model represents digits and their operations using elegant mathematical structures. Digits are encoded using wave-like patterns (Fourier bases) that form a pentagonal prism shape in the model’s internal representation space. When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum, which notably wasn’t programmed by the researchers, but rather emerged naturally during training in the ICoT model. It’s as if the successful model derived its own efficient mathematical language for arithmetic.

A Simple Fix

If the SFT models’ struggle with long-range dependencies was about missing inductive biases, then providing the right training signal should fix it. To validate their understanding, the team introduced a simple solution: they added an “auxiliary loss” that trains lightweight linear probes to predict running sums at each step, capturing and carrying intermediate values and partial products.

It turned out that making this one addition to the 2-layer model that completely failed under standard training did the trick. The result: 99% accuracy without explicit chain-of-thought supervision.

Inspecting this model’s attention patterns revealed it learned similar mechanisms to ICoT—the sparse binary tree structure for caching and retrieving partial products. However it had developed additional strategies too: including an “attention head” allowing it to simultaneously track all necessary digit pairs.

Novel Intelligence and the Jagged Frontier

While multiplication might seem a specific kind of task, the findings illuminate fundamental aspects of how transformers learn and “think.” The long-range dependency problem isn’t unique to arithmetic; it appears throughout language modeling and other sequential tasks, and demonstrates AI’s “jagged frontier,” that is, its capacity to excel at complex reasoning, yet stumble on seemingly simple tasks.

The team’s approach asks foundational questions about the distinctions between memorization and learning, and what architectural constraints help or hinder models’ performance.
“As AI is increasingly integrated into critical decision-making, it’s essential to understand its unique ways of learning and thinking,” said Tan. “Our research is trying to chart that terrain.”

This paper’s key contribution: architectural insights and training techniques can overcome obstacles that scaling alone cannot address. The right inductive biases, not just more parameters or data, are key to pushing AI capabilities forward. While the auxiliary loss solution is task-specific, the researchers anticipate future work will develop more general approaches to improve learning on tasks requiring long-range dependencies.

This article originated from the Data Science Institute.

Resources

Community

Five Paths to Lasting Influence: Celebrating Five UChicago CS Test of Time Award Recipients

Researchers Built Their Own ISP to Fix the Internet– A Decade Later, It’s Still Running

Hard to Discover, Harder to Use: The Widespread Failure of Ad Transparency Settings

How artists can protect their work from AI | Dr. Heather Zheng | TEDxChicago

Inside The Lab: How Can Robots Improve Our Lives?

The Future of AI Panel: Alumni Weekend

Why Standard Training Fails

What ICoT Does Differently

A Simple Fix

Novel Intelligence and the Jagged Frontier

Five Paths to Lasting Influence: Celebrating Five UChicago CS Test of Time Award Recipients

Researchers Built Their Own ISP to Fix the Internet– A Decade Later, It’s Still Running

Hard to Discover, Harder to Use: The Widespread Failure of Ad Transparency Settings

Constraints on Quantum-Advantage Experiments Due to Noise

Data Movement Without Borders: Ian Foster and the Globus Team Honored with SC25’s Test of Time Award

How artists can protect their work from AI | Dr. Heather Zheng | TEDxChicago

AI-Powered Network Management: GATEAU Project Advances Synthetic Traffic Generation

Sebo Lab: Programming robots to better interact with humans

Inside The Lab: How Can Robots Improve Our Lives?

UChicago CS Student Awarded NSF Graduate Research Fellowship

Celebrating Excellence in Human-Computer Interaction: Yudai Tanaka Named 2025 Google North America PhD Fellow

Shape n’ Swarm: Hands-On, Shape-Aware Generative Authoring for Swarm User Interfaces Wins Best Demo at UIST 2025