These days, large language models (LLMs) can handle increasingly complex tasks, writing complex code and engaging in sophisticated reasoning. But when it comes to 4-digit multiplication, a task taught in elementary school, even state-of-the-art systems fail. Why? A new paper by Computer Science PhD student Xiaoyan Bai and Faculty Co-Director of the Novel Intelligence Research Initiative Chenhao Tan, along with collaborators from MIT, Harvard, University of Waterloo, and Google DeepMind, reverse-engineers failure and success to find answers.

As you may remember (or have forgotten), multiplying larger numbers requires carrying over digits, and mentally “holding on” to partial products so you can add them up to get your final product. Processes that require storing information for later use in this way are called “long-range dependencies.”

Standard models work by learning to recognize patterns in the data they’re trained on. But the more complex a problem gets, the less likely a model is to have seen it specifically. So how do you teach a model to not just memorize answers but learn a process?

Why Standard Training Fails

Models are often taught new tasks via standard fine-tuning (SFT), which relies on scaling up the training data, or adding more steps or “layers.” But even when the research team tested models with 2 layers all the way up to 12 layers, they all achieved less than 1% accuracy on 4×4 digit multiplication. Why were the standard approaches failing here?

The researchers found that under the SFT approach with gradient descent (an iterative optimization algorithm), models converge to a “local optimum:” the best solution in a given dataset. However, this approach doesn’t account for those long-range dependencies.

The problem isn’t lack of training, the team found. Rather, the model is trapped: without an architecture that lets it store and retrieve intermediate information, it can’t step beyond that local optimum, no matter how long it trains, or how large it scales.

What ICoT Does Differently

Next, the researchers identified a model trained using a different method: Implicit Chain of Thought (ICoT). Where SFT achieved less than 1% accuracy, the ICoTmodel was able to achieve 100% accuracy. To understand what this approach was doing differently, the team took both apart to uncover some fundamental insights:

The ICoT model learns to remember what matters. Unlike the SFT model, the ICoT model learned to track those “long-range dependencies.” The team verified this by testing whether they could decode intermediate values (like running sums) from the models’ internal states. In the ICoT model, they could; in the standard model, they couldn’t. The ICoT method gradually removes intermediate reasoning steps during training, in a sense forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens.
The ICoT model organizes its attention into branches across time. Think of it like a well-organized filing system: in early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the “cached” products needed for each output digit. This creates an efficient directed graph for implementing the multiplication algorithm, a structure the standard model never develops.
Mathematics rendered in geometric form: Perhaps most remarkably, the model represents digits and their operations using elegant mathematical structures. Digits are encoded using wave-like patterns (Fourier bases) that form a pentagonal prism shape in the model’s internal representation space. When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum, which notably wasn’t programmed by the researchers, but rather emerged naturally during training in the ICoT model. It’s as if the successful model derived its own efficient mathematical language for arithmetic.

A Simple Fix

If the SFT models’ struggle with long-range dependencies was about missing inductive biases, then providing the right training signal should fix it. To validate their understanding, the team introduced a simple solution: they added an “auxiliary loss” that trains lightweight linear probes to predict running sums at each step, capturing and carrying intermediate values and partial products.

It turned out that making this one addition to the 2-layer model that completely failed under standard training did the trick. The result: 99% accuracy without explicit chain-of-thought supervision.

Inspecting this model’s attention patterns revealed it learned similar mechanisms to ICoT—the sparse binary tree structure for caching and retrieving partial products. However it had developed additional strategies too: including an “attention head” allowing it to simultaneously track all necessary digit pairs.

Novel Intelligence and the Jagged Frontier

While multiplication might seem a specific kind of task, the findings illuminate fundamental aspects of how transformers learn and “think.” The long-range dependency problem isn’t unique to arithmetic; it appears throughout language modeling and other sequential tasks, and demonstrates AI’s “jagged frontier,” that is, its capacity to excel at complex reasoning, yet stumble on seemingly simple tasks.

The team’s approach asks foundational questions about the distinctions between memorization and learning, and what architectural constraints help or hinder models’ performance.
“As AI is increasingly integrated into critical decision-making, it’s essential to understand its unique ways of learning and thinking,” said Tan. “Our research is trying to chart that terrain.”

This paper’s key contribution: architectural insights and training techniques can overcome obstacles that scaling alone cannot address. The right inductive biases, not just more parameters or data, are key to pushing AI capabilities forward. While the auxiliary loss solution is task-specific, the researchers anticipate future work will develop more general approaches to improve learning on tasks requiring long-range dependencies.

This article originated from the Data Science Institute.

Related News

More UChicago CS stories from this research area.
figure detailing how net diffusion works
UChicago CS News

AI-Powered Network Management: GATEAU Project Advances Synthetic Traffic Generation

Oct 29, 2025
girl with robot
UChicago CS News

Sebo Lab: Programming robots to better interact with humans

Oct 28, 2025
Inside the Lab icon
Video

Inside The Lab: How Can Robots Improve Our Lives?

Oct 27, 2025
headshot
UChicago CS News

UChicago CS Student Awarded NSF Graduate Research Fellowship

Oct 27, 2025
headshot
UChicago CS News

Celebrating Excellence in Human-Computer Interaction: Yudai Tanaka Named 2025 Google North America PhD Fellow

Oct 23, 2025
best demo award acceptance
UChicago CS News

Shape n’ Swarm: Hands-On, Shape-Aware Generative Authoring for Swarm User Interfaces Wins Best Demo at UIST 2025

Oct 22, 2025
gas example
UChicago CS News

Redirecting Hands in Virtual Reality With Galvanic Vestibular Stimulation: UChicago Lab to Present First-of-Its-Kind Work at UIST 2025

Oct 13, 2025
prophet arena explanation
UChicago CS News

Breaking New Ground in Machine Learning and AI: New Platform Prophet Arena Redefines How We Evaluate AI’s Intelligence

Oct 13, 2025
Fred Chong accepting award
UChicago CS News

University of Chicago’s EPiQC Wins Prestigious IEEE Synergy Award for Quantum Computing Collaboration

Oct 06, 2025
UIST collage
UChicago CS News

UChicago CS Researchers Expand the Boundaries of Interface Technology at UIST 2025

Sep 26, 2025
Michael Franklin and Aaron Elmore holding award
UChicago CS News

Looking Back 20 Years: How an Academic Bet on Real-Time Data Finally Paid Off

Sep 22, 2025
UChicago CS News

Five UChicago CS students named to Siebel Scholars class of 2026

Sep 19, 2025
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube