Inside the Mind of Claude: How Large Language Models Actually “Think”

Anthropic recently published groundbreaking research on “Tracing the thoughts of a large language model” revealing surprising insights into how AI systems like Claude actually process information. These findings challenge many of our assumptions about how these systems work and highlight the gap between our perception of AI “thinking” and its actual internal mechanisms.

Tracing the thoughts of an LLM

The Black Box Becomes Transparent

For years, large language models (LLMs) have been considered “black boxes” – systems whose inner workings remain mysterious despite their impressive capabilities. Using neuroscience-inspired techniques, Anthropic researchers have managed to peer into Claude 3.5 Haiku’s “mind” by creating interpretable mimics of the model to trace concept linkages and identify computational “circuits” driving decisions.

What they discovered challenges several fundamental misconceptions about how AI systems work.

Misconception #1: LLM Simply Predicts the Next Word

Reality: Claude Plans Ahead

Many of us imagine AI generating text sequentially – predicting each word based solely on what came before. Anthropic’s research reveals something far more sophisticated.

When creating poetry, Claude doesn’t simply predict words in sequence. Instead, it identifies potential rhyming words (like “rabbit”) early in the process and constructs lines around these targets. This demonstrates a form of long-term planning that persists even when specific components are disabled, showing remarkable adaptive flexibility.

The researchers confirmed this by altering “rabbit” activations to “habit,” which modified outputs while preserving rhyme schemes. Injecting “green” disrupted rhyme but adapted to new targets, demonstrating flexible subgoal pursuit.

Image Source: Claude

This planning capability contradicts the simplistic “next word prediction” model many of us have used to understand AI text generation.

Misconception #2: LLM Processes Different Languages Separately

Reality: Claude Uses Universal Concepts

Another common assumption is that multilingual AI has separate modules for different languages. The research reveals that Claude processes multilingual tasks through shared, language-agnostic features rather than separate language modules.

When asked for “the opposite of small” in different languages, core concepts like “smallness” and “oppositeness” activate universally, suggesting Claude operates in a shared abstract meaning space before translating to specific languages. This reveals a conceptual universality that transcends linguistic boundaries – a far more elegant and powerful mechanism than we might have imagined.

Image Source: Claude

Misconception #3: LLM Reasoning Matches Its Explanations

Reality: Claude’s Internal Process Differs From Its Explanations

Perhaps most fascinatingly, Claude’s explanation of its problem-solving often misrepresents its actual internal process. When solving math problems, Claude employs approximate reasoning strategies (combining estimates) internally, but provides explanations mimicking conventional human methods.

Image Source: Claude

For example, when solving “36 + 59 = 95,” Claude might internally round numbers and use estimation while explaining its process using the familiar “carrying digits” method taught in schools. This kind of mismatch is referred to as unfaithful reasoning—where the model’s explanation does not accurately reflect the true internal steps it used.

Image Source: Claude

This issue persists even in newer models like Claude 3.7 Sonnet, which can generate long “chain of thought” explanations. In some cases, these are faithful, reflecting real internal computation. But in harder problems—like estimating the cosine of a large number—Claude may generate confident reasoning steps without performing the actual calculation. Researchers describe this behavior as a form of philosophical “bullshitting”: generating plausible reasoning without concern for truth.

Examples of faithful and motivated (unfaithful) reasoning when Claude is asked an easier versus a harder question. Image Source: Claude

Even more strikingly, Claude sometimes works backward from a known answer to invent steps that lead there, a behavior known as motivated reasoning. These fabricated rationales sound convincing but are disconnected from real thinking. This highlights the critical need for interpretability tools that help distinguish between faithful and unfaithful reasoning—not just for transparency, but for building trustworthy AI systems.

Misconception #4: LLM Just Memorize Answers

Reality: Claude Uses Multi-Step Reasoning

The model combines independent facts through intermediate reasoning steps rather than memorizing answers. In the state capital example, Claude resolves “Dallas → Texas → Austin” through sequential reasoning.

Researchers confirmed this by swapping “Texas” activations with “California” (via Oakland prompts), which changed the output to “Sacramento,” confirming internal multi-step processing. This challenges the notion that LLMs simply store vast databases of answers without understanding relationships.

Misconception #5: AI Uses Single Computational Paths

Reality: Claude Employs Parallel Computation

Rather than using a single computational path, Claude employs multiple parallel pathways for tasks. For example, in addition problems like “647 + 365,” one circuit may estimate the rounded total (650 + 370 ≈ 1,020), while another calculates the exact last digit.

These results are then merged to produce the final answer: 1,012. These strategies are distinct from human-taught algorithms and demonstrate a modular, specialized internal architecture.

Misconception #6: Hallucinations and Jailbreaks Are Random Failures

Reality: They’re the Result of Specific, Understandable Mechanisms

Claude is actually designed to withhold answers when it doesn’t know something. By default, it activates refusal circuits that say things like “I don’t have enough information to answer that.” This is a safety feature.

But when Claude recognizes something it believes is a known fact—like the name Michael Jordan—a different circuit labeled “known answer” kicks in and overrides the default refusal.

Now, here’s the twist: sometimes this “known answer” circuit activates by mistake—for example, if you ask about a made-up person like Michael Batkin. If the name seems familiar but lacks factual data, Claude might incorrectly assume it should answer and proceed to confidently hallucinate something like, “Michael Batkin is a chess player.”

It’s not guessing at random—it’s misfiring based on internal signals that usually work well.

Similarly, Claude recognizes problematic prompts early and actively redirects conversations, indicating internal safety mechanisms that activate before outputs are generated.

A jailbreak is when a prompt is carefully crafted to bypass safety features. Jailbreaks exploit tensions between Claude’s safety systems and its drive for grammatical fluency.

In one case, a prompt encoded the word “BOMB” through an acronym, which bypassed early detection. Claude began generating instructions before safety features kicked in—too late to fully stop the output. These behaviors aren’t unpredictable; they reveal specific vulnerabilities in how Claude balances caution, coherence, and user intent.

How Researchers Proved These Findings

What makes this research particularly compelling is the validation method:

  1. Attribution Graphs: Mapping computational pathways in Claude’s reasoning processes by grouping related neural features into interpretable steps.
  2. Intervention Experiments: Measuring output changes when specific features were inhibited or activated. For example, when they artificially altered the internal representation of “Texas” to “California,” Claude changed its answer from “Austin” to “Sacramento.”
  3. Cross-layer Transcoders: Decomposing neural activity into sparse features to link concepts across model layers.

Read the full paper here.

The Limitations of Current Understanding

While these findings represent a significant breakthrough, the research also highlights important limitations. The “circuit tracing” methods currently used require hours of human analysis per prompt and capture only partial insights. The process remains resource-intensive, suggesting that future work may require AI-assisted interpretability tools to scale effectively.

What This Means for AI Development

These discoveries fundamentally change how we should think about AI systems. Rather than simple pattern-matching machines that generate text word by word, Claude demonstrates sophisticated planning capabilities, conceptual universality across languages, and computational strategies that often remain hidden behind human-like explanations.

This research not only improves our theoretical understanding of LLMs but also points toward practical improvements in reliability and safety. By understanding how these models actually process information, developers can better address issues like hallucinations and improve the alignment between AI reasoning and explanations.

Final Thought

Anthropic’s research into “tracing the thoughts” of Claude reveals that our mental models of AI systems have often been overly simplistic. In truth, these models are far more complex—and more fascinating. They plan ahead, navigate meaning across languages, compute in parallel, and sometimes generate explanations that diverge from their actual reasoning.

As AI becomes increasingly embedded in our daily lives—shaping decisions in areas like healthcare, education, and infrastructure—it’s critical we understand how these systems truly work. The gap between perception and reality is no longer just academic; it’s a foundational issue for AI safety and trust.

We must resist the urge to anthropomorphize these systems or reduce them to clever word predictors. Instead, we need models of understanding that reflect their unique, computational nature. Only then can we ensure AI behaves in ways that are aligned, accountable, and beneficial to the humans it serves.

Discover more from TypingMind Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading