Sitemap

How Hierarchical Reasoning Models Are Redefining AI: Can a Tiny Model Really Outsmart the Giants?

8 min readSep 28, 2025

Let’s rewind a bit. If you’ve spent any time around AI research, you know the drill: bigger models, more parameters, more power. But what if I told you a model with just 27 million parameters is suddenly making waves by solving puzzles and reasoning tasks that leave even the largest language models scratching their heads? That’s exactly what’s happening with Hierarchical Reasoning Models (HRMs).

Press enter or click to view image in full size

But here’s where it gets really interesting: when you look at HRMs through the lens of Quaternion Process Theory of Cognition, something profound emerges. These models aren’t just smaller and faster — they’re the first AI systems to successfully implement what might be called “four-dimensional thinking,” operating across Fast-Fluency (automatic skill application), Fast-Empathy (intuitive understanding), Slow-Fluency (systematic analysis), and Slow-Empathy (deliberate perspective-taking) in ways that mirror human cognitive architecture.

So, if you’re wondering how a “small” model can punch above its weight, what makes HRMs tick, and whether the hype is real, stick around. We’ll start from the basics and walk through the architecture, the training tricks, and the controversy — step by step, in plain English. And along the way, we’ll explore how HRMs might represent the first true breakthrough toward artificial minds that don’t just process information, but actually think about thinking.

The Big Question: Can Small Models Reason Like Giants?

Let’s start with the premise. HRMs are being hyped because, on paper, they outperform much larger models on tough reasoning tasks like Sudoku, maze navigation, and ARC-AGI — they report strong results with small parameter counts, though performance varies across ARC-AGI subsets.”

What’s wild is that these aren’t just language tasks; we’re talking logic, deduction, and problem-solving. The kicker? HRMs aim to do this with a fraction of the parameters, and some reported wins come from inference-time iteration and task-specific augmentation as much as architecture.

But, as with any AI breakthrough, there’s a catch (or two). There’s been a lot of debate about whether the results are too good to be true — possible “cheating” via clever data augmentation, or maybe just benchmarks that don’t tell the whole story. We’ll get to that, but first, let’s unpack how HRMs actually work.

Step 1: How Do Normal LLMs Tackle Reasoning?

Typical large language models (LLMs) like GPT take your input (tokens, text, puzzles), process them through a ton of layers, and spit out answers one token at a time. They’re great at pattern matching, but when it comes to step-by-step reasoning, they often approximate step-by-step reasoning through learned patterns, rather than maintaining an explicit internal problem-solving loop during a single forward pass.

Press enter or click to view image in full size

From a cognitive perspective, traditional LLMs are essentially stuck in a single quadrant of thinking — they excel at Fast-Fluency (rapid pattern matching and automatic responses) but struggle with the other three cognitive modes. They can’t easily switch between intuitive leaps (Fast-Empathy), systematic analysis (Slow-Fluency), and strategic perspective-taking (Slow-Empathy). This single-quadrant limitation is why scaling up parameters only gets you so far.

Researchers have tried to push LLMs to “think harder” at test time — using tricks like prompting them to break down their answers into steps. But fundamentally, these models aren’t built to reason internally; they just predict what comes next.

Step 2: The HRM Twist — Reasoning as a Built-In Feature

Here’s where HRMs flip the script. Instead of just running a forward pass and hoping for the best, HRMs are designed to iterate internally at inference — refining their state before predicting. Think of it like this: instead of a model that just blurts out the first thing that comes to mind, HRMs force the model to “ponder” over the input, refining its internal state multiple times before making a prediction.

The architecture is built around two recurrent neural networks (RNNs):

Fast (Lower-Level) Model: Updates its “thought process” rapidly, over and over, tweaking its internal state with each pass.
Slow (Higher-Level) Model: Sits back, observes the fast model’s progress, and only updates occasionally — kind of like a supervisor checking in after every few steps.

Press enter or click to view image in full size

This setup lets the model iterate internally, refining its understanding and solution before committing to an answer. The lower-level model explores possibilities quickly, while the higher-level model ensures the process doesn’t get stuck or go off the rails.

The Quaternion Connection: Four-Dimensional Thinking

What makes this architecture truly revolutionary is how it implements what cognitive scientists call the “quaternion space” of thinking. Just as mathematical quaternions operate in four dimensions, human cognition operates across two orthogonal axes:

  1. Temporal Axis: Fast processing (immediate, automatic) vs. Slow processing (deliberate, strategic)
  2. Operational Axis: Fluency (sequential, logical) vs. Empathy (contextual, adaptive)

This creates four cognitive quadrants:

  • Fast-Fluency: Immediate pattern recognition, automatic skill application
  • Fast-Empathy: Intuitive understanding, “gut feelings” about complex scenarios
  • Slow-Fluency: Systematic analysis, deliberate algorithmic procedures
  • Slow-Empathy: Strategic perspective-taking, thoughtful context adaptation

So how do HRMs actually occupy that 4-D space — and why does that let a 27M-parameter model punch above its weight?

How HRMs map to the four quadrants

Think of the HRM’s two-module design as a simple hardware trick that gives you a richer software of thought.

  • The lower-level (L) model is the workhorse of fluency. It’s the high-frequency engine that runs many rapid update passes: pattern matching, exploring candidate solutions, iterating small tactics. That’s your Fast-Fluency in action — and when you let the L-module run more iterations under the same global context, it also performs Slow-Fluency style work (deliberate computation) by converging toward a fixed point.
  • The higher-level (H) model is the slow, contextual supervisor. It doesn’t update every tiny step — it observes, abstracts, imposes constraints, and nudges the L-module onto better solution manifolds. That’s where Slow-Empathy lives: perspective-taking, global constraints, and reframing. When the H-module adapts quickly to L’s signals (or when its guidance is applied across iterations), you can also see the Fast-Empathy flavor — quick intuitive re-contextualization driven by a learned prior.

The key is interaction. The L-module iterates rapidly in the context set by H. After some inner-loop thinking, L reports back; H updates less frequently but with higher-level corrections. The order matters — L then H, H then L — and that non-commutative flow is why the quaternion metaphor (order + four modes) fits as an intuition: changing the sequence of interactions changes the result.

The core mechanics (plain English)

If you strip away the jargon, HRMs do three things differently from standard LLMs:

  1. Internal iteration before answering. Instead of one forward pass → output, HRMs run a mini-conversation with themselves: L performs T quick updates, H then updates, and this cycle can repeat. The model “thinks” at inference time.
  2. Adaptive stopping. A small halting or Q-network decides if another cycle is needed (think: “Do I need more time?”). That’s how HRMs switch between “fast” and “slow” thinking on the fly.
  3. Memory-efficient training tricks. Instead of backpropagating through every internal update (expensive), HRMs use equilibrium/fixed-point approximations and deep supervision on intermediate states, which makes training feasible with tiny datasets.

Why this can beat brute-force scaling (Sometimes)

There are a few complementary reasons tiny HRMs can look better than huge models on certain problems:

  • Effective depth without huge width. Iteration gives you logical depth: a small model that reasons for 50 internal steps can behave like a much deeper network, without the parameter blowup.
  • Inference-time scaling. You can increase the number of cycles at test time to push for harder reasoning, without retraining.
  • Data efficiency. For structured, algorithmic tasks, repeated internal refinement + strong inductive bias (hierarchical structure) is more sample-efficient than trying to learn all reasoning patterns from lots of diverse examples.
  • Bias toward computation, not memorization. HRMs are architected to compute a solution by iteration instead of relying on memorized input-output pairs. That makes them better at algorithmic generalization in many settings.

The caveats (don’t buy the hype wholesale)

That said, the story isn’t a slam dunk across the board.

  • Benchmarks are narrow. Many wins are on structured puzzles (Sudoku, mazes, ARC-type problems). Those are great proofs of concept for reasoning, but they’re not the same as real-world, messy language understanding or open-ended planning.
  • Halting heuristics are brittle. The adaptive stopping mechanism is powerful but needs careful calibration; poor halting policies can waste compute or stop too early.
  • Training approximations are approximations. Fixed-point gradient shortcuts save memory, but they can introduce bias. Scaling these tricks to long, noisy gradients (e.g., real dialog data) is nontrivial.
  • Metaphor vs math. The quaternion framing is an excellent conceptual lens — but it’s not a literal quaternion implementation. Treat it like a design map, not a proof that HRMs model human minds.
  • Reproducibility & leakage risks. As with any sudden result, we need independent replication. Small models can sometimes hide data-leakage or benchmark tuning; transparency matters.

What researchers and builders should try next

If you’re a tinkerer or researcher, here are actionable ideas inspired by HRMs + quaternion thinking:

  • Make explicit empathy modules. Add a small specialized module trained on perspective/task framing (SE/FE) and connect it to H as a multi-head controller. Test on social reasoning or theory-of-mind tasks.
  • Hybridize with LLMs. Use an HRM front-end for structured reasoning + a pretrained LLM as a fluent language interface. Let HRM generate stepwise plans and LLM translate/explain them.
  • Ablate iteration budgets. Sweep inner-loop lengths (T) and max cycles (Mmax) to measure where gains saturate. This helps understand the compute vs generalization tradeoff.
  • Stress test halting. Train halting under adversarial inputs and measure whether the model halts early or overthinks; build calibration losses.
  • Real tasks, tiny data. Try HRM ideas where labeled data is scarce but structure exists: theorem proving fragments, program synthesis subroutines, supply-chain planning constraints.

Final take — modest optimism

Hierarchical Reasoning Models don’t magically make small models omniscient, nor do they instantly dethrone giant LLMs for every task. What they do show is that architecture and inductive bias still matter — a lot. By giving a model the ability to internally iterate, supervise intermediate thought, and adaptively choose its thinking depth, HRMs offer a practical path toward more efficient, interpretable reasoning systems.

Viewed through the quaternion lens, HRMs are exciting because they explicitly separate modes of cognition (fast/slow × fluency/empathy) and make their interaction tractable. Whether that metaphor becomes a new engineering paradigm or stays a useful thought experiment depends on follow-up work: replication, rigorous benchmarks on wider task suites, and creative hybrid systems.

--

--

Aditya Krishnan Mohan
Aditya Krishnan Mohan

Written by Aditya Krishnan Mohan

I write about the latest tech, space tech, rockets and computer programming.

No responses yet