Sitemap

Native Sparse Attention for dummies — The Next Leap in Efficient Long-Context Language Models

7 min readAug 2, 2025

Okay, so it’s been a long, long time since my last blog — life got busy, but I’m finally back! I want to get back into the habit of blogging as I keep learning, and I’ll do my best to stick with it this time. :)

With that out of the way, let’s jump into what everyone’s talking about lately: Native Sparse Attention, from the same team that caused Nvidia’s stock to tank a couple of months ago — Deepseek.

Press enter or click to view image in full size
Deepseek

If you’ve ever tried to get a large language model (LLM) to read a book, analyze a massive codebase, or keep track of a sprawling conversation, you know the pain: context windows are limited, and as you stretch them, the computation cost skyrockets. The culprit? The attention mechanism — the very core of what makes transformers powerful. But what if we could make attention smarter, faster, and more memory-efficient, without losing performance? That’s exactly what “Native Sparse Attention” (NSA) sets out to do.

Quick example (What is attention?):
For 10,000 tokens, you’re already looking at 100 million attention scores. At 64,000 tokens, you’re in the billions. Most of the compute (up to 80%!) goes just into attention. It’s like trying to remember every detail of every page you’ve ever read, every time you read a new sentence. Not fun.

What Have People Tried? (And Why Isn’t It Enough?)

The obvious fix is to make attention “sparse” — don’t look everywhere, just look where it matters! So researchers tried all sorts of hacks:

  • Sliding windows: Only pay attention to nearby tokens.
  • Global tokens: Pick a few important tokens and let everyone look at those.
  • Block selection: Chop the sequence into blocks and pick the good ones.
  • Retrieval-based: Use search tricks to find what’s relevant.

But here’s the catch:

  • Most tricks only help at inference, not during training.
  • Theoretical speedups don’t always show up on real GPUs (hardware is picky).
  • Many methods need you to train a full-attention model first, then sparsify it — which can wreck performance.

What Makes Native Sparse Attention (NSA) Different?

NSA isn’t just another hack — it’s a full rethink, built from the ground up to be both hardware-friendly and trainable from scratch.

Press enter or click to view image in full size
Overview of NSA’s architecture. Left: The framework processes input sequences through three parallel attention branches: For a given query, preceding keys and values are processed into compressed attention for coarse-grained patterns, selected attention for important token blocks, and sliding attention for local context. Right: Visualization of different attention patterns produced by each branch. Green areas indicate regions where attention scores need to be computed, while white areas represent regions that can be skipped.

Here’s how it works:

NSA’s Three-Path Trick: How It Actually “Thinks”

NSA splits attention into three parallel branches, each with its own job:

1. Token Compression (Global Gist)

  • Divide the sequence into blocks (say, 32 tokens each).
  • Each block gets “summarized” into a single token using a small neural net.
  • This lets NSA scan the whole sequence quickly, like skimming chapters for the big ideas.
Press enter or click to view image in full size
Press enter or click to view image in full size
Token compression explained by Gabriel Mongaras (YouTube)

2. Token Selection (Key Details)

  • NSA computes an “importance score” for every block by looking at the attention weights between the query and the compressed (summary) tokens.
  • These scores are calculated efficiently using the same softmax attention that’s already being done for compression, so there’s no big extra cost.
  • It keeps the top-N blocks (e.g., 16 out of hundreds) and attends to every token in those blocks.
  • If there’s a plot twist or a key variable, NSA will zoom in and read every word.
Press enter or click to view image in full size
How the scores calculated

3. Sliding Window (Local Context)

  • Always includes the latest tokens (e.g., last 512).
  • This is huge for conversations, code, or anything where the recent stuff matters most.
  • 4. Learned Gating Mechanism
  • Instead of simply summing or averaging these three outputs with fixed weights, NSA utilizes a learned gating mechanism.
  • This mechanism is implemented as a small neural network, like a Multi-Layer Perceptron (MLP), that takes the outputs of the three attention branches as input.
  • The network then calculates a set of “gate scores,” where each score corresponds to one of the attention branches (compression, selection, or sliding window).
  • These gate scores are passed through a sigmoid activation function, ensuring that the values are between 0 and 1.
  • The final output is produced by taking a weighted sum of the outputs from the three attention branches, where the weights are the learned gate scores.

With the help of this gating mechanism — the model decides, for every query, how much to trust each branch.

But Does It Actually Run Faster?

Yes! NSA is built to play nice with GPUs. Instead of random memory access, everything is block-based — so GPUs can load big chunks at once (think: fewer trips to memory, more time crunching numbers). It also shares key/value blocks across heads (Grouped-Query Attention), and uses custom kernels written in Triton to squeeze every last bit of speed.

Is It Trainable? (Or Just Another Inference Hack?)

This is where NSA really shines. Most sparse attention tricks only work after you’ve trained a full model. NSA is sparse from the very beginning — during pretraining, finetuning, even RL. The model learns to use its sparse patterns from scratch, so there’s no awkward “conversion” step and no lost performance.

Let’s Get Concrete: How NSA Works on a Big Sequence

Say you’ve got a sequence of 32,000 tokens (think: a novella).

  • Compression: 1,000 blocks of 32 tokens each → 1,000 summary tokens.
  • Selection: Compute scores, keep the top 16 blocks, and attend to all their tokens.
  • Sliding window: Always include the last 512 tokens.

For every query, NSA computes attention over:

  • The summaries (for the big picture)
  • The selected blocks (for the juicy details)
  • The sliding window (for what just happened)

All three run in parallel, and the model learns how to mix them.

What Was NSA Actually Run On?

  • Hardware: 8 NVIDIA A100 GPUs (the kind you find in serious AI labs)
  • Model: 27B-parameter transformer, Grouped-Query Attention (4 groups, 64 heads), plus Mixture-of-Experts
  • Data: 260 billion tokens
  • Settings(hyper-parameters): Compression block size 32, selection block size 64, 16 selected blocks, sliding window 512, up to 64k tokens per sequence

The Results: Is NSA Worth the Hype?

Speed

  • Training: NSA is up to 9× faster (forward) and 6× faster (backward) than full attention at 64k tokens.
  • Decoding: Up to 11.6× faster at 64k tokens. The longer the sequence, the bigger the win.

Memory

  • NSA slashes memory use for attention, so you can fit longer contexts on the same hardware.

Accuracy

NSA was put through its paces on all the usual suspects:

  • General knowledge & reasoning: MMLU, MMLU-PROCMMLU, BBH, GSM8K, MATH, DROP, MBPP, HumanEval
Press enter or click to view image in full size
Pretraining performance comparison between the full attention baseline and NSA on general benchmarks, across knowledge (MMLU, MMLU-PRO, CMMLU), reasoning (BBH, GSM8K, MATH, DROP), and coding (MBPP, HumanEval) tasks. NSA achieves superior average performance on most benchmarks despite high sparsity
  • Long-context: LongBench, Needle-in-a-Haystack, multi-hop QA, code tasks
Press enter or click to view image in full size
Performance comparison between our NSA and baselines on LongBench, including subsets in single document QA, multi-document QA, synthetic and code task categories. NSA outperformed most of the baselines including Full Attention.
Press enter or click to view image in full size
Needle-in-a-Haystack retrieval accuracy across context positions with 64k context length. NSA achieves perfect accuracy through its hierarchical sparse attention design.
  • Chain-of-thought reasoning: AIME math benchmark

What happened?

  • NSA matches or beats full attention on nearly every benchmark — even with high sparsity.
  • On long-context tests, NSA nails “needle-in-a-haystack” (find a fact anywhere in 64k tokens).
  • It does especially well on complex reasoning and multi-hop tasks.

Example:
On AIME math reasoning (at 16k tokens), NSA scored 0.146 versus 0.092 for full attention. That’s a big jump.

Robustness

NSA isn’t just a trick— it works across tasks, sequence lengths, and even with different training setups.

Why Does This Matter?

If you care about building LLMs that can actually read long docs, big codebases, or keep up with real conversations, NSA is a game-changer. It’s fast, memory-friendly, and doesn’t sacrifice accuracy. Plus, it’s trainable from day one — no hacks, no compromises.

TL;DR (But Seriously, Read the Details!)

  • NSA = three-path, hardware-optimized, trainable sparse attention.
  • Massive speed and memory savings.
  • Matches or beats full attention on real-world tasks.

If you’re into LLMs, this is the direction things are heading. Got questions? Want to see code or a deeper dive? Drop a comment below!

You can read the paper here: https://arxiv.org/pdf/2502.11089

or watch Gabriel Mongaras on YouTube for a video walk through of the entire paper! (This inspired me to get back to blogging :) )

Alright, that’s my deep-dive for today. Glad to be back — and if you made it this far, thanks for reading!

--

--

Aditya Krishnan Mohan
Aditya Krishnan Mohan

Written by Aditya Krishnan Mohan

I write about the latest tech, space tech, rockets and computer programming.

No responses yet