There is a lot of hand-waving around long context.
Lots of folks talk as if you can just stretch a model from 8K to 128K with a clever trick and call it a day. You usually cannot.
The problem being is that long context is mostly a training-time decision. Some tricks help. Some buy you a little headroom. Some are gimmicks more or less. And a lot of opinions blur those categories in ways that make the whole thing sound more solved than it is.
Why this is hard in the first place
Most modern LLMs use RoPE, rotary position embeddings.
That works well within the range the model saw during training. Push too far beyond that, and you start asking the model to operate on positional patterns it was never really trained to understand. Attention quality drops. The model gets less reliable. Things start to break down.
That is one problem. The other problem is economics. Attention is quadratic.
Every token has to score against every other token. At sequence length L, that means an L x L attention matrix. Double the length, and the matrix grows by four times. Go from 2K to 128K and you are dealing with a different magnitude of compute and memory pressure.
Those are two problems to distinguish. One is positional generalisation. The other is attention cost. Fixing one does not fix the other, quite contrary to be fair.
FlashAttention helps, though not in the way you often imply
FlashAttention gets mentioned all the time in long-context discussions as if it is the breakthrough that solves the problem.
It is useful, but let’s be precise. FlashAttention is an IO optimisation. And a cracking good one. It avoids writing the full attention matrix out to slow GPU memory and instead computes attention in tiled chunks using fast on-chip memory.
That reduces memory pressure significantly. What it does not do is change the fundamental compute equation. FLOPs are still quadratic.
So yes, FlashAttention is a big deal in practice. Without it, you run out of memory much earlier. But it does not suddenly turn full attention into some cheap linear-time system, it makes exact attention viable further out.
Sparse attention is the obvious idea, but reality … it’s complicated
Sparse attention tries to cut the problem down by limiting what each token can attend to.
Instead of every token looking at everything, it might look only at a local window, some strided positions, or a few designated global tokens. On paper, that sounds attractive. Compute comes down from quadratic toward something more or less manageable. In practice, the gains are often less clean than they may appear.
GPUs are very good at dense, regular matrix math. Sparse patterns often mean irregular memory access and much worse hardware utilisation. So the theoretical reduction in work does not always turn into the speedup you would have hoped for. There is another issue though.
Static sparsity patterns make assumptions about what matters. A sliding window assumes locality is the important thing. Sometimes that is sensible. Sometimes it is completely wrong. If the crucial token is 80K back and your pattern never lets attention look there, then you just missed that piece of information.
DeepSeek’s DSA is more serious than the usual sparse-attention story
DeepSeek V3.2 is quite interesting as it does not hard-code a sparse pattern and then pray for the best. Their DSA setup uses a learned indexer to decide which KV entries matter for each query. That means the sparsity is content-aware, not position-aware.
First, they run a dense warm-up stage where full attention still happens, but the indexer learns from the real attention distribution. Then they switch on the sparse top-k selection and fine-tune the whole thing with that sparsity pattern active. So the sparse system is trying to approximate where full attention would have gone. That is an architectural choice. And as such it comes with a catch. It is a pretraining commitment, not a retrofit. DSA depends on MLA and the whole way how that representation is structured. Unfortunately, that’s not something you can plumb in later.
Positional extensions are not magic
If compute and memory are under control, you still need the model to behave sensibly at longer positions. The two common names here are Position Interpolation and YaRN.
Position Interpolation is the straightforward instrument. If a model trained at 2K and you want 16K, you scale the position IDs down so the model still sees values in a familiar range. That works better than doing nothing, but you’re doing a trade-off here by compressing resolution. YaRN is a bit more careful. It treats various RoPE frequency components differently, which helps preserve local behaviour and stretches long-range coverage a bit more sensibly.
Neither of these are the answer you’re looking for…
You can give the model positional coverage, but if it has rarely spent real training time reasoning across long documents, that alone does not buy you robust long-context behaviour. You still need continued pretraining or serious long-context fine-tuning on the right data. So if someone tells you they solved long context just by changing a scaling factor, well, you know they are economical with the truth to say the least.
KV cache becomes its own problem
Imagine you solve the attention story well enough and then the KV cache still grows linearly with sequence length.
At 128K across many layers, that hits you hard pretty quick. For production systems, you rely on things like GQA or MQA, KV quantisation, and often on cache eviction heuristics. They help you stretch it as far as possible, but they do not turn a short-context model into a truly long-context one.
There are other ways around the problem
The most practical one is not to force everything into context in the first place.
RAG is your good friend here. Retrieve the relevant chunks and feed those in. However, it comes with a trade-off too. You lose holistic reasoning over the entire document and start depending heavily on your retrieval layer. State space models are another direction. They replace attention with something that scales linearly by design. Interesting, promising, and definitely worth watching. Though they come with their own trade-off too. Fixed-size state means compression, and compression means loss in a precise recall of far-back detail. There is no such a thing as a free lunch. Prompt compression is the other pragmatic move. Take the input, drop what looks low-value, then send a smaller version to the model. Sometimes it can be useful, but it introduces a bit of extra hassle and complexity.
What you can do after a model is already trained
This is where many of us get it wrong.
If you have an existing model and want to extend context, your options are limited.
FlashAttention helps memory, not understanding.
KV compression helps budget, not range.
RAG helps many tasks, but changes the problem.
Naive position interpolation might buy you some headroom.
YaRN with a long-context fine-tuning can do something real.
But the strongest long-context systems made an architectural decision in the pre-training. It’s hard to get from short-context foundations to a superb 128K behaviour with a couple of fixes tried afterwards.
Saying this out loud does not win you many friends.
So where do we take it from here?
Meaningful long context is a pretraining decision.
The methods that work well require either architectural commitment, long-context data, or both. Anything done later on in the game can only help you get around the edges. So if you are designing a new model and long context matters, design for it right from the start.
Otherwise, be real about what is the art of the possible.

0 responses to “Extending LLM Context Length: What Works and What Doesn’t”