TL;DR: Giving your LLM more context does not make it smarter. Chroma tested 18 frontier models, and every single one got worse as the input grew. The problem is called context rot, and it is why most AI agents fail in production. The fix is not a bigger model. It is feeding the model less, but the right things, at the right time.
One of the most counterintuitive things I keep running into in production AI: you add more context to your LLM, and the outputs get worse.
Not subtly worse. Not "kind of worse."
In a 2025 study by the Chroma research team, 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5, among them) were tested against increasing input lengths. Every single one degraded. Some held near-perfect accuracy at short inputs and then nosedived to 60% once context crossed a certain length. The drop was unpredictable. No exceptions.
The belief that "more context always helps" is quietly wrecking production AI systems right now. And fixing it requires a shift in how you architect information, not a switch to a bigger model.

What context rot actually is
Context rot is the measurable drop in LLM output quality as input length grows, even when the model's context window is far from full.
The term comes from Chroma's 2025 research, but the mechanics were documented earlier. Stanford researchers (Liu et al., 2024) found that LLMs attend strongly to the start and end of a prompt, but lose track of information sitting in the middle. Key facts buried in a long context produced 30%+ accuracy drops, regardless of how relevant or clearly written they were.
This is called the "lost-in-the-middle" effect.
Piling your system prompt with company policy docs, chat history, retrieved chunks, and tool outputs doesn't give the model more to work with.
It gives the model a pile of noise with your actual answer somewhere underneath.
Three failure modes that stack
Context rot doesn't come from a single bug. It's three problems compounding on each other.
1. Lost-in-the-middle effect. Positional bias is real. Models attend strongly to the beginning and end of input. If your most important instruction sits 50,000 tokens into the context, the model may effectively ignore it.
2. Attention dilution. Transformer attention scales quadratically. At 100,000 tokens, the model handles roughly 10 billion pairwise relationships. There's a finite attention budget. More tokens means that the budget spreads thinner, and critical information gets quietly deprioritized.
3. Distractor interference. Semantically similar but irrelevant content actively misleads the model. Chroma found that a single distractor sentence caused measurable degradation. Four distractors compounded it further, across all 18 models.
The most uncomfortable finding came from an October 2025 arXiv paper: even with 100% perfect retrieval of the right information, performance degraded 13.9% to 85% as input length increased. The researchers even masked all irrelevant tokens. Degradation persisted. Sheer length, independent of content quality, taxes LLM reasoning.
Why agents are where this really hurts
Single-turn chat apps can tolerate context rot reasonably well, because each query starts fresh and short.
Agents can't.
An agent running a multi-step task accumulates context across every tool call, search result, and intermediate reasoning step. A coding agent working through a bug will have explored wrong paths, collected stale outputs, and run several failed sub-tasks before it gets to the fix. By the time it's generating the actual solution, it's reasoning through a context window stuffed with dead ends.
This is why, per Gartner, only 10% of enterprise AI agent pilots actually reach production. The models aren't failing. The context management is.
What context engineering actually means
Prompt engineering asks: what should I say to the model?
Context engineering asks: what should the model see when it generates a response?
Andrej Karpathy described the context window as the LLM's "RAM." That framing is useful. Context engineering is the discipline of treating it exactly that way: a managed, finite resource that gets the right data loaded at the right time, not a document dump.
DataHub's 2026 State of Context Management Report found that 82% of IT and data leaders now say prompt engineering alone is insufficient to scale AI reliably. The budget is following: 95% of data teams surveyed plan to train on context engineering this year.
There are four practical strategies that show up in well-built production systems:
Just-in-time retrieval. Don't pre-load everything you might need. Identify the intent of the current step and fetch only the data relevant to that step. A travel agent doesn't need the full expense policy when the user is still choosing a departure city.
Context compression. Strip raw tool outputs after they've been processed. Keep a working NOTES file outside the context window and pull in only what the current step needs.
Hierarchical summarization. Every 10 to 20 agent steps, compress working context into a structured summary that keeps decisions and discards the noise of how you arrived at them.
Positional priming. Put your highest-priority instructions and facts at the very start or end of the prompt. Information buried in the middle of a long context block gets lost. The research shows this happens mechanically, not just in theory.
Where most teams go wrong
Most context engineering discussions center on RAG. Retrieval-augmented generation does help. It beats pre-loading full documents.
But RAG solves retrieval, not context quality. You can retrieve the exact right document and still inject it where the model loses track of it. Surround it with semantically similar distractors and it actively misleads. Let stale history accumulate to 40,000 tokens and the signal drowns. DataHub's report found 77% of data leaders agree that RAG alone is insufficient for reliable production AI.
Context engineering sits upstream of RAG. It's about the full information architecture: what gets in, what gets removed, where things are positioned, and how clean your working memory stays across steps.
Where this is headed
The 2025 conversation was "prompt engineering vs. RAG." The 2026 conversation is about managing context as a first-class engineering concern, with the same rigor teams apply to database indexing or memory allocation.
Gartner projects 40% of enterprise applications will embed AI agents by end of 2026, up from under 5% in 2025. As agents move from demos to infrastructure, context rot will stop being a research footnote and start being a production incident. Teams that treat context as an afterthought will end up with systems that work in staging and fall apart under real workloads. The model choice usually isn't the problem. What the model sees is.
This week's takeaway
Bigger context windows don't fix reliability. The models already know how to reason. Your job is to stop burying that reasoning in noise.
More context isn't the fix. Better context is.
Sources and further reading
Hong, K., Troynikov, A., Huber, J. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research. https://trychroma.com/research/context-rot
Liu, N.F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Stanford NLP / TACL. https://arxiv.org/abs/2307.03172
Adobe Research. (February 2025). NoLiMa: Non-Literal Matching Benchmark for Long-Context Reasoning. https://arxiv.org/abs/2502.05167
Hsieh, C.Y., et al. (October 2025). Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. arXiv. https://arxiv.org/abs/2510.10813
DataHub. (2026). State of Context Management Report 2026. https://datahub.com/blog/context-engineering-vs-prompt-engineering/
Gartner. (2025). Predicts 2026: AI Agent Adoption in Enterprise Applications. Cited via MachinelearningMastery / Prompt Bestie report summaries.
Karpathy, A. (2025). Context window as RAM framing. Referenced in Composio Dev report. https://composio.dev/blog/why-ai-agent-pilots-fail-2026-integration-roadmap