When I first heard about million-token context windows, I thought all my problems were solved. Finally, I could just dump everything into the prompt - docs, code, conversation history, every tool definition imaginable - and let the AI sort it out. No more carefully curating what goes in. No more “sorry, I don't have enough context.” The future had arrived.
Yeah, about that. After a few months of actually working this way, I can tell you: bigger context windows don't magically fix anything. If anything, they introduced new problems I didn't even know existed. Problems that cost me weeks of work before I figured out what was happening.
The project that made me question everything
It started with an AI agent I was building for a client. Nothing too exotic - it needed to handle customer queries, pull data from a few APIs, and generate reports. I had a massive context window to play with, so I loaded it up. Full API docs. Every tool definition. Complete conversation history going back days.
At first, it worked great. The agent was handling complex queries, remembering previous conversations, using the right tools. I was feeling pretty smug about my setup.
Then things started getting weird. The agent began making the same mistakes over and over. It would use a tool incorrectly, I'd correct it, and a few turns later... it would make the exact same error. Sometimes it would get stuck in loops, repeating actions instead of moving forward. Other times it would call tools that had nothing to do with the task at hand.
I spent a week debugging this. Checked the tools. Checked the prompts. Checked everything. The code was fine. The problem was something I hadn't even considered: the context itself was working against me.
The four patterns that kept breaking my agents
After that project - and several more failures - I started noticing patterns. The same types of problems kept showing up, just wearing different disguises. Here's what I learned:
1. Context rot - when old mistakes won't die
This was the first one I identified. The agent would make a wrong assumption or hallucinate something early in a long session. That incorrect information would then sit in the context, influencing everything that came after.
The worst part? Sometimes I'd correct the mistake explicitly, but the original error was still there in the history, and the model would keep referencing it. It's like trying to convince someone of a fact when they've already read the wrong thing five times. The wrong thing has more weight just by being repeated.
2. Information overload - when the model forgets how to think
This one surprised me. I assumed more context was always better. But as the context grew past a certain point, the model started changing behavior. Instead of reasoning through problems, it would just... repeat things. It would loop back to patterns from earlier in the conversation rather than coming up with new solutions.
It's like the context became so heavy that the model couldn't see past it anymore. All that training it had, all that general knowledge - it was getting drowned out by the sheer volume of stuff I'd crammed into the prompt. The model was doing retrieval when I needed it to be doing reasoning.
3. Tool chaos - when more options means worse decisions
I fell hard for the MCP hype. Connect all the tools! Give the AI access to everything! It'll figure out which one to use!
Except... it doesn't. When you give a model 40 different tools, it starts using tools that are vaguely related instead of exactly right. Sometimes it calls tools for no reason at all. The more tool definitions you stuff into the context, the more opportunity for the model to get confused about which one to use.
I noticed this especially with smaller models, but even the big ones aren't immune. Every tool definition is something the model has to pay attention to, and attention is finite no matter how big your context window is.
4. Self-contradiction - when your context argues with itself
This is a nastier version of context rot. It happens when you're building up context from multiple sources - tool responses, documents, earlier parts of the conversation - and some of that information contradicts other parts.
The model will try to reconcile the contradiction, often by picking the wrong version. Or it'll get confused and produce something that makes no sense at all. I had an agent that kept generating impossible solutions because early in the conversation it had made an incorrect assumption, and later tool calls returned data that contradicted it. The model was trying to satisfy both and satisfying neither.
Why this matters more than you think
Here's the thing that really got me: these aren't edge cases. If you're building anything that runs for more than a few turns, that uses multiple tools, or that accumulates information over time, you're going to hit these problems. Maybe not every time, but often enough to matter.
And the frustrating part is that the failures are hard to diagnose. The model doesn't throw an error. It just... does the wrong thing, confidently. You end up debugging your code, your prompts, your tool implementations, when the actual problem is just accumulated garbage in the context.
What I actually do now
After getting burned enough times, I developed some habits that have made a real difference:
- →Start fresh more often. I used to try to maintain context across sessions, thinking I was preserving valuable history. Now I start clean sessions more frequently and only carry forward what I explicitly summarize. Yes, you lose some context. But you also lose all the accumulated noise.
- →Load tools dynamically. Instead of giving the model access to every tool upfront, I only include the tools that are actually relevant to the current task. More setup work, but way fewer confused tool calls.
- →Summarize, don't accumulate. For long-running agents, I periodically summarize the important state into a clean format and use that instead of the full conversation history. Think of it like garbage collection for your context.
- →Keep the instructions close. Critical instructions should be at the beginning AND end of long contexts. Models pay more attention to the edges than the middle.
This is actually why specs work
If you read my earlier post on Spec-Driven Development, this might add some context (pun intended). One reason SDD works so well is that it naturally keeps your context clean.
When you write a spec upfront, you're not relying on accumulated conversation history to define what you're building. The spec is the source of truth - clear, intentional, and free of all the noise that builds up during back-and-forth iteration. You're essentially protecting yourself from context rot before it happens.
The uncomfortable truth about bigger context windows
Look, I'm not saying big context windows are bad. They're useful for plenty of things - summarizing long documents, searching through large codebases, that kind of thing. What I am saying is that “just throw everything in and let the AI figure it out” is not a strategy. It's wishful thinking.
Context is not just a bucket you fill up. It's an environment your model operates in. And like any environment, if you let it get cluttered and chaotic, performance suffers. The models are getting better at handling this, but they're not magic. You still need to be intentional about what you put in there.
After all the time I wasted learning this the hard way, the lesson is simple: treat your context like a scarce resource even when it isn't. Your future self - the one who isn't debugging mysterious agent failures at 2am - will thank you.
Josip Budalić
Founder & CEO
Josip runs HOTFIX d.o.o., a dev shop based in Croatia. He's been writing code for over a decade and is slightly obsessed with finding ways to ship faster without sacrificing quality. When not arguing with AI assistants, he's probably hiking somewhere or consuming unhealthy amounts of coffee.