Mastering Context Engineering for AI Agents
Context engineering is crucial for building effective and efficient AI agent systems. It focuses on optimizing the information provided to the AI, leading to better decision-making and performance. Instead of relying solely on fine-tuning models, the current trend involves leveraging in-context learning with frontier models, making context engineering a vital skill for AI agent development.
This approach offers several advantages, including faster iteration cycles and model agnosticism. By carefully crafting the context, developers can rapidly experiment with different strategies and adapt to changing requirements without retraining the entire model.
A practical approach to context optimization is iteratively refining the context based on the agent's performance, similar to how gradient descent optimizes model parameters. By understanding and applying these principles, we can unlock the full potential of modern language models and create more robust and intelligent systems.
Optimizing KV-Cache Hit Rate for AI Agent Efficiency
For AI agents in production, the KV-cache hit rate is a critical metric because it directly impacts both latency and cost. Understanding how KV-caching works is essential to appreciate its significance. In a typical agent operation, after receiving user input, the agent iterates through a chain of tool uses to complete the task. Each iteration involves the model selecting an action from a predefined action space, based on the current context. The action is executed, producing an observation that is appended to the context, forming the input for the next iteration. Given that the context grows with each step while the output remains relatively short, the ratio between prefilling and decoding becomes highly skewed.
Contexts sharing identical prefixes can leverage KV-cache, which significantly reduces time-to-first-token (TTFT) and inference costs. The savings can be substantial. For example, with Claude Sonnet, cached input tokens can cost significantly less than uncached tokens.
To improve the KV-cache hit rate, several context engineering practices can be adopted:
Maintain a stable prompt prefix: Due to the autoregressive nature of LLMs, even a single-token difference can invalidate the cache. Including a timestamp at the beginning of the system prompt, for instance, can negatively affect the cache hit rate.
Ensure append-only context: Avoid modifying previous actions or observations, and ensure deterministic serialization. Many programming languages don't guarantee stable key ordering when serializing JSON objects, which can break the cache.
Mark cache breakpoints explicitly: Some model providers may not support automatic incremental prefix caching, requiring manual insertion of cache breakpoints in the context. When assigning these, account for potential cache expiration and ensure the breakpoint includes the end of the system prompt.
If self-hosting models using frameworks like vLLM, enabling prefix/prompt caching and using techniques like session IDs to route requests consistently across distributed workers is important. By focusing on these strategies, you can achieve significant improvements in AI agent latency reduction and cost-efficiency through KV-cache optimization.
Masking vs. Removing: Managing Action Space Complexity
As AI agents become more capable, the number of tools they can use tends to increase, leading to a more complex action space. The rise of MCP further accelerates this trend. Allowing user-configurable tools can also lead to an explosion of options, potentially overwhelming the agent and reducing its effectiveness. A natural response might be to create a dynamic action space, loading tools as needed using a RAG-like approach. However, experiments suggest that dynamically adding or removing tools mid-iteration should be avoided unless absolutely necessary.
There are two main reasons for this: First, tool definitions are often located near the beginning of the context, so any changes invalidate the KV-cache for subsequent actions. Second, the model can become confused if previous actions refer to tools no longer defined in the current context, potentially leading to schema violations or hallucinated actions without constrained decoding.
To address this, a context-aware state machine can be used to manage tool availability. Instead of removing tools, this approach masks the token logits during decoding to either prevent or enforce the selection of certain actions based on the current context.
Most model providers and inference frameworks support some form of response prefill, which allows for constraining the action space without modifying tool definitions. Consider these function calling modes, using the Hermes format from NousResearch as an example:
Auto: The model can choose whether or not to call a function. This is implemented by prefilling only the reply prefix: <|im_start|>assistant
Required: The model must call a function, but the choice is unconstrained. This is implemented by prefilling up to the tool call token: <|im_start|>assistant<tool_call>
Specified: The model must call a function from a specific subset. This is implemented by prefilling up to the beginning of the function name: <|im_start|>assistant<tool_call>{"name": “browser_
This allows action selection to be constrained by masking token logits directly. For instance, when a user provides a new input, the agent should reply immediately instead of taking an action. By designing action names with consistent prefixes (e.g., browser_ for browser-related tools, shell_ for command-line tools), it’s easy to ensure the agent only chooses from a specific group of tools at a given state without using stateful logits processors. These designs help maintain the stability of the agent loop, even within a model-driven architecture, and solve the challenges of managing complex action spaces in AI agents, especially with the proliferation of tools, and can be useful in dynamic action space AI agents.
Leveraging the File System as Extended Context Memory
Modern frontier LLMs now boast impressive context windows, often reaching 128K tokens or more. However, in practical agentic scenarios, this can still be insufficient and, at times, even detrimental. There are three common challenges:
Observations can be extensive, particularly when agents interact with unstructured data like web pages or PDFs, easily exceeding context limits.
Model performance tends to decline beyond a certain context length, even if the window technically supports it.
Long inputs are expensive, even with prefix caching, as costs are incurred for transmitting and prefilling every token.
To address these issues, many agent systems implement context truncation or compression strategies. However, overly aggressive compression inevitably leads to information loss. The challenge is that an agent must predict the next action based on all prior states, making it difficult to determine which observation might become critical later on. Irreversible compression, therefore, carries inherent risks.
That's why one approach treats the file system as the ultimate context: unlimited in size, persistent, and directly operable by the agent itself. The model learns to write to and read from files on demand, utilizing the file system not just as storage but as structured, externalized memory. This approach directly helps in AI agent context management.
Compression strategies are designed to be restorable. For instance, the content of a web page can be dropped from the context as long as the URL is preserved, and a document's contents can be omitted if its path remains available in the sandbox. This allows shrinking context length without permanently losing information. This approach of using the file system as memory AI allows the agent to maintain access to information when needed, without bloating the active context.
While developing this feature, consider what it would take for a State Space Model (SSM) to work effectively in an agentic setting. Unlike Transformers, SSMs lack full attention and struggle with long-range backward dependencies. But if they could master file-based memory—externalizing long-term state instead of holding it in context—then their speed and efficiency might unlock a new class of agents. Agentic SSMs could be the real successors to Neural Turing Machines.
Recitation for Attention Manipulation: The Todo List Strategy
A typical task in AI agents requires multiple tool calls on average. Since AI agents rely on LLMs for decision-making, they're vulnerable to drifting off-topic or forgetting earlier goals, especially in long contexts or complicated tasks. To mitigate this, a 'recitation' technique can be employed to manipulate the agent's attention.
One effective method is to have the agent create and constantly rewrite a todo list, updating it step-by-step as the task progresses. By reciting its objectives into the end of the context, the agent pushes the global plan into the model's recent attention span. This helps avoid "lost-in-the-middle" issues and reduces goal misalignment, acting as an AI agent attention mechanism. In effect, this biases its own focus toward the task objective without needing special architectural changes. This approach helps ensure better goal alignment for AI agents.
Embrace Failure: The Value of Keeping the Wrong Turns In
AI agents inevitably make mistakes. Language models hallucinate, environments return errors, tools misbehave, and unexpected edge cases arise. In multi-step tasks, these failures are not exceptions but rather integral parts of the process.
The common reaction is to hide these errors, clean up the trace, retry the action, or reset the model's state. While this approach might feel safer, it removes valuable evidence that the model could use to adapt and improve. Erasing failure prevents the model from learning from its mistakes.
One effective method for improving agent behavior involves retaining these "wrong turns" in the context. When the model observes a failed action, along with the resulting observation or stack trace, it updates its internal beliefs. This update reduces the likelihood of repeating the same mistake. Error recovery, therefore, becomes a key indicator of true agentic behavior.
Error handling is still underrepresented in academic work and public benchmarks, which tend to focus on task success under ideal conditions. By incorporating these real-world scenarios, research can better reflect the challenges and opportunities in AI agent development. This approach aligns with the principles used in Manus, where maintaining a comprehensive context, including errors, helps the agent learn and adapt more effectively.
Avoiding Few-Shot Prompting Pitfalls: The Importance of Diversity
Few-shot prompting is a common technique to improve LLM outputs. However, in agent systems, it can backfire in subtle ways.
Language models are excellent mimics; they imitate the pattern of behavior in the context. If your context is full of similar past action-observation pairs, the model will tend to follow that pattern, even when it's no longer optimal. This is one of the few-shot prompting limitations that can affect AI agents.
This can be dangerous in tasks that involve repetitive decisions or actions. For example, when using Manus to help review a batch of 20 resumes, the agent often falls into a rhythm—repeating similar actions simply because that's what it sees in the context. This leads to drift, overgeneralization, or sometimes hallucination. Maintaining AI agent context diversity is crucial to avoid these issues.
The fix is to increase diversity. Manus introduces small amounts of structured variation in actions and observations—different serialization templates, alternate phrasing, minor noise in order or formatting. This controlled randomness helps break the pattern and tweaks the model's attention. In other words, don't few-shot yourself into a rut. The more uniform your context, the more brittle your agent becomes. By ensuring a varied context, you can mitigate the few-shot prompting limitations and improve the robustness of your AI agent. See how Manus addresses this issue in its architecture.
Shaping the Agentic Future Through Context Engineering
The journey of building effective AI agents hinges significantly on context engineering. We've explored how crucial elements like memory, environment, and feedback loops shape agent behavior. Optimizing KV-cache hit rates, strategically managing action space complexity, leveraging the file system for extended memory, employing recitation techniques for attention manipulation, embracing failure for model adaptation, and avoiding few-shot prompting pitfalls are all vital lessons.
Ultimately, the way you engineer the context defines the performance of your AI agents. Thoughtful design leads to faster, more resilient, and scalable solutions. As you venture further into this domain, remember the power of well-crafted contexts.
Explore the capabilities of platforms like Manus to witness these principles in action and further refine your context engineering skills for building advanced AI Agents.







It's intresting how you frame context engineering, connecting it to gradient descent. Do you think we'll see more dedicated frameworks for KV-cache optimization soon, beyond current LLM toolkits? Super insightful to highlight its production impact! Really sharp observations here.
This piece realy made me think about the practical side of AI development. I completly agree that context engineering is becoming absolutley critical, way beyond just fine-tuning. The iterative approach and the focus on KV-cache hit rate for production efficiency are such vital insights. Great read, thanks for sharing!