Why smarter agent architecture does not always improve results

Subscribe • Previous Issues

Why Your AI Agents Need Engineering Instead of Best Practices

I remain optimistic about the impact agents will have on knowledge work. As I noted in an earlier article, fields shaped by clear rules and mature systems, including accounting and contract management, already look well suited to this kind of automation. But even if the opportunity is real, the practical reality is that AI teams are still learning how to build agents that work reliably in production. Moving from a fragile prototype to a dependable system requires more than a good prompt. It means thinking carefully about the underlying architecture. To see how these systems come together, it helps to break the stack into its main parts.


Regular reader? Consider becoming a paid supporter 🙏


In a working AI agent system, three core components define capability and behavior. 

  • Tools are the individual actions an agent can perform: database queries, API calls, file operations, or code execution. They are the atomic operations that enable agents to reach out and interact with external systems. 
  • Skills operate at a higher level. They are reusable workflows that combine multiple tools with specific reasoning steps to accomplish meaningful business objectives like analyzing a contract or triaging support tickets. 
  • Context files like AGENTS.md work differently. Rather than adding capability, they define how the agent should think and act. They specify the agent’s role, decision-making guidelines, constraints, and the reasoning patterns it applies when facing choices. 

This three-layer separation is practical: it lets you mix tools into different skills, and run those skills under different behavioral frameworks, without rebuilding core logic.

Production agent systems depend on several other components that matter just as much as the tools themselves. Memory systems maintain continuity across multiple turns, allowing agents to reference past decisions and context. Orchestration frameworks determine whether one agent or multiple specialized agents should handle a task. Planning modules help break complex goals into executable sequences. State management ensures context carries across interactions. Guardrails and permissions prevent misuse and enforce organizational policy. Monitoring and logging let you see what the agent actually does, which often differs from what you expected. These pieces work together. Without memory, the agent can’t maintain context. Without orchestration, it can’t coordinate complex work. Without guardrails, it risks policy violations.

Rethinking Coordination and Memory in Agent Systems

There is still a great deal of experimentation happening across all these tool categories. Orchestration is one area seeing intense activity as builders realize that early frameworks are often too rigid. Older systems force developers to map out every workflow in advance or rely on unstructured agent chats. New tools are filling this gap by offering more flexibility and control. Cord is a recent example that lets agents build their own task trees on the fly. It allows models to decide when to split work into parallel tracks or share context without needing a hardcoded plan. Emdash tackles orchestration from a workspace angle by letting developers run multiple coding agents in parallel across isolated environments. This eliminates the messy reality of juggling different terminals and waiting for a single model to finish its job.

One underappreciated cost of adding agents is coordination overhead. In many-to-many designs, that overhead can rise very quickly as the number of agents grows. Centralized orchestration can reduce some of that complexity, though it introduces its own bottlenecks. More agents also means more inference costs and more opportunities for compounding errors. Recent studies suggest that adding agents helps in some settings, especially when work can be cleanly decomposed, but it can also add overhead and even reduce performance when the single-agent baseline is already strong or the task is highly sequential.

Memory and context systems are also evolving to handle more than just conversational history. As I argued in an earlier piece, most current memory approaches are better at retrieving facts or preserving conversation than at helping agents repeat operational work reliably. To solve this, developers are moving toward operational skill stores or context file systems. It is less about chat history and more about procedural memory. Instead of overloading a prompt with endless documentation, these new systems save successful workflows as permanent procedures. The agent only loads the specific instructions it needs for the exact task at hand. This method turns temporary problem solving into reliable company assets while drastically cutting down on computing costs.

Moving From Art to Engineering in Agent Design

As teams adopt new memory and orchestration tools, they often inherit best practices before testing whether those methods actually help in their own environments. AGENTS.md is a good example. These simple repository-level files are meant to guide how coding agents behave inside a codebase. A recent study examined whether they deliver on that promise by testing coding agents on standard benchmarks and on a new benchmark, AGENTBENCH, built from real repositories. The results were not especially encouraging. Automatically generated context files reduced task success rates while increasing inference costs by more than 20 percent. Agents followed the instructions and explored the code more extensively, but that extra activity did not translate into better outcomes. Even developer-written files produced only modest gains.

Building AI agents is an engineering discipline, not an art form. You get exactly what you measure.

Too many teams still build a workflow, run it a few times, decide it feels right, and ship it. That approach carries real risks. The standard practice in machine learning has long been to test each new component before adding it: does this actually improve results, and where does it now fail? The same logic applies to agent systems. The lesson from the AGENTS.md research is not that context files are useless. It is that adding any component – a guidance file, a new agent, a prompt change – should be treated as an engineering decision, not a default. Leo Meyerovich made this point well when he argued that teams get what they measure. In practice, that means defining clear evals for your own use cases and keeping only what improves results, whether the metric is task success, speed, safety, or cost. In agent systems, the question is not whether a recommendation sounds sensible. It is whether it improves performance in your setting.

Putting an AI agent into production means coordinating a stack of tools, skills, orchestration frameworks, memory systems, and guardrails. Developers and startups are still iterating quickly on this infrastructure, often in open source, and that experimentation is helping the field mature. But it is also easy to mistake architectural complexity for progress. As the evidence on context files suggests, simpler tools paired with rigorous evaluation will often beat a more elaborate setup that has not been tested against real work. Part of the problem is that the number of variables in a working agent system is larger than it first appears. Chunking strategy, embedding choice, retrieval method, prompt structure, context window size, and model selection all interact. Teams that rely on defaults and intuition across these variables are, in effect, guessing. Systematic evaluation does not have to mean testing every combination – but it does mean knowing which variables matter most for your specific use case.

Getting an agent ready for production means running computationally intensive experiments to find the right configuration. Having an AI platform that lets you run those experiments efficiently is a distinct advantage. Dean Wampler recently explored this in a new article on the PARK stack, an open source foundation built on PyTorch, AI models and agents, Ray, and Kubernetes. In the end, teams with scalable infrastructure and rigorous evaluation will be better positioned to solve real business problems.


How LanceDB fits into personal autonomous agents — based on the LanceDB + OpenClaw integration guide

From A Practitioner’s Guide to GTC 2026”

Discover more from Gradient Flow

Subscribe now to keep reading and get access to the full archive.

Continue reading