Context Engineering: Building Reliable ML Agents — Julian Mukaj

Context Engineering: Building Reliable ML Agents

Prompt engineering is dead. Not because it was useless, but because the problem has moved. Models now handle multi-step, long-horizon tasks that were science fiction two years ago: searching the web, writing and executing code, training classifiers, running backtests. The METR benchmark tells this story clearly.

METR Benchmark showing AI progression on long-horizon tasks
Agent performance on long-horizon tasks, measured in hours of sustained autonomous work. The jump from 2024 to 2026 is not incremental.

The uncomfortable truth is that model quality is now secondary. Given two systems with the same underlying model, the one with better scaffolding wins. Specifically: a better harness and a better verifier. No more system prompt optimizing via DSPy, architecture of your agent context system is now the key to success.

Long Context Retrieval Performance Comparison
MRCR v2, 8-needle: mean match ratio (%) vs. input tokens. Opus 4.6 and Sonnet 4.6 degrade gracefully to 1M tokens. GPT-5.4 falls to 36.6% at 1M context. Gemini 3.1 Pro collapses entirely.

At 128K tokens GPT-5.4 sits around 80%, competitive with the Claude models. By 1M tokens it has fallen to 36.6% while Opus 4.6 holds at 78.3%. On raw retrieval, the gap is large. Yet in practice, on agentic benchmarks, GPT-5.4 keeps pace with Claude across many task categories. The reason is that well-engineered scaffolding compensates for weaker retrieval: the harness controls what enters the context window, when, and in what order. A model that never sees a 1M-token prompt because the harness manages context intelligently does not pay that retrieval penalty. Model capability matters. Scaffolding quality matters more.


1. Context is the Autonomy Ceiling

I spend a lot of time building agents at home and at work: e.g. ML feature generation, automating routines via playwright, personal assistant for keeping track of mailbox/calendar etc. The single best predictor of whether an agent succeeds is not model size (I prefer the nano-sized models!). It is what fraction of the required context the agent can actually see.

Coding agents work exceptionally well because the context coverage is near-total. The repository is the context. Every file, every test, every CI configuration is traversable. The agent knows exactly where it stands. Contrast that with something like an HR or compliance agent, where maybe 30-40% of the relevant information exists in structured form. The rest lives in emails, in someone's head, in a meeting that was never written down. That agent will hallucinate, not because the model is bad, but because it is being asked to reason over a gap.

In alpha research the same fragmentation problem appears, but with a different character. The data exists. Risk factor exposures come from one provider, fundamentals from another, security and sector mappings are maintained internally, and portfolio-level constraints live with the fund manager or the risk team. Each piece is accessible in principle. The problem is that assembling them into coherent agent context means crossing cloud environments, internal APIs, team boundaries, and sometimes just asking someone. An agent that only gets connected to the price feed and a backtest runner is not context-poor because the information is unavailable. It is context-poor because the integration work was not done. That distinction matters: the fix is an engineering and coordination problem, not a data problem, and it is harder than it looks precisely because it involves people and teams, not just pipelines.

My rough heuristic holds regardless: below 80% context coverage, the agent cannot run unsupervised. You are not delegating, you are babysitting.

High context (≥80%)
Agent operates autonomously. Failures are edge cases, not the norm.
e.g. coding agent — repo, tests, CI all accessible
Low context (<80%)
Agent hallucinates to fill gaps. Human stays in the loop.
e.g. HR agent where most information hangs with the hiring manager

Context coverage threshold. Below 80%, autonomous operation is unreliable.


2. Harnesses and Verifiers

Moving beyond a simple chain of LLM calls into genuinely open-ended search requires two pieces of infrastructure that most people underinvest in.

Specialized Coding Agent and LangGraph Optimization Loop architectures
Two architectures I use regularly. Left: a coding agent with MCP tool access and a Skills.MD context file. Right: a LangGraph optimization loop where the agent iterates until a validation target is met.

The harness is the execution environment. It controls what tools the agent can call, manages Python subprocess lifecycles, injects the right context at the right point in the loop, and handles the failure cases: timeouts, bad outputs, retries. Without a well-built harness the agent is running loose. With one, you can actually reason about what it is doing and why.

The verifier is more important but can be easier to build. It is the function that tells the agent whether its output is correct. In a feature generation task this might be a set of statistical assertions: no lookahead, IC above a threshold on a holdout period, Sharpe on a sample backtest above some floor. The verifier has to be non-gameable. An agent that can pass the verifier by overfitting the validation set is not useful. Getting this right is genuinely difficult, and it is where most agent projects actually fail. You should build the verifier before you build anything else. It defines what success means. Everything downstream is just search.

The failure mode to watch for specifically is reward hacking. If your verifier is a backtest run on the same cross-validation folds that the agent is generating features against, the agent will find them. Not intentionally, not through any planning, but because the search will naturally converge on whatever pattern scores well against the objective you gave it. If that objective is leaky, the agent exploits the leak. You end up with a feature set that looks excellent in-sample and is worthless out of sample. The fix is structural: holdout data the agent cannot touch, test sets evaluated only once, and preferably a second-stage verifier that runs on data from a different time period or market regime entirely. Treat your verification scheme with the same rigour you would apply to a live trading system. Because if the agent is good, it will find any weakness you leave in it.


3. Memory

A context window is not memory. Dumping all prior experiment results into the prompt on every iteration is wasteful and breaks down quickly at scale. State needs to be managed properly, split across two horizons.

Within a session, the agent tracks conversation history, tool call results, a working scratchpad, and the current plan. This is short-term memory. It is ephemeral and that is fine. Across sessions, the agent needs a structured record of decisions made, hypotheses tested, failures to avoid repeating, and domain knowledge accumulated during prior runs. This is long-term memory, and the simplest reliable implementation really is Markdown files.

In the demo agent shown below this section, I use an EDA.md for exploratory findings, a ScratchPad.md for active hypotheses, and an AGENT.md that absorbs both at the start of each run and acts as the agent's working memory file. The agent writes to these files as it works. When a feature does not pass the verifier, that gets logged. The next run starts with that knowledge already loaded. You stop burning compute re-discovering dead ends.


4. Building the Loop

When agents start doing real research work, the researcher's job changes. You stop generating features and start designing the system that generates features. The intellectual work shifts from object-level to meta-level: what is the search space, how do we verify progress, how do we avoid local optima, when do we intervene.

ML Agent Architecture Flowchart
The loop I run for ML feature research. The agent cycles through exploratory analysis, feature generation, and verification. Human intervention feeds back into the loop at any node. MD files persist state across iterations.

To make this concrete: I built a demo of this loop from scratch and let it run overnight on a Kaggle-style ML competition dataset. By morning it had reached a score of 0.267, running 107 iterations, across roughly $6 in model costs. Getting to that same score manually, iterating on features by hand, writing and running backtests, reading the results, forming a hypothesis, trying again, would conservatively take a week. Probably more. The agent is not smarter than a researcher. It is just faster, tireless, and it never forgets what it already tried. Those three properties compound quickly.

That gap only exists because the scaffolding was built properly first. The verifier was defined before the agent wrote a single feature. The harness gave it clean access to the data and a working backtest runner. The memory files meant iteration 50 knew what iterations 1 through 49 had already ruled out. Remove any one of those three and the overnight run becomes a pile of repeated failures. The autonomy ceiling is not set by the model. It is set by how well you built the system around it.

ML Agent Live Dashboard
The demo agent reaching a best score of 0.267 across 107 iterations overnight. $6.19 in model costs. A score that would take a researcher the better part of a week to reach manually. This is a standalone demo build.