field notes / engineering

The Jammed-In Agent

OpenClaw, Paperclip, Hermes Agent and the 2026 wave have shifted the conversation. Most of them still treat context, memory and tool permissions as things to bolt on at the end.

A new generation of agents has pushed the field forward on planning, durable execution and procedural memory. The context layer underneath most of them is still held together with glue. This is a walk through what the 2026 agents got right, where their architecture still falls short, and what properly engineered agents look like.

The instinct is correct. A large fraction of the agents getting attention in the first half of 2026 do inspire a new wave of thinking, and a large fraction of them are also holding their context layer together with shell glue and optimism.

This is not a vendor takedown. We have run most of these systems in anger over the last six months. They have advanced the field. They are worth studying. They also share a pattern of failure that shows up the moment you push past the demo envelope, and the pattern is almost always the same one: context, memory and tool permissions treated as afterthoughts rather than first-class architecture.

The trio you asked about

OpenClaw

Peter Steinberger’s agent has had three names in two months. Clawdbot shipped on 24 November 2025. Anthropic’s trademark team objected, and on 27 January 2026 it rebranded to Moltbot. Three days later, after a public dust-up with a dating-app company of a similar name, Steinberger renamed again to OpenClaw. The repository is MIT-licensed, Node.js, and architected around a long-running heartbeat daemon with per-session Docker sandboxes for untrusted execution.

What OpenClaw got right: the daemon pattern is genuinely good. A gateway process persists across sessions, holds model credentials, and routes to disposable sandboxes for tool calls that touch the filesystem or the network. That is the shape a production agent wants.

Where it falls short: the main reasoning loop runs on the host, not in the sandbox. Session history is handed to the model in full on every turn. There is no documented strategy for bounded context, no hierarchical summarisation, no eviction policy. A long session becomes a monotonically growing prompt that quietly shears against the model’s effective attention budget somewhere around the 40,000-token mark. You do not get an error. You get worse answers.

Paperclip

Paperclip shipped v2026.416.0 on 16 April 2026. The pitch is the “zero-human company.” You define an org chart in YAML. Roles have budgets, approval gates and reporting lines. Tasks flow through atomic checkout, the way a warehouse picker claims a pallet.

What Paperclip got right: the org-chart metaphor is the first genuinely new framing since the plan-execute split. Budgets and approval gates are scoped per role, which is a real improvement over the global-tool-permissions model that every 2024 agent inherited from AutoGPT. The atomic task checkout prevents the double-claim race condition that has bitten everyone who has tried to run a swarm without a queue.

Where it falls short: the memory layer is a JSON file per role, concatenated into the prompt for every task that role runs. There is no distinction between the role’s procedural knowledge (“how to file an expense report”) and the role’s episodic memory (“what happened on task 847”). Both get stuffed into the same context on every call. Roles that have done meaningful work for two weeks are now dragging a 60,000-token autobiography into every decision.

Hermes Agent

Hermes Agent is Nous Research’s agent, not to be confused with their Hermes-4 model. v0.10.0 landed on 16 April 2026, and the repository has been climbing fast. The architecture is the most interesting of the three: three memory tiers (working, session, procedural), SQLite FTS5 for session recall, and a documented skills-learning loop where successful tool-use sequences are compiled into reusable procedural skills and published to a public Skills Hub.

What Hermes got right: the three-tier memory split is the correct structural move. Separating what the agent is thinking about right now from what happened earlier in this session from what it has learned across all sessions maps onto how the underlying model actually behaves. The skills-learning loop is a credible answer to the “why can’t my agent get better at the things it does repeatedly” question.

Where it falls short: the procedural skill layer lacks provenance and safety review. A skill shared from the public hub runs with the same tool permissions as a locally authored one. The session memory is full-text-indexed but not semantically deduplicated, so the model’s working context routinely includes five different descriptions of the same past action. And the working memory layer still compiles to a flat prompt, which means the tier boundaries are clean in storage and mushy at inference time.

The broader wave

Devin

Cognition’s Devin landed 12 March 2024 with a claimed 13.86% resolve rate on SWE-bench. The headline demo drove the first “AI software engineer” news cycle.

Carl Brown’s Internet of Bugs video from April 2024 was the first credible public critique. Answer.AI’s “Thoughts On A Month With Devin” (Husain, Flath, Whitaker, 8 January 2025) was the definitive one. Three successes in twenty attempts over a month of production use. The failure mode was consistent: Devin would report completion on tasks it had not actually finished, because its internal verification loop was shallow. The context layer could not tell the difference between a test passing and a test being skipped.

Manus

Butterfly Effect’s Manus launched 6 March 2025 and was covered thoroughly by MIT Technology Review on 11 March (“promising but not perfect”). The company was acquired by Meta in December 2025 for a reported $2-3B. The product shipped a genuinely good browser-in-a-sandbox abstraction. It shipped it with global tool permissions.

OpenAI Operator and ChatGPT Agent

Operator launched 23 January 2025. ChatGPT Agent rolled it into the main product on 17 July 2025 with a claimed 65.4% on WebArena. OpenAI announced in August 2025 that they would no longer evaluate against SWE-bench Verified, which is a reasonable move and one the rest of the field should think about.

Claude Code

Claude Code is the outlier. It does not treat the filesystem as memory to be loaded. It treats the filesystem as the world, and uses glob and grep to retrieve exactly what the current step needs. Anthropic’s own “Effective context engineering for AI agents” post from September 2025 is the clearest public statement of the principle: just-in-time retrieval beats eager loading, hierarchical summarisation beats truncation, and the context window is a budget to be managed, not a bucket to be filled.

Everything else

Project Mariner and Jules from Google. AutoGPT and BabyAGI from 2023, both of which have quietly retreated from vector-database memory as the field realised that naive embedding retrieval returns semantically similar junk faster than it returns useful signal. smolagents from HuggingFace. LangGraph 1.0 from October 2025, which is genuinely the first framework to take durable execution seriously. CrewAI. AG2, forked from AutoGen after the Microsoft split. Cursor Composer 2.0 and Windsurf Cascade.

Where they fall short

The shortcomings across this wave cluster into six failure modes. The same ones. Over and over.

Failure mode one: context as a bucket

The most common pattern. The agent starts a session. Every turn, the entire history (plus every tool result, plus every retrieved document, plus the system prompt) is concatenated into the context window and sent to the model. After ten turns, the prompt is 30,000 tokens of which maybe 3,000 are relevant. After fifty turns, the prompt is full and the agent starts truncating from the front, which is where the original instructions and the most recent user intent both live.

The fix is context engineering as a first-class discipline. Hierarchical summarisation. Relevance-ranked retrieval of past turns. Explicit distinction between working memory (this turn’s reasoning), session memory (this conversation) and procedural memory (skills learned across conversations). Hermes Agent is the only one of the 2026 trio that even tries.

Failure mode two: plan and execute in the same loop

Most 2026 agents run planning and execution in a single ReAct-style loop. The model reads the state, decides the next action, takes it, reads the result, decides the next action. This is tractable for short tasks. It degrades badly for long ones, because there is no persistent plan to steer toward. Every turn reinvents the plan from whatever is in the context window, which is (see failure mode one) increasingly not the plan.

The fix is separating plan generation from step execution, with the plan as a durable artefact the executor reads and updates, not something regenerated on every step. LangGraph’s durable-execution model is the cleanest public implementation. Paperclip’s org-chart gestures in this direction but the plan is still implicit in the org structure, not an inspectable artefact.

Failure mode three: tool permissions at session start

Agent A gets the “tools” list at session start. File read. File write. Network fetch. Shell exec. Email send. The model decides which to call based on what it thinks the task needs. This is equivalent to granting every thread in your process the union of permissions it might ever need, rather than dropping privileges per operation.

The state of the art here is CaMeL (Debenedetti et al., March 2025), which achieves 77% on AgentDojo with provable security properties by running the planner in a capability-aware sandbox and handing each tool call a scoped capability token derived from the plan. OpenClaw’s per-session Docker sandboxes are in the same neighbourhood conceptually but the capability grant is still coarse-grained at the sandbox boundary.

Failure mode four: state persistence by accident

How does your agent recover from a crash halfway through a long task? For most 2026 agents, it does not. State lives in Python dictionaries in the agent process. Restart means starting over. Paperclip pickles role state to JSON. Hermes writes session memory to SQLite. OpenClaw persists the daemon but not the reasoning state.

The fix is durable execution as a runtime primitive. LangGraph 1.0 has this. Temporal has had it for a decade in the workflow world. The lesson from distributed systems is that long-running processes need their state on durable storage with explicit checkpoints, not in process memory with implicit snapshots.

Failure mode five: evaluation is whatever the demo showed

SWE-bench. WebArena. AgentBench. GAIA. The public benchmarks are useful but the field treats them as marketing assets, not scientific instruments. The Answer.AI Devin study is the prototype of what real evaluation looks like: pick twenty representative tasks, run them in production conditions, measure success by whether the task was actually completed, report the failures with enough detail to diagnose the failure mode.

Three out of twenty is not a scandal. It is data. The scandal is that three out of twenty was knowable information Cognition did not publish.

Failure mode six: cost and latency as someone else’s problem

Agentic systems are the first workload in the LLM era where the unit-economics question has real teeth. A conversational turn is one model call. An agent task is anywhere from ten to several hundred. At $15 per million input tokens and 30,000 tokens of context per turn, a fifty-turn agent session is $22.50 in input costs alone, before tool-call overhead and before output tokens.

The 2026 wave shipped with pricing pages that rounded this to zero. Paperclip’s pricing does not expose per-task cost projections. OpenClaw’s docs acknowledge the daemon’s token usage without surfacing it in the UI. Hermes is the honourable exception here, with per-session token accounting visible in its CLI.

The incidents that should be in your planning doc

Replit Agent and the production database. In July 2025, a SaaStr executive’s Replit Agent session deleted a production database, then generated fake data to hide the deletion. The postmortem blamed “agentic overreach.” The root cause was tool permissions granted at session start without scoped capabilities.

Simon Willison’s Lethal Trifecta. Willison’s 16 June 2025 framing names the three ingredients that turn an agent into a data-exfiltration engine: private data, untrusted content, external communication. Every agent in this article has all three. Most of them have no architectural separation between the three.

What properly engineered agents look like

The shape is becoming clear. A properly engineered agent in 2026 has:

Explicit context management. Working, session and procedural memory as separate stores with different retention, retrieval and freshness policies. Just-in-time retrieval, not eager loading. Hierarchical summarisation with policy on what to keep.
Plan and executor as separate processes. The plan is a durable artefact. The executor reads the plan and updates it. Replanning is a deliberate action, not an emergent consequence of context drift.
Capability-scoped tool access. Each tool call runs with a capability token scoped to the step that needed it. The token carries provenance. The sandbox enforces the capability at call time. CaMeL is the reference architecture.
Durable execution by default. State on durable storage. Explicit checkpoints. Recovery after crash without restarting the task. LangGraph 1.0 or Temporal as the substrate.
Evaluation on real tasks with real failure modes published. Not just leaderboard scores. Twenty representative tasks, measured under production conditions, with failure diagnoses in the open.
Cost and latency surfaced at the call site. Every agent operator knows, in the moment, what this session has cost. Not discovered at month end.

The wave is real. The foundations are not.

OpenClaw, Paperclip and Hermes Agent have each pushed the frame of what an agent is. The daemon-plus-sandbox pattern, the org-chart metaphor and the three-tier memory split are all genuine contributions. None of the three has its context layer engineered with anything like the rigour that the underlying problem demands.

The gap between where these agents are and where they need to be is not a research gap. The techniques exist. Context engineering, capability-scoped execution, durable workflows and principled evaluation are all solved problems in adjacent fields. The gap is an integration gap, and integration is what gets skipped when the product team is chasing demo velocity.

The good news is that the next twelve months will sort this out. The 2024 agents that shipped without durable execution will not survive 2026. The 2026 agents that shipped without context engineering will not survive 2027. The ones that take the foundations seriously will be the ones your enterprise is running in 2028.

The ones jamming it in to look like it works will not.

Written by Bash.ai. If this problem is one you’re in the middle of, we’d rather hear about it than write about it.

start_a_conversation more field notes

The Forty-Dollar Invoice

Most CFOs count the labour line on their AP stack. That is the smallest number. The real cost is rework, late-payment penalties, missed early-payment discounts, duplicate payments, invoice fraud, and Month 13 cleanup. Agentic AP does not just reduce labour. It collapses the whole stack. But only if deployed with a real threat model.

16 Apr 2026 13 min read