The REPL Is Dead. Long Live the Factory.

January 17, 2026·19 min read
#vibe-coding#ai-development#gas-town#agent-orchestration#repl#predictions

This is a companion piece to my vibe coding devlog series. If you've been following along, this is my bold prediction for 2026. If you're new here, welcome to the deep end.


My prediction for 2026: The year of the orchestrator. The REPL pattern that powers every AI coding tool will evolve—or be replaced by tools that have.

(Yes, "The REPL Is Dead" is provocative. The REPL isn't dead—it's incomplete. Add an orchestration layer and you get something that scales. I tried this myself: file locking, shared state, manual coordination. It worked, sort of. What I'm calling ODMCR is just that pattern formalized. The REPL isn't dying; it's growing up.)

I'm not just speculating. I've spent 10 years in IT, chasing the DevOps dream that Gene Kim painted in The Phoenix Project. The pattern is always the same: take unreliable components, wrap them in operational discipline, let the platform handle the mess. We did it with servers. We did it with containers. Now we're doing it with AI agents.

Last year I published 12-Factor AgentOps—infrastructure patterns applied to AI workflows. The 40% context rule. Validation gates. Treat agents like cattle, not pets. It's DevOps for the age of agents.

Then I found Steve's Gas Town. His Welcome to Gas Town and Future of Coding Agents confirmed what I suspected: the REPL is an incomplete abstraction. The right one looks like Kubernetes—a control plane orchestrating ephemeral workers against a declarative state store.

I'm building a K8s operator for knowledge work—Hephaestus. Gas Town is the CLI proving ground. This article bridges the gap between where most people are (REPL loops) and where I believe we're headed (orchestrated agent factories).

From workbench to factory floor


Every agentic coding tool in 2025 was built on the same foundation: the REPL. Read-Eval-Print-Loop. It's elegant. It's simple. It doesn't scale—not without help.

(Yes, I'm writing this article with Claude Code—a REPL-based tool. The irony isn't lost on me. REPL works fine for single-agent, single-session work. The problems start when you try to scale it.)

If you've run an agent loop overnight and woke up to $547 in burned tokens, you already know something is broken. If you think the REPL pattern is the endgame for AI coding—well. Grab a seat.


The Pattern Everyone Loves

The agent execution loop has become gospel. Prepare context, call model, handle response, iterate until done. Every major player uses some version of it: Replit Agent, GitHub Copilot agent mode, Claude Code, Cursor. Victor Dibia's explainer captures the pattern well.

Geoffrey Huntley's Ralph pattern turned it into a meme. A bash loop that keeps feeding the agent until tests pass. Simple. Viral. I was a Ralph power user. I ran it overnight. I felt very clever.

Then the quality collapsed.

It's not just me. A 2025 randomized controlled trial by METR found that AI coding tools made experienced developers 19% slower—not faster. Developers predicted a 24% speedup. ML experts predicted 38%. The actual result? A slowdown. On real tasks. With frontier models.

To be fair: that study tested Cursor specifically, not the REPL pattern in general. The slowdown could be Cursor's UX, task selection, or a dozen other factors. I'm making a different claim: the architectural limits I'm describing—context accumulation, lack of cross-session memory, no coordination primitive—are inherent to REPL-style tools, not specific to any implementation. I can't prove orchestration solves this yet—my data is a pilot, not a peer-reviewed study. But the REPL pattern clearly isn't delivering the gains everyone expected. Something needs to change.


The 50 First Dates Problem

Here's the failure mode nobody warns you about: the agent spirals on the same bug for hours, confident it's making progress. It'll try the same fix seventeen times with minor variations, each time expecting different results.

It's not insanity. It's optimism without memory.

The REPL pattern is stateless by design. Every iteration starts fresh. The loop doesn't know what already failed. It's Drew Barrymore waking up every morning in 50 First Dates, except instead of falling in love with you, it's falling in love with solutions that already didn't work.

I burned $547 in a single day on December 8th. That was my "holy shit" moment.

The $547 day


The Architectural Problem

OK, so you add state. Memory. Persistence. That helps with single-agent work.

But here's what happens at scale. You run 3+ agents. They all need coordination. The coordinator needs to know what each agent is doing.

Classic multi-agent pattern: workers report back to a boss. Every result flows into the coordinator's context.

Math time:

  • 8 agents
  • 10K tokens each
  • 80K tokens consumed in the coordinator

The boss becomes the bottleneck. Context fills up. Performance degrades. The expensive model—the one doing the coordination—chokes on data it doesn't need.

SWE-Bench Pro quantified the complexity cliff. On simple single-file tasks, top models score 70%+. On realistic multi-file enterprise problems? GPT-5 and Claude Opus 4.1 drop to 23%. The performance cliff is brutal—beyond 3 files, even frontier models struggle. Beyond 10 files, open-source alternatives hit near-zero. The study tests model capability, not execution patterns—but it explains why decomposition matters. Smaller tasks mean smaller contexts mean staying on the good side of the cliff.

I built this multi-agent coordinator before I found Gas Town. I called it "the kraken." I should have called it "the money pit."

The Kraken architecture


The Factory Analogy

Steve nailed it in The Future of Coding Agents: coding agent shops think they've built workers. What they need is a factory.

A factory doesn't funnel all work product to a single manager who inspects every widget. A factory has:

  • Workers who operate in isolation
  • Status boards (not detailed reports)
  • Quality gates (not real-time supervision)
  • Graceful failure handling (not cascade failures)

The REPL is a workbench. You stand at it. You do the work. It's fine for one person.

Gas Town is closer to a production line than a workbench. You manage capacity. You track throughput. You handle failures without stopping the line.

I'm not running a factory yet—8 polecats is a workshop, proof-of-concept scale. But the architecture is designed to scale. Different patterns for different scales.

Workbench vs Factory floor


From REPL to ODMCR

Gas Town doesn't replace the REPL—it's what the REPL becomes when you add orchestration.

The insight: don't have agents return their work to a coordinator.

Each worker—Steve calls them "polecats"—runs in complete isolation:

  • Own terminal
  • Own copy of the code
  • Results go straight to git and a shared issue tracker

The coordinator (Mayor) just reads status updates from the tracker. It never loads the actual work. It doesn't know what the code looks like—it knows what the tickets look like.

It's managing work, not syntax.

What keeps it moving? Steve calls it GUPP—the Gastown Universal Propulsion Principle: "If there is work on your hook, YOU MUST RUN IT." Every worker has a persistent hook in the Beads database. Sling work to that hook, and the agent picks it up automatically. Sessions crash, context fills up, doesn't matter. The next session reads the hook and continues.

REPL vs ODMCR


ODMCR: The Evolution

I've been running Gas Town for two weeks. That's early—too early for proof, but enough for signal. The pattern that emerged is what I'm calling ODMCR: Observe-Dispatch-Monitor-Collect-Retry.

Steve built the machinery and the acronyms (GUPP, MEOW, NDI). ODMCR is just my shorthand for the execution pattern—a way to compare against REPL in conversation. The ideas are his; I'm just a practitioner trying to document what I'm learning.

Underneath ODMCR is Steve's MEOW stack—Molecular Expression of Work. Beads (issues) chain into molecules (workflows), which cook from formulas (templates). Work becomes a graph of connected nodes in Git, not ephemeral context in a model's head. The workflow survives the session.

REPL PhaseODMCR PhaseWhat Changed
ReadObserveMulti-source state discovery (issues, git, convoy status)
EvalDispatchAsync execution to polecats, not blocking
PrintMonitorLow-token polling (~100 tokens vs 80K)
(implicit)CollectExplicit verification of claimed completions
(implicit)RetryExponential backoff with escalation

The coordinator stays light. ~750 tokens per iteration. Sustainable for 8-hour autonomous runs.

Steve calls the durability guarantee NDI—Nondeterministic Idempotence. The path through the workflow is unpredictable (agents crash, retry, take different routes), but the outcome converges as long as you keep throwing sessions at the hook. (Caveat: this assumes the task is actually solvable. NDI won't fix impossible requirements or bugs beyond the model's capability. For unsolvable tasks, the system detects spiral patterns—same error 3+ times—and escalates to human review rather than burning tokens indefinitely.) The pattern resembles Temporal's durable execution in intent—though without Temporal's enterprise hardening—achieved through Git-backed state instead of event sourcing.

REPL assumes you're present. ODMCR assumes you're not.


The 40% Rule

The research is clear: never exceed 40% context utilization. Chroma's 2025 Context Rot study and the ACE framework both show that LLMs don't degrade gracefully—they hit cliffs. I codified this in 12-Factor AgentOps after burning $547 in a single day. Same principle as not running servers at 95% CPU: you need headroom for the unexpected.

The study tested 18 models and found they "do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows." A model advertising 200K tokens typically fails around 130K (~65% utilization). My 40% threshold is conservative—I'd rather leave money on the table than watch an agent spiral.

The failure modes are brutal: context pollution (irrelevant info drowns the signal), context confusion (model can't distinguish instructions from data), context poisoning (hallucinations infect subsequent reasoning). The ACE research adds two more: brevity bias (systems drop domain insights for concise summaries) and context collapse (iterative rewriting erodes details over time). Every REPL iteration compounds these losses. ODMCR sidesteps this: workers write to Git, not back to a coordinator's context. No summarization chain. No iterative rewriting. The detail lives in the commit, not in compressed memory.

As Factory.ai puts it: "Effective agentic systems must treat context the way operating systems treat memory and CPU cycles: as finite resources to be budgeted, compacted, and intelligently paged."

Gas Town operationalizes this. Each polecat runs under 40% context. The Mayor stays light. Failures get isolated, not cascaded. It's the same pattern we use in Kubernetes—resource limits and quotas, not infinite scaling.

The 40% cliff


The Numbers

Two weeks. Same workloads across 8 repositories (personal projects—mix of TypeScript, Python, and Go, ranging from CLI tools to web apps). Week one: Ralph pattern (REPL). Week two: Gas Town (ODMCR). Caveat: personal projects have no compliance overhead, no legacy code, no team coordination. Enterprise results will vary. Here's what I tracked:

MetricRalph (REPL)Gas Town (ODMCR)
Issues closed2031,005
Multi-file changes34 (17%)312 (31%)
Rework rate (issues needing follow-up)31%18%
Cost per issue$2.47$0.89
Total spend~$501~$894
Human interventionConstant (hourly)3x daily check-ins (~2.5 hrs total)

Metrics comparison

Yes, total spend went up ~78%. I bought more compute. The question is whether you get proportional value—and 5x throughput for 1.8x cost is a trade I'd make again. The per-issue efficiency is what compounds.

The 5x throughput difference isn't magic—it's parallelism. Ralph runs one agent. Gas Town runs eight. "But couldn't you just run 8 Ralph loops?" No. Parallel REPL loops have no coordination—they'll step on each other's work, create merge conflicts, and duplicate effort. You need shared state (the issue tracker) and explicit work assignment (the hook system) to make parallelism productive.

"What about 8 Claude Code sessions with a human coordinator and a shared Trello board?" That's closer—and probably better than dumb Ralph loops. But you're still the bottleneck. You're context-switching between 8 terminals, manually assigning work, resolving conflicts, and your context window is worse than the model's. The factory pattern automates that coordination layer. The human reviews output, not manages workflow.

Here's what surprised me: the multi-file success rate. Not just more attempts—a higher rate. REPL agents choke on cross-file changes because context accumulates across the session. ODMCR agents start fresh each task, isolated, focused. This is why decomposition helps even though the underlying models are the same ones that struggle on SWE-Bench Pro—each polecat gets a clean context for each issue, avoiding the accumulation that tanks performance.

Yes, many issues were small. That's intentional. The factory pattern decomposes big work into small, verifiable units. Each unit is trivial. The orchestration makes them compound.

Here's the honest confound: decomposition is a lever. You can break work into small issues with any tool—GitHub Projects, Trello, a text file. But decomposition without orchestration gives you flow improvement without stock accumulation. The beads database does three things a Trello board can't: automatic dispatch (GUPP means no human context-switching), crash recovery (hooks persist across sessions), and the knowledge flywheel (issues become patterns become formulas become institutional memory). The orchestration layer is what turns decomposed issues into compounding wisdom. (A rigorous test would randomize the same issues across both patterns. I didn't do that. The comparison is workload-equivalent, not issue-identical.)

Opus 4.5 did the heavy lifting—both coordination and implementation. Sonnet handled simpler tasks. The key isn't which model does what; it's that each agent runs in isolation. No shared context. No pollution between tasks.

The savings compound. Lower cost per issue means you can attempt more. Lower rework means less wasted effort. Scheduled check-ins means you can actually sleep.

Here's what the metrics don't capture: the knowledge flywheel. Every closed issue becomes a searchable pattern. Every formula that works gets templated. After two weeks, I have 47 reusable formulas and 1,005 issues worth of searchable history. When I start a new task, I search the beads database (grep + embeddings via MCP) to surface similar past work. The system doesn't just execute—it accumulates wisdom. The beads database is institutional memory that survives session crashes, context limits, and model version changes. That's the real stock of the system. Throughput is a flow metric. Wisdom is a stock that compounds. (The flywheel can also compound mistakes—bad patterns propagate if you don't prune them. I've had to tombstone a few formulas that encoded wrong assumptions. Garbage in, garbage out, but at least it's searchable garbage.)

Methodology note: This was a sequential pilot, not a controlled experiment. Week 2 benefited from Week 1 learnings—better decomposition skills, tooling familiarity, and (honestly) picking issues I'd learned worked well with the pattern. A proper A/B test would randomize issue assignment across both patterns simultaneously. The comparison shows what a practiced user can achieve, not what a newcomer will see on day one. Cost caveat: The $0.89/issue figure excludes approximately 20 hours of Gas Town setup and learning curve. Amortized across the 1,005 issues, that adds ~$0.40/issue in time cost (at a conservative $20/hr—your actual opportunity cost is likely higher). The pattern pays off at scale, not on day one.


When It Fails

Gas Town isn't magic. The failure modes I've hit:

  • Retry spirals: A polecat gets stuck on a problem it can't solve and burns tokens retrying. The fix: validation gates that detect spiral patterns and escalate to human review.
  • Merge conflicts: Two polecats modify the same file from different branches. The fix: better work decomposition—one file, one owner.
  • False completions: An agent marks an issue "complete" when the work is half-done. Models sometimes claim completion prematurely—a known failure mode in agentic systems. The fix: explicit verification steps, not just trusting status reports.

The numbers from my pilot: ~15% of polecat sessions hit context limits and needed restart. ~8% of "completed" issues required follow-up fixes. 4 merge conflicts needed manual resolution (0.4% of issues—low because decomposition assigns one file per issue when possible, and the 8 repos had minimal overlap). Not perfect—but failures stayed isolated instead of cascading.

Every system fails. The question is how it fails. REPL failures cascade—one bad context poisons everything. ODMCR failures isolate. You lose one polecat's work, not the whole convoy.

Platform risk is real too: Gas Town is Steve's side project, not enterprise software. If Claude Code's API changes, or Anthropic shifts rate limits, the tooling breaks. The mitigation is that the pattern is portable—isolation, async dispatch, Git-backed state—even if this specific implementation isn't. I'm betting on the architecture, not the vendor.


Why Not Just Wait?

Models improve fast. Claude 5 might handle multi-file tasks natively. Why build orchestration complexity?

Because the problem isn't model intelligence. It's architecture.

TSMC doesn't make one giant chip. They make billions of small ones with extreme precision. "But code isn't standardized widgets!" True. The decomposition is the hard part—breaking "design auth system" into "add JWT validation to middleware," "create token refresh endpoint," "write integration tests." That's what the issue tracker and workflow templates are for. The factory pattern—decomposition, parallelism, quality gates—scales regardless of how good individual components get.

Better models won't eliminate the need for orchestration. They'll make orchestration more powerful. A factory of Opus 5 agents will outperform a single Opus 5 agent the same way a factory of Opus 4.5 agents outperforms a single one today.

The pattern is the moat, not the model.


What This Means

Steve predicts 2026 brings a new class of "100x engineer"—his term, not mine—people who've figured out how to wield coding agent orchestrators effectively. I'm not there yet; I'm running 8 polecats, not 100. But the trajectory is clear: if 8 agents give me 5x throughput, what about 100? The math isn't linear—you'd hit new bottlenecks (git contention, review capacity, issue decomposition overhead). But the pattern is designed for that scale, even if I haven't proven it yet.

The bottleneck isn't the agents. It's how fast humans can review code and agree to take responsibility for it. Parallel agents can't overcome human review capacity limits.

Which means the game isn't "more agents." It's better orchestration of the agents you have.

Deloitte predicts that 40% of agentic AI projects could be cancelled by 2027 due to "cost, scaling complexity, and unanticipated risks." Orchestration isn't a silver bullet—plenty of orchestrated projects will fail too. But projects without orchestration hit the scaling wall earlier. The ones that survive will have figured out how to manage cost and complexity at scale.


The Factory vs Worker Moat

What about LangGraph? CrewAI? AutoGen? OpenAI Swarm?

They're solving coordination, not orchestration. Coordination is "how do agents talk to each other." Orchestration is "how do you run 100 agents without going broke."

Many multi-agent frameworks default to coordinator-centric patterns where results flow to a central context. The question isn't which framework, it's which architecture.

Gas Town is opinionated about a different architecture: workers never return results to the coordinator. They write to Git. The coordinator reads status, not output. That's the key insight—and it doesn't require Gas Town. You could build it on any framework.

The pattern—isolation, async dispatch, low-token monitoring, explicit verification—will show up everywhere. The question is whether you're building it from scratch or learning from the people who already burned the money.


Try It

The stack:

  • Beads - Git-backed issue tracking for agents. The shared memory that survives sessions.
  • Gas Town - Orchestration system. The factory.
  • Vibe Coding - Gene Kim & Steve. The philosophy.

Fair warning: Gas Town assumes you're already comfortable with AI coding tools. Steve's "Stage 7" means you're running agents autonomously, trusting them with commits, reviewing output rather than watching execution. If you're still prompting manually and reviewing every change, this will feel like overkill.

This article is n=1. Steve built the system; I'm an early adopter documenting my experience. If you try it and get different results—better or worse—I want to hear about it. The pattern needs more data points than just mine.

The Hacker News thread is full of skeptics comparing it to early blockchain hype.

They might be right. Or they might be missing what containers' critics missed: the specific technical advantages that compound over time. For Gas Town, those are isolation (failures don't cascade), durability (work survives session crashes), and coordination (shared state without shared context). Time will tell if those advantages matter as much as portability and process isolation did for containers.


The Prediction

The REPL got us here. Orchestration gets us where we're going.

2026 is the year of the orchestrator. By end of year, every major AI coding tool will ship with orchestration primitives. Not just chat history or memory features—those exist now. I mean: multi-agent coordination (dispatching work to parallel workers), persistent workflow state (surviving session crashes), or declarative work queues (expressing intent, not steps). At least two of Cursor, Copilot, Claude Code, Replit, or Windsurf will have these capabilities in production. The single-agent REPL will still exist for simple tasks, but serious production use will look closer to ODMCR: observe, dispatch, monitor, collect, retry. Isolation over integration. Status boards over detailed reports. Production lines over workbenches.

The winning pattern won't be the smartest model. It'll be the best orchestration of whatever models exist.

Here's where I'm headed: Hephaestus, a Kubernetes operator for knowledge work. Gas Town is the CLI proving ground—the kubectl before the control plane. The patterns I'm testing now (ODMCR, the 40% rule, model-tiered routing) will become CRDs and controllers. Declare your desired state, let the operator reconcile.

We've done this before. Twenty years of DevOps and SRE taught us how to make unreliable things reliable. AI agents are just the newest unreliable component. The playbook is the same: operational discipline, observability, declarative state, automated recovery.

This is my bet. I'm building toward it. The experiment is running.

I've kicked off waves of 8 parallel agents and watched the convoy dashboard tick green for hours. When it works—and it doesn't always—I'm reviewing PRs, not babysitting prompts. Still work, but different work.

That's the future I'm building toward. The workshop becoming a production line, then maybe a factory.

Welcome to Gas Town


The foundation: 12-Factor AgentOps—infrastructure patterns for AI workflows. The journey: my vibe coding devlog series has the $10K burn, the Ralph collapse, and the practical .claude/ setup. This piece is the prediction.

Building orchestration systems for AI agents? Let's compare notes.