Devlog #1: What $10k in AI Tokens Actually Produced

First, locate yourself.

Steve Yegge's Welcome to Gas Town lays out 8 stages of AI-assisted coding. Where are you?

Vibe Coding Stages

Gas Town is Stage 8. As Yegge puts it: "Gas Town is an industrialized coding factory manned by superintelligent robot chimps, and when they feel like it, they can wreck your shit in an instant."

This devlog is for Stages 5–8. If you're juggling multiple Claude Code instances, wondering why nobody else seems to be struggling with the same things you're struggling with-stick around. This is for you.

If you're at Stages 1–4, this will be interesting but probably not actionable yet. Bookmark it. Come back when you're running 3+ agents and feeling the chaos.

If you think AI coding is a fad, or that "real programmers" don't need it, I hear you. The hype is deafening, but don't dismiss this just because you don't trust the output yet. We've been here before. We said the same thing about compilers. We said it about garbage collection. The concerns were valid then, and they're valid now. But the abstraction layer moved up anyway. It's happening again.

Still here? OK. Let's go.

What I Actually Built

I started vibe coding -coding by feel, by flow, by orchestration rather than typing-in October. I didn't come up for air until January.

Here's what $10K in tokens produced:

12-Factor AgentOps - A full methodology applying DevOps and SRE principles to AI reliability. 12 factors, 9 laws, a documentation site, and a framework I'm betting my career on. Gene Kim taught me how to think about systems. Steve Yegge showed me the architecture. This is me trying to make AI as reliable as infrastructure.

vibe-check - An npm CLI that analyzes your git history for AI coding failure patterns. Debug spirals, context amnesia, trust violations. 24 versions shipped. It catches the 18% of tasks that need a second pass.

bodenfuller.com - Rebuilt my personal site from scratch. Cataloged 138 books that shaped how I think. Used AI to make sense of a decade of infrastructure work and figure out where it's all going.

7 Gas Town PRs - 2,400 lines of Go merged upstream. Boot role detection, remote branch cleanup, test coverage, package docs. Learning Go and TypeScript by contributing, not tutorials.

An internal platform - My actual day job. We're talking proper PaaS engineering: Kubernetes operators in Go, massive Python ETL pipelines, container orchestration at scale. For example, I built a K8s operator that spins up ephemeral, isolated 'agent sandboxes' for every ticket - giving each agent a pre-warmed environment that is nuked the moment the PR merges. Can't share the code, but this is the magnum opus. It's the multi-year project I'll be pitching internally and building for the foreseeable future.

Automation pipeline - Running in production. Our devs are using it. Plans to open-source once it's battle-tested enough.

Stash

That's the portfolio. Three months. Two subscriptions. Not bad for someone who was "just trying Claude Code."

Here's the thing about vibe coding that nobody warns you about: it's not a productivity hack. It's a lifestyle change. Once you see how fast you CAN go, going back to normal feels like running through honey.

I've been chasing that speed for three months. Sometimes I catch it. Sometimes I burn $500 in an afternoon and have nothing to show for it except a lesson I should've already known.

Every AI coding session starts from scratch. The model doesn't remember what you were working on yesterday. It doesn't know what already failed. It's Drew Barrymore waking up fresh every morning, except instead of falling in love with you, it's falling in love with solutions you already tried that don't work.

The community calls this the "50 First Dates problem." I spent my first month trying to solve it.

What worked:

CLAUDE.md files with project context
Persistent directories (.agents/research/, .agents/plans/)
Vibe levels (a calibration system for how much to trust the model)

What didn't:

Hoping the model would "just remember"
Ever-longer system prompts (context rot is real)
My first orchestrator, which I called "the kraken" but should have called "the money pit"

Memory

I was the bug in my own system. The architecture was fighting me. I kept adding duct tape instead of fixing the foundation.

Ralph Wiggum, RIP

Then Ralph Wiggum dropped. If you don't know Ralph, it's a loop pattern-you have Claude write to files, read its own output, and iterate until tests pass. Viral for a reason. I was a Ralph power user.

Ralph worked great for about two weeks. Then I started running it overnight, waking up to finished work, and feeling very clever.

That's when the quality collapsed.

Here's the failure mode: the agent spirals on the same bug for hours, confident it's making progress. It'll try the same fix seventeen times with minor variations, each time expecting different results. It's not insanity-it's optimism without memory.

I burned $547 in a single day on December 8th. That was my "holy shit" moment.

Ralph v Gas Town

Gas Town: The Architecture Shift

In January, Steve Yegge released Gas Town. It's an orchestration system for AI agents, and the core insight clicked immediately:

Don't have agents return their work to a coordinator.

This sounds simple, but it changes the physics of the system. Every multi-agent system I'd built had the workers reporting back to a boss. The boss's context fills up. Everything slows down. The boss becomes the bottleneck.

Beads Network

Gas Town's architecture is different. Each worker-Yegge calls them "polecats"-runs in complete isolation.

Own terminal.
Own copy of the code.
Results go straight to git and a shared issue tracker.

The coordinator (Opus or Sonnet) just reads status updates from the tracker. It never loads the actual work. It doesn't know what the code looks like-it knows what the tickets look like. It's managing work, not syntax.

It's like the difference between a raid leader who's also tanking, healing, and doing DPS versus a raid leader who just calls out mechanics while the team executes. (If you don't play MMOs, just trust me-the second one scales better.)

I stopped everything and rebuilt my workflow around Gas Town. One week later, I submitted my first PR upstream.

The Gatekeeper

Gas Town gave me parallelization. But I still had the quality problem (the "Ralph" factor).

So I built a gatekeeper.

Gatekeeper

Every push goes through strict validation: type checks, linting, complexity analysis, builds. The gatekeeper doesn't care about your feelings. It doesn't care about your deadlines. It's successive refinement-fail, fix, repeat-until the code is clean or you give up.

This is the part that people miss about AI-generated code: it looks right. It types right. It's confidently, beautifully wrong.

18% of my tasks last week needed a second pass. That's better than 30% with Ralph. But it means almost one in five times, the first answer is garbage.

If you're not validating AI output, you're not shipping code. You're shipping hope.

The Numbers (One Week)

Numbers

I tracked everything January 4–11 using ccusage. These are token usage values-what it would cost on API pricing. Actual spend is two $200/month subscriptions.

The model strategy is shifting.

I used Opus 4.5 exclusively once it dropped. It handles the coordination and the heavy lifting. But for pure execution-running the loops, checking the errors-it's overkill. I'm currently transitioning those steps to Haiku.

That tiny slice of Haiku closed hundreds of issues. The expensive model coordinates while the cheap model grinds. Same pattern as raid composition-you don't bring 20 tanks.

But the highlight reel hides the failures. Day 1 I burned through tokens learning dispatch patterns the hard way. Day 3 a merge conflict required manual intervention across six branches. Two polecats got stuck in debug spirals I didn't catch for hours.

This is the real story. Not the 1,005 issues. The 18% rework. The moments where I wanted to throw my laptop out the window.

Gate

The Factory Floor: The 40% Rule

There's a pattern here that keeps showing up, and it's not unique to AI.

Toyota figured out in the 1950s: run production lines at 40% utilization, zero defects. Push to 60%, defects increase 400%. The extra capacity isn't waste-it's what enables continuous improvement.

Same pattern shows up everywhere failure is catastrophic: aviation fuel reserves, ICU occupancy, portfolio risk. The 40% threshold isn't arbitrary. It's where systems shift from linear performance to cliff-like failure.

Gas Town applies this to code. Each polecat runs under 40% context. The coordinator stays light. Failures get isolated.

TSMC dominates semiconductors because they figured out yield optimization at scale. The differentiation isn't the machines-everyone buys those. It's the operational discipline that produces consistent quality at high throughput.

The analogy isn't perfect. But the principle transfers.

What I Think I Know

These are current beliefs. Ask me again in a month.

1. The 40% Rule Is Real

AI tools perform well below 40% of their context window. Above that, failures compound exponentially. If you're not tracking context usage, start.

2. Isolation > Cleverness

I spent weeks trying to build clever coordination. Turns out dumb isolation wins. Separate copies of the code mean no merge conflicts during parallel work. Failures don't cascade. Kill and restart workers without affecting others. Cattle, not pets.

3. Memory Changes Everything

Before Beads (our git-backed issue tracker), every session started from zero. Now there's persistent state-what's done, what failed, what's blocked. The AI wakes up knowing the plan.

4. Validation Isn't Optional: Successive Refinement

Models are trained to be about 85% correct. That means they will lie to you. They will say "fixed" when it's broken. They will say "tests passed" when they didn't run.

You need intuition for when the AI is lying, and you need a system that forces successive refinement. Fail, fix, repeat. It often takes multiple passes to bridge that last 15%. This is where the 12 failure patterns come in-you need to catch the drift before it becomes debt.

Flywheel

5. The Flywheel Might Be Real

Here's the bet I'm making: every dollar I spend makes the system smarter. The retros become searchable wisdom. Patterns get promoted. Standards keep things consistent. Work produces byproducts, byproducts become knowledge, knowledge enables better work.

I'll tell you in a year if it's true.

The New Job: AI Coding Foundries

I've realized something over the last three months.

My job isn't to write code anymore. My job is to build and run AI coding foundries.

I am designing the factory floor. I am tuning the conveyor belts. I am watching the defect rates and adjusting the acceptance criteria. The code is just the output of the machine I'm building.

It's a different skill-more systems engineering than software engineering. But when the foundry is humming, it feels like the future.

Coding Foundry

Try It

The Stack:

Beads - Git-backed issue tracking for agents
Gas Town - Orchestration system
Vibe Coding book - Gene Kim & Steve Yegge

What I'm Building:

12-Factor AgentOps - Making AI as reliable as infrastructure
vibe-check - CLI for catching failure patterns

What's Next

This is devlog #1. I'm documenting as I go.

Next up:

Full breakdown of my .claude/
My journey from 60+ agents down to a select few
How the Claude 2.1 command/skill changes forced me to change my workflow yet again

The landscape changes fast. Claude Code 2.1 shipped January 7th. By the time I write the next post, half of what I said here might be obsolete.

That's the game right now.

Devlog #1. Two subscriptions. $10K in tokens. Here's what that produced.