Andrej Karpathy coined the term for letting AI write code while you direct. Gene Kim and Steve Yegge turned it into a methodology that answers the question pure vibe-coding ignores: when can you trust AI output, and when do you need to verify every line?

Their framing: you're the head chef directing AI sous chefs. The book documents 12 failure patterns where vibe coding destroys work, from AI claiming tests pass when it never ran them to 3,000-line functions that become unmaintainable. The methodology exists to avoid them.

I built tooling around this idea. (See how I built this website using the methodology.)

// vibe-check: from zero to v1.5.0

duration: 16.4 hours

commits: 56

features: 24

deleted: 1563 lines

Nov 28 · 13:27MVP

First commit: basic CLI with 5 core metrics

13:51v1.0.2

Published to npm in 24 minutes

17:07AI TRAP

Started building ML prediction system

18:04...

Ordered logistic regression, ECE calibration

20:58v2.0.0

Shipped with ML, but something felt off

22:25...

Writing tests to validate ML accuracy

Nov 29 · 10:25...

Tweaking ML thresholds, still debugging

15:35CAUGHT

Realized I was debugging, not building. 1,563 lines gone.

15:46FLOW

Watch mode: real-time spiral detection

16:01+15m

Baseline comparison against your history

16:25+24m

Pattern memory: learn your spiral triggers

16:45+20m

Interventions: track what breaks spirals

17:33v1.5.0

5 real features in 1h 58m post-delete

The Core Insight

AI reliability varies by task type. Formatting is nearly always correct; architecture needs line-by-line verification. The vibe levels answer the question: when can you safely let the AI drive, and when do you need to verify every line?

Vibe Levels

The idea is to pick a level before you start working:

Level	Trust	Verification	Example Tasks
L5	95%	Final only	Formatting, linting
L4	80%	Spot check	Boilerplate, config
L3	60%	Key outputs	CRUD, standard tests
L2	40%	Every change	Features, integrations
L1	20%	Every line	Architecture, security
L0	0%	N/A	Novel research

What they look like:

L595% trust L360% trust L120% trust

Declaring the level upfront forces you to think about what kind of task you're doing. After the session, you can compare what actually happened to what you expected, and over time your intuition about which level to use gets better.

The Five Core Metrics

All of these come from git history rather than code content. You can't game timestamps.

1. Trust Pass Rate

What it measures: The percentage of commits that don't require an immediate fix.

When you trust AI output and commit it, does it actually work? A high trust pass rate means your vibe level calibration is accurate: you're trusting AI on tasks where it's reliable. A low rate means you're over-trusting on complex work.

Calculation: Count commits where no fix commit follows within 10 minutes. Divide by total commits.

Rating	Range	What It Means
ELITE	90%+	Your vibe level intuition is well-calibrated
HIGH	75-89%	Solid, occasional miscalibration
MEDIUM	50-74%	Trusting AI on work that needs more verification
LOW	<50%	Either under-verifying or taking on L1/L0 work at higher levels

2. Rework Ratio

What it measures: Fix commits as a percentage of total work.

Some rework is healthy; zero fixes probably means you're over-verifying. But when rework climbs above 25%, you're spending more time correcting than building. This is the canary in the coal mine for AI reliability issues.

Calculation: Commits with "fix" in the message divided by total commits.

Rating	Range	What It Means
ELITE	<10%	Building forward, minimal corrections
HIGH	10-20%	Healthy iteration
MEDIUM	20-30%	Worth investigating which components cause fixes
LOW	>30%	You're debugging, not building

3. Debug Spirals

What it measures: Consecutive fix commits on the same component.

A spiral is three or more fix commits in a row on the same area of code. One fix is normal. Two fixes happens. Three fixes means you're not solving the root problem: you're patching symptoms while the AI keeps generating broken code.

Calculation: Detect sequences of 3+ commits where (a) all have "fix" in the message, and (b) all touch the same file or component.

Rating	Range	What It Means
ELITE	0 spirals	Clean execution, problems solved on first attempt
HIGH	1-2	Occasional complexity, but you break out
MEDIUM	3-5	Pattern worth examining: what triggers your spirals?
LOW	6+	You're fighting the AI instead of directing it

4. Spiral Duration

What it measures: Total time spent stuck in fix loops.

Spirals happen. What matters is how long you stay in them. Five minutes of debugging is fine. Forty-five minutes means you should have stepped back, switched approaches, or dropped to a lower vibe level twenty minutes ago.

Calculation: For each detected spiral, sum the time from first fix commit to last fix commit in the sequence.

Rating	Range	What It Means
ELITE	0 min	No spirals, no time lost
HIGH	<15 min total	Quick recognition and recovery
MEDIUM	15-45 min	You're getting stuck; watch mode would help
LOW	>45 min	Significant time lost; worth reviewing patterns

5. Flow Efficiency

What it measures: Percentage of active time spent building rather than debugging.

This is the meta-metric. It combines everything above into a single question: are you in a productive flow state, or are you stuck in the weeds?

Calculation: (Active time - Spiral duration) / Active time × 100

Active time comes from your commit timestamps. If you committed at 9:00 and 9:30, that's 30 minutes of active time. Spiral duration is subtracted from that to get productive building time.

Rating	Range	What It Means
ELITE	90%+	You're the head chef, AI is executing
HIGH	75-89%	Strong flow with minor interruptions
MEDIUM	50-74%	Split between building and debugging
LOW	<50%	More time fighting than building

vibe-check: The Tool

vibe-check is a CLI that reads your git history and shows you how you're doing:

bash

npm install -g @boshu2/vibe-check

Or run directly:

bash

npx @boshu2/vibe-check

What You Get

// terminal

$ vibe-check --since "1 week ago"

VIBE-CHECK Nov 21 - Nov 28

Rating: ELITE Trust: 94% ELITE Rework: 18% ELITE Spirals: 1 detected (12 min)

VibeScore: 87%

> TIP:

Run vibe-check profile to see your cumulative stats, achievements, and current streak.

Watch Mode

Run this in a terminal while you work:

bash

vibe-check watch

It monitors your commits and warns you when things look off:

// spiral-detection

09:15 fix(auth) handle token refresh
09:18 fix(auth) add retry logic
09:22 fix(auth) increase timeout
⚠️ SPIRAL DETECTED
Component: auth
Fixes: 3 commits, 7 min

Pattern Memory

The tool learns your specific spiral triggers over time:

// terminal

$ vibe-check profile --patterns

YOUR SPIRAL TRIGGERS

Component Times Pattern auth 5 OAuth/token/refresh issues database 3 Connection pooling api 2 External API timeouts

And tracks what interventions work for you:

// terminal

WHAT WORKS FOR YOU

Take a break 4 times (avg 12 min) Write test first 3 times Read docs 2 times

Dogfooding: The ML Detour

Day 1: Building the Wrong Thing

Nov 28, 13:27 3127f07: First commit. Scoped package name, basic CLI.

Nov 28, 17:07 f20a614: I start building an ML system. Ordered Logistic Regression to predict which Vibe Level you should use. Expected Calibration Error to measure confidence. The whole academic stack.

Nov 28, 18:04 58c8cf3: ML learning loop implemented. Calibration, ECE, partial fit.

Nov 28, 20:58 6093290: Ship it as v2.0.0 with a level subcommand that predicts your optimal trust level.

Nov 28, 22:25 ec4e8d6: Add unit tests for the ML model.

Total time building ML: 5 hours 18 minutes.

Day 2: The Realization

Nov 29, 15:35 a597615: L120% trust I'm staring at my own code.

It uses ml-matrix and calculates Expected Calibration Error. Technically impressive but completely useless.

The ML predicts which Vibe Level to use. You already know what task you're doing. You don't need a model to tell you that OAuth integration is riskier than formatting code.

1,563 lines of math solving a problem that doesn't exist. And I was already dreading the tests I'd need to maintain.

21 hours after implementing it, I delete all of it.

bash

git rm -rf src/recommend/ # Ordered logistic regression git rm -rf src/calibration/ # ECE calculations git rm src/commands/level.ts # The prediction command

Commit: a5976159: feat(v1.3): replace ML level prediction with session workflow

1,563 lines deleted. One commit.

The Productive Sprint

With the ML gone, I have mental space for features that actually matter.

Nov 29, 15:46 9db3e73 (11 minutes later): L480% trust Watch Mode

Instead of predicting levels, I build something useful: real-time spiral detection. Poll git log every 30 seconds, warn when you're stuck.

bash

vibe-check watch

Nov 29, 16:01 652da00 (15 minutes later): L480% trust Baseline Comparison

Compare your current session to your own historical patterns. No ML required, just simple statistics.

Nov 29, 16:25 301a322 (24 minutes later): Pattern Memory

Track which components trigger your spirals. Learn from your own history.

Nov 29, 16:45 85cd751 (20 minutes later): Intervention Tracking

Record what breaks your spirals. Build a personal playbook.

Nov 29, 17:33 15f1dee: Ship v1.5.0 with gamification, watch mode, pattern memory, and 24 features total.

The Numbers

Metric	Value
Time building ML	5h 18m
Time ML existed	21h 31m
Lines deleted	1,563
Time from delete to v1.5.0	1h 58m

I spent 5+ hours building something I deleted 21 hours later. In the 2 hours after deleting it, I shipped more useful features than in the entire previous day.

Actual vibe-check output for the vibe-check build:

// vibe-check --since 2025-11-28

Period: Nov 28 - Nov 30 (16.4h active over 2 days) Commits: 56 total (24 feat, 3 fix, 7 docs, 22 other)

METRIC VALUE RATING

Iteration Velocity 3.4/hour HIGH Rework Ratio 5% ELITE Trust Pass Rate 100% ELITE Debug Spiral Duration 0min ELITE Flow Efficiency 100% ELITE

OVERALL: ELITE

The Takeaway

If I had stayed at L4 (high trust), I would have asked the AI to "fix the ML tests." It would have done it. I would have shipped a feature nobody wanted and maintained it forever.

By dropping to L1 (verify every line), I saw the real problem: the feature itself was wrong.

The Gamification Layer

XP, streaks, achievements. The same reward loops that make games addictive, pointed at development discipline.

XP & Levels

// progression

Level 1: Novice       (0-100 XP)      🌱
Level 2: Apprentice   (100-300 XP)    🌿
Level 3: Practitioner (300-600 XP)    🌳
Level 4: Expert       (600-1000 XP)   🌲
Level 5: Master       (1000-2000 XP)  🎋
Level 6: Grandmaster  (2000-5000 XP)  🏔️

Prestige tiers beyond Grandmaster: Archmage 🔮, Sage 📿, Zenmester ☯️, Transcendent 🌟, Legendary 💫

Streaks

// streak-display

🔥 5-day streak (1-5 days)
🌟🌟 12-day streak (6-14 days)
👑👑👑 18-day streak 🏆 Personal Best!

Weekly Challenges

Auto-generated based on your weak metrics:

// challenges

WEEKLY CHALLENGES

🎯 Trust Gauntlet: ████████░░ 4/5 (90%+ trust) 🧘 Zen Mode: ██████████ ✓ COMPLETE 🔥 Streak Builder: ██░░░░░░░░ 1/5 (extend by 5)

Integration Points

Git Hook

Automatic vibe-check on every push:

bash

vibe-check init-hook vibe-check init-hook --block-low # Block LOW-rated pushes

GitHub Action

// .github/workflows/vibe-check.yml

name: Vibe Check
on: [pull_request]

jobs:
vibe-check:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0
    - uses: boshu2/vibe-check@v1
      with:
        github-token: ${{ secrets.GITHUB_TOKEN }}

The Philosophy

> INFO:

The goal here isn't perfect metrics; it's noticing your own patterns.

This is a tool for self-reflection. You use it on yourself to get better at working with AI. It's not a productivity metric and it's definitely not for performance reviews. Don't use it to measure other people.

Gene Kim and Steve Yegge's research shows a concrete threshold: context utilization above 40% degrades AI performance dramatically. Under 35%, success rate is 98%. Above 60%, it drops to 24%. vibe-check helps you notice when you're approaching that limit: when the spirals start, when the rework ratio climbs, when you're fighting the AI instead of directing it.

Why Git History?

I wanted signals that can't be faked. Commit timestamps are real, and the tool never reads your actual code, just the metadata. This means it works regardless of what language or framework you're using. The patterns in your commit history reveal behavior rather than intentions, which makes them useful for actually improving.

What's Next

Phase	Status	Description
CLI Core	✅ Complete	Metrics, scoring, analysis
Gamification	✅ Complete	XP, streaks, achievements
Watch Mode	✅ Complete	Real-time spiral detection
GitHub Action	✅ Complete	Automated PR feedback
Pattern Memory	✅ Complete	Track spiral triggers
Web Dashboard	🔜 Next	Visualizations, trends
VS Code Extension	💭 Exploring	Status bar, live alerts

Try It

bash

Install

npm install -g @boshu2/vibe-check

Run your first check

vibe-check --since "1 week ago"

See your profile

vibe-check profile

Start watching

vibe-check watch

Links: npm · GitHub

Getting Started with Vibe Coding

The practical guide: install, declare a level, measure.

Building This Website with Vibe-Coding

First-time Next.js at L3-L4

Building vibe-check

The Core Insight

Vibe Levels

The Five Core Metrics

1. Trust Pass Rate

2. Rework Ratio

3. Debug Spirals

4. Spiral Duration

5. Flow Efficiency

vibe-check: The Tool

What You Get

Watch Mode

Pattern Memory

Dogfooding: The ML Detour

Day 1: Building the Wrong Thing

Day 2: The Realization

The Productive Sprint

The Numbers

METRIC VALUE RATING

The Takeaway

The Gamification Layer

XP & Levels

Streaks

Weekly Challenges

Integration Points

Git Hook

GitHub Action

The Philosophy

Why Git History?

What's Next

Try It

Install

Run your first check

See your profile

Start watching

Related Posts