Building vibe-check

November 29, 2025·13 min read
#vibe-coding#ai-development#developer-tools#open-source

Andrej Karpathy coined the term for letting AI write code while you direct. Gene Kim and Steve Yegge turned it into a methodology that answers the question pure vibe-coding ignores: when can you trust AI output, and when do you need to verify every line?

Their framing: you're the head chef directing AI sous chefs. The book documents 12 failure patterns where vibe coding destroys work, from AI claiming tests pass when it never ran them to 3,000-line functions that become unmaintainable. The methodology exists to avoid them.

I built tooling around this idea. (See how I built this website using the methodology.)

// vibe-check: from zero to v1.5.0
duration: 16.4 hours
commits: 56
features: 24
deleted: 1563 lines
Nov 28 · 13:27MVP
First commit: basic CLI with 5 core metrics
13:51v1.0.2
Published to npm in 24 minutes
17:07AI TRAP
Started building ML prediction system
18:04...
Ordered logistic regression, ECE calibration
20:58v2.0.0
Shipped with ML, but something felt off
22:25...
Writing tests to validate ML accuracy
Nov 29 · 10:25...
Tweaking ML thresholds, still debugging
15:35CAUGHT
Realized I was debugging, not building. 1,563 lines gone.
15:46FLOW
Watch mode: real-time spiral detection
16:01+15m
Baseline comparison against your history
16:25+24m
Pattern memory: learn your spiral triggers
16:45+20m
Interventions: track what breaks spirals
17:33v1.5.0
5 real features in 1h 58m post-delete

The Core Insight

AI reliability varies by task type. Formatting is nearly always correct; architecture needs line-by-line verification. The vibe levels answer the question: when can you safely let the AI drive, and when do you need to verify every line?


Vibe Levels

The idea is to pick a level before you start working:

LevelTrustVerificationExample Tasks
L595%Final onlyFormatting, linting
L480%Spot checkBoilerplate, config
L360%Key outputsCRUD, standard tests
L240%Every changeFeatures, integrations
L120%Every lineArchitecture, security
L00%N/ANovel research

What they look like:

L595% trust L360% trust L120% trust

Declaring the level upfront forces you to think about what kind of task you're doing. After the session, you can compare what actually happened to what you expected, and over time your intuition about which level to use gets better.


The Five Core Metrics

All of these come from git history rather than code content. You can't game timestamps.

1. Trust Pass Rate

What it measures: The percentage of commits that don't require an immediate fix.

When you trust AI output and commit it, does it actually work? A high trust pass rate means your vibe level calibration is accurate: you're trusting AI on tasks where it's reliable. A low rate means you're over-trusting on complex work.

Calculation: Count commits where no fix commit follows within 10 minutes. Divide by total commits.

RatingRangeWhat It Means
ELITE90%+Your vibe level intuition is well-calibrated
HIGH75-89%Solid, occasional miscalibration
MEDIUM50-74%Trusting AI on work that needs more verification
LOW<50%Either under-verifying or taking on L1/L0 work at higher levels

2. Rework Ratio

What it measures: Fix commits as a percentage of total work.

Some rework is healthy; zero fixes probably means you're over-verifying. But when rework climbs above 25%, you're spending more time correcting than building. This is the canary in the coal mine for AI reliability issues.

Calculation: Commits with "fix" in the message divided by total commits.

RatingRangeWhat It Means
ELITE<10%Building forward, minimal corrections
HIGH10-20%Healthy iteration
MEDIUM20-30%Worth investigating which components cause fixes
LOW>30%You're debugging, not building

3. Debug Spirals

What it measures: Consecutive fix commits on the same component.

A spiral is three or more fix commits in a row on the same area of code. One fix is normal. Two fixes happens. Three fixes means you're not solving the root problem: you're patching symptoms while the AI keeps generating broken code.

Calculation: Detect sequences of 3+ commits where (a) all have "fix" in the message, and (b) all touch the same file or component.

RatingRangeWhat It Means
ELITE0 spiralsClean execution, problems solved on first attempt
HIGH1-2Occasional complexity, but you break out
MEDIUM3-5Pattern worth examining: what triggers your spirals?
LOW6+You're fighting the AI instead of directing it

4. Spiral Duration

What it measures: Total time spent stuck in fix loops.

Spirals happen. What matters is how long you stay in them. Five minutes of debugging is fine. Forty-five minutes means you should have stepped back, switched approaches, or dropped to a lower vibe level twenty minutes ago.

Calculation: For each detected spiral, sum the time from first fix commit to last fix commit in the sequence.

RatingRangeWhat It Means
ELITE0 minNo spirals, no time lost
HIGH<15 min totalQuick recognition and recovery
MEDIUM15-45 minYou're getting stuck; watch mode would help
LOW>45 minSignificant time lost; worth reviewing patterns

5. Flow Efficiency

What it measures: Percentage of active time spent building rather than debugging.

This is the meta-metric. It combines everything above into a single question: are you in a productive flow state, or are you stuck in the weeds?

Calculation: (Active time - Spiral duration) / Active time × 100

Active time comes from your commit timestamps. If you committed at 9:00 and 9:30, that's 30 minutes of active time. Spiral duration is subtracted from that to get productive building time.

RatingRangeWhat It Means
ELITE90%+You're the head chef, AI is executing
HIGH75-89%Strong flow with minor interruptions
MEDIUM50-74%Split between building and debugging
LOW<50%More time fighting than building

vibe-check: The Tool

vibe-check is a CLI that reads your git history and shows you how you're doing:

bash

npm install -g @boshu2/vibe-check

Or run directly:

bash

npx @boshu2/vibe-check


What You Get

// terminal

$ vibe-check --since "1 week ago"

VIBE-CHECK Nov 21 - Nov 28

Rating: ELITE Trust: 94% ELITE Rework: 18% ELITE Spirals: 1 detected (12 min)

VibeScore: 87%

> TIP:

Run vibe-check profile to see your cumulative stats, achievements, and current streak.


Watch Mode

Run this in a terminal while you work:

bash

vibe-check watch

It monitors your commits and warns you when things look off:

// spiral-detection

09:15 fix(auth) handle token refresh 09:18 fix(auth) add retry logic 09:22 fix(auth) increase timeout

⚠️ SPIRAL DETECTED Component: auth Fixes: 3 commits, 7 min


Pattern Memory

The tool learns your specific spiral triggers over time:

// terminal

$ vibe-check profile --patterns

YOUR SPIRAL TRIGGERS

Component Times Pattern auth 5 OAuth/token/refresh issues database 3 Connection pooling api 2 External API timeouts

And tracks what interventions work for you:

// terminal

WHAT WORKS FOR YOU

Take a break 4 times (avg 12 min) Write test first 3 times Read docs 2 times


Dogfooding: The ML Detour

Day 1: Building the Wrong Thing

Nov 28, 13:27 3127f07: First commit. Scoped package name, basic CLI.

Nov 28, 17:07 f20a614: I start building an ML system. Ordered Logistic Regression to predict which Vibe Level you should use. Expected Calibration Error to measure confidence. The whole academic stack.

Nov 28, 18:04 58c8cf3: ML learning loop implemented. Calibration, ECE, partial fit.

Nov 28, 20:58 6093290: Ship it as v2.0.0 with a level subcommand that predicts your optimal trust level.

Nov 28, 22:25 ec4e8d6: Add unit tests for the ML model.

Total time building ML: 5 hours 18 minutes.

Day 2: The Realization

Nov 29, 15:35 a597615: L120% trust I'm staring at my own code.

It uses ml-matrix and calculates Expected Calibration Error. Technically impressive but completely useless.

The ML predicts which Vibe Level to use. You already know what task you're doing. You don't need a model to tell you that OAuth integration is riskier than formatting code.

1,563 lines of math solving a problem that doesn't exist. And I was already dreading the tests I'd need to maintain.

21 hours after implementing it, I delete all of it.

bash

git rm -rf src/recommend/ # Ordered logistic regression git rm -rf src/calibration/ # ECE calculations git rm src/commands/level.ts # The prediction command

Commit: a5976159: feat(v1.3): replace ML level prediction with session workflow

1,563 lines deleted. One commit.

The Productive Sprint

With the ML gone, I have mental space for features that actually matter.

Nov 29, 15:46 9db3e73 (11 minutes later): L480% trust Watch Mode

Instead of predicting levels, I build something useful: real-time spiral detection. Poll git log every 30 seconds, warn when you're stuck.

bash

vibe-check watch

Nov 29, 16:01 652da00 (15 minutes later): L480% trust Baseline Comparison

Compare your current session to your own historical patterns. No ML required, just simple statistics.

Nov 29, 16:25 301a322 (24 minutes later): Pattern Memory

Track which components trigger your spirals. Learn from your own history.

Nov 29, 16:45 85cd751 (20 minutes later): Intervention Tracking

Record what breaks your spirals. Build a personal playbook.

Nov 29, 17:33 15f1dee: Ship v1.5.0 with gamification, watch mode, pattern memory, and 24 features total.

The Numbers

MetricValue
Time building ML5h 18m
Time ML existed21h 31m
Lines deleted1,563
Time from delete to v1.5.01h 58m

I spent 5+ hours building something I deleted 21 hours later. In the 2 hours after deleting it, I shipped more useful features than in the entire previous day.

Actual vibe-check output for the vibe-check build:

// vibe-check --since 2025-11-28

Period: Nov 28 - Nov 30 (16.4h active over 2 days) Commits: 56 total (24 feat, 3 fix, 7 docs, 22 other)

METRIC VALUE RATING

Iteration Velocity 3.4/hour HIGH Rework Ratio 5% ELITE Trust Pass Rate 100% ELITE Debug Spiral Duration 0min ELITE Flow Efficiency 100% ELITE

OVERALL: ELITE

The Takeaway

If I had stayed at L4 (high trust), I would have asked the AI to "fix the ML tests." It would have done it. I would have shipped a feature nobody wanted and maintained it forever.

By dropping to L1 (verify every line), I saw the real problem: the feature itself was wrong.


The Gamification Layer

XP, streaks, achievements. The same reward loops that make games addictive, pointed at development discipline.

XP & Levels

// progression

Level 1: Novice (0-100 XP) 🌱 Level 2: Apprentice (100-300 XP) 🌿 Level 3: Practitioner (300-600 XP) 🌳 Level 4: Expert (600-1000 XP) 🌲 Level 5: Master (1000-2000 XP) 🎋 Level 6: Grandmaster (2000-5000 XP) 🏔️

Prestige tiers beyond Grandmaster: Archmage 🔮, Sage 📿, Zenmester ☯️, Transcendent 🌟, Legendary 💫

Streaks

// streak-display

🔥 5-day streak (1-5 days) 🌟🌟 12-day streak (6-14 days) 👑👑👑 18-day streak 🏆 Personal Best!

Weekly Challenges

Auto-generated based on your weak metrics:

// challenges

WEEKLY CHALLENGES

🎯 Trust Gauntlet: ████████░░ 4/5 (90%+ trust) 🧘 Zen Mode: ██████████ ✓ COMPLETE 🔥 Streak Builder: ██░░░░░░░░ 1/5 (extend by 5)


Integration Points

Git Hook

Automatic vibe-check on every push:

bash

vibe-check init-hook vibe-check init-hook --block-low # Block LOW-rated pushes

GitHub Action

// .github/workflows/vibe-check.yml
name: Vibe Check
on: [pull_request]

jobs:
vibe-check:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0
    - uses: boshu2/vibe-check@v1
      with:
        github-token: ${{ secrets.GITHUB_TOKEN }}

The Philosophy

> INFO:

The goal here isn't perfect metrics; it's noticing your own patterns.

This is a tool for self-reflection. You use it on yourself to get better at working with AI. It's not a productivity metric and it's definitely not for performance reviews. Don't use it to measure other people.

Gene Kim and Steve Yegge's research shows a concrete threshold: context utilization above 40% degrades AI performance dramatically. Under 35%, success rate is 98%. Above 60%, it drops to 24%. vibe-check helps you notice when you're approaching that limit: when the spirals start, when the rework ratio climbs, when you're fighting the AI instead of directing it.


Why Git History?

I wanted signals that can't be faked. Commit timestamps are real, and the tool never reads your actual code, just the metadata. This means it works regardless of what language or framework you're using. The patterns in your commit history reveal behavior rather than intentions, which makes them useful for actually improving.


What's Next

PhaseStatusDescription
CLI Core✅ CompleteMetrics, scoring, analysis
Gamification✅ CompleteXP, streaks, achievements
Watch Mode✅ CompleteReal-time spiral detection
GitHub Action✅ CompleteAutomated PR feedback
Pattern Memory✅ CompleteTrack spiral triggers
Web Dashboard🔜 NextVisualizations, trends
VS Code Extension💭 ExploringStatus bar, live alerts

Try It

bash

Install

npm install -g @boshu2/vibe-check

Run your first check

vibe-check --since "1 week ago"

See your profile

vibe-check profile

Start watching

vibe-check watch

Links: npm · GitHub