Building vibe-check
Andrej Karpathy coined the term for letting AI write code while you direct. Gene Kim and Steve Yegge turned it into a methodology that answers the question pure vibe-coding ignores: when can you trust AI output, and when do you need to verify every line?
Their framing: you're the head chef directing AI sous chefs. The book documents 12 failure patterns where vibe coding destroys work, from AI claiming tests pass when it never ran them to 3,000-line functions that become unmaintainable. The methodology exists to avoid them.
I built tooling around this idea. (See how I built this website using the methodology.)
The Core Insight
AI reliability varies by task type. Formatting is nearly always correct; architecture needs line-by-line verification. The vibe levels answer the question: when can you safely let the AI drive, and when do you need to verify every line?
Vibe Levels
The idea is to pick a level before you start working:
| Level | Trust | Verification | Example Tasks |
|---|---|---|---|
| L5 | 95% | Final only | Formatting, linting |
| L4 | 80% | Spot check | Boilerplate, config |
| L3 | 60% | Key outputs | CRUD, standard tests |
| L2 | 40% | Every change | Features, integrations |
| L1 | 20% | Every line | Architecture, security |
| L0 | 0% | N/A | Novel research |
What they look like:
L595% trust L360% trust L120% trustDeclaring the level upfront forces you to think about what kind of task you're doing. After the session, you can compare what actually happened to what you expected, and over time your intuition about which level to use gets better.
The Five Core Metrics
All of these come from git history rather than code content. You can't game timestamps.
1. Trust Pass Rate
What it measures: The percentage of commits that don't require an immediate fix.
When you trust AI output and commit it, does it actually work? A high trust pass rate means your vibe level calibration is accurate: you're trusting AI on tasks where it's reliable. A low rate means you're over-trusting on complex work.
Calculation: Count commits where no fix commit follows within 10 minutes. Divide by total commits.
| Rating | Range | What It Means |
|---|---|---|
| ELITE | 90%+ | Your vibe level intuition is well-calibrated |
| HIGH | 75-89% | Solid, occasional miscalibration |
| MEDIUM | 50-74% | Trusting AI on work that needs more verification |
| LOW | <50% | Either under-verifying or taking on L1/L0 work at higher levels |
2. Rework Ratio
What it measures: Fix commits as a percentage of total work.
Some rework is healthy; zero fixes probably means you're over-verifying. But when rework climbs above 25%, you're spending more time correcting than building. This is the canary in the coal mine for AI reliability issues.
Calculation: Commits with "fix" in the message divided by total commits.
| Rating | Range | What It Means |
|---|---|---|
| ELITE | <10% | Building forward, minimal corrections |
| HIGH | 10-20% | Healthy iteration |
| MEDIUM | 20-30% | Worth investigating which components cause fixes |
| LOW | >30% | You're debugging, not building |
3. Debug Spirals
What it measures: Consecutive fix commits on the same component.
A spiral is three or more fix commits in a row on the same area of code. One fix is normal. Two fixes happens. Three fixes means you're not solving the root problem: you're patching symptoms while the AI keeps generating broken code.
Calculation: Detect sequences of 3+ commits where (a) all have "fix" in the message, and (b) all touch the same file or component.
| Rating | Range | What It Means |
|---|---|---|
| ELITE | 0 spirals | Clean execution, problems solved on first attempt |
| HIGH | 1-2 | Occasional complexity, but you break out |
| MEDIUM | 3-5 | Pattern worth examining: what triggers your spirals? |
| LOW | 6+ | You're fighting the AI instead of directing it |
4. Spiral Duration
What it measures: Total time spent stuck in fix loops.
Spirals happen. What matters is how long you stay in them. Five minutes of debugging is fine. Forty-five minutes means you should have stepped back, switched approaches, or dropped to a lower vibe level twenty minutes ago.
Calculation: For each detected spiral, sum the time from first fix commit to last fix commit in the sequence.
| Rating | Range | What It Means |
|---|---|---|
| ELITE | 0 min | No spirals, no time lost |
| HIGH | <15 min total | Quick recognition and recovery |
| MEDIUM | 15-45 min | You're getting stuck; watch mode would help |
| LOW | >45 min | Significant time lost; worth reviewing patterns |
5. Flow Efficiency
What it measures: Percentage of active time spent building rather than debugging.
This is the meta-metric. It combines everything above into a single question: are you in a productive flow state, or are you stuck in the weeds?
Calculation: (Active time - Spiral duration) / Active time × 100
Active time comes from your commit timestamps. If you committed at 9:00 and 9:30, that's 30 minutes of active time. Spiral duration is subtracted from that to get productive building time.
| Rating | Range | What It Means |
|---|---|---|
| ELITE | 90%+ | You're the head chef, AI is executing |
| HIGH | 75-89% | Strong flow with minor interruptions |
| MEDIUM | 50-74% | Split between building and debugging |
| LOW | <50% | More time fighting than building |
vibe-check: The Tool
vibe-check is a CLI that reads your git history and shows you how you're doing:
npm install -g @boshu2/vibe-check
Or run directly:
npx @boshu2/vibe-check
What You Get
$ vibe-check --since "1 week ago"
VIBE-CHECK Nov 21 - Nov 28
Rating: ELITE Trust: 94% ELITE Rework: 18% ELITE Spirals: 1 detected (12 min)
VibeScore: 87%
Run vibe-check profile to see your cumulative stats, achievements, and current streak.
Watch Mode
Run this in a terminal while you work:
vibe-check watch
It monitors your commits and warns you when things look off:
09:15 fix(auth) handle token refresh 09:18 fix(auth) add retry logic 09:22 fix(auth) increase timeout
⚠️ SPIRAL DETECTED Component: auth Fixes: 3 commits, 7 min
Pattern Memory
The tool learns your specific spiral triggers over time:
$ vibe-check profile --patterns
YOUR SPIRAL TRIGGERS
Component Times Pattern auth 5 OAuth/token/refresh issues database 3 Connection pooling api 2 External API timeouts
And tracks what interventions work for you:
WHAT WORKS FOR YOU
Take a break 4 times (avg 12 min) Write test first 3 times Read docs 2 times
Dogfooding: The ML Detour
Day 1: Building the Wrong Thing
Nov 28, 13:27 3127f07: First commit. Scoped package name, basic CLI.
Nov 28, 17:07 f20a614: I start building an ML system. Ordered Logistic Regression to predict which Vibe Level you should use. Expected Calibration Error to measure confidence. The whole academic stack.
Nov 28, 18:04 58c8cf3: ML learning loop implemented. Calibration, ECE, partial fit.
Nov 28, 20:58 6093290: Ship it as v2.0.0 with a level subcommand that predicts your optimal trust level.
Nov 28, 22:25 ec4e8d6: Add unit tests for the ML model.
Total time building ML: 5 hours 18 minutes.
Day 2: The Realization
Nov 29, 15:35 a597615: L120% trust I'm staring at my own code.
It uses ml-matrix and calculates Expected Calibration Error. Technically impressive but completely useless.
The ML predicts which Vibe Level to use. You already know what task you're doing. You don't need a model to tell you that OAuth integration is riskier than formatting code.
1,563 lines of math solving a problem that doesn't exist. And I was already dreading the tests I'd need to maintain.
21 hours after implementing it, I delete all of it.
git rm -rf src/recommend/ # Ordered logistic regression git rm -rf src/calibration/ # ECE calculations git rm src/commands/level.ts # The prediction command
Commit: a5976159: feat(v1.3): replace ML level prediction with session workflow
1,563 lines deleted. One commit.
The Productive Sprint
With the ML gone, I have mental space for features that actually matter.
Nov 29, 15:46 9db3e73 (11 minutes later): L480% trust Watch Mode
Instead of predicting levels, I build something useful: real-time spiral detection. Poll git log every 30 seconds, warn when you're stuck.
vibe-check watch
Nov 29, 16:01 652da00 (15 minutes later): L480% trust Baseline Comparison
Compare your current session to your own historical patterns. No ML required, just simple statistics.
Nov 29, 16:25 301a322 (24 minutes later): Pattern Memory
Track which components trigger your spirals. Learn from your own history.
Nov 29, 16:45 85cd751 (20 minutes later): Intervention Tracking
Record what breaks your spirals. Build a personal playbook.
Nov 29, 17:33 15f1dee: Ship v1.5.0 with gamification, watch mode, pattern memory, and 24 features total.
The Numbers
| Metric | Value |
|---|---|
| Time building ML | 5h 18m |
| Time ML existed | 21h 31m |
| Lines deleted | 1,563 |
| Time from delete to v1.5.0 | 1h 58m |
I spent 5+ hours building something I deleted 21 hours later. In the 2 hours after deleting it, I shipped more useful features than in the entire previous day.
Actual vibe-check output for the vibe-check build:
Period: Nov 28 - Nov 30 (16.4h active over 2 days) Commits: 56 total (24 feat, 3 fix, 7 docs, 22 other)
METRIC VALUE RATING
Iteration Velocity 3.4/hour HIGH Rework Ratio 5% ELITE Trust Pass Rate 100% ELITE Debug Spiral Duration 0min ELITE Flow Efficiency 100% ELITE
OVERALL: ELITE
The Takeaway
If I had stayed at L4 (high trust), I would have asked the AI to "fix the ML tests." It would have done it. I would have shipped a feature nobody wanted and maintained it forever.
By dropping to L1 (verify every line), I saw the real problem: the feature itself was wrong.
The Gamification Layer
XP, streaks, achievements. The same reward loops that make games addictive, pointed at development discipline.
XP & Levels
Level 1: Novice (0-100 XP) 🌱 Level 2: Apprentice (100-300 XP) 🌿 Level 3: Practitioner (300-600 XP) 🌳 Level 4: Expert (600-1000 XP) 🌲 Level 5: Master (1000-2000 XP) 🎋 Level 6: Grandmaster (2000-5000 XP) 🏔️
Prestige tiers beyond Grandmaster: Archmage 🔮, Sage 📿, Zenmester ☯️, Transcendent 🌟, Legendary 💫
Streaks
🔥 5-day streak (1-5 days) 🌟🌟 12-day streak (6-14 days) 👑👑👑 18-day streak 🏆 Personal Best!
Weekly Challenges
Auto-generated based on your weak metrics:
WEEKLY CHALLENGES
🎯 Trust Gauntlet: ████████░░ 4/5 (90%+ trust) 🧘 Zen Mode: ██████████ ✓ COMPLETE 🔥 Streak Builder: ██░░░░░░░░ 1/5 (extend by 5)
Integration Points
Git Hook
Automatic vibe-check on every push:
vibe-check init-hook vibe-check init-hook --block-low # Block LOW-rated pushes
GitHub Action
name: Vibe Check
on: [pull_request]
jobs:
vibe-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: boshu2/vibe-check@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}The Philosophy
The goal here isn't perfect metrics; it's noticing your own patterns.
This is a tool for self-reflection. You use it on yourself to get better at working with AI. It's not a productivity metric and it's definitely not for performance reviews. Don't use it to measure other people.
Gene Kim and Steve Yegge's research shows a concrete threshold: context utilization above 40% degrades AI performance dramatically. Under 35%, success rate is 98%. Above 60%, it drops to 24%. vibe-check helps you notice when you're approaching that limit: when the spirals start, when the rework ratio climbs, when you're fighting the AI instead of directing it.
Why Git History?
I wanted signals that can't be faked. Commit timestamps are real, and the tool never reads your actual code, just the metadata. This means it works regardless of what language or framework you're using. The patterns in your commit history reveal behavior rather than intentions, which makes them useful for actually improving.
What's Next
| Phase | Status | Description |
|---|---|---|
| CLI Core | ✅ Complete | Metrics, scoring, analysis |
| Gamification | ✅ Complete | XP, streaks, achievements |
| Watch Mode | ✅ Complete | Real-time spiral detection |
| GitHub Action | ✅ Complete | Automated PR feedback |
| Pattern Memory | ✅ Complete | Track spiral triggers |
| Web Dashboard | 🔜 Next | Visualizations, trends |
| VS Code Extension | 💭 Exploring | Status bar, live alerts |
Try It
Install
npm install -g @boshu2/vibe-check
Run your first check
vibe-check --since "1 week ago"
See your profile
vibe-check profile
Start watching
vibe-check watch