/builds/openshift-paas
// where I learned reliability
Before I could make AI reliable, I had to learn what reliability actually means.
Top secret on-prem networks running self-hosted AI models. Every dependency vendored, every image scanned, every certificate managed internally. This is where I learned the patterns that AI agents need—observability, deterministic deployments, graceful failure.
> The same principles that make infrastructure boring make AI reliable.
// patterns that transfer
One config file defines truth. Everything else renders from it. No manual editing, no drift. AI agents need this too: explicit context in, predictable behavior out.
Same inputs, same outputs. Every time. The foundation of trust—whether you're deploying containers or generating code.
If you can't see what's happening, you can't fix it. Logs, metrics, traces—the same visibility AI workflows need to debug when things go wrong.
Where AI meets infrastructure directly. Multi-generation GPU scheduling for ML teams running inference at scale. OpenShift with and without Run:AI.
// building trust in hostile environments
Air-gapped environments break assumptions. Every trust relationship has to be explicit. AI has the same problem—you can't assume anything works until you verify it.
Every dependency audited. No public registries. Every container image vendored, every chart pinned, every binary scanned. Trust nothing by default. The same principle behind vibe-check: measure before you trust.
Explicit trust chains. Internal PKI everywhere. Custom certificate chains, managed trust stores. No implicit trust—the same pattern AI agents need for authentication and authorization.
No Stack Overflow. When something breaks, you figure it out from docs and source code. This is where I learned to actually understand systems instead of copy-pasting fixes. Essential for debugging AI failures.
// the pattern
Environment-specific config. IPs, hostnames, feature flags. One file per site, overlays compose the final state.
Opinionated wrappers around upstream charts. Pinned versions, enterprise defaults, sync-wave ordering baked in.
Config flows one direction. Sites customize, base charts deploy.
// outcomes
Upgrades are boring. That's the goal. Change the version in config, let the pipeline render, watch dashboards. Platform upgrades shouldn't be events.
No tribal knowledge. Everything lives in Git. New engineers can trace exactly what happens from config change to deployed state. The pipeline is the documentation.
Developers stay unblocked. Self-service means the platform team isn't a bottleneck. Request infrastructure through Git, get infrastructure through Git.
// stack
This work taught me that reliability isn't a feature. It's a discipline. Observability, determinism, explicit trust, graceful failure.
Now I'm applying these patterns to AI agents. The infrastructure is ready. The agents are getting there.
> Start with the patterns that already work.