/builds/openshift-paas

// where I learned reliability

Before I could make AI reliable, I had to learn what reliability actually means.

Top secret on-prem networks running self-hosted AI models. Every dependency vendored, every image scanned, every certificate managed internally. This is where I learned the patterns that AI agents need—observability, deterministic deployments, graceful failure.

> The same principles that make infrastructure boring make AI reliable.

// patterns that transfer

Declarative State

One config file defines truth. Everything else renders from it. No manual editing, no drift. AI agents need this too: explicit context in, predictable behavior out.

Deterministic Deployments

Same inputs, same outputs. Every time. The foundation of trust—whether you're deploying containers or generating code.

Observable Systems

If you can't see what's happening, you can't fix it. Logs, metrics, traces—the same visibility AI workflows need to debug when things go wrong.

GPU Orchestration

Where AI meets infrastructure directly. Multi-generation GPU scheduling for ML teams running inference at scale. OpenShift with and without Run:AI.

// building trust in hostile environments

Air-gapped environments break assumptions. Every trust relationship has to be explicit. AI has the same problem—you can't assume anything works until you verify it.

Every dependency audited. No public registries. Every container image vendored, every chart pinned, every binary scanned. Trust nothing by default. The same principle behind vibe-check: measure before you trust.

Explicit trust chains. Internal PKI everywhere. Custom certificate chains, managed trust stores. No implicit trust—the same pattern AI agents need for authentication and authorization.

No Stack Overflow. When something breaks, you figure it out from docs and source code. This is where I learned to actually understand systems instead of copy-pasting fixes. Essential for debugging AI failures.

// the pattern

Site Layer (Kustomize)

Environment-specific config. IPs, hostnames, feature flags. One file per site, overlays compose the final state.

Base Layer (Custom Helm Charts)

Opinionated wrappers around upstream charts. Pinned versions, enterprise defaults, sync-wave ordering baked in.

Config flows one direction. Sites customize, base charts deploy.

// outcomes

Upgrades are boring. That's the goal. Change the version in config, let the pipeline render, watch dashboards. Platform upgrades shouldn't be events.

No tribal knowledge. Everything lives in Git. New engineers can trace exactly what happens from config change to deployed state. The pipeline is the documentation.

Developers stay unblocked. Self-service means the platform team isn't a bottleneck. Request infrastructure through Git, get infrastructure through Git.

// stack

OpenShiftGitOpsHelmKustomizePythonGPU SchedulingEnterprise SSOSecrets ManagementPolicy EngineObject Storage

This work taught me that reliability isn't a feature. It's a discipline. Observability, determinism, explicit trust, graceful failure.

Now I'm applying these patterns to AI agents. The infrastructure is ready. The agents are getting there.

> Start with the patterns that already work.