$ AI agent governance, security tooling, and mechanical enforcement.

90 Days of Discovery and Vibe Engineering

ai-development, claude-code, vibe-engineering, governance, rigscore, case-study

At the start of 2026, I set out to push AI-assisted development as far as it would go on a single $200/month Claude Max plan โ€” not as a weekend hackathon, but as an 80-day stress test. The setup was simple: a WSL2 container, Claude Code as the coding agent, and whatever project seemed worth building next.

The output was about 20 distinct projects and roughly 147,000 lines of source. Six reached real production. The rest are archived. What I kept wasn’t the code. It was a pattern library of how vibe coding fails โ€” structural failure modes I saw often enough to turn into a CLI called rigscore. Every section below maps to a check it runs.

Finding the Breaking Points

The projects mostly worked. Stepping back, the failure patterns were more interesting than the successes.

The v0.1.0 plateau. Almost every project with a pyproject.toml declared version 0.1.0. Two reached 1.0.0. AI is good at getting something to work the first time. It does not push you toward versioning discipline, release readiness, or the question “is this mature enough to ship?” That question never enters the loop because it is never the thing you are asking for. rigscore’s workflow-maturity check (advisory) flags this exact pattern: unpromoted skills, perpetual-draft libraries, stalled version strings.

Testing theater. Thirteen of about 20 projects had zero tests. The aggregate 0.53 test:source ratio looked respectable; it was noise. The misleading cases were worse than the gaps. One project hit a 3.18 ratio โ€” 915 lines of tests for 288 lines of source โ€” but most of those tests were marked xfail. Another hit 3.86; its tests validated that specific keywords existed inside a system prompt string. Keyword presence is not functional testing.

The one project where I set cov-fail-under = 60 had the only honest test suite in the portfolio. Coverage thresholds force the conversation about what matters. Without them, Claude will cheerfully generate tests that assert whatever you let it assert. rigscore’s claude-settings and workflow-maturity checks read these signals: test-to-source ratios, coverage floors, xfail density.

Dependency chain collapse. I retired two upstream data services. One decision cascaded into three dead projects because a downstream consumer had an explicit “deprecate when X retires” exit condition that fired exactly as written. The cascade was clean โ€” but the fact that I hadn’t mapped it was a governance failure. rigscore’s coherence check (14 points, one of the heaviest) exists for this class of problem: it reads whether your governance files agree with each other and with the filesystem they describe.

No plaintext secrets should exist in any repository, on any persistent filesystem, or in any container image. Eighty days of fast iteration produces a lot of config files, and config files are where secrets leak. deep-secrets (8 points) and env-exposure (8 points) are the two checks that catch this class of drift before it becomes an incident.

MCP supply chain. Late in the experiment I started wiring MCP servers into multiple clients. The attack surface there is different from npm or pip โ€” a typosquatted MCP server runs with whatever permissions the client grants it, often silently. rigscore’s heaviest check, mcp-config (14 points), is the one I wish I had on day one.

Building Governance: Three Layers Deep

The breaking points taught me what was missing โ€” structure around the code. It arrived in three layers.

Layer 1: Project-level governance

LIFECYCLE.md files appeared in 14 projects. Nobody told me to write them. I started because I kept losing track of what was active, what was abandoned, and what was waiting on something upstream. Exit conditions turned out to be the most valuable part. Projects could kill themselves cleanly.

CLAUDE.md files appeared in 15 projects โ€” documentation written for the AI collaborator, not for human readers. rigscore’s claude-md (10 points) and skill-files (10 points) checks read these files for the failure modes I kept hitting: stale instructions, contradictions with the filesystem, skill files that drift out of alignment with the code they describe.

Layer 2: Mechanical walls on autonomous execution

As the project count grew, the AI needed real permissions to operate without me at the keyboard. Real permissions plus behavioral-only constraints is a one-hallucination-away security problem.

The answer was architectural separation. Claude Code runs as user dev (uid 1001). Governance configuration lives under user joe (uid 1002) in a directory the dev container physically cannot access. The AI cannot modify its own constraints โ€” not because a prompt says “don’t do this,” but because filesystem permissions make it structurally impossible. Behavioral rules can be worked around, intentionally or through drift. Mount points cannot.

“Rather than blocking known-bad commands, only explicitly permitted operations execute without interruption.” That principle โ€” allowlist-shaped rather than denylist-shaped โ€” shows up across every layer of the stack that held up under pressure. rigscore’s claude-settings (8 points) and permissions-hygiene (4 points) checks look for its inverse: deny-list-only configs that are easy to rationalize around.

Layer 3: From observation to walls

The third layer was an observation system I built. It logged every tool call the AI made โ€” thousands of JSONL events per session with CRC32 integrity checksums โ€” and ran real-time anomaly detection on tool-call sequences. The most useful detection was the fix-loop detector: repeated file edits without intervening diagnosis, a pattern anyone who has worked with agentic AI has seen.

The system generated retrospectives. From those retrospectives it proposed learned bans: rules derived from observed behavior rather than predefined policy. sudo chown got globally banned after root-ownership incidents. It caught that the AI routinely hardcoded host-specific paths like /home/joe/ into code files. It caught governance changes slipping through without documentation updates.

And then, on April 6, 2026, I decommissioned it.

I replaced it with three mechanical pre-commit hooks: no-hardcoded-host-home, governance-decision-record, and check-dirty-submodules. Instead of watching the AI make a mistake and learning from it, the hooks make the most common mistakes impossible to commit. The observation system’s complexity had outgrown its value. The patterns it discovered were real and useful, but once I understood them I didn’t need a learning system to keep rediscovering them. I needed walls.

rigscore is the same pattern applied to the configuration layer. It doesn’t watch the AI โ€” it reads your filesystem state once and tells you where the drift is. One bypass at one layer doesn’t compromise the system; the whole point is that the checks are independent. The observation layer teaches you what to codify. The codification is the thing you ship.

What I’d Tell Someone Starting This Today

Three takeaways survived the archival pass. Each maps to a class of rigscore check.

Set a coverage threshold on day one. The one project where I enforced cov-fail-under had the only honest test suite in the portfolio. AI will happily generate hundreds of lines of xfail tests or tests that check whether keywords exist in a prompt string. A numeric threshold in your tool config forces the conversation about what actually needs to be tested. This is a settings-file signal โ€” exactly what rigscore’s claude-settings and workflow-maturity checks are tuned to read.

Separate governance from the thing being governed. The most important decision I made was architectural: put governance configuration in a directory the AI’s runtime user cannot access. Not “don’t modify this file” in a system prompt โ€” filesystem permissions that make modification structurally impossible. rigscore’s permissions-hygiene (4 points) and credential-storage (6 points) checks read for this specifically: governance files owned by the wrong user, credentials on paths the runtime can reach, .claude/settings.json writable by the agent it governs.

Build the observation layer, then replace it with walls. The observation system taught me what the AI gets wrong โ€” fix loops, hardcoded paths, root ownership issues. Once I understood the patterns, I codified the preventable ones as pre-commit hooks and, eventually, as rigscore checks. The observation phase was valuable and necessary. The goal is to graduate from it. rigscore’s git-hooks check (2 points) is small by weight but load-bearing by role: it verifies the walls are actually installed.

Conclusion

I built about 20 projects in under three months. Six produced verified output. The majority are archived. The archived projects taught me more than the active ones, because they are where the breaking points live.

The most durable output wasn’t any single project. It was the governance infrastructure that emerged: mount-point separation, pre-commit hooks, LIFECYCLE.md files that let projects kill themselves cleanly, and โ€” eventually โ€” a scoring tool that reads all of it at once and reports the gaps.

rigscore is the compressed form. The docs have the check list and the install line. The gap between vibe coding and vibe engineering is governance, and governance is measurable.

1
npx github:Back-Road-Creative/rigscore .

How this article was made: this is a substantial rewrite of an earlier 4,500-word draft. The source material was reviewed, the three sections that mapped cleanly to rigscore check categories were kept and compressed, and the supporting volume โ€” project rosters, podcast pipeline anecdotes, archival taxonomy โ€” was cut. All check names and point weights are pulled from the current rigscore src/constants.js.

Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.