devstation engineering

Testing through MCP: running the suite against a real homelab

The suite drives a real Proxmox node over MCP — provision, install, uninstall, destroy. What that choice brought to big refactors, to the cross-OS Windows validation and to test discipline with AI.

Jun 28, 2026 André Spineli

The hard part of a project that touches real infrastructure is not writing the feature — it is trusting that a big change did not break anything. Provisioning, SSH and service install tested with mocks alone give a false sense of safety: they validate shape, not behavior. DevStation took a different route — testing through MCP, run against a real Proxmox node in a homelab.

MCP as the test surface

The Model Context Protocol is an open standard for connecting agents to tools and data. The DevStation engine exposes an MCP server, and the e2e suite exercises the exact same boundary an agent would use: list clusters, provision a node, install a service over SSH, uninstall, destroy. By design, MCP is an inbound adapter that translates the tools into the existing JSON-RPC calls — it knows nothing about the internal structure of the contexts — with an explicit allowlist and risk metadata for the destructive operations.

Real infrastructure instead of mocks

The suite does not simulate Proxmox; it talks to the machine. It provisions a real node through OpenTofu, brings the VM up, connects over SSH, installs, uninstalls and destroys. It is slower than a unit test — and that is the trade accepted on purpose: what it returns is real behavior, not an approximation.

Confidence in big refactors

The payoff is clearest in refactors. When renaming deploy → install and destroy → uninstall across the whole stack, the e2e against the real node flagged in seconds a regression no mock would catch: reading the previous state had broken an endpoint. Finding that on the spot, rather than in production, is what makes broad changes viable. And because it is the agent that drives the suite, the “provision → install → uninstall → destroy” cycle runs in a loop, without constant supervision. The cost of verifying a big change dropped to near zero.

The cross-OS Windows journey

The Windows validation illustrates the method well. Two machines — one Linux, one Windows — shared a directory, with Claude Code running on both. The Linux agent compiled the binary and published the artifacts; the Windows agent validated, guided by instructions in .md files. The two ran for hours, nearly autonomous, until the whole CLI passed on Windows — and the MCP tests were run from both sides whenever needed. The operating-system boundary stopped being a manual event and became part of the loop.

Test discipline and the role of the harness

There is a trade-off to acknowledge. There is evidence that AI’s gains shrink on mature projects a team already knows well, and that models lose effectiveness on very large codebases. What changed the outcome was the harness enforcing the discipline: every feature ships with its test, every fix with its regression, and the agent runs the real suite because the rule demands it. AI did not dissolve the complexity on its own; a set of practices, plus a harness that enforces those practices, made the complexity manageable. Testing against the real thing, in a loop, changed what is safe to attempt.

References

Model Context Protocol — Anthropic
Experienced developers’ productivity with AI on mature projects — METR (2025)
LLMs’ difficulty on large codebases — arXiv