AI SystemsDistributed Systems

From Manual Testing to Multi-Agent Automation: Building AI Systems That Actually Work

Why separating compilation from execution, enforcing context boundaries, and treating AI as a systems problem produces reliable automation at scale.

August 1, 2025

The Scale Problem That Forced the Question

Imagine you have a thousand test cases. Each one validates a business workflow spanning multiple backend systems: order processing, financial accounting, payment reconciliation, tax calculation, inventory management. Each test requires an engineer who understands the business domain, can navigate multiple system interfaces, extract the right data, and document evidence.

Now imagine you need to run these tests before every release. And the test suite is growing faster than you can hire.

Over 80 engineers were dedicated to manual test execution. The resource model was unsustainable. The question wasn’t whether to automate; it was how to automate something that seemed to require human judgment at every step.

Why Traditional Test Automation Doesn’t Work Here

Traditional automation works for deterministic scenarios. Click this button, verify this element appears. Our problem was different in three ways:

The validation logic is domain-specific and constantly changing. A test verifying financial ledger entries for Germany requires different account codes and tax rules than the same test for the United States.
Tests span multiple systems with different interfaces. A single test case might extract data from order management, cross-reference accounting entries, verify payment records, and check tax calculations.
Timing is non-deterministic. Business events cascade across systems with variable delays. Fixed wait times either waste time or fail.

The Architecture: Compilation + Orchestration

The core insight: separate the intelligence from the execution.

Most AI automation tries to be intelligent at runtime: an agent that reasons about what to do while interacting with live systems. This is fragile. LLMs hallucinate. Network calls fail. System state changes between reasoning steps.

Instead, we split into two phases:

Compilation (heavy reasoning, offline): Transform natural language test specifications into complete, executable plans. All ambiguity resolution happens here, before any live system is touched.

Orchestration (minimal reasoning, online): Execute the compiled plan step by step. The orchestrator is deliberately mechanical; it doesn’t need to reason because the compiler already did.

This separation has a profound benefit: you can review and correct the compilation output before execution. Compile-time errors are cheap. Runtime AI errors are expensive.

The Compilation Pipeline: 11 Stages

The compiler transforms natural language into executable specifications through 11 stages in two phases.

Phase 1: Atomic Extraction (Stages 1-5)

Each stage processes one test step in isolation. It can see only the current step’s text and the test’s preconditions. It cannot see other steps.

This isolation is the single most important design decision in the entire system.

Why? Because LLMs hallucinate more when they have more context. If a model can see all 15 steps while processing step 3, it will “helpfully” reference information from step 8 that doesn’t exist yet, invent relationships between steps, and produce plausible-looking but incorrect specifications.

By limiting context to only the current step, we made it structurally impossible to hallucinate cross-step relationships. In our measurements, this reduced hallucination rates from approximately 23% to under 2%.

Phase 2: Cross-Step Integration (Stages 6-10)

After atomic processing, the second phase expands context to see all steps together. Now the system can reason about cross-step relationships: ordering, timing, data flow completeness.

Stage 11 performs quality validation with a repair loop that re-runs low-confidence stages with targeted hints.

System Agents: Domain Experts

Each backend system has a dedicated agent, an expert in that system and nothing else. A financial ledger agent knows how to navigate the ledger, what fields to extract, what account codes mean. It knows nothing about order management.

The architecture follows a principle: agents extract, platform evaluates. Agents get data out of systems. They never decide whether that data is correct. Assertion evaluation happens in the orchestrator, ensuring consistent behavior across all system types.

Knowledge Architecture: Three Layers

Layer 1: Registry (exact lookup, rarely changes). Catalog of available systems and agents.

Layer 2: Capability Metadata (semantic search, changes with features). What each agent can do: use cases, schemas, relationships.

Layer 3: Execution Knowledge (runtime reference, changes with systems). How-to documentation for each system: navigation, endpoints, extraction procedures.

This separation means a system UI change (Layer 3) doesn’t affect compilation (Layer 2). Teams update agent navigation without recompiling test specs.

Knowledge Graphs: The Missing Piece

One challenge the multi-agent system couldn’t solve alone: understanding dependency relationships between systems. When a test says “verify the GL entries for this order,” the system needs to know GL entries depend on invoice generation, which depends on shipment completion, which depends on order fulfillment.

We built a separate extraction pipeline processing source code, API contracts, configuration files, and business documents into a unified dependency graph with complete provenance tracking back to source evidence.

The extraction is deterministic: same inputs, same outputs. It processes multiple artifact types (code via AST parsing, documents via structured extraction, API contracts via schema analysis) and unifies them into a single queryable graph.

The Hard Lessons

Determinism is more valuable than intelligence. The most reliable parts are the most boring: eligibility checks, assertion evaluation, evidence capture. We minimized non-deterministic surface area.

Context boundaries prevent more bugs than any other technique. The biggest quality improvement came from limiting what each stage could see. Not better prompts. Not better models.

Few-shot examples from your own system beat generic examples. We maintain a corpus of successful outputs. The system retrieves semantically similar past examples per stage. It gets better with use.

The feedback loop matters more than initial accuracy. First pipeline: ~70% accuracy. Correcting a compiled spec is still faster than writing from scratch. Every correction feeds back, improving future compilations.

Evidence capture is first-class. In regulated contexts, it’s not enough to verify a value is correct; you need to prove it. Every agent interaction produces evidence artifacts linked to supporting assertions.

The Takeaway

AI in production isn’t about building the smartest possible agent. It’s about building systems where AI handles what it’s uniquely good at (language understanding, ambiguity resolution, pattern recognition) while deterministic infrastructure handles everything else.

Separate compilation from execution. Enforce context boundaries. Specialize your agents. Build feedback loops. And always make AI output reviewable before it touches a live system.

The 80+ engineers aren’t gone. They’re doing higher-value work: designing test strategies, improving business processes, building domain knowledge. The AI didn’t replace human judgment. It amplified it.