AI agents are often highly interactive. They engage with tools, environments, people, and other agents. Their behaviour depends on the system design, the underlying LLM, system prompts, user inputs, active context, and historical memory. Even minor changes can push an agent down a completely different path, with vastly different outcomes. To build trust in custom AI agents, we need to expand beyond output verification and include behavioural evaluation.
The core principles of software quality remain as valid as ever (such as Functional Suitability, Reliability, Security, ... from ISO 25010). What must change is how we apply them in practice. Just as agile methods and shift-left testing transformed how we work, we now need to fundamentally rethink how we interpret and measure these quality principles for custom agents and LLMs.
Take "Functional Suitability." We can no longer verify it through strict, binary assertions alone. Instead, we also evaluate it probabilistically using semantic relevance, context adherence, and AI-as-a-judge scoring. "Reliability" still covers consistency, robustness, and graceful recovery from failures, while AI-specific concerns such as factuality and faithfulness can be seen as new facets of existing quality characteristics like Functional Correctness, but they require different evaluation techniques that go beyond traditional verification methods.
Testing is only possible to the extent that a solution supports it. Testability is arguably even more critical for AI agents than for traditional software. Without it, we may detect that an agent failed, but we will not know whether the cause was the prompt, the data lookup, a tool error, or a model hallucination.
Designing for testability requires three clear shifts:
Building on modularization, we also need to rethink how we structure the broader workflow. When "agentifying" a workflow, we should not give AI free rein over every step. It would be better to keep deterministic operations in conventional code and reserve agentic behaviour for the decision points that truly require probabilistic reasoning. This usually improves testability, limits failure propagation, and reduces unnecessary token cost and latency, even though it may introduce some orchestration overhead.
We cannot rely solely on other AI systems to evaluate quality. AI judges make mistakes too. We need a differentiated, layered evaluation approach. One where each layer compensates for the weaknesses of the others:
Unit Tests: Verify that individual functions work correctly and catch basic logic errors.
Integration: Confirm that tools and APIs connect and respond correctly, catching schema mismatches and integration failures. Also validate that security boundaries hold: can the agent be manipulated by malicious instructions hidden inside a tool response or a retrieved document?
Model-Based Evals (AI-as-a-Judge): Evaluate whether observable behaviour, traces and outputs make sense, catching nuance and trajectory failures. This applies to both single agents and orchestrators.
Human-in-the-Loop: Provide human oversight by manually reviewing the most complex traces within full-system simulations, catching edge cases that models might miss and assessing the agent's "vibe" and voice.
Production Monitoring: Detect real-world drift and unexpected behaviour once the system is live.
With AI acting as both executor and, at certain layers, grader, what happens to the traditional software tester? The QA mindset becomes more critical than ever, while the day-to-day execution expands beyond writing deterministic assertion scripts. As we become “Eval Engineers” we still rely on deterministic tests for predictable components, and additionally we should also be designing model-based evals, curating datasets, and analysing production feedback. In practice, AI may help generate both kinds of checks, but humans remain responsible for deciding what should be verified and how trust is established.
Because model-based evals are often expressed as natural-language grading rubrics (the specific scales and scoring criteria) that instruct an LLM how to assess an agent’s transcript, testers are well-positioned to design them. Their strength lies not just in writing prompts, but in defining clear quality criteria, identifying edge cases, and calibrating automated judgments against human expectations. Our role evolves towards:
Mastering Agentic Architecture: Going beyond basic LLM prompts, Eval Engineers need to understand the foundational building blocks of agents. This includes concepts like skills, tools, memory, reasoning patterns, and whatever comes next.
Deeply Understanding the Solution: Moving away from purely "black-box" testing. To evaluate an agent effectively, you need to understand its internal architecture: the tools and APIs it uses, its retrieval mechanisms (RAG), and multi-agent handoffs. Diagnosing whether a failure was caused by an ambiguous API response or a model hallucination requires deep architectural awareness.
Eval-Driven Development: Defining the natural-language grading rubrics and the success criteria for evaluation before the agent is fully built.
Curating Golden Datasets: Designing the challenging edge cases and ambiguous inputs that stress-test the agent's autonomy for its expected real-world usage.
Red-Teaming the Agent: Testing whether the agent can be manipulated into doing something it shouldn't through jailbreaks, guardrail bypasses, or malicious instructions hidden inside content it retrieves.
Human-in-the-Loop Calibration: Acting as the expert baseline by manually reviewing complex agent traces to ensure AI judges are grading accurately and have not drifted. Beyond strict accuracy, human testers remain essential for assessing user-friendliness, evaluating conversational tone, and gauging the overall "vibe" people experience when interacting with the agent.
Securing the Guardrails: Continuing to write traditional automated tests for deterministic components, API integrations, and security boundaries within the workflow.
To support the shift towards Eval Engineering, testing frameworks must adapt to the probabilistic reality of agents. Rather than only demanding 100% green test suites, teams need to adopt statistical pass rates as well, measuring metrics like pass@k (Can the agent get it right at least once in k tries?) or pass^k (Can the agent get it right every single time in k tries?) to account for non-binary functional correctness.
To address the "no single right answer" challenge we can apply metamorphic testing. This is a testing technique that, instead of checking exact outputs, checks if the relationship between multiple outputs makes sense. For example testing that if a normal search delivers x results, using the same search with additional criteria would result in a number of results that is < x . Leveraging the observability built into the system design, engineers can perform root cause analysis to identify more confidently which hallucinated context derailed an agent. This feeds into multi-objective evaluation: moving beyond simple pass/fail to weigh trade-offs between reliability, safety, cost, and speed.
Finally, while static golden datasets remain essential for regression testing, ensuring that the same inputs continue to yield acceptable results over time, they are not sufficient on their own. Teams must also complement them with dynamic benchmarks that continually evolve alongside the agents to catch new vulnerabilities and edge cases as models and capabilities are updated.
For Eval Engineering to build real trust, it must move from local notebooks into the deployment pipeline. Traditional CI systems may use deterministic tests as hard deployment gates. Agent evaluation suites are different, because they rely on statistical performances and model-based judging they do not always fit a strict pass/fail model.
They require Probabilistic Regression Gating, blocking deployments when performance drops below agreed statistical thresholds across representative benchmark sets. For instance, a release is halted if a tool-calling skill's pass@1 rate falls below 92% over a sufficiently large sample, or if API token consumption suddenly spikes without any quality gain.
Operationalizing evals also requires a "Data Flywheel" that connects production monitoring to the CI/CD pipeline. Real-world edge cases, such as infinite loops or hallucinated context, are invaluable. When monitoring tools flag a bad response or users submit negative feedback, Eval Engineers analyse the feedback and, when valid, capture the trace and convert it into a permanent test case for the pipeline's "Golden Dataset". This continuous feedback loop ensures the evaluation suite evolves alongside the agent, turning yesterday's production failures into tomorrow's automated safeguards.
Quality in the agentic era starts with testability. Individual agents must be testable by design. Workflows should be architected to restrict "agency" only to the nodes where probabilistic reasoning is genuinely needed, maximizing deterministic behaviour across the other nodes. Today's frontier models are powerful, but they still regularly stumble on context overload and infinite loops. Restricted autonomy is not just a testing best practice; it is an operational necessity.
With the rise of agents, some testing tasks will be absorbed or accelerated by AI. But that does not mean human testers become redundant. It means our role is shifting and our required technical understanding with it. We are moving from software tester to "Evaluation Engineer", a role where our critical thinking can shine even brighter as we build layered defences against agent unpredictability.