Cegeka_Logo Careers Language Solutions Back
Solutions
Hybrid Cloud

Hybrid Cloud

Explore the added value of cloud adoption for your business.

Data solution

Data & AI

Discover our different data solutions to help you become a data-driven company.

RegulatoryCompliance_Visuals_Navigation (1)

Regulatory Compliance

Ensure GDPR & GxP compliance with our comprehensive solutions.

Cyber Security & Networking Solution

Cyber Security & Networking

With cyber resilience, your organisation becomes a bit more secure with each day.

Digital Workplace Solution

Digital Employee Xperience

Your Digital Employee Xperience. Our Mission.

Testing_Visuals_IconNavigation (1)

Quality Engineering

Ensuring seamless software, one Test at a time.

Cegeka Application Services

Cegeka Application Services

Building the applications to embed growth, innovation and agility.

Business Applications

Business Applications

Transform your business with Microsoft Dynamics ERP and CRM, integrated with Microsoft’s Power Platform.

5G_Citymesh

5G & Mobile Private Networks

Expertise and development experience to bring all the advantages of 5G.

Products and platform solutions

Products & Platforms

Software solutions that optimize business processes and drive success.

Services Back
Services
Website_Navigation_IT_Team_Extension_3

IT Team Extension

The best IT professionals to support your projects.

Outsourcing Services

Outsourcing & Managed Services

Outsourcing your IT helps you to focus on your strategy.

Website_Navigation_Consultancy

Consultancy

The right skills and attitude to support the IT projects at your office

Website_Navigation_Projects

Projects

Integrating the right digital solutions for your IT project

Industries Back
Industries

Our industry-tailored services are designed to address specific challenges and opportunities across different industries.

All industries
government 1

Government and Public Sector

Digital transformation for smarter, citizen-focused public services.

energy_utilities 1

Energy & Utilities

The arrival of smart electricity grids will allow companies to take care of their own energy management

defense 1

Defense & Security

Mission-ready IT securing critical infrastructure and data sovereignty.

food 1

Food

Streamline operations, ensure traceability and maintain compliance.

finance 1

Finance & Insurance

Cloud-powered innovation for agile, secure financial services.

manufacturing 1

Manufacturing

Production chains are becoming intelligent networks with real-time track-and-trace systems.

Insights Back
Knowledge is our backbone

We believe in sharing our insights and expertise with you. Explore our resources and learn more about our products, services and industry trends.

Icons_Navigation_Case Studies

Case Studies

Step into the world of our delighted customers and see how we helped them achieve their goals.

Icons_Navigation_News Items

Corporate News

Stay in the loop with our company news, announcements, awards and events.

Icons_Navigation_Blogs

Blogs

Read our latest articles on topics ranging from technology, innovation, business and beyond.

Icons_Navigation_Webinars

Webinars

Be part of the action with our live or on-demand webinars, where our experts share invaluable knowledge.

Icons_Navigation_Ebooks

E-books & Whitepapers

Download our guides and reports on various aspects of technology and business.

Icons_Navigation_Events

Events

Find out where we are going to be next, and register for our upcoming events.

Cegeka Academy

Academy

Enhance your skills with our expert-led training courses, tutorials, and certifications at our Academy.

Join our Academy
About us Back
shaping digital together

We work shoulder to shoulder with our clients to ensure technology drives impact when and where it matters most.

Start the journey with us
Icons_Navigation_Why Cegeka

Why Cegeka

Discover why more than 2,500 clients around the world choose to work with us, and stay with us.

Icons_Navigation_Cegeka&Society

ESG at Cegeka

We turn ESG ambition into action via sustainable IT, carbon footprint reduction and an inclusive work environment.

Icons_Navigation_Our Story

The Cegeka Story

In just over 30 years, Cegeka has grown from 30 people to a global company with 9,000 employees across 3 continents.

Icons_Navigation_Annual Report

Annual Report

See how our work makes a difference, explore the full annual report to learn more.

More Cegeka

Our Management

Corporate News

Contact & Locations

Back
Select language

Austria

Belgium

Belgium

Denmark

Germany

Italy

Romania

Sweden

The Netherlands

United Kingdom

United States

Let’s get in touch
Cegeka_Logo Solutions
Solutions
Hybrid Cloud

Hybrid Cloud

Explore the added value of cloud adoption for your business.

Data solution

Data & AI

Discover our different data solutions to help you become a data-driven company.

RegulatoryCompliance_Visuals_Navigation (1)

Regulatory Compliance

Ensure GDPR & GxP compliance with our comprehensive solutions.

Cyber Security & Networking Solution

Cyber Security & Networking

With cyber resilience, your organisation becomes a bit more secure with each day.

Digital Workplace Solution

Digital Employee Xperience

Your Digital Employee Xperience. Our Mission.

Testing_Visuals_IconNavigation (1)

Quality Engineering

Ensuring seamless software, one Test at a time.

Cegeka Application Services

Cegeka Application Services

Building the applications to embed growth, innovation and agility.

Business Applications

Business Applications

Transform your business with Microsoft Dynamics ERP and CRM, integrated with Microsoft’s Power Platform.

5G_Citymesh

5G & Mobile Private Networks

Expertise and development experience to bring all the advantages of 5G.

Products and platform solutions

Products & Platforms

Software solutions that optimize business processes and drive success.

Services
Services
Website_Navigation_IT_Team_Extension_3

IT Team Extension

The best IT professionals to support your projects.

Outsourcing Services

Outsourcing & Managed Services

Outsourcing your IT helps you to focus on your strategy.

Website_Navigation_Consultancy

Consultancy

The right skills and attitude to support the IT projects at your office

Website_Navigation_Projects

Projects

Integrating the right digital solutions for your IT project

Industries
Industries

Our industry-tailored services are designed to address specific challenges and opportunities across different industries.

All industries
government 1

Government and Public Sector

Digital transformation for smarter, citizen-focused public services.

energy_utilities 1

Energy & Utilities

The arrival of smart electricity grids will allow companies to take care of their own energy management

defense 1

Defense & Security

Mission-ready IT securing critical infrastructure and data sovereignty.

food 1

Food

Streamline operations, ensure traceability and maintain compliance.

finance 1

Finance & Insurance

Cloud-powered innovation for agile, secure financial services.

manufacturing 1

Manufacturing

Production chains are becoming intelligent networks with real-time track-and-trace systems.

Insights
Knowledge is our backbone

We believe in sharing our insights and expertise with you. Explore our resources and learn more about our products, services and industry trends.

Icons_Navigation_Case Studies

Case Studies

Step into the world of our delighted customers and see how we helped them achieve their goals.

Icons_Navigation_News Items

Corporate News

Stay in the loop with our company news, announcements, awards and events.

Icons_Navigation_Blogs

Blogs

Read our latest articles on topics ranging from technology, innovation, business and beyond.

Icons_Navigation_Webinars

Webinars

Be part of the action with our live or on-demand webinars, where our experts share invaluable knowledge.

Icons_Navigation_Ebooks

E-books & Whitepapers

Download our guides and reports on various aspects of technology and business.

Icons_Navigation_Events

Events

Find out where we are going to be next, and register for our upcoming events.

Cegeka Academy

Academy

Enhance your skills with our expert-led training courses, tutorials, and certifications at our Academy.

Join our Academy
About us
shaping digital together

We work shoulder to shoulder with our clients to ensure technology drives impact when and where it matters most.

Start the journey with us
Icons_Navigation_Why Cegeka

Why Cegeka

Discover why more than 2,500 clients around the world choose to work with us, and stay with us.

Icons_Navigation_Cegeka&Society

ESG at Cegeka

We turn ESG ambition into action via sustainable IT, carbon footprint reduction and an inclusive work environment.

Icons_Navigation_Our Story

The Cegeka Story

In just over 30 years, Cegeka has grown from 30 people to a global company with 9,000 employees across 3 continents.

Icons_Navigation_Annual Report

Annual Report

See how our work makes a difference, explore the full annual report to learn more.

More Cegeka

Our Management

Corporate News

Contact & Locations

Austria   Austria DE Belgium   Belgium NL Belgium   Belgium FR Denmark   Denmark EN Germany   Germany DE Italy   Italy IT Romania   Romania EN Sweden   Sweden EN The Netherlands   The Netherlands NL United Kingdom   United Kingdom EN United States   United States EN Careers Let’s get in touch
Home Discover our latest blogs Discover our latest blogs Testing in the Era of Probabilistic AI (From Tester to Eval Engineer)
Quality Engineering
6 minutes reading

Testing in the Era of Probabilistic AI (From Tester to Eval Engineer)

Traditional software testing has largely depended on a fundamental assumption: input A produces output B. (or at least for most deterministic software). But what happens when the system we are testing is autonomous, dynamic, and inherently probabilistic? How do we evaluate it when there is rarely a single correct answer?

Wouter Emael

Wouter Emael

May 28, 2026

AI agents are often highly interactive. They engage with tools, environments, people, and other agents. Their behaviour depends on the system design, the underlying LLM, system prompts, user inputs, active context, and historical memory. Even minor changes can push an agent down a completely different path, with vastly different outcomes. To build trust in custom AI agents, we need to expand beyond output verification and include behavioural evaluation.

 

Rethinking  our testing approach: Principles over practices

The core principles of software quality remain as valid as ever (such as Functional Suitability, Reliability, Security, ... from ISO 25010). What must change is how we apply them in practice. Just as agile methods and shift-left testing transformed how we work, we now need to fundamentally rethink how we interpret and measure these quality principles for custom agents and LLMs.

Take "Functional Suitability." We can no longer verify it through strict, binary assertions alone. Instead, we also evaluate it probabilistically using semantic relevance, context adherence, and AI-as-a-judge scoring. "Reliability" still covers consistency, robustness, and graceful recovery from failures, while AI-specific concerns such as factuality and faithfulness can be seen as new facets of existing quality characteristics like Functional Correctness, but they require different evaluation techniques that go beyond traditional verification methods.

Designing agentic solutions with testability in mind

Testing is only possible to the extent that a solution supports it. Testability is arguably even more critical for AI agents than for traditional software. Without it, we may detect that an agent failed, but we will not know whether the cause was the prompt, the data lookup, a tool error, or a model hallucination.

Designing for testability requires three clear shifts:

  • Observability: We cannot verify what we cannot see. Testable agents need traceable execution, including context retrieval, tool usage, workflow decisions, and outputs, so failures can be isolated and understood.
  • Controllability: A testable design lets us mock environment states. For example, replacing a live API call with a fixed response to verify how the agent handles specific data.
  • Modularization of AI skills: A single, monolithic prompt is extremely difficult to test. Breaking agents into modules or sub-agents allows us to test specific logic in isolation.

Building on modularization, we also need to rethink how we structure the broader workflow. When "agentifying" a workflow, we should not give AI free rein over every step. It would be better to keep deterministic operations in conventional code and reserve agentic behaviour for the decision points that truly require probabilistic reasoning. This usually improves testability, limits failure propagation, and reduces unnecessary token cost and latency, even though it may introduce some orchestration overhead.

The Swiss cheese of quality

We cannot rely solely on other AI systems to evaluate quality. AI judges make mistakes too. We need a differentiated, layered evaluation approach. One where each layer compensates for the weaknesses of the others:

  1. Unit Tests: Verify that individual functions work correctly and catch basic logic errors.

  2. Integration: Confirm that tools and APIs connect and respond correctly, catching schema mismatches and integration failures. Also validate that security boundaries hold: can the agent be manipulated by malicious instructions hidden inside a tool response or a retrieved document?

  3. Model-Based Evals (AI-as-a-Judge): Evaluate whether observable behaviour, traces and outputs make sense, catching nuance and trajectory failures. This applies to both single agents and orchestrators.

  4. Human-in-the-Loop: Provide human oversight by manually reviewing the most complex traces within full-system simulations, catching edge cases that models might miss and assessing the agent's "vibe" and voice.

  5. Production Monitoring: Detect real-world drift and unexpected behaviour once the system is live.

The Evolving Role of the Software Tester

With AI acting as both executor and, at certain layers, grader, what happens to the traditional software tester? The QA mindset becomes more critical than ever, while the day-to-day execution expands beyond writing deterministic assertion scripts. As we become “Eval Engineers” we still rely on deterministic tests for predictable components, and additionally we should also be designing model-based evals, curating datasets, and analysing production feedback. In practice, AI may help generate both kinds of checks, but humans remain responsible for deciding what should be verified and how trust is established.

Because model-based evals are often expressed as natural-language grading rubrics (the specific scales and scoring criteria) that instruct an LLM how to assess an agent’s transcript, testers are well-positioned to design them. Their strength lies not just in writing prompts, but in defining clear quality criteria, identifying edge cases, and calibrating automated judgments against human expectations. Our role evolves towards:

  • Mastering Agentic Architecture: Going beyond basic LLM prompts, Eval Engineers need to understand the foundational building blocks of agents. This includes concepts like skills, tools, memory, reasoning patterns, and whatever comes next.

  • Deeply Understanding the Solution: Moving away from purely "black-box" testing. To evaluate an agent effectively, you need to understand its internal architecture: the tools and APIs it uses, its retrieval mechanisms (RAG), and multi-agent handoffs. Diagnosing whether a failure was caused by an ambiguous API response or a model hallucination requires deep architectural awareness.

  • Eval-Driven Development: Defining the natural-language grading rubrics and the success criteria for evaluation before the agent is fully built.

  • Curating Golden Datasets: Designing the challenging edge cases and ambiguous inputs that stress-test the agent's autonomy for its expected real-world usage.

  • Red-Teaming the Agent: Testing whether the agent can be manipulated into doing something it shouldn't through jailbreaks, guardrail bypasses, or malicious instructions hidden inside content it retrieves.

  • Human-in-the-Loop Calibration: Acting as the expert baseline by manually reviewing complex agent traces to ensure AI judges are grading accurately and have not drifted. Beyond strict accuracy, human testers remain essential for assessing user-friendliness, evaluating conversational tone, and gauging the overall "vibe" people experience when interacting with the agent.

  • Securing the Guardrails: Continuing to write traditional automated tests for deterministic components, API integrations, and security boundaries within the workflow.

Advanced Methodologies for the Agentic Era

To support the shift towards Eval Engineering, testing frameworks must adapt to the probabilistic reality of agents. Rather than only demanding 100% green test suites, teams need to adopt statistical pass rates as well, measuring metrics like pass@k (Can the agent get it right at least once in k tries?) or pass^k (Can the agent get it right every single time in k tries?) to account for non-binary functional correctness.

To address the "no single right answer" challenge we can apply metamorphic testing.  This is a testing technique that, instead of checking exact outputs, checks if the relationship between multiple outputs makes sense.  For example testing that if a normal search delivers x results, using the same search with additional criteria would result in a number of results that is < x . Leveraging the observability built into the system design, engineers can perform root cause analysis to identify more confidently which hallucinated context derailed an agent. This feeds into multi-objective evaluation: moving beyond simple pass/fail to weigh trade-offs between reliability, safety, cost, and speed.

Finally, while static golden datasets remain essential for regression testing, ensuring that the same inputs continue to yield acceptable results over time, they are not sufficient on their own. Teams must also complement them with dynamic benchmarks that continually evolve alongside the agents to catch new vulnerabilities and edge cases as models and capabilities are updated. 

The Continuous Evaluation Pipeline

For Eval Engineering to build real trust, it must move from local notebooks into the deployment pipeline. Traditional CI systems may use deterministic tests as hard deployment gates. Agent evaluation suites are different, because they rely on statistical performances and model-based judging they do not always fit a strict pass/fail model.

They require Probabilistic Regression Gating, blocking deployments when performance drops below agreed statistical thresholds across representative benchmark sets. For instance, a release is halted if a tool-calling skill's pass@1 rate falls below 92% over a sufficiently large sample, or if API token consumption suddenly spikes without any quality gain.

Operationalizing evals also requires a "Data Flywheel" that connects production monitoring to the CI/CD pipeline. Real-world edge cases, such as infinite loops or hallucinated context, are invaluable. When monitoring tools flag a bad response or users submit negative feedback, Eval Engineers analyse the feedback and, when valid, capture the trace and convert it into a permanent test case for the pipeline's "Golden Dataset". This continuous feedback loop ensures the evaluation suite evolves alongside the agent, turning yesterday's production failures into tomorrow's automated safeguards.

Conclusion

Quality in the agentic era starts with testability. Individual agents must be testable by design. Workflows should be architected to restrict "agency" only to the nodes where probabilistic reasoning is genuinely needed, maximizing deterministic behaviour across the other nodes. Today's frontier models are powerful, but they still regularly stumble on context overload and infinite loops. Restricted autonomy is not just a testing best practice; it is an operational necessity.

With the rise of agents, some testing tasks will be absorbed or accelerated by AI. But that does not mean human testers become redundant. It means our role is shifting and our required technical understanding with it. We are moving from software tester to "Evaluation Engineer", a role where our critical thinking can shine even brighter as we build layered defences against agent unpredictability.

Wouter Emael

Wouter Emael

More of Wouter Emael articles

Get in touch

Send us a message

tim.taels@cegeka.com

Tim Taels

Account Executive