Blogs

What Pharma Can (and Cannot) Do with AI Under Annex 22

Written by Thomas Van Dorpe | May 4, 2026 1:24:13 PM

The business wants AI agents, the use cases look compelling, and the technology is moving faster than anyone can comfortably track, and then somewhere in the middle of the excitement, someone from QA or regulatory raises their hand and asks: "But can we actually do this in a GxP context?"

It's the right question, and it deserves a more useful answer than "it depends." With the introduction of EU GMP Annex 22, regulators have made one thing very clear: AI is allowed in regulated environments, but only under clearly defined conditions. This article lays out a practical framework for understanding where AI can genuinely help in GxP processes, where the hard limits are, and what compliant design looks like in practice.

Start here: not all AI is the same

One of the most common mistakes in pharma AI discussions is treating AI as a single category. Once you start designing for GxP, the differences between types of AI behavior become the core of the conversation. A useful way to structure this is to think in three tiers, based on intended use and required level of control, rather than on technology preference.

General AI agents

General AI agents operate in nonGxP processes. They are goaldriven systems: you define an objective, set security and access boundaries, and the agent determines how to get there. It may invoke multiple tools, coordinate with other agents, and make autonomous decisions within its defined scope.

This is where AI has the most operational freedom and where it can deliver significant efficiency gains, especially in backoffice or support processes that do not touch regulated data, records, or decisions. From a compliance perspective, these use cases sit outside the scope of Annex 22’s strictest requirements.

GxPadjacent or GxPsupporting AI

The second tier is where things become more constrained. In or adjacent to regulated processes, AI can assist, draft, propose, and route work, but autonomous decisionmaking is not acceptable. Human review at defined quality gates is mandatory. Outputs must be traceable to source data, behavior must be predictable, and the system must operate within clearly documented boundaries.

Large language models can play a role here, but not on their own. On their own, they are probabilistic and unconstrained, which makes them unsuitable for regulated decisions. When properly bounded, for example by grounding outputs in approved data and policies, they can support regulated workflows without undermining control.

Deterministic Automations

The third tier is worth naming explicitly, because it often gets overlooked.
Deterministic automations are predefined sequences of steps and rules with no AI involved. When organisations map out AI use cases carefully, they often realise that some problems are better solved with automation rather than AI. In GxP contexts, that is not a compromise. Fully deterministic behaviour is fully predictable, and predictability is exactly what many regulated processes require.

This threetier model forces the right question to the front, before any technology is selected: what is the intended use, and how much control does the process genuinely require?

A real example: the Quality Impact Recall Agent

To make this concrete, let's walk through a use case we've actually built: a recall orchestration agent designed to support the quality impact and communication workflow during a product recall. It's a good test case precisely because recalls are time-critical, high-stakes, and touch almost every part of the supply chain at once. The workflow maps across our three tiers as follows.

Steps 1 and 2: Trigger detection and traceability analysis. Both are automation. A failed quality result triggers the agent, which then traces the genealogy tree to identify affected batches, production runs, sales orders, customers, and inventory. This has to be 100% correct, no false negatives, no false positives, so building it as an AI system would actually reduce reliability. The logic is input-to-output, and deterministic automation was the right call.

Steps 3 and 4: Drafting customer communications. This is where AI enters. Based on the traced data, an LLM drafts notification emails to potentially affected customers, but with a firm constraint: specific fields from the ERP, batch numbers, affected products, customer references, are mandatory inclusions the agent cannot omit or substitute. The output lands in a review workspace where a human reads it, modifies it if needed, and approves before anything is sent.

If you want a broader sense of how first-party and custom AI agents work together across business processes, beyond the GxP context, our AI agents one-pager gives a useful overview of the two pillars. 

We use Steps 5, 6, and 7 to illustrate where the design gets more complex and where the regulatory stakes rise considerably.

Step 5: Draft impact proposal. At this stage, the agent could draft a broader proposal: an impact summary, suggested actions based on policies and available data, prioritization of affected parties, draft wording for internal follow-up. This is where AI assistance could create real efficiency, but the regulatory expectations are specific. The output must be non-binding and explicitly marked as draft support, it must trace back to source data, and the agent cannot invent content: no hallucinated lots, no assumed markets, no fabricated customer references. Retrieval Augmented Generation (RAG), which grounds the LLM in your specific company data and policies rather than its general training, is the technology most likely to make this step feasible while keeping it controlled. If you're unfamiliar with the regulatory framework that sits behind these requirements, our post on EU GMP Annex 11 & Annex 22 explains the compliance context in detail.

Step 6: Human approval. This is the most critical control point in the entire workflow. A qualified human must review the proposed actions and approve them in the validated system of record, ERP, EQMS, or both, with that approval identity-linked, time-stamped, and documented with rationale. The human must genuinely assess the proposal, and this is where accountability sits: in the hands of a qualified person making a documented decision.

Step 7: Execution. Once the human decision is made, execution must be fully deterministic and validated. Sending approved notifications, updating statuses in ERP and EQMS, logging communication and execution, triggering downstream tasks, generating audit-ready records, all of this happens in a controlled, validated workflow. Model Context Protocol (MCP), which acts as a structured layer between a language model and enterprise systems like ERP or LIMS, is an emerging approach worth watching here, as it enables more deterministic, tool-based interactions rather than open-ended generation.

The pattern that runs across all seven steps is consistent: AI prepares, human decides, system executes. That's the compliant design pattern for GxP AI workflows, and it holds regardless of how sophisticated the underlying technology becomes.

The LLM question answered honestly

A question that comes up constantly, is whether large language models can have any role in GxP processes at all. In critical GMP processes, LLMs cannot be decision-makers. The current draft of EU GMP Annex 22 is clear: critical GMP use requires static, deterministic models, and LLMs are probabilistic by nature, they predict the next output based on how they were trained, which is fundamentally incompatible with the determinism regulators require for critical process decisions. LLMs can still have a role, though, it's a question of where in the workflow they sit. Generating a draft communication that a human then reviews, modifies if needed, and approves is a very different use pattern from an LLM deciding whether a batch is released.

RAG can help close some of the gap. By grounding the model in your company's actual data, policies, and process context rather than its general training, you can constrain what it produces and make the outputs more reliable and more traceable. It won't make an LLM deterministic, but it reduces the surface area for hallucination and makes the outputs more defensible in a regulated context.

Explainability is the other piece worth understanding here. Regulators expect you to be able to explain why an AI system produced a given output, and LLMs are inherently difficult to explain, which is why "black box" is the term people reach for. Until explainability tooling matures significantly, the safest design keeps LLMs in the preparation layer and out of the decision and execution layers entirely.

Designing your own GxP AI use cases: the questions to ask first

If you're working through your own use cases, these are the questions worth answering before any technology conversation starts, because the answers determine everything that follows.

  • What is the intended use, exactly? Be precise: which process, which decision points, which inputs, which outputs. "AI for quality" is the start of a conversation, not a definition.
  • What is the risk if this fails? To patient safety, product quality, data integrity — the answer to this question determines how much control is required and which tier of the framework applies.
  • Which steps should be deterministic? Where the process requires 100% predictable, reproducible behavior, build those steps as automations rather than AI, because the reliability floor for automation is higher.
  • Where is the human decision gate? For any GxP-critical outcome, there must be a documented point where a qualified person reviews, decides, and signs off. Define who that person is, what they're approving, and how it gets recorded.
  • How will outputs be traced to source data? Every AI-generated output in a regulated workflow needs to be traceable, so design for it from the start rather than trying to retrofit it.
  • What does the validation lifecycle look like? Initial validation is the beginning — ongoing performance monitoring, change control, revalidation triggers, and eventually decommissioning all need to be planned for upfront, because a system that can't be maintained under change control can't stay in production.

Starting with these questions stops you from building something that can't be defended when an inspector walks in.

A practical way to assess where you stand

The GxP AI Readiness Checklist covers 50 questions across every domain that matters for regulated AI: intended use, validation documentation, human-in-the-loop controls, audit trails, model versioning, cybersecurity, and more. It's designed to be worked through before you build, not after, so that gaps show up in your planning rather than in an inspection.

If you’re mapping real AI use cases against these GxP boundaries, the next question becomes ownership: who defines intended use, who signs off, and who maintains control over time. That’s what we cover next. Read: Who Owns AI in Pharma? Why Governance Must Be Cross‑Functional.