Blogs

Observability Maturity Model

Written by Jerry Caupain | Feb 13, 2026 3:42:49 PM

There is a lot to say about this topic, too much to just write down in one article so I’ll start with an introduction to set the stage and go over level 1 of the maturity model. In the next article I’ll cover the remaining levels. 

The industry is actively working to define a standard Observability Maturity Model. Such a model can be used to determine where you currently are and what is needed to get to the level that you want to reach. There is currently no standardized Observability Maturity Model, for this reason I’ve created my own model. It’s based on existing documentation, but I’ve added the last level, which I’ve labeled: autonomous observability. In these series of articles, I’ll explain every level in detail and focus on the core concept for each level, who the primary stakeholders are, and what the most important business outcome is. 

Figure 1: Observability Maturity Model

Introducing the Observability Maturity Model

The model consists of 5 levels:

  • Level 1: Monitoring. This has basically been around for decades. 

  • Level 2: Observability. Metrics, logs and traces are siloed. 
  • Level 3: Causal Observability. Metrics, logs and traces are combined.
  • Level 4: Proactive observability. Applying AI and machine learning to observability data to become more proactive and predictive. At this level we also add context in the form of business data. 
  • Level 5: Autonomous IT management. By leveraging agentic AI, we can automatically manage our IT landscape.

In these articles I’ll will go over every level in the model to deep dive the concept of Observability. In this introduction I’ll start with monitoring and go over the difference between monitoring and observability. I will write an article for each level. 

Finally, I’ll cover a vision for the near future with “Autonomous IT Management”. I haven’t seen this in practice yet but given the developments in AI, I can’t imagine this being a very distant future.

Level 1: Monitoring

“Yes of course we do observability, we have a great monitoring tool”

The interesting thing about this statement is that monitoring and observability are often seen as the same but in my opinion, this is not correct. The short explanation of the difference is that monitoring deals with stuff that is already known. If you know it could be a problem if the CPU usage is 85% or higher, you can configure your monitoring system to measure the CPU usage and send out an alert when the CPU usage is greater than or equal to this value. What happens next is that OPS will get this alert and after (hopefully) investigating, restart some process. Problem solved……for a while. Until it happens again, and the process is restarted again. 

A couple of things are not optimal with this scenario.

  1. Getting this alert does not mean that there is something wrong with the application service. If the service runs fine, and you still get this alert, at some point this will be ignored. This will result in monitoring not being reliable. 

  2. In this scenario OPS and DEV possibly aren’t working together. (Yes, its 2025 and DEV and OPS sometimes are still separated) This is another topic, but in this scenario investigating this problem is very difficult. Even if you have DEVOPS teams that are fully responsible for the service, having only this monitoring system limits the possibilities to thoroughly investigate what is going on.

What vs Why

So, monitoring lets you know that something is happening.  However, this does not let you know WHY it is happening. 

That’s where the concept of Observability comes into play. In this concept you can determine the state of a system, based on its outputs in the form of metrics (for instance, that CPU usage that was measured in the example above), logs and traces. Typically, monitoring will give you an answer tot he question: "Is the check-engine light on?" (a known failure mode). In contrast, Observability is the diagnostic engine that allows you to ask, “Why is the check-engine light on, even though the driver is driving normally?”. It’s a practice that allows for a deeper investigation into the problem. 

Data is everything

What’s essential to understand about this is that we are collecting a lot more data about a system with the purpose of getting to the bottom of things. In my example, DEV and OPS need to work together. Observability enables both teams to look at the same data and thus reach the same conclusions.” 

Does this mean that there is no longer a need for monitoring now that we have Observability? Not at all. Monitoring exists within the concept of Observability. Ideally you need to know that something is going on, but you also need to have the tools that enable you to get a good understanding of why this is happening. This way you can prevent this from happening again. 

Business Outcome

The most important business outcome at this stage is minimizing downtime and improving service availability. In other words, achieving a reduction in MTTD (Mean Time to Detect). However, this is still purely reactive because monitoring alerts you of a known scenario that occurs. Ideally you don’t want this to happen at all. Nevertheless, it is a start and not having any monitoring in place is far worse. 

It is possible to increase the business value by mapping monitoring events to a graphical representation of your IT landscape. If done right, it is possible to quickly determine where in your landscape there is an outage and what the potential impact is on your service. However, it does not prevent any outages. To prevent outages, you must analyze data and implement the right improvements. This is where observability plays a huge part. 

Primary Stakeholders

The primary stakeholders of monitoring data are usually IT Operations teams that can act on monitoring alerts. Their goal is strictly to restore service (restart the process), not fix the underlying bug. At this stage, operational teams are still firefighting and reacting instead of being proactive.

Tools and practices

Monitoring has been around for a very long time and there are a lot of tools to choose from. Looking at opensource you typically see tools like Prometheus, Nagios and Zabbix. Commercially there is also a lot to choose from, but commercial offerings normally do a lot more than just monitoring. Think of tools like Dynatrace, Datadog, New Relic but also tooling that is integrated in the hyper cloud platforms. Great examples are Amazon CloudWatch, Azure monitor.

To Conclude

Although this foundational level can offer a lot of benefits, relying only on monitoring means that that  we are always catching up. We are operating in a reactive state, merely addressing symptoms. 

Join me in Part 2, where we dive into the next levels. We will explore how collecting Logs, Metrics, and Traces, even when siloed, unlocks the next major business outcome: a significant reduction in Mean Time To Resolution (MTTR).

Do you agree? Any thoughts? I’d love to hear your feedback.