If you run workloads on OpenShift and monitor them with Prometheus and Alertmanager, you’ve likely dealt with a flood of alerts every time the cluster gets upgraded. Pods restart, nodes drain, services go temporarily unavailable — and your on-call gets paged for things that are expected and transient.
In this post I’ll walk through how we handle this in our monitoring stack: detecting when an upgrade is happening, extending that detection window to cover the recovery period, and using Alertmanager inhibit rules to suppress alerts that would otherwise fire during this expected disruption.
Our monitoring stack
We run a fairly standard Prometheus and Alertmanager setup on OpenShift. Prometheus scrapes metrics from the cluster and our applications, evaluates alerting rules, and sends firing alerts to Alertmanager. Alertmanager handles routing, deduplication, grouping, and — crucially for this post — inhibition of alerts.
Our alerting rules and Alertmanager configuration are managed as code in a Git repository and applied through CI/CD.
Detecting a cluster upgrade
OpenShift exposes cluster version information through the cluster_version
metric, exposed by the
Cluster Version Operator.
When an upgrade is in progress, a time series with type="updating" is emitted.
A simple Prometheus alerting rule to detect an ongoing upgrade:
- alert: ClusterUpgradeInProgress
expr: sum(cluster_version{type="updating"}) > 0
labels:
severity: info
annotations:
summary: "OpenShift cluster upgrade is ongoing."
description: |
The OpenShift cluster is being upgraded. Pods will be migrated
from upgraded nodes and restarted.
This fires once Prometheus evaluates the expression and finds the upgrade
metric active, and resolves once cluster_version no longer reports
type="updating".
The problem: the aftermath
The alert above works well for the upgrade itself, but services don’t recover instantly once the upgrade finishes. Pods need time to restart, caches need to warm up, and various health checks may lag behind. Depending on your workloads, it can take 30 minutes to an hour before everything is fully stable again.
During this recovery window, you’ll still see alerts firing for conditions that are a direct consequence of the upgrade — even though the upgrade metric has already gone back to zero.
Extending the detection window
Prometheus supports
subqueries,
which let you evaluate an instant query over a time range. We can use
max_over_time to create an alert that stays firing for a configurable period
after the upgrade completes:
- alert: ClusterUpgradeRecovery
expr: max_over_time(sum(cluster_version{type="updating"})[1h:]) > 0
labels:
severity: info
annotations:
summary: "OpenShift cluster upgrade is ongoing or recently completed."
description: |
The cluster is being upgraded or was upgraded within the last hour.
Services may still be recovering.
The expression max_over_time(sum(cluster_version{type="updating"})[1h:]) looks
at the maximum value of the upgrading metric over the last 1 hour. If the cluster
was upgrading at any point during that window, the max will be greater than zero,
and the alert will keep firing.
The [1h:] part is the subquery syntax — 1h is the lookback range and the
omitted resolution (after the colon) defaults to the global evaluation interval. You can
adjust the 1h to whatever recovery window makes sense for your environment.
This gives us two distinct alerts:
- ClusterUpgradeInProgress — fires only while the upgrade is actively happening. Use this when you need to know the upgrade is ongoing right now.
- ClusterUpgradeRecovery — fires during the upgrade and for 1 hour after. Use this for suppressing alerts that may fire during the recovery period.
Alertmanager inhibit rules
Alertmanager inhibit rules let you prevent certain alerts from being routed to receivers when other alerts are firing. The concept is simple: if a “source” alert is active, any matching “target” alerts are suppressed. The inhibited alerts are still visible in the Alertmanager UI marked as “inhibited”, which is useful for debugging.
An inhibit rule has three key parts:
- source_matchers — conditions that identify the source alert (the one that, when firing, causes inhibition)
- target_matchers — conditions that identify the target alerts (the ones to suppress)
- equal (optional) — labels that must have the same value in both source and target for the inhibition to apply
Note: older Alertmanager versions used source_match/target_match (and their
_re variants). These still work but are deprecated since Alertmanager 0.22 in
favor of source_matchers/target_matchers, which use PromQL-style label
matching syntax.
Here’s a basic example:
inhibit_rules:
# Suppress InstanceDown alerts for services during cluster upgrade recovery
- target_matchers:
- alertname = InstanceDown
- instance =~ ".*myservice.*"
source_matchers:
- alertname = ClusterUpgradeRecovery
When ClusterUpgradeRecovery is firing, any InstanceDown alert whose
instance label matches .*myservice.* will be suppressed. No pages, no
noise.
Putting it all together
Here’s a complete example. Say you have a service that processes a message queue, and it tends to appear stuck during and shortly after cluster upgrades because its workers need time to reconnect and catch up.
First, the Prometheus alerting rules:
groups:
- name: cluster
rules:
# Fires only during active upgrade
- alert: ClusterUpgradeInProgress
expr: sum(cluster_version{type="updating"}) > 0
labels:
severity: info
annotations:
summary: "OpenShift cluster upgrade is ongoing."
# Fires during upgrade + 1 hour recovery window
- alert: ClusterUpgradeRecovery
expr: max_over_time(sum(cluster_version{type="updating"})[1h:]) > 0
labels:
severity: info
annotations:
summary: "Cluster upgrade ongoing or recently completed."
- name: myservice
rules:
- alert: StuckQueueProcessing
expr: pending_messages > 50
for: 10m
labels:
severity: critical
annotations:
summary: "Message queue processing appears stuck."
Then the Alertmanager inhibit rules:
inhibit_rules:
# Node status can be unhealthy during upgrade
- target_matchers:
- alertname = ClusterNodeStatus
source_matchers:
- alertname = ClusterUpgradeInProgress
# Queue processing needs extra recovery time after upgrade
- target_matchers:
- alertname = StuckQueueProcessing
source_matchers:
- alertname = ClusterUpgradeRecovery
# Instances go down during upgrade
- target_matchers:
- alertname = InstanceDown
source_matchers:
- alertname = ClusterUpgradeInProgress
Notice how different alerts use different source alerts depending on their recovery characteristics:
ClusterNodeStatusandInstanceDownuseClusterUpgradeInProgressdirectly because nodes and instances typically recover shortly after the upgrade completes.StuckQueueProcessingusesClusterUpgradeRecoverybecause queue processing needs significantly more time to reconnect and catch up after the upgrade.
Tips
A few things to keep in mind when setting up inhibit rules:
- Be specific with target matching. Use exact matchers (
=) for alert names or regex matchers (=~) for patterns. Overly broad rules can mask real problems. - Use the
equalfield when you need to scope inhibition to specific instances or clusters. For example, if you monitor multiple clusters, addequal: [cluster]so an upgrade on cluster A doesn’t silence alerts from cluster B. - Choose the right recovery window. Start with 1 hour and adjust based on how long your services actually take to stabilize. Check your alert history after a few upgrades to calibrate.
- Keep the original upgrade alert. Having both
ClusterUpgradeInProgressandClusterUpgradeRecoverygives you flexibility — some alerts only need suppression during the upgrade itself, while others need the extended window. - Test your inhibit rules. Use
amtool check-configto validate your Alertmanager configuration syntax. Then, send test alerts to a running Alertmanager instance to confirm inhibition works as expected before relying on it during a real upgrade.
Conclusion
Cluster upgrades are routine but disruptive. Rather than training your team to ignore alerts during upgrades — which risks missing real issues — use the tooling that Prometheus and Alertmanager already provide. Detect the upgrade with metrics, extend the window with subqueries, and suppress the noise with inhibit rules. Your on-call will thank you.