In a detailed blog post, Monika Singh at Cloudflare explores the stressful environment on-call personnel face, often illustrated by the 'this is fine' meme. On-call staff frequently deal with numerous alerts, leading to alert fatigue—a state of exhaustion caused by responding to non-prioritised or unclear alerts. To combat this, Cloudflare teams conduct periodic alert analyses to enhance the accuracy and actionability of alerts. Singh's blog post delves into the significance of alert observability and Cloudflare's methods to improve it using open-source tools and best practices.
Singh explains how alert fatigue can disrupt on-call personnel's sleep, social life, and leisure activities, potentially leading to burnout. Regular alert analysis helps mitigate this by reducing unnecessary interruptions and improving the efficiency of on-call work. Despite its importance, not all teams conduct alert analysis. Singh emphasises that analysing alerts helps staff in making handover notes, aids managers in assessing burnout risks, and supports writing of incident reports.
In the post, Singh walks through the basics of the Prometheus architecture at Cloudflare. Cloudflare relies heavily on Prometheus to monitor their equipment, which is in over 310 cities and runs to more than 1100 servers. Alertmanager centralises alerts, using a webhook to store alerts for analysis. Prometheus collects metrics, evaluates rules, and triggers alerts, which Alertmanager manages.
Alertmanager processes alerts by inhibiting, grouping, silencing, or routing them based on configuration. However, not all alerts are optimally configured, leading to noise. Cloudflare initially used alertmanager2es for alert monitoring and reporting, but it had limitations as it did not notify the team about silenced or inhibited alerts. Singh highlights how Cloudflare got around this limitation by querying the Alertmanager API to capture all alert states.
Cloudflare aggregates all alert states into a datastore by correlating data from the Alertmanager webhook and API, tying them together with a unique fingerprint field. Singh describes how the data is transformed using vector.dev (an observability data routing and transformation pipeline) and stored in ClickHouse (an open source database for analytics) for analysis. ClickHouse allows efficient data manipulation, for example enabling specific label queries and aggregating alert data.
Singh outlines the creation of multiple dashboards at Cloudflare to monitor alerts, including:
- Alerts Overview: General insights into alerts received by Alertmanager.
- Alertname Overview: Detailed analysis of specific alerts.
- Alerts Overview by Receiver: Insights specific to teams or receivers.
- Alerts State Timeline: Snapshot of alert volume and activity.
- Jiralerts Overview: Alerts received by the ticketing system.
- Silences Overview: Insights into Alertmanager silences.
Singh goes on to explain that they route alerts to teams and a team can have multiple services or components, giving many possible combinations of alerts. A dashboard panel aggregates firing alerts component counts over time, to show which components are noisy and when, on a simple dashboard. Also, a swimlane view of receivers shows how busy on-call was, and when, on a colour-coded dashboard to highlight flapping alerts - where the state changes frequently. This helped the team to reconfigure and tweak the thresholds and duration periods in the alert rules.
The analysis revealed that some alerts fired without notification labels, and some were from decommissioned clusters where the alerts hadn't been removed. Also, alertmanager inhibitions sometimes failed, leading to unnecessary alerts. Storing alert data in ClickHouse enabled Cloudflare to detect and address the configuration errors causing these issues. Alertmanager also allows silencing alerts during maintenance or when being worked on. By analysing silences, Cloudflare identified stale silences that were no longer relevant, ensuring that alerts were always relevant.
Singh offers a demo for implementing alerts observability using Docker Compose, Prometheus, Alertmanager, Vector, ClickHouse, and Grafana. This setup allows users to explore prebuilt demo dashboards and understand the alert observability process used at Cloudflare.
Singh concludes by stating that alert observability enhances on-call efficiency and reduces burnout by minimising unnecessary interruptions. Cloudflare's approach has improved alert management, providing valuable insights for troubleshooting and optimising alert configurations. This proactive monitoring culture benefits all teams, promoting a more manageable environment for the engineers working on-call.