Alertmanager 告警配置文件和告警規則詳解

作者：尼古拉斯李 2025-04-09 08:05:00

Alertmanager 是 Prometheus 生態中的一個重要組件，用于處理 Prometheus 發送的告警（Alerts），它提供了告警分組、抑制、去重、路由以及告警通知等功能。

一、什么是 Alertmanager？

Alertmanager 是 Prometheus 生態中的一個重要組件，用于處理 Prometheus 發送的告警（Alerts）。它提供了告警分組、抑制、去重、路由以及告警通知等功能。它可以在發現問題時立即通知相關負責人，使運維人員能快速響應并采取措施。

下面是告警是流程圖：

Alertmanager 的主要功能：

告警分組（Grouping）：將相似的告警合并，減少告警數量。
去重（Deduplication）：如果同一告警在短時間內重復觸發，Alertmanager 只會發送一次。
抑制（Inhibition）：當某個高優先級告警觸發時，屏蔽低優先級的相關告警。
路由（Routing）：將不同類型的告警發送到不同的接收者（如郵件、Slack、Webhook）。
通知（Notification）：支持多種通知方式，如郵件、Slack、PagerDuty、Webhook 等。

二、Alertmanager 配置文件詳解

完整的配置文件，這個只是實例，生產環境需要根據實際配置：

global:
  resolve_timeout: 5m  
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'user@example.com'
  smtp_auth_password: 'yourpassword'
  smtp_hello: "qq.com"
  smtp_require_tls: false

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']  # 分組依據
  group_wait: 30s  # 第一次分組告警發送前的等待時間
  group_interval: 5m  # 組內新告警的發送間隔
  repeat_interval: 3h  # 相同告警重復發送的間隔
  routes:
    - match:
        severity: critical
      receiver: 'email'
      continue: true# 繼續匹配后續規則
    - match:
        severity: warning
      receiver: 'slack'
    - match:
        alertname: InstanceDown
      receiver: 'webhook'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook.example.com/alert'# Webhook 地址

  - name: 'email'
    email_configs:
      - to: 'admin@example.com'
        from: 'alert@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'user@example.com'
        auth_password: 'yourpassword'
        send_resolved: true# 發送告警恢復通知

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'webhook'
    webhook_configs:
      - url: 'http://alert-handler.local/notify'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']

Alertmanager 的配置文件 alertmanager.yml 主要包括以下幾個部分，每個部分的作用及參數如下。

1. global 配置

全局配置，Global塊配置下的配置選項在本配置文件內的所有配置項下可見，但是文件內其他位置的子配置可以覆蓋Global配置。

global:
  resolve_timeout: 5m  # 告警恢復后等待 5 分鐘再標記為已解決
  smtp_smarthost: 'smtp.example.com:587'  # 郵件服務器地址
  smtp_from: 'xxx@example.com'  # 發送郵件的地址
  smtp_auth_username: 'xxx@example.com'  # 郵件服務器認證用戶名
  smtp_auth_password: 'yourpassword'  # 郵件服務器認證密碼

resolve_timeout：在告警恢復后，等待多長時間才將其標記為“已解決”。
smtp_smarthost：用于發送郵件告警的郵件服務器地址。
smtp_from：用于發送告警郵件的發件人地址。
smtp_auth_username 和 smtp_auth_password：用于認證 SMTP 服務器的用戶名和授權碼。
smtp_require_tls：是否開啟tls認證

2. route 配置

告警路由配置，用于告警信息的分組路由，可以將不同分組的告警發送給不同的收件人。

route:
  receiver: 'default' 
  group_by: ['alertname', 'cluster', 'service']  
  group_wait: 30s  
  group_interval: 5m 
  repeat_interval: 3h

receiver：默認的接收器（如果沒有匹配到具體的路由規則，則使用此接收器）。
group_by：定義告警分組的方式，確保相同服務的告警不會重復發送。
group_wait：在發送第一個告警前等待的時間，避免瞬時波動導致告警風暴。
group_interval：在同一分組內，新增告警的發送間隔。
repeat_interval：相同告警重復發送的時間間隔，防止過度通知。

3. routes 配置

路由子配置，優先級高于route，配置和route一樣的：

routes:
    - match:
        severity: critical
      receiver: 'email'
      continue: true
    - match:
        severity: warning
      receiver: 'slack'
    - match:
        alertname: InstanceDown
      receiver: 'webhook'

match：定義匹配規則，例如 severity: critical 表示匹配所有嚴重告警。
receiver：匹配該規則的告警將發送到指定接收器。
continue：如果為 true，則匹配該規則后仍會繼續匹配其他規則，默認是找到符合的規則后就不在往下繼續匹配。

4. receivers 配置

告警接收人配置，每個receiver都有一個名字，經過route分組并且路由后需要指定一個receiver，可以配置不同類型的接收者：

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook.example.com/alert'

name：定義接收器的名稱。
webhook_configs：使用 Webhook 發送告警。

- name: 'email'
    email_configs:
      - to: 'admin@example.com'
        send_resolved: true

to：郵件接收者。
send_resolved：是否發送恢復通知。

- name: 'slack'
    slack_configs:
      - channel: '#alerts'
        api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'

channel：告警發送的 Slack 頻道。
api_url：Slack Webhook URL。

5. inhibit_rules 配置

告警抑制，主要用于減少告警的次數，防止“告警轟炸：

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']

source_match：定義高優先級的告警。
target_match：當 source_match 觸發時，抑制 target_match。
equal：只有在相同 instance 觸發時才應用抑制規則。

當一個 severity 為 critical 的告警觸發時，所有 severity 為 warning 且 alertname 和 instance 標簽與 critical 告警相同的告警將被抑制，不會發送通知。

完整的配置文件應該還有個Templates配置：用于放置自定義模板的位置，這里不展開講

三、Prometheus 告警規則詳解

1. 告警規則文件

告警配置文件可以自定義名稱，在prometheus配置文件同步就行。

alert-rules.yml：

groups:
  - name: 主機狀態監控
    rules:
      - alert: 主機宕機
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "實例 {{ $labels.instance }} 宕機"
          description: "實例 {{ $labels.instance }} 已經宕機超過 1 分鐘。請檢查服務狀態。"

expr：告警觸發條件。
for：持續多長時間才觸發告警。
labels：告警標簽，可用于匹配規則。
annotations：告警的詳細描述。

責任編輯：趙寧寧來源：運維李哥不背鍋

運維告警 Prometheus

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看