通過Prometheus來做SLI/SLO監控展示

作者：喬克 2021-04-07 14:53:09

開源

SRE通常通過這兩個指標來衡量系統的穩定性，其主要思路就是通過SLI來判斷SLO，也就是通過一系列的指標來衡量我們的目標是否達到了"幾個9"。

[[391678]]

什么是SLI/SLO

SLI，全名Service Level Indicator，是服務等級指標的簡稱，它是衡定系統穩定性的指標。

SLO，全名Sevice Level Objective，是服務等級目標的簡稱，也就是我們設定的穩定性目標，比如"4個9"，"5個9"等。

SRE通常通過這兩個指標來衡量系統的穩定性，其主要思路就是通過SLI來判斷SLO，也就是通過一系列的指標來衡量我們的目標是否達到了"幾個9"。

如何選擇SLI

在系統中，常見的指標有很多種，比如：

系統層面：CPU使用率、內存使用率、磁盤使用率等
應用服務器層面：端口存活狀態、JVM的狀態等
應用運行層面：狀態碼、時延、QPS等
中間件層面：QPS、TPS、時延等
業務層面：成功率、增長速度等

這么多指標，應該如何選擇呢?只要遵從兩個原則就可以：

選擇能夠標識一個主體是否穩定的指標，如果不是這個主體本身的指標，或者不能標識主體穩定性的，就要排除在外。
優先選擇與用戶體驗強相關或用戶可以明顯感知的指標。

通常情況下，可以直接使用谷歌的VALET指標方法。

V：Volume，容量，服務承諾的最大容量
A：Availability，可用性，服務是否正常
L：Latency，延遲，服務的響應時間
E：Error，錯誤率，請求錯誤率是多少
T：Ticket，人工介入，是否需要人工介入

這就是谷歌使用VALET方法給的樣例。

上面僅僅是簡單的介紹了一下SLI/SLO，更多的知識可以學習《SRE：Google運維解密》和趙成老師的極客時間課程《SRE實踐手冊》。下面來簡單介紹如何使用Prometheus來進行SLI/SLO監控。

service-level-operator

Service level operator是為了Kubernetes中的應用SLI/SLO指標來衡量應用的服務指標，并可以通過Grafana來進行展示。

Operator主要是通過SLO來查看和創建新的指標。例如：

apiVersion: monitoring.spotahome.com/v1alpha1 
kind: ServiceLevel 
metadata: 
  name: awesome-service 
spec: 
  serviceLevelObjectives: 
    - name: "9999_http_request_lt_500" 
      description: 99.99% of requests must be served with <500 status code. 
      disable: false 
      availabilityObjectivePercent: 99.99 
      serviceLevelIndicator: 
        prometheus: 
          address: http://myprometheus:9090 
          totalQuery: sum(increase(http_request_total{host="awesome_service_io"}[2m])) 
          errorQuery: sum(increase(http_request_total{host="awesome_service_io", code=~"5.."}[2m])) 
      output: 
        prometheus: 
          labels: 
            team: a-team 
            iteration: "3"

availabilityObjectivePercent：SLO
totalQuery：總請求數
errorQuery：錯誤請求數

Operator通過totalQuert和errorQuery就可以計算出SLO的指標了。

部署service-level-operator

前提：在Kubernetes集群中部署好Prometheus，我這里是采用Prometheus-Operator方式進行部署的。

(1)首先創建RBAC

apiVersion: v1 
kind: ServiceAccount 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
 
--- 
apiVersion: rbac.authorization.k8s.io/v1 
kind: ClusterRole 
metadata: 
  name: service-level-operator 
  labels: 
    app: service-level-operator 
    component: app 
rules: 
  # Register and check CRDs. 
  - apiGroups: 
      - apiextensions.k8s.io 
    resources: 
      - customresourcedefinitions 
    verbs: 
      - "*" 
 
  # Operator logic. 
  - apiGroups: 
      - monitoring.spotahome.com 
    resources: 
      - servicelevels 
      - servicelevels/status 
    verbs: 
      - "*" 
 
--- 
kind: ClusterRoleBinding 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  name: service-level-operator 
subjects: 
  - kind: ServiceAccount 
    name: service-level-operator 
    namespace: monitoring  
roleRef: 
  apiGroup: rbac.authorization.k8s.io 
  kind: ClusterRole 
  name: service-level-operator

（2）然后創建Deployment

apiVersion: apps/v1  
kind: Deployment 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: service-level-operator 
      component: app 
  strategy: 
    rollingUpdate: 
      maxUnavailable: 0 
  template: 
    metadata: 
      labels: 
        app: service-level-operator 
        component: app 
    spec: 
      serviceAccountName: service-level-operator 
      containers: 
        - name: app 
          imagePullPolicy: Always 
          image: quay.io/spotahome/service-level-operator:latest 
          ports: 
            - containerPort: 8080 
              name: http 
              protocol: TCP 
          readinessProbe: 
            httpGet: 
              path: /healthz/ready 
              port: http 
          livenessProbe: 
            httpGet: 
              path: /healthz/live 
              port: http 
          resources: 
            limits: 
              cpu: 220m 
              memory: 254Mi 
            requests: 
              cpu: 120m 
              memory: 128Mi

（3）創建service

apiVersion: v1 
kind: Service 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
spec: 
  ports: 
    - port: 80 
      protocol: TCP 
      name: http 
      targetPort: http 
  selector: 
    app: service-level-operator 
    component: app

（4）創建prometheus serviceMonitor

apiVersion: monitoring.coreos.com/v1 
kind: ServiceMonitor 
metadata: 
  name: service-level-operator 
  namespace: monitoring 
  labels: 
    app: service-level-operator 
    component: app 
    prometheus: myprometheus 
spec: 
  selector: 
    matchLabels: 
      app: service-level-operator 
      component: app 
  namespaceSelector: 
    matchNames: 
      - monitoring  
  endpoints: 
    - port: http 
      interval: 10s

到這里，Service Level Operator部署完成了，可以在prometheus上查看到對應的Target，如下：

然后就需要創建對應的服務指標了，如下所示創建一個示例。

apiVersion: monitoring.spotahome.com/v1alpha1 
kind: ServiceLevel 
metadata: 
  name: prometheus-grafana-service 
  namespace: monitoring 
spec: 
  serviceLevelObjectives: 
    - name: "9999_http_request_lt_500" 
      description: 99.99% of requests must be served with <500 status code. 
      disable: false 
      availabilityObjectivePercent: 99.99 
      serviceLevelIndicator: 
        prometheus: 
          address: http://prometheus-k8s.monitoring.svc:9090 
          totalQuery: sum(increase(http_request_total{service="grafana"}[2m])) 
          errorQuery: sum(increase(http_request_total{service="grafana", code=~"5.."}[2m])) 
      output: 
        prometheus: 
          labels: 
            team: prometheus-grafana  
            iteration: "3"

上面定義了grafana應用"4個9"的SLO。

然后可以在Prometheus上看到具體的指標，如下。

接下來在Grafana上導入ID為8793的Dashboard，即可生成如下圖表。

上面是SLI，下面是錯誤總預算和已消耗的錯誤。

下面可以定義告警規則，當SLO下降時可以第一時間收到，比如：

groups: 
  - name: slo.rules 
    rules: 
      - alert: SLOErrorRateTooFast1h 
        expr: | 
          ( 
            increase(service_level_sli_result_error_ratio_total[1h]) 
            / 
            increase(service_level_sli_result_count_total[1h]) 
          ) > (1 - service_level_slo_objective_ratio) * 14.6 
        labels: 
          severity: critical 
          team: a-team 
        annotations: 
          summary: The monthly SLO error budget consumed for 1h is greater than 2% 
          description: The error rate for 1h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 2% monthly budget. 
      - alert: SLOErrorRateTooFast6h 
        expr: | 
          ( 
            increase(service_level_sli_result_error_ratio_total[6h]) 
            / 
            increase(service_level_sli_result_count_total[6h]) 
          ) > (1 - service_level_slo_objective_ratio) * 6 
        labels: 
          severity: critical 
          team: a-team 
        annotations: 
          summary: The monthly SLO error budget consumed for 6h is greater than 5% 
          description: The error rate for 6h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 5% monthly budget.

第一條規則表示在1h內消耗的錯誤率大于30天內的2%，應該告警。第二條規則是在6h內的錯誤率大于30天的5%，應該告警。

下面是谷歌的的基準。

最后

說到系統穩定性，這里不得不提到系統可用性，SRE提高系統的穩定性，最終還是為了提升系統的可用時間，減少故障時間。那如何來衡量系統的可用性呢?

目前業界有兩種衡量系統可用性的方式，一個是時間維度，一個是請求維度。時間維度就是從故障出發對系統的穩定性進行評估。請求維度是從成功請求占比的角度出發，對系統穩定性進行評估。

時間維度：可用性 = 服務時間 / (服務時間 + 故障時間)

請求維度：可用性 = 成功請求數 / 總請求數

在SRE實踐中，通常會選擇請求維度來衡量系統的穩定性，就如上面的例子。不過，如果僅僅通過一個維度來判斷系統的穩定性也有點太武斷，還應該結合更多的指標，比如延遲，錯誤率等，而且對核心應用，核心鏈路的SLI應該更細致。

參考

[1] 《SRE實踐手冊》- 趙成

[2] 《SRE：Google運維解密》

[3] https://github.com/spotahome/service-level-operator

責任編輯：姜華來源：運維開發故事

Prometheus 開源監控

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看