prometheus alertmanager邮件告警

2018-09-19 14:43:00 | kubernetes |

prometheus将监测到的异常事件发送给alertmanager，alertmanager发送异常事件的通知（邮件、webhook等）

参考
https://prometheus.io/docs/alerting/configuration/
https://coreos.com/tectonic/docs/latest/tectonic-prometheus-operator/user-guides/configuring-prometheus-alertmanager.html
https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml
https://github.com/prometheus/alertmanager/blob/master/template/default.tmpl
k8s部署prometheus
略
k8s部署alertmanager
修改Prometheus配置文件我prometheus.yml
1
2
3
4
5
6
7
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]#AlertManager 地址与端口

configMap.yaml文件

随便注释一部分具体参数参考官方

apiVersion: v1
kind: ConfigMap
metadata:
    name: alertmanager
    namespace: monitoring
data:
  config.yml: |-
    global:
      smtp_smarthost:  'smtp.idcsec.com:25' # Email SMTP 服务器信息
      smtp_from:  'root@idcsec.com'
      smtp_auth_username:   'root@idcsec.com'
      smtp_auth_password:  'pwd'
      resolve_timeout: 10m
      smtp_require_tls: false          #是否开启 TLS
    route:                       # 路由规则配置将不同告警发送给指定人
      group_by: ['alertname']         # 告警压缩规则
      repeat_interval: 24h
      receiver: monitoring       #默认发送到 `monitoring`，该 monitoring 必须存在，否则报错退出
       routes:               # 子路由告警级别分别发给不同的接收器
       - match:
          team:dba      #匹配prometheus的rule_files文件中的labels
         receiver: db-team-email   receiver接收器名称 全局唯一
         continue: true               # 默认告警匹配成功第一个 receivers 会退出匹配，开启 continue 参数后会继续匹配 receivers 列表
         
    receivers:                         # 接收器
    - name: 'monitoring'
      email_configs:
      - send_resolved: true  #告警恢复后否发送通知，这里选择发送
       to: 'test@idcsec.com' #接收邮件'A@.com,B@com'
    - name: 'db-team-email'
      email_configs:
      - send_resolved: true
       to:'xxx@xxx.com'

alertmanager.yaml文件

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
  creationTimestamp: 2018-07-31T13:08:06Z
  generation: 3
  labels:
    app: alertmanager
  name: alertmanager
  namespace: monitoring
  resourceVersion: "43603292"
  selfLink: /apis/extensions/v1beta1/namespaces/monitoring/deployments/alertmanager
  uid: c3c75e6c-94c2-11e8-b5ba-1866daeddaa4
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: alertmanager
      name: alertmanager
    spec:
      containers:
      - args:
        - -config.file=/etc/alertmanager/config.yml
        - -storage.path=/alertmanager
        image: alertmanager:v0.7.1
        imagePullPolicy: IfNotPresent
        name: alertmanager
        ports:
        - containerPort: 9093
          name: alertmanager
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/alertmanager
          name: config-volume
        - mountPath: /etc/alertmanager-templates
          name: templates-volume
        - mountPath: /alertmanager
          name: alertmanager
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: alertmanager
        name: config-volume
      - configMap:
          defaultMode: 420
          name: alertmanager-templates
        name: templates-volume
      - emptyDir: {}
        name: alertmanager
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: 2018-07-31T13:08:06Z
    lastUpdateTime: 2018-07-31T13:08:06Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
---

部署Prometheus alertmanager相关配置文件
在prometheus.yml中指定规则文件（可使用通配符，如rules/*.rules）这使用 rules.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  labels:
    name:  prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ["alertmanager:9093"] # AlertManager 地址与端口
    rule_files:
     - "rules.yml"
    scrape_configs:
    ..........
  
rules.yml: |-
    groups:
    - name: noah_pod.rules
      rules:
      - alert: Pod_all_cpu_usage
        expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
        for: 5m
        labels:
          severity: critical
          service: pods
        annotations:
          description: 容器 {{ $labels.name }} CPU 资源利用率大于 75% , (current value is {{ $value }})
          summary: Dev CPU 负载告警
      - alert: Pod_all_memory_usage
        expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
        for: 10m
        labels:
          severity: critical
        annotations:
          description: 容器 {{ $labels.name }} Memory 资源利用率大于 2G , (current value is {{ $value }})
          summary: Dev Memory 负载告警
      - alert: Pod_all_network_receive_usage
        expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
        for: 10m
        labels:
          severity: critical
        annotations:
          description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})
          summary: network_receive 负载告警
    - name: Oracle.rules
      rules:
      - alert: Oracledb-status
        expr: oracledb_up{job="oracle-198"}  == 0
        for: 60s
        labels:
          severity: critica
          team:dba
        annotations:
          summary: 数据库 {{ $labels.instance }} 告警
          description: "数据库 {{ $labels.instance }} 异常 (当前值: {{ $value }}"

更新prometheus的配置需要让重新读取，有两种方法：
1、通过HTTP API向/-/reload发送POST请求，例：curl -X POST http://ip:9090/-/reload
2、向prometheus进程发送SIGHUP信号
crt
crt
crt