prometheus alertmanager邮件告警

prometheus将监测到的异常事件发送给alertmanager,alertmanager发送异常事件的通知(邮件、webhook等)

configMap.yaml文件

随便注释一部分具体参数参考官方

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager
namespace: monitoring
data:
config.yml: |-
global:
smtp_smarthost: 'smtp.idcsec.com:25' # Email SMTP 服务器信息
smtp_from: 'root@idcsec.com'
smtp_auth_username: 'root@idcsec.com'
smtp_auth_password: 'pwd'
resolve_timeout: 10m
smtp_require_tls: false #是否开启 TLS
route: # 路由规则配置将不同告警发送给指定人
group_by: ['alertname'] # 告警压缩规则
repeat_interval: 24h
receiver: monitoring #默认发送到 `monitoring`,该 monitoring 必须存在,否则报错退出
routes: # 子路由告警级别分别发给不同的接收器
- match:
team:dba #匹配prometheus的rule_files文件中的labels
receiver: db-team-email receiver接收器名称 全局唯一
continue: true # 默认告警匹配成功第一个 receivers 会退出匹配,开启 continue 参数后会继续匹配 receivers 列表

receivers: # 接收器
- name: 'monitoring'
email_configs:
- send_resolved: true #告警恢复后否发送通知,这里选择发送
to: 'test@idcsec.com' #接收邮件'A@.com,B@com'
- name: 'db-team-email'
email_configs:
- send_resolved: true
to:'xxx@xxx.com'

alertmanager.yaml文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
creationTimestamp: 2018-07-31T13:08:06Z
generation: 3
labels:
app: alertmanager
name: alertmanager
namespace: monitoring
resourceVersion: "43603292"
selfLink: /apis/extensions/v1beta1/namespaces/monitoring/deployments/alertmanager
uid: c3c75e6c-94c2-11e8-b5ba-1866daeddaa4
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: alertmanager
name: alertmanager
spec:
containers:
- args:
- -config.file=/etc/alertmanager/config.yml
- -storage.path=/alertmanager
image: alertmanager:v0.7.1
imagePullPolicy: IfNotPresent
name: alertmanager
ports:
- containerPort: 9093
name: alertmanager
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/alertmanager
name: config-volume
- mountPath: /etc/alertmanager-templates
name: templates-volume
- mountPath: /alertmanager
name: alertmanager
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: alertmanager
name: config-volume
- configMap:
defaultMode: 420
name: alertmanager-templates
name: templates-volume
- emptyDir: {}
name: alertmanager
status:
availableReplicas: 1
conditions:
- lastTransitionTime: 2018-07-31T13:08:06Z
lastUpdateTime: 2018-07-31T13:08:06Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 3
readyReplicas: 1
replicas: 1
updatedReplicas: 1
---

部署Prometheus alertmanager相关配置文件
在prometheus.yml中指定规则文件(可使用通配符,如rules/*.rules)这使用 rules.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
labels:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"] # AlertManager 地址与端口
rule_files:
- "rules.yml"
scrape_configs:
..........

rules.yml: |-
groups:
- name: noah_pod.rules
rules:
- alert: Pod_all_cpu_usage
expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
for: 5m
labels:
severity: critical
service: pods
annotations:
description: 容器 {{ $labels.name }} CPU 资源利用率大于 75% , (current value is {{ $value }})
summary: Dev CPU 负载告警
- alert: Pod_all_memory_usage
expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} Memory 资源利用率大于 2G , (current value is {{ $value }})
summary: Dev Memory 负载告警
- alert: Pod_all_network_receive_usage
expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})
summary: network_receive 负载告警
- name: Oracle.rules
rules:
- alert: Oracledb-status
expr: oracledb_up{job="oracle-198"} == 0
for: 60s
labels:
severity: critica
team:dba
annotations:
summary: 数据库 {{ $labels.instance }} 告警
description: "数据库 {{ $labels.instance }} 异常 (当前值: {{ $value }}"

更新prometheus的配置需要让重新读取,有两种方法:
1、通过HTTP API向/-/reload发送POST请求,例:curl -X POST http://ip:9090/-/reload
2、向prometheus进程发送SIGHUP信号
crt
crt
crt