具体参考网址：https://www.cnblogs.com/sanduzxcvbnm/p/16291296.html

本章用到的yaml文件地址：https://files.cnblogs.com/files/sanduzxcvbnm/operator_yaml.zip?t=1654593400

背景说明

依据官方文档进行部署，解决部署过程中出现的各种问题，并有所优化

以上缺少的部分可以根据实际情况进行修改而定

安装

git clone https://github.com/coreos/kube-prometheus.git

cd kube-prometheus/manifests

有俩文件需要修改镜像仓库，否则会拉取不到镜像

文件1：kubeStateMetrics-deployment.yaml =》 k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2 =》 bitnami/kube-state-metrics:2.4.2

文件2：prometheusAdapter-deployment.yaml =》 k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1 =》 selina5288/prometheus-adapter:v0.9.1

有三个文件需要修改apiVersion （k8s版本是1.20.11，PodDisruptionBudget 看在1.20中还是v1beta1，修改为policy/v1beta1 ）

文件1：alertmanager-podDisruptionBudget.yaml =》 apiVersion: policy/v1 =》apiVersion: policy/v1beta1

文件2：prometheus-podDisruptionBudget.yaml =》 apiVersion: policy/v1 =》apiVersion: policy/v1beta1

文件3：prometheusAdapter-podDisruptionBudget.yaml =》 apiVersion: policy/v1 =》apiVersion: policy/v1beta1

需要新增的文件,保存在manifests目录下

文件1：prometheus-kubeControllerManagerService.yaml

apiVersion: v1

kind: Service

metadata:

  namespace: kube-system

  name: kube-controller-manager

  labels:

    app.kubernetes.io/name: kube-controller-manager

spec:

  clusterIP: None

  selector:

    component: kube-controller-manager

  ports:

  - name: https-metrics

    port: 10257

    targetPort: 10257

    protocol: TCP

文件2：prometheus-kubeSchedulerService.yaml

apiVersion: v1

kind: Service

metadata:

  namespace: kube-system

  name: kube-scheduler

  labels:

    app.kubernetes.io/name: kube-scheduler

spec:

  clusterIP: None

  selector:

    component: kube-scheduler

  ports:

  - name: https-metrics

    port: 10259

    targetPort: 10259

    protocol: TCP

# (执行kubectl apply -f setup/ 则会报错：The CustomResourceDefinition "prometheuses.monitoring.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes)

# 或者先执行 kubectl apply -f setup/ ，等出现上述报错后，再单独执行报错文件 kubectl create -f setup/0prometheusCustomResourceDefinition.yaml

kubectl create -f setup/

kubectl apply -f .

kubectl get pods -n monitoring

kubectl get svc -n monitoring

访问

针对 grafana、alertmanager 和 prometheus 都创建了一个类型为 ClusterIP 的 Service，当然如果我们想要在外网访问这两个服务的话可以通过创建对应的 Ingress 对象或者使用 NodePort 类型的 Service，我们这里为了简单，直接使用 NodePort 类型的服务即可，编辑 grafana、alertmanager-main 和 prometheus-k8s 这3个 Service，将服务类型更改为 NodePort:

# 将 type: ClusterIP 更改为 type: NodePort

$ kubectl edit svc grafana -n monitoring

$ kubectl edit svc alertmanager-main -n monitoring

$ kubectl edit svc prometheus-k8s -n monitoring

$ kubectl get svc -n monitoring

注意: 这一步用浏览器访问会报错504，原因是设置了网络访问策略，删除对应的网络策略就可以了，使用ingress无法访问也是同样的解决办法

或者创建对应的 Ingress 对象

本机hosts文件需要添加自定义解析：

# cat alertmanager-ingress.yaml

---

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

  name: alertmanager-ingress

  namespace: monitoring

spec:

  ingressClassName: nginx

  rules:

    - host: www.fff.com # 自定义域名，本机hosts配置解析

      http:

        paths:

          - backend:

              service:

                name: alertmanager-main

                port:

                  number: 9093

            path: /

            pathType: Prefix

# cat grafana-ingress.yaml

---

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

  name: grafana-ingress

  namespace: monitoring

spec:

  ingressClassName: nginx

  rules:

    - host: www.eee.com # 自定义域名，本机hosts配置解析

      http:

        paths:

          - backend:

              service:

                name: grafana

                port:

                  number: 3000

            path: /

            pathType: Prefix

# cat prometheus-ingress.yaml

---

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

  name: prometheus-ingress

  namespace: monitoring

spec:

  ingressClassName: nginx

  rules:

    - host: www.ddd.com # 自定义域名，本机hosts配置解析

      http:

        paths:

          - backend:

              service:

                name: prometheus-k8s

                port:

                  number: 9090

            path: /

            pathType: Prefix

Grafana 第一次登录使用 admin:admin，进入首页后，可以发现其实 Grafana 已经有很多配置好的监控图表了。

监控kube-controller-manager 和 kube-scheduler 这两个系统组件

安装步骤中已经新增俩文件：prometheus-kubeControllerManagerService.yaml 和 prometheus-kubeSchedulerService.yaml，但是prometheus的targets中无法访问，这是因为kube-controller-manager 和 kube-scheduler 都使用了 --secure-port 绑定到 127.0.0.1 而不是 0.0.0.0

解决办法：

vim /etc/kubernetes/manifests/kube-controller-manager.yaml

将--bind-address=127.0.0.1 改为 --bind-address=0.0.0.0

vim /etc/kubernetes/manifests/kube-scheduler.yaml

将--bind-address=127.0.0.1 改为 --bind-address=0.0.0.0

由于 kube-controller-manager 和 kube-scheduler 是以静态 Pod 运行在集群中的，所以只要修改静态 Pod 目录下对应的 yaml 文件即可。等待一会后，对应服务会自动重启

配置 PrometheusRule 自定义监控rules

自定义一个报警规则，只需要创建一个具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 对象就行了,比如：

注意 label 标签一定至少要有 prometheus=k8s 和 role=alert-rules

# prometheus-etcdRules.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

  labels:

    prometheus: k8s # 必须有

    role: alert-rules # 必须有

  name: etcd-rules

  namespace: monitoring

spec:

  groups:

  - name: etcd # 具体的报警规则

    rules:

    - alert: EtcdClusterUnavailable

      annotations:

        summary: etcd cluster small

        description: If one more etcd peer goes down the cluster will be unavailable

      expr: |

        count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)

      for: 3m

      labels:

        severity: critical

# kubectl apply -f prometheus-etcdRules.yaml

prometheusrule.monitoring.coreos.com/etcd-rules created

配置企业微信报警

直接修改 alertmanager-secret.yaml 文件，增加报警信息参数，然后重新更新这个资源对象

除了watchdog外，其余报警都通过企业微信发送

apiVersion: v1

kind: Secret

metadata:

  labels:

    app.kubernetes.io/component: alert-router

    app.kubernetes.io/instance: main

    app.kubernetes.io/name: alertmanager

    app.kubernetes.io/part-of: kube-prometheus

    app.kubernetes.io/version: 0.24.0

  name: alertmanager-main

  namespace: monitoring

stringData:

  alertmanager.yaml: |-

    "global":

      "resolve_timeout": "5m"

    "inhibit_rules":

    - "equal":

      - "namespace"

      - "alertname"

      "source_matchers":

      - "severity = critical"

      "target_matchers":

      - "severity =~ warning|info"

    - "equal":

      - "namespace"

      - "alertname"

      "source_matchers":

      - "severity = warning"

      "target_matchers":

      - "severity = info"

    - "equal":

      - "namespace"

      "source_matchers":

      - "alertname = InfoInhibitor"

      "target_matchers":

      - "severity = info"

    "receivers":

    - "name": "Default"

      "wechat_configs":

        - corp_id: 'xxx'  # 根据实际情况填写

          api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

          send_resolved: true

          to_party: '2' # 根据实际情况填写

          agent_id: 1000005 # 根据实际情况填写

          api_secret: 'xxx'  # 根据实际情况填写

    - "name": "Watchdog"

    - "name": "Critical"

      "wechat_configs":

        - corp_id: 'xxx'  # 根据实际情况填写

          api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

          send_resolved: true

          to_party: '2' # 根据实际情况填写

          agent_id: 1000005 # 根据实际情况填写

          api_secret: 'xxx'  # 根据实际情况填写

    - "name": "null"

      "wechat_configs":

        - corp_id: 'xxx'  # 根据实际情况填写

          api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

          send_resolved: true

          to_party: '2' # 根据实际情况填写

          agent_id: 1000005 # 根据实际情况填写

          api_secret: 'xxx'  # 根据实际情况填写

    "route":

      "group_by":

      - "namespace"

      "group_interval": "5m"

      "group_wait": "30s"

      "receiver": "Default"

      "repeat_interval": "12h"

      "routes":

      - "matchers":

        - "alertname = Watchdog"

        "receiver": "Watchdog"

      - "matchers":

        - "alertname = InfoInhibitor"

        "receiver": "null"

      - "matchers":

        - "severity = critical"

        "receiver": "Critical"

type: Opaque

# 直接更新该文件，然后就可以收到告警了

$ kubectl apply -f alertmanager-secret.yaml

secret/alertmanager-main configured

注意：执行命令kubectl apply -f alertmanager-secret.yaml表示是创建一个secret,名称为alertmanager-main，里面的内容是alertmanager.yaml文件。

若是增加自定义企业微信告警模板的话，有两种解决办法：

第一种是在alertmanager-secret.yaml文件中继续新增模板文件内容，还是使用apply命令

apiVersion: v1

kind: Secret

metadata:

  labels:

    app.kubernetes.io/component: alert-router

    app.kubernetes.io/instance: main

    app.kubernetes.io/name: alertmanager

    app.kubernetes.io/part-of: kube-prometheus

    app.kubernetes.io/version: 0.24.0

  name: alertmanager-main

  namespace: monitoring

stringData:

  alertmanager.yaml: |-

    "global":

      "resolve_timeout": "5m"

    "inhibit_rules":

    - "equal":

      - "namespace"

      - "alertname"

      "source_matchers":

      - "severity = critical"

      "target_matchers":

      - "severity =~ warning|info"

    - "equal":

      - "namespace"

      - "alertname"

      "source_matchers":

      - "severity = warning"

      "target_matchers":

      - "severity = info"

    - "equal":

      - "namespace"

      "source_matchers":

      - "alertname = InfoInhibitor"

      "target_matchers":

      - "severity = info"

    "receivers":

    - "name": "Default"

      "wechat_configs":

        - corp_id: 'ww0b85c21458a13b12' # 根据实际情况来定

          api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

          send_resolved: true

          to_party: '2' # 根据实际情况来定

          agent_id: 1000005 # 根据实际情况来定

          api_secret: 'xxx' # 根据实际情况来定

    - "name": "Watchdog"

    - "name": "Critical"

      "wechat_configs":

        - corp_id: 'ww0b85c21458a13b12' # 根据实际情况来定

          api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

          send_resolved: true

          to_party: '2' # 根据实际情况来定

          agent_id: 1000005 # 根据实际情况来定

          api_secret: 'xxx' # 根据实际情况来定

    - "name": "null"

      "wechat_configs":

        - corp_id: 'ww0b85c21458a13b12' # 根据实际情况来定

          api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

          send_resolved: true

          to_party: '2' # 根据实际情况来定

          agent_id: 1000005 # 根据实际情况来定

          api_secret: 'xxx' # 根据实际情况来定

    "route":

      "group_by":

      - "namespace"

      "group_interval": "5m"

      "group_wait": "30s"

      "receiver": "Default"

      "repeat_interval": "12h"

      "routes":

      - "matchers":

        - "alertname = Watchdog"

        "receiver": "Watchdog"

      - "matchers":

        - "alertname = InfoInhibitor"

        "receiver": "null"

      - "matchers":

        - "severity = critical"

        "receiver": "Critical"

    "templates":

      - 'wechat_template.tmpl'

  wechat_template.tmpl: |-

    {{ define "wechat.default.message" }}

    {{- if gt (len .Alerts.Firing) 0 -}}

    {{- range $index, $alert := .Alerts -}}

    {{- if eq $index 0 }}

    ==========异常告警==========

    告警类型: {{ $alert.Labels.alertname }}

    告警级别: {{ $alert.Labels.severity }}

    告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}

    故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

    {{- if gt (len $alert.Labels.instance) 0 }}

    实例信息: {{ $alert.Labels.instance }}

    {{- end }}

    {{- if gt (len $alert.Labels.namespace) 0 }}

    命名空间: {{ $alert.Labels.namespace }}

    {{- end }}

    {{- if gt (len $alert.Labels.node) 0 }}

    节点信息: {{ $alert.Labels.node }}

    {{- end }}

    {{- if gt (len $alert.Labels.pod) 0 }}

    实例名称: {{ $alert.Labels.pod }}

    {{- end }}

    ============END============

    {{- end }}

    {{- end }}

    {{- end }}

    {{- if gt (len .Alerts.Resolved) 0 -}}

    {{- range $index, $alert := .Alerts -}}

    {{- if eq $index 0 }}

    ==========异常恢复==========

    告警类型: {{ $alert.Labels.alertname }}

    告警级别: {{ $alert.Labels.severity }}

    告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}

    故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

    恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

    {{- if gt (len $alert.Labels.instance) 0 }}

    实例信息: {{ $alert.Labels.instance }}

    {{- end }}

    {{- if gt (len $alert.Labels.namespace) 0 }}

    命名空间: {{ $alert.Labels.namespace }}

    {{- end }}

    {{- if gt (len $alert.Labels.node) 0 }}

    节点信息: {{ $alert.Labels.node }}

    {{- end }}

    {{- if gt (len $alert.Labels.pod) 0 }}

    实例名称: {{ $alert.Labels.pod }}

    {{- end }}

    ============END============

    {{- end }}

    {{- end }}

    {{- end }}

    {{- end }}

type: Opaque

问题：异常告警跟异常恢复消息在一起发送的时候，异常恢复中恢复时间显示不对

但是单独的异常恢复消息发送后，，显示的恢复时间是对的

单独的异常告警消息中，时间显示的也是对的

第二种，单独创建alertmanager.yaml文件和wechat.tmpl模板文件。使用创建secret的命令进行创建

alertmanager.yaml文件内容

global:

  resolve_timeout: 5m

  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/

templates:

  - '*.tmpl'

route:

  group_by: ['alertname', 'job']

  group_wait: 30s

  group_interval: 5m

  repeat_interval: 5m

  receiver: 'wechat'

  routes:

  - receiver: 'wechat'

    group_wait: 10s

    match:

      severity: warning

  - receiver: 'wechat'

    group_wait: 5s

    match:

      severity: critical

receivers:

- name: 'wechat'

  wechat_configs:

  - corp_id: 'xxx' # 根据实际情况来定

    agent_id: '1000005' # 根据实际情况来定

    api_secret:  'xxx' # 根据实际情况来定

    to_party: '2' # 根据实际情况来定

    send_resolved: true

创建一个wechat.tmpl的文件

{{ define "wechat.default.message" }}

{{- if gt (len .Alerts.Firing) 0 -}}

{{- range $index, $alert := .Alerts -}}

{{- if eq $index 0 }}

==========异常告警==========

告警类型: {{ $alert.Labels.alertname }}

告警级别: {{ $alert.Labels.severity }}

告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}

故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

{{- if gt (len $alert.Labels.instance) 0 }}

实例信息: {{ $alert.Labels.instance }}

{{- end }}

{{- if gt (len $alert.Labels.namespace) 0 }}

命名空间: {{ $alert.Labels.namespace }}

{{- end }}

{{- if gt (len $alert.Labels.node) 0 }}

节点信息: {{ $alert.Labels.node }}

{{- end }}

{{- if gt (len $alert.Labels.pod) 0 }}

实例名称: {{ $alert.Labels.pod }}

{{- end }}

============END============

{{- end }}

{{- end }}

{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}

{{- range $index, $alert := .Alerts -}}

{{- if eq $index 0 }}

==========异常恢复==========

告警类型: {{ $alert.Labels.alertname }}

告警级别: {{ $alert.Labels.severity }}

告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}

故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

{{- if gt (len $alert.Labels.instance) 0 }}

实例信息: {{ $alert.Labels.instance }}

{{- end }}

{{- if gt (len $alert.Labels.namespace) 0 }}

命名空间: {{ $alert.Labels.namespace }}

{{- end }}

{{- if gt (len $alert.Labels.node) 0 }}

节点信息: {{ $alert.Labels.node }}

{{- end }}

{{- if gt (len $alert.Labels.pod) 0 }}

实例名称: {{ $alert.Labels.pod }}

{{- end }}

============END============

{{- end }}

{{- end }}

{{- end }}

{{- end }}

# 删除原来的secret

$ kubectl delete secret alertmanager-main -n monitoring

secret "alertmanager-main" deleted

# 使用如下命令创建新的secret,注意：这个命令有别去第一种方法的命令

$ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring

secret/alertmanager-main configured

以上两种方法判断是否生效的办法：

1.查看k8s中的密文，是否有创建的那些

2.查看alertmanager日志，是否有报错

3.查看alertmanager的web页面中config的信息，配置的信息是否显示的有

4.进入到alertmanager的pod中，查看文件是否存在

自动发现配置

在 Service 的 annotation 区域添加 prometheus.io/scrape=true 的声明，将上面文件直接保存为 prometheus-additional.yaml，然后通过这个文件创建一个对应的 Secret 对象：

# cat prometheus-additional.yaml

- job_name: 'kubernetes-endpoints'

  kubernetes_sd_configs:

  - role: endpoints

  relabel_configs:

  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]

    action: keep

    regex: true

  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]

    action: replace

    target_label: __scheme__

    regex: (https?)

  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]

    action: replace

    target_label: __metrics_path__

    regex: (.+)

  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]

    action: replace

    target_label: __address__

    regex: ([^:]+)(?::\d+)?;(\d+)

    replacement: $1:$2

  - action: labelmap

    regex: __meta_kubernetes_service_label_(.+)

  - source_labels: [__meta_kubernetes_namespace]

    action: replace

    target_label: kubernetes_namespace

  - source_labels: [__meta_kubernetes_service_name]

    action: replace

    target_label: kubernetes_name

  - source_labels: [__meta_kubernetes_pod_name]

    action: replace

    target_label: kubernetes_pod_name

# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring

secret "additional-configs" created

在声明 prometheus 的资源对象文件中通过 additionalScrapeConfigs 属性添加上这个额外的配置：(prometheus-prometheus.yaml)

# cat prometheus-prometheus.yaml

  ......

  version: v2.15.2

  additionalScrapeConfigs: # 如下三行是新增的

    name: additional-configs

    key: prometheus-additional.yaml

添加完成后，直接更新 prometheus 这个 CRD 资源对象即可：

# kubectl apply -f prometheus-prometheus.yaml

prometheus.monitoring.coreos.com "k8s" configured

隔一小会儿，可以前往 Prometheus 的 Dashboard 中查看配置已经生效了：

切换到 targets 页面下面却并没有发现对应的监控任务，查看 Prometheus 的 Pod 日志,可以看到有很多错误日志出现，都是 xxx is forbidden，这说明是 RBAC 权限的问题，通过 prometheus 资源对象的配置可以知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象，而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole：（prometheus-clusterRole.yaml）

上面的权限规则中我们可以看到明显没有对 Service 或者 Pod 的 list 权限，所以报错了，要解决这个问题，我们只需要添加上需要的权限即可：

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

  name: prometheus-k8s

rules:

- apiGroups:

  - ""

  resources:

  - nodes

  - services

  - endpoints

  - pods

  - nodes/proxy

  verbs:

  - get

  - list

  - watch

- apiGroups:

  - ""

  resources:

  - configmaps

  - nodes/metrics

  verbs:

  - get

- nonResourceURLs:

  - /metrics

  verbs:

  - get

更新上面的 ClusterRole 这个资源对象，然后重建下 Prometheus 的所有 Pod，正常就可以看到 targets 页面下面有 kubernetes-endpoints 这个监控任务了

这里发现的几个抓取目标是因为 Service 中都有 prometheus.io/scrape=true 这个 annotation。

数据持久化

Prometheus持久化：prometheus-prometheus.yaml，新增如下配置

  storage:

    volumeClaimTemplate:

      spec:

        storageClassName: rook-cephfs # 根据实际情况修改

        accessModes: ["ReadWriteOnce"]

        resources:

          requests:

            storage: 30Gi

Grafana 持久化

1.grafana-pvc.yaml （新建该文件）

kind: PersistentVolumeClaim

apiVersion: v1

metadata:

  name: grafana

  namespace: monitoring

spec:

  storageClassName: rook-cephfs # 根据实际情况修改

  accessModes:

    - ReadWriteOnce

  resources:

    requests:

      storage: 20Gi

2.grafana-deployment.yaml （修改该文件）

      volumes:

      - name: grafana-storage    # 新增配置

        persistentVolumeClaim:

          claimName: grafana

      #- emptyDir: {} # 注释原来的

      #  name: grafana-storage

# kubectl apply -f grafana-pvc.yaml

persistentvolumeclaim/grafana created

# kubectl apply -f grafana-deployment.yaml

deployment.apps/grafana configured

新增serviceMonitor监控ingress-nginx

prometheus opertaor是通过serviceMontior这个CRD来获取指标监控的,会通过Service的标签进行关联jobs

在manifests下创建一个kubernetes-serviceMonitorIngressNginx.yaml，并应用

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  labels:

    app.kubernetes.io/name: ingress-nginx

    app.kubernetes.io/part-of: kube-prometheus

  name: ingress-nginx

  namespace: monitoring

spec:

  endpoints:

  - interval: 15s

    port: metrics

  jobLabel: app.kubernetes.io/name

  namespaceSelector:

    matchNames:

    - ingress-nginx

  selector:

    matchLabels:

      app.kubernetes.io/name: ingress-nginx

# kubectl apply -f kubernetes-serviceMonitorIngressNginx.yaml

servicemonitor.monitoring.coreos.com/ingress-nginx created

在manifests下创建一个ingress-metrics.yaml，并应用

apiVersion: v1

kind: Service

metadata:

  name: ingress-nginx

  namespace: ingress-nginx

  labels:

    app.kubernetes.io/name: ingress-nginx

  annotations:

    prometheus.io/port: "10254" #这2个注解是ingress-nginx官方提供的

    prometheus.io/scrape: "true"

spec:

  type: ClusterIP

  ports:

  - name: metrics

    port: 10254

    targetPort: 10254

    protocol: TCP

  selector:

    app.kubernetes.io/name: ingress-nginx

    app.kubernetes.io/component: controller

# kubectl apply -f ingress-metrics.yaml

service/ingress-nginx created

前提条件：（这一步在上面自动发现配置中已经操作过了，若是未操作过自动发现配置，则还需要操作这个前提条件）

# vim prometheus-clusterRole.yaml

#新增一个apigroups

- apiGroups:

  - ""

  resources:

  - services

  - endpoints

  - pods

  verbs:

  - get

  - list

  - watch

Thanos

关于 prometheus operator 中如何配置 thanos，可以查看官方文档的介绍：https://github.com/coreos/prometheus-operator/blob/master/Documentation/thanos.md

$ kubectl explain prometheus.spec.thanos

KIND:     Prometheus

VERSION:  monitoring.coreos.com/v1

RESOURCE: thanos <Object>

DESCRIPTION:

     Thanos configuration allows configuring various aspects of a Prometheus

     server in a Thanos environment. This section is experimental, it may change

     significantly without deprecation notice in any release. This is

     experimental and may change significantly without backward compatibility in

     any release.

FIELDS:

   baseImage    <string>

     Thanos base image if other than default.

   grpcServerTlsConfig  <Object>

     GRPCServerTLSConfig configures the gRPC server from which Thanos Querier

     reads recorded rule data. Note: Currently only the CAFile, CertFile, and

     KeyFile fields are supported. Maps to the '--grpc-server-tls-*' CLI args.

   image        <string>

     Image if specified has precedence over baseImage, tag and sha combinations.

     Specifying the version is still necessary to ensure the Prometheus Operator

     knows what version of Thanos is being configured.

   listenLocal  <boolean>

     ListenLocal makes the Thanos sidecar listen on loopback, so that it does

     not bind against the Pod IP.

   objectStorageConfig  <Object>

     ObjectStorageConfig configures object storage in Thanos.

   resources    <Object>

     Resources defines the resource requirements for the Thanos sidecar. If not

     provided, no requests/limits will be set

   sha  <string>

     SHA of Thanos container image to be deployed. Defaults to the value of

     `version`. Similar to a tag, but the SHA explicitly deploys an immutable

     container image. Version and Tag are ignored if SHA is set.

   tag  <string>

     Tag of Thanos sidecar container image to be deployed. Defaults to the value

     of `version`. Version is ignored if Tag is set.

   tracingConfig        <Object>

     TracingConfig configures tracing in Thanos. This is an experimental

     feature, it may change in any upcoming release in a breaking way.

   version      <string>

     Version describes the version of Thanos to use.

上面的属性中有一个 objectStorageConfig 字段，该字段也就是用来指定对象存储相关配置的，这里同样我们使用前面 Thanos 章节中的对象存储配置即可：(thanos-storage-minio.yaml)

# cat thanos-storage-minio.yaml

type: s3

config:

  bucket: promethes-operator-data  # 记得事先在 minio 中创建这个 bucket

  endpoint: minio.minio.svc.cluster.local:9000

  access_key: minio

  secret_key: minio123

  insecure: true

  signature_version2: false

使用上面的配置文件创建一个对应的 Secret 资源对象：

$ kubectl -n monitoring create secret generic thanos-objectstorage --from-file=thanos.yaml=thanos-storage-minio.yaml

secret/thanos-objectstorage created

创建完成后在 prometheus 的 CRD 对象中添加如下配置：(prometheus-prometheus.yaml )

thanos:

  objectStorageConfig:

    key: thanos.yaml

    name: thanos-objectstorage

然后直接更新 prometheus 这个 CRD 对象即可：

$ kubectl apply -f prometheus-prometheus.yaml

更新完成后，可以看到 Prometheus 的 Pod 变成了4个容器，新增了一个 sidecar 容器.

部署其他的 Thanos 组件，比如 Querier、Store、Compactor

参考网址：https://www.cnblogs.com/sanduzxcvbnm/p/16284934.html

https://jishuin.proginn.com/p/763bfbd56ae4

该操作中使用到的yaml文件：https://files.cnblogs.com/files/sanduzxcvbnm/operator_thanos.zip?t=1654661018

现阶段 Prometheus CRD 里面对接 Thanos 的方式是一个实验特性，所以如果你是在生产环境要使用的话需要注意，可能后续版本就变动了，这里我们可以直接通过 thanos 属性来指定使用的镜像版本，以及对应的对象存储配置，这里我们仍然使用 minio 来做对象存储（部署参考前面章节），首先登录 MinIO 创建一个 thanos 的 bucket。然后创建一个对象存储配置文件：

# thanos-storage-minio.yaml

type: s3

config:

  bucket: promethes-operator-data  # bucket 名称,需要事先创建

  endpoint: minio.default.svc.cluster.local:9000 # minio 访问地址

  access_key: minio

  secret_key: minio123

  insecure: true

  signature_version2: false

使用上面的配置文件来创建一个 Secret 对象：

$ kubectl create secret generic thanos-objectstorage --from-file=thanos.yaml=thanos-storage-minio.yaml -n monitoring

secret/thanos-objectstorage created

对象存储的配置准备好过后，接下来我们就可以在 Prometheus CRD 中添加对应的 Thanos 配置了，完整的资源对象如下所示：（个别参数有变动）

# cat prometheus-prometheus.yaml

apiVersion: monitoring.coreos.com/v1

kind: Prometheus

metadata:

  labels:

    app.kubernetes.io/component: prometheus

    app.kubernetes.io/instance: k8s

    app.kubernetes.io/name: prometheus

    app.kubernetes.io/part-of: kube-prometheus

    app.kubernetes.io/version: 2.35.0

  name: k8s

  namespace: monitoring

spec:

  alerting:

    alertmanagers:

    - apiVersion: v2

      name: alertmanager-main

      namespace: monitoring

      port: web

  enableFeatures: []

  externalLabels: {}

  image: quay.io/prometheus/prometheus:v2.35.0

  nodeSelector:

    kubernetes.io/os: linux

  podMetadata:

    labels:

      app.kubernetes.io/component: prometheus

      app.kubernetes.io/instance: k8s

      app.kubernetes.io/name: prometheus

      app.kubernetes.io/part-of: kube-prometheus

      app.kubernetes.io/version: 2.35.0

  podMonitorNamespaceSelector: {}

  podMonitorSelector: {}

  probeNamespaceSelector: {}

  probeSelector: {}

  replicas: 2

  retention: 6h

  resources:

    requests:

      memory: 400Mi

  ruleNamespaceSelector: {}

  ruleSelector: {}

  securityContext:

    fsGroup: 2000

    runAsNonRoot: true

    runAsUser: 1000

  serviceAccountName: prometheus-k8s

  serviceMonitorNamespaceSelector: {}

  serviceMonitorSelector: {}

  version: 2.35.0

  additionalScrapeConfigs:  # 添加服务发现的配置

    name: additional-configs

    key: prometheus-additional.yaml

  thanos: # 添加 thanos 配置

    image: thanosio/thanos:v0.26.0

    resources:

      limits:

        cpu: 500m

        memory: 500Mi

      requests:

        cpu: 100m

        memory: 500Mi

    objectStorageConfig:

      key: thanos.yaml

      name: thanos-objectstorage

  #storage: # 添加本地数据持久化

  #  volumeClaimTemplate:

  #    spec:

  #      storageClassName: rook-cephfs

  #      resources:

  #        requests:

  #          storage: 20Gi # 至少20G

  #thanos: #  添加 thanos 配置

  #  objectStorageConfig:

  #    key: thanos.yaml

  #    name: thanos-objectstorage # 对象存储对应的 secret 资源对象

然后直接更新即可：

$ kubectl apply -f prometheus-prometheus.yaml

prometheus.monitoring.coreos.com/k8s configured

更新完成后我们再次查看更新后的 Prometheus Pod，可以发现已经变成了 4 个容器了： (原先就有3个容器了)

可以看到在原来的基础上新增了一个 sidecar 容器，正常每 2 个小时会上传一次数据，查看 sidecar 可以查看到相关日志：

# kubectl logs -f prometheus-k8s-0 -c thanos-sidecar -n monitoring

level=info ts=2022-06-08T02:23:04.21432378Z caller=options.go:27 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"

level=info ts=2022-06-08T02:23:04.215510591Z caller=factory.go:49 msg="loading bucket configuration"

level=info ts=2022-06-08T02:23:04.216213439Z caller=sidecar.go:360 msg="starting sidecar"

level=info ts=2022-06-08T02:23:04.216640996Z caller=intrumentation.go:75 msg="changing probe status" status=healthy

level=info ts=2022-06-08T02:23:04.21670998Z caller=http.go:73 service=http/server component=sidecar msg="listening for requests and metrics" address=:10902

level=info ts=2022-06-08T02:23:04.21707979Z caller=tls_config.go:195 service=http/server component=sidecar msg="TLS is disabled." http2=false

level=info ts=2022-06-08T02:23:04.218319048Z caller=reloader.go:199 component=reloader msg="nothing to be watched"

level=info ts=2022-06-08T02:23:04.218394592Z caller=intrumentation.go:56 msg="changing probe status" status=ready

level=info ts=2022-06-08T02:23:04.218450345Z caller=grpc.go:131 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=:10901

level=info ts=2022-06-08T02:23:04.223323398Z caller=sidecar.go:179 msg="successfully loaded prometheus version"

level=info ts=2022-06-08T02:23:04.301263386Z caller=sidecar.go:201 msg="successfully loaded prometheus external labels" external_labels="{prometheus=\"monitoring/k8s\", prometheus_replica=\"prometheus-k8s-0\"}"

level=warn ts=2022-06-08T02:23:06.219784039Z caller=shipper.go:239 msg="reading meta file failed, will override it" err="failed to read /prometheus/thanos.shipper.json: open /prometheus/thanos.shipper.json: no such file or directory

Thanos Querier

Thanos Querier 组件提供了从所有 prometheus 实例中一次性检索指标的能力。它与原 prometheus 的 PromQL 和 HTTP API 是完全兼容的，所以同样可以和 Grafana 一起使用。

因为 Querier 组件是要和 Sidecar 以及 Store 组件进行对接的，所以在 Querier 组件的方向参数中需要配置上上面我们启动的 Thanos Sidecar，同样我们可以通过对应的 Headless Service 来进行发现，自动创建的 Service 名为 prometheus-operated（可以通过 Statefulset 查看）：

# kubectl describe svc -n monitoring prometheus-operated

Name:              prometheus-operated

Namespace:         monitoring

Labels:            operated-prometheus=true

Annotations:       <none>

Selector:          app.kubernetes.io/name=prometheus

Type:              ClusterIP

IP Families:       <none>

IP:                None

IPs:               None

Port:              web  9090/TCP

TargetPort:        web/TCP

Endpoints:         10.1.112.219:9090,10.1.112.222:9090

Port:              grpc  10901/TCP

TargetPort:        grpc/TCP

Endpoints:         10.1.112.219:10901,10.1.112.222:10901

Session Affinity:  None

Events:            <none>

Thanos Querier 组件完整的资源清单如下所示，需要注意的是 Prometheus Operator 部署的 prometheus 实例多副本的 external_labels 标签为 prometheus_replica：

# cat querier.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

  name: thanos-querier

  namespace: monitoring

  labels:

    app: thanos-querier

spec:

  selector:

    matchLabels:

      app: thanos-querier

  template:

    metadata:

      labels:

        app: thanos-querier

    spec:

      containers:

      - name: thanos

        image: thanosio/thanos:v0.26.0

        args:

        - query

        - --log.level=debug

        - --query.replica-label=prometheus_replica # 注意这行

        - --store=dnssrv+prometheus-operated:10901 # 注意这行

        #- --store=dnssrv+thanos-store:10901 # 注意这行，先注释，一会儿再取消注释

        ports:

        - name: http

          containerPort: 10902

        - name: grpc

          containerPort: 10901

        resources:

          requests:

            memory: "2Gi"

            cpu: "1"

          limits:

            memory: "2Gi"

            cpu: "1"

        livenessProbe:

          httpGet:

            path: /-/healthy

            port: http

          initialDelaySeconds: 10

        readinessProbe:

          httpGet:

            path: /-/healthy

            port: http

          initialDelaySeconds: 15

---

apiVersion: v1

kind: Service

metadata:

  name: thanos-querier

  namespace: monitoring

  labels:

    app: thanos-querier

spec:

  ports:

  - port: 9090

    protocol: TCP

    targetPort: http

    name: http

  selector:

    app: thanos-querier

  type: NodePort # 这里直接用NodePort 方式访问查看，可以改用ingress-nginx方式

直接创建上面的资源对象即可：

# kubectl apply -f querier.yaml

# kubectl get pods -n monitoring -l app=thanos-querier

NAME                              READY   STATUS    RESTARTS   AGE

thanos-querier-557c7ff9dd-j2r7q   1/1     Running   0          43m

部署完成后我们可以在浏览器中打开 Querier 的页面，查看已经关联上的 Stores：

比如在 Graph 页面查询 node_load1 指标，记住勾选上 Use Deduplication 用于去重查询：

Thanos Store

接着需要部署 Thanos Store 组件，该组件和可以 Querier 组件一起协作从指定对象存储的 bucket 中检索历史指标数据，所以自然在部署的时候我们需要指定对象存储的配置，Store 组件配置完成后还需要加入到 Querier 组件里面去：

# cat store.yaml

apiVersion: apps/v1

kind: StatefulSet

metadata:

  name: thanos-store

  namespace: monitoring

  labels:

    app: thanos-store

spec:

  replicas: 1

  selector:

    matchLabels:

      app: thanos-store

  serviceName: thanos-store

  template:

    metadata:

      labels:

        app: thanos-store

        thanos-store-api: "true"

    spec:

      containers:

        - name: thanos

          image: thanosio/thanos:v0.26.0

          args:

          - "store"

          - "--log.level=debug"

          - "--data-dir=/data"

          - "--objstore.config-file=/etc/secret/thanos.yaml"

          - "--index-cache-size=500MB"

          - "--chunk-pool-size=500MB"

          ports:

          - name: http

            containerPort: 10902

          - name: grpc

            containerPort: 10901

          livenessProbe:

            httpGet:

              port: 10902

              path: /-/healthy

            initialDelaySeconds: 10

          readinessProbe:

            httpGet:

              port: 10902

              path: /-/ready

            initialDelaySeconds: 15

          volumeMounts:

            - name: object-storage-config

              mountPath: /etc/secret

              readOnly: false

      volumes:

        - name: object-storage-config

          secret:

            secretName: thanos-objectstorage

---

apiVersion: v1

kind: Service

metadata:

  name: thanos-store

  namespace: monitoring

spec:

  type: ClusterIP

  clusterIP: None

  ports:

    - name: grpc

      port: 10901

      targetPort: grpc

  selector:

    app: thanos-store

直接部署上面的资源对象即可：

$ kubectl apply -f thanos-store.yaml

statefulset.apps/thanos-store created

service/thanos-store created

$ kubectl get pods -n monitoring -l app=thanos-store

NAME             READY   STATUS    RESTARTS   AGE

thanos-store-0   1/1     Running   0          106s

部署完成后为了让 Querier 组件能够发现 Store 组件，我们还需要在 Querier 组件中增加 Store 组件的发现：

containers:

  - name: thanos

    image: thanosio/thanos:v0.18.0

    args:

      - query

      - --log.level=debug

      - --query.replica-label=prometheus_replica

      # Discover local store APIs using DNS SRV.

      - --store=dnssrv+prometheus-operated:10901

      - --store=dnssrv+thanos-store:10901

更新后再次前往 Querier 组件的页面查看发现的 Store 组件正常会多一个 Thanos Store 的组件。

Thanos Compactor

Thanos Compactor 组件可以对我们收集的历史数据进行下采样，可以减少文件的大小。部署方式和之前没什么太大的区别，主要也就是对接对象存储。

# cat compactor.yaml

apiVersion: apps/v1

kind: StatefulSet

metadata:

  name: thanos-compactor

  namespace: monitoring

  labels:

    app: thanos-compactor

spec:

  replicas: 1

  selector:

    matchLabels:

      app: thanos-compactor

  serviceName: thanos-compactor

  template:

    metadata:

      labels:

        app: thanos-compactor

    spec:

      containers:

      - name: thanos

        image: thanosio/thanos:v0.26.0

        args:

        - "compact"

        - "--log.level=debug"

        - "--data-dir=/data"

        - "--objstore.config-file=/etc/secret/thanos.yaml"

        - "--wait"

        ports:

        - name: http

          containerPort: 10902

        livenessProbe:

          httpGet:

            port: 10902

            path: /-/healthy

          initialDelaySeconds: 10

        readinessProbe:

          httpGet:

            port: 10902

            path: /-/ready

          initialDelaySeconds: 15

        volumeMounts:

        - name: object-storage-config

          mountPath: /etc/secret

          readOnly: false

      volumes:

      - name: object-storage-config

        secret:

          secretName: thanos-objectstorage

同样直接创建上面的资源对象即可：

# kubectl apply -f thanos-compactor.yaml

最后如果想通过 Thanos 的 Ruler 组件来配置报警规则，可以直接使用 Prometheus Operator 提供的 ThanosRuler 这个 CRD 对象，不过还是推荐直接和单独的 prometheus 实例配置报警规则，这样调用链路更短，出现问题的时候排查也更方便。Thanos Ruler 组件允许配置记录和告警规则，跨越多个 prometheus 实例进行处理，一个 ThanosRuler 实例至少需要一个 queryEndpoint 指向 Thanos Queriers 或 prometheus 实例的位置，如下所示：

# ThanosRuler Demo

apiVersion: monitoring.coreos.com/v1

kind: ThanosRuler

metadata:

  name: thanos-ruler-demo

  labels:

    example: thanos-ruler

  namespace: monitoring

spec:

  image: thanosio/thanos

  ruleSelector:

    matchLabels: # 匹配 Rule 规则

      role: my-thanos-rules

  queryEndpoints: # querier 地址

    - dnssrv+_http._tcp.my-thanos-querier.monitoring.svc.cluster.local

ThanosRuler 组件使用的记录和警报规则与 Prometheus 里面配置的 PrometheusRule 对象，比如上面的示例中，表示包含 role=my-thanos-rules 标签的 PrometheusRule 对象规则会被添加到 Thanos Ruler Pod 中去。

最后通过 Prometheus Operator 对接上 Thanos 过后的所有资源对象如下所示：

# kubectl get pods -n monitoring

NAME                                  READY   STATUS    RESTARTS   AGE

alertmanager-main-0                   2/2     Running   2          20h

alertmanager-main-1                   2/2     Running   2          20h

alertmanager-main-2                   2/2     Running   2          20h

blackbox-exporter-5cb5d7479d-nb9td    3/3     Running   3          2d

grafana-6fc6fff957-jdjn7              1/1     Running   1          19h

kube-state-metrics-d64589d79-8gs9b    3/3     Running   3          2d

node-exporter-fvnbm                   2/2     Running   2          2d

node-exporter-jlqmc                   2/2     Running   2          2d

node-exporter-m76cj                   2/2     Running   2          2d

prometheus-adapter-785b59bccc-jrpjj   1/1     Running   2          2d

prometheus-adapter-785b59bccc-zqlkx   1/1     Running   2          2d

prometheus-k8s-0                      3/3     Running   0          92m

prometheus-k8s-1                      3/3     Running   0          92m

prometheus-operator-d8c5b745d-l2trp   2/2     Running   2          2d

thanos-compactor-0                    1/1     Running   0          44m

thanos-querier-557c7ff9dd-j2r7q       1/1     Running   0          46m

thanos-store-0                        1/1     Running   0          47m

正常 minio 对象存储上面也会有上传的历史数据了：

巴特西

Kubernetes 监控：Prometheus Operator + Thanos ---实践篇