Grafana 系列文章（十四）：Helm 安装Loki

前言

写或者翻译这么多篇 Loki 相关的文章了, 发现还没写怎么安装

现在开始介绍如何使用 Helm 安装 Loki.

前提

有 Helm, 并且添加 Grafana 的官方源:

helm repo add grafana https://grafana.github.io/helm-charts

helm repo update

Warning:

网络受限, 需要保证网络通畅.

部署

架构

Promtail(收集) + Loki(存储及处理) + Grafana(展示)

Promtail

启用 Prometheus Operator Service Monitor 做监控
增加external_labels - cluster, 以识别是哪个 K8S 集群;
pipeline_stages 改为 cri, 以对 cri 日志做处理(因为我的集群用的 Container Runtime 是 CRI, 而 Loki Helm 默认配置是 docker)
增加对 systemd-journal 的日志收集:

promtail:

  config:

    snippets:

      pipelineStages:

        - cri: {}

  extraArgs:

    - -client.external-labels=cluster=ctyun

  # systemd-journal 额外配置:

  # Add additional scrape config

  extraScrapeConfigs:

    - job_name: journal

      journal:

        path: /var/log/journal

        max_age: 12h

        labels:

          job: systemd-journal

      relabel_configs:

        - source_labels: ['__journal__systemd_unit']

          target_label: 'unit'

        - source_labels: ['__journal__hostname']

          target_label: 'hostname'

  # Mount journal directory into Promtail pods

  extraVolumes:

    - name: journal

      hostPath:

        path: /var/log/journal

  extraVolumeMounts:

    - name: journal

      mountPath: /var/log/journal

      readOnly: true

Loki

启用持久化存储
启用 Prometheus Operator Service Monitor 做监控
1. 并配置 Loki 相关 Prometheus Rule 做告警
因为个人集群日志量较小, 适当调大 ingester 相关配置

Grafana

启用持久化存储
启用 Prometheus Operator Service Monitor 做监控
sidecar 都配置上, 方便动态更新 dashboards/datasources/plugins/notifiers;

Helm 安装

通过如下命令安装:

helm upgrade --install loki --namespace=loki --create-namespace grafana/loki-stack -f values.yaml

自定义 values.yaml 如下:

loki:

  enabled: true

  persistence:

    enabled: true

    storageClassName: local-path

    size: 20Gi

  serviceScheme: https

  user: admin

  password: changit!

  config:

    ingester:

      chunk_idle_period: 1h

      max_chunk_age: 4h

    compactor:

      retention_enabled: true

  serviceMonitor:

    enabled: true

    prometheusRule:

      enabled: true

      rules:

        #  Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki

        - alert: LokiProcessTooManyRestarts

          expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2

          for: 0m

          labels:

            severity: warning

          annotations:

            summary: Loki process too many restarts (instance {{ $labels.instance }})

            description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

        - alert: LokiRequestErrors

          expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10

          for: 15m

          labels:

            severity: critical

          annotations:

            summary: Loki request errors (instance {{ $labels.instance }})

            description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

        - alert: LokiRequestPanic

          expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0

          for: 5m

          labels:

            severity: critical

          annotations:

            summary: Loki request panic (instance {{ $labels.instance }})

            description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

        - alert: LokiRequestLatency

          expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1

          for: 5m

          labels:

            severity: critical

          annotations:

            summary: Loki request latency (instance {{ $labels.instance }})

            description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

promtail:

  enabled: true

  config:

    snippets:

      pipelineStages:

        - cri: {}

  extraArgs:

    - -client.external-labels=cluster=ctyun

  serviceMonitor:

    # -- If enabled, ServiceMonitor resources for Prometheus Operator are created

    enabled: true

  # systemd-journal 额外配置:

  # Add additional scrape config

  extraScrapeConfigs:

    - job_name: journal

      journal:

        path: /var/log/journal

        max_age: 12h

        labels:

          job: systemd-journal

      relabel_configs:

        - source_labels: ['__journal__systemd_unit']

          target_label: 'unit'

        - source_labels: ['__journal__hostname']

          target_label: 'hostname'

  # Mount journal directory into Promtail pods

  extraVolumes:

    - name: journal

      hostPath:

        path: /var/log/journal

  extraVolumeMounts:

    - name: journal

      mountPath: /var/log/journal

      readOnly: true

fluent-bit:

  enabled: false

grafana:

  enabled: true

  adminUser: caseycui

  adminPassword: changit!

  ## Sidecars that collect the configmaps with specified label and stores the included files them into the respective folders

  ## Requires at least Grafana 5 to work and can't be used together with parameters dashboardProviders, datasources and dashboards

  sidecar:

    image:

      repository: quay.io/kiwigrid/k8s-sidecar

      tag: 1.15.6

      sha: ''

    dashboards:

      enabled: true

      SCProvider: true

      label: grafana_dashboard

    datasources:

      enabled: true

      # label that the configmaps with datasources are marked with

      label: grafana_datasource

    plugins:

      enabled: true

      # label that the configmaps with plugins are marked with

      label: grafana_plugin

    notifiers:

      enabled: true

      # label that the configmaps with notifiers are marked with

      label: grafana_notifier

  image:

    tag: 8.3.5

  persistence:

    enabled: true

    size: 2Gi

    storageClassName: local-path

  serviceMonitor:

    enabled: true

  imageRenderer:

    enabled: disable

filebeat:

  enabled: false

logstash:

  enabled: false

安装后的资源拓扑如下:

Day 2 配置(按需)

Grafana 增加 Dashboards

在同一个 NS 下, 创建如下 ConfigMap: (只要打上grafana_dashboard 这个 label 就会被 Grafana 的 sidecar 自动导入)

apiVersion: v1

kind: ConfigMap

metadata:

  name: sample-grafana-dashboard

  labels:

     grafana_dashboard: "1"

data:

  k8s-dashboard.json: |-

  [...]

Grafana 增加 DataSource

在同一个 NS 下, 创建如下 ConfigMap: (只要打上grafana_datasource 这个 label 就会被 Grafana 的 sidecar 自动导入)

apiVersion: v1

kind: ConfigMap

metadata:

  name: loki-loki-stack

  labels:

    grafana_datasource: '1'

data:

  loki-stack-datasource.yaml: |-

    apiVersion: 1

    datasources:

    - name: Loki

      type: loki

      access: proxy

      url: http://loki:3100

      version: 1

Traefik 配置 Grafana IngressRoute

因为我是用的 Traefik 2, 通过 CRD IngressRoute 配置 Ingress, 配置如下:

apiVersion: traefik.containo.us/v1alpha1

kind: IngressRoute

metadata:

  name: grafana

spec:

  entryPoints:

    - web

    - websecure

  routes:

    - kind: Rule

      match: Host(`grafana.ewhisper.cn`)

      middlewares:

        - name: hsts-header

          namespace: kube-system

        - name: redirectshttps

          namespace: kube-system

      services:

        - name: loki-grafana

          namespace: monitoring

          port: 80

  tls: {}

最终效果

如下:

️参考文档

helm-charts/charts at main · grafana/helm-charts (github.com)

Grafana 系列文章

三人行, 必有我师; 知识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.

巴特西