Monitor Tanzu Kubernetes Cluster with Prometheus & Grafana

Table of Contents

Introduction

Monitoring is the most important part of any infrastructure. Day-2 operations are heavily dependent on the monitoring/alerting/logging aspects. Containerized applications are now part of almost every environment and monitoring Kubernetes cluster eases the management of containerized infrastructure by tracking utilization of cluster resources.

As a Kubernetes operator, you would want to receive alerts if the desired number of pods are not running, if the resource utilization is approaching critical limits, or when failures or misconfiguration cause pods or nodes to become unable to participate in the cluster.

Why Kubernetes monitoring is a challenge?

Kubernetes abstracts away a lot of complexity to speed up application deployment; but in the process, it leaves you blind as to what is actually happening behind the scenes, what resources are being utilized, and even the cost implications of the actions being taken. In a Kubernetes world, the number of components is typically more than traditional infrastructure, which makes root cause analysis more difficult when things go wrong. The underlying components generate millions of metrics per day and not all metric is important.

The dynamic nature of Kubernetes allows you to scale up/scale down deployments on the fly. When applications are scaled down, the underlying pods are deleted. When scaling up applications or scheduling new deployments, the Kube-scheduler may move pods in order to free up resources on a given node. This results in pods being moved and recreated with a different name and in a different place. The monitoring solution should be intelligent enough to pick these changes and shouldn’t start sending alerts for these types of events.

Which Kubernetes metrics should you monitor?

When setting up monitoring for Kubernetes, it’s very important to know what to monitor. Some key metrics to consider when setting up monitoring are:

Cluster state including the health and availability of pods.
Node status, including readiness, memory, disk or processor overload, and network availability.
Pod availability.
Memory utilization at the pod and node level.
Disk utilization including lack of space for file system and index nodes.
CPU utilization in relation to the amount of CPU resource allocated to the pod.
API request latency measured in milliseconds.

Monitoring Tanzu Kubernetes Grid Instances

Monitoring for Tanzu Kubernetes Grid provisioned clusters is implemented using the open-source projects Prometheus and Grafana. Both Prometheus and Grafana are bundled with Tanzu Kubernetes Grid Extensions and installed on top of the Tanzu Kubernetes Cluster. The TKG Extension binaries are built and signed by VMware.

This article is focused on setting up Prometheus and Grafana Extensions on a Tanzu Kubernetes Cluster.

Installing TKG Extensions

Before you deploy the TKG Extensions, you should meet the following prerequisites:

Tanzu Kubernetes Cluster (workload cluster) is deployed.
Carvel Tools installed on the machine which you are using to manage your Tanzu Kubernetes Clusters.
Contour Extension is installed on the cluster where you are planning to Grafana & Prometheus. Instructions for installing Contour is documented here
Tanzu Kubernetes Grid Extension bundle uploaded on the machine from where the installation will be triggered.

Install Cert Manager on Workload Clusters

Before you can deploy the TKG extensions, you must install cert-manager, which provides automated certificate management, on workload clusters. The cert-manager service runs by default in management clusters when the cluster is provisioned.

Extract the TKG Extensions bundle using tar or a similar utility and execute the below commands to install the cert-manager.

# tar -xzf tkg-extensions-manifests-v1.3.1-vmware.1.tar.gz

# cd tkg-extensions-v1.3.1

# kubectl apply -f cert-manager/

# tar -xzf tkg-extensions-manifests-v1.3.1-vmware.1.tar.gz

# cd tkg-extensions-v1.3.1

# kubectl apply -f cert-manager/

Validate that the cert manager pods are deployed correctly and are in a running state.

# kubectl get pods -n cert-manager

NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-7c58cb795-jw7mk               1/1     Running   0          2m38s
cert-manager-cainjector-765684c9d6-qgcw9   1/1     Running   0          2m38s
cert-manager-webhook-ccc946479-gnbvh       1/1     Running   0          2m37s

# kubectl get pods -n cert-manager

NAME READY STATUS RESTARTS AGE

cert-manager-7c58cb795-jw7mk 1/1 Running 0 2m38s

cert-manager-cainjector-765684c9d6-qgcw9 1/1 Running 0 2m38s

cert-manager-webhook-ccc946479-gnbvh 1/1 Running 0 2m37s

Install Prometheus Extension

2.1: Create Prometheus namespace

# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/prometheus

# kubectl apply -f namespace-role.yaml

# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/prometheus

# kubectl apply -f namespace-role.yaml

The kubectl apply command creates the tanzu-system-monitoring namespace along with a service account for the prometheus-extension and necessary role bindings.

namespace/tanzu-system-monitoring created
serviceaccount/prometheus-extension-sa created
role.rbac.authorization.k8s.io/prometheus-extension-role created
rolebinding.rbac.authorization.k8s.io/prometheus-extension-rolebinding created
clusterrole.rbac.authorization.k8s.io/prometheus-extension-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-extension-cluster-rolebinding created

namespace/tanzu-system-monitoring created

serviceaccount/prometheus-extension-sa created

role.rbac.authorization.k8s.io/prometheus-extension-role created

rolebinding.rbac.authorization.k8s.io/prometheus-extension-rolebinding created

clusterrole.rbac.authorization.k8s.io/prometheus-extension-cluster-role created

clusterrolebinding.rbac.authorization.k8s.io/prometheus-extension-cluster-rolebinding created

2.2: Prepare the Prometheus yaml for deployment.

# cp prometheus-data-values.yaml.example prometheus-data-values.yaml

1	# cp prometheus-data-values.yaml.example prometheus-data-values.yaml

The supported configuration parameters for the Prometheus yaml are documented here

A sample prometheus-data-values.yaml is shown below

---
monitoring:
  ingress:
    enabled: true
    virtual_host_fqdn: "prometheus.tanzu.lab"
    prometheus_prefix: "/"
    alertmanager_prefix: "/alertmanager/"
  prometheus_server:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  alertmanager:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  kube_state_metrics:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  node_exporter:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  pushgateway:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  cadvisor:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  prometheus_server_configmap_reload:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  prometheus_server_init_container:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus

---

monitoring:

ingress:

enabled: true

virtual_host_fqdn: "prometheus.tanzu.lab"

prometheus_prefix: "/"

alertmanager_prefix: "/alertmanager/"

prometheus_server:

image: