Monitor Tanzu Kubernetes Cluster with Prometheus & Grafana

Introduction

Monitoring is the most important part of any infrastructure. Day-2 operations are heavily dependent on the monitoring/alerting/logging aspects. Containerized applications are now part of almost every environment and monitoring Kubernetes cluster eases the management of containerized infrastructure by tracking utilization of cluster resources.

As a Kubernetes operator, you would want to receive alerts if the desired number of pods are not running, if the resource utilization is approaching critical limits, or when failures or misconfiguration cause pods or nodes to become unable to participate in the cluster.

Why Kubernetes monitoring is a challenge?

Kubernetes abstracts away a lot of complexity to speed up application deployment; but in the process, it leaves you blind as to what is actually happening behind the scenes, what resources are being utilized, and even the cost implications of the actions being taken. In a Kubernetes world, the number of components is typically more than traditional infrastructure, which makes root cause analysis more difficult when things go wrong. The underlying components generate millions of metrics per day and not all metric is important. 

The dynamic nature of Kubernetes allows you to scale up/scale down deployments on the fly. When applications are scaled down, the underlying pods are deleted. When scaling up applications or scheduling new deployments, the Kube-scheduler may move pods in order to free up resources on a given node. This results in pods being moved and recreated with a different name and in a different place. The monitoring solution should be intelligent enough to pick these changes and shouldn’t start sending alerts for these types of events. 

Which Kubernetes metrics should you monitor?

When setting up monitoring for Kubernetes, it’s very important to know what to monitor. Some key metrics to consider when setting up monitoring are:

  • Cluster state including the health and availability of pods.
  • Node status, including readiness, memory, disk or processor overload, and network availability.
  • Pod availability.
  • Memory utilization at the pod and node level.
  • Disk utilization including lack of space for file system and index nodes.
  • CPU utilization in relation to the amount of CPU resource allocated to the pod.
  • API request latency measured in milliseconds.

Monitoring Tanzu Kubernetes Grid Instances

Monitoring for Tanzu Kubernetes Grid provisioned clusters is implemented using the open-source projects Prometheus and Grafana. Both Prometheus and Grafana are bundled with Tanzu Kubernetes Grid Extensions and installed on top of the Tanzu Kubernetes Cluster. The TKG Extension binaries are built and signed by VMware.

This article is focused on setting up Prometheus and Grafana Extensions on a Tanzu Kubernetes Cluster. 

Installing TKG Extensions

Before you deploy the TKG Extensions, you should meet the following prerequisites:

  • Tanzu Kubernetes Cluster (workload cluster) is deployed. 
  • Carvel Tools installed on the machine which you are using to manage your Tanzu Kubernetes Clusters.
  • Contour Extension is installed on the cluster where you are planning to Grafana & Prometheus. Instructions for installing Contour is documented here
  • Tanzu Kubernetes Grid Extension bundle uploaded on the machine from where the installation will be triggered.
Install Cert Manager on Workload Clusters

Before you can deploy the TKG extensions, you must install cert-manager, which provides automated certificate management, on workload clusters. The cert-manager service runs by default in management clusters when the cluster is provisioned.

Extract the TKG Extensions bundle using tar or a similar utility and execute the below commands to install the cert-manager.

Validate that the cert manager pods are deployed correctly and are in a running state.

Install Prometheus Extension

2.1: Create Prometheus namespace

The kubectl apply command creates the tanzu-system-monitoring namespace along with a service account for the prometheus-extension and necessary role bindings.

2.2:  Prepare the Prometheus yaml for deployment. 

The supported configuration parameters for the Prometheus yaml are documented here

A sample prometheus-data-values.yaml is shown below



2.3: Create Prometheus secret

2.4: Deploy Prometheus extension

2.5: Retrieve the status of Prometheus extension

Prometheus app status should change to ‘Reconcile Succeeded’ once Prometheus is deployed successfully

Install Grafana Extension

3.1: Create Grafana namespace

The kubectl apply command creates the service account for grafana-extension and necessary role bindings.

3.2:  Prepare the Grafana yaml for deployment. 

The supported configuration parameters are Grafana yaml are documented here

A sample grafana-data-values.yaml file is shown below.

3.3: Create Grafana secret

3.4: Deploy Grafana extension

3.5: Retrieve the status of Prometheus extension

Grafana app status should change to ‘Reconcile Succeede’ once grafana is deployed successfully.

Accessing the Prometheus & Grafana Dashboards

To access the dashboards, you have to find out the external IP address of the envoy service that gets created when you deploy the Contour extension on the Tanzu Kubernetes Cluster.  

Also, confirm the fqdn of the Grafana & Prometheus extension.

In your DNS Server create an A record for the envoy external IP mapping to these 2 hostnames.

To access the Prometheus dashboard, enter the URL https://<prometheus-app-fqdn>/

To access the Grafana dashboard, enter the URL http://<grafana-app-fqdn>/

You can now customize the dashboards to start monitoring your TKG instances. 

 I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.

Leave a Reply