Table of Contents
Introduction
Monitoring is the most important part of any infrastructure. Day-2 operations are heavily dependent on the monitoring/alerting/logging aspects. Containerized applications are now part of almost every environment and monitoring Kubernetes cluster eases the management of containerized infrastructure by tracking utilization of cluster resources.
As a Kubernetes operator, you would want to receive alerts if the desired number of pods are not running, if the resource utilization is approaching critical limits, or when failures or misconfiguration cause pods or nodes to become unable to participate in the cluster.
Why Kubernetes monitoring is a challenge?
Kubernetes abstracts away a lot of complexity to speed up application deployment; but in the process, it leaves you blind as to what is actually happening behind the scenes, what resources are being utilized, and even the cost implications of the actions being taken. In a Kubernetes world, the number of components is typically more than traditional infrastructure, which makes root cause analysis more difficult when things go wrong. The underlying components generate millions of metrics per day and not all metric is important.
The dynamic nature of Kubernetes allows you to scale up/scale down deployments on the fly. When applications are scaled down, the underlying pods are deleted. When scaling up applications or scheduling new deployments, the Kube-scheduler may move pods in order to free up resources on a given node. This results in pods being moved and recreated with a different name and in a different place. The monitoring solution should be intelligent enough to pick these changes and shouldn’t start sending alerts for these types of events.
Which Kubernetes metrics should you monitor?
When setting up monitoring for Kubernetes, it’s very important to know what to monitor. Some key metrics to consider when setting up monitoring are:
- Cluster state including the health and availability of pods.
- Node status, including readiness, memory, disk or processor overload, and network availability.
- Pod availability.
- Memory utilization at the pod and node level.
- Disk utilization including lack of space for file system and index nodes.
- CPU utilization in relation to the amount of CPU resource allocated to the pod.
- API request latency measured in milliseconds.
Monitoring Tanzu Kubernetes Grid Instances
Monitoring for Tanzu Kubernetes Grid provisioned clusters is implemented using the open-source projects Prometheus and Grafana. Both Prometheus and Grafana are bundled with Tanzu Kubernetes Grid Extensions and installed on top of the Tanzu Kubernetes Cluster. The TKG Extension binaries are built and signed by VMware.
This article is focused on setting up Prometheus and Grafana Extensions on a Tanzu Kubernetes Cluster.
Installing TKG Extensions
Before you deploy the TKG Extensions, you should meet the following prerequisites:
- Tanzu Kubernetes Cluster (workload cluster) is deployed.
- Carvel Tools installed on the machine which you are using to manage your Tanzu Kubernetes Clusters.
- Contour Extension is installed on the cluster where you are planning to Grafana & Prometheus. Instructions for installing Contour is documented here
- Tanzu Kubernetes Grid Extension bundle uploaded on the machine from where the installation will be triggered.
Install Cert Manager on Workload Clusters
Before you can deploy the TKG extensions, you must install cert-manager, which provides automated certificate management, on workload clusters. The cert-manager service runs by default in management clusters when the cluster is provisioned.
Extract the TKG Extensions bundle using tar or a similar utility and execute the below commands to install the cert-manager.
1 2 3 4 5 |
# tar -xzf tkg-extensions-manifests-v1.3.1-vmware.1.tar.gz # cd tkg-extensions-v1.3.1 # kubectl apply -f cert-manager/ |
Validate that the cert manager pods are deployed correctly and are in a running state.
1 2 3 4 5 6 |
# kubectl get pods -n cert-manager NAME READY STATUS RESTARTS AGE cert-manager-7c58cb795-jw7mk 1/1 Running 0 2m38s cert-manager-cainjector-765684c9d6-qgcw9 1/1 Running 0 2m38s cert-manager-webhook-ccc946479-gnbvh 1/1 Running 0 2m37s |
Install Prometheus Extension
2.1: Create Prometheus namespace
1 2 3 |
# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/prometheus # kubectl apply -f namespace-role.yaml |
The kubectl apply command creates the tanzu-system-monitoring namespace along with a service account for the prometheus-extension and necessary role bindings.
1 2 3 4 5 6 |
namespace/tanzu-system-monitoring created serviceaccount/prometheus-extension-sa created role.rbac.authorization.k8s.io/prometheus-extension-role created rolebinding.rbac.authorization.k8s.io/prometheus-extension-rolebinding created clusterrole.rbac.authorization.k8s.io/prometheus-extension-cluster-role created clusterrolebinding.rbac.authorization.k8s.io/prometheus-extension-cluster-rolebinding created |
2.2: Prepare the Prometheus yaml for deployment.
1 |
# cp prometheus-data-values.yaml.example prometheus-data-values.yaml |
The supported configuration parameters for the Prometheus yaml are documented here
A sample prometheus-data-values.yaml is shown below
2.3: Create Prometheus secret
1 |
# kubectl create secret generic prometheus-secret --from-file=values.yaml=prometheus-data-values.yaml -n tanzu-system-monitoring |
2.4: Deploy Prometheus extension
1 |
# kubectl apply -f prometheus-extension.yaml |
2.5: Retrieve the status of Prometheus extension
1 |
# kubectl get app prometheus -n tanzu-system-monitoring |
Prometheus app status should change to ‘Reconcile Succeeded’ once Prometheus is deployed successfully
1 2 3 |
NAME DESCRIPTION SINCE-DEPLOY AGE prometheus Reconcile succeeded 48s 2d6h |
Install Grafana Extension
3.1: Create Grafana namespace
1 2 3 |
# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/grafana/ # kubectl apply -f namespace-role.yaml |
The kubectl apply command creates the service account for grafana-extension and necessary role bindings.
1 2 3 4 5 6 |
namespace/tanzu-system-monitoring unchanged serviceaccount/grafana-extension-sa created role.rbac.authorization.k8s.io/grafana-extension-role created rolebinding.rbac.authorization.k8s.io/grafana-extension-rolebinding created clusterrole.rbac.authorization.k8s.io/grafana-extension-cluster-role created clusterrolebinding.rbac.authorization.k8s.io/grafana-extension-cluster-rolebinding created |
3.2: Prepare the Grafana yaml for deployment.
1 |
# cp grafana-data-values.yaml.example grafana-data-values.yaml |
The supported configuration parameters are Grafana yaml are documented here
A sample grafana-data-values.yaml file is shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
--- monitoring: grafana: ingress: enabled: true virtual_host_fqdn: "grafana.tanzu.lab" image: repository: "projects.registry.vmware.com/tkg/grafana" secret: admin_password: "Vk13YXJlMSE=" grafana_init_container: image: repository: "projects.registry.vmware.com/tkg/grafana" grafana_sc_dashboard: image: repository: "projects.registry.vmware.com/tkg/grafana" |
3.3: Create Grafana secret
1 |
# kubectl create secret generic grafana-data-values --from-file=values.yaml=grafana-data-values.yaml -n tanzu-system-monitoring |
3.4: Deploy Grafana extension
1 |
# kubectl apply -f grafana-extension.yaml |
3.5: Retrieve the status of Prometheus extension
1 |
# kubectl get app grafana -n tanzu-system-monitoring |
Grafana app status should change to ‘Reconcile Succeede’ once grafana is deployed successfully.
1 2 3 |
NAME DESCRIPTION SINCE-DEPLOY AGE grafana Reconcile succeeded 6m59s 10m |
Accessing the Prometheus & Grafana Dashboards
To access the dashboards, you have to find out the external IP address of the envoy service that gets created when you deploy the Contour extension on the Tanzu Kubernetes Cluster.
1 2 3 |
# kubectl get services -A | grep envoy tanzu-system-ingress envoy LoadBalancer 100.69.174.170 172.19.80.55 80:31384/TCP,443:31510/TCP 2d8h |
Also, confirm the fqdn of the Grafana & Prometheus extension.
1 2 3 4 5 6 7 |
# kubectl get proxy -A NAMESPACE NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION tanzu-system-monitoring grafana-httpproxy grafana.tanzu.lab grafana-tls valid Valid HTTPProxy tanzu-system-monitoring prometheus-httpproxy prometheus.tanzu.lab prometheus-tls valid Valid HTTPProxy |
In your DNS Server create an A record for the envoy external IP mapping to these 2 hostnames.
To access the Prometheus dashboard, enter the URL https://<prometheus-app-fqdn>/
To access the Grafana dashboard, enter the URL http://<grafana-app-fqdn>/
You can now customize the dashboards to start monitoring your TKG instances.
I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.