Quick Tip: Disable vSAN Precheck During Workload Domain Upgrade in VCF

Before an upgrade bundle can be applied to a workload domain (Mgmt or VI), the SDDC manager trigger a precheck on the domain to identify and alert if there is an underlying issue, so that the issue can be remediated before applying the upgrade bundle. In lab environments, one of the common precheck failures is regarding the vSAN HCL compatibility. 

In lab environments, you might be running VCF on unsupported hardware that is not present in the vSAN HCL

During upgrade precheck on the workload domain, you will see the vSAN HCL status as Red, and SDDC Manager won’t let you upgrade the domain until the issue is fixed. 

You can force SDDC Manager to ignore the vSAN precheck by adding the following lines in the applications-prod.properties file and modifying the below entries. The file is located in the directory “/opt/vmware/vcf/lcm/lcm-app/conf”

Change the vsan health check related data from true to false. Read More

Monitor Tanzu Kubernetes Cluster with Prometheus & Grafana

Introduction

Monitoring is the most important part of any infrastructure. Day-2 operations are heavily dependent on the monitoring/alerting/logging aspects. Containerized applications are now part of almost every environment and monitoring Kubernetes cluster eases the management of containerized infrastructure by tracking utilization of cluster resources.

As a Kubernetes operator, you would want to receive alerts if the desired number of pods are not running, if the resource utilization is approaching critical limits, or when failures or misconfiguration cause pods or nodes to become unable to participate in the cluster.

Why Kubernetes monitoring is a challenge?

Kubernetes abstracts away a lot of complexity to speed up application deployment; but in the process, it leaves you blind as to what is actually happening behind the scenes, what resources are being utilized, and even the cost implications of the actions being taken. In a Kubernetes world, the number of components is typically more than traditional infrastructure, which makes root cause analysis more difficult when things go wrong.Read More

Centralized Logging For TKG using Fluentbit and vRealize Log Insight

Monitoring is one of the most important aspects of a production deployment. Logs are the savior when things go haywire in the environment, so capturing event logs from the infrastructure pieces is very critical. Day-2 operations become easy if you have comprehensive logging and alerting mechanism in place as it allows for a quick response to failures in infrastructure. 

With the increasing footprint of K8 workloads in the datacenter, centralized monitoring for K8 is a must configure thing. The application developers who are focused on developing and deploying containerized applications are usually not well versed with backend infrastructure.

So if a developer finds any errors in the application logs, they might not find out that the issue is causing because of an infrastructure event in the backend, because centralized logging is not in place and infrastructure logs are stored in a different location than the application logs.

The application and infrastructure logs should be aggregated so that it’s easier to identify the real problem that’s affecting the application. Read More

NSX ALB Upgrade Breaking AKO Integration

Recently I upgraded NSX ALB from 20.1.4 to 20.1.5 in my lab and observed weird things whenever I attempted to deploy/delete any Kubernetes workload of type LoadBalancer.

The Issue

On deploying a new K8 application, AKO was unable to create a load balancer for the application. In NSX ALB UI, I can see that a pool has been created and a VIP assigned but no VS is present. I have also verified that the ‘ako-essential’ role has the necessary permission “PERMISSION_VIRTUALSERIVCE”  to create any new VS.

On attempting to delete a K8 application, the application got deleted from the TKG side, but it left lingering items (VS, Pools, etc) in the ALB UI. To investigate more on the issue, I manually tried deleting the server pool and captured the output using the browser network inspect option. 

As expected the delete operation failed with the error that the object that you are trying to delete is associated with ‘L4PolicySet’

But the l4policyset was empty

Read More

Quick Tip – Restricting SSH Access to NSX ALB Service Engines

By default, the user can connect directly to a Service Engine via SSH using the system’s admin credentials. If there is a security requirement to restrict SSH connection, it is possible to disable this access using the following CLI configuration:

1: Connect to the NSX ALB controller and gain shell access

2: Run the following commands to disable admin SSH access to Service Engine.

Is restricting SSH enough from the security point of view? Read More

Protecting TKG Workloads with Tanzu Mission Control Data Protection

Welcome to Part-3 of the getting started with Tanzu Mission Control. In this post, I will discuss how you can leverage Tanzu Mission Control to protect your Kubernetes workloads that are deployed on the Tanzu Kubernetes Grid cluster. 

If you are new to Tanzu Mission Control, I would encourage you to read previous articles of this series before diving into data protection for K8 workloads.

1: Tanzu Mission Control – Introduction & Architecture

2: Managing Tanzu Kubernetes Clusters with TMC

Tanzu Mission Control & Data Protection

Data protection in TMC is provided by Velero which is an open-source project that came with the Heptio acquisition.

When data protection is enabled on a Kubernetes cluster, the data backup is stored external to the TMC. TMC leverages AWS S3 functionality to store the backups. 

Note: Data protection is not enabled on the Kubernetes cluster by default. In this post, I will demonstrate the steps of enabling data protection and the process of backup and restoration of K8 data. Read More