Native Kubernetes in VCD using Container Service Extension 3.0

Introduction

VMware Container Service is an extension to Cloud Director which enables VCD cloud providers to offer Kubernetes-as-a-Service to their tenants. CSE integration with VCD has allowed CSPs to provide true developer-ready cloud offering to VCD tenants. Tenants can quickly deploy the Kubernetes cluster in just a few clicks directly from the VCD portal. 

Cloud Providers upload customized Kubernetes templates in public catalogs which tenants leverage to deploy K8 clusters in self-contained vApps. Once the K8 cluster is available, developers can use their native Kubernetes tooling to interact with the cluster.

To know more about the architecture and interaction of CSE components, please see my previous blog on this topic.

Container Service Extension 3.x went GA earlier this year and brought several new features/enhancements and one of them is supporting Tanzu Kubernetes Grid multi-cloud (TKGm) for K8 deployments and thus unlocking the full potential of consistent upstream Kubernetes in their VCD powered clouds.Read More

Upgrade Tanzu Kubernetes Grid from v1.3.x to 1.4.x

Tanzu KubernetestaTanzu Kubernetes Grid 1.4 is all set to be released today. A lot of new features are coming up with this release including (not limited to):

  • K8’s versions including 1.21.2, 1.20.8, and 1.19.12 are supported with the 1.4 release.
  • Support for NSX Advanced Load Balancer versions 20.1.3 & 20.1.6. 
  • New vSphere configuration variables “VSPHERE_REGION” and “VSPHERE_ZONE” are added to enable CSI storage for workload clusters in vSphere environments with multiple data centers or clusters.
  • Supports L7 ingress (using NSX ALB) for workload clusters.
  • AKO deployment is fully automated. No need to install it via helm or using custom yaml for deployment as both AKO and AKO Operator are provided as core packages. 

It’s a good time to upgrade TKG in the lab and through this blog post, I will walk through the steps of upgrading TKGm from v1.3 to v1.4. 

Upgrade Procedure

Step 1: Download and install the new version of the Tanzu CLI and Kubectl on the bootstrapper machine. Read More

Tanzu Kubernetes Grid Ingress With NSX Advanced Load Balancer

NSX ALB delivers scalable, enterprise-class container ingress for containerized workloads running in Kubernetes clusters. The biggest advantage of using NSX ALB in a Kubernetes environment is that it is agnostic to the underlying Kubernetes cluster implementations. The NSX ALB controller integrates with the Kubernetes ecosystem via REST API and thus can be used for ingress & L4-L7 load balancing solution for a wide variety of Kubernetes implementation including VMware Tanzu Kubernetes Grid.

NSX ALB provides ingress and load balancing functionality for TKG using AKO which is a Kubernetes operator that runs as a pod in the Tanzu Kubernetes clusters and translates the required Kubernetes objects to Avi objects and automates the implementation of ingresses/routes/services on the Service Engines (SE) via the NSX ALB Controller.

The diagram below shows a high-level architecture of AKO interaction with NSX ALB.

AKO interacts with the Controller & Service Engines via API to automate the provisioning of Virtual Service/VIP etc.Read More

Quick Tip: Disable vSAN Precheck During Workload Domain Upgrade in VCF

Before an upgrade bundle can be applied to a workload domain (Mgmt or VI), the SDDC manager trigger a precheck on the domain to identify and alert if there is an underlying issue, so that the issue can be remediated before applying the upgrade bundle. In lab environments, one of the common precheck failures is regarding the vSAN HCL compatibility. 

In lab environments, you might be running VCF on unsupported hardware that is not present in the vSAN HCL

During upgrade precheck on the workload domain, you will see the vSAN HCL status as Red, and SDDC Manager won’t let you upgrade the domain until the issue is fixed. 

You can force SDDC Manager to ignore the vSAN precheck by adding the following lines in the applications-prod.properties file and modifying the below entries. The file is located in the directory “/opt/vmware/vcf/lcm/lcm-app/conf”

Change the vsan health check related data from true to false. Read More

Monitor Tanzu Kubernetes Cluster with Prometheus & Grafana

Introduction

Monitoring is the most important part of any infrastructure. Day-2 operations are heavily dependent on the monitoring/alerting/logging aspects. Containerized applications are now part of almost every environment and monitoring Kubernetes cluster eases the management of containerized infrastructure by tracking utilization of cluster resources.

As a Kubernetes operator, you would want to receive alerts if the desired number of pods are not running, if the resource utilization is approaching critical limits, or when failures or misconfiguration cause pods or nodes to become unable to participate in the cluster.

Why Kubernetes monitoring is a challenge?

Kubernetes abstracts away a lot of complexity to speed up application deployment; but in the process, it leaves you blind as to what is actually happening behind the scenes, what resources are being utilized, and even the cost implications of the actions being taken. In a Kubernetes world, the number of components is typically more than traditional infrastructure, which makes root cause analysis more difficult when things go wrong.Read More

Centralized Logging For TKG using Fluentbit and vRealize Log Insight

Monitoring is one of the most important aspects of a production deployment. Logs are the savior when things go haywire in the environment, so capturing event logs from the infrastructure pieces is very critical. Day-2 operations become easy if you have comprehensive logging and alerting mechanism in place as it allows for a quick response to failures in infrastructure. 

With the increasing footprint of K8 workloads in the datacenter, centralized monitoring for K8 is a must configure thing. The application developers who are focused on developing and deploying containerized applications are usually not well versed with backend infrastructure.

So if a developer finds any errors in the application logs, they might not find out that the issue is causing because of an infrastructure event in the backend, because centralized logging is not in place and infrastructure logs are stored in a different location than the application logs.

The application and infrastructure logs should be aggregated so that it’s easier to identify the real problem that’s affecting the application. Read More

NSX ALB Upgrade Breaking AKO Integration

Recently I upgraded NSX ALB from 20.1.4 to 20.1.5 in my lab and observed weird things whenever I attempted to deploy/delete any Kubernetes workload of type LoadBalancer.

The Issue

On deploying a new K8 application, AKO was unable to create a load balancer for the application. In NSX ALB UI, I can see that a pool has been created and a VIP assigned but no VS is present. I have also verified that the ‘ako-essential’ role has the necessary permission “PERMISSION_VIRTUALSERIVCE”  to create any new VS.

On attempting to delete a K8 application, the application got deleted from the TKG side, but it left lingering items (VS, Pools, etc) in the ALB UI. To investigate more on the issue, I manually tried deleting the server pool and captured the output using the browser network inspect option. 

As expected the delete operation failed with the error that the object that you are trying to delete is associated with ‘L4PolicySet’

But the l4policyset was empty

Read More

Quick Tip – Restricting SSH Access to NSX ALB Service Engines

By default, the user can connect directly to a Service Engine via SSH using the system’s admin credentials. If there is a security requirement to restrict SSH connection, it is possible to disable this access using the following CLI configuration:

1: Connect to the NSX ALB controller and gain shell access

2: Run the following commands to disable admin SSH access to Service Engine.

Is restricting SSH enough from the security point of view? Read More

Protecting TKG Workloads with Tanzu Mission Control Data Protection

Welcome to Part-3 of the getting started with Tanzu Mission Control. In this post, I will discuss how you can leverage Tanzu Mission Control to protect your Kubernetes workloads that are deployed on the Tanzu Kubernetes Grid cluster. 

If you are new to Tanzu Mission Control, I would encourage you to read previous articles of this series before diving into data protection for K8 workloads.

1: Tanzu Mission Control – Introduction & Architecture

2: Managing Tanzu Kubernetes Clusters with TMC

Tanzu Mission Control & Data Protection

Data protection in TMC is provided by Velero which is an open-source project that came with the Heptio acquisition.

When data protection is enabled on a Kubernetes cluster, the data backup is stored external to the TMC. TMC leverages AWS S3 functionality to store the backups. 

Note: Data protection is not enabled on the Kubernetes cluster by default. In this post, I will demonstrate the steps of enabling data protection and the process of backup and restoration of K8 data. Read More

Integrating Custom Registries with Tanzu Kubernetes Grid 1.3

Introduction

Tanzu Kubernetes Grid can be configured with a private registry for the rapid deployment of K8 workloads. Although there are a variety of container and artifact registries out there, Harbor has drawn attention because of its accessibility and ease of use, and rich feature set.

Although public registries are out there on the internet, they might contain everything you are looking for. In that case, you can create a custom Harbor registry to push custom K8 images to be used within your organization. A standalone Harbor registry is a perfect use case for an air-gapped TKG deployment.

In my last post, I have documented the steps of deploying a private Harbor registry for TKG. This post will show how you can leverage the registry to push/pull images for your K8 deployment. 

I have created a new project (named manish) in Harbor and I will be pushing images in that custom project.Read More