TKG Cluster Deployment Gotchas with Node Health Check in CSE 4.2

Recently, I upgraded Container Service Extension to 4.2.0 in my lab and was trying to deploy a TKG 2.4.0 cluster with node health check enabled. The deployment got stuck after deploying one control plane and worker node, and the cluster went into an error state.

Clicking on the Events tab showed the following error:

I checked the CSE log file and the capvcd logs on the ephemeral vm (before it got deleted) and found no error that would make sense to me.

I contacted CSE Engineering to discuss this issue and opened a bug for further analysis of the logs.

Root Cause

CSE Engineering debugged the logs and found that it was a bug in the product version. Here is the summary of the analysis done by Engineering. 

Conclusion

If you are using Kubernetes Container Cluster plugin 4.2, do not enable ‘Node Health Check’ on the cluster.

I have requested the CSE Engineering team to get this issue updated under the known issues in the product documentation and the doc team is working on the same.

I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.

Leave a Reply