Recently, I upgraded Container Service Extension to 4.2.0 in my lab and was trying to deploy a TKG 2.4.0 cluster with node health check enabled. The deployment got stuck after deploying one control plane and worker node, and the cluster went into an error state.
Clicking on the Events tab showed the following error:
I checked the CSE log file and the capvcd logs on the ephemeral vm (before it got deleted) and found no error that would make sense to me.
I contacted CSE Engineering to discuss this issue and opened a bug for further analysis of the logs.
Root Cause
CSE Engineering debugged the logs and found that it was a bug in the product version. Here is the summary of the analysis done by Engineering.