Recently, I upgraded Container Service Extension to 4.2.0 in my lab and was trying to deploy a TKG 2.4.0 cluster with node health check enabled. The deployment got stuck after deploying one control plane and worker node, and the cluster went into an error state.
Clicking on the Events tab showed the following error:
I checked the CSE log file and the capvcd logs on the ephemeral vm (before it got deleted) and found no error that would make sense to me.
I contacted CSE Engineering to discuss this issue and opened a bug for further analysis of the logs.
Root Cause
CSE Engineering debugged the logs and found that it was a bug in the product version. Here is the summary of the analysis done by Engineering.
1 2 3 4 5 6 7 8 9 10 11 |
This bug is caused by UI plugin 4.2.0 using the wrong API version in the MachineHealthCheck capi yaml object. The UI plugin was using API version v1beta2 is used instead of v1beta1, which is the correct value. Clusters that contain a MachineHealthCheck section using the incorrect value will be unmanageable (unable to be resized or operated on) There are 2 ways in UI plugin 4.2.0 for a cluster to have this invalid capi yaml 1. If a user creates a cluster with ‘Node Health Check’ activated, then the resulting cluster will be unmanageable. 2. If a user activates ‘Node Health Check’ for a cluster that never had ‘Node Health Check’ activated previously, then the resulting cluster will be unmanageable. The exception to this is if a cluster was created using a lower UI plugin version and ‘Node Health Check’ had already been activated before, then that cluster will not face this issue. Note: Clusters created in lower UI plugin versions will not face this issue unless ‘Node Health Check’ is activated for the first time for the cluster using UI plugin 4.2.0 |
Conclusion
If you are using Kubernetes Container Cluster plugin 4.2, do not enable ‘Node Health Check’ on the cluster.
I have requested the CSE Engineering team to get this issue updated under the known issues in the product documentation and the doc team is working on the same.
I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.