Nested ESXi is a great way to quickly spin up a test/demo environment and fiddle around things in the lab. I have been doing so for quite a bit now. VCF is very dear to my heart and because VCF needs a hell lot of resources, I always test new versions/features in my nested lab.
Nested ESXi doesn’t always behave nicely and sometimes gives you a hard time and I encountered this recently in one of my VCF deployments.
What was the problem and how it started?
The problem was with ESXi UUID and due to which vSAN configuration was failing. I will talk about more this later in this post.
To save time, I created a nested ESXi template following this article. Deployed few ESXi hosts and everything was working fine. One day I tweaked my template to inject some advanced parameters and booted the template VM. This generated a new UUID entry for ESXi in /etc/vmware/esx.conf file which I forgot to remove before powering off VM and converting to template again.
Once VCF prep work was done and SDDC Bringup was triggered, it was failing in the stage where VSAN is configured on the cluster. Cloud builder log file (vcf-bringup-debug.log) was full of the below error
1 |
Post Validation: Vsan cluster did not come up after 600 secs, validation failed |
on digging more into this and checking the bringup state in Cloud Builder DB, I found the task is stuck in the failed state.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
id | 7f000001-77cf-1d12-8177-cf53a1750122 created_date | 1614099079369 execution_order | 220 dal_version | 1.0 execute_retrial_count | 3 execution_errors | [{"errorResponse":{"messageBundle":"com.vmware.evo.sddc.bringup.vsphere.messages","errorCode":"VSPHERE_VSAN_CLUSTER_VALIDATION_FAILED","arguments":[],"message":"VSAN Cluster validation failed","cause":[{"t ype":"com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException","message":"Post Validation: Vsan cluster did not come up after 600 secs, validation failed"}],"referenceToken":"85RLEE"},"errorSignature":"2efacacc"}] execution_id | cefd82b6-97cd-4ced-b10c-11ca2c29f364 input_map | null input_param | com.vmware.evo.sddc.bringup.plugin.model.vsphere.Vsphere max_tries | 0 next_state | EnableVsanDone operation_logging_token | cefd82b6-97cd-4ced-b10c-11ca2c29f364 output_map | null output_param | param_builder | com.vmware.evo.sddc.bringup.adapters.toplugin.VsphereAdapter plugin | com.vmware.evo.sddc.bringup.plugin.spi.VspherePlugin post_validation_retrial_count | 0 pre_validation_retrial_count | 0 previous_state | UpdateVsanHclDone processed_resource_type | EVO processing_context_id | 7f000001-77cf-1d12-8177-cf53a1750118 processing_state_description | Management Cluster Configuration processing_state_name | ManagementClusterConfiguration rack_id | rollback_retrial_count | 0 status | COMPLETED_WITH_FAILURE task_data | task_description | Enable vSAN task_description_pack | {"component":"workflowconfig.ems_recipes.ManagementClusterConfiguration","messageKey":"EnableVsan.desc","localBundle":"workflowconfig.ems_recipes.ManagementClusterConfiguration","defaultMessage":"Enable vS AN"} task_name | EnableVsan task_name_pack | {"component":"workflowconfig.ems_recipes.ManagementClusterConfiguration","messageKey":"EnableVsan.name","localBundle":"workflowconfig.ems_recipes.ManagementClusterConfiguration","defaultMessage":"Enable vS AN"} |
Next, I logged into Web Client and checked VSAN state, and found that hosts are complaining about reachability.
1 |
Host cannot communicate with one or more other nodes in the vSAN enabled cluster |
After performing network validation for VSAN, I ruled out networking issues because hosts were able to reach others over the VSAN network (vmkping etc).
As always google saved my life and I stumbled upon this article that explains the same issue I was having in my lab.
So what exactly happened?
When I deployed 4 nested ESXi VMs (post tweaking template), each ESXi host got the same system UUID.
1 2 3 4 5 6 7 8 9 10 11 |
[root@mgmt-esxi01:~] esxcli system uuid get 6020bec5-47d5-8335-2046-00505685027a [root@mgmt-esxi02:~] esxcli system uuid get 6020bec5-47d5-8335-2046-00505685027a [root@mgmt-esxi03:~] esxcli system uuid get 6020bec5-47d5-8335-2046-00505685027a [root@mgmt-esxi04:~] esxcli system uuid get 6020bec5-47d5-8335-2046-00505685027a |
The problem was loud and clear. Since each VSAN host was having the same UUID, it made VSAN crazy.
Important Tip: When dealing with nested ESXi, make sure to remove UUID entry from esx.conf file as new UUID is generated when cloned vm boots up.
To fix this, I edited esx.conf file on all 4 hosts and removed the UUID part, and rebooted the host. You can also use this one-liner to remove UUID: sed -i ‘s/system/uuid.*//’ /etc/vmware/esx.conf
Post reboot each host got a new UUID and VSAN was happy again 😉
1 2 3 4 5 6 7 8 9 10 11 |
[root@mgmt-esxi01:~] esxcli system uuid get 6020bec5-47d5-8335-2046-00505685027a [root@mgmt-esxi02:~] esxcli system uuid get 6035d926-4eb2-b5ae-4d26-00505685eb72 [root@mgmt-esxi03:~] esxcli system uuid get 6035d947-ad1f-34bd-8a27-00505685ca50 [root@mgmt-esxi04:~] esxcli system uuid get 6035d919-9a1e-f8d7-2add-00505685892a |
Next, I retried the task in cloud builder UI and all went smooth and within a couple of hours, my SDDC was ready to roll.
And that concludes this post.
I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing 🙂