Nested ESXi Gotchas with VCF |

Nested ESXi is a great way to quickly spin up a test/demo environment and fiddle around things in the lab. I have been doing so for quite a bit now. VCF is very dear to my heart and because VCF needs a hell lot of resources, I always test new versions/features in my nested lab.

Nested ESXi doesn’t always behave nicely and sometimes gives you a hard time and I encountered this recently in one of my VCF deployments.

What was the problem and how it started?

The problem was with ESXi UUID and due to which vSAN configuration was failing. I will talk about more this later in this post.

To save time, I created a nested ESXi template following this article. Deployed few ESXi hosts and everything was working fine. One day I tweaked my template to inject some advanced parameters and booted the template VM. This generated a new UUID entry for ESXi in /etc/vmware/esx.conf file which I forgot to remove before powering off VM and converting to template again.

Once VCF prep work was done and SDDC Bringup was triggered, it was failing in the stage where VSAN is configured on the cluster. Cloud builder log file (vcf-bringup-debug.log) was full of the below error

Post Validation: Vsan cluster did not come up after 600 secs, validation failed

1	Post Validation: Vsan cluster did not come up after 600 secs, validation failed

on digging more into this and checking the bringup state in Cloud Builder DB, I found the task is stuck in the failed state.

id                            | 7f000001-77cf-1d12-8177-cf53a1750122
created_date                  | 1614099079369
execution_order               | 220
dal_version                   | 1.0
execute_retrial_count         | 3
execution_errors              | [{"errorResponse":{"messageBundle":"com.vmware.evo.sddc.bringup.vsphere.messages","errorCode":"VSPHERE_VSAN_CLUSTER_VALIDATION_FAILED","arguments":[],"message":"VSAN Cluster validation failed","cause":[{"t
ype":"com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException","message":"Post Validation: Vsan cluster did not come up after 600 secs, validation failed"}],"referenceToken":"85RLEE"},"errorSignature":"2efacacc"}]
execution_id                  | cefd82b6-97cd-4ced-b10c-11ca2c29f364
input_map                     | null
input_param                   | com.vmware.evo.sddc.bringup.plugin.model.vsphere.Vsphere
max_tries                     | 0
next_state                    | EnableVsanDone
operation_logging_token       | cefd82b6-97cd-4ced-b10c-11ca2c29f364
output_map                    | null
output_param                  |
param_builder                 | com.vmware.evo.sddc.bringup.adapters.toplugin.VsphereAdapter
plugin                        | com.vmware.evo.sddc.bringup.plugin.spi.VspherePlugin
post_validation_retrial_count | 0
pre_validation_retrial_count  | 0
previous_state                | UpdateVsanHclDone
processed_resource_type       | EVO
processing_context_id         | 7f000001-77cf-1d12-8177-cf53a1750118
processing_state_description  | Management Cluster Configuration
processing_state_name         | ManagementClusterConfiguration
rack_id                       |
rollback_retrial_count        | 0
status                        | COMPLETED_WITH_FAILURE
task_data                     |
task_description              | Enable vSAN
task_description_pack         | {"component":"workflowconfig.ems_recipes.ManagementClusterConfiguration","messageKey":"EnableVsan.desc","localBundle":"workflowconfig.ems_recipes.ManagementClusterConfiguration","defaultMessage":"Enable vS
AN"}
task_name                     | EnableVsan
task_name_pack                | {"component":"workflowconfig.ems_recipes.ManagementClusterConfiguration","messageKey":"EnableVsan.name","localBundle":"workflowconfig.ems_recipes.ManagementClusterConfiguration","defaultMessage":"Enable vS
AN"}

id | 7f000001-77cf-1d12-8177-cf53a1750122

created_date | 1614099079369

execution_order | 220

dal_version | 1.0

execute_retrial_count | 3

execution_errors | [{"errorResponse":{"messageBundle":"com.vmware.evo.sddc.bringup.vsphere.messages","errorCode":"VSPHERE_VSAN_CLUSTER_VALIDATION_FAILED","arguments":[],"message":"VSAN Cluster validation failed","cause":[{"t

ype":"com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException","message":"Post Validation: Vsan cluster did not come up after 600 secs, validation failed"}],"referenceToken":"85RLEE"},"errorSignature":"2efacacc"}]

execution_id | cefd82b6-97cd-4ced-b10c-11ca2c29f364

input_map | null

input_param | com.vmware.evo.sddc.bringup.plugin.model.vsphere.Vsphere

max_tries | 0

next_state | EnableVsanDone

operation_logging_token | cefd82b6-97cd-4ced-b10c-11ca2c29f364

output_map | null

output_param |

param_builder | com.vmware.evo.sddc.bringup.adapters.toplugin.VsphereAdapter

plugin | com.vmware.evo.sddc.bringup.plugin.spi.VspherePlugin

post_validation_retrial_count | 0

pre_validation_retrial_count | 0

previous_state | UpdateVsanHclDone

processed_resource_type | EVO

processing_context_id | 7f000001-77cf-1d12-8177-cf53a1750118

processing_state_description | Management Cluster Configuration

processing_state_name | ManagementClusterConfiguration

rack_id |

rollback_retrial_count | 0

status | COMPLETED_WITH_FAILURE

task_data |

task_description | Enable vSAN

task_description_pack | {"component":"workflowconfig.ems_recipes.ManagementClusterConfiguration","messageKey":"EnableVsan.desc","localBundle":"workflowconfig.ems_recipes.ManagementClusterConfiguration","defaultMessage":"Enable vS

AN"}

task_name | EnableVsan

task_name_pack | {"component":"workflowconfig.ems_recipes.ManagementClusterConfiguration","messageKey":"EnableVsan.name","localBundle":"workflowconfig.ems_recipes.ManagementClusterConfiguration","defaultMessage":"Enable vS

AN"}

Next, I logged into Web Client and checked VSAN state, and found that hosts are complaining about reachability.

Host cannot communicate with one or more other nodes in the vSAN enabled cluster

1	Host cannot communicate with one or more other nodes in the vSAN enabled cluster

After performing network validation for VSAN, I ruled out networking issues because hosts were able to reach others over the VSAN network (vmkping etc).

As always google saved my life and I stumbled upon this article that explains the same issue I was having in my lab.

So what exactly happened?

When I deployed 4 nested ESXi VMs (post tweaking template), each ESXi host got the same system UUID.

[root@mgmt-esxi01:~] esxcli system uuid get
6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi02:~] esxcli system uuid get
6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi03:~] esxcli system uuid get
6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi04:~] esxcli system uuid get
6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi01:~] esxcli system uuid get

6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi02:~] esxcli system uuid get

6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi03:~] esxcli system uuid get

6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi04:~] esxcli system uuid get

6020bec5-47d5-8335-2046-00505685027a

The problem was loud and clear. Since each VSAN host was having the same UUID, it made VSAN crazy.

Important Tip: When dealing with nested ESXi, make sure to remove UUID entry from esx.conf file as new UUID is generated when cloned vm boots up.

To fix this, I edited esx.conf file on all 4 hosts and removed the UUID part, and rebooted the host. You can also use this one-liner to remove UUID: sed -i ‘s/system/uuid.*//’ /etc/vmware/esx.conf

Post reboot each host got a new UUID and VSAN was happy again 😉

[root@mgmt-esxi01:~] esxcli system uuid get
6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi02:~] esxcli system uuid get
6035d926-4eb2-b5ae-4d26-00505685eb72

[root@mgmt-esxi03:~] esxcli system uuid get
6035d947-ad1f-34bd-8a27-00505685ca50

[root@mgmt-esxi04:~] esxcli system uuid get
6035d919-9a1e-f8d7-2add-00505685892a

[root@mgmt-esxi01:~] esxcli system uuid get

6020bec5-47d5-8335-2046-00505685027a

[root@mgmt-esxi02:~] esxcli system uuid get

6035d926-4eb2-b5ae-4d26-00505685eb72

[root@mgmt-esxi03:~] esxcli system uuid get

6035d947-ad1f-34bd-8a27-00505685ca50

[root@mgmt-esxi04:~] esxcli system uuid get

6035d919-9a1e-f8d7-2add-00505685892a

Next, I retried the task in cloud builder UI and all went smooth and within a couple of hours, my SDDC was ready to roll.

And that concludes this post.

I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing 🙂

Spread the Love

Leave a ReplyCancel reply