Nested ESXi Gotchas with VCF

Nested ESXi is a great way to quickly spin up a test/demo environment and fiddle around things in the lab. I have been doing so for quite a bit now. VCF is very dear to my heart and because VCF needs a hell lot of resources, I always test new versions/features in my nested lab.

Nested ESXi doesn’t always behave nicely and sometimes gives you a hard time and I encountered this recently in one of my VCF deployments. 

What was the problem and how it started?

The problem was with ESXi UUID and due to which vSAN configuration was failing. I will talk about more this later in this post. 

To save time, I created a nested ESXi template following this article. Deployed few ESXi hosts and everything was working fine. One day I tweaked my template to inject some advanced parameters and booted the template VM. This generated a new UUID entry for ESXi in /etc/vmware/esx.conf file which I forgot to remove before powering off VM and converting to template again.

Once VCF prep work was done and SDDC Bringup was triggered, it was failing in the stage where VSAN is configured on the cluster. Cloud builder log file (vcf-bringup-debug.log) was full of the below error

on digging more into this and checking the bringup state in Cloud Builder DB, I found the task is stuck in the failed state.

Next, I logged into Web Client and checked VSAN state, and found that hosts are complaining about reachability. 

After performing network validation for VSAN, I ruled out networking issues because hosts were able to reach others over the VSAN network (vmkping etc).

As always google saved my life and I stumbled upon this article that explains the same issue I was having in my lab. 

So what exactly happened?

When I deployed 4 nested ESXi VMs (post tweaking template), each ESXi host got the same system UUID. 

The problem was loud and clear. Since each VSAN host was having the same UUID, it made VSAN crazy.

Important Tip: When dealing with nested ESXi, make sure to remove UUID entry from esx.conf file as new UUID is generated when cloned vm boots up. 

To fix this, I edited esx.conf file on all 4 hosts and removed the UUID part, and rebooted the host. You can also use this one-liner to remove UUID: sed -i ‘s/system/uuid.*//’ /etc/vmware/esx.conf

Post reboot each host got a new UUID and VSAN was happy again 😉

Next, I retried the task in cloud builder UI and all went smooth and within a couple of hours, my SDDC was ready to roll. 

And that concludes this post. 

I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing 🙂

Leave a ReplyCancel reply