Cleanup NSX After Force Install of vDefend SSP

Recently, while deploying vDefend SSP in my nested lab, I encountered an issue in which the SSP platform became unstable as soon as I activated platform services.

When platform services (Security Intelligence/Rule Analysis, etc.) are activated, SSP creates new pods. If, at that moment, the CPU on the worker nodes is stuck due to resource constraints on the physical host, the overall platform health degrades. The pods that make up the core service are restarted frequently, and they never come back.

You can use the SSPi diagnostic tool to have visibility into problematic worker nodes and the namespace/pods.

To view pod information, SSH to the SSPI VM and list the pods.

The SSP UI won’t let you login and throws weird errors. An example is shown below

If you attempt to query the platform status, you will see gateway timeout error.

The root cause of this problem was that each worker node is deployed with 16 vCPU, and a minimum of 4 worker nodes is deployed. If there is CPU contention at the ESXi host, the pods in the nsxi-platform namespace that constitute the SSP platform are restarted and are stuck in the restart loop.

I had no option but to do a force uninstall of SSP. But a force uninstall leaves lingering items in the NSX manager, as it was never offboarded from SSP. When you attempt to force uninstall, the system complains about the NSX site still onboarded. The wizard also refers to KB-382295, which provides a script to offboard NSX and cleanup any lingering items.

Before executing the cleanup script, let’s review the items that get created in NSX when it is onboarded in SSP. This will help in understanding how to perform a manual cleanup if the script fails for any reason.

1: Users: SSP creates a few users to facilitate communication with NSX.

2: Certificates: A bunch of certs are assigned to services that run on the SSP and are imported into NSX.

3: Site Registration: The SSP platform is registered as a site in NSX and must be deleted before redeploying the platform.

It’s funny that NSX still refers to SSP as Naap. SSP is the successor to Naap.

The Offboarding Script

Depending upon the SSP version (5.0/5.1), download the right script from the KB-382295. In my environment, I deployed SSP 5.1, so I downloaded the script ‘site-offboarding-cleanup-nsx-ssp5.1.sh’

Place the script on any Linux machine that has connectivity to the NSX manager. Make the script executable and run the command:

The script will cleanup all stale entries, and you are ready to redeploy SSP after fixing the infra issues.

That’s it for this post. I hope you enjoyed reading it. Feel free to share it on social media if it’s worth sharing.

Leave a Reply