Troubleshoot Common VMware NSX Installation/Configuration Issues

Troubleshoot NSX Manager Services

If you are facing any NSX related issues, then NSX manager UI is the first place to verify which service or services are impacted. Typically you can check status of following services from NSX Manager UI (https://NSX-FQDN/login.jsp)

  • vPostgres 
  • RabbitMQ: 
  • NSX Management Service
  • NSX Universal Synchronization Service (Only when you have Cross vCenter NSX Configured)

If any service is in stopped state, try to start or restart it.

You can also check logs from NSX manager CLI to determine what is broken. The two important logs you can check are: NSX Manager log and the System log. These logs can be viwed by firing commands: show log manager & show log system. You can append the word follow to watch the logs in real time (similar to linux tail command)

 

If any of the service is crashing, or not starting, you can check the bottom of the log to see the latest entries and it should give you more information on why it’s not starting.

In order to view the log from the bottom, run command: show log manager reverse and can check for keywords ERROR, WARNING, FATAL, or EXCEPTION

If NSX Manager is having connectivity issues either with vCenter Server or the ESXi host, you can run the command: debug connection IP_of_ESXi_or_VC, and examine the output.

You can also configure NSX manager to follow all logs to a centralized syslog server

In my lab, I have syslog server configured on one of the linux box and NSX manager is forwarding following logs there.

Download Technical Supports logs from NSX Manager

In case when you want to provide the NSX manager logs to VMware support team for analysis or for internal investigation later, you can download them from NSX Manager UI by clicking on “Download Tech Support Log”

Log collection process will start.

Once the process is completed, click on Download button to download log on your local system.

NSX manager/controllers logs can be collected from vCenter Web Client as well by navigating to Networking & Security > Support Bundle and then selecting the components and then clicking on “Start Bundle Collection”

Once the log bundle generation is completed, click on Download button to get the bundle.

Troubleshoot Host Preparation Issues

When we kick Esxi host preparation task, NSX installs the VIBs on ESXi hosts. Before populating the vSphere Clusters with workloads, we have to ensure that all Esxi hosts are prepared their status is green.

Over the time when changes (host reimage or NSX manager upgrade) are made in NSX environment, you might see some hosts reporting as not ready. 

To resolve host preparation issues, login to vCenter Web Client and navigate to Networking & Security > Installation > Host Preparation tab and select the affected host and click on gear icon and hit Resolve to push the VIB’s back on host.

Sometimes this operation fails and you have to install the NSX VIB’s on host manually. Refer this article for manual installation of NSX VIB’s.

Verify that Communication Channels health status is green at host/cluster level

If issue is related to one particular host then dont forget to check DNS record (forward and reverse) as well for that hosts. In correct DNS entry for a host can also cause VIB installation to fail on that host.

If more than one host is affected, verify that Rabbitmq service on NSX manager is UP and running. Optionally you can try restarting the service.

Troubleshoot NSX Controller cluster status, roles and connectivity

Troubleshooting controller connectivity isn’t too difficult. If you have any issues with one of the controller and its not recoverable, then you can always delete the borked controller and deploy a new one, which is the easiest and quickest method to fix the issue.

Controller Cluster Status

In case of issues, verify the NSX controller cluster status. Make sure each controller is showing as healthy and status as connected.

As per this document, VMware recommend deleting the entire controller cluster when one or more of the controllers encounter catastrophic, unrecoverable errors and cannot be fixed.  In this case delete all controllers, even if some of the controllers seem healthy. This sounds weird to me to be frank.

You can also verify the status of the controllers from CLI by running command: show control-cluster status

You will see the output with controller stating that yeah I am fine. My join status is complete and I am connected to cluster majority and I can be safely restarted. Here is my cluster ID and UUID. I am happy, I am activated and ready to go.

Controller Cluster Roles

When controller cluster is formed, each node participating in the cluster is master of something. You can ask the controller “Are you the one who is incharge here”. Are you the master of anything.

To check controllers role, run command: show control-cluster roles

Below is the output from 2 of my controller node.

If a node has disconnected from the cluster because of a failure, you can  try to force that node to join back the cluster by running command: join control-cluster Master-Controller-IP force

You can verify the cluster history and ensure there is no sign of host connection flapping, or VNI join failures and abnormal cluster membership change. To verify this, run command: show control-cluster history

You can check the details of the connections to and from a controller by running command: show network connections of-type tcp

Running this command is same like running netstat command.

This document from VMware list all the cli commands which you can use to troubleshoot NSX controllers issue.

Troubleshoot Logical Switch transport zone and NSX Edge mappings

To display a full list of commands for logical switches on an NSX Controller, run show control-cluster logical-switches and hit Enter. 

To find out which controller is responsible for managing which VNI

To check configuration for the specified VNI

To check connections of a specific VNI

To check VTEP table of a VNI

Troubleshoot Logical Router interface and route mappings

To see full list of commands to troubleshoot DLR issues, run command:  show control-cluster logical-routers on the NSX controller.

I have summarized few hands commands that can be used while troubleshooting DLR issues

To list all deployed distributed logical routers

Interface summary for a specific DLR

Static routes for a specific DLR

DLR edge connection list

Troubleshoot distributed and edge firewall implementations

The Distributed Firewall module runs inside the Esxi host kernel, so the host has all the information about what are the policies are configured for the virtual machines that are running on a given host.

We can debug and troubleshoot DFW issues from Esxi host command line. The two most important commands that we will be using are: summarize-dvfilter and vsipioctl

summarize-dvfilter : We don’t have any arguments for this command, it just prints out all of the dvfilters as shown below

We need to get VM UUID in order to view the DFW policies for a given VM

Once we have got the UUID of a VM, we need to employ vsipioctl command to view the policies. 

1: vsipioctl getfilters : This command will show UUID of all VM’s as well as the NIC’s on which DFW is operating

To view all DFW rules configured for a particular VM, run the getrules command and specify the NIC name as shown below

And that’s it for this post.

I hope you find this post informational. Feel free to share this on social media if it is worth sharing. Be sociable 🙂

Leave a ReplyCancel reply