Table of Contents
Troubleshoot NSX Manager Services
If you are facing any NSX related issues, then NSX manager UI is the first place to verify which service or services are impacted. Typically you can check status of following services from NSX Manager UI (https://NSX-FQDN/login.jsp)
- vPostgres
- RabbitMQ:
- NSX Management Service
- NSX Universal Synchronization Service (Only when you have Cross vCenter NSX Configured)
If any service is in stopped state, try to start or restart it.
You can also check logs from NSX manager CLI to determine what is broken. The two important logs you can check are: NSX Manager log and the System log. These logs can be viwed by firing commands: show log manager & show log system. You can append the word follow to watch the logs in real time (similar to linux tail command)
If any of the service is crashing, or not starting, you can check the bottom of the log to see the latest entries and it should give you more information on why it’s not starting.
In order to view the log from the bottom, run command: show log manager reverse and can check for keywords ERROR, WARNING, FATAL, or EXCEPTION
If NSX Manager is having connectivity issues either with vCenter Server or the ESXi host, you can run the command: debug connection IP_of_ESXi_or_VC, and examine the output.
1 2 3 4 5 6 7 8 9 10 11 12 |
nsxmgr-01a.corp.local> debug connection vcsa-01a.corp.local PING vcsa-01a.corp.local (192.168.109.113): 56 data bytes 64 bytes from 192.168.109.113: icmp_seq=0 ttl=64 time=0.371 ms 64 bytes from 192.168.109.113: icmp_seq=1 ttl=64 time=1.096 ms 64 bytes from 192.168.109.113: icmp_seq=2 ttl=64 time=0.379 ms --- vcsa-01a.corp.local ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.371/0.615/1.096/0.340 ms vcsa-01a.corp.local reachable vcsa-01a.corp.local reachable over port 443 vcsa-01a.corp.local not reachable over port 902 vcsa-01a.corp.local not reachable over port 903 |
You can also configure NSX manager to follow all logs to a centralized syslog server
In my lab, I have syslog server configured on one of the linux box and NSX manager is forwarding following logs there.
Download Technical Supports logs from NSX Manager
In case when you want to provide the NSX manager logs to VMware support team for analysis or for internal investigation later, you can download them from NSX Manager UI by clicking on “Download Tech Support Log”
Log collection process will start.
Once the process is completed, click on Download button to download log on your local system.
NSX manager/controllers logs can be collected from vCenter Web Client as well by navigating to Networking & Security > Support Bundle and then selecting the components and then clicking on “Start Bundle Collection”
Once the log bundle generation is completed, click on Download button to get the bundle.
Troubleshoot Host Preparation Issues
When we kick Esxi host preparation task, NSX installs the VIBs on ESXi hosts. Before populating the vSphere Clusters with workloads, we have to ensure that all Esxi hosts are prepared their status is green.
Over the time when changes (host reimage or NSX manager upgrade) are made in NSX environment, you might see some hosts reporting as not ready.
To resolve host preparation issues, login to vCenter Web Client and navigate to Networking & Security > Installation > Host Preparation tab and select the affected host and click on gear icon and hit Resolve to push the VIB’s back on host.
Sometimes this operation fails and you have to install the NSX VIB’s on host manually. Refer this article for manual installation of NSX VIB’s.
Verify that Communication Channels health status is green at host/cluster level
If issue is related to one particular host then dont forget to check DNS record (forward and reverse) as well for that hosts. In correct DNS entry for a host can also cause VIB installation to fail on that host.
If more than one host is affected, verify that Rabbitmq service on NSX manager is UP and running. Optionally you can try restarting the service.
Troubleshoot NSX Controller cluster status, roles and connectivity
Troubleshooting controller connectivity isn’t too difficult. If you have any issues with one of the controller and its not recoverable, then you can always delete the borked controller and deploy a new one, which is the easiest and quickest method to fix the issue.
Controller Cluster Status
In case of issues, verify the NSX controller cluster status. Make sure each controller is showing as healthy and status as connected.
As per this document, VMware recommend deleting the entire controller cluster when one or more of the controllers encounter catastrophic, unrecoverable errors and cannot be fixed. In this case delete all controllers, even if some of the controllers seem healthy. This sounds weird to me to be frank.
You can also verify the status of the controllers from CLI by running command: show control-cluster status
You will see the output with controller stating that yeah I am fine. My join status is complete and I am connected to cluster majority and I can be safely restarted. Here is my cluster ID and UUID. I am happy, I am activated and ready to go.
Controller Cluster Roles
When controller cluster is formed, each node participating in the cluster is master of something. You can ask the controller “Are you the one who is incharge here”. Are you the master of anything.
To check controllers role, run command: show control-cluster roles
Below is the output from 2 of my controller node.
If a node has disconnected from the cluster because of a failure, you can try to force that node to join back the cluster by running command: join control-cluster Master-Controller-IP force
You can verify the cluster history and ensure there is no sign of host connection flapping, or VNI join failures and abnormal cluster membership change. To verify this, run command: show control-cluster history
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
nsx-controller # show control-cluster history =================================== Host nsx-controller Node 25e364d3-1cfc-494f-9637-e52fd5ca8fc9 (192.168.109.251, unknown version) --------------------------------- 05/12 12:38:14: Node restarted 06/04 09:36:45: User shutdown =================================== Host nsx-controller Node 25e364d3-1cfc-494f-9637-e52fd5ca8fc9 (192.168.109.251, nicira-nvp-controller.6.4.1.8403660) --------------------------------- 06/04 09:38:23: Node restarted 06/04 09:38:29: Joining cluster via node 192.168.109.250 06/04 09:38:29: Waiting to join cluster 06/04 09:38:29: Role directory_server configured 06/04 09:38:29: Role logical_manager configured 06/04 09:38:29: Role switch_manager configured 06/04 09:38:29: Role api_provider configured 06/04 09:38:29: Role persistence_server configured 06/04 09:38:29: Joining cluster via node 192.168.109.251 06/04 09:38:29: Joining cluster via node 192.168.109.252 06/04 09:39:08: Joined cluster; initializing local components 06/04 09:39:08: Disconnected from cluster majority 06/04 09:39:08: Connected to cluster majority 06/04 09:39:11: Initializing data contact with cluster 06/04 09:39:20: Fetching initial configuration data 06/04 09:39:21: Role persistence_server activated 06/04 09:39:27: Join complete 06/04 09:39:28: Role api_provider activated 06/04 09:39:28: Role directory_server activated 06/04 09:39:28: Role logical_manager activated 06/04 09:39:28: Role switch_manager activated 06/04 09:47:05: Interrupted connection to cluster majority 06/04 09:47:05: Connected to cluster majority 06/04 09:47:05: Interrupted connection to cluster majority 06/04 09:47:05: Connected to cluster majority |
You can check the details of the connections to and from a controller by running command: show network connections of-type tcp
Running this command is same like running netstat command.
This document from VMware list all the cli commands which you can use to troubleshoot NSX controllers issue.
Troubleshoot Logical Switch transport zone and NSX Edge mappings
To display a full list of commands for logical switches on an NSX Controller, run show control-cluster logical-switches and hit Enter.
To find out which controller is responsible for managing which VNI
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
nsx-controller # show control-cluster logical-switches vni-table VNI Controller BUM-Replication ARP-Proxy Connections VTEPs Active 8003 192.168.109.250 Enabled Enabled 0 0 false 5002 192.168.109.252 Enabled Enabled 2 1 true 5006 192.168.109.251 Enabled Enabled 0 0 false 8002 192.168.109.250 Enabled Enabled 0 0 false 5003 192.168.109.250 Enabled Enabled 0 0 false 5007 192.168.109.252 Enabled Enabled 1 1 true 8001 192.168.109.251 Enabled Enabled 0 0 false 5000 192.168.109.252 Enabled Enabled 2 1 true 5004 192.168.109.250 Enabled Enabled 0 0 false 5008 192.168.109.250 Enabled Enabled 0 0 false 8000 192.168.109.252 Enabled Enabled 2 2 true 5001 192.168.109.251 Enabled Enabled 0 0 false 5005 192.168.109.252 Enabled Enabled 2 2 true |
To check configuration for the specified VNI
1 2 3 |
nsx-controller # show control-cluster logical-switches vni 5000 VNI Controller BUM-Replication ARP-Proxy Connections VTEPs Active 5000 192.168.109.252 Enabled Enabled 2 1 true |
To check connections of a specific VNI
1 2 3 4 |
nsx-controller # show control-cluster logical-switches connection-table 5001 Host-IP Port ID 192.168.109.120 55170 377 192.168.109.115 36458 380 |
To check VTEP table of a VNI
1 2 3 |
nsx-controller # show control-cluster logical-switches vtep-table 5001 VNI IP Segment MAC Connection-ID Is-Active Out-Of-Sync 5001 192.168.109.245 192.168.109.0 00:50:56:64:9e:a6 380 YES NO |
Troubleshoot Logical Router interface and route mappings
To see full list of commands to troubleshoot DLR issues, run command: show control-cluster logical-routers on the NSX controller.
I have summarized few hands commands that can be used while troubleshooting DLR issues
To list all deployed distributed logical routers
1 2 3 4 |
nsx-controller # show control-cluster logical-routers instance all LR-Id LR-Name Universal Service-Controller Egress-Locale In-Sync Sync-Category 0x1f40 edge-6f60b373-b4e1-4bba-a234-1990a8b1b44f true 192.168.109.252 local N/A N/A 0x1388 edge-30 false 192.168.109.252 local N/A N/A |
Interface summary for a specific DLR
1 2 3 4 5 6 |
nsx-controller # show control-cluster logical-routers interface-summary 0x1388 Interface Type Id IP[] 13880000000c vxlan 5002(0x138a) 172.16.30.1/24 138800000002 vxlan 5003(0x138b) 192.168.10.1/29 13880000000a vxlan 5001(0x1389) 172.16.20.1/24 13880000000b vxlan 5000(0x1388) 172.16.10.1/24 |
Static routes for a specific DLR
1 2 3 4 |
nsx-controller # show control-cluster logical-routers routes 0x1388 Destination Next-Hop[] Preference Locale-Id Source 0.0.0.0/0 192.168.10.2 1 00000000-0000-0000-0000-000000000000 CONTROL_VM 192.168.20.0/29 192.168.10.2 200 00000000-0000-0000-0000-000000000000 CONTROL_VM |
DLR edge connection list
1 2 3 |
nsx-controller # show control-cluster logical-routers edge-connections 0x1388 Id IP Version Locale-Id Sync-State 15 192.168.109.120 6.2 00000000-0000-0000-0000-000000000000 OK |
Troubleshoot distributed and edge firewall implementations
The Distributed Firewall module runs inside the Esxi host kernel, so the host has all the information about what are the policies are configured for the virtual machines that are running on a given host.
We can debug and troubleshoot DFW issues from Esxi host command line. The two most important commands that we will be using are: summarize-dvfilter and vsipioctl
summarize-dvfilter : We don’t have any arguments for this command, it just prints out all of the dvfilters as shown below
1 2 3 4 5 6 7 8 9 |
[root@esxi-01a:~] summarize-dvfilter Fastpaths: agent: dvfilter-faulter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter agent: ESXi-Firewall, refCount: 6, rev: 0x1010000, apiRev: 0x1010000, module: esxfw agent: dvfilter-generic-vmware, refCount: 2, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter-generic-fastpath agent: dvfg-igmp, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfg-igmp agent: dvfilter-generic-vmware-swsec, refCount: 7, rev: 0x1010000, apiRev: 0x1010000, module: nsx-dvfilter-switch-security agent: bridgelearningfilter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vdrb agent: vmware-sfw, refCount: 8, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vsip |
We need to get VM UUID in order to view the DFW policies for a given VM
1 2 3 |
[root@esxi-01a:~] summarize-dvfilter | grep App01-New world 12009418 vmm0:App01-New vcUuid:'50 23 0f 82 c6 c0 c1 7b-ab 1c 76 c9 e5 12 98 f2' port 33554491 App01-New.eth0 |
Once we have got the UUID of a VM, we need to employ vsipioctl command to view the policies.
1: vsipioctl getfilters : This command will show UUID of all VM’s as well as the NIC’s on which DFW is operating
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
[root@esxi-01a:~] vsipioctl getfilters Filter Name : nic-12009418-eth0-vmware-sfw.2 VM UUID : 50 23 0f 82 c6 c0 c1 7b-ab 1c 76 c9 e5 12 98 f2 VNIC Index : 0 Service Profile : --NOT SET-- Filter Hash : 30241 Filter Name : nic-12944314-eth0-vmware-sfw.2 VM UUID : 50 23 5c 2c 98 4c 2d bb-16 a4 79 a8 9d 2e 4b 6f VNIC Index : 0 Service Profile : --NOT SET-- Filter Hash : 28802 Filter Name : nic-13247978-eth0-vmware-sfw.2 VM UUID : 50 23 f9 12 9e 9e ce 28-8b ba 66 83 dd 14 9a c9 VNIC Index : 0 Service Profile : --NOT SET-- Filter Hash : 45088 Filter Name : nic-13247960-eth0-vmware-sfw.2 VM UUID : 50 23 d1 17 84 7c a2 42-72 6c b5 b8 14 17 c1 1b VNIC Index : 0 Service Profile : --NOT SET-- Filter Hash : 28573 Filter Name : nic-13247967-eth0-vmware-sfw.2 VM UUID : 50 23 5a 34 48 cb 48 eb-e1 3b fb af 57 a3 40 c5 VNIC Index : 0 Service Profile : --NOT SET-- Filter Hash : 65381 Filter Name : nic-13249254-eth0-vmware-sfw.2 VM UUID : 50 23 52 44 c7 66 94 c6-d1 9b 58 26 60 54 3c b6 VNIC Index : 0 Service Profile : --NOT SET-- |
To view all DFW rules configured for a particular VM, run the getrules command and specify the NIC name as shown below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
[root@esxi-01a:~] vsipioctl getrules -f nic-12009418-eth0-vmware-sfw.2 ruleset domain-c7 { # Filter rules rule 2147483649 at 1 inout protocol any from any to any accept; rule 1003 at 2 inout protocol ipv6-icmp icmptype 135 from any to any accept; rule 1003 at 3 inout protocol ipv6-icmp icmptype 136 from any to any accept; rule 1002 at 4 inout protocol udp from any to any port 67 accept; rule 1002 at 5 inout protocol udp from any to any port 68 accept; rule 1001 at 6 inout protocol any from any to any accept; } ruleset domain-c7_L2 { # Filter rules rule 1004 at 1 inout ethertype any stateless from any to any accept; } |
And that’s it for this post.
I hope you find this post informational. Feel free to share this on social media if it is worth sharing. Be sociable 🙂