Troubleshoot Common VMware NSX Installation/Configuration Issues

Table of Contents

Troubleshoot NSX Manager Services

If you are facing any NSX related issues, then NSX manager UI is the first place to verify which service or services are impacted. Typically you can check status of following services from NSX Manager UI (https://NSX-FQDN/login.jsp)

vPostgres
RabbitMQ:
NSX Management Service
NSX Universal Synchronization Service (Only when you have Cross vCenter NSX Configured)

If any service is in stopped state, try to start or restart it.

You can also check logs from NSX manager CLI to determine what is broken. The two important logs you can check are: NSX Manager log and the System log. These logs can be viwed by firing commands: show log manager & show log system. You can append the word follow to watch the logs in real time (similar to linux tail command)

If any of the service is crashing, or not starting, you can check the bottom of the log to see the latest entries and it should give you more information on why it’s not starting.

In order to view the log from the bottom, run command: show log manager reverse and can check for keywords ERROR, WARNING, FATAL, or EXCEPTION

If NSX Manager is having connectivity issues either with vCenter Server or the ESXi host, you can run the command: debug connection IP_of_ESXi_or_VC, and examine the output.

nsxmgr-01a.corp.local> debug connection vcsa-01a.corp.local
PING vcsa-01a.corp.local (192.168.109.113): 56 data bytes
64 bytes from 192.168.109.113: icmp_seq=0 ttl=64 time=0.371 ms
64 bytes from 192.168.109.113: icmp_seq=1 ttl=64 time=1.096 ms
64 bytes from 192.168.109.113: icmp_seq=2 ttl=64 time=0.379 ms
--- vcsa-01a.corp.local ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.371/0.615/1.096/0.340 ms
vcsa-01a.corp.local reachable
vcsa-01a.corp.local reachable over port 443
vcsa-01a.corp.local not reachable over port 902
vcsa-01a.corp.local not reachable over port 903

nsxmgr-01a.corp.local> debug connection vcsa-01a.corp.local

PING vcsa-01a.corp.local (192.168.109.113): 56 data bytes

64 bytes from 192.168.109.113: icmp_seq=0 ttl=64 time=0.371 ms

64 bytes from 192.168.109.113: icmp_seq=1 ttl=64 time=1.096 ms

64 bytes from 192.168.109.113: icmp_seq=2 ttl=64 time=0.379 ms

--- vcsa-01a.corp.local ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max/stddev = 0.371/0.615/1.096/0.340 ms

vcsa-01a.corp.local reachable

vcsa-01a.corp.local reachable over port 443

vcsa-01a.corp.local not reachable over port 902

vcsa-01a.corp.local not reachable over port 903

You can also configure NSX manager to follow all logs to a centralized syslog server

In my lab, I have syslog server configured on one of the linux box and NSX manager is forwarding following logs there.

Download Technical Supports logs from NSX Manager

In case when you want to provide the NSX manager logs to VMware support team for analysis or for internal investigation later, you can download them from NSX Manager UI by clicking on “Download Tech Support Log”

Log collection process will start.

Once the process is completed, click on Download button to download log on your local system.

NSX manager/controllers logs can be collected from vCenter Web Client as well by navigating to Networking & Security > Support Bundle and then selecting the components and then clicking on “Start Bundle Collection”

Once the log bundle generation is completed, click on Download button to get the bundle.

Troubleshoot Host Preparation Issues

When we kick Esxi host preparation task, NSX installs the VIBs on ESXi hosts. Before populating the vSphere Clusters with workloads, we have to ensure that all Esxi hosts are prepared their status is green.

Over the time when changes (host reimage or NSX manager upgrade) are made in NSX environment, you might see some hosts reporting as not ready.

To resolve host preparation issues, login to vCenter Web Client and navigate to Networking & Security > Installation > Host Preparation tab and select the affected host and click on gear icon and hit Resolve to push the VIB’s back on host.

Sometimes this operation fails and you have to install the NSX VIB’s on host manually. Refer this article for manual installation of NSX VIB’s.

Verify that Communication Channels health status is green at host/cluster level

If issue is related to one particular host then dont forget to check DNS record (forward and reverse) as well for that hosts. In correct DNS entry for a host can also cause VIB installation to fail on that host.

If more than one host is affected, verify that Rabbitmq service on NSX manager is UP and running. Optionally you can try restarting the service.

Troubleshoot NSX Controller cluster status, roles and connectivity

Troubleshooting controller connectivity isn’t too difficult. If you have any issues with one of the controller and its not recoverable, then you can always delete the borked controller and deploy a new one, which is the easiest and quickest method to fix the issue.

Controller Cluster Status

In case of issues, verify the NSX controller cluster status. Make sure each controller is showing as healthy and status as connected.

As per this document, VMware recommend deleting the entire controller cluster when one or more of the controllers encounter catastrophic, unrecoverable errors and cannot be fixed. In this case delete all controllers, even if some of the controllers seem healthy. This sounds weird to me to be frank.

You can also verify the status of the controllers from CLI by running command: show control-cluster status

You will see the output with controller stating that yeah I am fine. My join status is complete and I am connected to cluster majority and I can be safely restarted. Here is my cluster ID and UUID. I am happy, I am activated and ready to go.

Controller Cluster Roles

When controller cluster is formed, each node participating in the cluster is master of something. You can ask the controller “Are you the one who is incharge here”. Are you the master of anything.

To check controllers role, run command: show control-cluster roles

Below is the output from 2 of my controller node.

If a node has disconnected from the cluster because of a failure, you can try to force that node to join back the cluster by running command: join control-cluster Master-Controller-IP force

You can verify the cluster history and ensure there is no sign of host connection flapping, or VNI join failures and abnormal cluster membership change. To verify this, run command: show control-cluster history

nsx-controller # show control-cluster history
===================================
Host nsx-controller
Node 25e364d3-1cfc-494f-9637-e52fd5ca8fc9 (192.168.109.251, unknown version)
---------------------------------
05/12 12:38:14: Node restarted
06/04 09:36:45: User shutdown
===================================
Host nsx-controller
Node 25e364d3-1cfc-494f-9637-e52fd5ca8fc9 (192.168.109.251, nicira-nvp-controller.6.4.1.8403660)
---------------------------------
06/04 09:38:23: Node restarted
06/04 09:38:29: Joining cluster via node 192.168.109.250
06/04 09:38:29: Waiting to join cluster
06/04 09:38:29: Role directory_server configured
06/04 09:38:29: Role logical_manager configured
06/04 09:38:29: Role switch_manager configured
06/04 09:38:29: Role api_provider configured
06/04 09:38:29: Role persistence_server configured
06/04 09:38:29: Joining cluster via node 192.168.109.251
06/04 09:38:29: Joining cluster via node 192.168.109.252
06/04 09:39:08: Joined cluster; initializing local components
06/04 09:39:08: Disconnected from cluster majority
06/04 09:39:08: Connected to cluster majority
06/04 09:39:11: Initializing data contact with cluster
06/04 09:39:20: Fetching initial configuration data
06/04 09:39:21: Role persistence_server activated
06/04 09:39:27: Join complete
06/04 09:39:28: Role api_provider activated
06/04 09:39:28: Role directory_server activated
06/04 09:39:28: Role logical_manager activated
06/04 09:39:28: Role switch_manager activated
06/04 09:47:05: Interrupted connection to cluster majority
06/04 09:47:05: Connected to cluster majority
06/04 09:47:05: Interrupted connection to cluster majority
06/04 09:47:05: Connected to cluster majority

nsx-controller # show control-cluster history

===================================

Host nsx-controller

Node 25e364d3-1cfc-494f-9637-e52fd5ca8fc9 (192.168.109.251, unknown version)

---------------------------------

05/12 12:38:14: Node restarted

06/04 09:36:45: User shutdown

===================================

Host nsx-controller

Node 25e364d3-1cfc-494f-9637-e52fd5ca8fc9 (192.168.109.251, nicira-nvp-controller.6.4.1.8403660)

---------------------------------

06/04 09:38:23: Node restarted

06/04 09:38:29: Joining cluster via node 192.168.109.250

06/04 09:38:29: Waiting to join cluster

06/04 09:38:29: Role directory_server configured

06/04 09:38:29: Role logical_manager configured

06/04 09:38:29: Role switch_manager configured

06/04 09:38:29: Role api_provider configured

06/04 09:38:29: Role persistence_server configured

06/04 09:38:29: Joining cluster via node 192.168.109.251

06/04 09:38:29: Joining cluster via node 192.168.109.252

06/04 09:39:08: Joined cluster; initializing local components

06/04 09:39:08: Disconnected from cluster majority

06/04 09:39:08: Connected to cluster majority

06/04 09:39:11: Initializing data contact with cluster

06/04 09:39:20: Fetching initial configuration data

06/04 09:39:21: Role persistence_server activated

06/04 09:39:27: Join complete

06/04 09:39:28: Role api_provider activated

06/04 09:39:28: Role directory_server activated

06/04 09:39:28: Role logical_manager activated

06/04 09:39:28: Role switch_manager activated

06/04 09:47:05: Interrupted connection to cluster majority

06/04 09:47:05: Connected to cluster majority

06/04 09:47:05: Interrupted connection to cluster majority

06/04 09:47:05: Connected to cluster majority

You can check the details of the connections to and from a controller by running command: show network connections of-type tcp

Running this command is same like running netstat command.

This document from VMware list all the cli commands which you can use to troubleshoot NSX controllers issue.

Troubleshoot Logical Switch transport zone and NSX Edge mappings

To display a full list of commands for logical switches on an NSX Controller, run show control-cluster logical-switches and hit Enter.

To find out which controller is responsible for managing which VNI

nsx-controller # show control-cluster logical-switches vni-table
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs     Active
8003     192.168.109.250 Enabled         Enabled   0           0         false
5002     192.168.109.252 Enabled         Enabled   2           1         true
5006     192.168.109.251 Enabled         Enabled   0           0         false
8002     192.168.109.250 Enabled         Enabled   0           0         false
5003     192.168.109.250 Enabled         Enabled   0           0         false
5007     192.168.109.252 Enabled         Enabled   1           1         true
8001     192.168.109.251 Enabled         Enabled   0           0         false
5000     192.168.109.252 Enabled         Enabled   2           1         true
5004     192.168.109.250 Enabled         Enabled   0           0         false
5008     192.168.109.250 Enabled         Enabled   0           0         false
8000     192.168.109.252 Enabled         Enabled   2           2         true
5001     192.168.109.251 Enabled         Enabled   0           0         false
5005     192.168.109.252 Enabled         Enabled   2           2         true

nsx-controller # show control-cluster logical-switches vni-table

VNI Controller BUM-Replication ARP-Proxy Connections VTEPs Active

8003 192.168.109.250 Enabled Enabled 0 0 false

5002 192.168.109.252 Enabled Enabled 2 1 true

5006 192.168.109.251 Enabled Enabled 0 0 false

8002 192.168.109.250 Enabled Enabled 0 0 false

5003 192.168.109.250 Enabled Enabled 0 0 false

5007 192.168.109.252 Enabled Enabled 1 1 true

8001 192.168.109.251 Enabled Enabled 0 0 false

5000 192.168.109.252 Enabled Enabled 2 1 true

5004 192.168.109.250 Enabled Enabled 0 0 false

5008 192.168.109.250 Enabled Enabled 0 0 false

8000 192.168.109.252 Enabled Enabled 2 2 true

5001 192.168.109.251 Enabled Enabled 0 0 false

5005 192.168.109.252 Enabled Enabled 2 2 true

To check configuration for the specified VNI

nsx-controller # show control-cluster logical-switches vni 5000
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs     Active
5000     192.168.109.252 Enabled         Enabled   2           1         true

nsx-controller # show control-cluster logical-switches vni 5000

VNI Controller BUM-Replication ARP-Proxy Connections VTEPs Active

5000 192.168.109.252 Enabled Enabled 2 1 true

To check connections of a specific VNI

nsx-controller # show control-cluster logical-switches connection-table 5001
Host-IP         Port  ID
192.168.109.120 55170 377
192.168.109.115 36458 380

nsx-controller # show control-cluster logical-switches connection-table 5001

Host-IP Port ID

192.168.109.120 55170 377

192.168.109.115 36458 380

To check VTEP table of a VNI

nsx-controller # show control-cluster logical-switches vtep-table 5001
VNI      IP              Segment         MAC               Connection-ID Is-Active  Out-Of-Sync
5001     192.168.109.245 192.168.109.0   00:50:56:64:9e:a6 380           YES        NO

nsx-controller # show control-cluster logical-switches vtep-table 5001

VNI IP Segment MAC Connection-ID Is-Active Out-Of-Sync

5001 192.168.109.245 192.168.109.0 00:50:56:64:9e:a6 380 YES NO

Troubleshoot Logical Router interface and route mappings

To see full list of commands to troubleshoot DLR issues, run command: show control-cluster logical-routers on the NSX controller.

I have summarized few hands commands that can be used while troubleshooting DLR issues

To list all deployed distributed logical routers

nsx-controller # show control-cluster logical-routers instance all
LR-Id LR-Name Universal Service-Controller Egress-Locale In-Sync Sync-Category
0x1f40 edge-6f60b373-b4e1-4bba-a234-1990a8b1b44f true 192.168.109.252 local N/A N/A
0x1388 edge-30 false 192.168.109.252 local N/A N/A

nsx-controller # show control-cluster logical-routers instance all

LR-Id LR-Name Universal Service-Controller Egress-Locale In-Sync Sync-Category

0x1f40 edge-6f60b373-b4e1-4bba-a234-1990a8b1b44f true 192.168.109.252 local N/A N/A

0x1388 edge-30 false 192.168.109.252 local N/A N/A

Interface summary for a specific DLR

nsx-controller # show control-cluster logical-routers interface-summary 0x1388
Interface                        Type   Id                       IP[]
13880000000c                     vxlan  5002(0x138a)             172.16.30.1/24
138800000002                     vxlan  5003(0x138b)             192.168.10.1/29
13880000000a                     vxlan  5001(0x1389)             172.16.20.1/24
13880000000b                     vxlan  5000(0x1388)             172.16.10.1/24

nsx-controller # show control-cluster logical-routers interface-summary 0x1388

Interface Type Id IP[]

13880000000c vxlan 5002(0x138a) 172.16.30.1/24

138800000002 vxlan 5003(0x138b) 192.168.10.1/29

13880000000a vxlan 5001(0x1389) 172.16.20.1/24

13880000000b vxlan 5000(0x1388) 172.16.10.1/24

Static routes for a specific DLR

nsx-controller # show control-cluster logical-routers routes 0x1388
Destination        Next-Hop[]      Preference Locale-Id                            Source
0.0.0.0/0          192.168.10.2    1          00000000-0000-0000-0000-000000000000 CONTROL_VM
192.168.20.0/29    192.168.10.2    200        00000000-0000-0000-0000-000000000000 CONTROL_VM

nsx-controller # show control-cluster logical-routers routes 0x1388

Destination Next-Hop[] Preference Locale-Id Source

0.0.0.0/0 192.168.10.2 1 00000000-0000-0000-0000-000000000000 CONTROL_VM

192.168.20.0/29 192.168.10.2 200 00000000-0000-0000-0000-000000000000 CONTROL_VM

DLR edge connection list

nsx-controller # show control-cluster logical-routers edge-connections 0x1388
Id     IP               Version Locale-Id                            Sync-State
15     192.168.109.120  6.2     00000000-0000-0000-0000-000000000000 OK

nsx-controller # show control-cluster logical-routers edge-connections 0x1388

Id IP Version Locale-Id Sync-State

15 192.168.109.120 6.2 00000000-0000-0000-0000-000000000000 OK

Troubleshoot distributed and edge firewall implementations

The Distributed Firewall module runs inside the Esxi host kernel, so the host has all the information about what are the policies are configured for the virtual machines that are running on a given host.

We can debug and troubleshoot DFW issues from Esxi host command line. The two most important commands that we will be using are: summarize-dvfilter and vsipioctl

summarize-dvfilter : We don’t have any arguments for this command, it just prints out all of the dvfilters as shown below

[root@esxi-01a:~] summarize-dvfilter
Fastpaths:
agent: dvfilter-faulter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter
agent: ESXi-Firewall, refCount: 6, rev: 0x1010000, apiRev: 0x1010000, module: esxfw
agent: dvfilter-generic-vmware, refCount: 2, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter-generic-fastpath
agent: dvfg-igmp, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfg-igmp
agent: dvfilter-generic-vmware-swsec, refCount: 7, rev: 0x1010000, apiRev: 0x1010000, module: nsx-dvfilter-switch-security
agent: bridgelearningfilter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vdrb
agent: vmware-sfw, refCount: 8, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vsip

[root@esxi-01a:~] summarize-dvfilter

Fastpaths:

agent: dvfilter-faulter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter

agent: ESXi-Firewall, refCount: 6, rev: 0x1010000, apiRev: 0x1010000, module: esxfw

agent: dvfilter-generic-vmware, refCount: 2, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter-generic-fastpath

agent: dvfg-igmp, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfg-igmp

agent: dvfilter-generic-vmware-swsec, refCount: 7, rev: 0x1010000, apiRev: 0x1010000, module: nsx-dvfilter-switch-security

agent: bridgelearningfilter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vdrb

agent: vmware-sfw, refCount: 8, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vsip

We need to get VM UUID in order to view the DFW policies for a given VM

[root@esxi-01a:~] summarize-dvfilter | grep App01-New
world 12009418 vmm0:App01-New vcUuid:'50 23 0f 82 c6 c0 c1 7b-ab 1c 76 c9 e5 12 98 f2'
port 33554491 App01-New.eth0

[root@esxi-01a:~] summarize-dvfilter | grep App01-New

world 12009418 vmm0:App01-New vcUuid:'50 23 0f 82 c6 c0 c1 7b-ab 1c 76 c9 e5 12 98 f2'

port 33554491 App01-New.eth0

Once we have got the UUID of a VM, we need to employ vsipioctl command to view the policies.

1: vsipioctl getfilters : This command will show UUID of all VM’s as well as the NIC’s on which DFW is operating

[root@esxi-01a:~] vsipioctl getfilters

Filter Name : nic-12009418-eth0-vmware-sfw.2
VM UUID : 50 23 0f 82 c6 c0 c1 7b-ab 1c 76 c9 e5 12 98 f2
VNIC Index : 0
Service Profile : --NOT SET--
Filter Hash : 30241

Filter Name : nic-12944314-eth0-vmware-sfw.2
VM UUID : 50 23 5c 2c 98 4c 2d bb-16 a4 79 a8 9d 2e 4b 6f
VNIC Index : 0
Service Profile : --NOT SET--
Filter Hash : 28802

Filter Name : nic-13247978-eth0-vmware-sfw.2
VM UUID : 50 23 f9 12 9e 9e ce 28-8b ba 66 83 dd 14 9a c9
VNIC Index : 0
Service Profile : --NOT SET--
Filter Hash : 45088

Filter Name : nic-13247960-eth0-vmware-sfw.2
VM UUID : 50 23 d1 17 84 7c a2 42-72 6c b5 b8 14 17 c1 1b
VNIC Index : 0
Service Profile : --NOT SET--
Filter Hash : 28573

Filter Name : nic-13247967-eth0-vmware-sfw.2
VM UUID : 50 23 5a 34 48 cb 48 eb-e1 3b fb af 57 a3 40 c5
VNIC Index : 0
Service Profile : --NOT SET--
Filter Hash : 65381

Filter Name : nic-13249254-eth0-vmware-sfw.2
VM UUID : 50 23 52 44 c7 66 94 c6-d1 9b 58 26 60 54 3c b6
VNIC Index : 0
Service Profile : --NOT SET--

[root@esxi-01a:~] vsipioctl getfilters

Filter Name : nic-12009418-eth0-vmware-sfw.2

VM UUID : 50 23 0f 82 c6 c0 c1 7b-ab 1c 76 c9 e5 12 98 f2

VNIC Index : 0

Service Profile : --NOT SET--

Filter Hash : 30241

Filter Name : nic-12944314-eth0-vmware-sfw.2

VM UUID : 50 23 5c 2c 98 4c 2d bb-16 a4 79 a8 9d 2e 4b 6f

VNIC Index : 0

Service Profile : --NOT SET--

Filter Hash : 28802

Filter Name : nic-13247978-eth0-vmware-sfw.2

VM UUID : 50 23 f9 12 9e 9e ce 28-8b ba 66 83 dd 14 9a c9

VNIC Index : 0

Service Profile : --NOT SET--

Filter Hash : 45088

Filter Name : nic-13247960-eth0-vmware-sfw.2

VM UUID : 50 23 d1 17 84 7c a2 42-72 6c b5 b8 14 17 c1 1b

VNIC Index : 0

Service Profile : --NOT SET--

Filter Hash : 28573

Filter Name : nic-13247967-eth0-vmware-sfw.2

VM UUID : 50 23 5a 34 48 cb 48 eb-e1 3b fb af 57 a3 40 c5

VNIC Index : 0

Service Profile : --NOT SET--

Filter Hash : 65381

Filter Name : nic-13249254-eth0-vmware-sfw.2

VM UUID : 50 23 52 44 c7 66 94 c6-d1 9b 58 26 60 54 3c b6

VNIC Index : 0

Service Profile : --NOT SET--

To view all DFW rules configured for a particular VM, run the getrules command and specify the NIC name as shown below

[root@esxi-01a:~] vsipioctl getrules -f nic-12009418-eth0-vmware-sfw.2
ruleset domain-c7 {
# Filter rules
rule 2147483649 at 1 inout protocol any from any to any accept;
rule 1003 at 2 inout protocol ipv6-icmp icmptype 135 from any to any accept;
rule 1003 at 3 inout protocol ipv6-icmp icmptype 136 from any to any accept;
rule 1002 at 4 inout protocol udp from any to any port 67 accept;
rule 1002 at 5 inout protocol udp from any to any port 68 accept;
rule 1001 at 6 inout protocol any from any to any accept;
}

ruleset domain-c7_L2 {
# Filter rules
rule 1004 at 1 inout ethertype any stateless from any to any accept;
}

[root@esxi-01a:~] vsipioctl getrules -f nic-12009418-eth0-vmware-sfw.2

ruleset domain-c7 {

# Filter rules

rule 2147483649 at 1 inout protocol any from any to any accept;

rule 1003 at 2 inout protocol ipv6-icmp icmptype 135 from any to any accept;

rule 1003 at 3 inout protocol ipv6-icmp icmptype 136 from any to any accept;

rule 1002 at 4 inout protocol udp from any to any port 67 accept;

rule 1002 at 5 inout protocol udp from any to any port 68 accept;

rule 1001 at 6 inout protocol any from any to any accept;

}

ruleset domain-c7_L2 {

# Filter rules

rule 1004 at 1 inout ethertype any stateless from any to any accept;

}

And that’s it for this post.

I hope you find this post informational. Feel free to share this on social media if it is worth sharing. Be sociable 🙂

Troubleshoot Common VMware NSX Installation/Configuration Issues

Troubleshoot NSX Manager Services

Download Technical Supports logs from NSX Manager

Troubleshoot Host Preparation Issues

Troubleshoot NSX Controller cluster status, roles and connectivity

Troubleshoot Logical Switch transport zone and NSX Edge mappings

Troubleshoot Logical Router interface and route mappings

Troubleshoot distributed and edge firewall implementations

Like this:

Related

Leave a ReplyCancel reply

Troubleshoot NSX Manager Services

Download Technical Supports logs from NSX Manager

Troubleshoot Host Preparation Issues

Troubleshoot NSX Controller cluster status, roles and connectivity

Troubleshoot Logical Switch transport zone and NSX Edge mappings

Troubleshoot Logical Router interface and route mappings

Troubleshoot distributed and edge firewall implementations

Spread the Love

Like this:

Related

Leave a ReplyCancel reply