Objective 2.3 of VCAP6-Deploy exam covers following topics
- Analyze and resolve storage multi-pathing and failover issues
- Troubleshoot storage device connectivity
- Analyze and resolve Virtual SAN configuration issues
- Troubleshoot iSCSI connectivity issues
- Analyze and resolve NFS issues
- Troubleshoot RDM issues
Lets discuss each topic one by one
Analyze and resolve storage multi-pathing and failover issues
There can be hundreds of reason for multipathing and failover issues and troubleshooting these issues comes with experience only. Issues with multipathing can be because of issues on storage side (SAN Switch, Fibre configuration etc) or from vSphere side. In this post we will focus only on vSphere side troubleshooting.
In my lab I am using openfiler appliance for shared storage and my vSphere hosts are configured to use software iSCSI to reach to openfiler. Each host has 2 physical adapters mapped to two disting portgroups configured for iSCSI connection and both portgroups are complaint with iSCSI Port Binding settings
VMware KB-1027963 explains in great details about storage path failover sequence in vSphere. Messages about path failover are recorded in /var/log/vmkernel.log
Change multipathing policy and Enable/disable paths manually
Multipathing policies and path failover can be manually triggered via Web Client or Esxi shell
Changing the Multi-Pathing Policy: Select an Esxi host from the inventory and navigate to Manage > Storage > Storage Devices and select a device from list.
Go to properties tab and select Edit Multipathing
Select one among Fixed/MRU and Round Robin and hit OK. To know more about these polices in detail, please refer this article
Refresh Web Client to ensure policy change has took effect.
Enabled/Disable a Path : To enable/disable a path manually, go to paths tab instead of properties tab of selected storage device.
Select a path and if its Active then click on disable button. If a path is already disabled then Enable button will be highlighted.
Change MultiPathing Policy from Command Line
Connect Esxi host over SSH and login via root user and fire below command to change the multipathing policy
1 |
<em><span style="color: #000000;"># esxcli storage nmp device set –d <naa_id_of_device> -P <path_policy></span></em> |
For example, to change the multipathing policy of LUN from MRU to Fixed, you need to run below command:
1 |
<em><span style="color: #000000;"># esxcli storage nmp device set -d t10.F405E46494C45425645447059546D2E6256413D213E61776 -P VMW_PSP_FIXED</span></em> |
Note: Device identifier/naa_id can be grabbed from both Web Client or via command: esxcli storage nmp device list
Disable a path via command line
To disable a path via CLI, run below command
1 |
<span style="color: #000000;"><em># esxcli storage core path set --state=off --path=vmhba33:C0:T0:L0</em></span> |
and you will see a path going dead
To enable the path again, run command:
1 |
<em><span style="color: #000000;"># esxcli storage core path set --state=active --path=vmhba33:C0:T0:L0</span></em> |
In /var/log/vmkernel.log you will see following log entry for this event
1 |
<span style="color: #ff0000;"><em>2017-12-07T15:19:02.813Z cpu3:102867 opID=87ff36cf)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:479: Changing active path from NONE to vmhba33:C0:T0:L0 for device "t10.F405E46494C45425645447059546D2E6256413D213E61776".</em></span> |
in hostd.log you will see following
1 |
<span style="color: #ff0000;"><em>2017-12-07T15:29:27.965Z info hostd[6DA80B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 302 : Path redundancy to storage device t10.F405E46494C45425645447059546D2E6256413D213E61776 (Datastores: iscsi-3) restored. Path vmhba33:C0:T0:L0 is active again.</em></span> |
Changing the Default Pathing Policy
Check current policy:
# esxcli storage nmp satp list
Change the Default Path Policy:
# esxcli storage nmp satp set –default-psp=<policy> –satp=<satp_name>
Monitoring Storage Performance
Storage performance can be monitored via esxtop. VMware KB-1008205 lists the steps to identify storage performance issues
hba view in my lab
device view
Troubleshoot storage device connectivity
VMware vSphere Troubleshooting guide explains in details on what are the things to check for troubleshooting storage connectivity. Few of them are:
- Check cable connectivity
- Check zoning for FC
- Check access control config – for iSCSI additional checks for CHAP, IP-based filtering and initiator name-based access control is setup correctly
- Make sure cross patching is correct between storage controllers
- For any changes rescan the HBA / host
The maximum queue depth can be changed on a ESXi host. If a ESXi host generates more commands to a LUN than the LUN queue depth can handle, the excess commands are queued in the VMKernel which increases latency.
To change the queue depth on a FC HBA run the following and reboot the host
# escli system module parameters set -p parameter=value -m module
For iSCSI run the following
# esxcli system module parameters set -m iscsi_vmk -p iscsivmk_LunQDepth=value
Some useful CLI commands:
- Get info about the storage device you trying to troubleshoot esxcli storage core device list
- check the multi-path configure with esxcfg-mpath command for example: esxcfg-mpath -b -d naa.60014054ddcf82083c44f8da7394198a
Analyze and Resolve vSan Configuration Issues
Host with the vSAN Service enabled is not in the vCenter Cluster: Add the host to the vSAN Cluster
Host is in a vSAN enabled cluster but does not have the vSAN Service enabled: Verify the network is correct on host and is configured properly.
vSAN network is not configured: Configure vSAN Network on the virtual switchs that will connect the vSAN cluster
Host cannot communicate with all other nodes in the vSAN enabled cluster: Check for network isolation.
Following commands are useful for retrieving vSAN info:
# esxcli vsan network list
vSAN VMkenel portgroup settingss:
# esxcli network ip interface list -i <vSAN VMkernel PG>
Check vSAN cluster status
# esxcli vsan cluster get
Troubleshoot iSCSI Connectivity Issues
Common issues that can arise in an environment which is based on iSCSI storage are:
1: No targets from an array are seen by all Esxi host or a subset of hosts or to a single host.
2: Targets on the array are visible but one or more LUNs are not visible
3: An iSCSI LUN not visible
4: An iSCSI LUN cannot connect
5: There are connectivity issues to the storage array
6: A LUN is missing
The basic troubleshooting steps for diagnosing above mentioned issues are outlined as below
1: Verify connectivity between host and storage array
1 2 3 4 5 |
<em>[root@esxi01:~] vmkping -I vmk2 192.168.106.6</em> <em>PING 192.168.106.6 (192.168.106.6): 56 data bytes</em> <em>64 bytes from 192.168.106.6: icmp_seq=0 ttl=64 time=0.432 ms</em> <em>64 bytes from 192.168.106.6: icmp_seq=1 ttl=64 time=0.450 ms</em> <em>64 bytes from 192.168.106.6: icmp_seq=2 ttl=64 time=0.378 ms</em> |
2: Verify Esxi host can reach storage array on port 3260
1 2 |
<em><span style="color: #000000;">[root@esxi01:~] nc -z 192.168.106.6 3260</span></em> <em><span style="color: #000000;">Connection to 192.168.106.6 3260 port [tcp/*] succeeded!</span></em> |
3: Verify Port Binding Settings
For iSCSI based storage, please ensure port binding configuration and make sure paths are active and complaint.
4: Verify if large packets can be sent to storage array (If jumbo frame is configured
1 2 3 4 5 6 7 8 9 10 11 12 |
<strong><span style="color: #000000;"><em>[root@esxi01:~] vmkping -s 1500 192.168.109.6 -d</em></span></strong> <span style="color: #000000;"><em>PING 192.168.109.6 (192.168.109.6): 1500 data bytes</em></span> <span style="color: #ff0000;"><em>sendto() failed (Message too long)</em></span> <span style="color: #ff0000;"><em>sendto() failed (Message too long)</em></span> <span style="color: #ff0000;"><em>sendto() failed (Message too long)</em></span> <span style="color: #000000;"><em>--- 192.168.109.6 ping statistics ---</em></span> <span style="color: #000000;"><em>3 packets transmitted, 0 packets received, 100% packet loss</em></span> <strong><span style="color: #000000;"><em>[root@esxi01:~] vmkping -s 1200 192.168.109.6 -d</em></span></strong> <span style="color: #000000;"><em>PING 192.168.109.6 (192.168.109.6): 1200 data bytes</em></span> <span style="color: #000000;"><em>1208 bytes from 192.168.109.6: icmp_seq=0 ttl=64 time=0.736 ms</em></span> <span style="color: #000000;"><em>1208 bytes from 192.168.109.6: icmp_seq=1 ttl=64 time=0.390 ms</em></span> <span style="color: #000000;"><em>1208 bytes from 192.168.109.6: icmp_seq=2 ttl=64 time=0.373 ms</em></span> |
5: Ensure that the LUNs are presented to the ESXi/ESX hosts.
On the array side, ensure that the LUN IQNs and access control list (ACL) allow the ESXi/ESX host HBAs to access the array targets. For instructions please refer VMware KB-1003955
6: Ensure that the HOST ID on the array for the LUN is less than 255 for the LUN. The maximum LUN ID is 255. Any LUN that has a HOST ID greater than 255 may not show as available under Storage Adapters
7: Verify that the host HBA’s are able to access the shared storage. For instructions please refere VMware KB-1003973
8: Verify chap authentication settings
If CHAP is configured on the array, ensure that the authentication settings for the ESXi/ESX hosts are the same as the settings on the array. VMware KB-1004029 has instructions for checking CHAP settings.
9: Verify that the storage array being used is listed on the Storage/SAN Compatibility Guide. For instructions please refer VMware KB-1003916
Analyze and resolve NFS issues
Common issues that can arise for NFS based storage are:
1: The NFS share cannot be mounted by the ESX/ESXi host.
2: The NFS share is mounted, but nothing can be written to it. Following log entries can be seen for this issue:
a) NFS Error: Unable to connect to NFS server
b) WARNING: NFS: 983: Connect failed for client 0xb613340 sock 184683088: I/O error
c)WARNING: NFS: 898: RPC error 12 (RPC failed) trying to get port for Mount Program (100005) Version (3) Protocol (TCP) on Server (xxx.xxx.xxx.xxx)
Since NFS is also IP based storage, troubleshooting NFS issues start with checking:
1: Testing NFS server connectivity : Verify that the ESX host can vmkping the NFS server. Also verify that the NFS server can ping the VMkernel IP of the ESX host.
2: Verify port on NFS server is open. Default port is 2049
# nc -z <NFS-Server-IP> <NFS-Port>
3: Verify firewall on NFS side and Esxi host side. On both side firewall should be configured to pass NFS traffic. VMware KB-1007352 explains more on this.
4: Check the vSwitch configuration and verify correct VLAN and MTU are specified on portgroup designated for NFS traffic. If the MTU is set to anything other than 1500 or 9000, test the connectivity using the vmkping command:
# vmkping -I vmkN -s <mtu_size> <NFS_Server_IP>
5: Permission to mount the NFS exported filesystem.
6: Export configuration on the NFS server. NFS mount should have rw access for Esxi host subnet.
7: Proper authentication set if using NFS 4.1.
8: If the NFS is configured inside a windows server then verify it is configured properly. use VMware KB-1004490 for troubleshooting steps.
9: For troubleshooting mount related issues, enable the nfsstat3 service for enhanced logging. By default this is disabled
Verify current settings
1 2 |
<em><span style="color: #000000;">[root@esxi01:~] esxcfg-advcfg -g /NFS/LogNfsStat3</span></em> <em><span style="color: #000000;">Value of LogNfsStat3 is 0</span></em> |
Enable nfsstat3 service
1 2 |
<em><span style="color: #000000;">[root@esxi01:~] esxcfg-advcfg -s 1 /NFS/LogNfsStat3</span></em> <em><span style="color: #000000;">Value of LogNfsStat3 is 1</span></em> |
Troubleshoot RDM issues
Storage vendors might require that VMs with RDMs ignore SCSI INQUIRY data cached by ESXi. When a host first connects to the target storage device it issues the SCSI INQUIRY command to obtain basic identification data from the device which ESXi will cache, the data remains unchanged after this.
To configure the VM with a RDM to ignore SCSI INQUIRY cache add the following to the .vmx file
# scsix:y.ignoreDeviceInquiryCache = “true”
Where x = the SCSI controller number and y = the SCSI target number of the RDM
And that’s it for this post.
I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing. Be sociable 🙂