Today while investigating one production issue, I came across an incident where a virtual machine restart was performed by HA.
Navigating through task and events for the VM in vCenter, I was seeing message similar to
1 |
"This virtual machine reset by HA. Reason: VMware Tools heartbeat failure.A screenshot is saved at /vmfs/volumes/4c4850e4-0dcce710-28d9-00215-a5d36b8/XYZ/xyz-0.png" |
On investigating further, I came to find out that VMware tools running on VM was out of date.
I checked for HA settings related to VM monitoring on the cluster and did not see any aggressive value set there which might have caused the vm restart. The VM monitoring settings were set at 30 secs for failure interval which means that if for continuous 30 seconds VM is unable to send any heartbeat to HA, a restart of VM will be performed.
Keep in mind that this is not related to host failure where all VM running on failed host is restarted on remaining nodes of the cluster. In case of single VM failure, HA will restart the VM on same host.
I started my investigation with checking hostd.log file of the Esxi host where VM was running and observed following messages which hints on VM restart
1 2 3 4 5 6 7 8 9 10 11 |
2016-11-06T06:00:52.281Z info hostd[50A40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] State Transition (VM_STATE_ON -> VM_STATE_CREATE_SCREENSHOT) 2016-11-06T06:00:52.353Z info hostd[5114CB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] State Transition (VM_STATE_CREATE_SCREENSHOT -> VM_STATE_ON) 2016-11-05T00:26:30.020Z info hostd[50F44B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] Send config update invoked 2016-11-05T00:26:30.049Z info hostd[502C6B70] [Originator@6876 sub=Snmpsvc] VmConfigListener: vm state change received, queueing reload request 2016-11-05T01:58:44.821Z info hostd[50507B70] [Originator@6876 sub=Snmpsvc] VmConfigListener: vm state change received, queueing reload request 2016-11-05T01:58:49.097Z info hostd[50244B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] Upgrade is required for virtual machine, version: 10 |
Next was to check vmware.log for VM XYZ. I observed following log entries for VM reset
1 2 3 4 5 6 |
2016-11-01T18:13:53.321Z| vmx| I120: Vix: [43327 vmxCommands.c:685]: VMAutomation_Reset. Trying hard reset 2016-11-01T18:13:53.321Z| vmx| W110: 2016-11-01T18:13:53.321Z| vmx| W110+ 2016-11-01T18:13:53.321Z| vmx| W110+ VMXRequestReset 2016-11-01T18:13:53.321Z| vmx| I120: Vigor_Reset: Attaching to reset. 2016-11-01T18:13:53.321Z| vmx| I120: Stopping VCPU threads... |
On digging vmware.log more, I found the messages regarding VMware tools version out of date
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
2016-10-27T04:25:34.389Z| vmx| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'oldTools', 'supportedOld', install possible 2016-10-27T04:25:37.656Z| vcpu-3| I120: ToolsSetVersionWork did nothing; new tools version (9344) matches old Tools version 2016-10-27T04:25:37.832Z| Worker#1| I120: ToolsVersion: Status is supported old because this is status of monolithic version, and vsock_win2k3 and 25 more components are missing from the guest. 2016-10-27T04:25:37.832Z| Worker#1| I120: ToolsVersionGetStatusWorkerThread: Tools status 2 derived from environment 2016-10-27T04:25:37.832Z| vmx| I120: ToolsUpdateManifestInfoWorkerThreadDone: Compared tools manifest from host and from the guest. Status = 2. 2016-10-27T04:25:37.832Z| vmx| I120: ToolsUpdateManifestInfoWorkerThreadDone: Updating the manifest info. 2016-10-27T04:25:37.832Z| vmx| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'oldTools', 'supportedOld', install possible 2016-10-27T04:25:37.832Z| vmx| I120: TOOLS installed legacy version 9344, available legacy version 9541 2016-10-27T04:25:37.832Z| vmx| I120: TOOLS manifest update status is 2 2016-10-27T04:25:37.832Z| vmx| I120: TOOLS can be autoupgraded. |
Also found VMware tools heartbeat timing out which eventually forced HA to reset the VM
1 2 3 4 5 6 7 8 9 10 11 12 13 |
2016-10-27T04:33:10.583Z| vcpu-0| I120: Tools: Tools heartbeat timeout. 2016-10-27T04:34:26.229Z| vcpu-3| I120: GuestMsg: Channel 0, Protocol error, state: 0 2016-10-27T04:34:26.230Z| vcpu-3| I120: GuestMsg: Cannot close channel 0: it is not opened 2016-10-27T04:34:28.597Z| vcpu-0| I120: Guest: toolbox: Version: build-1280544 2016-10-27T04:34:28.597Z| vcpu-0| W110: GuestRpc: application toolbox, changing channel 65535 -> 0 2016-10-27T04:34:28.597Z| vcpu-0| I120: GuestRpc: Channel 0, guest application toolbox. 2016-10-27T04:34:29.924Z| vcpu-1| I120: TOOLS autoupgrade protocol version 2 |
From the above logs it was evident that vmware tools were unable to send heartbeat to HA and which caused HA to perform a VM reset.
VMware KB-1027734 was very helpful to me in carrying out my investigation.
Conclusion: From the above I can conclude that Vmware tools update is needed on the VM so that HA should not try to reset VM again.
I hope this post is informational to you. Feel free to share this on social media if it is worth sharing. Be sociable
As I know, the ESXi performs other checks in case of Tools heartbeat time out. For example checks the disk IO.