My previous blog post discussed the VCD Extension for Tanzu Mission Control and covered the end-to-end deployment steps. In this post, I will cover how to troubleshoot a stuck TMC self-managed deployment in VCD.
I was deploying TMC self-managed in a new environment, and during configuration, I made a mistake by passing an incorrect value for the DNS zone, leading to a stuck deployment that did not terminate automatically. I waited for a couple of hours for the task to fail, but the task kept on running, thus preventing me from installing it with the correct configuration.
The deployment was stalled in the Creating phase and did not fail.
On checking the pods in the tmc-local namespace, a lot of them were stuck in either ‘CreateContainerConfigError” or “CrashLoopBackOff” states.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
root@jumpbox:~# kubectl get po -n tmc-local | grep CreateContainerConfigError audit-service-consumer-59b6954688-5fhkd 0/1 CreateContainerConfigError 0 6m27s audit-service-consumer-59b6954688-mqkrt 0/1 CreateContainerConfigError 0 6m27s audit-service-server-778d89bf7-lw2ph 0/1 CreateContainerConfigError 0 6m27s audit-service-server-778d89bf7-q5g6z 0/1 CreateContainerConfigError 0 6m27s dataprotection-server-65848fb688-62b46 0/1 CreateContainerConfigError 0 6m24s dataprotection-server-65848fb688-n8cgz 0/1 CreateContainerConfigError 0 6m24s inspection-server-679cccbc57-kfs5z 0/2 CreateContainerConfigError 0 6m23s inspection-server-679cccbc57-pldsb 0/2 CreateContainerConfigError 0 6m23s root@jumpbox:~# kubectl get po -n tmc-local | grep CrashLoopBackOff agent-gateway-server-5c6b5dd5d4-8dvdb 0/1 CrashLoopBackOff 6 (2m18s ago) 8m45s agent-gateway-server-5c6b5dd5d4-qw7bz 0/1 CrashLoopBackOff 6 (2m20s ago) 8m45s api-gateway-server-6c54fd7f86-v4mld 0/1 CrashLoopBackOff 6 (2m19s ago) 8m44s api-gateway-server-6c54fd7f86-xjd55 0/1 CrashLoopBackOff 6 (2m32s ago) 8m44s policy-insights-server-5d5458b76d-t25vx 0/1 CrashLoopBackOff 6 (102s ago) 8m39s policy-insights-server-5d5458b76d-xcgxc 0/1 CrashLoopBackOff 6 (113s ago) 8m39s |
In VCD, when I checked the failed task ‘Execute global ‘post-create’ action,” I found the installer was complaining that the tmc package installation reconciliation failed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
failed to execute trigger hook exit status 1 vcd-ext/action.(*Program). Execute vcd-ext/action/program.go:137 vcd-ext/cmd/instance.(*TriggersActivity).executeTrigger vcd-ext/cmd/instance/triggers.go:275 vcd-ext/cmd/instance.(*TriggersActivity).Run vcd-ext/cmd/instance/triggers.go:108 vcd-ext/cmd/instance.ActivityList.Execute vcd-ext/cmd/instance/activity.go:95 vcd-ext/cmd/instance/create.(*RealizeContext).RealizeInstance vcd-ext/cmd/instance/create/realize.go:215 vcd-ext/cmd/instance/create.(*RealizeContext).Realize vcd-ext/cmd/instance/create/realize.go:98 vcd-ext/cmd/instance/retry.(*RealizeContext).Realize vcd-ext/cmd/instance/retry/realize.go:79 vcd-ext/cmd/instance/retry.(*Options).Run vcd-ext/cmd/instance/retry/cmd.go:184 vcd-ext/cmd/instance/retry.NewRetryCommand.func2 vcd-ext/cmd/instance/retry/cmd.go:36 github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra@v1.5.0/command.go:872 vcd-ext/cmd/cli.Run vcd-ext/cmd/cli/cli.go:60 main.main /opt/src/vcd-ext/main.go:13 cause: fail to reconcile the tmc pkgi, error: not all pkgi are ready |
Because the product is new, the documentation provides little guidance on how to troubleshoot issues like this. After discussing the problem with the Engineering team, I was able to resolve it. Once again, the VCD API saved the day. Here is the conclusion of the discussion
This is a known issue of solution addon agent. The subtask timed out after 2h but the task status is not updated because vcd killed its agent and vcd use a fixed time 2h for addon agent. It’s better to set a smaller timeout value when creating tmc instance, for example 5400s.
Here are the steps I followed:
Disclaimer: Before running the below commands in a production environment, consult the GSS team.
1: (Optional) Export Environment variables
1 2 3 4 5 6 7 |
export VCD_EXT_HOST=<vcd fqdn> export VCD_EXT_USERNAME=<vcd sys admin user> export VCD_EXT_PASSWORD=<sys admin password> export VCD_EXT_NAME=<name of the tmc-sm instance> |
2: Generate VCD Auth Token
1 2 3 4 |
export VCLOUD_ACCESS_TOKEN=`curl -ksSL -D - -H "Accept: application/json;version=38.1" \n -u "${VCD_EXT_USERNAME}@system:${VCD_EXT_PASSWORD}" \n -X POST https://${VCD_EXT_HOST}/cloudapi/1.0.0/sessions/provider \n | sed -n -e 's/X-VMWARE-VCLOUD-ACCESS-TOKEN: \(.*\)/\1/ip' | tr -cd '[:alnum:]._-'` |
2: Retrieve the TMC-SM RDE
1 2 3 |
curl -ksSL -H "Accept: application/json;version=38.1" -H "Authorization: Bearer ${VCLOUD_ACCESS_TOKEN}" \n -X GET "https://${VCD_EXT_HOST}/cloudapi/1.0.0/entities/types/vmware/solutions_add_on_instance/1.0.0" \n | jq '.values[] | select((.entity.name == $ENV.VCD_EXT_NAME))' > entity-${VCD_EXT_NAME}.json |
3: Mark the TMC-SM instance as failed
1 2 3 4 5 |
curl -ksSL -H "Accept: application/json;version=38.1" \n -H "Content-Type: application/json" -H "Authorization: Bearer ${VCLOUD_ACCESS_TOKEN}" \n -X PUT -d @entity-${VCD_EXT_NAME}.json \n "https://${VCD_EXT_HOST}/cloudapi/1.0.0/entities/$(cat entity-${VCD_EXT_NAME}.json \n | jq -r .id)" | jq . |
After forcefully failing the TMC-Sm instance, the deletion went fine and the instance was cleaned.
And that’s it for this post. In the next post, I will discuss one more troubleshooting scenario that I encountered in my lab. Stay tuned!!!
I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.