Troubleshooting TMC Self-Managed Stuck Deployment in VCD -

My previous blog post discussed the VCD Extension for Tanzu Mission Control and covered the end-to-end deployment steps. In this post, I will cover how to troubleshoot a stuck TMC self-managed deployment in VCD.

I was deploying TMC self-managed in a new environment, and during configuration, I made a mistake by passing an incorrect value for the DNS zone, leading to a stuck deployment that did not terminate automatically. I waited for a couple of hours for the task to fail, but the task kept on running, thus preventing me from installing it with the correct configuration.

The deployment was stalled in the Creating phase and did not fail.

On checking the pods in the tmc-local namespace, a lot of them were stuck in either ‘CreateContainerConfigError” or “CrashLoopBackOff” states.

root@jumpbox:~# kubectl get po -n tmc-local | grep CreateContainerConfigError

audit-service-consumer-59b6954688-5fhkd             0/1     CreateContainerConfigError   0               6m27s
audit-service-consumer-59b6954688-mqkrt             0/1     CreateContainerConfigError   0               6m27s
audit-service-server-778d89bf7-lw2ph                0/1     CreateContainerConfigError   0               6m27s
audit-service-server-778d89bf7-q5g6z                0/1     CreateContainerConfigError   0               6m27s
dataprotection-server-65848fb688-62b46              0/1     CreateContainerConfigError   0               6m24s
dataprotection-server-65848fb688-n8cgz              0/1     CreateContainerConfigError   0               6m24s
inspection-server-679cccbc57-kfs5z                  0/2     CreateContainerConfigError   0               6m23s
inspection-server-679cccbc57-pldsb                  0/2     CreateContainerConfigError   0               6m23s

root@jumpbox:~# kubectl get po -n tmc-local | grep CrashLoopBackOff

agent-gateway-server-5c6b5dd5d4-8dvdb               0/1     CrashLoopBackOff             6 (2m18s ago)   8m45s
agent-gateway-server-5c6b5dd5d4-qw7bz               0/1     CrashLoopBackOff             6 (2m20s ago)   8m45s
api-gateway-server-6c54fd7f86-v4mld                 0/1     CrashLoopBackOff             6 (2m19s ago)   8m44s
api-gateway-server-6c54fd7f86-xjd55                 0/1     CrashLoopBackOff             6 (2m32s ago)   8m44s
policy-insights-server-5d5458b76d-t25vx             0/1     CrashLoopBackOff             6 (102s ago)    8m39s
policy-insights-server-5d5458b76d-xcgxc             0/1     CrashLoopBackOff             6 (113s ago)    8m39s

root@jumpbox:~# kubectl get po -n tmc-local | grep CreateContainerConfigError

audit-service-consumer-59b6954688-5fhkd 0/1 CreateContainerConfigError 0 6m27s

audit-service-consumer-59b6954688-mqkrt 0/1 CreateContainerConfigError 0 6m27s

audit-service-server-778d89bf7-lw2ph 0/1 CreateContainerConfigError 0 6m27s

audit-service-server-778d89bf7-q5g6z 0/1 CreateContainerConfigError 0 6m27s

dataprotection-server-65848fb688-62b46 0/1 CreateContainerConfigError 0 6m24s

dataprotection-server-65848fb688-n8cgz 0/1 CreateContainerConfigError 0 6m24s

inspection-server-679cccbc57-kfs5z 0/2 CreateContainerConfigError 0 6m23s

inspection-server-679cccbc57-pldsb 0/2 CreateContainerConfigError 0 6m23s

root@jumpbox:~# kubectl get po -n tmc-local | grep CrashLoopBackOff

agent-gateway-server-5c6b5dd5d4-8dvdb 0/1 CrashLoopBackOff 6 (2m18s ago) 8m45s

agent-gateway-server-5c6b5dd5d4-qw7bz 0/1 CrashLoopBackOff 6 (2m20s ago) 8m45s

api-gateway-server-6c54fd7f86-v4mld 0/1 CrashLoopBackOff 6 (2m19s ago) 8m44s

api-gateway-server-6c54fd7f86-xjd55 0/1 CrashLoopBackOff 6 (2m32s ago) 8m44s

policy-insights-server-5d5458b76d-t25vx 0/1 CrashLoopBackOff 6 (102s ago) 8m39s

policy-insights-server-5d5458b76d-xcgxc 0/1 CrashLoopBackOff 6 (113s ago) 8m39s

In VCD, when I checked the failed task ‘Execute global ‘post-create’ action,” I found the installer was complaining that the tmc package installation reconciliation failed.

failed to execute trigger hook exit status 1 vcd-ext/action.(*Program).

Execute vcd-ext/action/program.go:137 vcd-ext/cmd/instance.(*TriggersActivity).executeTrigger vcd-ext/cmd/instance/triggers.go:275 

vcd-ext/cmd/instance.(*TriggersActivity).Run vcd-ext/cmd/instance/triggers.go:108 

vcd-ext/cmd/instance.ActivityList.Execute vcd-ext/cmd/instance/activity.go:95 

vcd-ext/cmd/instance/create.(*RealizeContext).RealizeInstance vcd-ext/cmd/instance/create/realize.go:215 

vcd-ext/cmd/instance/create.(*RealizeContext).Realize vcd-ext/cmd/instance/create/realize.go:98 

vcd-ext/cmd/instance/retry.(*RealizeContext).Realize vcd-ext/cmd/instance/retry/realize.go:79 

vcd-ext/cmd/instance/retry.(*Options).Run vcd-ext/cmd/instance/retry/cmd.go:184 

vcd-ext/cmd/instance/retry.NewRetryCommand.func2 vcd-ext/cmd/instance/retry/cmd.go:36 

github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra@v1.5.0/command.go:872 

vcd-ext/cmd/cli.Run vcd-ext/cmd/cli/cli.go:60 main.main /opt/src/vcd-ext/main.go:13 

cause: fail to reconcile the tmc pkgi, error: not all pkgi are ready

failed to execute trigger hook exit status 1 vcd-ext/action.(*Program).

Execute vcd-ext/action/program.go:137 vcd-ext/cmd/instance.(*TriggersActivity).executeTrigger vcd-ext/cmd/instance/triggers.go:275

vcd-ext/cmd/instance.(*TriggersActivity).Run vcd-ext/cmd/instance/triggers.go:108

vcd-ext/cmd/instance.ActivityList.Execute vcd-ext/cmd/instance/activity.go:95

vcd-ext/cmd/instance/create.(*RealizeContext).RealizeInstance vcd-ext/cmd/instance/create/realize.go:215

vcd-ext/cmd/instance/create.(*RealizeContext).Realize vcd-ext/cmd/instance/create/realize.go:98

vcd-ext/cmd/instance/retry.(*RealizeContext).Realize vcd-ext/cmd/instance/retry/realize.go:79

vcd-ext/cmd/instance/retry.(*Options).Run vcd-ext/cmd/instance/retry/cmd.go:184

vcd-ext/cmd/instance/retry.NewRetryCommand.func2 vcd-ext/cmd/instance/retry/cmd.go:36

github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra@v1.5.0/command.go:872

vcd-ext/cmd/cli.Run vcd-ext/cmd/cli/cli.go:60 main.main /opt/src/vcd-ext/main.go:13

cause: fail to reconcile the tmc pkgi, error: not all pkgi are ready

Because the product is new, the documentation provides little guidance on how to troubleshoot issues like this. After discussing the problem with the Engineering team, I was able to resolve it. Once again, the VCD API saved the day. Here is the conclusion of the discussion

This is a known issue of solution addon agent. The subtask timed out after 2h but the task status is not updated because vcd killed its agent and vcd use a fixed time 2h for addon agent. It’s better to set a smaller timeout value when creating tmc instance, for example 5400s.

Here are the steps I followed:

Disclaimer: Before running the below commands in a production environment, consult the GSS team.

1: (Optional) Export Environment variables

export VCD_EXT_HOST=<vcd fqdn>

export VCD_EXT_USERNAME=<vcd sys admin user>

export VCD_EXT_PASSWORD=<sys admin password>

export VCD_EXT_NAME=<name of the tmc-sm instance>

export VCD_EXT_HOST=<vcd fqdn>

export VCD_EXT_USERNAME=<vcd sys admin user>

export VCD_EXT_PASSWORD=<sys admin password>

export VCD_EXT_NAME=<name of the tmc-sm instance>

2: Generate VCD Auth Token

export VCLOUD_ACCESS_TOKEN=`curl -ksSL -D - -H "Accept: application/json;version=38.1" \n
-u "${VCD_EXT_USERNAME}@system:${VCD_EXT_PASSWORD}" \n
-X POST https://${VCD_EXT_HOST}/cloudapi/1.0.0/sessions/provider \n
| sed -n -e 's/X-VMWARE-VCLOUD-ACCESS-TOKEN: \(.*\)/\1/ip' | tr -cd '[:alnum:]._-'`

export VCLOUD_ACCESS_TOKEN=`curl -ksSL -D - -H "Accept: application/json;version=38.1" \n

-u "${VCD_EXT_USERNAME}@system:${VCD_EXT_PASSWORD}" \n

-X POST https://${VCD_EXT_HOST}/cloudapi/1.0.0/sessions/provider \n

| sed -n -e 's/X-VMWARE-VCLOUD-ACCESS-TOKEN: $.*$/\1/ip' | tr -cd '[:alnum:]._-'`

2: Retrieve the TMC-SM RDE

curl -ksSL -H "Accept: application/json;version=38.1" -H "Authorization: Bearer ${VCLOUD_ACCESS_TOKEN}" \n
-X GET "https://${VCD_EXT_HOST}/cloudapi/1.0.0/entities/types/vmware/solutions_add_on_instance/1.0.0" \n
| jq '.values[] | select((.entity.name == $ENV.VCD_EXT_NAME))' > entity-${VCD_EXT_NAME}.json

curl -ksSL -H "Accept: application/json;version=38.1" -H "Authorization: Bearer ${VCLOUD_ACCESS_TOKEN}" \n

-X GET "https://${VCD_EXT_HOST}/cloudapi/1.0.0/entities/types/vmware/solutions_add_on_instance/1.0.0" \n

| jq '.values[] | select((.entity.name == $ENV.VCD_EXT_NAME))' > entity-${VCD_EXT_NAME}.json

3: Mark the TMC-SM instance as failed

curl -ksSL -H "Accept: application/json;version=38.1" \n
-H "Content-Type: application/json" -H "Authorization: Bearer ${VCLOUD_ACCESS_TOKEN}" \n
-X PUT -d @entity-${VCD_EXT_NAME}.json \n
"https://${VCD_EXT_HOST}/cloudapi/1.0.0/entities/$(cat entity-${VCD_EXT_NAME}.json \n
| jq -r .id)" | jq .

curl -ksSL -H "Accept: application/json;version=38.1" \n

-H "Content-Type: application/json" -H "Authorization: Bearer ${VCLOUD_ACCESS_TOKEN}" \n

-X PUT -d @entity-${VCD_EXT_NAME}.json \n

"https://${VCD_EXT_HOST}/cloudapi/1.0.0/entities/$(cat entity-${VCD_EXT_NAME}.json \n

| jq -r .id)" | jq .

After forcefully failing the TMC-Sm instance, the deletion went fine and the instance was cleaned.

And that’s it for this post. In the next post, I will discuss one more troubleshooting scenario that I encountered in my lab. Stay tuned!!!

I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.

Spread the Love

Leave a Reply Cancel reply