VCD (10.5) Service Crashing Continuously in CSE Environment

After updating my lab’s Container Service Extension to version 4.2.0, I observed that the VMware VCD service was frequently crashing. Restarting the cell service did not help much, as the VCD user interface (UI) died again after five minutes. The cell.log was throwing below exception

You will find similar log entries in the cell-runtime.log file.

Since the service crashed multiple times, each time creating a Java dump file, it eventually filled up disk space, and there was no space left at the root partition.

I verified this by running the df -h command, and the root partition was 100% full. 

The immediate step was to delete all Java dump files from the logs folder and restart the vcd-cell service. You can delete all dump files using the below one-liner.

On investigating the issue further, I came across VMware KB-95464 which explains that this is a known issue affecting VCD 10.4 and later versions and will be fixed in the upcoming release of VCD. As per the KB article, the root cause of this is:

This issue can occur when the resolve operation is invoked on a Runtime Defined Entity (RDE) which has a large number of tasks associated with it.

Workaround

Step 1: Remove all Java dump files from the /opt/vmware/vcloud-director/logs directory.

Step 2: Reduce the overall number of task entries associated with the resolve operation. The task specifics can be found in the VCD database’s table “BEHAVIOR_INVOCATION”.

Connect to the VCD database:

List the task count and clean it up.

Disclaimer: In a production environment, perform the database modification with VMware GSS assistance.

3: Restart VCD cell service: systemctl restart vmware-vcd

I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.

Leave a Reply