After updating my lab’s Container Service Extension to version 4.2.0, I observed that the VMware VCD service was frequently crashing. Restarting the cell service did not help much, as the VCD user interface (UI) died again after five minutes. The cell.log was throwing below exception
1 2 |
java.lang.OutOfMemoryError: Java heap space Dumping heap to /opt/vmware/vcloud-director/logs/java_pid11530.hprof ... |
You will find similar log entries in the cell-runtime.log file.
1 2 3 4 |
2024-02-12 15:25:07,873 | FATAL | pc-activity-pool-13 | UncaughtExceptionHandlerStartupAction | Uncaught Exception. Originating thread: Thread[pc-activity-pool-13,5,main]. Message: Java heap space | activity=(com.vmware.vcloud.vimproxy.internal.impl.PCEventProcessingActivity,urn:uuid:5d072602-9517-4ce9-b00c-9e12a46e03bd) java.lang.OutOfMemoryError: Java heap space |
Since the service crashed multiple times, each time creating a Java dump file, it eventually filled up disk space, and there was no space left at the root partition.
1 2 3 |
Dump file is incomplete: No space left on device log4j:ERROR Failed to flush writer, java.io.IOException: No space left on device |
I verified this by running the df -h command, and the root partition was 100% full.
The immediate step was to delete all Java dump files from the logs folder and restart the vcd-cell service. You can delete all dump files using the below one-liner.
1 |
# rm -f java_* |
On investigating the issue further, I came across VMware KB-95464 which explains that this is a known issue affecting VCD 10.4 and later versions and will be fixed in the upcoming release of VCD. As per the KB article, the root cause of this is:
This issue can occur when the resolve operation is invoked on a Runtime Defined Entity (RDE) which has a large number of tasks associated with it.
Workaround
Step 1: Remove all Java dump files from the /opt/vmware/vcloud-director/logs directory.
Step 2: Reduce the overall number of task entries associated with the resolve operation. The task specifics can be found in the VCD database’s table “BEHAVIOR_INVOCATION”.
Connect to the VCD database:
1 2 3 4 5 6 |
root@vcd01 [ ~ ]# sudo -i -u postgres psql vcloud psql.bin (14.7 (VMware Postgres 14.7.0-21567549 release)) Type "help" for help. vcloud=# \c vcloud You are now connected to database "vcloud" as user "postgres". |
List the task count and clean it up.
Disclaimer: In a production environment, perform the database modification with VMware GSS assistance.
1 2 3 4 5 6 7 8 |
vcloud=# SELECT COUNT(*) FROM jobs WHERE operation = 'BEHAVIOR_INVOCATION'; count -------- 120508 (1 row) vcloud=# delete from jobs where operation = 'BEHAVIOR_INVOCATION'; DELETE 120508 |
3: Restart VCD cell service: systemctl restart vmware-vcd
I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.