Recently I upgraded NSX ALB from 20.1.4 to 20.1.5 in my lab and observed weird things whenever I attempted to deploy/delete any Kubernetes workload of type LoadBalancer.
The Issue
On deploying a new K8 application, AKO was unable to create a load balancer for the application. In NSX ALB UI, I can see that a pool has been created and a VIP assigned but no VS is present. I have also verified that the ‘ako-essential’ role has the necessary permission “PERMISSION_VIRTUALSERIVCE” to create any new VS.
On attempting to delete a K8 application, the application got deleted from the TKG side, but it left lingering items (VS, Pools, etc) in the ALB UI. To investigate more on the issue, I manually tried deleting the server pool and captured the output using the browser network inspect option.
As expected the delete operation failed with the error that the object that you are trying to delete is associated with ‘L4PolicySet’
1 2 3 4 5 6 7 |
Request URL: https://192.168.15.20/api/pool/pool-18629d29-535a-49e5-93b5-dc2a4589374b Request Method: DELETE { "error":"Cannot delete, object is referred by: ['L4PolicySet tkc-wld01--default-my-service']", "obj_name":"tkc-wld01--default-my-service--8080" } |
But the l4policyset was empty
1 2 |
[admin:172-16-10-11]: > show l4policyset No results. |
Investigation
On checking the portal_exception.log on the ALB controller, I found the root cause. The error was loud and clear ‘
1 |
User 'tkg-wld01-wld01-ako-user' is not authorized to read on resource L4 Policy Set in tenant admin |
Log Trace
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
[2021-07-30 16:34:08,799] ERROR [error_handler.process_exception:102] ESC[31mData Exception in GET /api/l4policyset/ Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/django/core/handlers/base.py", line 132, in get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File "/usr/local/lib/python3.8/dist-packages/django/views/generic/base.py", line 71, in view return self.dispatch(request, *args, **kwargs) File "/opt/avi/python/lib/avi/rest/views.py", line 1018, in do_get_list self.check_user(request, args, kwargs) File "/opt/avi/python/lib/avi/rest/api_perf.py", line 116, in _perf rsp = f(*args, **kwargs) File "/opt/avi/python/lib/avi/rest/views.py", line 385, in check_user self.check_user_tenant_resource(user, tenant, self.permission_enum, File "/opt/avi/python/lib/avi/rest/api_perf.py", line 116, in _perf rsp = f(*args, **kwargs) File "/opt/avi/python/lib/avi/rest/views.py", line 628, in check_user_tenant_resource raise PermissionError( avi.rest.error_list.PermissionError: User 'tkg-wld01-wld01-ako-user' is not authorized to read on resource L4 Policy Set in tenant admin |
On checking with the ALB engineering team, found out that is a known bug and the issue is resolved in ALB 20.1.6. To conclude, the issue is that after the ALB upgrade, permission ‘PERMISSION_L4POLICYSET’ is getting wiped out from the ako user role which gets created when you instantiate any new workload cluster.
The Fix
Manually add PERMISSION_L4POLICYSET to ako user role via Controller CLI.
1: Connect to the ALB controller over SSH and obtain the shell access
1 2 3 4 |
admin@172-16-10-11:~$ shell Login: admin Password: [admin:172-16-10-11]: > |
2: Find AKO user role
1 2 3 4 5 6 7 8 9 10 11 12 |
[admin:172-16-11-11]: > show role +----------------------+-------------------------------------------+ | Name | UUID | +----------------------+-------------------------------------------+ | Application-Admin | role-cdd294c5-37ee-445c-b1d0-0c10036a3490 | | Tenant-Admin | role-b5d13e88-2983-4e0f-8eec-8d96d30c59c1 | | System-Admin | role-3978b330-b6f5-46a6-9fc5-69c1d3de17cd | | Application-Operator | role-0d555f2a-5f30-4f2d-8f18-055a3e666897 | | Security-Admin | role-294da179-450a-46a0-88db-4a1b0af449ea | | WAF-Admin | role-3c77cd65-f490-4480-8cb4-c538a0fc644d | | ako-essential-role | role-4751732f-2b0d-479f-9d8e-210f37885069 | +----------------------+-------------------------------------------+ |
3: Configure AKO user role to add necessary permission
1 2 3 4 5 6 7 8 9 10 11 12 |
[admin:172-16-11-11]: > configure role ako-essential-role [admin:172-16-11-11]: role> privileges New object being created [admin:172-16-11-11]: role:privileges> type write_access [admin:172-16-11-11]: role:privileges> resource permission_l4policyset [admin:172-16-11-11]: role:privileges> save [admin:172-16-11-11]: role> save |
After applying the fix, I tried to create/delete the K8 load balancer application again and this time things worked like a charm.
And that’s it for this post. I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing.