containercluster: can't disable `workloadIdentityConfig` once enabled #437

jonnylangefeld · 2021-03-25T07:27:22Z

Checklist

I did not find a related open issue.
I did not find a solution in the troubleshooting guide: (https://cloud.google.com/config-connector/docs/troubleshooting)
If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.

Bug Description

I created a simple containercluster resource with workload identity enabled.

╰─ kubectl get containercluster paas-jonny1-dev-us-west1 -o jsonpath={.spec.workloadIdentityConfig}
{"identityNamespace":"my-project.svc.id.goog"}

This all works and on GCP, I see that it's enabled for my cluster:

But now it doesn't work to disable it on an existing cluster.

I'm trying to disable by running

╰─ kubectl patch containercluster/paas-jonny1-dev-us-west1 -p '{"spec":{"workloadIdentityConfig":null}}' --type=merge
containercluster.container.cnrm.cloud.google.com/paas-jonny1-dev-us-west1 patched

By watching the containercluster in a separate window via

watch kubectl get containercluster paas-jonny1-dev-us-west1 -o jsonpath={.spec.workloadIdentityConfig}

I see that my patch command actually removes the .spec.workloadIdentityConfig as expected, but right after the cnrm-controller-manager adds it back in (probably because it takes what's on GCP as source of truth).

If I explicitly set spec.workloadIdentityConfig to null, I would expect that the feature gets disabled on GCP. I have also tried setting it to {} instead of null, but that doesn't work either.

Additional Diagnostic Information

Kubernetes Cluster Version

Client Version: v1.20.4
Server Version: v1.16.15-gke.7800

Config Connector Version

1.32.0

Config Connector Mode

cluster

Log Output

{"severity":"info","logger":"containercluster-controller","msg":"starting reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"creating/updating underlying resource","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"successfully finished reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"starting reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"underlying resource already up to date","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"successfully finished reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"starting reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"underlying resource already up to date","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"successfully finished reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}

Steps to Reproduce

Steps to reproduce the issue

create a containercluster with workload identity enabled
disable workload identity via the containercluster resource

The text was updated successfully, but these errors were encountered:

jcanseco · 2021-03-29T05:29:13Z

Hi @jonnylangefeld, looking now. Could you please elaborate on this issue's impact on you? For example, is this issue is a blocker, friction point, or nice-to-have?

jonnylangefeld · 2021-03-30T19:36:21Z

Thanks for your reply @jcanseco. While the issue of disabling workload identity was merely a friction point, we now saw this issue also on other fields, that are actually blocking us from rolling out a controller in production right now.
The current blocker is for containernodepools and the field spec.autoscaling. Autoscaling is a mutable field on node pools and can be enabled and disabled via the GKE API.

However, we cannot disable it via Config Connector. The same thing as described above happens. When the field is removed from the custom resource, the Config Connector controller just adds it back in right away. And I can see that it gets removed initially after my patch by watching the watch command.
To make this even more evident, I have scaled down the operator to 0 via

kubectl scale --replicas 0 -n cnrm-system statefulset/cnrm-controller-manager

And then applied my patch via

kubectl patch containernodepool/generic-2 -p '{"spec":{"autoscaling":null}}' --type=merge

Now checking the autoscaling setting:

kubectl get containernodepool generic-2 -o jsonpath={.spec.autoscaling}

actually shows that it's removed. (Same thing works via a kubectl edit of that resource)
Then scaling the controller back up via

scale --replicas 1 -n cnrm-system statefulset/cnrm-controller-manager

just brings the original autoscaling setting right back in from the actual GCP resource. But in this case I obviously want the kubernetes resource to be the source of truth because that's the one that was just edited.

Here the original containernodepool yaml:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
    cnrm.cloud.google.com/project-id: my-project
  name: generic-2
  namespace: nodepools
spec:
  autoscaling:
    maxNodeCount: 1000
    minNodeCount: 0
  clusterRef:
    external: paas-jonny1-dev-us-west1
  initialNodeCount: 1
  location: us-west1
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsPerNode: 64
  nodeConfig:
    diskSizeGb: 500
    diskType: pd-ssd
    imageType: COS
    labels:
      instance-family: e2
      node-pool-deployment: generic-1
      node-pool-type: generic
    machineType: e2-standard-32
    metadata:
      disable-legacy-endpoints: "true"
    minCpuPlatform: AUTOMATIC
    oauthScopes:
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    serviceAccountRef:
      external: [email protected]
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
      enableSecureBoot: true
    tags:
    - allow-health-checkers
    - paas-gke-node
    - paas-jonny1-dev-us-west1
    workloadMetadataConfig:
      nodeMetadata: GKE_METADATA_SERVER
  nodeCount: 0
  nodeLocations:
  - us-west1-a
  - us-west1-b
  upgradeSettings:
    maxSurge: 20
    maxUnavailable: 0
  version: 1.17.17-gke.1101

This was tested on KCC version

1.37.0

and

1.32.0

caieo · 2021-03-31T21:23:38Z

Hi @jonnylangefeld, we're very sorry you ran into these scenarios on those Container resources. Ideally, you should be able to disable Workload Identity and Autoscaling in the ways you defined, so we're working on improving that behavior for Config Connector.

In the meantime, our recommended workaround is to remove the fields from your config, manually disable those fields on GCP, and then trigger a reconcile. However, we're aware that that workaround is infeasible given the number of clusters/nodePools you might have. We found one workaround for the ContainerCluster scenario and have another fix in the works for the ContainerNodePool scenario:

For ContainerCluster's spec.workloadIdentityConfig, you should be able to disable the field by simply setting the spec.workloadIdentityConfig.identityNamespace to an empty string.
Unfortunately, ContainerNodePool's spec.autoscaling doesn't have as easy of a workaround, but we are working on a fix that will allow you to disable it by setting an empty object. We're hoping to have this in by next week's release.

Like we mentioned in our sync, we're working on improving our field deletion/disabling user experience, so if you run into other scenarios, we value any additional data points you can provide to help us improve!

jonnylangefeld · 2021-04-01T15:01:51Z

Thank you @caieo for the update!
Can we assume that with next week's release also spec.workloadIdentityConfig can be disabled by setting it to an empty object (rather than setting the string inside to an empty string)?
I assume the fix to set optional GKE API fields to empty objects to disable them would be a global fix.

In that case we wouldn't go with the workaround for spec.workloadIdentityConfig and just wait for the fix next week and do it globally right right away.

caieo · 2021-04-01T16:05:35Z

@jonnylangefeld, unfortunately, our fix will be specific to the ContainerNodePool resource. It's meant to be a workaround to unblock your use case while we work on creating a clearer story around field disabling/deletions. The ContainerCluster fix for spec.workloadIdentityConfig will still need to be as I mentioned above. We're really sorry for this inconsistency.

jonnylangefeld · 2021-04-01T16:18:07Z

I see. So next week's release will specifically target ContainerNodePool's spec.autoscaling.
Is there an ETA for when it's fixed globally? So that any optional GCP API field can be disabled by setting an empty object?

jonnylangefeld added the bug Something isn't working label Mar 25, 2021

diviner524 mentioned this issue Dec 5, 2022

BillingBudgetsBudget amount.lastPeriodAmount is unusable #723

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containercluster: can't disable `workloadIdentityConfig` once enabled #437

containercluster: can't disable `workloadIdentityConfig` once enabled #437

jonnylangefeld commented Mar 25, 2021

jcanseco commented Mar 29, 2021

jonnylangefeld commented Mar 30, 2021 •

edited

caieo commented Mar 31, 2021

jonnylangefeld commented Apr 1, 2021

caieo commented Apr 1, 2021

jonnylangefeld commented Apr 1, 2021

containercluster: can't disable workloadIdentityConfig once enabled #437

containercluster: can't disable workloadIdentityConfig once enabled #437

Comments

jonnylangefeld commented Mar 25, 2021

Checklist

Bug Description

Additional Diagnostic Information

Kubernetes Cluster Version

Config Connector Version

Config Connector Mode

Log Output

Steps to Reproduce

Steps to reproduce the issue

jcanseco commented Mar 29, 2021

jonnylangefeld commented Mar 30, 2021 • edited

caieo commented Mar 31, 2021

jonnylangefeld commented Apr 1, 2021

caieo commented Apr 1, 2021

jonnylangefeld commented Apr 1, 2021

containercluster: can't disable `workloadIdentityConfig` once enabled #437

containercluster: can't disable `workloadIdentityConfig` once enabled #437

jonnylangefeld commented Mar 30, 2021 •

edited