Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containercluster: can't disable workloadIdentityConfig once enabled #437

Open
2 of 3 tasks
jonnylangefeld opened this issue Mar 25, 2021 · 6 comments
Open
2 of 3 tasks
Labels
bug Something isn't working

Comments

@jonnylangefeld
Copy link

Checklist

Bug Description

I created a simple containercluster resource with workload identity enabled.

╰─ kubectl get containercluster paas-jonny1-dev-us-west1 -o jsonpath={.spec.workloadIdentityConfig}
{"identityNamespace":"my-project.svc.id.goog"}

This all works and on GCP, I see that it's enabled for my cluster:
Screen Shot 2021-03-25 at 12 17 06 AM

But now it doesn't work to disable it on an existing cluster.

I'm trying to disable by running

╰─ kubectl patch containercluster/paas-jonny1-dev-us-west1 -p '{"spec":{"workloadIdentityConfig":null}}' --type=merge
containercluster.container.cnrm.cloud.google.com/paas-jonny1-dev-us-west1 patched

By watching the containercluster in a separate window via

watch kubectl get containercluster paas-jonny1-dev-us-west1 -o jsonpath={.spec.workloadIdentityConfig}

I see that my patch command actually removes the .spec.workloadIdentityConfig as expected, but right after the cnrm-controller-manager adds it back in (probably because it takes what's on GCP as source of truth).

If I explicitly set spec.workloadIdentityConfig to null, I would expect that the feature gets disabled on GCP. I have also tried setting it to {} instead of null, but that doesn't work either.

Additional Diagnostic Information

Kubernetes Cluster Version

Client Version: v1.20.4
Server Version: v1.16.15-gke.7800

Config Connector Version

1.32.0

Config Connector Mode

cluster

Log Output

{"severity":"info","logger":"containercluster-controller","msg":"starting reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"creating/updating underlying resource","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"successfully finished reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"starting reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"underlying resource already up to date","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"successfully finished reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"starting reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"underlying resource already up to date","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}
{"severity":"info","logger":"containercluster-controller","msg":"successfully finished reconcile","resource":{"namespace":"clusters","name":"paas-jonny1-dev-us-west1"}}

Steps to Reproduce

Steps to reproduce the issue

  1. create a containercluster with workload identity enabled
  2. disable workload identity via the containercluster resource
@jonnylangefeld jonnylangefeld added the bug Something isn't working label Mar 25, 2021
@jcanseco
Copy link
Member

Hi @jonnylangefeld, looking now. Could you please elaborate on this issue's impact on you? For example, is this issue is a blocker, friction point, or nice-to-have?

@jonnylangefeld
Copy link
Author

jonnylangefeld commented Mar 30, 2021

Thanks for your reply @jcanseco. While the issue of disabling workload identity was merely a friction point, we now saw this issue also on other fields, that are actually blocking us from rolling out a controller in production right now.
The current blocker is for containernodepools and the field spec.autoscaling. Autoscaling is a mutable field on node pools and can be enabled and disabled via the GKE API.

However, we cannot disable it via Config Connector. The same thing as described above happens. When the field is removed from the custom resource, the Config Connector controller just adds it back in right away. And I can see that it gets removed initially after my patch by watching the watch command.
To make this even more evident, I have scaled down the operator to 0 via

kubectl scale --replicas 0 -n cnrm-system statefulset/cnrm-controller-manager

And then applied my patch via

kubectl patch containernodepool/generic-2 -p '{"spec":{"autoscaling":null}}' --type=merge

Now checking the autoscaling setting:

kubectl get containernodepool generic-2 -o jsonpath={.spec.autoscaling}

actually shows that it's removed. (Same thing works via a kubectl edit of that resource)
Then scaling the controller back up via

scale --replicas 1 -n cnrm-system statefulset/cnrm-controller-manager

just brings the original autoscaling setting right back in from the actual GCP resource. But in this case I obviously want the kubernetes resource to be the source of truth because that's the one that was just edited.

Here the original containernodepool yaml:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
    cnrm.cloud.google.com/project-id: my-project
  name: generic-2
  namespace: nodepools
spec:
  autoscaling:
    maxNodeCount: 1000
    minNodeCount: 0
  clusterRef:
    external: paas-jonny1-dev-us-west1
  initialNodeCount: 1
  location: us-west1
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsPerNode: 64
  nodeConfig:
    diskSizeGb: 500
    diskType: pd-ssd
    imageType: COS
    labels:
      instance-family: e2
      node-pool-deployment: generic-1
      node-pool-type: generic
    machineType: e2-standard-32
    metadata:
      disable-legacy-endpoints: "true"
    minCpuPlatform: AUTOMATIC
    oauthScopes:
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    serviceAccountRef:
      external: [email protected]
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
      enableSecureBoot: true
    tags:
    - allow-health-checkers
    - paas-gke-node
    - paas-jonny1-dev-us-west1
    workloadMetadataConfig:
      nodeMetadata: GKE_METADATA_SERVER
  nodeCount: 0
  nodeLocations:
  - us-west1-a
  - us-west1-b
  upgradeSettings:
    maxSurge: 20
    maxUnavailable: 0
  version: 1.17.17-gke.1101

This was tested on KCC version

1.37.0

and

1.32.0

@caieo
Copy link
Contributor

caieo commented Mar 31, 2021

Hi @jonnylangefeld, we're very sorry you ran into these scenarios on those Container resources. Ideally, you should be able to disable Workload Identity and Autoscaling in the ways you defined, so we're working on improving that behavior for Config Connector.

In the meantime, our recommended workaround is to remove the fields from your config, manually disable those fields on GCP, and then trigger a reconcile. However, we're aware that that workaround is infeasible given the number of clusters/nodePools you might have. We found one workaround for the ContainerCluster scenario and have another fix in the works for the ContainerNodePool scenario:

  • For ContainerCluster's spec.workloadIdentityConfig, you should be able to disable the field by simply setting the spec.workloadIdentityConfig.identityNamespace to an empty string.
  • Unfortunately, ContainerNodePool's spec.autoscaling doesn't have as easy of a workaround, but we are working on a fix that will allow you to disable it by setting an empty object. We're hoping to have this in by next week's release.

Like we mentioned in our sync, we're working on improving our field deletion/disabling user experience, so if you run into other scenarios, we value any additional data points you can provide to help us improve!

@jonnylangefeld
Copy link
Author

Thank you @caieo for the update!
Can we assume that with next week's release also spec.workloadIdentityConfig can be disabled by setting it to an empty object (rather than setting the string inside to an empty string)?
I assume the fix to set optional GKE API fields to empty objects to disable them would be a global fix.

In that case we wouldn't go with the workaround for spec.workloadIdentityConfig and just wait for the fix next week and do it globally right right away.

@caieo
Copy link
Contributor

caieo commented Apr 1, 2021

@jonnylangefeld, unfortunately, our fix will be specific to the ContainerNodePool resource. It's meant to be a workaround to unblock your use case while we work on creating a clearer story around field disabling/deletions. The ContainerCluster fix for spec.workloadIdentityConfig will still need to be as I mentioned above. We're really sorry for this inconsistency.

@jonnylangefeld
Copy link
Author

I see. So next week's release will specifically target ContainerNodePool's spec.autoscaling.
Is there an ETA for when it's fixed globally? So that any optional GCP API field can be disabled by setting an empty object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants