Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KCCM] service controller: add providerID Node predicate #117388

Merged
merged 2 commits into from
Apr 18, 2023

Conversation

alexanderConstantinescu
Copy link
Member

@alexanderConstantinescu alexanderConstantinescu commented Apr 15, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

The providerID is used by the cloud-provider code to correlate the Node to the underlying cloud instance. The service controller in the KCCM does not define the provider ID as a predicate for syncing nodes, hence nodes can be passed to the cloud-provider code with a missing provider ID. Should the cloud-provider not handle the missing provider ID as an error, and instead silently fail: then the load balancers won't be re-synced with the node in question. Moreover: the service controller does not sync the assignment of the providerID on the Node object

This PR adds the provider ID as a predicate for syncing the node. As such we are sure the cloud provider won't error at any time because of this missing attribute, and we always re-sync the node once it gets the provider ID set.

This was overlooked in #109706 because the providerID was never explicitly defined in the service controller for syncing nodes, it was indirectly covered by the fact that the service controller always synced nodes on the Ready condition...which comes after the setting of the providerID on the Node object. Also, others clouds might have other mechanisms for retrieving the instance besides solely relying on the providerID.

Which issue(s) this PR fixes:

Fixes #117375

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a 1.25 regression: [KCCM] service controller: change the cloud controller manager to make `providerID` a predicate when synchronizing nodes. This change allows load balancer integrations to ensure that  the `providerID` is set when configuring load balancers and targets.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig network
/sig cloud-provider

/cc @aojea
/assign @thockin

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 15, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Apr 15, 2023
@k8s-ci-robot k8s-ci-robot requested a review from aojea April 15, 2023 23:21
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/cloudprovider labels Apr 15, 2023
@sftim
Copy link
Contributor

sftim commented Apr 17, 2023

Changelog suggestion

Changed the cloud controller manager to make `providerID` a predicate when synchronizing nodes.
This change allows load balancer integrations to ensure that  the `providerID` is set when configuring
load balancers and targets.

@sftim
Copy link
Contributor

sftim commented Apr 17, 2023

others clouds might have other mechanisms for retrieving the instance besides solely relying on the providerID.

How might we account for this?

@sftim
Copy link
Contributor

sftim commented Apr 17, 2023

We should document the contract between cloud provider and core of Kubernetes; I worry that right now we can't work out which side isn't keeping to the bargain.

@aojea
Copy link
Member

aojea commented Apr 17, 2023

others clouds might have other mechanisms for retrieving the instance besides solely relying on the providerID.

How might we account for this?

/assign @andrewsykim @nckturner

@alexanderConstantinescu
Copy link
Member Author

Changelog suggestion

I am fine with that. I updated the release-note comment.

How might we account for this?

I don't have a good answer unfortunately. Tracking/finding cloud provider implementations which are depending on the cloud-controller, currently requires finding which cloud provider is implementing the interface (usually: by grokking GitHub), reading the implementation and interpreting it. There is no contract defined, for which Node fields the cloud provider implementations requires as to successfully sync - at least to best of my knowledge.

I would however argue that the providerID was mistakenly overlooked as a required field, even without a contract. It is a common discriminator for many controllers when mapping the Node object to the cloud provider instance ID.

@nckturner
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 056acafd4374ed71d7e6664258e56b18541bbbc8

@dims
Copy link
Member

dims commented Apr 18, 2023

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, dims, mmerkes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2023
@k8s-ci-robot k8s-ci-robot merged commit 8bba479 into kubernetes:master Apr 18, 2023
12 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Apr 18, 2023
@@ -970,6 +975,10 @@ func nodeIncludedPredicate(node *v1.Node) bool {
return !hasExcludeBalancerLabel
}

func nodeHasProviderIDPredicate(node *v1.Node) bool {
return node.Spec.ProviderID != ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no case where someone runs without this set? It is strictly optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question...I assumed that to be the case, but I can't tell for sure :/

The safest might be to just trigger a resync in shouldSyncUpdatedNode due to changes to the field, but not actually require that nodes have the field set?

@thockin
Copy link
Member

thockin commented Apr 25, 2023

I don't know why it would ever go from having a value to no value, but I guess it's allowed.

If we don't demand a value, we are back to the orginal bug report, no? We will try to sync nodes that are not ready to be synced. I just don't know if there's a use-case for no value...

@alexanderConstantinescu
Copy link
Member Author

If we don't demand a value, we are back to the orginal bug report, no?

No, the original bug report was about the fact that we weren't even reacting to the provider ID being set. I am saying we could react to the change, but not filter out nodes based on it. The corollary is as you say: that we will sync nodes that both have and don't have a provider ID set, which will cause some cloud providers to error for the ones that don't have it, but I think that's fine since it will only be temporary.

@thockin
Copy link
Member

thockin commented Apr 25, 2023

Ahh, subtle, yes. I guess they are already tryoing to sync nodes that have no value here, so this would not be worse than before, right?

SGTM

@alexanderConstantinescu
Copy link
Member Author

Ahh, subtle, yes. I guess they are already tryoing to sync nodes that have no value here, so this would not be worse than before, right?

SGTM

Right, eventually the node will get the provider ID assigned, we'll sync on that change and the cloud provider code will be happy.

Are we fine with filing a PR on master, but holding any backport until someone actually reports this as a bug? We don't really know if some clouds run without setting the provider ID but still expect to provision LBs, so not sure a backport is warranted at this point.

@thockin
Copy link
Member

thockin commented Apr 26, 2023

SGTM

k8s-ci-robot added a commit that referenced this pull request May 4, 2023
…rry-pick-of-#117388-upstream-release-1.26

Automated cherry pick of #117388: Re-work logic in shouldSyncUpdatedNode
k8s-ci-robot added a commit that referenced this pull request May 4, 2023
…rry-pick-of-#117388-upstream-release-1.27

Automated cherry pick of #117388: Re-work logic in shouldSyncUpdatedNode
@liggitt liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 9, 2023
@liggitt
Copy link
Member

liggitt commented Sep 9, 2023

it looks like the change that caused the problem went into 1.25, but the picks of this fix only made it back to 1.26. Does this need to be picked to 1.25 as well?

@alexanderConstantinescu
Copy link
Member Author

Does this need to be picked to 1.25 as well?

No. 1.25 doesn't have the changes which required this fix to begin with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud provider updating load balancer before registering the node
10 participants