Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kube-proxy]: Implement KEP-3836 #116470

Merged
merged 2 commits into from
Jul 15, 2023

Conversation

alexanderConstantinescu
Copy link
Member

@alexanderConstantinescu alexanderConstantinescu commented Mar 10, 2023

I am filing this PR, but I definitely not convinced that this is safe to go in as-is. I am filing it because I want to track the limitation which currently blocks it from being safe. Hence, if the KEP does miss its deadline: that we know why. I am therefore putting:

/hold

until we've resolved the issues mentioned below.

This patch implements kubernetes/enhancements#3836

This PR implements exactly what was agreed on the KEP, that is to say:

  • for eTP:Cluster services we start failing the HC when the node is unschedulable or marked as deleted by means of having the deletionTimestamp set

The goal is to allow connection draining of terminating nodes to happen.

The current problem: the unschedulable field is not a good indicator for "the node is terminating". It is true that cordoning a node (making it unschedulable) is usually followed by a drain and then a delete, but there is no guarantee for that. In fact: there are cases where I believe this would completely break cluster ingress, specifically for this case which was discussed on the KEP:

I think my company actually does that when we upgrade Kube at one point. We manage the node pools where our workloads run and we do this in manual mode: so when we need to upgrade them we create a new node pool with version N + 1 , then we cordon all existing Nodes in the cluster, but don't evict them, we call some service which will evict them later. But we expect ingress connectivity to work on the unschedulable nodes until that eviction service has kicked in a decided to trigger a restart (which might be minutes / hours)....killing ingress to these workloads for that time, would be bad.

This PR would connection drain ingress for all eTP:Cluster services on the cluster in the case mentioned above.

/cc @thockin @danwinship @aojea

I believe this can't really be implemented until Kube has another (and more clear-cut) way for expressing "the node is terminating and about to be deleted"

What type of PR is this?

/kind feature

What this PR does / why we need it:

  • As to implement connection draining for terminating nodes

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

[Kube-proxy]: implement connection draining for terminating nodes, KEP-3836

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot requested a review from aojea March 10, 2023 13:50
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 10, 2023
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 10, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 10, 2023
@alexanderConstantinescu
Copy link
Member Author

/kind feature
/sig network

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. area/ipvs and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 10, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 10, 2023
@alexanderConstantinescu
Copy link
Member Author

/retitle [Kube-proxy]: Implement KEP-3836

@k8s-ci-robot k8s-ci-robot changed the title Implement KEP-3836 [Kube-proxy]: Implement KEP-3836 Mar 10, 2023
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Will you add "/livez" (or "/currentz" or something else more obvious) in this or a different PR?

I believe this can't really be implemented until Kube has another (and more clear-cut) way for expressing "the node is terminating and about to be deleted"

@bobbypage xref https://github.com/kubernetes/kubernetes/issues/115139

@@ -62,6 +68,7 @@ type proxierHealthServer struct {

lastUpdated atomic.Value
oldestPendingQueued atomic.Value
nodeHealthy atomic.Value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic.Bool ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will change

@@ -156,7 +170,14 @@ type healthzHandler struct {
}

func (h healthzHandler) ServeHTTP(resp http.ResponseWriter, req *http.Request) {
var nodeHealthy bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not talk about node being "healthly" here but "eligible" or "viable" or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't like "healthy" either. I'd like something which goes well with the path: /livez - nodeLive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about being alive either though - I find "eligible" to be the least awkward so far ?

@thockin thockin self-assigned this Mar 10, 2023
@@ -656,6 +660,9 @@ func (proxier *Proxier) OnNodeDelete(node *v1.Node) {
"eventNode", node.Name, "currentNode", proxier.hostname)
return
}

proxier.healthzServer.SyncNode(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better if you add your own handler, per example NodeHealthzHandlerand register it during the kube-proxy initialization, adding the feature gate on registration, so you don't have to plumb it in all the proxies and doesn't gate executed despite is feature gated

See #111344 for reference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that doesn't respect the criteria of having the feature gate enabled/disabled and immediately experiencing changing behavior as a consequence, right? I mean: if the watcher handler is added depending on if the feature gate is on/off, then we'd need to restart/re-initialize kube-proxy if the feature gate is flipped. In this case we always want to compute/react to the Node event, but consider/not consider it depending on the feature gate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't follow, but I'm a bit slow these days, so I may be missing something

n this case we always want to compute/react to the Node event, but consider/not consider it depending on the feature gate.

why do you want to compute it if you never going to use it, you always have to restart ... the feature gate just does that, not execute code that is under feature gate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature gates can't (today) change live - they always require a restart. That said, this plumbing is minor - I can go either way. Antonio has more context on current best-practice :)

@alexanderConstantinescu
Copy link
Member Author

Will you add "/livez" (or "/currentz" or something else more obvious) in this or a different PR?

Sorry, forgot about that. I will update

@thockin
Copy link
Member

thockin commented Mar 14, 2023

code-freeze in about 6 hours - are we kicking this to next release?

@alexanderConstantinescu
Copy link
Member Author

code-freeze in about 6 hours - are we kicking this to next release?

Yeah, unfortunately I doubt we will have the time to update this PR + implement the metrics we agreed on the KEP + review it + possibly address the review. I had urgent non-upstream things I needed to focus on today, sorry about that

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2023
@alexanderConstantinescu
Copy link
Member Author

@thockin / @aojea : just wanted to remind + ask for a review given the code freeze deadline which is approaching. FYI: I am on PTO until the 19th of July - but I don't think there's anything left to address for what concerns this PR.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2023
TL;DR: we want to start failing the LB HC if a node is tainted with ToBeDeletedByClusterAutoscaler.
This field might need refinement, but currently is deemed our best way of understanding if
a node is about to get deleted. We want to do this only for eTP:Cluster services.

The goal is to connection draining terminating nodes
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2023
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 10, 2023

@alexanderConstantinescu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gci-gce-ipvs 08dd657 link false /test pull-kubernetes-e2e-gci-gce-ipvs

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aroradaman
Copy link
Member

/retest-required

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a good place to document these URLs and semantics. https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ is auto-generated. Where should such docs go? Ideally every command would have something like "kube-proxy --man" which would show flags and more.

@sftim ideas?

healthy, lastUpdated, currentTime := h.hs.isHealthy()
resp.Header().Set("Content-Type", "application/json")
resp.Header().Set("X-Content-Type-Options", "nosniff")
if !healthy {
metrics.ProxyLivez503Total.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want 2 metrics or just 2 labels on one metric?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two labels is much better IMHO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin
Copy link
Member

thockin commented Jul 11, 2023

I'll be OOO for code-freeze, so I am going to approve this and hold, and we can either merge as-is, or fixup, or decline changes.

/approve
/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 11, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 64db7ca76d9ee0c7f820ad33a14f5f26eff3c4b1

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2023
@sftim
Copy link
Contributor

sftim commented Jul 11, 2023

We need a good place to document these URLs and semantics. https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ is auto-generated. Where should such docs go? Ideally every command would have something like "kube-proxy --man" which would show flags and more.

If we can make an artefact at build time (eg: JSON with some embedded Markdown), then SIG Docs can consume it.

If we mean URL paths like /healthz, maybe OpenAPI format could work?

@thockin
Copy link
Member

thockin commented Jul 11, 2023 via email

Comment on lines +196 to +217
// ProxyLivez200Total is the number of returned HTTP Status 200 for each
// livez probe.
ProxyLivez200Total = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "proxy_livez_200_total",
Help: "Cumulative proxy livez HTTP status 200",
StabilityLevel: metrics.ALPHA,
},
)

// ProxyLivez503Total is the number of returned HTTP Status 503 for each
// livez probe.
ProxyLivez503Total = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "proxy_livez_503_total",
Help: "Cumulative proxy livez HTTP status 503",
StabilityLevel: metrics.ALPHA,
},
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ProxyLivez200Total is the number of returned HTTP Status 200 for each
// livez probe.
ProxyLivez200Total = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "proxy_livez_200_total",
Help: "Cumulative proxy livez HTTP status 200",
StabilityLevel: metrics.ALPHA,
},
)
// ProxyLivez503Total is the number of returned HTTP Status 503 for each
// livez probe.
ProxyLivez503Total = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "proxy_livez_503_total",
Help: "Cumulative proxy livez HTTP status 503",
StabilityLevel: metrics.ALPHA,
},
)
// ProxyLivezTotal is the number of returned HTTP Status for each
// livez probe.
ProxyLivezTotal = metrics.NewCounterVec(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "proxy_livez_total",
Help: "Cumulative proxy livez HTTP status",
StabilityLevel: metrics.ALPHA,
},
[]string{"code"},
)

@aojea
Copy link
Member

aojea commented Jul 13, 2023

@aojea
Copy link
Member

aojea commented Jul 14, 2023

@thockin if @alexanderConstantinescu is on vacation and can't get to this before code freeze and since this is feature gated, I think we can merge and iterate

@thockin
Copy link
Member

thockin commented Jul 14, 2023

ACK - clear the hold if you are happy!

@aojea
Copy link
Member

aojea commented Jul 15, 2023

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2023
@k8s-ci-robot k8s-ci-robot merged commit f343657 into kubernetes:master Jul 15, 2023
13 of 14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants