Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix sync_proxy_rules_iptables_total metric #119140

Merged
merged 3 commits into from
Jul 14, 2023

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Jul 6, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Reverts the definition of the sync_proxy_rules_iptables_total metric back to the generally-understood pre-MinimizeIPTablesRestore meaning: "the total number of iptables rules that kube-proxy is responsible for". Also adds a new metric, sync_proxy_rules_iptables_last, preserving the behavior that sync_proxy_rules_iptables_total had accidentally slipped into: "the number of iptables rules that kube-proxy reprogrammed on the last sync".

Also fixes a bug noticed while added unit tests for this, which is that if syncProxyRules() deleted any stale service/endpoint chains, it would count each of those deletions as being a "rule" for purposes of the metric due to carelessness in how it was counting.

Which issue(s) this PR fixes:

Fixes #118978

Does this PR introduce a user-facing change?

The kube-proxy `sync_proxy_rules_iptables_total` metric has now reverted back
to its pre-1.27 behavior of tracking the total number of iptables rules that
kube-proxy is responsible for, rather than only counting the number of rules
that it re-synced on the last sync. The new `sync_proxy_rules_iptables_last`
metric now gives the latter number.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3453-minimize-iptables-restore/README.md

/sig network
/priority important-soon
/assign @thockin @aojea

This required fixing a small bug in the metric, where it had
previously been counting the "-X" lines that had been passed to
iptables-restore to delete stale chains, rather than only counting the
actual rules.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 6, 2023
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 6, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2023
Historically, IptablesRulesTotal could have been intepreted as either
"the total number of iptables rules kube-proxy is responsible for" or
"the number of iptables rules kube-proxy rewrote on the last sync".
Post-MinimizeIPTablesRestore, these are very different things (and
IptablesRulesTotal unintentionally became the latter).

Fix IptablesRulesTotal (sync_proxy_rules_iptables_total) to be "the
total number of iptables rules kube-proxy is responsible for" and add
IptablesRulesLastSync (sync_proxy_rules_iptables_last) to be "the
number of iptables rules kube-proxy rewrote on the last sync".
@@ -852,6 +852,9 @@ func (proxier *Proxier) syncProxyRules() {
proxier.natChains.Reset()
proxier.natRules.Reset()

skippedNatChains := &proxyutil.LineBuffer{}
skippedNatRules := &proxyutil.LineBuffer{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(yes, this is wasteful; fixed in the next commit)

@danwinship
Copy link
Contributor Author

/retest

@aojea
Copy link
Member

aojea commented Jul 7, 2023

LGTM

the only doubts I have is about the solution of switching the pointers to store the lines in a different buffer, it is not straightforward for future developments to realize there is some code doing that in the middle of the loop.
There is no problem if the future rules have to be skipped though

/assign @thockin

@danwinship
Copy link
Contributor Author

danwinship commented Jul 7, 2023

Yeah, I didn't love that approach, but it seemed simplest...

I guess I could try doing it with separate pointers throughout, like requiredNATRules and fullSyncOnlyNATRules, and it would decide at the top (when it sets tryPartialSync) whether to set fullSyncOnlyNATRules to the same buffer as requiredNATRules or a dummy buffer.

(This would actually make one of the earlier refactorings (#110266) irrelevant; I'd carefully reorganized all of the code in the main sync loop so that the required rules all come first, and then the rules that can be skipped at the bottom. So maybe if we refactored it with requiredNATRules vs fullSyncOnlyNATRules, I should undo some of that reorganization, because maybe it would be more logical to go back to grouping things the old way if we don't have to worry about being able to skip the second half of the loop?)

@@ -852,8 +852,8 @@ func (proxier *Proxier) syncProxyRules() {
proxier.natChains.Reset()
proxier.natRules.Reset()

skippedNatChains := &proxyutil.LineBuffer{}
skippedNatRules := &proxyutil.LineBuffer{}
skippedNatChains := proxyutil.NewDummyLineBuffer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here would not hurt


// NewDummyLineBuffer returns a dummy LineBuffer that counts the number of writes but
// throws away the data.
func NewDummyLineBuffer() LineBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe NewLineCounter() or NewDiscardLineBuffer() would make it feel like less of a test-infra thing?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't hate the pointer flip. I wonder if we can insure against accidentally writing to proxier.natRules by making some of these helper methods into free-functions, but that can be a followup, I think.

@thockin
Copy link
Member

thockin commented Jul 13, 2023

/lgtm

/hold if you want to change the "dummy" name

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7c15eef3f6bff22f74285b290ef3f7e262c02a4f

Rather than actually assembling all of the rules we aren't going to
use, just count them and throw them away.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 14, 2023
@danwinship
Copy link
Contributor Author

updated the name
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2023
@aojea
Copy link
Member

aojea commented Jul 14, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d890cccb04781596c4c122d7396c2add48b39b03

@k8s-ci-robot
Copy link
Contributor

@danwinship: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gci-gce-ipvs 883d0c3 link false /test pull-kubernetes-e2e-gci-gce-ipvs

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit ffa4c26 into kubernetes:master Jul 14, 2023
12 of 13 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jul 14, 2023
@danwinship danwinship deleted the iptables-metrics branch October 8, 2023 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

meaning of sync_proxy_rules_iptables_total given MinimizeIPTablesRestore
4 participants