Skip to content

Increase leader-election nominal concurrency shares from 10 to 40 #129646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Jan 15, 2025

What type of PR is this?

/kind feature
Actually, kind of hard to classify. This is a tweak to an existing feature, based on results from experience and non-CI testing.

What this PR does / why we need it:

This PR increases the default setting for the nominal concurrency shares of the leader-election priority level in the API Priority and Fairness feature. Experience shows that in some stressful situations APF gives too little concurrency to the leader election requests, which causes the corresponding controller(s) to go out of business (and, due to the stress, a replacement has trouble getting up and elected). This PR changes the default setting of that level's nominal concurrency shares from 10 to 40.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

The default nominal concurrency shares of the "leader-election" priority level in the API Priority and Fairness feature has been changed from 10 to 40.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 15, 2025
@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jan 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MikeSpreitzer
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MikeSpreitzer
Copy link
Member Author

/cc @linxiulei

@MikeSpreitzer
Copy link
Member Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 15, 2025
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

Because some testing shows leader election being starved.

Signed-off-by: Mike Spreitzer <[email protected]>
@MikeSpreitzer MikeSpreitzer force-pushed the more-leader-election-concurrency branch from 092a634 to e22f46e Compare January 15, 2025 20:20
@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 16, 2025
@MikeSpreitzer
Copy link
Member Author

PRs to compare:

Also see:

@MikeSpreitzer
Copy link
Member Author

@linxiulei compared two runs with and without tweaking the nominal concurrency shares of the leader-election priority level. I have graphed the results, see https://docs.google.com/document/d/1DEexuOLOTmtNDAcAgnSOlVGsDXcq9f5_f_V77qYB3ME and https://docs.google.com/document/d/1Bz4t6A_H1z-jmsOL-Fa05EW9e0DL1uvtb9JzqwXw3bY

@@ -208,7 +208,7 @@ var (
flowcontrol.PriorityLevelConfigurationSpec{
Type: flowcontrol.PriorityLevelEnablementLimited,
Limited: &flowcontrol.LimitedPriorityLevelConfiguration{
NominalConcurrencyShares: ptr.To(int32(10)),
NominalConcurrencyShares: ptr.To(int32(40)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - with this change we're going up from default total of 245 to 275 without changing the lendable percent.
So technically, for usecases that don't need it, we effectively decrease the capacity by 10%.

Shouldn't we also adjust the LendablePercent here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting suggestion. I have added a commit that increases the lendable percent to 75, thus keeping the unlendable shares at 10 like before this PR. There is a small difference due to the increase in total nominal shares, but I suspect that will be unimportant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

There is a small difference due to the increase in total nominal shares, but I suspect that will be unimportant.

I looked into those numbers - there are changes, for typical values that we use in GKE, I see differences of actually available seats <15%. This isn't negligible, but I also don't know how to assess the risk of such a change.

With borrowing, in overloaded cluster I think that the risk will be mitigated, but I think we need ability to disable this change - so I would like to hide this change behind a feature-gate (though it can be enabled by default).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that we have two other config changes being considered now:

  1. Segregating events (introduced in Events+weighted borrowing #128974);
  2. rejiggering to stop working around the lack of borrowing ([APF] low priorities have larger effective shares than high priorities #121982 (comment))

@wojtek-t wojtek-t self-assigned this Feb 24, 2025
.. so that it takes no larger bite from system capacity than before
(modulo small difference due to total nominal shares).
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants