sched: Optimize anti-affinity node label check #132071

utam0k · 2025-06-03T11:23:25Z

What type of PR is this?

/kind cleanup
/sig scheduling

What this PR does / why we need it:

The performance difference depends on: O(number of node labels vs existingAntiAffinityCounts)

Number of node labels (typically 10-15 on GKE, can be 20-30+ with custom labels)
Number of entries in existingAntiAffinityCounts

Benchmarking

Scenario	Master (ns/op)	This PR (ns/op)	Performance
Few node labels (10) vs Many anti-affinity (100)	37.17	113.5	3x slower ❌
Many node labels (30) vs Few anti-affinity (5)	199.4	41.69	4.8x faster ✅
Equal size (15 vs 15)	61.52	41.75	1.5x faster ✅

Scheduling Perf

Note: L = Node Labels, AA = Anti-Affinity rules

1. Direct Execution Time Comparison

Top 30 test cases showing side-by-side comparison of filter execution times. Green bars (This Branch) are consistently lower than red bars (Master) for these cases.

2. Performance Scatter Plot

Each point represents a test case. Points below the diagonal line indicate improved performance with dynamic selection.

3. Performance Table (Top 20)

Test Case	Master (ms)	This branch (ms)	Improvement	Speedup
90L-5AA	0.527	0.146	72.3%	3.6x
100L-3AA	0.523	0.168	67.8%	3.1x
100L-40AA	0.507	0.180	64.5%	2.8x
50L-5AA	0.350	0.126	64.0%	2.8x
100L-5AA	0.416	0.159	61.8%	2.6x
100L-75AA	0.250	0.095	61.8%	2.6x
60L-5AA	0.584	0.227	61.2%	2.6x
9000L-5AA	0.380	0.150	60.6%	2.5x
100L-95AA	0.344	0.143	58.5%	2.4x
100L-1AA	0.445	0.194	56.4%	2.3x
100L-70AA	0.294	0.132	54.9%	2.2x
40L-5AA	0.296	0.138	53.5%	2.2x
100L-400AA	0.361	0.176	51.2%	2.1x
900L-5AA	0.403	0.207	48.6%	1.9x
100L-200AA	0.262	0.138	47.4%	1.9x
1000L-5AA	0.229	0.124	46.1%	1.9x
200L-5AA	0.415	0.225	45.8%	1.8x
3000L-5AA	0.231	0.126	45.4%	1.8x
6000L-5AA	0.378	0.210	44.5%	1.8x
100L-100AA	0.140	0.079	43.7%	1.8x

4. pods/sec

Node Labels	Non-optimized (pods/sec)	Optimized (pods/sec)	Change	Improvement
100	466.7	489.5	+22.8	+4.9%
1,000	469.9	433.7	-36.2	-7.7%
10,000	350.3	369.0	+18.7	+5.3%
20,000	330.1	395.0	+64.9	+19.7%
50,000	298.0	405.0	+107.0	+35.9%

Summary

78 test configurations tested using scheduler_perf
49 cases (62.8%) showed improvement
Maximum improvement: 72.3% (3.6x speedup)
Best improvements occur when node labels and anti-affinity counts differ significantly

Which issue(s) this PR fixes:

Fixes #132070

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Improved scheduler performance by optimizing the InterPodAffinity plugin's node label iteration when checking existing pod anti-affinity constraints

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-06-03T11:23:33Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-06-03T11:23:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: utam0k
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haosdent · 2025-06-06T02:47:12Z

Do you have any performance tests with/without this patch to measure the improvement?

haosdent · 2025-06-06T02:51:02Z

pkg/scheduler/framework/plugins/interpodaffinity/filtering.go

-			if state.existingAntiAffinityCounts[tp] > 0 {
-				return false
+		// Check if this node matches any anti-affinity topology pairs.
+		nodeLabels := nodeInfo.Node().Labels


What if nodeLabels only has a few items while existingAntiAffinityCounts has more items?
Would this patch make the performance worse in this situation?

Thanks for your review! Yes, your concern is right. I have already describe it in the description of this PR.

The performance difference depends on: O(number of node labels vs existingAntiAffinityCounts)
Number of node labels (typically 10-15 on GKE, can be 20-30+ with custom labels)
Number of entries in existingAntiAffinityCounts

If you know of any cases where things go bad in the real world, please let me know.

Do we have any benchmark results that show that this PR is visibly improving the performance?

@macsko Thank you for your reaction. Here is the benchmark.

Benchmark Results

Scenario Main (ns/op) This PR (ns/op) Performance

Few node labels (10) vs Many anti-affinity (100) 37.17 113.5 3x slower ❌

Many node labels (30) vs Few anti-affinity (5) 199.4 41.69 4.8x faster ✅

Equal size (15 vs 15) 61.52 41.75 1.5x faster ✅

Code

package interpodaffinity import ( "fmt" "testing" v1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/kubernetes/pkg/scheduler/framework" ) func BenchmarkSatisfyExistingPodsAntiAffinity(b *testing.B) { testCases := []struct { nodeLabels int antiAffinityKeys int desc string }{ { nodeLabels: 10, antiAffinityKeys: 100, desc: "FewNodeLabels_ManyAntiAffinityKeys", }, { nodeLabels: 30, antiAffinityKeys: 5, desc: "ManyNodeLabels_FewAntiAffinityKeys", }, { nodeLabels: 15, antiAffinityKeys: 15, desc: "EqualNodeLabels_AntiAffinityKeys", }, { nodeLabels: 20, antiAffinityKeys: 50, desc: "ModerateNodeLabels_ManyAntiAffinityKeys", }, { nodeLabels: 5, antiAffinityKeys: 5, desc: "FewNodeLabels_FewAntiAffinityKeys", }, { nodeLabels: 50, antiAffinityKeys: 50, desc: "ManyNodeLabels_ManyAntiAffinityKeys", }, } for _, tc := range testCases { b.Run(tc.desc, func(b *testing.B) { // Create node with specified number of labels nodeLabels := make(map[string]string) for i := 0; i < tc.nodeLabels; i++ { nodeLabels[fmt.Sprintf("label-key-%d", i)] = fmt.Sprintf("label-value-%d", i) } node := &v1.Node{ ObjectMeta: metav1.ObjectMeta{ Name: "test-node", Labels: nodeLabels, }, } nodeInfo := framework.NewNodeInfo() nodeInfo.SetNode(node) // Create preFilterState with specified number of anti-affinity entries state := &preFilterState{ existingAntiAffinityCounts: make(topologyToMatchedTermCount), } // Add anti-affinity entries, half matching node labels, half not matching for i := 0; i < tc.antiAffinityKeys; i++ { if i < tc.antiAffinityKeys/2 && i < tc.nodeLabels { // Matching entries tp := topologyPair{ key: fmt.Sprintf("label-key-%d", i), value: fmt.Sprintf("label-value-%d", i), } state.existingAntiAffinityCounts[tp] = 1 } else { // Non-matching entries tp := topologyPair{ key: fmt.Sprintf("non-existing-key-%d", i), value: fmt.Sprintf("non-existing-value-%d", i), } state.existingAntiAffinityCounts[tp] = 1 } } // Run benchmark b.ResetTimer() for i := 0; i < b.N; i++ { _ = satisfyExistingPodsAntiAffinity(state, nodeInfo) } }) } }

The optimization improves performance when nodes have many labels but few anti-affinity rules (common in real clusters), but degrades when the opposite is true.

We could adapt the iteration strategy based on the relative sizes:

if len(nodeLabels) < len(state.existingAntiAffinityCounts) { // Iterate over node labels (current master approach) } else { // Iterate over anti-affinity terms (this PR's approach) }

This would provide optimal performance in all cases. Should I implement this adaptive approach?

Could you also run some end-to-end benchmark (maybe modified scheduler_perf), checking if this change has any impact on the scheduling throughput?

Should I implement this adaptive approach?

You could also test this approach

Increasing AA is very slow because throughput drops to single digits to begin with. This is not a very realistic time, but if you have a few cases you would like me to try, please let me know.
I have run it several times and the same trend. However, it is local, so there will inevitably be some differences. Please let us know what your concerns are. I think the trend is consistent in all tests. Or shall I prepare it so that it can be tested in your environment as well?

In actual clusters, the number of topology pairs may be greater than the number of node labels, for example, when the topologyKey is set to hostname. However, the opposite can also be true, for instance, when the topologyKey is zone.

do we need to update the implementation of satisfyPodAntiAffinity function also?

Or shall I prepare it so that it can be tested in your environment as well?

It would be great if you could add a scheduler_perf benchmark to this PR that will verify the throughput for multiple scenarios.

do we need to update the implementation of satisfyPodAntiAffinity function also?

I don't think that would be beneficial. satisfyPodAntiAffinity relies on iterating through slice which is much faster than (potential) iterating through map.

It would be great if you could add a scheduler_perf benchmark to this PR that will verify the throughput for multiple scenarios.

@macsko
What scenarios do you want? Or may I choose?

What scenarios do you want? Or may I choose?

Let's select a few combinations of #node labels and #anti affinity rules.

utam0k · 2025-06-06T03:55:20Z

Do you have any performance tests with/without this patch to measure the improvement?

I think the trade-off is obvious, but in what cases and how would you like to see it tested?

Signed-off-by: utam0k <[email protected]>

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 3, 2025

k8s-ci-robot requested review from damemi and macsko June 3, 2025 11:23

haosdent reviewed Jun 6, 2025

View reviewed changes

sched: Optimize anti-affinity node label check

67d0783

Signed-off-by: utam0k <[email protected]>

utam0k force-pushed the inter-pod-affinity branch from eb7733e to 67d0783 Compare June 10, 2025 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sched: Optimize anti-affinity node label check #132071

sched: Optimize anti-affinity node label check #132071

Uh oh!

utam0k commented Jun 3, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 3, 2025

Uh oh!

k8s-ci-robot commented Jun 3, 2025

Uh oh!

haosdent commented Jun 6, 2025

Uh oh!

haosdent Jun 6, 2025

Uh oh!

utam0k Jun 6, 2025

Uh oh!

macsko Jun 9, 2025

Uh oh!

utam0k Jun 9, 2025 •

edited

Loading

Uh oh!

macsko Jun 9, 2025

Uh oh!

utam0k Jun 12, 2025

Uh oh!

likakuli Jun 18, 2025

Uh oh!

macsko Jun 18, 2025

Uh oh!

utam0k Jun 18, 2025

Uh oh!

macsko Jun 26, 2025

Uh oh!

utam0k commented Jun 6, 2025

Uh oh!

Uh oh!

sched: Optimize anti-affinity node label check #132071

Are you sure you want to change the base?

sched: Optimize anti-affinity node label check #132071

Uh oh!

Conversation

utam0k commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Benchmarking

Scheduling Perf

1. Direct Execution Time Comparison

2. Performance Scatter Plot

3. Performance Table (Top 20)

4. pods/sec

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jun 3, 2025

Uh oh!

k8s-ci-robot commented Jun 3, 2025

Uh oh!

haosdent commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

utam0k Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Benchmark Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

utam0k commented Jun 6, 2025

Uh oh!

Uh oh!

utam0k commented Jun 3, 2025 •

edited

Loading

utam0k Jun 9, 2025 •

edited

Loading