Skip to content

sched: Optimize anti-affinity node label check #132071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

utam0k
Copy link
Member

@utam0k utam0k commented Jun 3, 2025

What type of PR is this?

/kind cleanup
/sig scheduling

What this PR does / why we need it:

The performance difference depends on: O(number of node labels vs existingAntiAffinityCounts)

  • Number of node labels (typically 10-15 on GKE, can be 20-30+ with custom labels)
  • Number of entries in existingAntiAffinityCounts
Benchmarking
Scenario Master (ns/op) This PR (ns/op) Performance
Few node labels (10) vs Many anti-affinity (100) 37.17 113.5 3x slower ❌
Many node labels (30) vs Few anti-affinity (5) 199.4 41.69 4.8x faster ✅
Equal size (15 vs 15) 61.52 41.75 1.5x faster ✅

Scheduling Perf

Note: L = Node Labels, AA = Anti-Affinity rules

1. Direct Execution Time Comparison

graph1_filter_comparison

Top 30 test cases showing side-by-side comparison of filter execution times. Green bars (This Branch) are consistently lower than red bars (Master) for these cases.

2. Performance Scatter Plot

graph2_improvement

Each point represents a test case. Points below the diagonal line indicate improved performance with dynamic selection.

3. Performance Table (Top 20)
Test Case Master (ms) This branch (ms) Improvement Speedup
90L-5AA 0.527 0.146 72.3% 3.6x
100L-3AA 0.523 0.168 67.8% 3.1x
100L-40AA 0.507 0.180 64.5% 2.8x
50L-5AA 0.350 0.126 64.0% 2.8x
100L-5AA 0.416 0.159 61.8% 2.6x
100L-75AA 0.250 0.095 61.8% 2.6x
60L-5AA 0.584 0.227 61.2% 2.6x
9000L-5AA 0.380 0.150 60.6% 2.5x
100L-95AA 0.344 0.143 58.5% 2.4x
100L-1AA 0.445 0.194 56.4% 2.3x
100L-70AA 0.294 0.132 54.9% 2.2x
40L-5AA 0.296 0.138 53.5% 2.2x
100L-400AA 0.361 0.176 51.2% 2.1x
900L-5AA 0.403 0.207 48.6% 1.9x
100L-200AA 0.262 0.138 47.4% 1.9x
1000L-5AA 0.229 0.124 46.1% 1.9x
200L-5AA 0.415 0.225 45.8% 1.8x
3000L-5AA 0.231 0.126 45.4% 1.8x
6000L-5AA 0.378 0.210 44.5% 1.8x
100L-100AA 0.140 0.079 43.7% 1.8x
4. pods/sec
Node Labels Non-optimized (pods/sec) Optimized (pods/sec) Change Improvement
100 466.7 489.5 +22.8 +4.9%
1,000 469.9 433.7 -36.2 -7.7%
10,000 350.3 369.0 +18.7 +5.3%
20,000 330.1 395.0 +64.9 +19.7%
50,000 298.0 405.0 +107.0 +35.9%

Summary

  • 78 test configurations tested using scheduler_perf
  • 49 cases (62.8%) showed improvement
  • Maximum improvement: 72.3% (3.6x speedup)
  • Best improvements occur when node labels and anti-affinity counts differ significantly

Which issue(s) this PR fixes:

Fixes #132070

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Improved scheduler performance by optimizing the InterPodAffinity plugin's node label iteration when checking existing pod anti-affinity constraints

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 3, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 3, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: utam0k
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from damemi and macsko June 3, 2025 11:23
@haosdent
Copy link
Member

haosdent commented Jun 6, 2025

Do you have any performance tests with/without this patch to measure the improvement?

if state.existingAntiAffinityCounts[tp] > 0 {
return false
// Check if this node matches any anti-affinity topology pairs.
nodeLabels := nodeInfo.Node().Labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if nodeLabels only has a few items while existingAntiAffinityCounts has more items?
Would this patch make the performance worse in this situation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! Yes, your concern is right. I have already describe it in the description of this PR.

The performance difference depends on: O(number of node labels vs existingAntiAffinityCounts)
Number of node labels (typically 10-15 on GKE, can be 20-30+ with custom labels)
Number of entries in existingAntiAffinityCounts

If you know of any cases where things go bad in the real world, please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any benchmark results that show that this PR is visibly improving the performance?

Copy link
Member Author

@utam0k utam0k Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macsko Thank you for your reaction. Here is the benchmark.

Benchmark Results

Scenario Main (ns/op) This PR (ns/op) Performance
Few node labels (10) vs Many anti-affinity (100) 37.17 113.5 3x slower ❌
Many node labels (30) vs Few anti-affinity (5) 199.4 41.69 4.8x faster ✅
Equal size (15 vs 15) 61.52 41.75 1.5x faster ✅
Code

package interpodaffinity

import (
	"fmt"
	"testing"

	v1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

func BenchmarkSatisfyExistingPodsAntiAffinity(b *testing.B) {
	testCases := []struct {
		nodeLabels       int
		antiAffinityKeys int
		desc             string
	}{
		{
			nodeLabels:       10,
			antiAffinityKeys: 100,
			desc:             "FewNodeLabels_ManyAntiAffinityKeys",
		},
		{
			nodeLabels:       30,
			antiAffinityKeys: 5,
			desc:             "ManyNodeLabels_FewAntiAffinityKeys",
		},
		{
			nodeLabels:       15,
			antiAffinityKeys: 15,
			desc:             "EqualNodeLabels_AntiAffinityKeys",
		},
		{
			nodeLabels:       20,
			antiAffinityKeys: 50,
			desc:             "ModerateNodeLabels_ManyAntiAffinityKeys",
		},
		{
			nodeLabels:       5,
			antiAffinityKeys: 5,
			desc:             "FewNodeLabels_FewAntiAffinityKeys",
		},
		{
			nodeLabels:       50,
			antiAffinityKeys: 50,
			desc:             "ManyNodeLabels_ManyAntiAffinityKeys",
		},
	}

	for _, tc := range testCases {
		b.Run(tc.desc, func(b *testing.B) {
			// Create node with specified number of labels
			nodeLabels := make(map[string]string)
			for i := 0; i < tc.nodeLabels; i++ {
				nodeLabels[fmt.Sprintf("label-key-%d", i)] = fmt.Sprintf("label-value-%d", i)
			}

			node := &v1.Node{
				ObjectMeta: metav1.ObjectMeta{
					Name:   "test-node",
					Labels: nodeLabels,
				},
			}
			nodeInfo := framework.NewNodeInfo()
			nodeInfo.SetNode(node)

			// Create preFilterState with specified number of anti-affinity entries
			state := &preFilterState{
				existingAntiAffinityCounts: make(topologyToMatchedTermCount),
			}

			// Add anti-affinity entries, half matching node labels, half not matching
			for i := 0; i < tc.antiAffinityKeys; i++ {
				if i < tc.antiAffinityKeys/2 && i < tc.nodeLabels {
					// Matching entries
					tp := topologyPair{
						key:   fmt.Sprintf("label-key-%d", i),
						value: fmt.Sprintf("label-value-%d", i),
					}
					state.existingAntiAffinityCounts[tp] = 1
				} else {
					// Non-matching entries
					tp := topologyPair{
						key:   fmt.Sprintf("non-existing-key-%d", i),
						value: fmt.Sprintf("non-existing-value-%d", i),
					}
					state.existingAntiAffinityCounts[tp] = 1
				}
			}

			// Run benchmark
			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				_ = satisfyExistingPodsAntiAffinity(state, nodeInfo)
			}
		})
	}
}

The optimization improves performance when nodes have many labels but few anti-affinity rules (common in real clusters), but degrades when the opposite is true.

We could adapt the iteration strategy based on the relative sizes:

if len(nodeLabels) < len(state.existingAntiAffinityCounts) {
    // Iterate over node labels (current master approach)
} else {
    // Iterate over anti-affinity terms (this PR's approach)
}

This would provide optimal performance in all cases. Should I implement this adaptive approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also run some end-to-end benchmark (maybe modified scheduler_perf), checking if this change has any impact on the scheduling throughput?

Should I implement this adaptive approach?

You could also test this approach

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing AA is very slow because throughput drops to single digits to begin with. This is not a very realistic time, but if you have a few cases you would like me to try, please let me know.
I have run it several times and the same trend. However, it is local, so there will inevitably be some differences. Please let us know what your concerns are. I think the trend is consistent in all tests. Or shall I prepare it so that it can be tested in your environment as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In actual clusters, the number of topology pairs may be greater than the number of node labels, for example, when the topologyKey is set to hostname. However, the opposite can also be true, for instance, when the topologyKey is zone.

do we need to update the implementation of satisfyPodAntiAffinity function also?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or shall I prepare it so that it can be tested in your environment as well?

It would be great if you could add a scheduler_perf benchmark to this PR that will verify the throughput for multiple scenarios.

do we need to update the implementation of satisfyPodAntiAffinity function also?

I don't think that would be beneficial. satisfyPodAntiAffinity relies on iterating through slice which is much faster than (potential) iterating through map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could add a scheduler_perf benchmark to this PR that will verify the throughput for multiple scenarios.

@macsko
What scenarios do you want? Or may I choose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenarios do you want? Or may I choose?

Let's select a few combinations of #node labels and #anti affinity rules.

@utam0k
Copy link
Member Author

utam0k commented Jun 6, 2025

Do you have any performance tests with/without this patch to measure the improvement?

I think the trade-off is obvious, but in what cases and how would you like to see it tested?

@utam0k utam0k force-pushed the inter-pod-affinity branch from eb7733e to 67d0783 Compare June 10, 2025 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sched: InterPodAffinity plugin has inefficient node label iteration in satisfyExistingPodsAntiAffinity
5 participants