-
Notifications
You must be signed in to change notification settings - Fork 40.9k
sched: Optimize anti-affinity node label check #132071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: utam0k The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Do you have any performance tests with/without this patch to measure the improvement? |
if state.existingAntiAffinityCounts[tp] > 0 { | ||
return false | ||
// Check if this node matches any anti-affinity topology pairs. | ||
nodeLabels := nodeInfo.Node().Labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if nodeLabels only has a few items while existingAntiAffinityCounts
has more items?
Would this patch make the performance worse in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! Yes, your concern is right. I have already describe it in the description of this PR.
The performance difference depends on: O(number of node labels vs existingAntiAffinityCounts)
Number of node labels (typically 10-15 on GKE, can be 20-30+ with custom labels)
Number of entries in existingAntiAffinityCounts
If you know of any cases where things go bad in the real world, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any benchmark results that show that this PR is visibly improving the performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macsko Thank you for your reaction. Here is the benchmark.
Benchmark Results
Scenario | Main (ns/op) | This PR (ns/op) | Performance |
---|---|---|---|
Few node labels (10) vs Many anti-affinity (100) | 37.17 | 113.5 | 3x slower ❌ |
Many node labels (30) vs Few anti-affinity (5) | 199.4 | 41.69 | 4.8x faster ✅ |
Equal size (15 vs 15) | 61.52 | 41.75 | 1.5x faster ✅ |
Code
package interpodaffinity
import (
"fmt"
"testing"
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
func BenchmarkSatisfyExistingPodsAntiAffinity(b *testing.B) {
testCases := []struct {
nodeLabels int
antiAffinityKeys int
desc string
}{
{
nodeLabels: 10,
antiAffinityKeys: 100,
desc: "FewNodeLabels_ManyAntiAffinityKeys",
},
{
nodeLabels: 30,
antiAffinityKeys: 5,
desc: "ManyNodeLabels_FewAntiAffinityKeys",
},
{
nodeLabels: 15,
antiAffinityKeys: 15,
desc: "EqualNodeLabels_AntiAffinityKeys",
},
{
nodeLabels: 20,
antiAffinityKeys: 50,
desc: "ModerateNodeLabels_ManyAntiAffinityKeys",
},
{
nodeLabels: 5,
antiAffinityKeys: 5,
desc: "FewNodeLabels_FewAntiAffinityKeys",
},
{
nodeLabels: 50,
antiAffinityKeys: 50,
desc: "ManyNodeLabels_ManyAntiAffinityKeys",
},
}
for _, tc := range testCases {
b.Run(tc.desc, func(b *testing.B) {
// Create node with specified number of labels
nodeLabels := make(map[string]string)
for i := 0; i < tc.nodeLabels; i++ {
nodeLabels[fmt.Sprintf("label-key-%d", i)] = fmt.Sprintf("label-value-%d", i)
}
node := &v1.Node{
ObjectMeta: metav1.ObjectMeta{
Name: "test-node",
Labels: nodeLabels,
},
}
nodeInfo := framework.NewNodeInfo()
nodeInfo.SetNode(node)
// Create preFilterState with specified number of anti-affinity entries
state := &preFilterState{
existingAntiAffinityCounts: make(topologyToMatchedTermCount),
}
// Add anti-affinity entries, half matching node labels, half not matching
for i := 0; i < tc.antiAffinityKeys; i++ {
if i < tc.antiAffinityKeys/2 && i < tc.nodeLabels {
// Matching entries
tp := topologyPair{
key: fmt.Sprintf("label-key-%d", i),
value: fmt.Sprintf("label-value-%d", i),
}
state.existingAntiAffinityCounts[tp] = 1
} else {
// Non-matching entries
tp := topologyPair{
key: fmt.Sprintf("non-existing-key-%d", i),
value: fmt.Sprintf("non-existing-value-%d", i),
}
state.existingAntiAffinityCounts[tp] = 1
}
}
// Run benchmark
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = satisfyExistingPodsAntiAffinity(state, nodeInfo)
}
})
}
}
The optimization improves performance when nodes have many labels but few anti-affinity rules (common in real clusters), but degrades when the opposite is true.
We could adapt the iteration strategy based on the relative sizes:
if len(nodeLabels) < len(state.existingAntiAffinityCounts) {
// Iterate over node labels (current master approach)
} else {
// Iterate over anti-affinity terms (this PR's approach)
}
This would provide optimal performance in all cases. Should I implement this adaptive approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also run some end-to-end benchmark (maybe modified scheduler_perf), checking if this change has any impact on the scheduling throughput?
Should I implement this adaptive approach?
You could also test this approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing AA is very slow because throughput drops to single digits to begin with. This is not a very realistic time, but if you have a few cases you would like me to try, please let me know.
I have run it several times and the same trend. However, it is local, so there will inevitably be some differences. Please let us know what your concerns are. I think the trend is consistent in all tests. Or shall I prepare it so that it can be tested in your environment as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In actual clusters, the number of topology pairs may be greater than the number of node labels, for example, when the topologyKey is set to hostname. However, the opposite can also be true, for instance, when the topologyKey is zone.
do we need to update the implementation of satisfyPodAntiAffinity function also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or shall I prepare it so that it can be tested in your environment as well?
It would be great if you could add a scheduler_perf benchmark to this PR that will verify the throughput for multiple scenarios.
do we need to update the implementation of satisfyPodAntiAffinity function also?
I don't think that would be beneficial. satisfyPodAntiAffinity
relies on iterating through slice which is much faster than (potential) iterating through map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you could add a scheduler_perf benchmark to this PR that will verify the throughput for multiple scenarios.
@macsko
What scenarios do you want? Or may I choose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What scenarios do you want? Or may I choose?
Let's select a few combinations of #node labels and #anti affinity rules.
I think the trade-off is obvious, but in what cases and how would you like to see it tested? |
Signed-off-by: utam0k <[email protected]>
eb7733e
to
67d0783
Compare
What type of PR is this?
/kind cleanup
/sig scheduling
What this PR does / why we need it:
The performance difference depends on: O(number of node labels vs existingAntiAffinityCounts)
Benchmarking
Scheduling Perf
Note: L = Node Labels, AA = Anti-Affinity rules
1. Direct Execution Time Comparison
Top 30 test cases showing side-by-side comparison of filter execution times. Green bars (This Branch) are consistently lower than red bars (Master) for these cases.
2. Performance Scatter Plot
Each point represents a test case. Points below the diagonal line indicate improved performance with dynamic selection.
3. Performance Table (Top 20)
4. pods/sec
Summary
Which issue(s) this PR fixes:
Fixes #132070
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: