Skip to content

Adding batch handling for popping items from RealFIFO #132240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yue9944882
Copy link
Member

@yue9944882 yue9944882 commented Jun 11, 2025

/kind feature
/sig api-machinery

Summary

Ref: #130767

The above issue noted a problem that the write lock acquisition from DeltaFIFO's processDeltas can have much lower throughput due to read lock contention:

  • Read lock attempts were multi-threaded in a classic controller model so it has higher chance to acquire the lock.
  • Write lock works in a single-threaded context so at least one read lock will be able to acquire the lock in-between.

This leads to an problem that the distribution between read lock and write lock are almost 1:1 in a large scale busy cluster.

This PR enables DeltaFIFO to process items in a batch fashion while preserving the in-queue orders, so it can batch multiple item writes (add/update/delete) within a single write lock acquisition.

NONE

Performance Test

By running the following unit test which simulates heavy watch event stream. It constructs a simple controller with an informer built from in-memory fake watch, then we use 100 concurrent routines to flood watch events to the watch channel:

https://gist.github.com/yue9944882/f9ee6bbc81373545253b95abc9a65269

Screenshot 2025-06-11 at 6 55 48 PM
  • Achieve ~9x write throughput than reads
  • Increasing overall controller throughput by ~6x

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 11, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 11, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 11, 2025
@yue9944882 yue9944882 changed the title [WIP] Adding batch handling for popping items from DeltaFIFO [WIP] Adding batch handling for popping items from Real/DeltaFIFO Jun 11, 2025
@yue9944882 yue9944882 force-pushed the delta-fifo-batch branch 4 times, most recently from f5373ee to d77407b Compare June 11, 2025 22:57
@cartermckinnon
Copy link
Contributor

/cc

@yue9944882 yue9944882 force-pushed the delta-fifo-batch branch 4 times, most recently from 9b68011 to 2e4faa5 Compare June 12, 2025 00:55
@hakuna-matatah
Copy link
Contributor

/cc

@yue9944882 yue9944882 force-pushed the delta-fifo-batch branch 2 times, most recently from 9fd34a1 to 83da6cd Compare June 12, 2025 16:38
@yue9944882 yue9944882 changed the title [WIP] Adding batch handling for popping items from Real/DeltaFIFO Adding batch handling for popping items from Real/DeltaFIFO Jun 12, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2025
@dims
Copy link
Member

dims commented Jun 13, 2025

/assign @thockin @deads2k @liggitt

@yue9944882 yue9944882 changed the title Adding batch handling for popping items from Real/DeltaFIFO Adding batch handling for popping items from RealFIFO Jun 13, 2025
Copy link
Member

@mengqiy mengqiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good performance improvement and FMU it still preserves the ordering of events (we wont' run into https://github.com/kubernetes/kubernetes/pull/127631/files)

if unique.Has(id) {
break
}
batchSize++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This batchSize would be a useful metric.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, i think we can separate the metrics changes into another PR so we can focus down on the queue change in this PR

@@ -165,6 +179,38 @@ type cache struct {

var _ Store = &cache{}

func (c *cache) Transaction(txns ...Transaction) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's worth to add some trace log for transaction operation here. e.g. log something if it takes more than 100ms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think 100ms might be too tight for larger batch? added with 500ms

@yue9944882 yue9944882 force-pushed the delta-fifo-batch branch 3 times, most recently from 6480cc6 to 6b50997 Compare June 16, 2025 21:05
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yue9944882
Once this PR has been reviewed and has the lgtm label, please ask for approval from deads2k. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yue9944882 yue9944882 force-pushed the delta-fifo-batch branch 4 times, most recently from 15a5469 to a29f120 Compare June 16, 2025 21:54
@liggitt liggitt added this to @liggitt Jun 24, 2025
@yue9944882
Copy link
Member Author

/retest

@xigang
Copy link
Member

xigang commented Jun 27, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from xigang June 27, 2025 04:02
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

10 participants