-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Added stacktrace and logging for list and watch panics in reflector.go #131000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Added stacktrace and logging for list and watch panics in reflector.go #131000
Conversation
Hi @ballista01. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
buf := make([]byte, 1<<16) | ||
stackSize := golangruntime.Stack(buf, false) | ||
stackTrace := string(buf[:stackSize]) | ||
klog.Errorf("Panic occurred in Reflector watch operation: %v\nStack Trace:\n%s", r, stackTrace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No direct klog calls, please. Use the logger
instance instead.
What is the advantage of this custom code over runtime.HandleCrashWithContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I used it in watch
but not list
. Here are my reasons:
-
runtime.HandleCrashWithContext by default re-panics after calling the handlers. That doesn't work here because the panic happens inside a goroutine in
list()
, and we need to pass the panic back to the main goroutine viapanicCh
. Re-panicking inside the child goroutine would bypass that and crash the process. -
Setting
ReallyCrash = false
would avoid the re-panic, but it's a global variable and not thread-safe. Toggling it can trigger data races in tests or other concurrent code. -
So instead, I used a local
recover()
block that mimicsHandleCrashWithContext
but without the re-panic. It still calls the registered panic handlers (so we get the logs and stack trace), and then sends the panic over the channel for the caller to handle. -
If we think this pattern is useful more broadly, I’d be happy to propose a patch to
staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go
to support it more cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, just after I wrote these, I found I can make it cleaner by using
defer func() {
if r := recover(); r != nil {
panicCh <- r
}
}()
defer utilruntime.HandleCrashWithContext(ctx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sends the panic over the channel for the caller to handle
Who needs this? What is the caller supposed to do with the panic?
We seem to have a policy in Kubernetes which is implemented by k8s.io/apimachinery/pkg/util/runtime
: "panics should not occur and are fatal". If you feel that this code here needs a different policy, then I would like to see confirmation from SIG Arch on that.
I agree that for testing purposes it makes sense to apply a different policy, and I agree that the global variable in the runtime package is sub-optimal. If that is what you are trying to fix, then we could look for simpler alternatives. For example, a "don't exit" flag could be set in the context and get checked by the package first before falling back to the global variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this closer, I now understand why panicCh
exists: list
spawns a goroutine and any panic occurring there is meant to be seen as a panic in list
itself.
Your latest incarnation doesn't achieve that. When calling utilruntime.HandleCrashWithContext(ctx)
inside the goroutine, the entire process is getting torn down (default behavior outside of tests), without ever writing anything to panicCh
.
Perhaps that is the right approach. But then the panicCh
is redundant and should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I'm in favor of this PR as far as it goes, plumbing ctx to the lister so we don't hang a background list. I just think we need to make our ctx adapter more correct if we're relying on it to interrupt the list for a context cancel OR remove the possibility of listerwatchers that don't have context (ideal, but more breaking)
(just a quick look in-tree shows ~9 places we're constructing cache.ListWatch with only ListFunc/WatchFunc methods, not the context equivalents ... looks like we haven't really finished the work to make lister watcher context aware ... so relying on it in this PR isn't super solid yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, it looks like we have multiple implementations of ListWithContext now that just drop the context:
// List a set of apiserver resources
func (lw *ListWatch) ListWithContext(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) {
// ListWatch is used in Reflector, which already supports pagination.
// Don't paginate here to avoid duplication.
if lw.ListWithContextFunc != nil {
return lw.ListWithContextFunc(ctx, options)
}
return lw.ListFunc(options)
}
func (l listerWrapper) ListWithContext(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) {
return l.parent.List(options)
}
if we're relying on ctx cancellation here, we have to make those actually cancel on context cancel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix all in-tree code, but ListWatch is a public API. There's always the possibility that some client-go consumer only supplies the non-context-aware functions. Perhaps we have to really remove those before we can proceed with this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the minimum bar to rely on context cancellation here is:
- converting all in-tree uses to honor context
- making the places we adapt context-unaware listers to context-aware signatures unblock and return on context cancellation (leaving the uncancellable lists running/hanging in the background, I guess, with the panic propagation workaround shifted to those adapters)
To help move more of the out-of-tree ecosystem to threading context cancellation, we have a spectrum of options:
- Sweep the internet for things that use our concrete ListWatch struct with ListFunc/WatchFunc and file issues or open PRs to switch them to provide ListWithContextFunc / WatchWithContextFunc instead. This is beneficial even if we never do any subsequent steps.
- Possibly, after we've done 1 and waited some period of time, remove the non-context ListFunc/WatchFunc fields from
ListWatch
(while keeping the methods to satisfy the interface) - Possibly, after we've done 2 and waited some period of time, collapse the ListerWatcher and ListerWatcherWithContext interfaces.
/ok-to-test |
/retest |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ballista01 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction, but we need to figure out whether "cancellation despite hanging list/watch implementation" is something that has to be preserved. I'll respond to #131000 (comment).
…atchWithContext() to centralize error handling
/retest |
@@ -401,6 +401,7 @@ func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error { | |||
func (r *Reflector) ListAndWatchWithContext(ctx context.Context) error { | |||
logger := klog.FromContext(ctx) | |||
logger.V(3).Info("Listing and watching", "type", r.typeDescription, "reflector", r.name) | |||
defer utilruntime.HandleCrashWithContext(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't log panics in this method before, I'm not really sure why this level is where we'd choose to start logging them now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
@ballista01: with "leave this to the caller" in #131000 (comment) I meant the caller of client-go, not a higher-level function in client-go. Please don't resolve comments without giving your reviewer a chance to check whether their comment really is addressed.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR enhances the
Reflector
inclient-go/tools/cache
by adding panic handling mechanisms in thelist
andwatch
operations. The panic handling captures detailed stack trace information when a panic occurs, which helps to accurately identify the source of the panic. Without this improvement, panic messages may not provide sufficient context, making debugging difficult.Which issue(s) this PR fixes:
Fixes #130817
Special notes for your reviewer:
The following changes were made:
watch
andlist
methods ofreflector.go
.recover()
mechanisms inwatch
.list
andwatch
operations.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: