Added stacktrace and logging for list and watch panics in reflector.go #131000

ballista01 · 2025-03-23T01:10:17Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR enhances the Reflector in client-go/tools/cache by adding panic handling mechanisms in the list and watch operations. The panic handling captures detailed stack trace information when a panic occurs, which helps to accurately identify the source of the panic. Without this improvement, panic messages may not provide sufficient context, making debugging difficult.

Which issue(s) this PR fixes:

Fixes #130817

Special notes for your reviewer:

The following changes were made:

Improved logging by providing complete stack trace information in watch and list methods of reflector.go.
Added recover() mechanisms in watch.
Added test cases to ensure proper panic handling and logging for both list and watch operations.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-03-23T01:10:27Z

Hi @ballista01. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

staging/src/k8s.io/client-go/tools/cache/reflector.go

staging/src/k8s.io/client-go/tools/cache/reflector_test.go

pohly · 2025-03-25T09:43:02Z

staging/src/k8s.io/client-go/tools/cache/reflector.go

+			buf := make([]byte, 1<<16)
+			stackSize := golangruntime.Stack(buf, false)
+			stackTrace := string(buf[:stackSize])
+			klog.Errorf("Panic occurred in Reflector watch operation: %v\nStack Trace:\n%s", r, stackTrace)


No direct klog calls, please. Use the logger instance instead.

What is the advantage of this custom code over runtime.HandleCrashWithContext?

Thanks for the suggestion! I used it in watch but not list. Here are my reasons:

runtime.HandleCrashWithContext by default re-panics after calling the handlers. That doesn't work here because the panic happens inside a goroutine in list(), and we need to pass the panic back to the main goroutine via panicCh. Re-panicking inside the child goroutine would bypass that and crash the process.

Setting ReallyCrash = false would avoid the re-panic, but it's a global variable and not thread-safe. Toggling it can trigger data races in tests or other concurrent code.

So instead, I used a local recover() block that mimics HandleCrashWithContext but without the re-panic. It still calls the registered panic handlers (so we get the logs and stack trace), and then sends the panic over the channel for the caller to handle.

If we think this pattern is useful more broadly, I’d be happy to propose a patch to staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go to support it more cleanly.

nvm, just after I wrote these, I found I can make it cleaner by using

defer func() { if r := recover(); r != nil { panicCh <- r } }() defer utilruntime.HandleCrashWithContext(ctx)

sends the panic over the channel for the caller to handle

Who needs this? What is the caller supposed to do with the panic?

We seem to have a policy in Kubernetes which is implemented by k8s.io/apimachinery/pkg/util/runtime: "panics should not occur and are fatal". If you feel that this code here needs a different policy, then I would like to see confirmation from SIG Arch on that.

I agree that for testing purposes it makes sense to apply a different policy, and I agree that the global variable in the runtime package is sub-optimal. If that is what you are trying to fix, then we could look for simpler alternatives. For example, a "don't exit" flag could be set in the context and get checked by the package first before falling back to the global variable.

Looking at this closer, I now understand why panicCh exists: list spawns a goroutine and any panic occurring there is meant to be seen as a panic in list itself.

Your latest incarnation doesn't achieve that. When calling utilruntime.HandleCrashWithContext(ctx) inside the goroutine, the entire process is getting torn down (default behavior outside of tests), without ever writing anything to panicCh.

Perhaps that is the right approach. But then the panicCh is redundant and should be removed.

To be clear, I'm in favor of this PR as far as it goes, plumbing ctx to the lister so we don't hang a background list. I just think we need to make our ctx adapter more correct if we're relying on it to interrupt the list for a context cancel OR remove the possibility of listerwatchers that don't have context (ideal, but more breaking)

(just a quick look in-tree shows ~9 places we're constructing cache.ListWatch with only ListFunc/WatchFunc methods, not the context equivalents ... looks like we haven't really finished the work to make lister watcher context aware ... so relying on it in this PR isn't super solid yet)

actually, it looks like we have multiple implementations of ListWithContext now that just drop the context:

// List a set of apiserver resources func (lw *ListWatch) ListWithContext(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) { // ListWatch is used in Reflector, which already supports pagination. // Don't paginate here to avoid duplication. if lw.ListWithContextFunc != nil { return lw.ListWithContextFunc(ctx, options) } return lw.ListFunc(options) }

func (l listerWrapper) ListWithContext(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) { return l.parent.List(options) }

if we're relying on ctx cancellation here, we have to make those actually cancel on context cancel

We can fix all in-tree code, but ListWatch is a public API. There's always the possibility that some client-go consumer only supplies the non-context-aware functions. Perhaps we have to really remove those before we can proceed with this PR?

I think the minimum bar to rely on context cancellation here is:

converting all in-tree uses to honor context

making the places we adapt context-unaware listers to context-aware signatures unblock and return on context cancellation (leaving the uncancellable lists running/hanging in the background, I guess, with the panic propagation workaround shifted to those adapters)

To help move more of the out-of-tree ecosystem to threading context cancellation, we have a spectrum of options:

Sweep the internet for things that use our concrete ListWatch struct with ListFunc/WatchFunc and file issues or open PRs to switch them to provide ListWithContextFunc / WatchWithContextFunc instead. This is beneficial even if we never do any subsequent steps.

Possibly, after we've done 1 and waited some period of time, remove the non-context ListFunc/WatchFunc fields from ListWatch (while keeping the methods to satisfy the interface)

Possibly, after we've done 2 and waited some period of time, collapse the ListerWatcher and ListerWatcherWithContext interfaces.

seans3 · 2025-03-25T20:16:08Z

/ok-to-test
/triage accepted

… watch

ballista01 · 2025-04-09T06:18:54Z

/retest

k8s-ci-robot · 2025-04-21T08:22:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ballista01
Once this PR has been reviewed and has the lgtm label, please assign jpbetz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

staging/src/k8s.io/client-go/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly

I like the direction, but we need to figure out whether "cancellation despite hanging list/watch implementation" is something that has to be preserved. I'll respond to #131000 (comment).

staging/src/k8s.io/client-go/tools/cache/reflector.go

…atchWithContext() to centralize error handling

ballista01 · 2025-05-02T07:47:21Z

/retest

liggitt · 2025-05-02T14:09:48Z

staging/src/k8s.io/client-go/tools/cache/reflector.go

@@ -401,6 +401,7 @@ func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
 func (r *Reflector) ListAndWatchWithContext(ctx context.Context) error {
 	logger := klog.FromContext(ctx)
 	logger.V(3).Info("Listing and watching", "type", r.typeDescription, "reflector", r.name)
+	defer utilruntime.HandleCrashWithContext(ctx)


we didn't log panics in this method before, I'm not really sure why this level is where we'd choose to start logging them now

I agree.

@ballista01: with "leave this to the caller" in #131000 (comment) I meant the caller of client-go, not a higher-level function in client-go. Please don't resolve comments without giving your reviewer a chance to check whether their comment really is addressed.

k8s-ci-robot · 2025-06-26T21:27:12Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Added stacktrace and logging for list and watch panics in reflecter.go

190ef7f

k8s-ci-robot requested review from caesarxuchao and soltysh March 23, 2025 01:10

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 23, 2025

ballista01 changed the title ~~Added stacktrace and logging for list and watch panics in reflecter.go~~ Added stacktrace and logging for list and watch panics in reflector.go Mar 23, 2025

serathius reviewed Mar 25, 2025

View reviewed changes

staging/src/k8s.io/client-go/tools/cache/reflector.go Outdated Show resolved Hide resolved

serathius reviewed Mar 25, 2025

View reviewed changes

staging/src/k8s.io/client-go/tools/cache/reflector_test.go Outdated Show resolved Hide resolved

pohly reviewed Mar 25, 2025

View reviewed changes

ballista01 added 2 commits April 6, 2025 20:24

use HandleCrashWithContext to log panics with stacktraces in list and…

7fad3ff

… watch

improved panic logging in list and watch

394c4cc

ballista01 added 2 commits April 9, 2025 03:04

refactored panic logging code in list

2ff6533

make Reflector.list synchronous and improve panic handling

dda4910

pohly reviewed Apr 29, 2025

View reviewed changes

staging/src/k8s.io/client-go/tools/cache/reflector.go Outdated Show resolved Hide resolved

staging/src/k8s.io/client-go/tools/cache/reflector.go Outdated Show resolved Hide resolved

ballista01 added 2 commits May 1, 2025 01:45

Hoist HandleCrashWithContext call from watch() and list() to ListAndW…

105bc07

…atchWithContext() to centralize error handling

remove outdated code comment

cf09aef

liggitt reviewed May 2, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2025

liggitt added this to @liggitt May 3, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2025

Added stacktrace and logging for list and watch panics in reflector.go #131000

Are you sure you want to change the base?

Added stacktrace and logging for list and watch panics in reflector.go #131000

Conversation

ballista01 commented Mar 23, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 23, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liggitt May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seans3 commented Mar 25, 2025

Uh oh!

ballista01 commented Apr 9, 2025

Uh oh!

k8s-ci-robot commented Apr 21, 2025

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ballista01 commented May 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 26, 2025

Uh oh!

Uh oh!

pohly Apr 9, 2025 •

edited

Loading

liggitt May 2, 2025 •

edited

Loading

pohly May 2, 2025 •

edited

Loading