Skip to content

Added stacktrace and logging for list and watch panics in reflector.go #131000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ballista01
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR enhances the Reflector in client-go/tools/cache by adding panic handling mechanisms in the list and watch operations. The panic handling captures detailed stack trace information when a panic occurs, which helps to accurately identify the source of the panic. Without this improvement, panic messages may not provide sufficient context, making debugging difficult.

Which issue(s) this PR fixes:

Fixes #130817

Special notes for your reviewer:

The following changes were made:

  • Improved logging by providing complete stack trace information in watch and list methods of reflector.go.
  • Added recover() mechanisms in watch.
  • Added test cases to ensure proper panic handling and logging for both list and watch operations.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ballista01. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 23, 2025
@ballista01 ballista01 changed the title Added stacktrace and logging for list and watch panics in reflecter.go Added stacktrace and logging for list and watch panics in reflector.go Mar 23, 2025
buf := make([]byte, 1<<16)
stackSize := golangruntime.Stack(buf, false)
stackTrace := string(buf[:stackSize])
klog.Errorf("Panic occurred in Reflector watch operation: %v\nStack Trace:\n%s", r, stackTrace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No direct klog calls, please. Use the logger instance instead.

What is the advantage of this custom code over runtime.HandleCrashWithContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I used it in watch but not list. Here are my reasons:

  1. runtime.HandleCrashWithContext by default re-panics after calling the handlers. That doesn't work here because the panic happens inside a goroutine in list(), and we need to pass the panic back to the main goroutine via panicCh. Re-panicking inside the child goroutine would bypass that and crash the process.

  2. Setting ReallyCrash = false would avoid the re-panic, but it's a global variable and not thread-safe. Toggling it can trigger data races in tests or other concurrent code.

  3. So instead, I used a local recover() block that mimics HandleCrashWithContext but without the re-panic. It still calls the registered panic handlers (so we get the logs and stack trace), and then sends the panic over the channel for the caller to handle.

  4. If we think this pattern is useful more broadly, I’d be happy to propose a patch to staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go to support it more cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, just after I wrote these, I found I can make it cleaner by using

	defer func() {
		if r := recover(); r != nil {
			panicCh <- r
		}
	}()

	defer utilruntime.HandleCrashWithContext(ctx)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sends the panic over the channel for the caller to handle

Who needs this? What is the caller supposed to do with the panic?

We seem to have a policy in Kubernetes which is implemented by k8s.io/apimachinery/pkg/util/runtime: "panics should not occur and are fatal". If you feel that this code here needs a different policy, then I would like to see confirmation from SIG Arch on that.

I agree that for testing purposes it makes sense to apply a different policy, and I agree that the global variable in the runtime package is sub-optimal. If that is what you are trying to fix, then we could look for simpler alternatives. For example, a "don't exit" flag could be set in the context and get checked by the package first before falling back to the global variable.

Copy link
Contributor

@pohly pohly Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this closer, I now understand why panicCh exists: list spawns a goroutine and any panic occurring there is meant to be seen as a panic in list itself.

Your latest incarnation doesn't achieve that. When calling utilruntime.HandleCrashWithContext(ctx) inside the goroutine, the entire process is getting torn down (default behavior outside of tests), without ever writing anything to panicCh.

Perhaps that is the right approach. But then the panicCh is redundant and should be removed.

Copy link
Member

@liggitt liggitt May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm in favor of this PR as far as it goes, plumbing ctx to the lister so we don't hang a background list. I just think we need to make our ctx adapter more correct if we're relying on it to interrupt the list for a context cancel OR remove the possibility of listerwatchers that don't have context (ideal, but more breaking)

(just a quick look in-tree shows ~9 places we're constructing cache.ListWatch with only ListFunc/WatchFunc methods, not the context equivalents ... looks like we haven't really finished the work to make lister watcher context aware ... so relying on it in this PR isn't super solid yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, it looks like we have multiple implementations of ListWithContext now that just drop the context:

// List a set of apiserver resources
func (lw *ListWatch) ListWithContext(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) {
	// ListWatch is used in Reflector, which already supports pagination.
	// Don't paginate here to avoid duplication.
	if lw.ListWithContextFunc != nil {
		return lw.ListWithContextFunc(ctx, options)
	}
	return lw.ListFunc(options)
}
func (l listerWrapper) ListWithContext(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) {
	return l.parent.List(options)
}

if we're relying on ctx cancellation here, we have to make those actually cancel on context cancel

Copy link
Contributor

@pohly pohly May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix all in-tree code, but ListWatch is a public API. There's always the possibility that some client-go consumer only supplies the non-context-aware functions. Perhaps we have to really remove those before we can proceed with this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the minimum bar to rely on context cancellation here is:

  1. converting all in-tree uses to honor context
  2. making the places we adapt context-unaware listers to context-aware signatures unblock and return on context cancellation (leaving the uncancellable lists running/hanging in the background, I guess, with the panic propagation workaround shifted to those adapters)

To help move more of the out-of-tree ecosystem to threading context cancellation, we have a spectrum of options:

  1. Sweep the internet for things that use our concrete ListWatch struct with ListFunc/WatchFunc and file issues or open PRs to switch them to provide ListWithContextFunc / WatchWithContextFunc instead. This is beneficial even if we never do any subsequent steps.
  2. Possibly, after we've done 1 and waited some period of time, remove the non-context ListFunc/WatchFunc fields from ListWatch (while keeping the methods to satisfy the interface)
  3. Possibly, after we've done 2 and waited some period of time, collapse the ListerWatcher and ListerWatcherWithContext interfaces.

@seans3
Copy link
Contributor

seans3 commented Mar 25, 2025

/ok-to-test
/triage accepted

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2025
@ballista01
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ballista01
Once this PR has been reviewed and has the lgtm label, please assign jpbetz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction, but we need to figure out whether "cancellation despite hanging list/watch implementation" is something that has to be preserved. I'll respond to #131000 (comment).

@ballista01
Copy link
Contributor Author

/retest

@@ -401,6 +401,7 @@ func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
func (r *Reflector) ListAndWatchWithContext(ctx context.Context) error {
logger := klog.FromContext(ctx)
logger.V(3).Info("Listing and watching", "type", r.typeDescription, "reflector", r.name)
defer utilruntime.HandleCrashWithContext(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we didn't log panics in this method before, I'm not really sure why this level is where we'd choose to start logging them now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@ballista01: with "leave this to the caller" in #131000 (comment) I meant the caller of client-go, not a higher-level function in client-go. Please don't resolve comments without giving your reviewer a chance to check whether their comment really is addressed.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2025
@liggitt liggitt added this to @liggitt May 3, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Reflector panic doesn't provide any debug information
6 participants