Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

component-base/logs: improve handling of re-applying a configuration #117108

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Apr 5, 2023

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

Normal binaries should never have re-apply the configuration. It's not safe when there are already some goroutines running which might do logging. Therefore the new default is to return an error when a binary accidentally re-applies.

The unit tests of component-base/logs ensure that there are no goroutines and have to call the functions more than once.

Integration tests use the same code as the normal binaries. To make reuse of that code safe, component-base/logs can be configured to silently ignore any additional calls. This addresses data races that were found when enabling -race for integration tests.

To avoid having to modify all integration tests which start test servers, reconfiguring component-base/logs is done by the test server packages.

Which issue(s) this PR fixes:

Data race in k8s.io/kubernetes/test/integration/serving.TestComponentSecureServingAndAuth.

WARNING: DATA RACE
Write at 0x00000b345548 by goroutine 6459:
  k8s.io/kubernetes/vendor/k8s.io/klog/v2.ClearLogger()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/contextual.go:137 +0x524
  k8s.io/kubernetes/vendor/k8s.io/component-base/logs/api/v1.apply()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/logs/api/v1/options.go:194 +0x514
  k8s.io/kubernetes/vendor/k8s.io/component-base/logs/api/v1.validateAndApply()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/logs/api/v1/options.go:102 +0xab
  k8s.io/kubernetes/vendor/k8s.io/component-base/logs/api/v1.ValidateAndApply()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/logs/api/v1/options.go:68 +0xbb
  k8s.io/kubernetes/vendor/k8s.io/cloud-provider/app.NewCloudControllerManagerCommand.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/cloud-provider/app/controllermanager.go:94 +0x91
  k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:916 +0xbc5
  k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:1040 +0x5da
  k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:968 +0x2e
  k8s.io/kubernetes/vendor/k8s.io/cloud-provider/app/testing.StartTestServer.func4()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/cloud-provider/app/testing/testserver.go:169 +0xe5

Previous read at 0x00000b345548 by goroutine 2137:
  k8s.io/kubernetes/vendor/k8s.io/klog/v2.newVerbose()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1295 +0x6aa
  k8s.io/kubernetes/vendor/k8s.io/klog/v2.VDepth()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1359 +0x6b7
  k8s.io/kubernetes/vendor/k8s.io/klog/v2.V()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1322 +0x72a
  k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi.(*AggregationController).processNextWorkItem()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi/controller.go:122 +0x737
  k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi.(*AggregationController).runWorker()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi/controller.go:92 +0x39
  k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi.(*AggregationController).runWorker-fm()
      <autogenerated>:1 +0x1d
  k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x48
  k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xce
  k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x10d
  k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x48
  k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi.(*AggregationController).Run.func4()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi/controller.go:86 +0x4c

Goroutine 6459 (running) created at:
  k8s.io/kubernetes/vendor/k8s.io/cloud-provider/app/testing.StartTestServer()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/cloud-provider/app/testing/testserver.go:165 +0x1ac4
  k8s.io/kubernetes/test/integration/serving.cloudControllerManagerTester.StartTestServer()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/serving/serving_test.go:62 +0x87
  k8s.io/kubernetes/test/integration/serving.(*cloudControllerManagerTester).StartTestServer()
      <autogenerated>:1 +0x88
  k8s.io/kubernetes/test/integration/serving.testComponentWithSecureServing.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/serving/serving_test.go:228 +0x231
  testing.tRunner()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/testing/testing.go:1629 +0x47

Goroutine 2137 (running) created at:
  k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi.(*AggregationController).Run()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/controllers/openapi/controller.go:86 +0x351
  k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver.(*APIAggregator).PrepareRun.func1.1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver/apiserver.go:386 +0x47

Does this PR introduce a user-facing change?

component-base/logs is now more strict about not applying configurations multiple times and will return an error when that is attempted. Can be overridden by binaries which need to do that.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.27 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.27.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Apr 5 10:36:42 UTC 2023.

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver area/cloudprovider area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 5, 2023
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@pohly
Copy link
Contributor Author

pohly commented Apr 6, 2023

/retest

@logicalhan
Copy link
Member

/triage accepted

1 similar comment
@cici37
Copy link
Contributor

cici37 commented Apr 6, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2023
@pohly
Copy link
Contributor Author

pohly commented May 23, 2023

/assign @liggitt

Can you perhaps take a look? You were involved in designing the k8s.io/component-base/logs/api/v1 before.

default:
return fmt.Errorf("invalid value %d for ReapplyHandling", ReapplyHandling)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to actually detect if the config being applied differed, and always error if so?

that would avoid false positives where identical config was specified, and avoid silently ignoring config changes in tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should continue to be an error in applications (iReapplyHandlingError) f they apply twice, even if the config is the same: the config might only be the same under the current circumstances and different under others. I think it's better to fail under all circumstances because it makes obvious that something is wrong.

Comparing configs might make sense for ReapplyHandlingIgnore: if an integration test tries to run with different configs, then it's better to raise that as an error instead of silently ignoring it.

For ReapplyHandlingAllow it's most likely a different config, so we can compare and skip if equal, it just won't change much.

Copy link
Member

@liggitt liggitt May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsing the test-caller cases to ignore a reapply if identical and error if different seems good... if a test is reapplying a different config it is not testing what it thinks it is, and I would want to know so I could move my test to a different package or something to get an isolated test process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests where ReapplyHandlingAllow is used do need to apply different configs - that's their purpose 😅

So combining both cases doesn't work. I've pushed a version where the config get compared for the ReapplyHandlingIgnore case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying we should not have ReapplyHandlingAllow, only ReapplyHandlingIgnoreIfEqual and ReapplyHandlingError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of those unit tests are elsewhere:

I would move the unit tests, either into the package that can reset the state, or into separate integration packages, rather than enable an "ignore ineffective config" option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either into the package that can reset the state

That would be k8s.io/component-base/logs/api/v1. Then that package inherits new dependencies (like zapr) that are undesirable and/or we get circular dependencies.

separate integration packages

Isn't that what test/integration/logs/benchmark/load_test.go is? How would such a package reset the state in k8s.io/component-base/logs/api/v1 if we don't offer an API for that?

Hmm, perhaps we can add a k8s.io/component-base/logs/api/v1.Reset? That has to come with a big warning that it should only be used when no goroutines are running and is only meant for such unit tests - basically the same semantic as ReapplyHandlingAllow, except that it is a separate call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems ok. we've done things like that before:

// SetServiceResolverForTests allows the service resolver to be overridden during tests.
// Tests using this function must run serially as this function is not safe to call concurrently with server start.
func SetServiceResolverForTests(resolver webhook.ServiceResolver) func() {
// SetForTests sets capabilities for tests.  Convenience method for testing.  This should only be called from tests.
func SetForTests(c Capabilities) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly pohly force-pushed the test-integration-race-detection-component-base-logs branch from 0bbac15 to 47b086e Compare May 24, 2023 11:17
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2023
@pohly pohly force-pushed the test-integration-race-detection-component-base-logs branch from 47b086e to f9334fe Compare June 26, 2023 12:35
@pohly
Copy link
Contributor Author

pohly commented Jun 26, 2023

Rebased due to a conflict.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2023
@pohly pohly force-pushed the test-integration-race-detection-component-base-logs branch from f9334fe to 64e1273 Compare June 29, 2023 15:35
@pohly
Copy link
Contributor Author

pohly commented Jun 29, 2023

This PR is getting tested in combination with enabled race detection over in #117108.

@pohly pohly force-pushed the test-integration-race-detection-component-base-logs branch from 64e1273 to 4c953f5 Compare July 5, 2023 15:41
@pohly pohly force-pushed the test-integration-race-detection-component-base-logs branch from 4c953f5 to c031805 Compare July 5, 2023 15:50
Comment on lines 278 to 281
// Reset restores the default settings. This is not thread-safe and should only
// be used when there are no goroutines running. The intented users are unit
// tests in other packages.
func Reset(featureGate featuregate.FeatureGate) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Reset restores the default settings. This is not thread-safe and should only
// be used when there are no goroutines running. The intented users are unit
// tests in other packages.
func Reset(featureGate featuregate.FeatureGate) error {
// ResetForTest restores the default settings. This is not thread-safe and should only
// be used when there are no goroutines running. The intended users are unit
// tests in other packages.
func ResetForTest(featureGate featuregate.FeatureGate) error {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@pohly pohly force-pushed the test-integration-race-detection-component-base-logs branch from c031805 to bd4b902 Compare July 5, 2023 17:08
Normal binaries should never have to do this. It's not safe when there are
already some goroutines running which might do logging. Therefore the new
default is to return an error when a binary accidentally re-applies.

A few unit ensure that there are no goroutines and have to call the functions
more then once. The new ResetForTest API gets used by those to enable changing the
logging settings more than once in the same process.

Integration tests use the same code as the normal binaries. To make reuse of
that code safe, component-base/logs can be configured to silently ignore any
additional calls. This addresses data races that were found when enabling -race
for integration tests. To catch cases where the integration test does want
to modify the config, the old and new config get compared and an error is
raised when it's not the same.

To avoid having to modify all integration tests which start test servers,
reconfiguring component-base/logs is done by the test server packages.
@pohly pohly force-pushed the test-integration-race-detection-component-base-logs branch from bd4b902 to 02efe09 Compare July 5, 2023 17:09
@liggitt
Copy link
Member

liggitt commented Jul 5, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 5, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ce34b562b69c93999bd078d18803ceaffbb46fdb

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 5, 2023
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

2 similar comments
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit e5efa0a into kubernetes:master Jul 6, 2023
13 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants