Skip to content

etcd output interception #131350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 26, 2025
Merged

etcd output interception #131350

merged 1 commit into from
Jun 26, 2025

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Apr 17, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

etcd: better logging

Previously, etcd wrote to stderr in JSON format:

{"level":"warn","ts":"2025-04-11T03:32:06.676527Z","caller":"embed/config.go:689","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"warn","ts":"2025-04-11T03:32:06.676707Z","caller":"embed/config.go:689","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"warn","ts":"2025-04-11T03:32:06.677056Z","caller":"etcdmain/etcd.go:146","msg":"failed to start etcd","error":"listen tcp 127.0.0.1:37803: bind: address already in use"}
{"level":"fatal","ts":"2025-04-11T03:32:06.677104Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"listen tcp 127.0.0.1:37803: bind: address already in use","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:272"}

This has several drawbacks:

  • Not very readable.
  • When used in tests which start etcd themselves (for example, scheduler_perf),
    the output is not associated with the current test.
  • Contains warnings that are confusing for developers who don't know that they
    are harmless.

Intercepting output, parsing it and reformating makes the output nicer. Instead
of a mixture of JSON messages (see above) and normal test output, we now get
the etcd output embedded inside the test output. We can also filter out some
known harmless messages. Cleaning up more output or avoiding it in the first place might be a good
next step.

With go test -v ./test/integration/scheduler_perf/dra:

=== RUN   TestSchedulerPerf
=== RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate
=== RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast
I0411 13:21:03.353458   65212 feature_gate.go:385] feature gates: {map[SchedulerQueueingHints:false]}
...
I0411 13:21:10.552975   65212 cidrallocator.go:210] stopping ServiceCIDR Allocator Controller
I0411 13:21:10.567327   65212 etcd.go:210] "etcd output" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast" error="accept tcp 127.0.0.1:42245: use of closed network connection" level="warn" ts="2025-04-11T13:21:10.567045+0200" caller="embed/serve.go:160" msg="stopping insecure grpc server due to error"
I0411 13:21:10.567398   65212 etcd.go:210] "etcd output" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast" ts="2025-04-11T13:21:10.567198+0200" caller="embed/serve.go:162" msg="stopped insecure grpc server due to error" error="accept tcp 127.0.0.1:42245: use of closed network connection" level="warn"
I0411 13:21:15.567917   65212 etcd.go:227] "etcd didn't exit in 5 seconds, killing it" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast"
I0411 13:21:15.567964   65212 etcd.go:234] "etcd exited" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast" err="signal: terminated"

With per-test output go test -v ./test/integration/scheduler_perf/dra -args -use-testing-log:

=== RUN   TestSchedulerPerf
=== RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate
=== RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast
I0411 13:19:03.540497   28645 feature_gate.go:385] feature gates: {map[DynamicResourceAllocation:true]}
...
I0411 13:19:10.519994   28645 cidrallocator.go:210] stopping ServiceCIDR Allocator Controller
    etcd.go:210: I0411 13:19:10.533131] etcd output msg="stopping insecure grpc server due to error" error="accept tcp 127.0.0.1:46637: use of closed network connection" level="warn" ts="2025-04-11T13:19:10.532900+0200" caller="embed/serve.go:160"
    etcd.go:210: I0411 13:19:10.533274] etcd output msg="stopped insecure grpc server due to error" error="accept tcp 127.0.0.1:46637: use of closed network connection" level="warn" ts="2025-04-11T13:19:10.533022+0200" caller="embed/serve.go:162"
    etcd.go:227: I0411 13:19:15.533715] etcd didn't exit in 5 seconds, killing it
    etcd.go:234: I0411 13:19:15.533803] etcd exited err="signal: terminated"

Special notes for your reviewer:

Based on #131349 and #131259 which got merged.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Apr 17, 2025
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Thu Apr 17 07:35:58 UTC 2025.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 17, 2025
@k8s-ci-robot k8s-ci-robot added area/apiserver area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 17, 2025
Previously, etcd wrote to stderr in JSON format:

   {"level":"warn","ts":"2025-04-11T03:32:06.676527Z","caller":"embed/config.go:689","msg":"Running http and grpc server on single port. This is not recommended for production."}
   {"level":"warn","ts":"2025-04-11T03:32:06.676707Z","caller":"embed/config.go:689","msg":"Running http and grpc server on single port. This is not recommended for production."}
   {"level":"warn","ts":"2025-04-11T03:32:06.677056Z","caller":"etcdmain/etcd.go:146","msg":"failed to start etcd","error":"listen tcp 127.0.0.1:37803: bind: address already in use"}
   {"level":"fatal","ts":"2025-04-11T03:32:06.677104Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"listen tcp 127.0.0.1:37803: bind: address already in use","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:272"}

This has several drawbacks:
- Not very readable.
- When used in tests which start etcd themselves (for example, scheduler_perf),
  the output is not associated with the current test.
- Contains warnings that are confusing for developers who don't know that they
  are harmless.

Intercepting output, parsing it and reformating makes the output nicer. Instead
of a mixture of JSON messages (see above) and normal test output, we now get
the etcd output embedded inside the test output. We can also filter out some
known harmless messages. Cleaning up more output or avoiding it in the first
place might be a good next step.

With `go test -v ./test/integration/scheduler_perf/dra`:

    === RUN   TestSchedulerPerf
    === RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate
    === RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast
    I0411 13:21:03.353458   65212 feature_gate.go:385] feature gates: {map[SchedulerQueueingHints:false]}
    ...
    I0411 13:21:10.552975   65212 cidrallocator.go:210] stopping ServiceCIDR Allocator Controller
    I0411 13:21:10.567327   65212 etcd.go:210] "etcd output" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast" error="accept tcp 127.0.0.1:42245: use of closed network connection" level="warn" ts="2025-04-11T13:21:10.567045+0200" caller="embed/serve.go:160" msg="stopping insecure grpc server due to error"
    I0411 13:21:10.567398   65212 etcd.go:210] "etcd output" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast" ts="2025-04-11T13:21:10.567198+0200" caller="embed/serve.go:162" msg="stopped insecure grpc server due to error" error="accept tcp 127.0.0.1:42245: use of closed network connection" level="warn"
    I0411 13:21:15.567917   65212 etcd.go:227] "etcd didn't exit in 5 seconds, killing it" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast"
    I0411 13:21:15.567964   65212 etcd.go:234] "etcd exited" logger="TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast" err="signal: terminated"

With per-test output `go test -v ./test/integration/scheduler_perf/dra -args -use-testing-log`:

    === RUN   TestSchedulerPerf
    === RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate
    === RUN   TestSchedulerPerf/SchedulingWithResourceClaimTemplate/fast
    I0411 13:19:03.540497   28645 feature_gate.go:385] feature gates: {map[DynamicResourceAllocation:true]}
    ...
    I0411 13:19:10.519994   28645 cidrallocator.go:210] stopping ServiceCIDR Allocator Controller
        etcd.go:210: I0411 13:19:10.533131] etcd output msg="stopping insecure grpc server due to error" error="accept tcp 127.0.0.1:46637: use of closed network connection" level="warn" ts="2025-04-11T13:19:10.532900+0200" caller="embed/serve.go:160"
        etcd.go:210: I0411 13:19:10.533274] etcd output msg="stopped insecure grpc server due to error" error="accept tcp 127.0.0.1:46637: use of closed network connection" level="warn" ts="2025-04-11T13:19:10.533022+0200" caller="embed/serve.go:162"
        etcd.go:227: I0411 13:19:15.533715] etcd didn't exit in 5 seconds, killing it
        etcd.go:234: I0411 13:19:15.533803] etcd exited err="signal: terminated"
@pohly pohly changed the title WIP: etcd output interception etcd output interception Apr 24, 2025
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2025
@pohly pohly marked this pull request as ready for review April 24, 2025 06:57
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2025
@k8s-ci-robot k8s-ci-robot requested review from macsko and SataQiu April 24, 2025 06:57
@pohly
Copy link
Contributor Author

pohly commented Apr 24, 2025

/assign @serathius @aojea @BenTheElder

The other two etcd enhancements landed after code thaw, let's continue with the output improvement.

@siyuanfoundation
Copy link
Contributor

/triage accepted
/cc @siyuanfoundation

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2025
@dims
Copy link
Member

dims commented Jun 26, 2025

thanks @pohly this looks good to me, let's try it out!

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4e61fea50e7d7932569b4fb792934368c1aa8efd

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 28e4f3f into kubernetes:master Jun 26, 2025
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants