Webhook conversion metrics [request/error counts and latency metrics] #118292

cchapla · 2023-05-26T19:40:19Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Adds webhook conversion metrics for requests count for success/failures and latency.

Which issue(s) this PR fixes:

Ref #117167

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Kube-apiserver adds two new alpha metrics `conversion_webhook_request_total` and `conversion_webhook_duration_seconds` that allow users to monitor requests to CRD conversion webhooks, split by result, and failure_type (In case of failure).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2023-05-26T19:40:22Z

The committers listed above are authorized under a signed CLA.

✅ login: cchapla / name: CC (8df1a5e)

k8s-ci-robot · 2023-05-26T19:40:28Z

Welcome @cchapla!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-05-26T19:40:28Z

Hi @cchapla. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cici37 · 2023-05-30T20:12:26Z

/triage accepted
/sig instrumentation
/cc @logicalhan

shyamjvs · 2023-05-30T21:31:54Z

/ok-to-test

shyamjvs · 2023-05-30T21:45:55Z

/priority important-soon

logicalhan · 2023-05-30T21:47:10Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics.go

+ &metrics.HistogramOpts{
+ Name: "webhook_conversion_duration_seconds",
+ Help: "Webhook conversion request latency",
+ Buckets: metrics.ExponentialBuckets(0.001, 2, 15),


It's not obvious to me what the actual buckets are from looking at this.

logicalhan · 2023-05-30T21:47:46Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

@@ -0,0 +1,268 @@
+/*
+Copyright 2019 The Kubernetes Authors.


Suggested change

Copyright 2019 The Kubernetes Authors.

Copyright 2023 The Kubernetes Authors.

logicalhan · 2023-05-30T21:48:40Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ webhookConversionLatency: Metrics.webhookConversionLatency,
+ },
+ args: args{
+ ctx: context.TODO(),


I'd just eliminate this and pass it in to the function directly.

logicalhan · 2023-05-30T21:48:47Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ webhookConversionLatency: Metrics.webhookConversionLatency,
+ },
+ args: args{
+ ctx: context.TODO(),


logicalhan · 2023-05-30T21:49:01Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ webhookConversionLatency: Metrics.webhookConversionLatency,
+ },
+ args: args{
+ ctx: context.TODO(),


logicalhan · 2023-05-30T21:49:15Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ webhookConversionLatency: Metrics.webhookConversionLatency,
+ },
+ args: args{
+ ctx: context.TODO(),


logicalhan · 2023-05-30T21:49:22Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ webhookConversionLatency: Metrics.webhookConversionLatency,
+ },
+ args: args{
+ ctx: context.TODO(),


logicalhan · 2023-05-30T21:49:29Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ webhookConversionLatency: Metrics.webhookConversionLatency,
+ },
+ args: args{
+ ctx: context.TODO(),


logicalhan · 2023-05-30T21:51:24Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+func expectCounterValue(t *testing.T, name string, labelFilter map[string]string, wantCount int) {
+ metrics, err := legacyregistry.DefaultGatherer.Gather()
+ if err != nil {
+ t.Fatalf("Failed to gather metrics: %s", err)
+ }
+
+ counterSum := 0
+ for _, mf := range metrics {
+ if mf.GetName() != name {
+ continue // Ignore other metrics.
+ }
+ for _, metric := range mf.GetMetric() {
+ if !testutil.LabelsMatch(metric, labelFilter) {
+ continue
+ }
+ counterSum += int(metric.GetCounter().GetValue())
+ }
+ }
+ if wantCount != counterSum {
+ t.Errorf("Wanted count %d, got %d for metric %s with labels %#+v", wantCount, counterSum, name, labelFilter)
+ for _, mf := range metrics {
+ if mf.GetName() == name {
+ for _, metric := range mf.GetMetric() {
+ t.Logf("\tnear match: %s", metric.String())
+ }
+ }
+ }
+ }
+}
+
+func expectHistogramCountTotal(t *testing.T, name string, labelFilter map[string]string, wantCount int) {
+ metrics, err := legacyregistry.DefaultGatherer.Gather()
+ if err != nil {
+ t.Fatalf("Failed to gather metrics: %s", err)
+ }
+
+ counterSum := 0
+ for _, mf := range metrics {
+ if mf.GetName() != name {
+ continue // Ignore other metrics.
+ }
+ for _, metric := range mf.GetMetric() {
+ if !testutil.LabelsMatch(metric, labelFilter) {
+ continue
+ }
+ counterSum += int(metric.GetHistogram().GetSampleCount())
+ }
+ }
+ if wantCount != counterSum {
+ t.Errorf("Wanted count %d, got %d for metric %s with labels %#+v", wantCount, counterSum, name, labelFilter)
+ for _, mf := range metrics {
+ if mf.GetName() == name {
+ for _, metric := range mf.GetMetric() {


I would move these into component-base/metrics/testutils, make them public and rename them assertXCount or whatnot.

logicalhan · 2023-05-30T21:53:33Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics.go

+func newWebhookConversionMetrics() *WebhookConversionMetrics {
+ webhookConversionRequest := metrics.NewCounterVec(
+ &metrics.CounterOpts{
+ Name: "webhook_conversion_requests",


Subsytem should be "apiserver"

Counters should be suffixed _total.

Do we also provide "Namespace = apiextensions-apiserver" ?

No, I'd just use "apiserver" as the Namespace. Otherwise the metric name with be prefixed apiserver_apiextensions_apiserver_

Not clear.
you mean both subsystem and namespace will be "apiserver" or have to just provide namespace without subsystem ?

Making both apiserver will create name like "apiserver_apiserver_webhook_conversion_duration_seconds"

The metric name is comprised as <Namespace>_<Subsystem>_<Name>. So if you specify "apiextensions-apiserver" as a namespace and "apiserver" as a subsystem, you end up with apiextensions_apiserver_apiserver as a prefix to your metric name.

I'm just saying only use oneof {Namespace,Subsystem}, do not use both. And use "apiserver", since that's what we use everywhere else.

… etc

logicalhan · 2023-05-31T01:38:41Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics.go

+ Name: "webhook_conversion_duration_seconds",
+ Namespace: namespace,
+ Help: "Webhook conversion request latency",
+ // 0.001, 0.002, 0.004, .... 16.384 [1ms, 2ms, 4ms, ...., 16,384 ms]


16 seconds is a weird upper bound, maybe add one more bucket? Webhooks default timeout at 10 seconds, but can be configured to timeout at 30.

Yeah. Now that you point out 16.384 seconds.

How about directly using:
0.01, 0.02, 0.05, 1, 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60
Or maybe we could directly have 60 after 30 just in case..

That is much better.

logicalhan · 2023-05-31T19:16:09Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ wantLabels map[string]string
+ expectedRequestValue int
+ }{
+ // TODO: Add test cases.


Suggested change

// TODO: Add test cases.

logicalhan · 2023-05-31T19:16:25Z

staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/metrics_test.go

+ expectedRequestValue int
+ expectedLatencyCount int
+ }{
+ // TODO: Add test cases.


Suggested change

// TODO: Add test cases.

logicalhan

/lgtm
/approve

Thanks for the iterations!

k8s-ci-robot · 2023-05-31T19:27:49Z

LGTM label has been added.

Git tree hash: 930dd39309502bbe473eb161321e89cdccacf91f

cchapla · 2023-06-01T14:25:32Z

/assign @deads2k

dims · 2023-06-06T21:24:21Z

/approved

(applying approved here as @logicalhan's approval does not seem to cover staging/src/k8s.io/apiextensions-apiserver directory. Yes! the changes look good to me as well!)

cc @deads2k @sttts @jpbetz

dims · 2023-06-06T23:59:52Z

/approve

(whoops typo!)

k8s-ci-robot · 2023-06-07T00:00:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cchapla, dims, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiextensions-apiserver/OWNERS~~ [dims]
~~staging/src/k8s.io/component-base/metrics/OWNERS~~ [dims,logicalhan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shyamjvs · 2023-06-07T16:36:40Z

@cchapla can you update the release-note to make the new metrics and their meaning clear? Something like:

Kube-apiserver adds two new alpha metrics `webhook_conversion_request_total` and `webhook_conversion_duration_seconds` that allow users to monitor requests to CRD conversion webhooks, split by result.

Also @kubernetes/sig-instrumentation-approvers - I know it's a bit late, but can someone quickly check if the metric convention used here is ok? Is conversion_webhook a better prefix than webhook_conversion? @cchapla can fix it in a follow-up PR if needed.

Webhook conversion metrics

8df1a5e

k8s-ci-robot requested a review from alexzielenski May 26, 2023 19:40

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 26, 2023

k8s-ci-robot requested a review from logicalhan May 26, 2023 19:40

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 26, 2023

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 26, 2023

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 30, 2023

logicalhan reviewed May 30, 2023

View reviewed changes

Review comments, added metric namespace, moved utility functions, and…

705c6ff

… etc

k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label May 31, 2023

logicalhan reviewed May 31, 2023

View reviewed changes

Changes to histogram buckets

6426962

cchapla requested a review from logicalhan May 31, 2023 19:13

logicalhan reviewed May 31, 2023

View reviewed changes

Changes to buckets and comments

c539c73

logicalhan reviewed May 31, 2023

View reviewed changes

k8s-ci-robot assigned logicalhan May 31, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2023

k8s-ci-robot assigned deads2k Jun 1, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2023

k8s-ci-robot merged commit 9ede836 into kubernetes:master Jun 7, 2023
12 checks passed

k8s-ci-robot added this to the v1.28 milestone Jun 7, 2023

cchapla deleted the crd_webhook_metrics branch June 7, 2023 18:22

This was referenced Jun 7, 2023

Updating names from webhookconversion to conversionwebhook for apiserver #118542

Merged

Monitoring gaps in apiserver extension mechanisms #117167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook conversion metrics [request/error counts and latency metrics] #118292

Webhook conversion metrics [request/error counts and latency metrics] #118292

cchapla commented May 26, 2023 •

edited

linux-foundation-easycla bot commented May 26, 2023 •

edited

k8s-ci-robot commented May 26, 2023

k8s-ci-robot commented May 26, 2023

cici37 commented May 30, 2023

shyamjvs commented May 30, 2023

shyamjvs commented May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

logicalhan May 30, 2023

cchapla May 30, 2023

logicalhan May 31, 2023 •

edited

cchapla May 31, 2023

logicalhan May 31, 2023 •

edited

logicalhan May 31, 2023

cchapla May 31, 2023 •

edited

logicalhan May 31, 2023

logicalhan May 31, 2023

logicalhan May 31, 2023

logicalhan left a comment

k8s-ci-robot commented May 31, 2023

cchapla commented Jun 1, 2023

dims commented Jun 6, 2023 •

edited

dims commented Jun 6, 2023

k8s-ci-robot commented Jun 7, 2023

shyamjvs commented Jun 7, 2023 •

edited

	Copyright 2019 The Kubernetes Authors.
	Copyright 2023 The Kubernetes Authors.

Webhook conversion metrics [request/error counts and latency metrics] #118292

Webhook conversion metrics [request/error counts and latency metrics] #118292

Conversation

cchapla commented May 26, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented May 26, 2023 • edited

k8s-ci-robot commented May 26, 2023

k8s-ci-robot commented May 26, 2023

cici37 commented May 30, 2023

shyamjvs commented May 30, 2023

shyamjvs commented May 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

logicalhan May 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

logicalhan May 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cchapla May 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

logicalhan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 31, 2023

cchapla commented Jun 1, 2023

dims commented Jun 6, 2023 • edited

dims commented Jun 6, 2023

k8s-ci-robot commented Jun 7, 2023

shyamjvs commented Jun 7, 2023 • edited

cchapla commented May 26, 2023 •

edited

linux-foundation-easycla bot commented May 26, 2023 •

edited

logicalhan May 31, 2023 •

edited

logicalhan May 31, 2023 •

edited

cchapla May 31, 2023 •

edited

dims commented Jun 6, 2023 •

edited

shyamjvs commented Jun 7, 2023 •

edited