Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Authorization tracking request/error counts and latency metrics #117211

Merged
merged 1 commit into from
May 5, 2023

Conversation

HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Apr 11, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

add Authentication tracking request/error counts and latency metrics

Which issue(s) this PR fixes:

Fixes #117167

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kube-apiserver adds two new metrics `authorization_attempts_total` and `authorization_duration_seconds` that allow users to monitor requests to authorization webhooks, split by result.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 11, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.27 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.27.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Apr 11 14:07:05 UTC 2023.

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 11, 2023
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @HirazawaUi. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 11, 2023
@cici37
Copy link
Contributor

cici37 commented Apr 11, 2023

/cc @shyamjvs issue owner for initial review. Thank you
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2023
resultLabel = successLabel
case authorized == authorizer.DecisionNoOpinion:
// continue to the next authorizer
resultLabel = successLabel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be quite confusing to return a success when the authorizer didn't make a decision. I would understand a result="success" label of authorization_attempts_total to represent an authorization, which is not the case with DecisionNoOpinion.

Maybe we could just have a decision label instead of result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be quite confusing to return a success when the authorizer didn't make a decision. I would understand a result="success" label of authorization_attempts_total to represent an authorization, which is not the case with DecisionNoOpinion.

Maybe we could just have a decision label instead of result?

In the Authorize interface, the action of Authorize does not exactly match the presence of error, when error is not nil and action is not allowed, withAuthorization will return an error, this situation cannot be described by the decision label
So, I don't think it is possible to simply use the decision label instead of result.
metrics should be divided into the following cases in order

  1. when action is allow, the error is ignored and the count of metrics with decision="allow" and result="success" is added to 1
  2. when error is non-empty, the metrics count of {result="error"} is added to 1
  3. When action is forbid, also ignore the error and add 1 to the metrics count of decision="fobid and result="success".
  4. When action is NoOpinion, add 1 to the metrics count of {decision="no opinion"}.

The above is just a shallow understanding after reading the code, do you think this is reasonable?
Please review my latest commit code, it follows the above logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see thank you for the clarification. I have the feeling that separating decision and result will just be confusing to the general users and that the difference is an implementation detail that users don't necessarily need to be aware of.

I have two suggestions to make it simpler for the users:

  1. a single metric:
authorization_attempts_total{result="allowed|forbidden|no-opinion|error"}
  1. two metrics:
authorization_attempts_total{decision="allowed|forbidden|no-opinion|"}
authorization_attempt_errors_total{}

The second is more idiomatic when computing error rate since you don't have to do requests_total - requests_total{error} in order to get the total number of errors, but at the same time, I am not super happy with having an empty decision label in case of errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation, I agree with you and have used the first suggestion to make changes, please review code


authorizationDecisionAnnotationCounter = metrics.NewCounterVec(
&metrics.CounterOpts{
Name: "authorization_decision_annotations_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share some lights on the use-case for this metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share some lights on the use-case for this metric?

It is the metrics of the decision label, it seems that the name here is not very accurate, in the latest commit has been modified, its use case in the last comment in the reply I have described, please review, and hope you provide comments to tell me whether it is reasonable

@shyamjvs
Copy link
Member

Please change the PR title from Authentication -> Authorization for accuracy.

/ok-to-test
/sig auth (for reviewing the metric)
/sig instrumentation (for approving the metric style)

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 14, 2023
@k8s-ci-robot
Copy link
Contributor

@shyamjvs: The label(s) sig/(for, sig/reviewing, sig/the, sig/metric), sig/(for, sig/approving, sig/the, sig/metric, sig/style) cannot be applied, because the repository doesn't have them.

In response to this:

Please change the PR title from Authentication -> Authorization for accuracy.

/ok-to-test
/sig auth (for reviewing the metric)
/sig instrumentation (for approving the metric style)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shyamjvs
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 14, 2023
@shyamjvs
Copy link
Member

@HirazawaUi IMO this change deserves a release note, to let users know that they can now track their authz webhooks through these new apiserver metrics.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 247a4837d2bc5b2cefc7d33caa2e1b2331bc5c63

@dgrisonnet
Copy link
Member

@HirazawaUi could you perhaps update the release note with the correct metric names and description?

@HirazawaUi
Copy link
Contributor Author

@HirazawaUi could you perhaps update the release note with the correct metric names and description?

Okay, I've modified it

@shyamjvs
Copy link
Member

I still see the incorrect metric names in the release note - authorizationAttemptsCounter and authorizationLatency.
@HirazawaUi Could you please fix it?

@shyamjvs
Copy link
Member

I recommend this for the release note:

Apiserver adds two new metrics `authorization_attempts_total` and `authorization_duration_seconds` that allow users to monitor requests to authorization webhooks, split by result.

@@ -38,6 +39,10 @@ const (
successLabel = "success"
failureLabel = "failure"
errorLabel = "error"

allowedLabel = "allowed"
forbiddenLabel = "forbidden"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
forbiddenLabel = "forbidden"
deniedLabel = "denied"

deny|denied is the language used in the Kubernetes authorization docs and the API to refer to this case, although I see that the forbid|forbidden language is used in the code. IMO we should align the metric name with the language in the public facing docs and the API instead of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, changing it to denied seems to be a more reasonable behavior

Comment on lines 109 to 110
//Follow the behavior in withAuthorization, even encounter evaluation errors and still allow the request
//Will not judge whether there is an error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//Follow the behavior in withAuthorization, even encounter evaluation errors and still allow the request
//Will not judge whether there is an error
// In withAuthorization(), it is possible to allow the request and still encounter an evaluation error.

But also, doesn't this mean we should have a separate errors_total metric? Because in the case where there are errors and they are also allowed, we are not reporting them in metrics, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue was discussed with @dgrisonnet earlier and the conclusion was that whether or not to expose the metrics as an result error but still authorize successfully depends on whether or not the user should know the implementation details of the code
But we all feel as if the user doesn't need to care about how it is implemented
If it's because I left a comment that causes confusion, I'll remove it

@HirazawaUi
Copy link
Contributor Author

I recommend this for the release note:

Apiserver adds two new metrics `authorization_attempts_total` and `authorization_duration_seconds` that allow users to monitor requests to authorization webhooks, split by result.

Okay, I've changed it as recommended

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2023
@shyamjvs
Copy link
Member

/retest (unrelated failure)
/lgtm

@k8s-ci-robot
Copy link
Contributor

@shyamjvs: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test pull-cadvisor-e2e-kubernetes
  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-coverage-unit
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-cos
  • /test pull-kubernetes-e2e-gce-cos-canary
  • /test pull-kubernetes-e2e-gce-cos-no-stage
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-scale-performance-manual
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-go-canary
  • /test pull-kubernetes-update
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary

The following commands are available to trigger optional jobs:

  • /test check-dependency-stats
  • /test pull-ci-kubernetes-unit-windows
  • /test pull-e2e-gce-cloud-provider-disabled
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-autoscaling-hpa-cm
  • /test pull-kubernetes-e2e-autoscaling-hpa-cpu
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-windows
  • /test pull-kubernetes-e2e-capz-windows-alpha-features
  • /test pull-kubernetes-e2e-capz-windows-serial-slow-hpa
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-cos-alpha-features
  • /test pull-kubernetes-e2e-gce-cos-kubetest2
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gce-serial
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-kind-kms
  • /test pull-kubernetes-e2e-kind-multizone
  • /test pull-kubernetes-e2e-kops-aws
  • /test pull-kubernetes-e2e-kubelet-credential-provider
  • /test pull-kubernetes-e2e-ubuntu-gce-network-policies
  • /test pull-kubernetes-kind-dra
  • /test pull-kubernetes-kind-json-logging
  • /test pull-kubernetes-kind-text-logging
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-crio-e2e-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-alpha-features
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-containerd-features-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-sidecar-containers
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
  • /test pull-kubernetes-node-kubelet-credential-provider
  • /test pull-kubernetes-node-kubelet-serial-containerd
  • /test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-kubelet-serial-pod-disruption-conditions
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
  • /test pull-kubernetes-node-memoryqos-cgrpv2
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-node-swap-fedora-serial
  • /test pull-kubernetes-node-swap-ubuntu-serial
  • /test pull-kubernetes-unit-experimental
  • /test pull-kubernetes-verify-strict-lint
  • /test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-gce
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-integration
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-typecheck
  • pull-kubernetes-unit
  • pull-kubernetes-verify

In response to this:

/retest (unrelated failure)
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7cf7a9b1bf4089ef2d9f63d155b09be97e8ff58f

@shyamjvs
Copy link
Member

/retest

@shyamjvs
Copy link
Member

@lavalamp / @wojtek-t can one of you help approve this? Thanks

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2023

This is great - thanks!

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, HirazawaUi, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2023
@HirazawaUi
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 7e25f12 into kubernetes:master May 5, 2023
11 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone May 5, 2023
wongma7 pushed a commit to wongma7/kubernetes that referenced this pull request Jul 24, 2023
…tency metrics

Description:
  "Apiserver adds two new metrics `authorization_attempts_total` and `authorization_duration_seconds` that allow users
  to monitor requests to authorization webhooks, split by result." (PR description)

Upstream PR, Issue, KEP, etc. links:
  * Take from upstream Kubernetes PR kubernetes#117211 (kubernetes#117211), which, at the time
  this patch was added, had not been merged yet.
  * Partially fixes Kubernetes Issue kubernetes#117167 (kubernetes#117167)

If this patch is based on an upstream commit, how (if at all) do this patch and the upstream source differ?
  TBD after change is merged by upstream

If this patch's changes have not been added by upstream, why not?
  Change is proposed to upstream

Other patches related to this patch:
  None

Changes made to this patch after its initial creation and reasons for these changes:
  None

Kubernetes version this patch can be dropped:
  TBD after change is merged by upstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Monitoring gaps in apiserver extension mechanisms
7 participants