Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new metric for the internal client-go cache size #117295

Merged
merged 3 commits into from
May 11, 2023

Conversation

aojea
Copy link
Member

@aojea aojea commented Apr 13, 2023

/kind feature

client-go exposes two new metrics to monitor the client-go logic that
generate http.Transports for the clients.

- rest_client_transport_cache_entries is a gauge metric
with the number of existin entries in the internal cache

- rest_client_transport_create_calls_total is a counter
that increments each time a new transport is created, storing
the result of the operation needed to generate it: hit, miss
or uncacheable

/sig api-machinery
/sig instrumentation

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 13, 2023
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Apr 13, 2023
@aojea
Copy link
Member Author

aojea commented Apr 13, 2023

/assign @enj @liggitt @logicalhan

@benluddy
Copy link
Contributor

Is it useful to add counters for cache accesses (hit and miss) too?

@mouuii
Copy link
Contributor

mouuii commented Apr 14, 2023

/lgtm

@k8s-ci-robot
Copy link
Contributor

@mouuii: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Member Author

aojea commented Apr 14, 2023

Is it useful to add counters for cache accesses (hit and miss) too?

/hold

that is indeed a great idea

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2023
@aojea aojea force-pushed the transport_cache_metrics branch 3 times, most recently from b870c3b to 6e0d25f Compare April 14, 2023 13:42
@aojea
Copy link
Member Author

aojea commented Apr 14, 2023

added

Is it useful to add counters for cache accesses (hit and miss) too?

/hold

that is indeed a great idea

/hold cancel

added Ben suggestion, in addition record no-cacheable calls too

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2023
@enj
Copy link
Member

enj commented Apr 14, 2023

@aojea if you are feeling particularly generous you could add an integration test that #117250 (comment):

Trigger many APIServiceRegistrationController resyncs by repeatedly patching an APIService (e.g. update an annotation with random bytes).

At the end you could assert that we are indeed no longer leaking memory by checking the metrics :)

Example integration test that asserts metrics:

func TestCSRDuration(t *testing.T) {
t.Parallel()
s := kubeapiservertesting.StartTestServerOrDie(t, nil, nil, framework.SharedEtcd())
t.Cleanup(s.TearDownFn)
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
t.Cleanup(cancel)
// assert that the metrics we collect during the test run match expectations
// we have 7 valid test cases below that request a duration of which 6 should have their duration honored
wantMetricStrings := []string{
`apiserver_certificates_registry_csr_honored_duration_total{signerName="kubernetes.io/kube-apiserver-client"} 6`,
`apiserver_certificates_registry_csr_requested_duration_total{signerName="kubernetes.io/kube-apiserver-client"} 7`,
}

@k8s-ci-robot k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 14, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2023
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Apr 26, 2023
@aojea
Copy link
Member Author

aojea commented May 10, 2023

ok, Jordan is right ... again 😬

    client_test.go:1421: metric rest_client_transport_cache_entries 32
    client_test.go:1421: metric rest_client_transport_create_calls_total{result="hit"} 113
    client_test.go:1421: metric rest_client_transport_create_calls_total{result="miss"} 3
    client_test.go:1421: expected as many entries 32 in the cache as misses, got 3

most probably is racing with other test 🤔

@logicalhan
Copy link
Member

most probably is racing with other test 🤔

Yeah there are usually two ways around this. The first is to call Reset at the beginning of the test. The second (slightly more involved), is to register the metric to a non-global registry (one that's isolated to your test).

@aojea aojea force-pushed the transport_cache_metrics branch 2 times, most recently from d2ce753 to c18c50a Compare May 10, 2023 22:44
@dims
Copy link
Member

dims commented May 10, 2023

@aojea you doubted Jordan?!!

@aojea
Copy link
Member Author

aojea commented May 11, 2023

Kubernetes e2e suite: [It] [sig-node] Pods should run through the lifecycle of Pods and PodStatus [Conformance] expand_less

unrelated
/test pull-kubernetes-conformance-kind-ga-only-parallel

@aojea you doubted Jordan?!!

lol, I never doubted it

Yeah there are usually two ways around this. The first is to call Reset at the beginning of the test

I'm doing reset and moving the test to its own folder dedicated only to metrics, so we control the type of tests that are going to run together

Add two new metrics to monitor the client-go logic that
generate http.Transports for the clients.

- rest_client_transport_cache_entries is a gauge metrics
with the number of existin entries in the internal cache

- rest_client_transport_create_calls_total is a counter
that increments each time a new transport is created, storing
the result of the operation needed to generate it: hit, miss
or uncacheable

Change-Id: I2d8bde25281153d8f8e8faa249385edde3c1cb39
@liggitt
Copy link
Member

liggitt commented May 11, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ce0b0d6c6870889043ac59a2cdefdb638cd1a411

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, dgrisonnet, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2023
@k8s-ci-robot k8s-ci-robot merged commit 64af2d9 into kubernetes:master May 11, 2023
13 checks passed
SIG Node CI/Test Board automation moved this from Archive-it to Done May 11, 2023
SIG Node PR Triage automation moved this from Needs Approver to Done May 11, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kube-proxy area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

10 participants