Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client-go: add DNS resolver latency metrics #115357

Merged
merged 5 commits into from
Jun 28, 2023

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Jan 27, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

DNS failures and resolving latencies today are impossible to track in client-go metrics. If the request timeout is, for example, set to 5s, and DNS resolving takes 4s, then the request might timeout, but the reason why it failed is hidden from admins.
This PR adds a new Prometheus metric rest_client_dns_resolution_duration_seconds that observes the DNS resolver latencies and existing request latency.

It’s not DNS
There’s no way it’s DNS
It was DNS

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Allow to monitor client-go DNS resolver latencies via `rest_client_dns_resolution_duration_seconds` Prometheus metric

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 27, 2023
@mfojtik
Copy link
Contributor Author

mfojtik commented Jan 27, 2023

/cc @deads2k @sttts

@mfojtik
Copy link
Contributor Author

mfojtik commented Jan 27, 2023

/sig apimachinery

@k8s-ci-robot
Copy link
Contributor

@mfojtik: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachinery

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dgrisonnet
Copy link
Member

/cc @aojea

@aojea
Copy link
Member

aojea commented Jan 27, 2023

you don't need to wrap the dialer, and we shouldn't , the http trace already provide dns metrics.

I've already implemented it in client-go, when you use verbosity 9 you can see this metrics

#105156

it seems you just want to hook on that http.trace, but we should avoid adding a new dialer

/cc @liggitt @enj

@dgrisonnet
Copy link
Member

Thank you for pointing out this effort @aojea, I wasn't aware of it.

One gap I still see is that since the DNS lookup is not taken into account in the current rest client latency metric, it is impossible to detect DNS issues. Even though we would now be able to tell by increasing the verbosity of the logs that DNS latency is high, we don't have an high-level information available at all time that tells us that there is something wrong with the DNS server.

Do you think there could be a way today to record this information in a metric?

@mfojtik
Copy link
Contributor Author

mfojtik commented Jan 27, 2023

@aojea thanks! I wasn't aware as well that this debugging exists. However, you don't usually run things with verbosity level 9, and it would be nice to see the overall DNS latency measured in case your DNS server is slow or has a problem. I'm open to any solution that does not involve wrapping up dialer :)

@@ -938,7 +968,7 @@ func (r *Request) request(ctx context.Context, fn func(*http.Request, *http.Resp

client := r.c.Client
if client == nil {
client = http.DefaultClient
client = HttpClientResolverMetricsWrapper(http.DefaultClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use an interface instead? this will allow more composability, also we can write better unit tests for this

type ClientDecorator interface {
    Decorate(*http.Client) *http.Client
}

we can compose it in Request

type Request struct {
     ClientDecorator ClientDecorator
}

@aojea
Copy link
Member

aojea commented Jan 27, 2023

Do you think there could be a way today to record this information in a metric?

you just need to pass an httptrace in the context of the request, the same as I did in the example.

The dialer is not used in http2 if there is already a connection, http2 is the default

@aojea
Copy link
Member

aojea commented Jan 31, 2023

to be more precise, what I'm suggesting is that if you want to get those metrics, you should always pass the httprace I've added as a context and you get the metrics from there, you also have to keep the logging to be printed only with -v9

@dgrisonnet
Copy link
Member

Thanks for the pointer @aojea :)

@cici37
Copy link
Contributor

cici37 commented Jan 31, 2023

/remove-sig api-machinery

@aojea
Copy link
Member

aojea commented May 23, 2023

LGTM once Jordan concern is addressed https://github.com/kubernetes/kubernetes/pull/115357/files#r1161696762

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 8, 2023

/retest

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 8, 2023

@aojea
Copy link
Member

aojea commented Jun 8, 2023

@aojea https://github.com/kubernetes/kubernetes/pull/115357/files#r1202694132 looks addressed :-)

but that confirms the race, should we not avoid it?

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 9, 2023

@aojea https://github.com/kubernetes/kubernetes/pull/115357/files#r1202694132 looks addressed :-)

but that confirms the race, should we not avoid it?

oh sorry, I misread it. I added locking as separate commit, I hope DNSDone is called also in case of error ;-)

@@ -929,11 +930,35 @@ func (r *Request) newHTTPRequest(ctx context.Context) (*http.Request, error) {
if err != nil {
return nil, err
}
req = req.WithContext(ctx)
req = req.WithContext(httptrace.WithClientTrace(ctx, newDNSMetricsTrace(ctx)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're touching this, do you want to fold this and the above into a single call to http.NewRequestWithContext? That would avoid a needless shallow copy of every Request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 44 to 45
// resolverLatency is a Prometheus Histogram metric type partitioned by
// "network", and "address" labels. It is used for the rest client DNS resolver latency metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it's consistent with the comment on requestLatency, but isn't this all self-evident? The comment (and now that I'm looking closer, the help text too) has already gone out of sync with the actual set of labels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

req.Header = r.headers
return req, nil
}

// newDNSMetricsTrace returns an HTTP trace that tracks time spend on DNS lookups per network/host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// newDNSMetricsTrace returns an HTTP trace that tracks time spend on DNS lookups per network/host.
// newDNSMetricsTrace returns an HTTP trace that tracks time spent on DNS lookups per host.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benluddy All fixed.

@@ -929,11 +930,35 @@ func (r *Request) newHTTPRequest(ctx context.Context) (*http.Request, error) {
if err != nil {
return nil, err
}
req = req.WithContext(ctx)
req = req.WithContext(httptrace.WithClientTrace(ctx, newDNSMetricsTrace(ctx)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 44 to 45
// resolverLatency is a Prometheus Histogram metric type partitioned by
// "network", and "address" labels. It is used for the rest client DNS resolver latency metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

req.Header = r.headers
return req, nil
}

// newDNSMetricsTrace returns an HTTP trace that tracks time spend on DNS lookups per network/host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dinhxuanvu
Copy link
Contributor

@aojea @deads2k Ready for lgtm and approve :)

@dinhxuanvu
Copy link
Contributor

/test pull-kubernetes-e2e-gce

@aojea
Copy link
Member

aojea commented Jun 26, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4b997858766c367a9049ff23d53d8d0263f51d91

@aojea
Copy link
Member

aojea commented Jun 26, 2023

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 26, 2023
@deads2k
Copy link
Contributor

deads2k commented Jun 28, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, dgrisonnet, mfojtik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2023
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit 1c7e87c into kubernetes:master Jun 28, 2023
12 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 28, 2023
rayowang pushed a commit to rayowang/kubernetes that referenced this pull request Feb 9, 2024
* client-go: add DNS resolver latency metrics

* client-go: add locking to DNS latency metrics

* client-go: add locking for whole DNSStart and DNSDone

Signed-off-by: Vu Dinh <[email protected]>

* Fix a mismatched ctx on the request

Signed-off-by: Vu Dinh <[email protected]>

* Clean up request code and fix comments

Signed-off-by: Vu Dinh <[email protected]>

---------

Signed-off-by: Vu Dinh <[email protected]>
Co-authored-by: Vu Dinh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet