-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client-go: add DNS resolver latency metrics #115357
client-go: add DNS resolver latency metrics #115357
Conversation
/sig apimachinery |
@mfojtik: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @aojea |
you don't need to wrap the dialer, and we shouldn't , the http trace already provide dns metrics. I've already implemented it in client-go, when you use verbosity 9 you can see this metrics it seems you just want to hook on that http.trace, but we should avoid adding a new dialer |
Thank you for pointing out this effort @aojea, I wasn't aware of it. One gap I still see is that since the DNS lookup is not taken into account in the current rest client latency metric, it is impossible to detect DNS issues. Even though we would now be able to tell by increasing the verbosity of the logs that DNS latency is high, we don't have an high-level information available at all time that tells us that there is something wrong with the DNS server. Do you think there could be a way today to record this information in a metric? |
@aojea thanks! I wasn't aware as well that this debugging exists. However, you don't usually run things with verbosity level 9, and it would be nice to see the overall DNS latency measured in case your DNS server is slow or has a problem. I'm open to any solution that does not involve wrapping up dialer :) |
@@ -938,7 +968,7 @@ func (r *Request) request(ctx context.Context, fn func(*http.Request, *http.Resp | |||
|
|||
client := r.c.Client | |||
if client == nil { | |||
client = http.DefaultClient | |||
client = HttpClientResolverMetricsWrapper(http.DefaultClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use an interface instead? this will allow more composability, also we can write better unit tests for this
type ClientDecorator interface {
Decorate(*http.Client) *http.Client
}
we can compose it in Request
type Request struct {
ClientDecorator ClientDecorator
}
you just need to pass an httptrace in the context of the request, the same as I did in the example. The dialer is not used in http2 if there is already a connection, http2 is the default |
to be more precise, what I'm suggesting is that if you want to get those metrics, you should always pass the httprace I've added as a context and you get the metrics from there, you also have to keep the logging to be printed only with -v9 |
Thanks for the pointer @aojea :) |
/remove-sig api-machinery |
LGTM once Jordan concern is addressed https://github.com/kubernetes/kubernetes/pull/115357/files#r1161696762 |
/retest |
@aojea https://github.com/kubernetes/kubernetes/pull/115357/files#r1202694132 looks addressed :-) |
but that confirms the race, should we not avoid it? |
oh sorry, I misread it. I added locking as separate commit, I hope DNSDone is called also in case of error ;-) |
6769f47
to
c836f4a
Compare
Signed-off-by: Vu Dinh <[email protected]>
Signed-off-by: Vu Dinh <[email protected]>
@@ -929,11 +930,35 @@ func (r *Request) newHTTPRequest(ctx context.Context) (*http.Request, error) { | |||
if err != nil { | |||
return nil, err | |||
} | |||
req = req.WithContext(ctx) | |||
req = req.WithContext(httptrace.WithClientTrace(ctx, newDNSMetricsTrace(ctx))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're touching this, do you want to fold this and the above into a single call to http.NewRequestWithContext
? That would avoid a needless shallow copy of every Request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// resolverLatency is a Prometheus Histogram metric type partitioned by | ||
// "network", and "address" labels. It is used for the rest client DNS resolver latency metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that it's consistent with the comment on requestLatency
, but isn't this all self-evident? The comment (and now that I'm looking closer, the help text too) has already gone out of sync with the actual set of labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
req.Header = r.headers | ||
return req, nil | ||
} | ||
|
||
// newDNSMetricsTrace returns an HTTP trace that tracks time spend on DNS lookups per network/host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// newDNSMetricsTrace returns an HTTP trace that tracks time spend on DNS lookups per network/host. | |
// newDNSMetricsTrace returns an HTTP trace that tracks time spent on DNS lookups per host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Vu Dinh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benluddy All fixed.
@@ -929,11 +930,35 @@ func (r *Request) newHTTPRequest(ctx context.Context) (*http.Request, error) { | |||
if err != nil { | |||
return nil, err | |||
} | |||
req = req.WithContext(ctx) | |||
req = req.WithContext(httptrace.WithClientTrace(ctx, newDNSMetricsTrace(ctx))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// resolverLatency is a Prometheus Histogram metric type partitioned by | ||
// "network", and "address" labels. It is used for the rest client DNS resolver latency metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
req.Header = r.headers | ||
return req, nil | ||
} | ||
|
||
// newDNSMetricsTrace returns an HTTP trace that tracks time spend on DNS lookups per network/host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/test pull-kubernetes-e2e-gce |
/lgtm |
LGTM label has been added. Git tree hash: 4b997858766c367a9049ff23d53d8d0263f51d91
|
/label tide/merge-method-squash |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, dgrisonnet, mfojtik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
* client-go: add DNS resolver latency metrics * client-go: add locking to DNS latency metrics * client-go: add locking for whole DNSStart and DNSDone Signed-off-by: Vu Dinh <[email protected]> * Fix a mismatched ctx on the request Signed-off-by: Vu Dinh <[email protected]> * Clean up request code and fix comments Signed-off-by: Vu Dinh <[email protected]> --------- Signed-off-by: Vu Dinh <[email protected]> Co-authored-by: Vu Dinh <[email protected]>
What type of PR is this?
/kind feature
What this PR does / why we need it:
DNS failures and resolving latencies today are impossible to track in client-go metrics. If the request timeout is, for example, set to 5s, and DNS resolving takes 4s, then the request might timeout, but the reason why it failed is hidden from admins.
This PR adds a new Prometheus metric
rest_client_dns_resolution_duration_seconds
that observes the DNS resolver latencies and existing request latency.Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE