-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CRI ErrSignatureValidationFailed
#117717
Add support for CRI ErrSignatureValidationFailed
#117717
Conversation
/priority important-soon |
@kubernetes/sig-node-pr-reviews PTAL |
/assign @mrunalp @SergeyKanzhelev the string check feels a little brittle. but generally ok from me. |
/triage accepted |
pkg/kubelet/images/image_manager.go
Outdated
@@ -164,6 +164,10 @@ func (m *imageManager) EnsureImageExists(ctx context.Context, pod *v1.Pod, conta | |||
msg := fmt.Sprintf("image pull failed for %s because the registry is unavailable.", container.Image) | |||
return "", msg, imagePullResult.err | |||
} | |||
if imagePullResult.err.Error() == ErrInvalidSignature.Error() { | |||
msg := fmt.Sprintf("image pull failed for %s because the signature is invalid.", container.Image) | |||
return "", msg, imagePullResult.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be a difference in behavior? like should the kubelet stop trying to pull or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. Right now all CRI errors behave in the same way:
- Initial pull via the CRI
- If failed, show the error for 10s in the status, like:
NAMESPACE NAME READY STATUS RESTARTS AGE default test 0/1 RegistryUnavailable 0 1s
- On backoff time hit, show a new status:
NAMESPACE NAME READY STATUS RESTARTS AGE default test 0/1 ImagePullBackOff 0 16s
- After the backoff times out, we retry and show the desired error for another 10s:
NAMESPACE NAME READY STATUS RESTARTS AGE default test 0/1 RegistryUnavailable 0 52s
- …
The backoff time increases, means we mostly will show ImagePullBackOff
to the end users and they have to check the events to get an idea about what is going on:
> k describe pod test | tail -n 4
Warning Failed 2m59s (x4 over 4m28s) kubelet Failed to pull image "localhost:5000/foo:1.0.0": RegistryUnavailable
Warning Failed 2m59s (x4 over 4m28s) kubelet Error: RegistryUnavailable
Warning Failed 2m47s (x6 over 4m27s) kubelet Error: ImagePullBackOff
Normal BackOff 2m35s (x7 over 4m27s) kubelet Back-off pulling image "localhost:5000/foo:1.0.0"
The backoff has no direct API to specify a time "unlimited" as far as I can see. Removing the entry within the backoff will simply re-pull every 10s.
Means to exclude an image from pull, we would require something like an exclude map storing the last error (reason) why we do not re-pull it again. I can follow-up on that but it seems out of scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could imagine a transient signature check failure due to e.g. reporting this error when signature fails to fetch due to a network flake etc? I think it should probably keep retrying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should keep trying it again and not changing the loop.
5bdbe17
to
1bf0803
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
ErrInvalidSignature
ErrSignatureValidationFailed
/retest |
5539fc3
to
50aa9a8
Compare
still |
LGTM label has been added. Git tree hash: 0e6e39de84c8f1c8938b3e4adb7000acb67b0a64
|
pkg/kubelet/images/image_manager.go
Outdated
func evalCRIPullErr(container *v1.Container, err error) (errMsg string, errRes error) { | ||
// Error assertions via errors.Is is not supported by gRPC (remote runtime) errors right now. | ||
// See https://github.com/grpc/grpc-go/issues/3616 | ||
if err.Error() == ErrRegistryUnavailable.Error() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should those errors be defined in CRI API package or have a comment for API list for OOMKilled?
// Must be set to "OOMKilled" for containers terminated by cgroup-based Out-of-Memory killer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved them into the CRI API errors
package, would that fit better?
50aa9a8
to
50ca0b7
Compare
This allows container runtimes to propagate an image signature verification error through the CRI and display that to the end user during image pull. There is no other behavioral difference compared to a regular image pull failure. Signed-off-by: Sascha Grunert <[email protected]>
50ca0b7
to
63b69dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
thank you for moving errors to the cri api package!
LGTM label has been added. Git tree hash: 048b480c5fdc0c78bbb78943c33e31fe88d022e7
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: endocrimes, mrunalp, saschagrunert, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This allows to not re-pull the image if the signature verification failed by using a simple map for storing the previous error. Follow-up on: kubernetes#117717 (comment) Signed-off-by: Sascha Grunert <[email protected]>
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This allows container runtimes to propagate an image signature verification error through the CRI and display that to the end user during image pull. There is no other behavioral difference compared to a regular image pull failure.
Which issue(s) this PR fixes:
Follow-up on #117612
Special notes for your reviewer:
None
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: