Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CRI ErrSignatureValidationFailed #117717

Merged

Conversation

saschagrunert
Copy link
Member

@saschagrunert saschagrunert commented May 2, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This allows container runtimes to propagate an image signature verification error through the CRI and display that to the end user during image pull. There is no other behavioral difference compared to a regular image pull failure.

Which issue(s) this PR fixes:

Follow-up on #117612

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

Allow container runtimes to use `ErrSignatureValidationFailed` as possible image pull failure.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 2, 2023
@saschagrunert
Copy link
Member Author

/priority important-soon
/sig node

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 2, 2023
@saschagrunert
Copy link
Member Author

@kubernetes/sig-node-pr-reviews PTAL

@dims
Copy link
Member

dims commented May 2, 2023

/assign @mrunalp @SergeyKanzhelev

the string check feels a little brittle. but generally ok from me.

@endocrimes
Copy link
Member

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 2, 2023
@endocrimes endocrimes moved this from Triage to Waiting on Author in SIG Node PR Triage May 2, 2023
@@ -164,6 +164,10 @@ func (m *imageManager) EnsureImageExists(ctx context.Context, pod *v1.Pod, conta
msg := fmt.Sprintf("image pull failed for %s because the registry is unavailable.", container.Image)
return "", msg, imagePullResult.err
}
if imagePullResult.err.Error() == ErrInvalidSignature.Error() {
msg := fmt.Sprintf("image pull failed for %s because the signature is invalid.", container.Image)
return "", msg, imagePullResult.err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a difference in behavior? like should the kubelet stop trying to pull or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. Right now all CRI errors behave in the same way:

  1. Initial pull via the CRI
  2. If failed, show the error for 10s in the status, like:
    NAMESPACE     NAME                      READY   STATUS                RESTARTS   AGE
    default       test                      0/1     RegistryUnavailable   0          1s
    
  3. On backoff time hit, show a new status:
    NAMESPACE     NAME                      READY   STATUS             RESTARTS   AGE
    default       test                      0/1     ImagePullBackOff   0          16s
    
  4. After the backoff times out, we retry and show the desired error for another 10s:
    NAMESPACE     NAME                      READY   STATUS                RESTARTS   AGE
    default       test                      0/1     RegistryUnavailable   0          52s
    

The backoff time increases, means we mostly will show ImagePullBackOff to the end users and they have to check the events to get an idea about what is going on:

> k describe pod test | tail -n 4
  Warning  Failed     2m59s (x4 over 4m28s)  kubelet            Failed to pull image "localhost:5000/foo:1.0.0": RegistryUnavailable
  Warning  Failed     2m59s (x4 over 4m28s)  kubelet            Error: RegistryUnavailable
  Warning  Failed     2m47s (x6 over 4m27s)  kubelet            Error: ImagePullBackOff
  Normal   BackOff    2m35s (x7 over 4m27s)  kubelet            Back-off pulling image "localhost:5000/foo:1.0.0"

The backoff has no direct API to specify a time "unlimited" as far as I can see. Removing the entry within the backoff will simply re-pull every 10s.

Means to exclude an image from pull, we would require something like an exclude map storing the last error (reason) why we do not re-pull it again. I can follow-up on that but it seems out of scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, agreed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could imagine a transient signature check failure due to e.g. reporting this error when signature fails to fetch due to a network flake etc? I think it should probably keep retrying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should keep trying it again and not changing the loop.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 3, 2023
Copy link
Member

@endocrimes endocrimes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@saschagrunert saschagrunert changed the title Add support for CRI ErrInvalidSignature Add support for CRI ErrSignatureValidationFailed May 3, 2023
@saschagrunert
Copy link
Member Author

/retest

@haircommander
Copy link
Contributor

still
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0e6e39de84c8f1c8938b3e4adb7000acb67b0a64

func evalCRIPullErr(container *v1.Container, err error) (errMsg string, errRes error) {
// Error assertions via errors.Is is not supported by gRPC (remote runtime) errors right now.
// See https://github.com/grpc/grpc-go/issues/3616
if err.Error() == ErrRegistryUnavailable.Error() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should those errors be defined in CRI API package or have a comment for API list for OOMKilled?

// Must be set to "OOMKilled" for containers terminated by cgroup-based Out-of-Memory killer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them into the CRI API errors package, would that fit better?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2023
This allows container runtimes to propagate an image signature
verification error through the CRI and display that to the end user
during image pull. There is no other behavioral difference compared to a
regular image pull failure.

Signed-off-by: Sascha Grunert <[email protected]>
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

thank you for moving errors to the cri api package!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 048b480c5fdc0c78bbb78943c33e31fe88d022e7

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: endocrimes, mrunalp, saschagrunert, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2023
@k8s-ci-robot k8s-ci-robot merged commit af92da5 into kubernetes:master May 5, 2023
12 checks passed
SIG Node PR Triage automation moved this from Waiting on Author to Done May 5, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone May 5, 2023
@saschagrunert saschagrunert deleted the invalid-signature-error branch May 8, 2023 07:11
saschagrunert added a commit to saschagrunert/kubernetes that referenced this pull request May 8, 2023
This allows to not re-pull the image if the signature verification
failed by using a simple map for storing the previous error.

Follow-up on: kubernetes#117717 (comment)

Signed-off-by: Sascha Grunert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants