Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LegacyServiceAccountTokenCleanUp alpha #115554

Merged
merged 1 commit into from
May 26, 2023
Merged

Conversation

yt2985
Copy link
Contributor

@yt2985 yt2985 commented Feb 6, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Start to clean up auto-generated service account token.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kube-controller-manager: The `LegacyServiceAccountTokenCleanUp` feature gate is now available as alpha (off by default). When enabled, the `legacy-service-account-token-cleaner` controller loop removes service account token secrets that have not been used in the time specified by `--legacy-service-account-token-clean-up-period` (defaulting to one year), **and are** referenced from the `.secrets` list of a ServiceAccount object, **and are not** referenced from pods.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

-[KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/2799-reduction-of-secret-based-service-account-token


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 6, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @yt2985. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 6, 2023
@yt2985 yt2985 changed the title Clean sa LegacyServiceAccountTokenCleanUp alpha Feb 6, 2023
@k8s-ci-robot k8s-ci-robot added area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 6, 2023
@yt2985
Copy link
Contributor Author

yt2985 commented Feb 6, 2023

Open the PR first. I am still working on the integration test and may be the e2e test.

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 7, 2023
@yt2985 yt2985 force-pushed the cleanSA branch 2 times, most recently from ed33782 to ad9bebc Compare February 7, 2023 18:58
@cici37
Copy link
Contributor

cici37 commented Feb 7, 2023

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 7, 2023
@yt2985
Copy link
Contributor Author

yt2985 commented May 16, 2023

/test pull-kubernetes-conformance-kind-ga-only-parallel

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks pretty close, I haven't looked at test coverage yet

Thinking about someone rolling this out, wondering if we'll be happy with all long-unused token secrets getting deleted at the first kube-controller-manager startup. Things like dry-run behavior or rate-limiting come to mind as things we might want to make sure the results are what we expect.

cc @deads2k for any thoughts along those lines on this?

cmd/kube-controller-manager/app/core.go Show resolved Hide resolved
Comment on lines 171 to 196
if podMountedSecrets[secretNamespace] != nil {
return podMountedSecrets[secretNamespace], nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look for existence rather than non-nil:

if secrets, ok := podMountedSecrets[secretNamespace]; ok {
  return secrets
}

if podMountedSecrets[secretNamespace] != nil {
return podMountedSecrets[secretNamespace], nil
}
podMountedSecrets[secretNamespace] = sets.NewString()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't store here, otherwise we cache an empty set in an error case, then have the possibility of returning the cached-and-wrong empty set the next time getMountedSecretNames is invoked on the namespace

Comment on lines 180 to 186
for _, pod := range podList {
podutil.VisitPodSecretNames(pod, func(secretName string) bool {
podMountedSecrets[secretNamespace].Insert(secretName)
return true
})
}
return podMountedSecrets[secretNamespace], nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a nil set until we know we need to store secret names:

	var secrets sets.String
	for _, pod := range podList {
		podutil.VisitPodSecretNames(pod, func(secretName string) bool {
			if secrets == nil {
				secrets = sets.NewString()
			}
			secrets.Insert(secretName)
			return true
		})
	}

then after we've constructed the set without errors, cache it:

	podMountedSecrets[secretNamespace] = secrets
	return secrets, nil

if err != nil {
return time.Time{}, fmt.Errorf("error parsing trackedSince time: %v", err)
}
return trackedSinceTime.Add(24 * time.Hour), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to do trackedSinceTime.AddDate(0,0,1) rather than assume 24 hours from 00:00 on the tracked-since date covers all possible times in that date (daylight saving time can put 25 hours in a nominal date)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also deserves a comment on this line and on the godoc for the function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and latestPossibleTrackedSinceTime might be a better function name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also deserves a comment on this line and on the godoc for the function

Hi Jordan, may I know what is the godoc here?

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 18, 2023
@yt2985
Copy link
Contributor Author

yt2985 commented May 18, 2023

/test pull-kubernetes-e2e-kind

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2023
@deads2k
Copy link
Contributor

deads2k commented May 23, 2023

Thinking about someone rolling this out, wondering if we'll be happy with all long-unused token secrets getting deleted at the first kube-controller-manager startup. Things like dry-run behavior or rate-limiting come to mind as things we might want to make sure the results are what we expect.

cc @deads2k for any thoughts along those lines on this?

I've thought a bit about this and I can't think of a "gentle" way to remove a long-unused token gradually via dry-run or rate-limiting because no one will notice until they're broken and we're talking about something that runs infrequently to even hit this. The rate-limit would have to be order of years for someone to notice in time to matter.

I did have one different thought, but it would push out the deletion timeframe. What if tokens that hadn't been used in a year were made invalid for use against the kube-apiserver, but remained in the API. That would force client-side failures, with a cluster-admin able to "reset" a secret by removing the last-used annotation. The pattern this would produce is

  1. nothing uses token for a very long time
  2. instead of deleting the secret, it's marked as "not valid against kube-apiserver"
  3. rarely run client fails with, "secret is no longer valid"
  4. client now knows it needs to update itself
  5. for the current run, the last-used annotation can be reset to avoid impending bad-things
  6. another year after being marked, "not valid against kube", the secret is deleted.

A year is a long time to not use a token. I don't have strong opinions about whether the extra step is worth the effort, but it does provide a way to break people so they know they must change, but also provide immediate relief that doesn't require distributing new tokens. That ability to avoid distribution is the only meaningful difference between this approach and telling someone to create a new SA token secret.

@deads2k deads2k closed this May 23, 2023
@deads2k deads2k reopened this May 23, 2023
@zshihang
Copy link
Contributor

I did have one different thought, but it would push out the deletion timeframe. What if tokens that hadn't been used in a year were made invalid for use against the kube-apiserver, but remained in the API. That would force client-side failures, with a cluster-admin able to "reset" a secret by removing the last-used annotation. The pattern this would produce is

  1. nothing uses token for a very long time
  2. instead of deleting the secret, it's marked as "not valid against kube-apiserver"
  3. rarely run client fails with, "secret is no longer valid"
  4. client now knows it needs to update itself
  5. for the current run, the last-used annotation can be reset to avoid impending bad-things
  6. another year after being marked, "not valid against kube", the secret is deleted.

A year is a long time to not use a token. I don't have strong opinions about whether the extra step is worth the effort, but it does provide a way to break people so they know they must change, but also provide immediate relief that doesn't require distributing new tokens. That ability to avoid distribution is the only meaningful difference between this approach and telling someone to create a new SA token secret.

this two-phase deletion approach equals to the existing approach:

  1. assume waiting for a year before marking as "not valid against kube-apiserver"
  2. if a year passed after being marked, delete.

does this act as the same as the existing approach where we can set the wait time to be two years?

do we think one year is long enough to consider a auto legacy token being unused? i think it is.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2023
@liggitt
Copy link
Member

liggitt commented May 24, 2023

this two-phase deletion approach equals to the existing approach:

  1. assume waiting for a year before marking as "not valid against kube-apiserver"
  2. if a year passed after being marked, delete.

does this act as the same as the existing approach where we can set the wait time to be two years?

Not quite the same, it makes the revocation reversible... marking a token as invalid without deleting it means that action can be undone without needing to distribute a new credential to the impacted user.

A year is a long time to not use a token. I don't have strong opinions about whether the extra step is worth the effort, but it does provide a way to break people so they know they must change, but also provide immediate relief that doesn't require distributing new tokens.

I also don't have strong feelings on this. There's two aspects I'm trying to decide if we should provide:

  • ability to recover if it deletes something that turned out to be very-rarely-used (your suggestion would cover this)
  • ability to dry-run this controller in some way ("what would you delete (or at least, how many things would you delete)?")

@liggitt
Copy link
Member

liggitt commented May 24, 2023

that said, I don't think we need those for the alpha... let's add an item to the KEP to resolve this question before graduating to beta

@liggitt
Copy link
Member

liggitt commented May 26, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2274f3d84ec93368cd6d5b3db6cf5581ce01289a

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, yt2985

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2023
@k8s-ci-robot k8s-ci-robot merged commit c35a277 into kubernetes:master May 26, 2023
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

9 participants