Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way (annotations?) to mark a Tekton Resource to be skipped= #979

Open
vdemeester opened this issue Oct 31, 2023 · 11 comments
Open

Add a way (annotations?) to mark a Tekton Resource to be skipped= #979

vdemeester opened this issue Oct 31, 2023 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@vdemeester
Copy link
Member

Feature request

Provide a way to mark a Tekton resources (TaskRun or PipelineRun) to be skipped by the controller.
On of the possible idea would be to add an extra value to chains.tekton.dev/signed with a skip value, or an entirely new annotations (chains.tekton.dev/skip).

Use case

One of the use case for this is, if we enable chains on a very Tekton resources heavy cluster (like 5000s existing PipelineRun, or more…). At start, the chains controller will load them all and try to sign them all (more or less at once). This will possible generate a crazy amount of load if not completely kill the controller. One idea to reduce this, would be to allow a user to mark previous / existings resources to not be signed, making chains work essentially no-op when it starts (or very low).

It could also, probably, be useful if one is experimenting with some Pipeline or Task and doesn't want chains to sign it.

cc @lcarva @concaf

@vdemeester vdemeester added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 31, 2023
@concaf
Copy link
Contributor

concaf commented Oct 31, 2023

i agree with the feature request but not sure if it's the right solution for the use case - if i install / enable chains on a cluster with a heavy load of tekton workloads, then i need to annotate all the older workloads, which is... weird and clunky??

wdyt about a configuration option that tells chains not to sign older / already existing workloads on the cluster? something like artifacts.taskrun.sign-existing: true/false?

@vdemeester
Copy link
Member Author

@concaf the trick would be, what happens if the chains controller is shut down for an hour or two ? (during an upgrade or something). One would have to enable that field, and disable it later on. In addition, how do we "decide" what is "existing" ? The one that were created before the controller started, the one that were done at the point ?

I do think the skip annotation approach does not necessarily prevent a "controller's configuration" to be set/put in place later on.

i agree with the feature request but not sure if it's the right solution for the use case - if i install / enable chains on a cluster with a heavy load of tekton workloads, then i need to annotate all the older workloads, which is... weird and clunky??

This is relatively easy to script and it has the additional benefit of easily know that those resources were skip on purpose (versus having to look kind-of deep to understand why some resources are signed and some not ?).

@lcarva
Copy link
Contributor

lcarva commented Oct 31, 2023

An alternative is to introduce a new Chains configuration to ignore any resource created prior to a given date, e.g.

ignore-older-resources: 2023-01-01T00:00:00Z

That could potentially give a hook for an installer to better match the user's expectations.

@vdemeester
Copy link
Member Author

cc @tektoncd/chains-maintainers 🙏🏼

@concaf
Copy link
Contributor

concaf commented Oct 31, 2023

@concaf the trick would be, what happens if the chains controller is shut down for an hour or two ? (during an upgrade or something). One would have to enable that field, and disable it later on. In addition, how do we "decide" what is "existing" ? The one that were created before the controller started, the one that were done at the point ?

we could go via a timestamp approach as well, i.e. "don't sign anything that was created before 5 pm, 31 oct 2023"


This is relatively easy to script and it has the additional benefit of easily know that those resources were skip on purpose (versus having to look kind-of deep to understand why some resources are signed and some not ?).

i agree that it's easy to script, however it feels a bit incessant to me; building on your approach, there could also be a whitelisting flag that can help which would be both, explicit and obvious.

if you set chains.tekton.dev/skip: true, then you effectively blacklist that workload from getting signed; what do you think about chains.tekton.dev/allow: true which will act as a whitelist - it would mean that only the workloads with this annotation get signed by chains; this also reduces pressure on the controller to sign everything and transfers control over to the user - say i want only specific workloads to be signed and not others, then i can whitelist those.

there can be prioritization of sorts - by default, chains watches and signs all workloads - then there should a configuration that tells chains only to sign the workloads with the whitelist annotation / label.

another reason why i prefer whitelisting is that it's more explicit than blacklisting - say that a user (who knows nothing about chains) creates a tekton workload, then chains is going to sign it implicitly - in order to deny chains, they need to add an annotation; while on the other hand if whitelisting is "enabled", then only the workloads with that explicit annotation / label will be signed.

wdyt?

@wlynch
Copy link
Member

wlynch commented Oct 31, 2023

On of the possible idea would be to add an extra value to chains.tekton.dev/signed with a skip value, or an entirely new annotations (chains.tekton.dev/skip).

In general I would not recommend allowing pipeline users have a skip mode - you don't want nefarious users to be able to suppress provenance generation to mask what they're doing.

Opt-in for testing things out / ramp up seems reasonable (like we do for manual mode), but you don't want users to have control over this for normal operation.

At start, the chains controller will load them all and try to sign them all (more or less at once). This will possible generate a crazy amount of load if not completely kill the controller.

Are you seeing CPU load, memory load, both? Any metrics you can share? 👀
What resource requests/limits is the controller being subjected to?

By default, Knative controllers are rate limited (default 10 qps / 100 burst). If this is taking up a crazy amount of resources then this may be a sign of a leak.

My first thought would be to try and profile the controller and see if there's any low hanging fruit here.

An alternative is to introduce a new Chains configuration to ignore any resource created prior to a given date

I'd be okay with this, though the same problem could still pop up in other scenarios. 🤷

@vdemeester
Copy link
Member Author

Opt-in for testing things out / ramp up seems reasonable (like we do for manual mode), but you don't want users to have control over this for normal operation.

That could be an option I guess, but as soon as the user would switch to true, it would probably generate that load again ? Also, isn't it only on transparency log upload (and not on the signing) ?

Are you seeing CPU load, memory load, both? Any metrics you can share? 👀
What resource requests/limits is the controller being subjected to?

I don't have all the detail, but it's happening on a cluster with about 4000 to 5000 (if not more) existing PipelineRun when they switch chains on (and a bit less 1000 PipelineRun per hours). They had to up they resource request to at least 4G of memory for the controller to not be killed. However, the controller takes.. a lot of time to handle all the existing PipelineRuns, and thus the queue to sign any "live" / new PipelineRun (or TaskRun) keeps growing.

There is probably some exploration to do around the HA setup (similar to Pipeline, with multiple replicas, …), and fiddling with the controller's flag around the ratelimits, etc...

@concaf
Copy link
Contributor

concaf commented Nov 1, 2023

thinking out loud here - another implicit way to accommodate this use case would be that the chains controller processes the current / active / incoming workloads first before moving to older workloads on the cluster which get queued up; chains can start processing older workloads when it gets some free cycles and a chance to breathe - wdyt?

@vdemeester
Copy link
Member Author

thinking out loud here - another implicit way to accommodate this use case would be that the chains controller processes the current / active / incoming workloads first before moving to older workloads on the cluster which get queued up; chains can start processing older workloads when it gets some free cycles and a chance to breathe - wdyt?

I don't think there is any way to control this from knative/pkg code though.

@wlynch
Copy link
Member

wlynch commented Nov 8, 2023

You can! Knative has slow/fast queues built in - active items come in on the fast queue, resyncs come in on the slow queue and are picked off as needed.

What you could do is have a threshold for recent completed Runs, and if you're outside the threshold kick the key out to the slow queue to be handled later.

@vdemeester
Copy link
Member Author

You can! Knative has slow/fast queues built in - active items come in on the fast queue, resyncs come in on the slow queue and are picked off as needed.

Ohh that's interersting 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants