Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness Tracker Deadlock for Terminating Resources #660

Closed
theMagicalKarp opened this issue Jun 2, 2020 · 7 comments 路 Fixed by #662 or #683
Closed

Readiness Tracker Deadlock for Terminating Resources #660

theMagicalKarp opened this issue Jun 2, 2020 · 7 comments 路 Fixed by #662 or #683
Labels
bug Something isn't working

Comments

@theMagicalKarp
Copy link
Contributor

What

This was brought to my attention and discovered by 馃挭@SimKev2 馃挭

Resources which are marked for termination, on gatekeeper startup, cause gatekeeper to fail its readiness probes indefinitely (assuming gatekeeper is trying to sync those resources). This seems to be because gatekeeper is trying to ensure it has successfully loaded the sync cache before it handles any traffic. However, it "expects" but fails to "observe" terminating resources, resulting in a deadlock for the readiness check.

Steps

  1. Create the following namespace.
apiVersion: v1
kind: Namespace
metadata:
  finalizers:
  - rob.test.io/abcdefg
  name: rob-test
  1. Delete it with kubectl delete ns rob-test (This should hang, since the finalizer won't resolve) The purpose of this is to put a resource permanently into the terminating state.

  2. Ensure Namespace is in the sync config

apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: gatekeeper-system
spec:
  sync:
    syncOnly:
    - kind: Namespace
      version: v1
  1. Start Gatekeeper

After gatekeeper starts up it should fail its readiness checks indefinitely.

This seems to be because when setting the expectations for the objectTracker we take into consideration the rob-test namespace (even though it's terminating). This becomes a problem later when running the sync controller, since it doesn't observe resources marked for termination.

I think what makes sense is to run Observe on resources which have been marked for termination.

So add this r.tracker.ForData(gvk).Observe(instance) here

if !instance.GetDeletionTimestamp().IsZero() {

FYI @shomron

Environment:

  • Gatekeeper version: v3.1.0-beta.9
  • Kubernetes version: 1.15
@theMagicalKarp theMagicalKarp added the bug Something isn't working label Jun 2, 2020
@maxsmythe
Copy link
Contributor

Thanks for finding this! We should also call cancel expectations for any observed deletes. Otherwise there is a race condition where an object is deleted sometime after the initial list is gathered but before the operator begins syncing.

@maxsmythe
Copy link
Contributor

This also involves modifying the cancel expectation function to short-circuit if expectations already satisfied to avoid a memory leak.

@maxsmythe
Copy link
Contributor

It looks like we are already calling CancelExpect for all deleted constraint templates, so short-circuiting-if-populated would fix a memory leak there

@shomron
Copy link
Contributor

shomron commented Jun 9, 2020

@theMagicalKarp thank you again!
@maxsmythe I can pick up the remaining work, need to confirm the scope.

@maxsmythe
Copy link
Contributor

ack, lemme know. Happy to help if the scope is too large.

shomron added a commit to shomron/gatekeeper that referenced this issue Jun 15, 2020
Introduces a circuit breaker into objectTracker which is tripped once
expectations have been met. When tripped, internal state tracking memory
can be freed and subsequent operations will not consume additional
memory in the tracker.

Closes open-policy-agent#660

Signed-off-by: Oren Shomron <[email protected]>
shomron added a commit to shomron/gatekeeper that referenced this issue Jun 16, 2020
Introduces a circuit breaker into objectTracker which is tripped once
expectations have been met. When tripped, internal state tracking memory
can be freed and subsequent operations will not consume additional
memory in the tracker.

Closes open-policy-agent#660

Signed-off-by: Oren Shomron <[email protected]>
maxsmythe added a commit that referenced this issue Jun 17, 2020
Introduces a circuit breaker into objectTracker which is tripped once
expectations have been met. When tripped, internal state tracking memory
can be freed and subsequent operations will not consume additional
memory in the tracker.

Closes #660

Signed-off-by: Oren Shomron <[email protected]>

Co-authored-by: Max Smythe <[email protected]>
@niroowns
Copy link

Hi @maxsmythe @shomron - is there a work around for this without having to update to the latest image? We're running verison v3.1.0-beta.9 of the controller. We're seeing what appears to be a similar issue impacting the pod resource, but I don't see anything in a "Terminating" state. Removing pods from the config file and deleting the controller pod does allow it to come up and the readiness probe to pass. I'm just not sure which is the offending pod within the cluster that's causing the readiness probe to return a 500. Any help would be appreciated.

@shomron
Copy link
Contributor

shomron commented Aug 28, 2020

@niroowns without updating the image, your best bet would be to remove the readinessProbe from the Gatekeeper manifest. But upgrading would be preferable :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants