Readiness Tracker Deadlock for Terminating Resources #660

theMagicalKarp · 2020-06-02T02:07:15Z

What

This was brought to my attention and discovered by 💪@SimKev2 💪

Resources which are marked for termination, on gatekeeper startup, cause gatekeeper to fail its readiness probes indefinitely (assuming gatekeeper is trying to sync those resources). This seems to be because gatekeeper is trying to ensure it has successfully loaded the sync cache before it handles any traffic. However, it "expects" but fails to "observe" terminating resources, resulting in a deadlock for the readiness check.

Steps

Create the following namespace.

apiVersion: v1
kind: Namespace
metadata:
  finalizers:
  - rob.test.io/abcdefg
  name: rob-test

Delete it with kubectl delete ns rob-test (This should hang, since the finalizer won't resolve) The purpose of this is to put a resource permanently into the terminating state.
Ensure Namespace is in the sync config

apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: gatekeeper-system
spec:
  sync:
    syncOnly:
    - kind: Namespace
      version: v1

Start Gatekeeper

After gatekeeper starts up it should fail its readiness checks indefinitely.

This seems to be because when setting the expectations for the objectTracker we take into consideration the rob-test namespace (even though it's terminating). This becomes a problem later when running the sync controller, since it doesn't observe resources marked for termination.

I think what makes sense is to run Observe on resources which have been marked for termination.

So add this r.tracker.ForData(gvk).Observe(instance) here

gatekeeper/pkg/controller/sync/sync_controller.go

Line 178 in 21b6b4a

if !instance.GetDeletionTimestamp().IsZero() {

FYI @shomron

Environment:

Gatekeeper version: v3.1.0-beta.9
Kubernetes version: 1.15

The text was updated successfully, but these errors were encountered:

maxsmythe · 2020-06-02T18:10:39Z

Thanks for finding this! We should also call cancel expectations for any observed deletes. Otherwise there is a race condition where an object is deleted sometime after the initial list is gathered but before the operator begins syncing.

maxsmythe · 2020-06-02T18:13:38Z

This also involves modifying the cancel expectation function to short-circuit if expectations already satisfied to avoid a memory leak.

maxsmythe · 2020-06-02T18:27:25Z

It looks like we are already calling CancelExpect for all deleted constraint templates, so short-circuiting-if-populated would fix a memory leak there

shomron · 2020-06-09T21:52:27Z

@theMagicalKarp thank you again!
@maxsmythe I can pick up the remaining work, need to confirm the scope.

maxsmythe · 2020-06-09T22:04:47Z

ack, lemme know. Happy to help if the scope is too large.

Introduces a circuit breaker into objectTracker which is tripped once expectations have been met. When tripped, internal state tracking memory can be freed and subsequent operations will not consume additional memory in the tracker. Closes open-policy-agent#660 Signed-off-by: Oren Shomron <[email protected]>

Introduces a circuit breaker into objectTracker which is tripped once expectations have been met. When tripped, internal state tracking memory can be freed and subsequent operations will not consume additional memory in the tracker. Closes #660 Signed-off-by: Oren Shomron <[email protected]> Co-authored-by: Max Smythe <[email protected]>

niroowns · 2020-08-28T02:18:00Z

Hi @maxsmythe @shomron - is there a work around for this without having to update to the latest image? We're running verison v3.1.0-beta.9 of the controller. We're seeing what appears to be a similar issue impacting the pod resource, but I don't see anything in a "Terminating" state. Removing pods from the config file and deleting the controller pod does allow it to come up and the readiness probe to pass. I'm just not sure which is the offending pod within the cluster that's causing the readiness probe to return a 500. Any help would be appreciated.

shomron · 2020-08-28T17:15:05Z

@niroowns without updating the image, your best bet would be to remove the readinessProbe from the Gatekeeper manifest. But upgrading would be preferable :)

theMagicalKarp added the bug Something isn't working label Jun 2, 2020

theMagicalKarp mentioned this issue Jun 3, 2020

Ensure terminated are ignored for readiness #662

Merged

maxsmythe closed this as completed in #662 Jun 3, 2020

maxsmythe reopened this Jun 3, 2020

theMagicalKarp mentioned this issue Jun 10, 2020

Use cache for readiness probe instead of RPC #676

Closed

shomron mentioned this issue Jun 15, 2020

Introduce circuit breaker into objectTracker #683

Merged

maxsmythe closed this as completed in #683 Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness Tracker Deadlock for Terminating Resources #660

Readiness Tracker Deadlock for Terminating Resources #660

theMagicalKarp commented Jun 2, 2020

maxsmythe commented Jun 2, 2020

maxsmythe commented Jun 2, 2020

maxsmythe commented Jun 2, 2020

shomron commented Jun 9, 2020

maxsmythe commented Jun 9, 2020

niroowns commented Aug 28, 2020

shomron commented Aug 28, 2020

Readiness Tracker Deadlock for Terminating Resources #660

Readiness Tracker Deadlock for Terminating Resources #660

Comments

theMagicalKarp commented Jun 2, 2020

What

Steps

maxsmythe commented Jun 2, 2020

maxsmythe commented Jun 2, 2020

maxsmythe commented Jun 2, 2020

shomron commented Jun 9, 2020

maxsmythe commented Jun 9, 2020

niroowns commented Aug 28, 2020

shomron commented Aug 28, 2020