chore: don't cache errors in file cache #18555

aslilac · 2025-06-24T22:39:21Z

By design, concurrent calls to Acquire in the file cache all share a single database fetch. This is by design, so that everyone can share in the success of whoever asked for the file first. That's kind of what caches do!

but one problem with the current implementation is that errors are also shared. This is mostly fine, because once all of the references are dropped, the cache entry will be freed, and the next Acquire will trigger a new fetch. However, if enough people are trying to load the same file at once, you could imagine how they might keep retrying and the reference count never quite hits zero.

To combat this, just immediately and forcibly remove errors from the cache, even if they still have references. Whoever is the first to retry afterwards will trigger a new fetch (like we want), which can then again be shared by others who retry.

Related, one opportunity to reduce the potential for errors we have is to use context.Background() for the database fetch so that a canceled request context cannot disrupt others who may be waiting for the file. We can then manually check the context outside of the Load, just like we already do with authorization.

…rors

coderd/files/cache.go

aslilac · 2025-06-25T19:27:56Z

btw @Emyrk, the test as you originally wrote it assumed that any second caller would refetch, regardless of timing. but we discussed loosening it a bit so that any caller after the actual errored load would refetch, which is much more timing dependent. I can't really think of a good way to definitively test this behavior, because waiting until after the first fetch errors to run the second fetch means we're also waiting until the refcount would hit zero, which would clear it regardless of error state anyway. but if we call any earlier, most of the time the second caller gets the error, rarely taking long enough to trigger a refetch.

maybe we could add some method to "leak" a reference for testing purposes to ensure that the file is refetched anyway, but I'm never a fan of adding extra complexity just to make something testable.

Emyrk · 2025-06-26T00:00:40Z

btw @Emyrk, the test as you originally wrote it assumed that any second caller would refetch, regardless of timing. but we discussed loosening it a bit so that any caller after the actual errored load would refetch, which is much more timing dependent. I can't really think of a good way to definitively test this behavior, because waiting until after the first fetch errors to run the second fetch means we're also waiting until the refcount would hit zero, which would clear it regardless of error state anyway. but if we call any earlier, most of the time the second caller gets the error, rarely taking long enough to trigger a refetch.

Yes, 100% the original test is not really relevant anymore.

maybe we could add some method to "leak" a reference for testing purposes to ensure that the file is refetched anyway, but I'm never a fan of adding extra complexity just to make something testable.

I wonder if we can make something work with an internal test and manually calling the lock 🤔. I don't have any fancy ideas off the top of my head 😢

coderd/files/cache.go

Emyrk

Overall looking good.

There is a way to make the test queue up the Acquires at least for the first discrete group that gets an error. The value of the test can definitely be questioned 🤷

// TestCancelledFetch runs 2 Acquire calls in a queue, and ensures both return
// the same error.
func TestCancelledFetch2(t *testing.T) {
	t.Parallel()

	fileID := uuid.New()
	rdy := make(chan struct{})

	dbM := dbmock.NewMockStore(gomock.NewController(t))

	expectedErr := xerrors.New("expected error")

	// First call will fail with a custom error that all callers will return with.
	dbM.EXPECT().GetFileByID(gomock.Any(), gomock.Any()).DoAndReturn(func(mTx context.Context, fileID uuid.UUID) (database.File, error) {
		// Wait long enough for the second call to be queued up.
		<-rdy
		return database.File{}, expectedErr
	})

	//nolint:gocritic // Unit testing
	ctx := dbauthz.AsFileReader(testutil.Context(t, testutil.WaitShort))

	// Expect 2 calls to Acquire before we continue the test
	var acquiresQueued sync.WaitGroup
	acquiresQueued.Add(2)
	rawCache := files.New(prometheus.NewRegistry(), &coderdtest.FakeAuthorizer{})

	var cache files.FileAcquirer = &acquireHijack{
		cache: rawCache,
		hook: func(_ context.Context, _ database.Store, _ uuid.UUID) {
			acquiresQueued.Done()
		},
	}

	var wg sync.WaitGroup
	wg.Add(2)

	// First call that will fail
	go func() {
		_, err := cache.Acquire(ctx, dbM, fileID)
		assert.ErrorIs(t, err, expectedErr)
		wg.Done()
	}()

	// Second call, that should succeed
	go func() {
		_, err := cache.Acquire(ctx, dbM, fileID)
		assert.ErrorIs(t, err, expectedErr)
		wg.Done()
	}()

	// We need that second Acquire call to be queued up
	acquiresQueued.Wait()

	// Release the first Acquire call, which should make both calls return with the
	// expected error.
	close(rdy)

	// Wait for both go routines to assert their errors and finish.
	wg.Wait()
	require.Equal(t, 0, rawCache.Count())
}

coderd/files/cache.go

Emyrk

LGTM, I think we should just use the cache lock to control entry ref counts, and get rid of that race condition.

Emyrk · 2025-07-01T14:25:12Z

coderd/files/cache.go

+				entry.refCount.Add(-1)
+				c.currentOpenFileReferences.Dec()
+				// Safety: Another thread could grab a reference to this value between
+				// this check and entering `purge`, which will grab the cache lock. This
+				// is annoying, and may lead to temporary duplication of the file in
+				// memory, but is better than the deadlocking potential of other
+				// approaches we tried to solve this.
+				if entry.refCount.Load() > 0 {
+					return
+				}


refcount changes should just use the cache lock.

Then we don't have the issue the comment refers to.

A refcount++/-- is so quick, it's not going to hold up the cache performance wise. And if the count hits 0, purge requires the cache lock anyway.

coderd/files/cache_internal_test.go

Co-authored-by: Steven Masley <[email protected]>

Emyrk · 2025-07-01T18:41:06Z

coderd/files/cache.go

+		c.lock.Lock()
+		defer c.lock.Unlock()


The 3 other early returns do not lock the cache on e.Close

Emyrk and others added 2 commits June 22, 2025 22:56

test: unit test to excercise polluted file cache with error

780233b

chore: purge file cache entries on error

8a6deb1

github-actions bot assigned aslilac Jun 24, 2025

aslilac added 2 commits June 24, 2025 22:45

Merge branch 'stevenmasley/file_cache_error' into lilac/dont-cache-er…

bf6562a

…rors

proper release gating

610740a

aslilac requested a review from Emyrk June 24, 2025 22:50

aslilac marked this pull request as ready for review June 24, 2025 22:50

aslilac added 2 commits June 24, 2025 23:11

lint

ec53459

I win at debugging deadlocks today

df7acff

Emyrk reviewed Jun 25, 2025

View reviewed changes

coderd/files/cache.go Show resolved Hide resolved

coderd/files/cache.go Outdated Show resolved Hide resolved

coderd/files/cache.go Outdated Show resolved Hide resolved

aslilac added 3 commits June 25, 2025 17:01

chore: purge file cache entries on error

2ecb1d7

Merge branch 'main' into lilac/dont-cache-errors

0fa1f6b

hmm...

db89836

at this point the calls are just serialized anyway

aab0335

Emyrk reviewed Jun 26, 2025

View reviewed changes

coderd/files/cache.go Outdated Show resolved Hide resolved

Emyrk reviewed Jun 26, 2025

View reviewed changes

coderd/files/cache.go Show resolved Hide resolved

coderd/files/cache.go Show resolved Hide resolved

coderd/files/cache.go Outdated Show resolved Hide resolved

coderd/files/cache.go Outdated Show resolved Hide resolved

aslilac added 7 commits June 27, 2025 20:50

Merge branch 'main' into lilac/dont-cache-errors

c0ea08a

concurrency be like

78dedaa

add comment

94ce678

add another concurrency test

d12d121

think I finally figured out this test 😎

03f9217

actually this doesn't need to be threaded anymore

82a3b7e

lint

3764134

aslilac requested a review from Emyrk June 30, 2025 19:43

Emyrk reviewed Jul 1, 2025

View reviewed changes

aslilac and others added 3 commits July 1, 2025 10:26

Update coderd/files/cache_internal_test.go

ac52626

Co-authored-by: Steven Masley <[email protected]>

one lock to rule them all

725696f

update some comments

9cda09f

Emyrk reviewed Jul 1, 2025

View reviewed changes

aslilac added 3 commits July 1, 2025 18:59

wait I thought I committed this

98f18b1

Merge branch 'main' into lilac/dont-cache-errors

c09028b

lock on close

00fb0b0

Emyrk approved these changes Jul 1, 2025

View reviewed changes

aslilac merged commit d22ac1c into main Jul 1, 2025
34 checks passed

aslilac deleted the lilac/dont-cache-errors branch July 1, 2025 19:50

github-actions bot locked and limited conversation to collaborators Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: don't cache errors in file cache #18555

chore: don't cache errors in file cache #18555

aslilac commented Jun 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aslilac commented Jun 25, 2025 •

edited

Loading

Uh oh!

Emyrk commented Jun 26, 2025

Uh oh!

Uh oh!

Emyrk left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Emyrk left a comment

Uh oh!

Emyrk Jul 1, 2025

Uh oh!

Uh oh!

Emyrk Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

chore: don't cache errors in file cache #18555

chore: don't cache errors in file cache #18555

Conversation

aslilac commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aslilac commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Emyrk commented Jun 26, 2025

Uh oh!

Uh oh!

Emyrk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Emyrk left a comment

Choose a reason for hiding this comment

Uh oh!

Emyrk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Emyrk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aslilac commented Jun 24, 2025 •

edited

Loading

aslilac commented Jun 25, 2025 •

edited

Loading