-
Notifications
You must be signed in to change notification settings - Fork 929
chore: don't cache errors in file cache #18555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
btw @Emyrk, the test as you originally wrote it assumed that any second caller would refetch, regardless of timing. but we discussed loosening it a bit so that any caller after the actual errored load would refetch, which is much more timing dependent. I can't really think of a good way to definitively test this behavior, because waiting until after the first fetch errors to run the second fetch means we're also waiting until the refcount would hit zero, which would clear it regardless of error state anyway. but if we call any earlier, most of the time the second caller gets the error, rarely taking long enough to trigger a refetch. maybe we could add some method to "leak" a reference for testing purposes to ensure that the file is refetched anyway, but I'm never a fan of adding extra complexity just to make something testable. |
Yes, 100% the original test is not really relevant anymore.
I wonder if we can make something work with an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good.
There is a way to make the test queue up the Acquires
at least for the first discrete group that gets an error. The value of the test can definitely be questioned 🤷
// TestCancelledFetch runs 2 Acquire calls in a queue, and ensures both return
// the same error.
func TestCancelledFetch2(t *testing.T) {
t.Parallel()
fileID := uuid.New()
rdy := make(chan struct{})
dbM := dbmock.NewMockStore(gomock.NewController(t))
expectedErr := xerrors.New("expected error")
// First call will fail with a custom error that all callers will return with.
dbM.EXPECT().GetFileByID(gomock.Any(), gomock.Any()).DoAndReturn(func(mTx context.Context, fileID uuid.UUID) (database.File, error) {
// Wait long enough for the second call to be queued up.
<-rdy
return database.File{}, expectedErr
})
//nolint:gocritic // Unit testing
ctx := dbauthz.AsFileReader(testutil.Context(t, testutil.WaitShort))
// Expect 2 calls to Acquire before we continue the test
var acquiresQueued sync.WaitGroup
acquiresQueued.Add(2)
rawCache := files.New(prometheus.NewRegistry(), &coderdtest.FakeAuthorizer{})
var cache files.FileAcquirer = &acquireHijack{
cache: rawCache,
hook: func(_ context.Context, _ database.Store, _ uuid.UUID) {
acquiresQueued.Done()
},
}
var wg sync.WaitGroup
wg.Add(2)
// First call that will fail
go func() {
_, err := cache.Acquire(ctx, dbM, fileID)
assert.ErrorIs(t, err, expectedErr)
wg.Done()
}()
// Second call, that should succeed
go func() {
_, err := cache.Acquire(ctx, dbM, fileID)
assert.ErrorIs(t, err, expectedErr)
wg.Done()
}()
// We need that second Acquire call to be queued up
acquiresQueued.Wait()
// Release the first Acquire call, which should make both calls return with the
// expected error.
close(rdy)
// Wait for both go routines to assert their errors and finish.
wg.Wait()
require.Equal(t, 0, rawCache.Count())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think we should just use the cache lock to control entry ref counts, and get rid of that race condition.
coderd/files/cache.go
Outdated
entry.refCount.Add(-1) | ||
c.currentOpenFileReferences.Dec() | ||
// Safety: Another thread could grab a reference to this value between | ||
// this check and entering `purge`, which will grab the cache lock. This | ||
// is annoying, and may lead to temporary duplication of the file in | ||
// memory, but is better than the deadlocking potential of other | ||
// approaches we tried to solve this. | ||
if entry.refCount.Load() > 0 { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refcount changes should just use the cache lock.
Then we don't have the issue the comment refers to.
A refcount++/--
is so quick, it's not going to hold up the cache performance wise. And if the count hits 0, purge
requires the cache lock anyway.
Co-authored-by: Steven Masley <[email protected]>
c.lock.Lock() | ||
defer c.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 3 other early returns do not lock the cache on e.Close
By design, concurrent calls to
Acquire
in the file cache all share a single database fetch. This is by design, so that everyone can share in the success of whoever asked for the file first. That's kind of what caches do!but one problem with the current implementation is that errors are also shared. This is mostly fine, because once all of the references are dropped, the cache entry will be freed, and the next
Acquire
will trigger a new fetch. However, if enough people are trying to load the same file at once, you could imagine how they might keep retrying and the reference count never quite hits zero.To combat this, just immediately and forcibly remove errors from the cache, even if they still have references. Whoever is the first to retry afterwards will trigger a new fetch (like we want), which can then again be shared by others who retry.
Related, one opportunity to reduce the potential for errors we have is to use
context.Background()
for the database fetch so that a canceled request context cannot disrupt others who may be waiting for the file. We can then manually check the context outside of theLoad
, just like we already do with authorization.