Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firestore: Fix spurious "Backend didn't respond within 10 seconds" errors when network just slow #8145

Merged
merged 9 commits into from
Apr 18, 2024

Conversation

dconeybe
Copy link
Contributor

@dconeybe dconeybe commented Apr 9, 2024

Firestore was erroneously concluding that the connection to the backend was broken when the response to the initial listen request was large (e.g. 1000 documents) and the network speed was slow.

This bug resulted in erroneous errors that looked like this:

Could not reach Cloud Firestore backend. Backend didn't respond within 10 seconds.
This typically indicates that your device does not have a healthy Internet connection at the moment. The client will operate in offline mode until it is able to successfully connect to the backend.

The fix is to propagate the first "OPEN" message from WebChannel to the online state detector so that it will consider the connection to be "alive" even though no proto messages have been received on the stream.

Googlers see b/325591749

@dconeybe dconeybe self-assigned this Apr 9, 2024
Copy link

changeset-bot bot commented Apr 9, 2024

🦋 Changeset detected

Latest commit: d309eaf

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@firebase/firestore Patch
firebase Patch
@firebase/firestore-compat Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@google-oss-bot
Copy link
Contributor

google-oss-bot commented Apr 9, 2024

Size Report 1

Affected Products

  • @firebase/firestore

    TypeBase (d1eae3a)Merge (e03fd3f)Diff
    browser377 kB377 kB+275 B (+0.1%)
    esm5362 kB362 kB+345 B (+0.1%)
    main580 kB580 kB+464 B (+0.1%)
    module377 kB377 kB+275 B (+0.1%)
    react-native377 kB377 kB+275 B (+0.1%)
  • bundle

    TypeBase (d1eae3a)Merge (e03fd3f)Diff
    firestore (CSI Auto Indexing Disable and Delete)268 kB269 kB+113 B (+0.0%)
    firestore (CSI Auto Indexing Enable)268 kB269 kB+113 B (+0.0%)
    firestore (Persistence)303 kB303 kB+172 B (+0.1%)
    firestore (Query Cursors)247 kB241 kB-6.36 kB (-2.6%)
    firestore (Query)245 kB238 kB-6.36 kB (-2.6%)
    firestore (Read data once)233 kB226 kB-6.36 kB (-2.7%)
    firestore (Read Write w Persistence)323 kB323 kB+172 B (+0.1%)
    firestore (Realtime updates)235 kB229 kB-6.36 kB (-2.7%)
    firestore (Transaction)212 kB205 kB-6.47 kB (-3.1%)
    firestore (Write data)212 kB205 kB-6.39 kB (-3.0%)
  • firebase

    TypeBase (d1eae3a)Merge (e03fd3f)Diff
    firebase-compat.js786 kB786 kB+165 B (+0.0%)
    firebase-firestore-compat.js342 kB342 kB+165 B (+0.0%)
    firebase-firestore.js436 kB436 kB+275 B (+0.1%)

Test Logs

  1. https://storage.googleapis.com/firebase-sdk-metric-reports/EdgLCb4PBr.html

@google-oss-bot
Copy link
Contributor

google-oss-bot commented Apr 9, 2024

Size Analysis Report 1

This report is too large (568,673 characters) to be displayed here in a GitHub comment. Please use the below link to see the full report on Google Cloud Storage.

Test Logs

  1. https://storage.googleapis.com/firebase-sdk-metric-reports/GoAKfWvA3c.html

@thomasdao
Copy link

thomasdao commented Apr 13, 2024

@dconeybe thanks for working on this issue. I tested the PR, the query can load about 100 documents then found the below error

Error: FIRESTORE (10.6.0) INTERNAL ASSERTION FAILED: Unexpected state
    at fail (index.esm2017.js:194:1)
    at __PRIVATE_AsyncQueueImpl.Iu (index.esm2017.js:18538:1)
    at __PRIVATE_AsyncQueueImpl.enqueue (index.esm2017.js:18454:1)
    at __PRIVATE_AsyncQueueImpl.enqueueAndForget (index.esm2017.js:18439:1)
    at DelayedOperation.handleDelayElapsed (index.esm2017.js:15141:1)
    at index.esm2017.js:15123:1

@Valansch
Copy link

Hi @dconeybe thank you for working on the fix.
Do you have a tip how I can build the sdk and import it as a dart package?

@dconeybe
Copy link
Contributor Author

thomasdao

Please try these steps:

  1. cd packages/firestore
  2. yarn build:debug
  3. cp -r dist ~/YOUR_PROJECT/node_modules/@firebase/firestore

The build:debug script will build dist/index.esm2017.js without all of the optimizations and code mangling. This will make your stack traces easier to interpret.

Now, please reproduce again and post an updated stack trace. Hopefully the stack trace from unmangled code will make more sense.

@dconeybe
Copy link
Contributor Author

Hi @dconeybe thank you for working on the fix. Do you have a tip how I can build the sdk and import it as a dart package?

@Valansch Unfortunately, no, I don't have any experience with dart or flutter. Would you be able to ask this question on the https://github.com/firebase/flutterfire repository?

@thomasdao
Copy link

@dconeybe please see the error below

Error log
firebase.worker.js:50 Error: FIRESTORE (10.6.0) INTERNAL ASSERTION FAILED: AsyncQueue is already failed: Error: FIRESTORE (10.6.0) INTERNAL ASSERTION FAILED: `pendingResponses` is less than 0. Actual value: -1. This indicates that the SDK received more target acks from the server than expected. The SDK should not continue to operate.
  at fail (index.esm2017.js:213:1)
  at hardAssert (index.esm2017.js:223:1)
  at TargetState.recordTargetResponse (index.esm2017.js:8464:1)
  at index.esm2017.js:8523:1
  at Array.forEach (<anonymous>)
  at WatchChangeAggregator.forEachTarget (index.esm2017.js:8570:1)
  at WatchChangeAggregator.handleTargetChange (index.esm2017.js:8512:1)
  at onWatchStreamChange (index.esm2017.js:20721:1)
  at PersistentListenStream.onMessage (index.esm2017.js:20010:1)
  at index.esm2017.js:19945:1
  at fail (index.esm2017.js:213:1)
  at AsyncQueueImpl.verifyNotFailed (index.esm2017.js:25633:1)
  at AsyncQueueImpl.enqueue (index.esm2017.js:25539:1)
  at AsyncQueueImpl.enqueueAndForget (index.esm2017.js:25521:1)
  at DelayedOperation.handleDelayElapsed (index.esm2017.js:21168:1)
  at index.esm2017.js:21145:1

@dconeybe
Copy link
Contributor Author

@thomasdao Thanks for providing the updated stack trace. That error looks like the bug that you reported in #7652 has resurfaced. That is surprising because the bug you reported should have been fixed in v10.5.2 (and you are using v10.6.0).

To rule out this issue, could you try the following:

  1. Upgrade your dependency on firebase in package.json to the latest: "firebase": "^10.11.0"
  2. Delete your package-lock.json and node_modules directory.
  3. Run npm install in your application.
  4. Make sure that the firebase-js-sdk that you checked out from GitHub is at the latest commit in the master branch.

Thanks!

@thomasdao
Copy link

@dconeybe I updated Firebase to 10.11.0, then copy the firestore dist folder, and the query still hangs with below error:

index.esm2017.js:78 [2024-04-16T15:19:20.510Z]  @firebase/firestore: Firestore (10.11.0): WebChannelConnection RPC 'Listen' stream 0x5a8c4d87 transport errored: 
Qd {type: 'c', target: Q, g: Q, defaultPrevented: false, status: 1}

@dconeybe
Copy link
Contributor Author

@thomasdao Can you confirm that you have the dconeybe/WebChannelOnOpenFix_Bug325591749 branch checked out in your local clone of this repository? If so, could you try re-running the yarn build:debug and cp -r dist ~/YOUR_PROJECT/node_modules/@firebase/firestore steps?

Also, if you're willing, could you enable Firestore debug logging (i.e. call setLogLevel('debug')) and provide the entire log that is produced? Feel free to email your log to me directly (my email is my github username at google.com) if you're concerned about posting the logs publicly.

@thomasdao
Copy link

thomasdao commented Apr 17, 2024

@dconeybe I've sent the debug log to your email. I've checked the branch again and it's correct:

$ git pull
Already up to date.

$ git rev-parse --abbrev-ref HEAD
dconeybe/WebChannelOnOpenFix_Bug325591749

$ git branch --show-current
dconeybe/WebChannelOnOpenFix_Bug325591749

@dconeybe
Copy link
Contributor Author

@thomasdao Thank you for sending me the logs. Unfortunately, based on the logs I can confirm that this PR will NOT fix the issue you are reporting (#7860). One thing, however, that I noticed from your logs is that your app seems to using Firestore before a user is signed in, and then continue to use it after the user is signed in. This could be innocuous, or could be undesirable behavior, depending on what your app is doing. I just wanted to point it out.

@dconeybe dconeybe marked this pull request as ready for review April 18, 2024 14:55
@dconeybe dconeybe requested review from a team as code owners April 18, 2024 14:55
Copy link
Contributor

@MarkDuckworth MarkDuckworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with comments.

@@ -113,6 +113,7 @@ export interface Connection {
* be called if the stream successfully established a connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the difference between onOpen and onConnected in the documentation? Currently the docs state that "onOpen will only be called if the stream successfully established a connection." That definition of onOpen is what I would expect for onConneted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the docs. LMK if they clarify.

The naming "onOpen" is somewhat unfortunate, as it does not actually indicate that an opened connection with the backend has been established. I think at one point this was the meaning, but the channel is now optimized to send the "handshake" with the first message to avoid an extra round-trip with the backend. So WebChannelConnection now sends onOpen as soon as it is ready to send a message so that it can send the handshake along with the first message. I added onConnected to send an event when an actual connection with the backend has been established.

@@ -148,6 +156,14 @@ describe('Watch Stream', () => {
});
});
});

it('gets connected event before first message', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the title of this a little misleading because the test doesn't appear to test for any messages. Or does it and I'm just not seeing how?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awaitCallback() will fail if some other callback occurs (such as a message). The test doesn't care about if the "message" callback occurs, only that "connected" occurs before it. Does that clarify?

@dconeybe dconeybe merged commit 2244194 into master Apr 18, 2024
44 checks passed
@dconeybe dconeybe deleted the dconeybe/WebChannelOnOpenFix_Bug325591749 branch April 18, 2024 17:47
@thomasdao
Copy link

One thing, however, that I noticed from your logs is that your app seems to using Firestore before a user is signed in, and then continue to use it after the user is signed in.

@dconeybe beside user data, I also store public settings in Firestore, so I need to setup Firestore even if the user does not sign in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants