Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large changesets over slow connections result in uncontrollable usage-metered long-polling loops #2683

Closed
djsweet opened this issue Feb 27, 2020 · 12 comments
Assignees

Comments

@djsweet
Copy link

djsweet commented Feb 27, 2020

[REQUIRED] Describe your environment

  • Operating System version: Windows 10 Version 1909
  • Browser version: Version 80.0.3987.116 (Official Build) (64-bit)
  • Firebase SDK version: 7.9.1
  • Firebase Product: firestore

[REQUIRED] Describe the problem

We have encountered sporadic periods of quadrupled document read rates since early January 2020. We have narrowed the cause down to a pathological behavior of the Firebase SDK dealing with large changesets of 15,000 deleted documents and 15,000 added documents over slow Internet connections.

Clients with slow Internet connections will enter an uncontrollable, endless long-polling loop when these changes are encountered in snapshot listeners. The document additions seem to be reported to SDK consumers, but document removals are not. Each of these long-polling sessions counts the entire deleted document count against our read quotas, even though nothing is being passed to the SDK consumer.

Steps to reproduce:

I have prepared a sample project that simulates the exact conditions necessary to trigger this bug. A "server" script deletes and then creates 15,000 documents against a Firestore database, and a client attempts to react to these changes. The client's network connection is throttled using the Slow 3G preset in the Google Chrome Developer Console.

We see an initial HTTP connection that takes 270s to complete, transmitting 13.2 MB of data, which results in the client reporting all of the document additions but none of the document removals.

Read Spike First Contact

After this, we enter a state where follow-up HTTP connections taking 70 seconds each are constantly retried. Each connection transmits 3.3 MB of data, but the consumer of the SDK is not passed any information regarding changes.

Project State 45 minutes later

After running this client for an hour with no actual database changes, we see that we will be charged for 800,000 document reads.

Project State Full Hour Later

Because nothing has actually changed in the database during this period, and we are not actually able to observe changes from the client, we expect to not be charged for any reads.

Relevant Code:

See https://github.com/djsweet/firestore-read-spike-bug-report for an environment which reproduces this problem.

@wilhuff
Copy link
Contributor

wilhuff commented Feb 27, 2020

The issue you're encountering here is limbo resolution performed by the client.

As you've noticed, the Firestore server does not remember that deletes have happened indefinitely and a the client has been disconnected for long enough will just see deleted documents omitted from the result set rather than getting delete notifications.

Limbo resolution is designed to counter this limitation. The idea is that if the client has documents in its cache that match your query but the server isn't returning them then the cache must be wrong and needs fixing.

The issue is that when the client listens to a query, it first emits a snapshot based on the contents of its cache. Once the server has given results for the query the client emits another snapshot with those results, marked as in sync. Without limbo resolution, after some other client modifies a document so that it no longer matches the query, the first snapshot (from cache) will always include it and the second snapshot will not. Especially on slower connections this flicker is obvious and it won't self correct. The extra read brings the cache up-to-date and now the local document will no longer match the query and it won't end up in the first snapshot.

Note that in the general case it's not enough to just delete limbo documents because even though that document might not match this query it might match another one.

There are additional complications that happen with this method that involve concurrent remote modifications while a limbo resolution is in flight that cause us to implement limbo resolutions as listens and the client doesn't consider itself done with limbo resolution until all of them are in sync.

So what's happening in your case is that over slow connections, performing 15k document lookups takes a long time. It might take long enough that most clients never actually finish the process. If they quit and come back the limbo resolution process will start over. I suspect this is why the number of reads is so large.

In your reproduction (by the way, thanks for this, it's super helpful), your client is listening to a whole collection. Is that how your application actually works? That is--you're not using any where clauses or order bys? If so you're in luck there's a pretty simple fix we can make. Currently the client performs limbo resolution for all queries, but full collection queries are actually a case where the client can delete limbo documents outright without asking.

If you are using where clauses or order bys then there's no simple fix for this. We'll need to think about and test methods for limiting the number of active limbo resolutions or otherwise guaranteeing that clients can make progress without restarting in these kinds of pathological cases. We might also possibly look into how we might adjust the algorithm to not use listens. However, these aren't easy fixes.

In the meantime a workaround might be to clear persistence for this particular client, though sending this kind of instruction remotely is tricky (since you only want to clear once).

@djsweet
Copy link
Author

djsweet commented Feb 27, 2020

Hi @wilhuff. Thanks for the response -- I was expecting one by tomorrow at the earliest.

Unfortunately, our actual application does use inequality queries, so the proposed fix won't actually work for us.

Something I've been wondering about is whether it would be possible to communicate that limbo resolution is failing to the onError callback passed to onSnapshot. If we could just be told that limbo resolution is failing, we could actually disconnect and fall back to other mechanisms to read the data we need out of the snapshot. We wouldn't even necessarily need the listener to be terminated implicitly--we already have the cancellation function in-scope and could do that ourselves.

This is something of a time-sensitive issue for us. We've already lost a non-trivial amount of money to this bug this year. The faster we can deploy a mitigation, the better.

@wilhuff
Copy link
Contributor

wilhuff commented Feb 27, 2020

The onError callback signals that the listener has completely failed so that doesn't seem like the right way to report this.

Can you show an example of the query you're actually running against this collection?

Are there any other queries you run against this collection? Another potential fix is to delete documents that are in limbo but have only ever matched a single query.

The simplest mitigation right now is to clear persistence for affected clients. Something like this:

let db = ...;
if (isAffectedClient() && window.localStorage.getItem("persistence.cleared") == null) {
  await db.clearPersistence();
  window.localStorage.setItem("persistence.cleared", "1");
}

I've also put together a possible way to get the client to delete all limbo documents for itself: https://github.com/firebase/firebase-js-sdk/compare/wilhuff/limbo-hack, essentially forcing a delete-everything-in-limbo policy. I'm not convinced this is the right way to implement that though and need to discuss with the team. If you wanted you could build using the instructions in our readme and kick the tires on that though.

I'm currently looking more closely at your repro to see what's going on here:

Our testing indicates that we should expect a very long download session, lasting approximately 270 seconds with the Slow 3G throttling preset in Google Chrome, followed by a perpetual loop of 70 second download sessions.

The initial long interval makes sense to me, but the loop doesn't seem right.

@wilhuff
Copy link
Contributor

wilhuff commented Feb 27, 2020

OK I've figured out what's going on with your 270 second delay and 70 second loops.

With the "Slow 3G" setting, the initial 270 seconds is what it takes to download the new 15k documents. The client writes these out and notifies the listener.

The client then starts the limbo resolution process. Under the "Slow 3G" preset it takes about 10 seconds to send all 15k requests. The client then receives a resource-exhausted error from the backend indicating that it's not going to let a single client listen to 15k new things at once. The client's behavior when it encounters a resource-exhausted error is to wait with maximum backoff, which comes out to about 60 seconds.

It turns out that slowing the connection down isn't necessary to reproduce this loop.

Next steps are to look into how to add throttling to the limbo resolver.

@wilhuff
Copy link
Contributor

wilhuff commented Feb 27, 2020

One thing I just noticed here is that your reproduction doesn't have persistence enabled. Is your production app similarly configured? If so one of the easier solutions might actually apply.

@wilhuff wilhuff self-assigned this Feb 27, 2020
@wilhuff
Copy link
Contributor

wilhuff commented Feb 27, 2020

@djsweet In addition to technical discussion here you should open a regular support case so that we can credit you for the excessive reads. In it, please document when this large deletion happened, reference this github issue and that you're working with me ([email protected]). From there I'll be able to escalate this to the right people.

@djsweet
Copy link
Author

djsweet commented Feb 28, 2020

Hi again @wilhuff. My apologies for the delay in a response, yesterday was quite hectic for us.

As some background as to what our listeners are doing, we are listening to rows in virtual tables that are specific per app that developers create on our platform. The target collection names all take the form /data-store/{app}/tables/{table}/rows, and in any given app, we only have one listener open on the entire table/collection.

We use a query 99.9% of the time in this listener, but sometimes have to fall back to a bare listen. Our app data is also stored in Cloud Storage to increase performance of initial app loading, and all changes to the data are tracked by a monotonic increasing counter. Every time a document is affected by a change, its own copy of this counter increases. So what we often do looks like

firebase
    .firestore()
    .collection("/data-store/{app}/tables/{table}/rows")
    .where("version", ">", blobVersion)
    .onSnapshot(handler)

This prevents the network overhead of having to download the entire table again, even though we've already realized it via a blob in Cloud Storage, but at the same time allows us to report updates as they happen.

As you saw in the reproduction project, we do not use Firestore persistence.

We really appreciate the offer to help us out with GCP support on the financial side of this. Thanks!

@wilhuff
Copy link
Contributor

wilhuff commented Mar 2, 2020

Update: our plan to address this issue is to follow two tracks.

On the first track we're currently testing a small change that disables limbo resolution. You should be able to use this version of the SDK temporarily and it should address the immediate concern.

The second track involves more fundamental changes to limbo resolution so that these kinds of events can't cause this kind of problem. We believe that this circumstance is one we could automatically avoid. Additionally we'll be adding limits and throttling so that even if we somehow determine that a huge number of limbo documents do need to be resolved the client won't fail.

@wilhuff
Copy link
Contributor

wilhuff commented Mar 4, 2020

While we've been working on coming up with a build that disables limbo resolution it occurred to me that in your specific circumstances there's a potentially much easier workaround.

The core of the idea is to keep track of the latest version seen (from the blob or Firestore snapshots). When the app loses network connectivity, stop your listener. Then, when there's a chance there's a network connection, start the listener again from the latest version you've seen.

This works because when persistence is disabled, Firestore only caches active queries and pending writes. If you stop listening the client will forget about the query and any documents that matched it (that don't match some other active query). When you start the listener later there's no chance for limbo resolution because there are no cached documents to get into limbo.

A potential way to implement this would be to subscribe to offline and online events. In your example it would look something like this:

let unsubscribe = null;

function setupListener() {
  const db = firebase.firestore();
  unsubscribe = db.collection(targetCollection).onSnapshot(...);
}

function networkEventListener(e) {
  if (navigator.onLine) {
    if (unsubscribe == null) {
      setupListener();
    }
  } else {
    if (unsubscribe != null) {
      unsubscribe();
      unsubscribe = null;
    }
  }
}

window.addEventListener("online", networkEventListener);
window.addEventListener("offline", networkEventListener);

Instead of testing with the Disable Network/Enable Network button, change the Online state Offline in the Network tab of Chrome's developer tools. Testing your example app this way shows that there's no longer any catastrophic limbo resolution loop. Of course the example doesn't keep track of versions so it's redownloading everything each iteration but altering this to keep track of the latest version seen should be straightforward.

Another signal you might use is visibilitychange events to make the listener run only when the current tab is active and has a network connection.

Let me know if this addresses the immediate problem. We'll continue working on making limbo resolution smarter, but making a build that disables limbo resolution has proven far more difficult than is really practical for a short term mitigation like this.

@djsweet
Copy link
Author

djsweet commented Mar 11, 2020

@wilhuff My apologies for missing your replies up til this point.

It's interesting that you suggested shutting off network connectivity based on visibility. We actually did develop out a mechanism for disabling the Firestore network connection when our app isn't visible, thinking that it would be a valid mitigation for the issue. We don't re-establish listeners yet but this may not be too hard for us.

Today, we hit an unexpected read loop for approximately an hour. This is the first time the issue hasn't persisted, so I think it's safe to say that our partial mitigation is working. We'll look into re-establishing listeners on visibility and network changes as well.

@wilhuff
Copy link
Contributor

wilhuff commented Mar 11, 2020

OK, that sounds promising. Let us know how that goes.

In the meantime we're getting closer to having a working throttler for Limbo resolution and are making progress on design changes that mitigate specific circumstances like yours.

I'm going to leave this open until we ship a throttler that avoids the loop you saw. If you see further issues related to this, don't hesitate to post more.

@dconeybe
Copy link
Contributor

dconeybe commented Sep 23, 2020

Limbo resolution throttling was implemented in all three client SDKs 6 months ago:

I think we just forgot to close this ticket. I'm closing it, but feel free to re-open or add a comment if the issue is not resolved by limbo resolution throttling.

@firebase firebase locked and limited conversation to collaborators Oct 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants