Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scalability of the Watch transform #18459

Open
kennknowles opened this issue Jun 3, 2022 · 1 comment
Open

Improve scalability of the Watch transform #18459

kennknowles opened this issue Jun 3, 2022 · 1 comment

Comments

@kennknowles
Copy link
Member

#3565 introduces the Watch transform http://s.apache.org/beam-watch-transform.

The implementation leaves several scalability-related TODOs:

  1. The state stores hashes and timestamps of outputs that have already been output and should be omitted from future polls. We could garbage-collect this state, e.g. dropping elements from "completed" and from addNewAsPending() if their timestamp is more than X behind the watermark.
  2. When a poll returns a huge number of elements, we don't necessarily have to add all of them into state.pending - instead we could add only N oldest elements and ignore others, relying on future poll rounds to provide them, in order to avoid blowing up the state. Combined with garbage collection of GrowthState.completed, this would make the transform scalable to very large poll results.

Imported from Jira BEAM-2680. Original Jira may contain additional context.
Reported by: jkff.

@zhengbuqian
Copy link
Contributor

This has caused a dataflow customer to experience error: too many hashes/timestamps stored causing this ValueState to exceed max allowed single value size limit of 80MB(https://cloud.google.com/dataflow/quotas#limits).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants