Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming interface #215

Open
patrickt opened this issue Sep 24, 2019 · 7 comments
Open

Streaming interface #215

patrickt opened this issue Sep 24, 2019 · 7 comments

Comments

@patrickt
Copy link
Contributor

Right now, all parsing assumes you have the entire source file paged into memory as a strict ByteString. I would love to see an option that incorporated streaming-bytestring, which is IME the best way to express lazily-read bytestreams:

parseStreaming :: Unmarshal t => Ptr TS.Language -> Streaming.ByteString.ByteString IO () -> IO (Either String t)
@robrix
Copy link
Member

robrix commented Sep 24, 2019

I’m generally in favour of streaming/incrementality, but I think there are a couple of challenges here we’ll have to overcome for this to actually result in any performance improvements:

  1. The resulting AST will be entirely resident in memory.
  2. Almost anything reported to a user will require arbitrary slices from the source.

@patrickt
Copy link
Contributor Author

  1. Yeah, there’s not much we can do about that. At least we can have predictable resource consumption when ingesting streams of bytes.
  2. I have done something like this. What you do is copy the input bytestring, feeding the copy into the parser and keeping the original around from which to slice.

@robrix
Copy link
Member

robrix commented Sep 24, 2019

  • I have done something like this. What you do is copy the input bytestring, feeding the copy into the parser and keeping the original around from which to slice.

I think I must be misunderstanding you; wouldn’t that mean we’ve got the entire file in memory twice?

@patrickt
Copy link
Contributor Author

Nope, copying a streaming bytestring is O(1). It doesn’t involve the actual copying of concrete chunks, just the constructors associated with them.

@robrix
Copy link
Member

robrix commented Sep 25, 2019

Ah, ok; but we still end up having the whole thing paged into memory, right? Which kind of obviates the value of streaming, no?

@robrix
Copy link
Member

robrix commented Sep 25, 2019

Ah, ok; but we still end up having the whole thing paged into memory, right? Which kind of obviates the value of streaming, no?

Answering my own question: we don’t keep the whole thing paged in at once unless we slice the whole thing. We (presumably? I don’t actually know this for sure) lose O(1) slicing, but we limit space consumption for the common case of slicing only parts of the document out.

Losing O(1) slicing could still be a challenge, but it’s one we can probably work with since we don’t tend to traverse the tree random-access.

Plus we might also be able to take advantage of tree-sitter’s incrementality at some point, albeit at the cost of keeping the Tree around.

Ok, I think I have my head wrapped around this now; sounds like a plan!

@dcreager
Copy link

In production don't we already have a maximum file size that we're willing to process? I think that's true both when we receive blob content via a network RPC call and when we load the blob from disk. (/cc @tclem @joshvera to confirm) If so, the limit should be small enough that paging an entire copy of the file into memory isn't that big of an issue. We're not dealing with multi-GB data files here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants