Streaming interface #215

patrickt · 2019-09-24T13:00:18Z

Right now, all parsing assumes you have the entire source file paged into memory as a strict ByteString. I would love to see an option that incorporated streaming-bytestring, which is IME the best way to express lazily-read bytestreams:

parseStreaming :: Unmarshal t => Ptr TS.Language -> Streaming.ByteString.ByteString IO () -> IO (Either String t)

The text was updated successfully, but these errors were encountered:

robrix · 2019-09-24T16:36:28Z

I’m generally in favour of streaming/incrementality, but I think there are a couple of challenges here we’ll have to overcome for this to actually result in any performance improvements:

The resulting AST will be entirely resident in memory.
Almost anything reported to a user will require arbitrary slices from the source.

patrickt · 2019-09-24T20:20:19Z

Yeah, there’s not much we can do about that. At least we can have predictable resource consumption when ingesting streams of bytes.
I have done something like this. What you do is copy the input bytestring, feeding the copy into the parser and keeping the original around from which to slice.

robrix · 2019-09-24T21:22:18Z

I have done something like this. What you do is copy the input bytestring, feeding the copy into the parser and keeping the original around from which to slice.

I think I must be misunderstanding you; wouldn’t that mean we’ve got the entire file in memory twice?

patrickt · 2019-09-24T21:52:58Z

Nope, copying a streaming bytestring is O(1). It doesn’t involve the actual copying of concrete chunks, just the constructors associated with them.

robrix · 2019-09-25T01:53:49Z

Ah, ok; but we still end up having the whole thing paged into memory, right? Which kind of obviates the value of streaming, no?

robrix · 2019-09-25T14:19:13Z

Ah, ok; but we still end up having the whole thing paged into memory, right? Which kind of obviates the value of streaming, no?

Answering my own question: we don’t keep the whole thing paged in at once unless we slice the whole thing. We (presumably? I don’t actually know this for sure) lose O(1) slicing, but we limit space consumption for the common case of slicing only parts of the document out.

Losing O(1) slicing could still be a challenge, but it’s one we can probably work with since we don’t tend to traverse the tree random-access.

Plus we might also be able to take advantage of tree-sitter’s incrementality at some point, albeit at the cost of keeping the Tree around.

Ok, I think I have my head wrapped around this now; sounds like a plan!

dcreager · 2019-09-25T14:54:08Z

In production don't we already have a maximum file size that we're willing to process? I think that's true both when we receive blob content via a network RPC call and when we load the blob from disk. (/cc @tclem @joshvera to confirm) If so, the limit should be small enough that paging an entire copy of the file into memory isn't that big of an issue. We're not dealing with multi-GB data files here.

patrickt added the enhancement label Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming interface #215

Streaming interface #215

patrickt commented Sep 24, 2019

robrix commented Sep 24, 2019

patrickt commented Sep 24, 2019

robrix commented Sep 24, 2019

patrickt commented Sep 24, 2019

robrix commented Sep 25, 2019

robrix commented Sep 25, 2019

dcreager commented Sep 25, 2019

Streaming interface #215

Streaming interface #215

Comments

patrickt commented Sep 24, 2019

robrix commented Sep 24, 2019

patrickt commented Sep 24, 2019

robrix commented Sep 24, 2019

patrickt commented Sep 24, 2019

robrix commented Sep 25, 2019

robrix commented Sep 25, 2019

dcreager commented Sep 25, 2019