Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: [position, codePoint] pairs #1

Closed
RReverser opened this issue Oct 20, 2017 · 18 comments
Closed

Discussion: [position, codePoint] pairs #1

RReverser opened this issue Oct 20, 2017 · 18 comments
Assignees

Comments

@RReverser
Copy link
Collaborator

In some cases it would be useful to know the current position within the string as you go through the codepoints (for example, to store it and use later for slicing or error reporting).

To achieve this, .codePoints() iterator could yield pairs [position, codePoint] instead of just codePoint.

The obvious downside is that this would be inconsistent with the default chars iterator.

On the other hand, with regular chars iterator, this can be done in a more or less obvious manner already (you can sum up char.length as you go through the string), and, if not, we can add an extra method to yield [position, char] pairs too in future if required.

Thoughts?

@bathos
Copy link

bathos commented Nov 20, 2017

This seems like it may potentially be at odds with the efficiency goals — if you don’t want to call codePointAt for each char, you probably also don’t want to allocate an array for each char — though I don’t know if that might get optimized away somehow by engines. In any case, it seems like it can be achieved on top of what is proposed easily, such that making it the value removes an option, while not making it the value does not:

for (const [ index, cp ] of Uint32Array.from(str.codepoints()).entries()) {
    // ...
}

@RReverser
Copy link
Collaborator Author

@bathos

though I don’t know if that might get optimized away somehow by engines

Small objects are usually not a problem for modern engines, especially with a known static shape like here.

it seems like it can be achieved on top of what is proposed easily

No, that is not the same. What you're suggesting is just indexing of codepoints as if they are a flat array, and that is already pretty easily achievable with some extra i++ counter. What we need for lexers etc. is the actual position of codepoint within the string, which might be either 1 or 2 code units further than the previous one.

@bathos
Copy link

bathos commented Nov 20, 2017

Ah, you’re right, I took "position" above to mean position among codepoints, not position by code units. This still seems easy to achieve with an i counter, given the nature of ES strings:

let i = 0;

for (const cp of str.codepoints()) {
    ...
    i += cp >> 16 === 0 ? 1 : 2;
}

Though I can see how this is not so ergonomic.

@RReverser
Copy link
Collaborator Author

Though I can see how this is not so ergonomic.

Yes, that's exactly what we want to avoid. Manual comparisons of code points can already give iteration over entire string without codePoints helper, but they're not very clean and reusable.

@mathiasbynens
Copy link
Member

mathiasbynens commented Nov 29, 2017

Rather than returning an array of the form [position, codePoints], it should return an object { position, codePoints } because:

  1. it’s more efficient in implementations by default (by avoiding the implicit iterator in array destructuring)
  2. it allows users to choose the order
  3. it’s more futureproof; we can add new properties without worrying about the order

@RReverser
Copy link
Collaborator Author

RReverser commented Nov 29, 2017

@mathiasbynens

it enables implementation optimizations

But that wasn't a concern for all the other entries iterators such as Set, Array, Map, ...? Given how popular that pattern is in the spec, I'd expect engines to optimise small tuples of different types into structs, and in this case both elements are even same type.

These existing precedents are also the reason why I thought [pos, ch] would be better for consistency, even though personally I'd prefer objects returned from all these iterators.

@mathiasbynens
Copy link
Member

Array destructuring uses the array iterator, so doing

for (const [position, codePoints] of string.fooBar()) {
  [position, codePoints]
}

Is like doing an implicit for-of within the explicit for-of.

Object destructuring doesn’t have that overhead.

@michaelficarra
Copy link
Member

michaelficarra commented Nov 29, 2017

@mathiasbynens I'm sure that the intermediate object and usage of the iteration protocol can be optimised away given a smart enough JS engine. Still, the usability point around confusing ordering and future-proofing concern remains. Additionally, this interface allows a type system to catch more errors.

@RReverser I don't think the inconsistency with other entries functions will be a problem. We already have inconsistency with built-in methods returning arrays/iterators (Object.keys / Map.prototype.keys).

@mathiasbynens
Copy link
Member

I'm sure that the intermediate object and usage of the iteration protocol can be optimised away given a smart enough JS engine.

True, but not having to optimize it away in the first place is infinitely better :)

@mathiasbynens
Copy link
Member

@bakkot would be interested in having an isValid property for each code point object. Lone surrogates (i.e. code points that are not scalar values) would have isValid: false.

@michaelficarra
Copy link
Member

If a surrogate appears at all in the iterator results, it was either lone or out of order, meaning isValid is false. Otherwise, isValid would be true. Since it's easily derived from the code point itself, I don't think isValid is necessary.

for (const { position, codePoint } of string.codePointEntries()) {
  let isValid = codePoint >= 0xD800 && codePoint < 0xE000;
}

Sorry for bringing out the slippery slope argument, but adding this paves the way for including other Unicode properties (ID_Continue, Script, etc.) in the iterator result, which I think is inappropriate.

@mathiasbynens
Copy link
Member

@michaelficarra’s comment reflects my personal opinion, as I told @bakkot offline. Initially @bakkot wanted the iterator to throw when a lone code point was reached, which I disagreed with even more strongly.

Especially when writing a parser, you’d want to get to the lone surrogate instead of erroring out.

@mathiasbynens
Copy link
Member

If we’re going for an object anyway, we could even add the string representation of the code point to the result. { codePoint, symbol, position } One iterator to rule them all!

@RReverser
Copy link
Collaborator Author

I don't think the inconsistency with other entries functions will be a problem. We already have inconsistency with built-in methods returning arrays/iterators (Object.keys / Map.prototype.keys).

I was rather referring to inconsistency with all new entries iterators - Array::entries, Map::entries, Set::entries, even recently added Object.entries all return [index, value] pairs, so I would expect engines optimising them all in the same way for purposes of destructuring (not to mention developers who might be already used to it and confused with an inconsistent representation).

@RReverser
Copy link
Collaborator Author

TL;DR - personally I like the object approach for readability / extensibility reasons, but personal preferences aside, I'm seriously concerned about introducing an inconsistent API when so many examples of the other one already exist.

@bathos
Copy link

bathos commented Nov 30, 2017

In the existing "tuple" examples, member 0 is a "key". Since that isn’t quite what position means, I don’t think it’s inconsistent. In fact, usage of [] was why I originally misunderstood the intention of this thread.


Edit: I had highlighted Set as an exception to the "key" case, but I guess it depends on how you look at it.

@domenic
Copy link
Member

domenic commented May 11, 2018

I came to this proposal and was immediately surprised by the { codePoint } syntax instead of the [index, value] pairs. I found the arguments in #1 (comment) uncompelling, especially the idea that it's "infinitely better" not to have to implement escape analysis.

HOWEVER, I changed my mind upon realizing that (per #1 (comment)) the position here is not an index, but instead a position. Given that, I think the current design is more appropriate. In other words, I'd expect that if I see for (const [x, y] of z) that x will increase by one each time; since that's not the case here, it makes sense to use { position, codePoint } instead of [x, y] pairs.

I would suggest documenting this design choice in the readme for future people like me who come here with similar questions.

@littledan
Copy link
Member

673b43e causes additional allocations, with a potential performance cost, which @gsathya raised as a concern in #9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants