Discussion: [position, codePoint] pairs #1

RReverser · 2017-10-20T00:19:47Z

In some cases it would be useful to know the current position within the string as you go through the codepoints (for example, to store it and use later for slicing or error reporting).

To achieve this, .codePoints() iterator could yield pairs [position, codePoint] instead of just codePoint.

The obvious downside is that this would be inconsistent with the default chars iterator.

On the other hand, with regular chars iterator, this can be done in a more or less obvious manner already (you can sum up char.length as you go through the string), and, if not, we can add an extra method to yield [position, char] pairs too in future if required.

Thoughts?

The text was updated successfully, but these errors were encountered:

bathos · 2017-11-20T19:40:01Z

This seems like it may potentially be at odds with the efficiency goals — if you don’t want to call codePointAt for each char, you probably also don’t want to allocate an array for each char — though I don’t know if that might get optimized away somehow by engines. In any case, it seems like it can be achieved on top of what is proposed easily, such that making it the value removes an option, while not making it the value does not:

for (const [ index, cp ] of Uint32Array.from(str.codepoints()).entries()) {
    // ...
}

RReverser · 2017-11-20T20:42:01Z

@bathos

though I don’t know if that might get optimized away somehow by engines

Small objects are usually not a problem for modern engines, especially with a known static shape like here.

it seems like it can be achieved on top of what is proposed easily

No, that is not the same. What you're suggesting is just indexing of codepoints as if they are a flat array, and that is already pretty easily achievable with some extra i++ counter. What we need for lexers etc. is the actual position of codepoint within the string, which might be either 1 or 2 code units further than the previous one.

bathos · 2017-11-20T22:46:44Z

Ah, you’re right, I took "position" above to mean position among codepoints, not position by code units. This still seems easy to achieve with an i counter, given the nature of ES strings:

let i = 0;

for (const cp of str.codepoints()) {
    ...
    i += cp >> 16 === 0 ? 1 : 2;
}

Though I can see how this is not so ergonomic.

RReverser · 2017-11-20T23:40:31Z

Though I can see how this is not so ergonomic.

Yes, that's exactly what we want to avoid. Manual comparisons of code points can already give iteration over entire string without codePoints helper, but they're not very clean and reusable.

mathiasbynens · 2017-11-29T20:11:53Z

Rather than returning an array of the form [position, codePoints], it should return an object { position, codePoints } because:

it’s more efficient in implementations by default (by avoiding the implicit iterator in array destructuring)
it allows users to choose the order
it’s more futureproof; we can add new properties without worrying about the order

RReverser · 2017-11-29T20:42:21Z

@mathiasbynens

it enables implementation optimizations

But that wasn't a concern for all the other entries iterators such as Set, Array, Map, ...? Given how popular that pattern is in the spec, I'd expect engines to optimise small tuples of different types into structs, and in this case both elements are even same type.

These existing precedents are also the reason why I thought [pos, ch] would be better for consistency, even though personally I'd prefer objects returned from all these iterators.

mathiasbynens · 2017-11-29T22:30:38Z

Array destructuring uses the array iterator, so doing

for (const [position, codePoints] of string.fooBar()) {
  [position, codePoints]
}

Is like doing an implicit for-of within the explicit for-of.

Object destructuring doesn’t have that overhead.

michaelficarra · 2017-11-29T23:53:01Z

@mathiasbynens I'm sure that the intermediate object and usage of the iteration protocol can be optimised away given a smart enough JS engine. Still, the usability point around confusing ordering and future-proofing concern remains. Additionally, this interface allows a type system to catch more errors.

@RReverser I don't think the inconsistency with other entries functions will be a problem. We already have inconsistency with built-in methods returning arrays/iterators (Object.keys / Map.prototype.keys).

mathiasbynens · 2017-11-30T00:45:07Z

I'm sure that the intermediate object and usage of the iteration protocol can be optimised away given a smart enough JS engine.

True, but not having to optimize it away in the first place is infinitely better :)

mathiasbynens · 2017-11-30T00:48:34Z

@bakkot would be interested in having an isValid property for each code point object. Lone surrogates (i.e. code points that are not scalar values) would have isValid: false.

michaelficarra · 2017-11-30T01:02:39Z

If a surrogate appears at all in the iterator results, it was either lone or out of order, meaning isValid is false. Otherwise, isValid would be true. Since it's easily derived from the code point itself, I don't think isValid is necessary.

for (const { position, codePoint } of string.codePointEntries()) {
  let isValid = codePoint >= 0xD800 && codePoint < 0xE000;
}

Sorry for bringing out the slippery slope argument, but adding this paves the way for including other Unicode properties (ID_Continue, Script, etc.) in the iterator result, which I think is inappropriate.

mathiasbynens · 2017-11-30T03:21:10Z

@michaelficarra’s comment reflects my personal opinion, as I told @bakkot offline. Initially @bakkot wanted the iterator to throw when a lone code point was reached, which I disagreed with even more strongly.

Especially when writing a parser, you’d want to get to the lone surrogate instead of erroring out.

mathiasbynens · 2017-11-30T03:31:43Z

If we’re going for an object anyway, we could even add the string representation of the code point to the result. { codePoint, symbol, position } One iterator to rule them all!

RReverser · 2017-11-30T17:08:45Z

I don't think the inconsistency with other entries functions will be a problem. We already have inconsistency with built-in methods returning arrays/iterators (Object.keys / Map.prototype.keys).

I was rather referring to inconsistency with all new entries iterators - Array::entries, Map::entries, Set::entries, even recently added Object.entries all return [index, value] pairs, so I would expect engines optimising them all in the same way for purposes of destructuring (not to mention developers who might be already used to it and confused with an inconsistent representation).

RReverser · 2017-11-30T17:31:04Z

TL;DR - personally I like the object approach for readability / extensibility reasons, but personal preferences aside, I'm seriously concerned about introducing an inconsistent API when so many examples of the other one already exist.

bathos · 2017-11-30T17:50:28Z

In the existing "tuple" examples, member 0 is a "key". Since that isn’t quite what position means, I don’t think it’s inconsistent. In fact, usage of [] was why I originally misunderstood the intention of this thread.

Edit: I had highlighted Set as an exception to the "key" case, but I guess it depends on how you look at it.

domenic · 2018-05-11T16:28:15Z

I came to this proposal and was immediately surprised by the { codePoint } syntax instead of the [index, value] pairs. I found the arguments in #1 (comment) uncompelling, especially the idea that it's "infinitely better" not to have to implement escape analysis.

HOWEVER, I changed my mind upon realizing that (per #1 (comment)) the position here is not an index, but instead a position. Given that, I think the current design is more appropriate. In other words, I'd expect that if I see for (const [x, y] of z) that x will increase by one each time; since that's not the case here, it makes sense to use { position, codePoint } instead of [x, y] pairs.

I would suggest documenting this design choice in the readme for future people like me who come here with similar questions.

littledan · 2018-06-24T13:05:49Z

673b43e causes additional allocations, with a potential performance cost, which @gsathya raised as a concern in #9.

zloirock mentioned this issue Dec 8, 2017

core-js@3 zloirock/core-js#325

Merged

RReverser closed this as completed in 673b43e May 8, 2018

michaelficarra mentioned this issue May 12, 2018

Integer versus string representation of code points #6

Closed

mathiasbynens reopened this May 18, 2018

mathiasbynens assigned RReverser May 18, 2018

RReverser closed this as completed in 50c13a9 May 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: [position, codePoint] pairs #1

Discussion: [position, codePoint] pairs #1

RReverser commented Oct 20, 2017

bathos commented Nov 20, 2017 •

edited

RReverser commented Nov 20, 2017

bathos commented Nov 20, 2017

RReverser commented Nov 20, 2017

mathiasbynens commented Nov 29, 2017 •

edited

RReverser commented Nov 29, 2017 •

edited

mathiasbynens commented Nov 29, 2017

michaelficarra commented Nov 29, 2017 •

edited

mathiasbynens commented Nov 30, 2017

mathiasbynens commented Nov 30, 2017

michaelficarra commented Nov 30, 2017

mathiasbynens commented Nov 30, 2017

mathiasbynens commented Nov 30, 2017

RReverser commented Nov 30, 2017

RReverser commented Nov 30, 2017

bathos commented Nov 30, 2017 •

edited

domenic commented May 11, 2018

littledan commented Jun 24, 2018

Discussion: [position, codePoint] pairs #1

Discussion: [position, codePoint] pairs #1

Comments

RReverser commented Oct 20, 2017

bathos commented Nov 20, 2017 • edited

RReverser commented Nov 20, 2017

bathos commented Nov 20, 2017

RReverser commented Nov 20, 2017

mathiasbynens commented Nov 29, 2017 • edited

RReverser commented Nov 29, 2017 • edited

mathiasbynens commented Nov 29, 2017

michaelficarra commented Nov 29, 2017 • edited

mathiasbynens commented Nov 30, 2017

mathiasbynens commented Nov 30, 2017

michaelficarra commented Nov 30, 2017

mathiasbynens commented Nov 30, 2017

mathiasbynens commented Nov 30, 2017

RReverser commented Nov 30, 2017

RReverser commented Nov 30, 2017

bathos commented Nov 30, 2017 • edited

domenic commented May 11, 2018

littledan commented Jun 24, 2018

bathos commented Nov 20, 2017 •

edited

mathiasbynens commented Nov 29, 2017 •

edited

RReverser commented Nov 29, 2017 •

edited

michaelficarra commented Nov 29, 2017 •

edited

bathos commented Nov 30, 2017 •

edited