@simevidas Kinda wondering what the rules are: CodePoints,...

Kinda wondering what the rules are: CodePoints, bytes? What if the page is UTF32 or ASCII? (Hopefully that insanity is gone)

Like 24 Feb 2023 at 15:58 | Wall-to-wall | Open on s.poweredbydev.com

9 comments

John Ulrik

@DevWouter @simevidas As I understand the spec, it’s “code units”, ie, 2-byte UTF-16 units, for historical or compatibility reasons probably. Wouldn’t make sense IMO if you started in a modern “codepoint” world. https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#attr-fe-maxlength

24 Feb 2023 at 16:12 | Open on mastodon.world

DevWouter

@ujay68 @simevidas

Thanks to your link I did some digging and I came to the same conclusion. It even says that JavaScript strings are UTF-16. However a quick check in javascript on both Firefox and safari and the JS implementation is the same.

Kinda wierd that HTML5 spec suggest UTF-8. (also mastodon counts 👩‍👩‍👧‍👧 as a single character)

24 Feb 2023 at 16:33 | Open on s.poweredbydev.com

John Ulrik

@DevWouter @simevidas Yes, JavaScript strings have been UTF-16 since the beginning of time. I think that’s where many of the compatibility issues come from. The Go language, eg, has a more modern approach combining UTF-8 byte sequences and codepoints for characters (“runes”).

24 Feb 2023 at 16:45 | Open on mastodon.world

Eric Sampson

@ujay68 @DevWouter @simevidas also in:

https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction

24 Feb 2023 at 22:37 | Open on hachyderm.io

John Ulrik

@DevWouter @simevidas From an end-user point of view, the only concept that would make sense as a measure of length IMO is what Unicode calls a “glyph”, ie, a sequence of code points that display or print as ONE visible symbol, ONE (possibly complex composite) emoji or ONE (possibly multiply accented) character.

24 Feb 2023 at 16:51 | Open on mastodon.world

Sören

@DevWouter @simevidas unfortunately, W3C defines “length” as UTF-16 code units. https://infra.spec.whatwg.org/#string-length

So Safari’s behavior is technically wrong.

24 Feb 2023 at 19:43 | Open on norden.social

Jens Ayton

@chucker @DevWouter However, the spec defines maxlength both as a “length” and a “number of characters”, and “characters” is defined as code points, not code units. In this case the “length” is 11 and the “number of characters” is 7; the spec is malformed.

24 Feb 2023 at 21:00 | Open on mendeddrum.org

Sören

@jens @DevWouter so there is hope yet!

24 Feb 2023 at 21:01 | Open on norden.social

Jens Ayton

@chucker I feel quite confident that any correction will be towards the UTF-16 interpretation, for “compatibility”

24 Feb 2023 at 21:06 | Open on mendeddrum.org