@simevidas dang not unicode aware :')

Šime's posts Post Back to profile

Top-level

Jonathan Kingston

Like 24 Feb 2023 at 14:50 | Wall-to-wall | Open on mastodon.social

7 comments

Ben Ramsey

@jkt @simevidas In this case, Safari is the one that’s Unicode aware. The other browsers are treating maxlength as the number of bytes rather than the number of characters. 🙂

24 Feb 2023 at 19:57 | Open on phpc.social

Ben Ramsey

@jkt @simevidas

Following up with that, as I was thinking of some examples of what I mean...

Take kanji, for example. 漢字 is 2 characters, but it's 6 bytes, so is the length 2 or 6?

Or the phrase "Góða nótt" in Icelandic. It's 9 characters (counting the space in the middle), but it's 12 bytes. So, should this fail the maxlength check, if the maxlength is 10?

24 Feb 2023 at 20:05 | Open on phpc.social

f4grx Sebastien (OLD ACCOUNT)

@ramsey @jkt @simevidas length is 2 characters, size is 6 bytes when encoded in utf8 I believe?

24 Feb 2023 at 21:00 | Open on mastodon.social

Ben Ramsey

@f4grx @jkt @simevidas The size is always 6 bytes, but yes, when encoded in utf-8, the length is 2 characters.

24 Feb 2023 at 21:06 | Open on phpc.social

Johannes ✔️

@ramsey @jkt @simevidas bytes assume an encoding. Codepoints vs. grapheme clusters is the distinction in experience, I guess.

24 Feb 2023 at 21:02 | Open on det.social

Ben Ramsey

@johannes @jkt @simevidas I thought it would be the other way around. The same grouping of bytes could represent different codepoints, based on the encoding.

24 Feb 2023 at 21:05 | Open on phpc.social

Johannes ✔️

@ramsey @jkt @simevidas yes, but working on bytes means that the encoding has to be carried thorough the different layers and might cut utf-8 sequences apart (assuming utf-8 being the default encoding)

With either codepoints or grapheme clusters you at least get some valid (while not always sensible) result.

24 Feb 2023 at 21:09 | Open on det.social

Go Up