Email or username:

Password:

Forgot your password?
Top-level
Šime Vidas

Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.

I filed a WebKit bug: bugs.webkit.org/show_bug.cgi?i

36 comments
Ben Ramsey

@jkt @simevidas In this case, Safari is the one that’s Unicode aware. The other browsers are treating maxlength as the number of bytes rather than the number of characters. 🙂

Ben Ramsey

@jkt @simevidas

Following up with that, as I was thinking of some examples of what I mean...

Take kanji, for example. 漢字 is 2 characters, but it's 6 bytes, so is the length 2 or 6?

Or the phrase "Góða nótt" in Icelandic. It's 9 characters (counting the space in the middle), but it's 12 bytes. So, should this fail the maxlength check, if the maxlength is 10?

f4grx Sebastien (OLD ACCOUNT)

@ramsey @jkt @simevidas length is 2 characters, size is 6 bytes when encoded in utf8 I believe?

Ben Ramsey

@f4grx @jkt @simevidas The size is always 6 bytes, but yes, when encoded in utf-8, the length is 2 characters.

Johannes ✔️

@ramsey @jkt @simevidas bytes assume an encoding. Codepoints vs. grapheme clusters is the distinction in experience, I guess.

Ben Ramsey

@johannes @jkt @simevidas I thought it would be the other way around. The same grouping of bytes could represent different codepoints, based on the encoding.

Johannes ✔️

@ramsey @jkt @simevidas yes, but working on bytes means that the encoding has to be carried thorough the different layers and might cut utf-8 sequences apart (assuming utf-8 being the default encoding)

With either codepoints or grapheme clusters you at least get some valid (while not always sensible) result.

DevWouter

@simevidas

Kinda wondering what the rules are: CodePoints, bytes? What if the page is UTF32 or ASCII? (Hopefully that insanity is gone)

John Ulrik

@DevWouter @simevidas As I understand the spec, it’s “code units”, ie, 2-byte UTF-16 units, for historical or compatibility reasons probably. Wouldn’t make sense IMO if you started in a modern “codepoint” world. html.spec.whatwg.org/multipage

DevWouter

@ujay68 @simevidas

Thanks to your link I did some digging and I came to the same conclusion. It even says that JavaScript strings are UTF-16. However a quick check in javascript on both Firefox and safari and the JS implementation is the same.

Kinda wierd that HTML5 spec suggest UTF-8. (also mastodon counts 👩‍👩‍👧‍👧 as a single character)

John Ulrik

@DevWouter @simevidas Yes, JavaScript strings have been UTF-16 since the beginning of time. I think that’s where many of the compatibility issues come from. The Go language, eg, has a more modern approach combining UTF-8 byte sequences and codepoints for characters (“runes”).

John Ulrik

@DevWouter @simevidas From an end-user point of view, the only concept that would make sense as a measure of length IMO is what Unicode calls a “glyph”, ie, a sequence of code points that display or print as ONE visible symbol, ONE (possibly complex composite) emoji or ONE (possibly multiply accented) character.

Sören

@DevWouter @simevidas unfortunately, W3C defines “length” as UTF-16 code units. infra.spec.whatwg.org/#string-

So Safari’s behavior is technically wrong.

Jens Ayton

@chucker @DevWouter However, the spec defines maxlength both as a “length” and a “number of characters”, and “characters” is defined as code points, not code units. In this case the “length” is 11 and the “number of characters” is 7; the spec is malformed.

Jens Ayton

@chucker I feel quite confident that any correction will be towards the UTF-16 interpretation, for “compatibility”

John Ulrik

@simevidas While I find Safari’s behaviour more relatable for end users (how is one supposed to know that an emoji is not single character?) the spec says that maxlength is to measured in 16-bit “code units” (sigh): html.spec.whatwg.org/multipage Even if you tried to measure in Unicode “Codepoints” that wouldn’t be intuitive for anyone who’s not a Unicode expert. AFAIK, the birdsite counts every emoji as a fixed number of characters (2?), independent of its technical representation.

@simevidas While I find Safari’s behaviour more relatable for end users (how is one supposed to know that an emoji is not single character?) the spec says that maxlength is to measured in 16-bit “code units” (sigh): html.spec.whatwg.org/multipage Even if you tried to measure in Unicode “Codepoints” that wouldn’t be intuitive for anyone who’s not a Unicode expert. AFAIK, the birdsite counts every emoji as a fixed number of characters (2?),...

wizzwizz4

@ujay68 @simevidas WHATWG specs are less specs, and more guidelines. Browser developers have a moral obligation to break them now and then, especially when the spec says silly things like this.

W3C specs SHOULD be respected, unless they're just a snapshot of a WHATWG spec, or you have a really compelling reason.

John Ulrik

@wizzwizz4html.spec.whatwg.org/multipage is the current HTML standard. It obsoletes all other previously-published HTML specifications.” w3.org/html/

Mark Koek

@simevidas seems to me that Safari's behaviour is correct -- but that it doesn't really matter as the maxlength attribute shouldn't be used to begin with, as it's trivially bypassed

Nordern

@mkoek @simevidas That doesn't mean it shouldn't be used, it means it shouldn't be relied on. In the end the server should check everything, but limiting the characters in the input items itself gives more immediate feedback to the user.

Clicking something like submit and getting errors afterwards is an inherently unsatisfying user experience

m. libby

@mkoek @simevidas Disagree. maxlength should be used to help people avoid inputting strings that are too long.

But you should never depend on web form validations and restrictions anywhere else, especially server-side. As you say, they are trivially bypassed.

Saying not to use maxlength is like saying to not use dropdown menus because the user could alter the menu or the selected value before submitting the form.

Vivien the Trumpeting Elephant

@mkoek @simevidas This should not be a bug. This should be the default behavior for all browsers (and the server should check the length in the same way).

Sören

@mkoek @simevidas Safari *should* be correct, but the spec unfortunately does go by byte length (16-bit code units) rather than grapheme cluster count. infra.spec.whatwg.org/#string-

Ryan Kennedy

@simevidas naive implementations of substring will also do undesirable things like trim off the skin color modifier

Samuel

@simevidas from a user's perspective Safari is the only one doing it right.

ocdtrekkie

@samueljohn @simevidas Yeah, this really should be a bug report against Chromium and friends.

Andreas Hartl

@simevidas you could add a reference to infra.spec.whatwg.org/#string- that specifies that the length of a string is the number of UTF-16 code units.

(Alas, I personally would would prefer that graphemes would be the length – disappearing children or others tend to surprise users)

Zau

@simevidas the term of the day is "Extended Grapheme Cluster"!

Juno Jove

@simevidas There's a buffer overrun hiding in there somewhere, I'm tellin' ya!

Go Up