Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.
I filed a WebKit bug: https://bugs.webkit.org/show_bug.cgi?id=252900
Top-level
Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers. I filed a WebKit bug: https://bugs.webkit.org/show_bug.cgi?id=252900 36 comments
@jkt @simevidas In this case, Safari is the one that’s Unicode aware. The other browsers are treating maxlength as the number of bytes rather than the number of characters. 🙂 Following up with that, as I was thinking of some examples of what I mean... Take kanji, for example. 漢字 is 2 characters, but it's 6 bytes, so is the length 2 or 6? Or the phrase "Góða nótt" in Icelandic. It's 9 characters (counting the space in the middle), but it's 12 bytes. So, should this fail the maxlength check, if the maxlength is 10? @ramsey @jkt @simevidas length is 2 characters, size is 6 bytes when encoded in utf8 I believe? @f4grx @jkt @simevidas The size is always 6 bytes, but yes, when encoded in utf-8, the length is 2 characters. @ramsey @jkt @simevidas bytes assume an encoding. Codepoints vs. grapheme clusters is the distinction in experience, I guess. @johannes @jkt @simevidas I thought it would be the other way around. The same grouping of bytes could represent different codepoints, based on the encoding. @ramsey @jkt @simevidas yes, but working on bytes means that the encoding has to be carried thorough the different layers and might cut utf-8 sequences apart (assuming utf-8 being the default encoding) With either codepoints or grapheme clusters you at least get some valid (while not always sensible) result. Kinda wondering what the rules are: CodePoints, bytes? What if the page is UTF32 or ASCII? (Hopefully that insanity is gone) @DevWouter @simevidas As I understand the spec, it’s “code units”, ie, 2-byte UTF-16 units, for historical or compatibility reasons probably. Wouldn’t make sense IMO if you started in a modern “codepoint” world. https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#attr-fe-maxlength Thanks to your link I did some digging and I came to the same conclusion. It even says that JavaScript strings are UTF-16. However a quick check in javascript on both Firefox and safari and the JS implementation is the same. Kinda wierd that HTML5 spec suggest UTF-8. (also mastodon counts 👩👩👧👧 as a single character) @DevWouter @simevidas Yes, JavaScript strings have been UTF-16 since the beginning of time. I think that’s where many of the compatibility issues come from. The Go language, eg, has a more modern approach combining UTF-8 byte sequences and codepoints for characters (“runes”). @DevWouter @simevidas From an end-user point of view, the only concept that would make sense as a measure of length IMO is what Unicode calls a “glyph”, ie, a sequence of code points that display or print as ONE visible symbol, ONE (possibly complex composite) emoji or ONE (possibly multiply accented) character. @DevWouter @simevidas unfortunately, W3C defines “length” as UTF-16 code units. https://infra.spec.whatwg.org/#string-length So Safari’s behavior is technically wrong. @chucker @DevWouter However, the spec defines maxlength both as a “length” and a “number of characters”, and “characters” is defined as code points, not code units. In this case the “length” is 11 and the “number of characters” is 7; the spec is malformed. @chucker I feel quite confident that any correction will be towards the UTF-16 interpretation, for “compatibility” @ujay68 @simevidas WHATWG specs are less specs, and more guidelines. Browser developers have a moral obligation to break them now and then, especially when the spec says silly things like this. W3C specs SHOULD be respected, unless they're just a snapshot of a WHATWG spec, or you have a really compelling reason. @wizzwizz4 “https://html.spec.whatwg.org/multipage/ is the current HTML standard. It obsoletes all other previously-published HTML specifications.” https://www.w3.org/html/ @simevidas seems to me that Safari's behaviour is correct -- but that it doesn't really matter as the maxlength attribute shouldn't be used to begin with, as it's trivially bypassed @mkoek @simevidas That doesn't mean it shouldn't be used, it means it shouldn't be relied on. In the end the server should check everything, but limiting the characters in the input items itself gives more immediate feedback to the user. Clicking something like submit and getting errors afterwards is an inherently unsatisfying user experience @mkoek @simevidas Disagree. maxlength should be used to help people avoid inputting strings that are too long. But you should never depend on web form validations and restrictions anywhere else, especially server-side. As you say, they are trivially bypassed. Saying not to use maxlength is like saying to not use dropdown menus because the user could alter the menu or the selected value before submitting the form. @mkoek @simevidas This should not be a bug. This should be the default behavior for all browsers (and the server should check the length in the same way). @mkoek @simevidas Safari *should* be correct, but the spec unfortunately does go by byte length (16-bit code units) rather than grapheme cluster count. https://infra.spec.whatwg.org/#string-length @simevidas Hah! I guess https://bugs.webkit.org/show_bug.cgi?id=93196 (from 2012) can be closed now then. A relevant spec issue seems to be https://github.com/whatwg/html/issues/7861 @simevidas naive implementations of substring will also do undesirable things like trim off the skin color modifier @samueljohn @simevidas Yeah, this really should be a bug report against Chromium and friends. @simevidas you could add a reference to https://infra.spec.whatwg.org/#string-length that specifies that the length of a string is the number of UTF-16 code units. (Alas, I personally would would prefer that graphemes would be the length – disappearing children or others tend to surprise users) @simevidas In Firefox Attachment page (https://bug-252900-attachments.webkit.org/attachment.cgi?id=465151) looks wrong |
@simevidas dang not unicode aware :')