If you drag an emoji family with a string size of 11 into an input with maxlength=10, one of the children will disappear.
If you drag an emoji family with a string size of 11 into an input with maxlength=10, one of the children will disappear. 135 comments
@simevidas emoji family? Wtf? I thought skin colors/genders is maximum what Unicode can do.
Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers. I filed a WebKit bug: https://bugs.webkit.org/show_bug.cgi?id=252900 @jkt @simevidas In this case, Safari is the one that’s Unicode aware. The other browsers are treating maxlength as the number of bytes rather than the number of characters. 🙂 Following up with that, as I was thinking of some examples of what I mean... Take kanji, for example. 漢字 is 2 characters, but it's 6 bytes, so is the length 2 or 6? Or the phrase "Góða nótt" in Icelandic. It's 9 characters (counting the space in the middle), but it's 12 bytes. So, should this fail the maxlength check, if the maxlength is 10? @ramsey @jkt @simevidas length is 2 characters, size is 6 bytes when encoded in utf8 I believe? @f4grx @jkt @simevidas The size is always 6 bytes, but yes, when encoded in utf-8, the length is 2 characters. @ramsey @jkt @simevidas bytes assume an encoding. Codepoints vs. grapheme clusters is the distinction in experience, I guess. @johannes @jkt @simevidas I thought it would be the other way around. The same grouping of bytes could represent different codepoints, based on the encoding. @ramsey @jkt @simevidas yes, but working on bytes means that the encoding has to be carried thorough the different layers and might cut utf-8 sequences apart (assuming utf-8 being the default encoding) With either codepoints or grapheme clusters you at least get some valid (while not always sensible) result. Kinda wondering what the rules are: CodePoints, bytes? What if the page is UTF32 or ASCII? (Hopefully that insanity is gone) @DevWouter @simevidas As I understand the spec, it’s “code units”, ie, 2-byte UTF-16 units, for historical or compatibility reasons probably. Wouldn’t make sense IMO if you started in a modern “codepoint” world. https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#attr-fe-maxlength Thanks to your link I did some digging and I came to the same conclusion. It even says that JavaScript strings are UTF-16. However a quick check in javascript on both Firefox and safari and the JS implementation is the same. Kinda wierd that HTML5 spec suggest UTF-8. (also mastodon counts 👩👩👧👧 as a single character) @DevWouter @simevidas Yes, JavaScript strings have been UTF-16 since the beginning of time. I think that’s where many of the compatibility issues come from. The Go language, eg, has a more modern approach combining UTF-8 byte sequences and codepoints for characters (“runes”). @DevWouter @simevidas From an end-user point of view, the only concept that would make sense as a measure of length IMO is what Unicode calls a “glyph”, ie, a sequence of code points that display or print as ONE visible symbol, ONE (possibly complex composite) emoji or ONE (possibly multiply accented) character. @DevWouter @simevidas unfortunately, W3C defines “length” as UTF-16 code units. https://infra.spec.whatwg.org/#string-length So Safari’s behavior is technically wrong. @chucker @DevWouter However, the spec defines maxlength both as a “length” and a “number of characters”, and “characters” is defined as code points, not code units. In this case the “length” is 11 and the “number of characters” is 7; the spec is malformed. @chucker I feel quite confident that any correction will be towards the UTF-16 interpretation, for “compatibility” @ujay68 @simevidas WHATWG specs are less specs, and more guidelines. Browser developers have a moral obligation to break them now and then, especially when the spec says silly things like this. W3C specs SHOULD be respected, unless they're just a snapshot of a WHATWG spec, or you have a really compelling reason. @wizzwizz4 “https://html.spec.whatwg.org/multipage/ is the current HTML standard. It obsoletes all other previously-published HTML specifications.” https://www.w3.org/html/ @simevidas seems to me that Safari's behaviour is correct -- but that it doesn't really matter as the maxlength attribute shouldn't be used to begin with, as it's trivially bypassed @mkoek @simevidas That doesn't mean it shouldn't be used, it means it shouldn't be relied on. In the end the server should check everything, but limiting the characters in the input items itself gives more immediate feedback to the user. Clicking something like submit and getting errors afterwards is an inherently unsatisfying user experience @mkoek @simevidas Disagree. maxlength should be used to help people avoid inputting strings that are too long. But you should never depend on web form validations and restrictions anywhere else, especially server-side. As you say, they are trivially bypassed. Saying not to use maxlength is like saying to not use dropdown menus because the user could alter the menu or the selected value before submitting the form. @mkoek @simevidas This should not be a bug. This should be the default behavior for all browsers (and the server should check the length in the same way). @mkoek @simevidas Safari *should* be correct, but the spec unfortunately does go by byte length (16-bit code units) rather than grapheme cluster count. https://infra.spec.whatwg.org/#string-length @simevidas Hah! I guess https://bugs.webkit.org/show_bug.cgi?id=93196 (from 2012) can be closed now then. A relevant spec issue seems to be https://github.com/whatwg/html/issues/7861 @simevidas naive implementations of substring will also do undesirable things like trim off the skin color modifier @samueljohn @simevidas Yeah, this really should be a bug report against Chromium and friends. @simevidas you could add a reference to https://infra.spec.whatwg.org/#string-length that specifies that the length of a string is the number of UTF-16 code units. (Alas, I personally would would prefer that graphemes would be the length – disappearing children or others tend to surprise users) @simevidas In Firefox Attachment page (https://bug-252900-attachments.webkit.org/attachment.cgi?id=465151) looks wrong @simevidas Paste any family emoji into the new Toot field and repeatedly hit backspace to delete people one at a time. 👨👩👧👧 👨👩👧 👨👩 👨 @hackerfriendly @simevidas ngl, that kind of fucked me up...you were not lying. 👀 Interesting. @LevelUp @simevidas The secret is Zero-Width Joining. Works with most of these: @hackerfriendly @simevidas there are more options than I'd initially thought. That's quite fascinating, thanks for sharing. :D @simevidas Webkit is correct. HTML “characters” are Unicode code points, not UTF-16 code units, so the length is 7. @jens @simevidas https://infra.spec.whatwg.org/#string-length seems to say the opposite (sigh) @simevidas That is awesome, and it's the sort of thing I'd spend a week trying to figure out and then have the answer driving to the supermarket. Is ther a web page or something describing this? I'd love to share it w/my coworkers, but I'm pretty sure Mastodon is firewalled. @simevidas The alt is for describing the image, you can't follow links in it, so that's best in the post itself @simevidas I guess this is where all the tech companies got their “family” ideas from. @simevidas New emoji movie spinoff remake of home alone? They get to the input of the plane and find out they left Kevin @simevidas Wait. So emoji are stored as characters, and specific strings of characters are reserved for translating to specific emoji? @atatassault @simevidas emoji are stored as characters, and often as *combinations* of characters. For example, “black bird” combines “Bird” (itself an emoji), ”Zero Width Joiner” and “Black Large Square” to form a new emoji. (Think of ZWJ as an addition operator.) Similarly, the family emoji just join different people together. This is also how skin color works. That way, they avoid having thousands more emoji characters. @simevidas Very interesting! What program are you using in this screenshot? I love how the input is on one side and the output is on the other like this. (I'm sure there are many but this one looked cool.) @simevidas @simevidas that reminds me that at one point Facebook rendered families with skin tone modifiers on the members, but only if all members had the same modifiers. Yes, an input field maxlength is supposed to count the glyphs but it counts the UTF-8 bytes instead. I hope this is a browser specific bug and that other browsers are correct. @simevidas Imagine trying to explain the sequence of events that lead to this to a victorian child "So we decided to organize our letters, right? But THEN we decided to put PICTURES in with the letters..." @simevidas yes this is because that emoji is actually four emojis combined with the zero width joiner. You can see the same effect by backspacing over the emoji. This is an unfortunate effect of how that works. @simevidas How can we use this to get a kid to move out of the house? Asking for a friend. @simevidas Correct me if I’m wrong, but is the main difference between software development and web development that in the latter one no error is truly treated like an error? Like the opposite of -Werror? I feel like browsers are way too tolerant and just produce “some behavior”. @simevidas OMG this is insane, I suppose next Unicode revision will require 10 billions bytes to hold the starts of the galaxy emoji? @simevidas not sure if anyone mentioned that using Intl.Segmenter() will count emojis as a single “character” (not sure if that’s the correct unit to use here?). Wes Bos made a video demonstrating its usage: https://www.instagram.com/reel/CpAxbczN4yc/?igshid=NDk5N2NlZjQ= @simevidas Don't understand any of the reasons, but I find it very funny...watched it several times, still tittering. @simevidas What is this strange black magic you do? Looks like you’re performing Lego exorcisms. @simevidas Biggest lies I’ve learned in my life: 1. Santa Claus exists. |
@simevidas This was a very sad story.