Email or username:

Password:

Forgot your password?
Šime Vidas

If you drag an emoji family with a string size of 11 into an input with maxlength=10, one of the children will disappear.

135 comments
Owashii :corgi:🐾:therian:

@simevidas Is this the solution to overpopulation? :blobhyperthink:

Mein Hund :oh_no_bubble:

@simevidas
Obviously, though, it's „just“ the boy who disappears, right? 🤔

James Cocker

@simevidas 🤯 I thought this was a joke when I first saw it. Oh my.

Wilmhit until demonetized
@simevidas emoji family? Wtf? I thought skin colors/genders is maximum what Unicode can do.
Šime Vidas

Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.

I filed a WebKit bug: bugs.webkit.org/show_bug.cgi?i

Boo Ramsey 🧛🏻‍♂️🧟‍♂️👻🎃

@jkt @simevidas In this case, Safari is the one that’s Unicode aware. The other browsers are treating maxlength as the number of bytes rather than the number of characters. 🙂

Boo Ramsey 🧛🏻‍♂️🧟‍♂️👻🎃

@jkt @simevidas

Following up with that, as I was thinking of some examples of what I mean...

Take kanji, for example. 漢字 is 2 characters, but it's 6 bytes, so is the length 2 or 6?

Or the phrase "Góða nótt" in Icelandic. It's 9 characters (counting the space in the middle), but it's 12 bytes. So, should this fail the maxlength check, if the maxlength is 10?

f4grx Sebastien (OLD ACCOUNT)

@ramsey @jkt @simevidas length is 2 characters, size is 6 bytes when encoded in utf8 I believe?

Boo Ramsey 🧛🏻‍♂️🧟‍♂️👻🎃

@f4grx @jkt @simevidas The size is always 6 bytes, but yes, when encoded in utf-8, the length is 2 characters.

Johannes ✔️

@ramsey @jkt @simevidas bytes assume an encoding. Codepoints vs. grapheme clusters is the distinction in experience, I guess.

Boo Ramsey 🧛🏻‍♂️🧟‍♂️👻🎃

@johannes @jkt @simevidas I thought it would be the other way around. The same grouping of bytes could represent different codepoints, based on the encoding.

Johannes ✔️

@ramsey @jkt @simevidas yes, but working on bytes means that the encoding has to be carried thorough the different layers and might cut utf-8 sequences apart (assuming utf-8 being the default encoding)

With either codepoints or grapheme clusters you at least get some valid (while not always sensible) result.

DevWouter

@simevidas

Kinda wondering what the rules are: CodePoints, bytes? What if the page is UTF32 or ASCII? (Hopefully that insanity is gone)

John Ulrik

@DevWouter @simevidas As I understand the spec, it’s “code units”, ie, 2-byte UTF-16 units, for historical or compatibility reasons probably. Wouldn’t make sense IMO if you started in a modern “codepoint” world. html.spec.whatwg.org/multipage

DevWouter

@ujay68 @simevidas

Thanks to your link I did some digging and I came to the same conclusion. It even says that JavaScript strings are UTF-16. However a quick check in javascript on both Firefox and safari and the JS implementation is the same.

Kinda wierd that HTML5 spec suggest UTF-8. (also mastodon counts 👩‍👩‍👧‍👧 as a single character)

John Ulrik

@DevWouter @simevidas Yes, JavaScript strings have been UTF-16 since the beginning of time. I think that’s where many of the compatibility issues come from. The Go language, eg, has a more modern approach combining UTF-8 byte sequences and codepoints for characters (“runes”).

John Ulrik

@DevWouter @simevidas From an end-user point of view, the only concept that would make sense as a measure of length IMO is what Unicode calls a “glyph”, ie, a sequence of code points that display or print as ONE visible symbol, ONE (possibly complex composite) emoji or ONE (possibly multiply accented) character.

Sören

@DevWouter @simevidas unfortunately, W3C defines “length” as UTF-16 code units. infra.spec.whatwg.org/#string-

So Safari’s behavior is technically wrong.

Jens Ayton

@chucker @DevWouter However, the spec defines maxlength both as a “length” and a “number of characters”, and “characters” is defined as code points, not code units. In this case the “length” is 11 and the “number of characters” is 7; the spec is malformed.

Jens Ayton

@chucker I feel quite confident that any correction will be towards the UTF-16 interpretation, for “compatibility”

John Ulrik

@simevidas While I find Safari’s behaviour more relatable for end users (how is one supposed to know that an emoji is not single character?) the spec says that maxlength is to measured in 16-bit “code units” (sigh): html.spec.whatwg.org/multipage Even if you tried to measure in Unicode “Codepoints” that wouldn’t be intuitive for anyone who’s not a Unicode expert. AFAIK, the birdsite counts every emoji as a fixed number of characters (2?), independent of its technical representation.

@simevidas While I find Safari’s behaviour more relatable for end users (how is one supposed to know that an emoji is not single character?) the spec says that maxlength is to measured in 16-bit “code units” (sigh): html.spec.whatwg.org/multipage Even if you tried to measure in Unicode “Codepoints” that wouldn’t be intuitive for anyone who’s not a Unicode expert. AFAIK, the birdsite counts every emoji as a fixed number of characters (2?),...

wizzwizz4

@ujay68 @simevidas WHATWG specs are less specs, and more guidelines. Browser developers have a moral obligation to break them now and then, especially when the spec says silly things like this.

W3C specs SHOULD be respected, unless they're just a snapshot of a WHATWG spec, or you have a really compelling reason.

John Ulrik

@wizzwizz4html.spec.whatwg.org/multipage is the current HTML standard. It obsoletes all other previously-published HTML specifications.” w3.org/html/

Mark Koek

@simevidas seems to me that Safari's behaviour is correct -- but that it doesn't really matter as the maxlength attribute shouldn't be used to begin with, as it's trivially bypassed

Nordern

@mkoek @simevidas That doesn't mean it shouldn't be used, it means it shouldn't be relied on. In the end the server should check everything, but limiting the characters in the input items itself gives more immediate feedback to the user.

Clicking something like submit and getting errors afterwards is an inherently unsatisfying user experience

DELETED

@mkoek @simevidas Disagree. maxlength should be used to help people avoid inputting strings that are too long.

But you should never depend on web form validations and restrictions anywhere else, especially server-side. As you say, they are trivially bypassed.

Saying not to use maxlength is like saying to not use dropdown menus because the user could alter the menu or the selected value before submitting the form.

Vivien the Trumpeting Elephant

@mkoek @simevidas This should not be a bug. This should be the default behavior for all browsers (and the server should check the length in the same way).

Sören

@mkoek @simevidas Safari *should* be correct, but the spec unfortunately does go by byte length (16-bit code units) rather than grapheme cluster count. infra.spec.whatwg.org/#string-

Ryan Kennedy

@simevidas naive implementations of substring will also do undesirable things like trim off the skin color modifier

Samuel

@simevidas from a user's perspective Safari is the only one doing it right.

ocdtrekkie

@samueljohn @simevidas Yeah, this really should be a bug report against Chromium and friends.

Andreas Hartl

@simevidas you could add a reference to infra.spec.whatwg.org/#string- that specifies that the length of a string is the number of UTF-16 code units.

(Alas, I personally would would prefer that graphemes would be the length – disappearing children or others tend to surprise users)

Zau

@simevidas the term of the day is "Extended Grapheme Cluster"!

Juno Jove

@simevidas There's a buffer overrun hiding in there somewhere, I'm tellin' ya!

Rob Flickenger ⚡️

@simevidas Paste any family emoji into the new Toot field and repeatedly hit backspace to delete people one at a time. 👨‍👩‍👧‍👧 👨‍👩‍👧 👨‍👩 👨‍

Ivan Demchuk

@simevidas And in Firefox hitting Backspace deletes people one at a time

DELETED

@demivan @simevidas when the homophobic parents have a gay kid

Ludwig Behm

Wenn man ihm auch noch die Frau weg nimmt, sieht der Mann plötzlich alt aus!

ge ricci

@simevidas Effect known as demographic control ;)

Jens Ayton

@simevidas Webkit is correct. HTML “characters” are Unicode code points, not UTF-16 code units, so the length is 7.

Lizard

@simevidas That is awesome, and it's the sort of thing I'd spend a week trying to figure out and then have the answer driving to the supermarket.

Lizard

Is ther a web page or something describing this? I'd love to share it w/my coworkers, but I'm pretty sure Mastodon is firewalled.

casey is remote

@simevidas Can you generate a family with an arbitrary amount of children?

Joe Lanman

@simevidas The alt is for describing the image, you can't follow links in it, so that's best in the post itself

Henrik Brameus

@simevidas I guess this is where all the tech companies got their “family” ideas from.

Ashley Blewer

@simevidas the tenth plague of egypt, emoji edition

Rose (krrr...)

@simevidas

Well. Someone had to pay the price the fey demand for such magic...

M4lu

então é assim que vamos acabar com a família brasileira.

John Francis

@simevidas had to be sold for scientific experiments

Joseph Richardson

@simevidas New emoji movie spinoff remake of home alone?

They get to the input of the plane and find out they left Kevin

AT-AT Assault :verifiedtrans:

@simevidas Wait. So emoji are stored as characters, and specific strings of characters are reserved for translating to specific emoji?

Sören

@atatassault @simevidas emoji are stored as characters, and often as *combinations* of characters. For example, “black bird” combines “Bird” (itself an emoji), ‍”Zero Width Joiner” and “Black Large Square” to form a new emoji. (Think of ZWJ as an addition operator.)

Similarly, the family emoji just join different people together. This is also how skin color works. That way, they avoid having thousands more emoji characters.

🇨🇦 Luvmykids 🧙‍♀️☕️🎨🐶

@simevidas Very interesting! What program are you using in this screenshot? I love how the input is on one side and the output is on the other like this. (I'm sure there are many but this one looked cool.)

gRuFtY

@simevidas I presume this is the Schrodinger family, then?

AlisonW

@simevidas
If the lengths were 4 and 3 it would feel more logical. 11 to 10 requires knowledge of emoji/utf construction.

MigMit

One more reason to hate emojis, I guess.

KyleDavidE

@simevidas that reminds me that at one point Facebook rendered families with skin tone modifiers on the members, but only if all members had the same modifiers.

nifela

@simevidas Some emojis (or Unicode characters in general) are actually a combination of multiple characters with a special "combine these" character between them (called a Zero Width Joiner).
This family is a combination of Man + ZWJ + Woman + ZWJ + Girl + ZWJ + Boy.
When you backspace you first delete the Boy, then the ZWJ, then the girl...

Also, HTML (or JavaScript) text length does not count Unicode characters, but essentially the storage size they take up.

@simevidas Some emojis (or Unicode characters in general) are actually a combination of multiple characters with a special "combine these" character between them (called a Zero Width Joiner).
This family is a combination of Man + ZWJ + Woman + ZWJ + Girl + ZWJ + Boy.
When you backspace you first delete the Boy, then the ZWJ, then the girl...

Suran

@nifela @simevidas

Yes, an input field maxlength is supposed to count the glyphs but it counts the UTF-8 bytes instead.

I hope this is a browser specific bug and that other browsers are correct.

hauntedhorns

@simevidas i mean it's definitely a type of character limit

Zunel

@simevidas Imagine trying to explain the sequence of events that lead to this to a victorian child

"So we decided to organize our letters, right? But THEN we decided to put PICTURES in with the letters..."

Renée

@simevidas this bug is some amazing hemmingway level visual poetry.

mehulkar

@simevidas Whoa I didn’t even know they grouped like that

Philip J. Hollenback

@simevidas yes this is because that emoji is actually four emojis combined with the zero width joiner. You can see the same effect by backspacing over the emoji. This is an unfortunate effect of how that works.

Rob ☘️:d20:

@simevidas How can we use this to get a kid to move out of the house?

Asking for a friend.

Laura Hermanns

@simevidas Correct me if I’m wrong, but is the main difference between software development and web development that in the latter one no error is truly treated like an error? Like the opposite of -Werror? I feel like browsers are way too tolerant and just produce “some behavior”.

Yefim

@simevidas I knew the first child was the favorite

Christian V

@simevidas this is the future feminists want 🙄🙄🙄

PK Rockin

@simevidas I.. didn't know you could concatenate emojis like this? Interesting!

mmu_man

@simevidas OMG this is insane, I suppose next Unicode revision will require 10 billions bytes to hold the starts of the galaxy emoji?

Chris Henrick

@simevidas not sure if anyone mentioned that using Intl.Segmenter() will count emojis as a single “character” (not sure if that’s the correct unit to use here?). Wes Bos made a video demonstrating its usage: instagram.com/reel/CpAxbczN4yc

Shield Maiden

@simevidas Don't understand any of the reasons, but I find it very funny...watched it several times, still tittering.

Jay

@simevidas I have 3 kids. Does it work the other way?

hybrid havoc

@simevidas And the rest of the family doesn’t even care or notice. That’s horrible.

Northbaybanter

@simevidas What is this strange black magic you do? Looks like you’re performing Lego exorcisms.

DELETED

@simevidas Biggest lies I’ve learned in my life:

1. Santa Claus exists.
2. Columbus discovered Americas.
3. Most applications support Unicode.

Go Up