Email or username:

Password:

Forgot your password?
Niki Tonsky

New post — all things Unicode!

One of the most interesting researches I had to do. Enjoy!

tonsky.me/blog/unicode/

26 comments
Hugo 雨果

@nikitonsky One line reads:

> Any pure ASCII text is also a valid UTF-8 text, and vice versa.

The vice versa part doesn't sound right. Any valid UTF-8 text isn't valid ASCII.

Hugo 雨果

@nikitonsky Huh, I now understand why all Chinese subtitles that I find online use GBK encoding. UTF8 is a poor choice for subtitle files because different characters share the same code point.

liilliil 🇫🇯🇱🇨🇱🇧

@nikitonsky
>Cyrillic Lowercase K and Bulgarian Lowercase K

Они оба кириллица, только в разном начертании

Niki Tonsky

@liilliil says who? Вот a и а вообще одинаково пишутся, и оба латиница, а коды разные

liilliil 🇫🇯🇱🇨🇱🇧

@nikitonsky там у тебя в примере коды идентичные

Niki Tonsky

@liilliil ну так в том и поинт! почему одному языку коды выделили, а другому нет?

liilliil 🇫🇯🇱🇨🇱🇧

@nikitonsky поинт в том, что русский назвал «кириллица», а болгарский «болгарским». Они оба кириллица

Ivan Habunek

@nikitonsky Elixir also gets this right. Even though my terminal renders an extra space after the emoji.

Niki Tonsky

@ihabunek nice! does it come from Erlang? Or this is Elixir specifically?

Ivan Habunek

@nikitonsky Just had a look, and it just calls erlang.

Erlang implementation is here:
github.com/erlang/otp/blob/016

Erlang has a script which generates the unicode util code from the unicode spec. Macros would have been nice for that. :)
github.com/erlang/otp/blob/mas

Lynn «Кофеман»

@nikitonsky just made this screenshot today. Shame on you Zoom 😅

Niki Tonsky

In case you were wondering why cursors

Charan

@nikitonsky did your site go down for few moments? I was seeing the nginx error.

Felix Niklas

@nikitonsky fun effect! I had to remove the DOM element though, to be able to focus on the content.

­

@nikitonsky
> I make sure it works very well

Nick Matthews

@nikitonsky I'll admit, I was curious. Was quite confusing at first, but devtools let me hide them quickly, and then confirmed the websockets by looking at the source. For me though, this was incredibly distracting, and made reading your post significantly harder until I turned them off.

Philipp Defner

@nikitonsky Maybe I shouldn’t be, but I was surprised how boring the HN crowd was in these comments. Usually quirky “old web” is much appreciated over there.

Björn Ebbinghaus

@nikitonsky
The cursors were too distracting for me, so I turned on the reader mode. Unfortunately, the dark mode of the reader broke the images.

Perhaps this could be a future topic for your blog. "Content that gets consumed in different ways. A11y, color schemes, reader mode, on b/w epaper, tiny screens, as audio, etc."

I feel like this is an important topic, that's often overlooked. And I guess you could have some insight, given your work on UI.

@nikitonsky
The cursors were too distracting for me, so I turned on the reader mode. Unfortunately, the dark mode of the reader broke the images.

Perhaps this could be a future topic for your blog. "Content that gets consumed in different ways. A11y, color schemes, reader mode, on b/w epaper, tiny screens, as audio, etc."

Der Teilweise

@nikitonsky Great article.

There’s another example where UTF-16 is unlikely to get removed from: USB.

At places in the spec it says “UTF-16LE” but then again, it says “The Unicode Standard, Version 1.1, is the newest version of the Unicode™ Standard.” and “The character content and encoding of Unicode 1.1 is thus identical to that of the ISO/IEC 10646-1 UCS-2 (the two-octet form).“

usb.org/sites/default/files/hu (Issue date January 26, 2023!)

Jonah

@nikitonsky I'd note that the advice to use grapheme clusters as the atomic unit of strings is dangerous. First because the clusters are ill-defined (as you note), but second because using them to e.g. parse a message can cause security issues!

For example, imagine a JSON string:
"string"
and use a combining char at the start:
"ˇstring"
(typing on mobile, the hacek is not actually combining)
Now it's not a string anymore. This is an error, but it's possible to imagine an injection attack.

Niki Tonsky

@vjon how is this an injection attack? It’s just an invalid string

Jonah

@nikitonsky @nikitonsky Sorry, message length limits.

The security issue is potential. In JSON, it's not possible to construct an attack this way -- at least not with a strict parser. In another format, it might turn into an actual injection attack.

But a syntactic error in trusted data is an issue on its own, because the string is valid. A parser based on grapheme clusters would reject it incorrectly. And it means that using them to parse JSON, or any other textual format, is not possible.

Go Up