Niki Tonsky

Niki's posts Post Back to profile

Niki Tonsky

New post — all things Unicode!

One of the most interesting researches I had to do. Enjoy!

https://tonsky.me/blog/unicode/

Like 1 2 Oct 2023 at 9:30 | Open on mastodon.online

26 comments

Hugo 雨果

@nikitonsky One line reads:

> Any pure ASCII text is also a valid UTF-8 text, and vice versa.

The vice versa part doesn't sound right. Any valid UTF-8 text isn't valid ASCII.

2 Oct 2023 at 9:55 | Open on fosstodon.org

Niki Tonsky

@whynothugo True, thanks!

2 Oct 2023 at 11:27 | Open on mastodon.online

shikanoko nokonoko

@nikitonsky the other cursors freaked me out huhuh

2 Oct 2023 at 10:24 | Open on emacs.ch

Hugo 雨果

@nikitonsky Huh, I now understand why all Chinese subtitles that I find online use GBK encoding. UTF8 is a poor choice for subtitle files because different characters share the same code point.

2 Oct 2023 at 10:29 | Open on fosstodon.org

liilliil 🇫🇯🇱🇨🇱🇧

@nikitonsky
>Cyrillic Lowercase K and Bulgarian Lowercase K

Они оба кириллица, только в разном начертании

2 Oct 2023 at 12:19 | Open on mastodon.online

Niki Tonsky

@liilliil says who? Вот a и а вообще одинаково пишутся, и оба латиница, а коды разные

2 Oct 2023 at 13:10 | Open on mastodon.online

liilliil 🇫🇯🇱🇨🇱🇧

@nikitonsky там у тебя в примере коды идентичные

2 Oct 2023 at 13:12 | Open on mastodon.online

Niki Tonsky

@liilliil ну так в том и поинт! почему одному языку коды выделили, а другому нет?

2 Oct 2023 at 14:04 | Open on mastodon.online

liilliil 🇫🇯🇱🇨🇱🇧

@nikitonsky поинт в том, что русский назвал «кириллица», а болгарский «болгарским». Они оба кириллица

2 Oct 2023 at 14:06 | Open on mastodon.online

Ivan Habunek

@nikitonsky Elixir also gets this right. Even though my terminal renders an extra space after the emoji.

2 Oct 2023 at 14:16 | Open on mastodon.social

Niki Tonsky

@ihabunek nice! does it come from Erlang? Or this is Elixir specifically?

2 Oct 2023 at 16:25 | Open on mastodon.online

Ivan Habunek

@nikitonsky Just had a look, and it just calls erlang.

Erlang implementation is here:
https://github.com/erlang/otp/blob/0164d3db05739fc1fad67ac1f5bf3e2aea15cd45/lib/stdlib/src/string.erl#L147

Erlang has a script which generates the unicode util code from the unicode spec. Macros would have been nice for that. :)
https://github.com/erlang/otp/blob/master/lib/stdlib/uc_spec/gen_unicode_mod.escript

2 Oct 2023 at 17:28 | Open on mastodon.social

Lynn «Кофеман»

@nikitonsky just made this screenshot today. Shame on you Zoom 😅

2 Oct 2023 at 16:25 | Open on mas.to

Niki Tonsky

@alexeyten excellent example!

2 Oct 2023 at 16:25 | Open on mastodon.online

Niki Tonsky

In case you were wondering why cursors

2 Oct 2023 at 16:26 | Open on mastodon.online

Charan

@nikitonsky did your site go down for few moments? I was seeing the nginx error.

2 Oct 2023 at 16:49 | Open on hachyderm.io

Felix Niklas

@nikitonsky fun effect! I had to remove the DOM element though, to be able to focus on the content.

2 Oct 2023 at 17:06 | Open on mastodon.social

@nikitonsky
> I make sure it works very well

2 Oct 2023 at 17:58 | Open on lor.sh

Nick Matthews

@nikitonsky I'll admit, I was curious. Was quite confusing at first, but devtools let me hide them quickly, and then confirmed the websockets by looking at the source. For me though, this was incredibly distracting, and made reading your post significantly harder until I turned them off.

2 Oct 2023 at 18:58 | Open on fosstodon.org

Philipp Defner

@nikitonsky Maybe I shouldn’t be, but I was surprised how boring the HN crowd was in these comments. Usually quirky “old web” is much appreciated over there.

3 Oct 2023 at 12:49 | Open on mastodon.social

Björn Ebbinghaus

@nikitonsky
The cursors were too distracting for me, so I turned on the reader mode. Unfortunately, the dark mode of the reader broke the images.

Perhaps this could be a future topic for your blog. "Content that gets consumed in different ways. A11y, color schemes, reader mode, on b/w epaper, tiny screens, as audio, etc."

I feel like this is an important topic, that's often overlooked. And I guess you could have some insight, given your work on UI.

@nikitonsky
The cursors were too distracting for me, so I turned on the reader mode. Unfortunately, the dark mode of the reader broke the images.

Perhaps this could be a future topic for your blog. "Content that gets consumed in different ways. A11y, color schemes, reader mode, on b/w epaper, tiny screens, as audio, etc."

Expand text...

4 Oct 2023 at 10:18 | Open on fosstodon.org

Thomas Lavergne

@nikitonsky thanks Niki. I learned a lot.

3 Oct 2023 at 7:20 | Open on fediscience.org

Der Teilweise

@nikitonsky Great article.

There’s another example where UTF-16 is unlikely to get removed from: USB.

At places in the spec it says “UTF-16LE” but then again, it says “The Unicode Standard, Version 1.1, is the newest version of the Unicode™ Standard.” and “The character content and encoding of Unicode 1.1 is thus identical to that of the ISO/IEC 10646-1 UCS-2 (the two-octet form).“

https://www.usb.org/sites/default/files/hut1_4.pdf (Issue date January 26, 2023!)

5 Oct 2023 at 7:44 | Open on layer8.space

Jonah

@nikitonsky I'd note that the advice to use grapheme clusters as the atomic unit of strings is dangerous. First because the clusters are ill-defined (as you note), but second because using them to e.g. parse a message can cause security issues!

For example, imagine a JSON string:
"string"
and use a combining char at the start:
"ˇstring"
(typing on mobile, the hacek is not actually combining)
Now it's not a string anymore. This is an error, but it's possible to imagine an injection attack.

5 Oct 2023 at 20:02 | Open on mastodon.online

Niki Tonsky

@vjon how is this an injection attack? It’s just an invalid string

6 Oct 2023 at 8:55 | Open on mastodon.online

Jonah

@nikitonsky @nikitonsky Sorry, message length limits.

The security issue is potential. In JSON, it's not possible to construct an attack this way -- at least not with a strict parser. In another format, it might turn into an actual injection attack.

But a syntactic error in trusted data is an issue on its own, because the string is valid. A parser based on grapheme clusters would reject it incorrectly. And it means that using them to parse JSON, or any other textual format, is not possible.

6 Oct 2023 at 13:11 | Open on mastodon.online