New post — all things Unicode!
One of the most interesting researches I had to do. Enjoy!
New post — all things Unicode! One of the most interesting researches I had to do. Enjoy! 26 comments
@nikitonsky Huh, I now understand why all Chinese subtitles that I find online use GBK encoding. UTF8 is a poor choice for subtitle files because different characters share the same code point. @nikitonsky Они оба кириллица, только в разном начертании @nikitonsky поинт в том, что русский назвал «кириллица», а болгарский «болгарским». Они оба кириллица @nikitonsky Elixir also gets this right. Even though my terminal renders an extra space after the emoji. @nikitonsky Just had a look, and it just calls erlang. Erlang implementation is here: Erlang has a script which generates the unicode util code from the unicode spec. Macros would have been nice for that. :) @nikitonsky fun effect! I had to remove the DOM element though, to be able to focus on the content. @nikitonsky I'll admit, I was curious. Was quite confusing at first, but devtools let me hide them quickly, and then confirmed the websockets by looking at the source. For me though, this was incredibly distracting, and made reading your post significantly harder until I turned them off. @nikitonsky Maybe I shouldn’t be, but I was surprised how boring the HN crowd was in these comments. Usually quirky “old web” is much appreciated over there. @nikitonsky Great article. There’s another example where UTF-16 is unlikely to get removed from: USB. At places in the spec it says “UTF-16LE” but then again, it says “The Unicode Standard, Version 1.1, is the newest version of the Unicode™ Standard.” and “The character content and encoding of Unicode 1.1 is thus identical to that of the ISO/IEC 10646-1 UCS-2 (the two-octet form).“ https://www.usb.org/sites/default/files/hut1_4.pdf (Issue date January 26, 2023!) @nikitonsky I'd note that the advice to use grapheme clusters as the atomic unit of strings is dangerous. First because the clusters are ill-defined (as you note), but second because using them to e.g. parse a message can cause security issues! For example, imagine a JSON string: @nikitonsky @nikitonsky Sorry, message length limits. The security issue is potential. In JSON, it's not possible to construct an attack this way -- at least not with a strict parser. In another format, it might turn into an actual injection attack. But a syntactic error in trusted data is an issue on its own, because the string is valid. A parser based on grapheme clusters would reject it incorrectly. And it means that using them to parse JSON, or any other textual format, is not possible. |
@nikitonsky One line reads:
> Any pure ASCII text is also a valid UTF-8 text, and vice versa.
The vice versa part doesn't sound right. Any valid UTF-8 text isn't valid ASCII.