Email or username:

Password:

Forgot your password?
Devil Lu Linvega

Added UTF-8's multi-byte support to a text editor today, I've always been scared to get into it, it looked messy and confusing at a distance. But the design makes it pretty accessible, for even such a small system as uxn.

The rule is pretty simple:

- starting bytes are 11xx xxxx
- continuation bytes are 10xx xxxx

The entire implementation to handle multi-byte characters is a mere 30ish bytes long.

wiki.xxiivv.com/site/utf8.html

example implementation: git.sr.ht/~rabbits/left/tree/m

25 comments
la ninpre

@neauoire whaaaaat
but i think utf-8 input is still not possible in uxnemu, because it only sends one byte into the Controller/key port. i was trying to make it work, but i was distracted by other things and never finished it as a result. i am still interested though!

Devil Lu Linvega

@la_ninpre sending utf8 out works! In left, if you ctrl+p a utf8 selection, it'll print it correctly

la ninpre

@neauoire i was talking about typing on a keyboard

here: git.sr.ht/~rabbits/uxn/tree/ma

if i type some character that's not ascii, for example, á (U+00E1), only the first byte (0xc3) of the utf-8 sequence 0xc3a1 gets recorded.

la ninpre

@neauoire wow, that's so cool. i'll go study the code then.
still i wonder if it will work with keyboard input in SDL...
one idea is to just send the text to the console port instead xD

la ninpre

@neauoire i was trying to say that what if we just send all input from the keyboard to the console port instead of controller. i know that this it not how it is supposed to work, but that would make multibyte input possible without rewriting uxntal code that expects one value from the Controller/key...
i hope i'm not confusing you, haha

Devil Lu Linvega

@la_ninpre Oh, haha. No that's not the way to do this, you want to run a controller vector evaluation for each byte.

la ninpre

@neauoire yes, i understand. but this means pretty much the same code has to be in both controller and console vectors, right?

Devil Lu Linvega

@la_ninpre It depends on the program, but in left for example, both vectors run the same <insert> subroutine.
git.sr.ht/~rabbits/left/tree/m

la ninpre replied to Devil Lu Linvega

@neauoire can you review a patch for uxnemu then? sent to your email.

la ninpre replied to Devil Lu Linvega

@neauoire i am sorry, did you get it? i suspect i could've sent it to the wrong address.

Devil Lu Linvega replied to la

@la_ninpre I did! I merged it locally but I want to test it some more, can you gimme 3-4 hours? if I hit an issue I'll let you know, otherwise it'll be merged

la ninpre replied to Devil Lu Linvega

@neauoire of course, it's okay. i asked not because i wanted you to review it asap, haha. i was just unsure about email address, i remember you were on gmail in the past

ThaCuber

@neauoire oh heck yeah, now that you're talking about the topic, here's a C function I really like that returns the codepoint that a sequence of UTF-8 bytes refers to, made by "x_rxi" in Twitter for their text editor, "lite": github.com/rxi/lite/blob/maste

blallo

@neauoire Literally UTF8 is a miracle. Its entire spec was written on a napkin.

🍂Evan Balster🍂

@neauoire @neauoire If you haven't already, make sure to read about the difference between code points vs graphemes vs grapheme clusters. Thinking in terms of "characters" tends to lead to trouble because the word "character" could refer to any of those things.

joelonsoftware.com/2003/10/08/

Depending on your level of commitment to international text support it might also be worth getting your head around e.g. the role of technologies like HarfBuzz in text rendering.

@neauoire @neauoire If you haven't already, make sure to read about the difference between code points vs graphemes vs grapheme clusters. Thinking in terms of "characters" tends to lead to trouble because the word "character" could refer to any of those things.

joelonsoftware.com/2003/10/08/

Devil Lu Linvega

@evan I doubt I will cover much more than what I need to do my work. My goal here was merely that if I encounter a multi-byte glyph, I can walk over, erase, select it properly, instead of say having to walk over 4 spaces in memory for a 4 byte glyph. At least that part, is well designed via utf-8 and I haven't found exceptions yet that Left couldn't handle.

Devil Lu Linvega

@evan Just to be clear, I'm not inventing a new specification, but following the utf8 codepoint map. The permacomputing approach to text editing is likely not to build some sort all encompassing textarea system, but rather just having people implement enough of their own language that they can get their tasks done.

In any case, I thought UTF8 would be out of reach from something like uxn, but it isn't that much trouble after all, at least at a scale that I need.

mcc

@neauoire Does the text display anticipate RTL?

Devil Lu Linvega

@mcc no, it doesn't, but it could, although I don't speak any RTL languages, so it's unlikely that I will add coverage for those. Someone else might take on that task, and I could help them.

Go Up