Wow, English-only people (or Western languages, for...

洪民憙 (Hong Minhee)'s posts Post Back to profile

Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages.

https://lobste.rs/s/9ck6y9/what_programming_language_is_this_code#c_0zuhqs

https://jsfiddle.net/8sa8ndLj/2/

#CJK #language #EastAsian

Like 28 August at 2:06 | Open on fosstodon.org

73 comments

Central Illumination Agency

@hongminhee “Oh, there are other languages besides English? Even with different scripts?!!

This surprises and confuses me”

🙄🙄🙄

28 August at 4:53 | Open on chaos.social

Cat Head Eagle

@slothrop @hongminhee some people don’t know about existence of time zones, nothing surprising

30 August at 6:33 | Open on mastodon.ml

Andrew

@hongminhee "It's trivial to determine computationally,"

meanwhile, my laptop when I ask it to translate Chinese: https://hackers.town/@cinebox/113032603710915348

28 August at 5:00 | Open on hackers.town

Dr. hc.* Grober Unfug

@hongminhee
Bloody hell... I can't stand some of the Software folks 😭

28 August at 5:37 | Open on mastodon.social

Ariel Mirage

@hongminhee probably they think the only language actually existing is English.

28 August at 5:48 | Open on fedi.absturztau.be

Christopher Giffard

@hongminhee Han unification plus western exceptionalism is paying off in spades.

28 August at 6:24 | Open on aus.social

Paul McO'Smith III

@hongminhee are the pictograms so precise when written that people in Korea would giggle if you didn't get that small vertical line exactly vertical? or vice versa?

28 August at 6:47 | Open on theblower.au

洪民憙 (Hong Minhee)

@pavsmith No, in handwriting, it's not that important. It's like the difference between an a and an ɑ, or the difference between crossing out a 7 or not, but in print, people feel awkward.

28 August at 7:16 | Open on fosstodon.org

Paul McO'Smith III

@hongminhee thanks. had always kinda wondered, as most of what i see is really detailed calligraphy, which can't be practical when writing notes in a meeting!

i suspect the problem like right handers trying to decipher left handed writing, especially when writing at speed and you start to smudge all of the ink!

28 August at 7:36 | Open on theblower.au

洪民憙 (Hong Minhee)

@pavsmith Of course, there's cursive script in East Asia too. Also, each single Chinese character is more like a word than a single letter, so the information density of a sentence is high (hence a sentence is short).

https://en.wikipedia.org/wiki/Cursive_script_(East_Asia)

28 August at 7:42 | Open on fosstodon.org

Janne Moren

@pavsmith @hongminhee
They look "off" in print. You can tell that something is wrong or badly typeset, even if you can't discover the exact character.

Kind of like a second language speaker with perfect pronunciation, but their choice of words and expressions doesn't match that of a native. You can tell it's not native speech even though you might not be able to pinpoint why.

A couple of the ones above are really obvious though, and would jump out immediately.

@pavsmith @hongminhee
They look "off" in print. You can tell that something is wrong or badly typeset, even if you can't discover the exact character.

Expand text...

28 August at 11:13 | Open on fosstodon.org

źmicier | зміцер

@pavsmith With such a small difference they will probably read it correctly, but this can be a problem with some dictionaries or input methods that rely on splitting characters into elements: 丨 and 丶 might be different elements.

E.g. while Cangjie inputs both 房 as HSYHS (／尸丄／𠃌), it would split ⻆ into NBG (乛⺆土) or NBQ(乛⺆扌), 直 into JBMM (十⺆一一) or JBUV (十⺆凵𠃊). Another example, 今, might be OIN (人丶乛) or OMN (人一乛). An input method only accepts one of these. Typing with wrong font can be painful.

28 August at 12:16 | Open on polyglot.city

Paul McO'Smith III

@zmicier thank you. that is quite amazing! and there are regional variations! it must be quite an experience to learn it. my usual question, though: can left handers do it routinely? the shape of some of the pictograms look like they'd be tricky, kinda like having to write backwards.

28 August at 12:46 | Open on theblower.au

Rimu

@hongminhee Thank you for this post.

I have added the lang attribute to posts and comments on https://piefed.social #PieFed

28 August at 7:19 | Open on mastodon.nzoss.nz

洪民憙 (Hong Minhee)

@rimu Oh, I'm glad you found my post helpful!

28 August at 7:20 | Open on fosstodon.org

Franchesko

@hongminhee reminded me of this: https://github.com/kdeldycke/awesome-falsehood

28 August at 7:26 | Open on mastodon.social

洪民憙 (Hong Minhee)

@franchesko Thanks for sharing this great resource. 😄

28 August at 7:27 | Open on fosstodon.org

Braw ☕🏳️‍🌈

@hongminhee it's also very important for screen readers so they don't attempt to read foreign language with the wrong synthesiser

28 August at 8:04 | Open on mstdn.social

Leonardo Ferreira Fontenelle

@brawaru @hongminhee came here to say that. I use it regularly to mark English words in my Portuguese blog.

28 August at 14:49 | Open on mastodon.social

Alberto de Murga

@hongminhee Those are the same people who then believe that every person has a single name and surname of 4 to 10 letters, and the address has a state 🙄

28 August at 8:27 | Open on mastodon.social

洪民憙 (Hong Minhee)

@threkk Every East Asian sighs every time they see the first/last name fields.

28 August at 8:28 | Open on fosstodon.org

Alberto de Murga

@hongminhee Same for Spanish people, we got two surnames xD

28 August at 8:39 | Open on mastodon.social

Leonardo Ferreira Fontenelle

@threkk @hongminhee here in Brazil is a different flavor of weird, because our main surname is the last one, so we end up putting our first surname with the given name(s)

28 August at 14:51 | Open on mastodon.social

Janne Moren

@hongminhee @threkk
And as a non-native I sigh each time I see a japanese site with only first and family name, and nowhere to put my middle name...

28 August at 11:15 | Open on fosstodon.org

GwenTheKween :verifiedtrans: :neofox_nom_verified:

@hongminhee is not all western people. Looking up anything in Portuguese on duckduckgo (so not using location information for context) and about 1 in 4 cases I get all results in Spanish instead.

I would more likely say people who only speak one language and aren't exposed to another with any regularity could think that, but maybe I'm overselling the interest of 2-language speakers....

Edit: I just realized I worded the start like a dumbass. I should have said "this also affects western people" instead, but sleepy doesn't lend itself to writing well. Sorry if I sounded like an apologist, I wanted to highlight how that is even more of a narrow view of language than it seemed

@hongminhee is not all western people. Looking up anything in Portuguese on duckduckgo (so not using location information for context) and about 1 in 4 cases I get all results in Spanish instead.

I would more likely say people who only speak one language and aren't exposed to another with any regularity could think that, but maybe I'm overselling the interest of 2-language speakers....

Expand text...

28 August at 10:06 | Open on tech.lgbt

Dr. Evan J. Gowan

@hongminhee I remember back when I first started learning Japanese and my phone was seemingly incapable of using Japanese fonts, and I was learning the wrong way to write characters like 過.

28 August at 10:38 | Open on fediscience.org

洪民憙 (Hong Minhee)

@DrEvanGowan Haha, that's funny! 😆

28 August at 10:46 | Open on fosstodon.org

esa

@hongminhee

Bulgarian also afaik, with example under "unicode is locale-dependent" here:
https://tonsky.me/blog/unicode/

Us non-english westerners mostly just have to deal with anglo systems changing some letters and the meaning of the word, though. I occasionally wish q wasn't in ascii just so they'd have to deal with it as well.

28 August at 10:43 | Open on snabelen.no

Bruno Girin

@hongminhee even in European languages: it can impact sorting order or capitalisation to start with. Variations like the ones in East Asian languages are also present. The only reason why Western languages can mostly ignore that problem is because Unicode has a large number of glyph variations for the Latin alphabet but that creates other problems such as canonicalisation.

Anyway, all this to say I agree and anyone who says the lang attribute is useless has some learning to do.

28 August at 11:04 | Open on mastodon.me.uk

gullevek ☢️

@hongminhee @jannem Biggest mistake in Unicode. Ever.

28 August at 11:46 | Open on famichiki.jp

Janne Moren

@gullevek @hongminhee
Yep. These should all have had their own codepoints.

28 August at 13:18 | Open on fosstodon.org

Residual Entropy

@hongminhee@fosstodon.org Yeah sooo many people just make assumptions like that :(

28 August at 12:49 | Open on kitsunes.club

reviewer 2 :Schwerified:

@hongminhee it’s not trivial to determine per se, but a cross-entropy classifier on character bigrams (that is, 1990s NLP) is surprisingly accurate at determining the language of a string.

However—and this is the big caveat—it’s only trivial if (1) you know where the language boundaries are and (2) the string is long enough to get robust bigram statistics.

Even if you weren’t to specify the language, “lang” solves problem (1) readily.

28 August at 12:54 | Open on lingo.lol

洪民憙 (Hong Minhee)

@thedansimonson Yeah, but East Asian languages often be too short, e.g., 孤立無援, which is a valid sentence in Korean, Chinese, and Japanese. 😅

28 August at 12:58 | Open on fosstodon.org

reviewer 2 :Schwerified:

@hongminhee oh yea totally—but is it common to mix strings of that length on a website containing multiple languages? I’d suspect generally, where things get mixed up, the shortest you might have is a link or button.

28 August at 13:15 | Open on lingo.lol

洪民憙 (Hong Minhee)

@thedansimonson You're right, the shortest ones are buttons or links in a navigation bar.

28 August at 13:19 | Open on fosstodon.org

Martin Seeger

@hongminhee The German Umlaut division, Ü-Battalion agrees 😁. ASCII is for people playing live in easy mode.

28 August at 12:55 | Open on infosec.exchange

King Calyo Delphi

@hongminhee Oh, today I learned something! 🤯

I don't use east asian languages but knowing that the lang attribute also affects TTS is something I'm gonna take with me to the accessibility bank 100%. 👌💯

28 August at 13:21 | Open on rubber.social

James Wood

@hongminhee How accurately are `lang` attributes placed in practice? I remember seeing “直” displayed wrong for the intended language on social media before (and, by the way, I don't think it's possible for me to specify the intended language in my quote there), and I often see people on the Fediverse who set-and-forget their language and then post in a different language, which you can tell on the client I use because it offers to translate the message.

28 August at 14:27 | Open on mathstodon.xyz

洪民憙 (Hong Minhee)

@mudri Yes, in practice, people often don't even specify the lang attribute at all, and as you said, even on fediverse, there are many people who post without setting the language correctly. 🤦

28 August at 14:34 | Open on fosstodon.org

Leonardo Ferreira Fontenelle

@mudri @hongminhee I guess the most common problem with lang is it often not being used when it should

28 August at 14:56 | Open on mastodon.social

arclight

@hongminhee Sometimes it's worth reading the comments. I'm certain that I should have been called out like that back when I was more of a 25-year-old dumbass. The internet has exposed us all to many things we'd otherwise be ignorant of, giving us a much greater chance to step in it and a bigger microphone to broadcast our jackassery.

28 August at 15:40 | Open on oldbytes.space

Frost, Wolffucker 🐺:therian:

@hongminhee Fuck Han unification!

28 August at 16:20 | Open on masto.brightfur.net

Erik

@hongminhee I think it's (a portion of) the English. Not a programmer but I use a minimum of three languages daily and work with lots of disabled students. This seems hyper relevant

28 August at 19:46 | Open on warhammer.social

Seiðr

@hongminhee Sincerely, I believe this is mostly a very USAmerican and English thing. The moment you are not English-native and have to interact with tech, you start from a "the world is not made for me" position that they usually take for granted.

28 August at 20:23 | Open on mstdn.social

@hongminhee 근데 정작 한국 웹도 이걸 신경쓰지 않는 것 같고... 일본제인 미스키도 이를 붙여주는 기능을 제공하지 않죠.

한국쪽 마스토돈 사용자도 언어 설정에 그렇게 신경쓰지 않는 듯 하고요.

29 August at 5:40 | Open on mastodon.social

洪民憙 (Hong Minhee)

@ssharp 맞습니다…

29 August at 5:41 | Open on fosstodon.org

Kote Isaev

@hongminhee Wow! I could not imagine that it is _that_ bad!.
In the comments of thread `lang` attribute necessary and useful not only in context of East Asian languages, but in cases like Polish word on English wiki page, or German within English or wice versa. So aspect the lang attr addesses present for any region or language, not for East Asian.
Never thought that there are so culturally-blind people in IT sphere. May be this is because i working in a distributed and diverse team for years.

29 August at 10:46 | Open on mastodon.online

Richard Barrell

@hongminhee I feel silly because for a moment there I was wondering why changing the lang attribute was causing little red circles to get drawn on some of the glyphs. 😅

29 August at 10:56 | Open on unstable.systems

Marek

@hongminhee It is not for nothing that the W3C Validator spits out a warning if it is missing.
But I think it's also valuable for search engines and the like to know what language a document is in.

29 August at 11:45 | Open on layer8.space

Jernej Simončič �

@hongminhee I still think it's stupid that Unicode hasn't separated the different-looking CJK glyphs into separate codepoints. If we can have A, Α, А and Ａ as separate characters, why couldn't that have been done for CJK?

29 August at 11:58 | Open on infosec.exchange

洪民憙 (Hong Minhee)

@jernej__s I'm in favor of Han unification though. See also this:

https://fosstodon.org/@hongminhee/113039545387576150

29 August at 12:00 | Open on fosstodon.org

Jernej Simončič �

@hongminhee But why? If the character looks different, why wouldn't it be represented by a different codepoint?

29 August at 12:15 | Open on infosec.exchange

洪民憙 (Hong Minhee)

@jernej__s Because they are the same characters, even though they look slightly different. “Unicode encodes characters, not glyphs.” —Unicode FAQ. It's like Arabic numeral 7 is encoded as a single codepoint whether it has an extra horizontal line drawn across it or not.

https://upload.wikimedia.org/wikipedia/commons/5/5c/Hand_Written_7.svg

29 August at 12:23 | Open on fosstodon.org

mirabilos

@hongminhee for screenreaders, too

signed, a central european polyglot (mostly only western languages, but still aware, as are many of us, in contrast to many english people)

29 August at 13:01 | Open on toot.mirbsd.org

Mikołaj Hołysz

@hongminhee Serious question. How do platforms that accept user-generated content handle this?

Take Mastodon for example, if three users send a post, one in Chinese, one in Korean, one in Japanese, and the app is international, how would this be handled? How should this be handled?

Are apps targeting the Asian market rewquiring the user to correctly fill in the "language" field each time? Are you effectively required to include AI-based language detection in each product? Are browsers truly unable to figure this out on their own when there's no lang attribute present?

@hongminhee Serious question. How do platforms that accept user-generated content handle this?

Take Mastodon for example, if three users send a post, one in Chinese, one in Korean, one in Japanese, and the app is international, how would this be handled? How should this be handled?

Expand text...

29 August at 14:21 | Open on dragonscave.space

洪民憙 (Hong Minhee)

@miki On the web, it's common to specify the lang attribute in the top-level <html> tag. Internationalized apps will prefer the user's locale setting.

29 August at 14:29 | Open on fosstodon.org

Kevin Boyd

@miki @hongminhee maston lets users specify a default language & change the language on a per-post basis when they are composing posts.

29 August at 19:21 | Open on phpc.social

Chris Abbey

@kboyd @miki @hongminhee sadly, a lot of users don’t bother to do either. A couple of hashtags I follow are very popular amongst users who speak languages I don’t, and while I have configured my client to only show posts in the three languages I have any chance at all with, I still scroll past a lot of content I can’t read. (Which claims to be posted in English, but absolutely isn’t.)

30 August at 6:38 | Open on phpc.social

lily 🏳️‍⚧️

@hongminhee@fosstodon.org even if you're not going to be racist about it i've seen sites without lang attributes be detected as things like spanish and german when they're in english so a lang attribute is essential

29 August at 14:27 | Open on possum.city

esanikerruzegibat

@hongminhee As a native speaker of Basque, even being full Western... I FEEL YOU.

29 August at 20:47 | Open on mastodon.eus

PsyMar aka Sam

@hongminhee It's definitely English And Maybe Spanish only. I've definitely seen sentences in Dutch that could be mistaken for badly spelled English.

29 August at 21:45 | Open on unstable.systems

Thomas Touhey

@hongminhee
So a Unicode codepoint can correspond to different glyphs in the same font depending on the language? This seems like a big oversight by Unicode, unless it's a conscious decision?

29 August at 23:28 | Open on social.touhey.org

洪民憙 (Hong Minhee)

@thomas It's called Han unification. See also the following thread:

https://fosstodon.org/@hongminhee/113039545387576150

29 August at 23:48 | Open on fosstodon.org

Thomas Touhey

@hongminhee I see, I dove a little bit into the subject and it comes down to Unicode primitives I don't understand yet. Thanks for the pointers :)

30 August at 0:28 | Open on social.touhey.org

Anneke

@hongminhee Reddit is oftentimes very USA/English centered too, it’s a little ridiculous. But those people in the comments really take the cake with their comments, good lord! Thank you for sharing!!

30 August at 7:43 | Open on front-end.social

Nick

@hongminhee it can also determine the language engine in screen readers so you don't have an English engine trying to read Chinese and completely blowing up

30 August at 8:07 | Open on toot.cat

Jan Eden

@hongminhee The lang attribute is important even when only languages with Latin characters are involved because of the different quotation marks (a highly contentious matter).

30 August at 13:34 | Open on social.eden.one

[Yaseenist] CauseOfBSOD

@hongminhee wait so the shared unicode characters arent even the same depending on the language?

another thing ill probably end up worrying about when writing code even though i doubt anyone will ever localise software i develop

4 September at 19:31 | Open on wetdry.world

Melody :cat_inside:

@hongminhee@fosstodon.org anything that improves accessibility, be it for disabled or people of different language, should not be a suggestion but mandatory