Email or username:

Password:

Forgot your password?
洪 民憙 (Hong Minhee)

Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages.

lobste.rs/s/9ck6y9/what_progra

jsfiddle.net/8sa8ndLj/2/

#CJK #language #EastAsian

73 comments
Central Illumination Agency

@hongminhee “Oh, there are other languages besides English? Even with different scripts?!!

This surprises and confuses me”

🙄🙄🙄

Cat Head Eagle

@slothrop @hongminhee some people don’t know about existence of time zones, nothing surprising

Andrew

@hongminhee "It's trivial to determine computationally,"

meanwhile, my laptop when I ask it to translate Chinese: hackers.town/@cinebox/11303260

Dr. hc.* Grober Unfug

@hongminhee
Bloody hell... I can't stand some of the Software folks 😭

Ariel Mirage
@hongminhee probably they think the only language actually existing is English.
Christopher Giffard

@hongminhee Han unification plus western exceptionalism is paying off in spades.

Paul McO'Smith III

@hongminhee are the pictograms so precise when written that people in Korea would giggle if you didn't get that small vertical line exactly vertical? or vice versa?

洪 民憙 (Hong Minhee)

@pavsmith No, in handwriting, it's not that important. It's like the difference between an a and an ɑ, or the difference between crossing out a 7 or not, but in print, people feel awkward.

Paul McO'Smith III

@hongminhee thanks. had always kinda wondered, as most of what i see is really detailed calligraphy, which can't be practical when writing notes in a meeting!

i suspect the problem like right handers trying to decipher left handed writing, especially when writing at speed and you start to smudge all of the ink!

洪 民憙 (Hong Minhee)

@pavsmith Of course, there's cursive script in East Asia too. Also, each single Chinese character is more like a word than a single letter, so the information density of a sentence is high (hence a sentence is short).

en.wikipedia.org/wiki/Cursive_

Janne Moren

@pavsmith @hongminhee
They look "off" in print. You can tell that something is wrong or badly typeset, even if you can't discover the exact character.

Kind of like a second language speaker with perfect pronunciation, but their choice of words and expressions doesn't match that of a native. You can tell it's not native speech even though you might not be able to pinpoint why.

A couple of the ones above are really obvious though, and would jump out immediately.

@pavsmith @hongminhee
They look "off" in print. You can tell that something is wrong or badly typeset, even if you can't discover the exact character.

Kind of like a second language speaker with perfect pronunciation, but their choice of words and expressions doesn't match that of a native. You can tell it's not native speech even though you might not be able to pinpoint why.

źmicier | зміцер

@pavsmith With such a small difference they will probably read it correctly, but this can be a problem with some dictionaries or input methods that rely on splitting characters into elements: 丨 and 丶 might be different elements.

E.g. while Cangjie inputs both 房 as HSYHS (/尸丄/𠃌), it would split ⻆ into NBG (乛⺆土) or NBQ(乛⺆扌), 直 into JBMM (十⺆一一) or JBUV (十⺆凵𠃊). Another example, 今, might be OIN (人丶乛) or OMN (人一乛). An input method only accepts one of these. Typing with wrong font can be painful.

Paul McO'Smith III

@zmicier thank you. that is quite amazing! and there are regional variations! it must be quite an experience to learn it. my usual question, though: can left handers do it routinely? the shape of some of the pictograms look like they'd be tricky, kinda like having to write backwards.

Rimu

@hongminhee Thank you for this post.

I have added the lang attribute to posts and comments on piefed.social #PieFed

Braw ☕🏳️‍🌈

@hongminhee it's also very important for screen readers so they don't attempt to read foreign language with the wrong synthesiser

Leonardo Ferreira Fontenelle

@brawaru @hongminhee came here to say that. I use it regularly to mark English words in my Portuguese blog.

Alberto de Murga

@hongminhee Those are the same people who then believe that every person has a single name and surname of 4 to 10 letters, and the address has a state 🙄

洪 民憙 (Hong Minhee)

@threkk Every East Asian sighs every time they see the first/last name fields.

Alberto de Murga

@hongminhee Same for Spanish people, we got two surnames xD

Leonardo Ferreira Fontenelle

@threkk @hongminhee here in Brazil is a different flavor of weird, because our main surname is the last one, so we end up putting our first surname with the given name(s)

Janne Moren

@hongminhee @threkk
And as a non-native I sigh each time I see a japanese site with only first and family name, and nowhere to put my middle name...

GwenTheKween :verifiedtrans: :neofox_nom_verified:

@hongminhee is not all western people. Looking up anything in Portuguese on duckduckgo (so not using location information for context) and about 1 in 4 cases I get all results in Spanish instead.

I would more likely say people who only speak one language and aren't exposed to another with any regularity could think that, but maybe I'm overselling the interest of 2-language speakers....

Edit: I just realized I worded the start like a dumbass. I should have said "this also affects western people" instead, but sleepy doesn't lend itself to writing well. Sorry if I sounded like an apologist, I wanted to highlight how that is even more of a narrow view of language than it seemed

@hongminhee is not all western people. Looking up anything in Portuguese on duckduckgo (so not using location information for context) and about 1 in 4 cases I get all results in Spanish instead.

I would more likely say people who only speak one language and aren't exposed to another with any regularity could think that, but maybe I'm overselling the interest of 2-language speakers....

Dr. Evan J. Gowan

@hongminhee I remember back when I first started learning Japanese and my phone was seemingly incapable of using Japanese fonts, and I was learning the wrong way to write characters like 過.

esa

@hongminhee

Bulgarian also afaik, with example under "unicode is locale-dependent" here:
tonsky.me/blog/unicode/

Us non-english westerners mostly just have to deal with anglo systems changing some letters and the meaning of the word, though. I occasionally wish q wasn't in ascii just so they'd have to deal with it as well.

Bruno Girin

@hongminhee even in European languages: it can impact sorting order or capitalisation to start with. Variations like the ones in East Asian languages are also present. The only reason why Western languages can mostly ignore that problem is because Unicode has a large number of glyph variations for the Latin alphabet but that creates other problems such as canonicalisation.

Anyway, all this to say I agree and anyone who says the lang attribute is useless has some learning to do.

Janne Moren

@gullevek @hongminhee
Yep. These should all have had their own codepoints.

Residual Entropy

@hongminhee@fosstodon.org Yeah sooo many people just make assumptions like that :(

reviewer 2 :Schwerified:

@hongminhee it’s not trivial to determine per se, but a cross-entropy classifier on character bigrams (that is, 1990s NLP) is surprisingly accurate at determining the language of a string.

However—and this is the big caveat—it’s only trivial if (1) you know where the language boundaries are and (2) the string is long enough to get robust bigram statistics.

Even if you weren’t to specify the language, “lang” solves problem (1) readily.

洪 民憙 (Hong Minhee)

@thedansimonson Yeah, but East Asian languages often be too short, e.g., 孤立無援, which is a valid sentence in Korean, Chinese, and Japanese. 😅

reviewer 2 :Schwerified:

@hongminhee oh yea totally—but is it common to mix strings of that length on a website containing multiple languages? I’d suspect generally, where things get mixed up, the shortest you might have is a link or button.

洪 民憙 (Hong Minhee)

@thedansimonson You're right, the shortest ones are buttons or links in a navigation bar.

Martin Seeger

@hongminhee The German Umlaut division, Ü-Battalion agrees 😁. ASCII is for people playing live in easy mode.

King Calyo Delphi

@hongminhee Oh, today I learned something! 🤯

I don't use east asian languages but knowing that the lang attribute also affects TTS is something I'm gonna take with me to the accessibility bank 100%. 👌💯

James Wood

@hongminhee How accurately are `lang` attributes placed in practice? I remember seeing “直” displayed wrong for the intended language on social media before (and, by the way, I don't think it's possible for me to specify the intended language in my quote there), and I often see people on the Fediverse who set-and-forget their language and then post in a different language, which you can tell on the client I use because it offers to translate the message.

洪 民憙 (Hong Minhee)

@mudri Yes, in practice, people often don't even specify the lang attribute at all, and as you said, even on fediverse, there are many people who post without setting the language correctly. 🤦

Leonardo Ferreira Fontenelle

@mudri @hongminhee I guess the most common problem with lang is it often not being used when it should

arclight

@hongminhee Sometimes it's worth reading the comments. I'm certain that I should have been called out like that back when I was more of a 25-year-old dumbass. The internet has exposed us all to many things we'd otherwise be ignorant of, giving us a much greater chance to step in it and a bigger microphone to broadcast our jackassery.

Erik

@hongminhee I think it's (a portion of) the English. Not a programmer but I use a minimum of three languages daily and work with lots of disabled students. This seems hyper relevant

Seiðr

@hongminhee Sincerely, I believe this is mostly a very USAmerican and English thing. The moment you are not English-native and have to interact with tech, you start from a "the world is not made for me" position that they usually take for granted.

S#

@hongminhee 근데 정작 한국 웹도 이걸 신경쓰지 않는 것 같고... 일본제인 미스키도 이를 붙여주는 기능을 제공하지 않죠.

한국쪽 마스토돈 사용자도 언어 설정에 그렇게 신경쓰지 않는 듯 하고요.

Kote Isaev

@hongminhee Wow! I could not imagine that it is _that_ bad!.
In the comments of thread `lang` attribute necessary and useful not only in context of East Asian languages, but in cases like Polish word on English wiki page, or German within English or wice versa. So aspect the lang attr addesses present for any region or language, not for East Asian.
Never thought that there are so culturally-blind people in IT sphere. May be this is because i working in a distributed and diverse team for years.

Richard Barrell

@hongminhee I feel silly because for a moment there I was wondering why changing the lang attribute was causing little red circles to get drawn on some of the glyphs. 😅

Marek

@hongminhee It is not for nothing that the W3C Validator spits out a warning if it is missing.
But I think it's also valuable for search engines and the like to know what language a document is in.

Jernej Simončič �

@hongminhee I still think it's stupid that Unicode hasn't separated the different-looking CJK glyphs into separate codepoints. If we can have A, Α, А and A as separate characters, why couldn't that have been done for CJK?

Jernej Simončič �

@hongminhee But why? If the character looks different, why wouldn't it be represented by a different codepoint?

洪 民憙 (Hong Minhee)

@jernej__s Because they are the same characters, even though they look slightly different. “Unicode encodes characters, not glyphs.” —Unicode FAQ. It's like Arabic numeral 7 is encoded as a single codepoint whether it has an extra horizontal line drawn across it or not.

upload.wikimedia.org/wikipedia

mirabilos

@hongminhee for screenreaders, too

signed, a central european polyglot (mostly only western languages, but still aware, as are many of us, in contrast to many english people)

Mikołaj Hołysz

@hongminhee Serious question. How do platforms that accept user-generated content handle this?

Take Mastodon for example, if three users send a post, one in Chinese, one in Korean, one in Japanese, and the app is international, how would this be handled? How should this be handled?

Are apps targeting the Asian market rewquiring the user to correctly fill in the "language" field each time? Are you effectively required to include AI-based language detection in each product? Are browsers truly unable to figure this out on their own when there's no lang attribute present?

@hongminhee Serious question. How do platforms that accept user-generated content handle this?

Take Mastodon for example, if three users send a post, one in Chinese, one in Korean, one in Japanese, and the app is international, how would this be handled? How should this be handled?

Are apps targeting the Asian market rewquiring the user to correctly fill in the "language" field each time? Are you effectively required to include AI-based language detection in each product? Are browsers truly unable...

洪 民憙 (Hong Minhee)

@miki On the web, it's common to specify the lang attribute in the top-level <html> tag. Internationalized apps will prefer the user's locale setting.

Kevin Boyd

@miki @hongminhee maston lets users specify a default language & change the language on a per-post basis when they are composing posts.

Chris Abbey

@kboyd @miki @hongminhee sadly, a lot of users don’t bother to do either. A couple of hashtags I follow are very popular amongst users who speak languages I don’t, and while I have configured my client to only show posts in the three languages I have any chance at all with, I still scroll past a lot of content I can’t read. (Which claims to be posted in English, but absolutely isn’t.)

lily 🏳️‍⚧️

@hongminhee@fosstodon.org even if you're not going to be racist about it i've seen sites without lang attributes be detected as things like spanish and german when they're in english so a lang attribute is essential

esanikerruzegibat

@hongminhee As a native speaker of Basque, even being full Western... I FEEL YOU.

PsyMar aka Sam

@hongminhee It's definitely English And Maybe Spanish only. I've definitely seen sentences in Dutch that could be mistaken for badly spelled English.

Thomas Touhey

@hongminhee
So a Unicode codepoint can correspond to different glyphs in the same font depending on the language? This seems like a big oversight by Unicode, unless it's a conscious decision?

Thomas Touhey

@hongminhee I see, I dove a little bit into the subject and it comes down to Unicode primitives I don't understand yet. Thanks for the pointers :)

Anneke

@hongminhee Reddit is oftentimes very USA/English centered too, it’s a little ridiculous. But those people in the comments really take the cake with their comments, good lord! Thank you for sharing!!

Nick

@hongminhee it can also determine the language engine in screen readers so you don't have an English engine trying to read Chinese and completely blowing up

Jan Eden

@hongminhee The lang attribute is important even when only languages with Latin characters are involved because of the different quotation marks (a highly contentious matter).

[Yaseenist] CauseOfBSOD

@hongminhee wait so the shared unicode characters arent even the same depending on the language?

another thing ill probably end up worrying about when writing code even though i doubt anyone will ever localise software i develop

Melody :cat_inside:

@hongminhee@fosstodon.org anything that improves accessibility, be it for disabled or people of different language, should not be a suggestion but mandatory

Go Up