Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages.
https://lobste.rs/s/9ck6y9/what_programming_language_is_this_code#c_0zuhqs
Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages. https://lobste.rs/s/9ck6y9/what_programming_language_is_this_code#c_0zuhqs 73 comments
@slothrop @hongminhee some people don’t know about existence of time zones, nothing surprising @hongminhee "It's trivial to determine computationally," meanwhile, my laptop when I ask it to translate Chinese: https://hackers.town/@cinebox/113032603710915348 @hongminhee Han unification plus western exceptionalism is paying off in spades. @hongminhee are the pictograms so precise when written that people in Korea would giggle if you didn't get that small vertical line exactly vertical? or vice versa? @pavsmith No, in handwriting, it's not that important. It's like the difference between an a and an ɑ, or the difference between crossing out a 7 or not, but in print, people feel awkward. @hongminhee thanks. had always kinda wondered, as most of what i see is really detailed calligraphy, which can't be practical when writing notes in a meeting! i suspect the problem like right handers trying to decipher left handed writing, especially when writing at speed and you start to smudge all of the ink! @pavsmith Of course, there's cursive script in East Asia too. Also, each single Chinese character is more like a word than a single letter, so the information density of a sentence is high (hence a sentence is short). @pavsmith With such a small difference they will probably read it correctly, but this can be a problem with some dictionaries or input methods that rely on splitting characters into elements: 丨 and 丶 might be different elements. E.g. while Cangjie inputs both 房 as HSYHS (/尸丄/𠃌), it would split ⻆ into NBG (乛⺆土) or NBQ(乛⺆扌), 直 into JBMM (十⺆一一) or JBUV (十⺆凵𠃊). Another example, 今, might be OIN (人丶乛) or OMN (人一乛). An input method only accepts one of these. Typing with wrong font can be painful. @zmicier thank you. that is quite amazing! and there are regional variations! it must be quite an experience to learn it. my usual question, though: can left handers do it routinely? the shape of some of the pictograms look like they'd be tricky, kinda like having to write backwards. @hongminhee Thank you for this post. I have added the lang attribute to posts and comments on https://piefed.social #PieFed @hongminhee it's also very important for screen readers so they don't attempt to read foreign language with the wrong synthesiser @brawaru @hongminhee came here to say that. I use it regularly to mark English words in my Portuguese blog. @hongminhee Those are the same people who then believe that every person has a single name and surname of 4 to 10 letters, and the address has a state 🙄 @threkk @hongminhee here in Brazil is a different flavor of weird, because our main surname is the last one, so we end up putting our first surname with the given name(s) @hongminhee @threkk @hongminhee I remember back when I first started learning Japanese and my phone was seemingly incapable of using Japanese fonts, and I was learning the wrong way to write characters like 過. Bulgarian also afaik, with example under "unicode is locale-dependent" here: Us non-english westerners mostly just have to deal with anglo systems changing some letters and the meaning of the word, though. I occasionally wish q wasn't in ascii just so they'd have to deal with it as well. @hongminhee even in European languages: it can impact sorting order or capitalisation to start with. Variations like the ones in East Asian languages are also present. The only reason why Western languages can mostly ignore that problem is because Unicode has a large number of glyph variations for the Latin alphabet but that creates other problems such as canonicalisation. Anyway, all this to say I agree and anyone who says the lang attribute is useless has some learning to do. @hongminhee it’s not trivial to determine per se, but a cross-entropy classifier on character bigrams (that is, 1990s NLP) is surprisingly accurate at determining the language of a string. However—and this is the big caveat—it’s only trivial if (1) you know where the language boundaries are and (2) the string is long enough to get robust bigram statistics. Even if you weren’t to specify the language, “lang” solves problem (1) readily. @thedansimonson Yeah, but East Asian languages often be too short, e.g., 孤立無援, which is a valid sentence in Korean, Chinese, and Japanese. 😅 @hongminhee oh yea totally—but is it common to mix strings of that length on a website containing multiple languages? I’d suspect generally, where things get mixed up, the shortest you might have is a link or button. @thedansimonson You're right, the shortest ones are buttons or links in a navigation bar. @hongminhee The German Umlaut division, Ü-Battalion agrees 😁. ASCII is for people playing live in easy mode. @hongminhee Oh, today I learned something! 🤯 I don't use east asian languages but knowing that the lang attribute also affects TTS is something I'm gonna take with me to the accessibility bank 100%. 👌💯 @hongminhee How accurately are `lang` attributes placed in practice? I remember seeing “直” displayed wrong for the intended language on social media before (and, by the way, I don't think it's possible for me to specify the intended language in my quote there), and I often see people on the Fediverse who set-and-forget their language and then post in a different language, which you can tell on the client I use because it offers to translate the message. @mudri Yes, in practice, people often don't even specify the lang attribute at all, and as you said, even on fediverse, there are many people who post without setting the language correctly. 🤦 @mudri @hongminhee I guess the most common problem with lang is it often not being used when it should @hongminhee Sometimes it's worth reading the comments. I'm certain that I should have been called out like that back when I was more of a 25-year-old dumbass. The internet has exposed us all to many things we'd otherwise be ignorant of, giving us a much greater chance to step in it and a bigger microphone to broadcast our jackassery. @hongminhee I think it's (a portion of) the English. Not a programmer but I use a minimum of three languages daily and work with lots of disabled students. This seems hyper relevant @hongminhee Sincerely, I believe this is mostly a very USAmerican and English thing. The moment you are not English-native and have to interact with tech, you start from a "the world is not made for me" position that they usually take for granted. @hongminhee 근데 정작 한국 웹도 이걸 신경쓰지 않는 것 같고... 일본제인 미스키도 이를 붙여주는 기능을 제공하지 않죠. 한국쪽 마스토돈 사용자도 언어 설정에 그렇게 신경쓰지 않는 듯 하고요. @hongminhee Wow! I could not imagine that it is _that_ bad!. @hongminhee I feel silly because for a moment there I was wondering why changing the lang attribute was causing little red circles to get drawn on some of the glyphs. 😅 @hongminhee It is not for nothing that the W3C Validator spits out a warning if it is missing. @hongminhee I still think it's stupid that Unicode hasn't separated the different-looking CJK glyphs into separate codepoints. If we can have A, Α, А and A as separate characters, why couldn't that have been done for CJK? @hongminhee But why? If the character looks different, why wouldn't it be represented by a different codepoint? @jernej__s Because they are the same characters, even though they look slightly different. “Unicode encodes characters, not glyphs.” —Unicode FAQ. It's like Arabic numeral 7 is encoded as a single codepoint whether it has an extra horizontal line drawn across it or not. https://upload.wikimedia.org/wikipedia/commons/5/5c/Hand_Written_7.svg @hongminhee for screenreaders, too signed, a central european polyglot (mostly only western languages, but still aware, as are many of us, in contrast to many english people) @miki On the web, it's common to specify the lang attribute in the top-level <html> tag. Internationalized apps will prefer the user's locale setting. @miki @hongminhee maston lets users specify a default language & change the language on a per-post basis when they are composing posts. @kboyd @miki @hongminhee sadly, a lot of users don’t bother to do either. A couple of hashtags I follow are very popular amongst users who speak languages I don’t, and while I have configured my client to only show posts in the three languages I have any chance at all with, I still scroll past a lot of content I can’t read. (Which claims to be posted in English, but absolutely isn’t.) @hongminhee@fosstodon.org even if you're not going to be racist about it i've seen sites without @hongminhee It's definitely English And Maybe Spanish only. I've definitely seen sentences in Dutch that could be mistaken for badly spelled English. @hongminhee @hongminhee I see, I dove a little bit into the subject and it comes down to Unicode primitives I don't understand yet. Thanks for the pointers :) @hongminhee Reddit is oftentimes very USA/English centered too, it’s a little ridiculous. But those people in the comments really take the cake with their comments, good lord! Thank you for sharing!! @hongminhee it can also determine the language engine in screen readers so you don't have an English engine trying to read Chinese and completely blowing up @hongminhee The lang attribute is important even when only languages with Latin characters are involved because of the different quotation marks (a highly contentious matter). @hongminhee wait so the shared unicode characters arent even the same depending on the language? another thing ill probably end up worrying about when writing code even though i doubt anyone will ever localise software i develop @hongminhee@fosstodon.org anything that improves accessibility, be it for disabled or people of different language, should not be a suggestion but mandatory @hongminhee not just East Asian. Some other languages suffer from this as well. Unicode is a fuck.
https://tonsky.me/blog/unicode/ |
@hongminhee “Oh, there are other languages besides English? Even with different scripts?!!
This surprises and confuses me”
🙄🙄🙄