@hongminhee it’s not trivial to determine per se, but...

洪民憙 (Hong Minhee)'s posts Post Back to profile

@hongminhee it’s not trivial to determine per se, but a cross-entropy classifier on character bigrams (that is, 1990s NLP) is surprisingly accurate at determining the language of a string.

However—and this is the big caveat—it’s only trivial if (1) you know where the language boundaries are and (2) the string is long enough to get robust bigram statistics.

Even if you weren’t to specify the language, “lang” solves problem (1) readily.

Like 28 August at 12:54 | Wall-to-wall | Open on lingo.lol

3 comments

洪民憙 (Hong Minhee)

@thedansimonson Yeah, but East Asian languages often be too short, e.g., 孤立無援, which is a valid sentence in Korean, Chinese, and Japanese. 😅

28 August at 12:58 | Open on fosstodon.org

reviewer 2 :Schwerified:

@hongminhee oh yea totally—but is it common to mix strings of that length on a website containing multiple languages? I’d suspect generally, where things get mixed up, the shortest you might have is a link or button.

28 August at 13:15 | Open on lingo.lol

洪民憙 (Hong Minhee)

@thedansimonson You're right, the shortest ones are buttons or links in a navigation bar.

28 August at 13:19 | Open on fosstodon.org