Email or username:

Password:

Forgot your password?
Top-level
reviewer 2 :Schwerified:

@hongminhee it’s not trivial to determine per se, but a cross-entropy classifier on character bigrams (that is, 1990s NLP) is surprisingly accurate at determining the language of a string.

However—and this is the big caveat—it’s only trivial if (1) you know where the language boundaries are and (2) the string is long enough to get robust bigram statistics.

Even if you weren’t to specify the language, “lang” solves problem (1) readily.

3 comments
洪 民憙 (Hong Minhee)

@thedansimonson Yeah, but East Asian languages often be too short, e.g., 孤立無援, which is a valid sentence in Korean, Chinese, and Japanese. 😅

reviewer 2 :Schwerified:

@hongminhee oh yea totally—but is it common to mix strings of that length on a website containing multiple languages? I’d suspect generally, where things get mixed up, the shortest you might have is a link or button.

洪 民憙 (Hong Minhee)

@thedansimonson You're right, the shortest ones are buttons or links in a navigation bar.

Go Up