Email or username:

Password:

Forgot your password?
Top-level
Piko Starsider :verified_paw:

@zanagb @bedast VLC will do it in-device, not sending anything anywhere.

Whisper models are terrible at transcribing casual conversations of doctors and patients because the training data doesn't reflect that kind of speech and environments. But it excels at transcribing movies etc. because a lot of its training data are closed captions. So this would actually work reasonably well. One can put some text with the names of characters, places, etc. as context and that makes it transcribe those names very well. (source: I've been using whisper models at work, and occasionally I've been putting the mic towards the speaker with some show I'm watching to test) (also: I haven't sent any data to openai nor paid them anything)

3 comments
ZanaGB

@starsider @bedast the CES demo makes it clear the transcription is **off-device**, ie, syphoning data. And besides, there are already many built in tools for that on macOS and linux.

If i wanted fucked-up nonsense on my videos i would watch a raunchy youtube poop from the early 2010s

Id rather have a phoneme-based system where at least you can tell what the gibberish came from and you can tell its an error, and even reconstruct the sentence back.

We do not need this.

Piko Starsider :verified_paw:

@zanagb @bedast What makes it clear that it's off-device? Can you provide a link?

What tools are you talking about? I use Linux, what should I search? I would like to compare it with the tool I'm doing as part of my day job (for which I compile the *whole* source code incl. all dependencies so I know for a fact that nothing is ever syphoned).

About fucked-up nonsense, what I see in youtube all the time: Youtube's automatic subtitles are beyond terrible. With automatic translations to my native language they're even worse. Family members use it and I can't fathom how can they get anything out of it. No pauses, no punctuation, full of mistakes.

Using whisper is a 1000x improvement over youtube's. It adds all the correct punctuation and everything. It only fails with proper names (unless it's given a context) and with speech with a lot of background noise. In all the 4 languages I've been testing it.

For regular casual speech it doesn't work _that_ well but my work's project has that in account by marking all the dubious words. It also discards whole sentences with too many dubious words because they tend to be gibberish from random noise. Which makes me shudder when I read about the model being used as-is for conversations without regard from confidence levels, without using the context feature, and using naive stitching (since it can only transcribe 30 seconds at a time). Results are awful as I would have expected.

@zanagb @bedast What makes it clear that it's off-device? Can you provide a link?

What tools are you talking about? I use Linux, what should I search? I would like to compare it with the tool I'm doing as part of my day job (for which I compile the *whole* source code incl. all dependencies so I know for a fact that nothing is ever syphoned).

ZanaGB

@starsider @bedast and... If you think whisper is anywhere being remotely adequate for the job, clearly you do not rely on subtitles to hear, nor consume information and media through foreign sources. The pitfalls are very apparent and very damaging for the actual purpose of "understanding what is actually happening". Random hindi people with tutorials about the weird obscure software you are trying to debug are always an incredibly easy test these... Abominations. always fail

Go Up