@zanagb @bedast What makes it clear that it's off-device? Can you provide a link?
What tools are you talking about? I use Linux, what should I search? I would like to compare it with the tool I'm doing as part of my day job (for which I compile the *whole* source code incl. all dependencies so I know for a fact that nothing is ever syphoned).
About fucked-up nonsense, what I see in youtube all the time: Youtube's automatic subtitles are beyond terrible. With automatic translations to my native language they're even worse. Family members use it and I can't fathom how can they get anything out of it. No pauses, no punctuation, full of mistakes.
Using whisper is a 1000x improvement over youtube's. It adds all the correct punctuation and everything. It only fails with proper names (unless it's given a context) and with speech with a lot of background noise. In all the 4 languages I've been testing it.
For regular casual speech it doesn't work _that_ well but my work's project has that in account by marking all the dubious words. It also discards whole sentences with too many dubious words because they tend to be gibberish from random noise. Which makes me shudder when I read about the model being used as-is for conversations without regard from confidence levels, without using the context feature, and using naive stitching (since it can only transcribe 30 seconds at a time). Results are awful as I would have expected.