@nixCraft I’d like to understand what it does the the binary sizes. I would assume the model weights are likely to be big enough to double the download size. If that’s the case, I’d prefer it to be in a plugin. The Tiny model of Whisper is as big as the entire VLC download was last time I paid attention to the size.
Beyond that, the concerns about this kind of thing that I might have are:
License. It looks as if Whisper is MIT licensed. Some of the Facebook models have a clause saying that using them means you cannot sue Facebook for any copyright infringement but in the absence of such a clause this is less important.
Ethics of training. Where did the training data come from? Did the people who created the data used to train it consent? I don’t see any statement to this effect on the Whisper site. This may also come with legal liability: use of lossy compression generally still counts as copyright infringement. Neural networks are equivalent to lossy compression. This is not yet settled precedent and may vary between jurisdictions.
Accuracy. Mechanically generated subtitles are still pretty bad. I’ve recently been watching some older things with subtitles and the people who wrote them have done a great job. They replace words with shorter synonyms to make the subtitles easier to read, but they capture the meaning. Machine generated ones are increasingly common and they replace word with homophones fairly regularly and often miss the key word, especially if there’s any kind of pun.
Social implications. By making it easy to generate bad subtitles on the client device, you reduce an incentive to create good subtitles. This seems to already be happening for streaming services but I don’t want to encourage it.
On the positive side, in the short term, bad subtitles are better than no subtitles. Do the costs outweighs the benefits? To me, probably yes.