@drewharwell I find this bit just perplexing:
> For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.
It’s not that far off saying “I can play any tune on the piano, as long as it sounds just like chopsticks”