@drewharwell I find this bit just perplexing:

> For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.

It’s not that far off saying “I can play any tune on the piano, as long as it sounds just like chopsticks”