Fortunately, I don't need to do this in real time: I can combine the audio in advance, before it's played in the voice channel. But I don't know how to execute this combination automatically.
As far as I understand, I need to make an #ffmpeg call under the hood of the #Go code?
Here's a rough schema of what I want to accomplish. It is important to take into account that the files with voices can be of indefinite length (for example: a couple of seconds, or even minutes) and their number varies from three to five. But they all start playing one after another at about 90 minutes of music.