Generic Content

One of the key features of the Song Translator app is song synthesis using SonicAPI. The app’s synthesizer requires three things: an original music file, the karaoke version of the file (instrumental and backup vocals only),
and a voice recording of the lyrics for the song. The voice recording can either be the result of Bing Translator’s built-in text-to-speech function, or the user’s own voice recording. The result is an auto-tuned voice recording that
matches the melody of a song.

The Sonic API Web Service

To synthesize songs, the app leverages the API at sonicapi.com, which is a web service for online music processing. The app contains a wrapper class, titled SonicApiWrapper, which contains methods for synthesis and also stores pitches and song segments in different object types.

Steps for Synthesis

Step 1: Upload the original music and karaoke files to the synthesizer for later usage. The SonicApiWrapper contains many overloaded methods for uploading songs in different formats. Once one of these methods is called, the songs are processed by the web service.

Step 2: Add a song segment to the synthesizer. The method for doing this takes in a music stream and starting and ending timestamps, and it cuts the music to create a segment. It then stores properties such as duration and speed.

Step 3: The next step is to process the music files and auto tune the recording. The web service first processes the original music file and splits the song segment into smaller segments, each with a particular duration and pitch. It then does the same thing for the voice recording. Here are some sample tables showing the processed files:

 

Step 4: The synthesizer alters the duration and pitch of the segments in the voice recording to match that of the original music file. In this case, the speed of the voice recording is slowed down to 0.53x.

Below is a sample table showing the voice recording after auto tune:

Step 5: To complete the synthesis, the karaoke music is added underneath the auto tuned voice recording to create a final, cohesive sound piece.

This process is currently slow and can take up to 2 minutes for a song segment about 10 seconds long. The translator team is working on updates to improve the speed and quality of this synthesize function.