App Development AR / VR Cloud Speech Assistants

How-To: Convert Neural Voice Audio from Amazon Polly (mp3) to Spark AR (m4a)

Currently, Facebook’s Spark AR Studio is restrictive with supported audio formats. Unfortunately, only M4A with specific settings is allowed. This short tutorial is a guidance on how to convert artificially generated neural voices (in this case coming from an mp3 file as produced by Amazon Polly) to the m4a format accepted by Spark AR. I’m using the free Audiacity tool, which integrates the open-source FFmpeg plug-in.

Spark AR has the following requirements on audio files:

  • M4A format
  • Mono
  • 44.1 kHz sample rate
  • 16-bit depth

Generating Audio using Text-to-Speech (mp3 / PCM)

Neither Amazon Polly nor the Microsoft Azure Text-to-Speech cognitive service can directly produce an m4a audio file. In its additional settings, Polly offers MP3, OGG, PCM and Speech Marks. MP3 goes up to a sample rate of 24000 Hz, PCM is limited to 16000 Hz.

Choose either PCM for uncompressed sound or go with MP3. It doesn’t make much difference if you don’t plan many intermediary processing steps, as the final file size should anyway be extremely low to bring your finished project to Facebook or Instagram.

Configure Amazon Polly with a good file format (e.g., mp3), enter the text and choose a neural speech engine. Make sure the AWS region and language support neural voices!
Configure Amazon Polly with a good file format (e.g., mp3), enter the text and choose a neural speech engine. Make sure the AWS region and language support neural voices!

Next, choose the language, the voice to use and enter the text to speak. Most languages already support neural voices, which I’d recommend, as they sound more natural. They are not yet available in all AWS regions and not for all languages. Plus, neural engines are more expensive; however, for a few static short text blocks that you generate, the cost savings of the standard voice are not worth the huge quality boost you gain by selecting neural voices.

You can always click on “Listen” to hear a preview. Once you’re happy with the spoken text, click on “Download” to get the mp3 file. To quickly test, you can also download my generated sound file.

Converting Audio for Spark AR M4A with Audacity

First, download and install the open-source Audacity tool. Make sure you get it from the official Audacity website.

FFmpeg for Windows and Linux

While Audacity itself is immensely powerful, it doesn’t offer m4a / mp4 support out of the box on Windows and Linux, due to legal reasons (patent concerns). However, there is a great open-source library called FFmpeg. The framework provides only the source files, which you’d have to compile yourself. As this would require setting up quite some technical infrastructure, you can simply go with pre-compiled libraries provided by the community.

Download and install a pre-compiled version of FFmpeg through the Audacity settings.
Download and install a pre-compiled version of FFmpeg through the Audacity settings.

Directly in the Audacity menu Edit > Preferences > Libraries, you will find a button that takes you to the FFmpeg download page. Make sure you close Audacity while you install FFmpeg. If you stay with the default location and settings during installation, Audacity will automatically find and start using the library. You can always verify the FFmpeg location and version in the Audacity settings, or manually locate the install path if auto-detection didn’t work.

Adapting the Sample Rate to 44.100 Hz

Open the created mp3 file with Audacity. While the sound file is already mono, it’s in a wrong format: 32-bit float with 24000 Hz sample rate.

The mp3 opened in Audacity.
The mp3 generated by Amazon Polly opened in Audacity. You still need to convert the sample rate and depth.

In the lower left corner of the main Audacity window, change the “Project Rate (Hz)” to 44100. This step is easy to miss; you would get an export error if you don’t perform this!

Set the Audacity Project Rate (Hz) to 44100.
Change the Project Rate (Hz) to 44100 in the drop down in the lower left corner of Audacity.

Next, let’s change the track sample rate. You need to be careful here – simply switching the rate would speed up the sound file, making the voice sound like Mikey Mouse. The correct tool to use is: Tracks > Resample…

Select 44100 as the “New sample rate (Hz)”.

Audacity menu to resample the track.
Resample the track to 44100 Hz to prevent distortion.

Change the Format to 16-bit PCM

Another requirement of Spark AR is the 16-bit PCM format. The generated mp3 file had a 32-bit float format. The easiest way to perform the conversion is by clicking the little down-arrow at the top of the track menu. Next, select Format > 16-bit PCM.

Convert the track format to 16-bit PCM.
Convert the track format to 16-bit PCM.

Export the Speech File as m4a

All conversions are done, now we only need to save the file in the format required by Spark AR.

Go to File > Export > Export Audio… If you are on a mac or have installed FFmpeg correctly, you will see the option of using M4A. For Spark AR, try to balance the quality with the file size; don’t go to the highest quality setting by default, as this could prevent you from distributing your finished project to Facebook or Instagram for larger projects.

Export Audio dialog in Audacity.
Export the audio file to a M4A (AAC) file through FFmpeg in Audacity.

In the following dialog, you can keep the defaults. To save every bit of storage (Spark AR effects have extremely strict size limitations), you can delete the tag about the software. Check the exported m4a file if it still sounds correct. As additional check, you can download my generated and converted sound file and compare it with your results.

Possible Errors

I struggled for some time with the following error message “FFmpeg Error – Can’t open audio codec “aac” (0x11a)”:

Error dialog in Audacity: FFmpeg Error - Can't open audio codec "aac" (0x11a)
Error dialog in Audacity when exporting the m4a through FFmpeg.

In this case – given that FFmpeg is correctly installed – you didn’t change the project sample rate to 44100 Hz in the lower left corner of Audacity. Make sure that both the project sample rate as well as the track sample rate are 44100 Hz.

Playing the M4A Sound File in Spark AR Studio

Now, create a Sharing Experience project in Spark AR. In the Assets panel in the bottom left corner, import the exported m4a file from your computer.

Spark AR screenshot: Import the mp4.
Import the text-to-speech mp4 audio file to a new Spark AR project.

When selected, you see the audio file’s property in the properties panel on the right.

Let’s configure the project so that it plays the audio file shortly after the project starts. Activate the Patch Editor (e.g., through the Workspace > Show/Hide Patch Editor in the upper left corner).

Next, add the Runtime patch (purple) to the canvas. This patch outputs the time in seconds since the project started. Connect a Greater Than patch to its First Input. In the Second Input, enter “1” so that we start the audio after 1 second.

Spark AR screenshot: the first patches to start a timed effect.
Connect the Spark AR Runtime patch to a Greater Than patch to control after how many seconds the audio will start.

Adding the Sound File, Player, Controller and Speaker

Next, drag the sound file from the Assets into your patch editor. It will get an orange highlight and has a single output. Connect that to an Audio Player patch; this in turn has a second input; choose the Single-Clip Controller. This patch can start playing the sound file when a specific event occurs.

Now, we simply need to connect the output of the Greater Than patch to the Play input of the Single-Clip Controller patch. This doesn’t work directly. However, if you just drag the line between the patches, Spark AR automatically inserts the required Pulse patch in the middle.

Spark AR screenshot: pulse patch, audio player, sound file and controller.
Spark AR automatically inserts the Pulse patch to trigger the single-clip controller for playing audio after a condition is met.

However, we still won’t hear anything. We still need to connect the patches to a speaker. In the Scene panel, under Camera, right click to Add Objects > Speaker. This adds a speaker0 object.

Spark AR screenshot: add a speaker to the scene.
Add a speaker to our Spark AR scene so that we can play sounds.

Make sure the speaker is selected in the scene. In the properties panel on the right, click the right-arrow next to Audio. This will insert an input patch into the Patch Editor.

Spark AR screenshot: audio input of the speaker.
Make the audio input of the speaker visible in the Spark AR patch editor.

The final step is to connect the Audio Player to the yellow speaker0 patch. To test, restart the scene from the icons on the left side. After one second, the neural speech file will start playing!

Finished Spark AR project playing a sound file upon its start. The artificial speech was generated using the Amazon Polly text-to-speech neural voice engine.
Finished Spark AR project playing a sound file upon its start. The artificial speech was generated using the Amazon Polly text-to-speech neural voice engine.