From Robbie the Robot to Jarvis, science fiction writers have long understood how important it was for an artificial being to sound as lifelike as possible. Speech is central to human interaction, and beyond words, it helps us express feelings and emotions: who can forget HAL 9000’s haunting final scene in 2001: A Space Odyssey?
In the real world, things are more complicated of course. Decades before the term ‘artificial intelligence’ had even been coined, scientists were designing systems that tried to mimic the human voice. In 1937, almost 20 years before the seminal Dartmouth workshop, Homer Dudley invented the Voder, the first attempt to synthesize human speech with electronic components: this video has sound samples and extra information on this incredible device.
We’ve come a long way since then! At AWS re:Invent 2016, we announced Polly, a managed service that turns text into lifelike speech, allowing customers to create applications that talk, and build entirely new categories of speech-enabled products. Zero machine learning expertise required: just call an API and get the job done! Since then, the team has regularly added new voices, for a current total of 29 languages and 59 voices.
Today, we’re happy to announce two major new features for Polly: Neural Text-To-Speech, and a groundbreaking newscaster style.
Introducing Neural Text-To-Speech (NTTS)
Through a new machine learning approach, NTTS delivers significant improvements in speech quality. It increases naturalness and expressiveness, two key factors in synthesizing lifelike speech that is getting closer than ever from human voices. Here’s an example of the quality you can expect.
As of today, NTTS is available for 11 voices, both in real-time and in batch mode:
Why not head out to the AWS console for a quick test?
Introducing the newscaster style
Speech quality is certainly important, but more can be done to make a synthetic voice sound even more realistic and engaging. What about style? For sure, human ears can tell the difference between a newscast, a sportscast, a university class and so on; indeed, most humans adopt the right style of speech for the right context, and this certainly helps in getting their message across.
Thanks to NTTS, it’s possible to apply styles to synthesized speech, and you can now use a newscaster style with Polly. Here’s an example.
From news to blog posts, this makes narration sound even more realistic, and customers like The Globe and Mail already use it today. Thanks to Polly and the newscaster style, their readers (or should we say listeners now?) can enjoy articles read in a high-quality voice that sounds like what they might expect to hear on the TV or radio. Adding Amazon Translate, they can also listen to articles that are automatically translated to a language they understand.
As of today, the newscaster style is available for two US English voices (Joanna and Matthew), both in real-time and in batch mode. Again, you can head out to the AWS console for a quick test, and here’s the same clip as above with the newscaster style.
Using Polly APIs with the NTTS voices and the newscaster style is extremely easy. Please let me show you how to get started with both.
Using NTTS Voices and the Newscaster Style
Let’s grab a bit of text for Polly to read: how about this paragraph from Amazon Simple Storage Service (S3)‘s announcement in 2006?
“Earlier today we rolled out Amazon S3, our reliable, highly scalable, and low-latency data storage service. Using SOAP and REST interfaces, developers can easily store any number of blocks of data in S3. Each block can be up to 5 GB in length, and is associated with a user-defined key and additional key-value metadata pairs. Further, each block is protected by an ACL (Access Control List) allowing the developer to keep the data private, share it for reading, or share it for reading and writing, as desired. The system was designed to provide a data availability factor of 99.99%; all data is transparently stored in multiple locations”.
I will use batch mode in order to save sound files in S3 and let you grab them: I explicitly changed permissions to make them public, but don’t worry, your own files are completely private by default.
Let’s first try the standard Matthew voice.
$ aws polly start-speech-synthesis-task
--voice-id Matthew --text file://s3.txt
--output-s3-bucket-name "jsimon-polly" --output-format mp3
--query "SynthesisTask.TaskId"
"e3db409c-419d-4a31-a3a7-72c1e712fe23"
$ wget https://jsimon-polly.s3.amazonaws.com/e3db409c-419d-4a31-a3a7-72c1e712fe23.mp3 -O matthew-standard.mp3
Tell us a bit about S3, Matthew.
Now, let’s use the NTTS version of the same voice: all we have to do is set the ‘engine‘ parameter to ‘neural‘.
$ aws polly start-speech-synthesis-task
--voice-id Matthew --engine neural --text file://s3.txt
--output-s3-bucket-name "jsimon-polly" --output-format mp3
--query "SynthesisTask.TaskId"
"e3902335-c1e6-450b-b6e9-f913d6d52055"
$ wget https://jsimon-polly.s3.amazonaws.com/e3902335-c1e6-450b-b6e9-f913d6d52055.mp3 -O matthew-neural.mp3
You should immediately notice the quality improvement that NTTS brings. Of course, Polly has correctly picked up technical abbreviations, numbers, etc.
Now let’s spice things up and apply the newscaster style. This requires that we use the SSML markup language. All we need to do is to enclose the text like so:
<speak>
<amazon:domain name="news">
Earlier today we rolled out Amazon S3, our reliable, highly scalable, and low-latency data storage service. Using SOAP and REST interfaces, developers can easily store any number of blocks of data in S3. Each block can be up to 5 GB in length, and is associated with a user-defined key and additional key-value metadata pairs. Further, each block is protected by an ACL (Access Control List) allowing the developer to keep the data private, share it for reading, or share it for reading and writing, as desired. The system was designed to provide a data availability factor of 99.99%; all data is transparently stored in multiple locations.
</amazon:domain>
</speak>
Let’s synthesize this text again, making sure to set text type to SSML.
$ aws polly start-speech-synthesis-task
--voice-id Matthew --engine neural
--text file://s3.ssml --text-type ssml
--output-s3-bucket-name "jsimon-polly" --output-format mp3
--query "SynthesisTask.TaskId"
"25c18bda-b32b-4485-a45f-eb9b757a513b"
$ wget https://jsimon-polly.s3.amazonaws.com/25c18bda-b32b-4485-a45f-eb9b757a513b.mp3 -O matthew-neural-newscaster.mp3
I’m sure you can immediately tell the difference! Doesn’t this sound like a news reporter reading our text?
If you’re curious about the Joanna voice, here are the equivalent clips: standard, neural, and neural with newscaster style.
Available Now!
As you can see, it’s extremely easy to use these new features, and they are available today in US East (N. Virginia), US West (Oregon) and Europe (Ireland). The free tier offers 1 million characters for NTTS voices per month for the first 12 months, starting from your first request for speech (standard or NTTS).
We’re looking forward to your feedback! Please post it to the AWS Forum for Polly, or send it to your usual AWS support contacts.
— Julien;
Source: AWS News