Using Azure Cognitive Services, it is possible to translate text into other languages and also synthesize the text to speech. It is also possible to add voice effects such as style of the voice.
This adds more realism by adding emotions to a synthesized voice. The voice is already trained by neural net training and adding voice style makes the synthesized speech even more realistic and multi-purpose.
The Github repo for this is available here as .NET Maui Blazor client written with .NET 8 :
MultiLingual translator DEMO Github repo
Not all the voices supported in Azure Cognitive Services do support voice effects. An overview of which voices are shown here:
https://learn.microsoft.com/nb-no/azure/ai-services/speech-service/language-support?tabs=tts#voice-styles-and-roles
More and more synthetic voices in Azure Cognitive Services gets more and more voice styles which express emotions. For now, most of the voices are either english (en-US) or chinese (zh-CN) and a few other languages got some few voices supporting styles.
This will most likely be improved into the future where these neural net trained voices are trained in voice styles or some generic voice style algorithm is achieved that can infer emotions on a generic level, although that still sounds a bit sci-fi.
Azure Cognitive Text-To-Speech Voices with support for emotions / voice styles
Voice | Styles | Roles |
de-DE-ConradNeural1 | cheerful | Not supported |
en-GB-SoniaNeural | cheerful, sad | Not supported |
en-US-AriaNeural | angry, chat, cheerful, customerservice, empathetic, excited, friendly, hopeful, narration-professional, newscast-casual, newscast-formal, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-DavisNeural | angry, chat, cheerful, excited, friendly, hopeful, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-GuyNeural | angry, cheerful, excited, friendly, hopeful, newscast, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-JaneNeural | angry, cheerful, excited, friendly, hopeful, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-JasonNeural | angry, cheerful, excited, friendly, hopeful, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-JennyNeural | angry, assistant, chat, cheerful, customerservice, excited, friendly, hopeful, newscast, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-NancyNeural | angry, cheerful, excited, friendly, hopeful, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-SaraNeural | angry, cheerful, excited, friendly, hopeful, sad, shouting, terrified, unfriendly, whispering | Not supported |
en-US-TonyNeural | angry, cheerful, excited, friendly, hopeful, sad, shouting, terrified, unfriendly, whispering | Not supported |
es-MX-JorgeNeural | chat, cheerful | Not supported |
fr-FR-DeniseNeural | cheerful, sad | Not supported |
fr-FR-HenriNeural | cheerful, sad | Not supported |
it-IT-IsabellaNeural | chat, cheerful | Not supported |
ja-JP-NanamiNeural | chat, cheerful, customerservice | Not supported |
pt-BR-FranciscaNeural | calm | Not supported |
zh-CN-XiaohanNeural | affectionate, angry, calm, cheerful, disgruntled, embarrassed, fearful, gentle, sad, serious | Not supported |
zh-CN-XiaomengNeural | chat | Not supported |
zh-CN-XiaomoNeural | affectionate, angry, calm, cheerful, depressed, disgruntled, embarrassed, envious, fearful, gentle, sad, serious | Boy, Girl, OlderAdultFemale, OlderAdultMale, SeniorFemale, SeniorMale, YoungAdultFemale, YoungAdultMale |
zh-CN-XiaoruiNeural | angry, calm, fearful, sad | Not supported |
zh-CN-XiaoshuangNeural | chat | Not supported |
zh-CN-XiaoxiaoNeural | affectionate, angry, assistant, calm, chat, chat-casual, cheerful, customerservice, disgruntled, fearful, friendly, gentle, lyrical, newscast, poetry-reading, sad, serious, sorry, whisper | Not supported |
zh-CN-XiaoyiNeural | affectionate, angry, cheerful, disgruntled, embarrassed, fearful, gentle, sad, serious | Not supported |
zh-CN-XiaozhenNeural | angry, cheerful, disgruntled, fearful, sad, serious | Not supported |
zh-CN-YunfengNeural | angry, cheerful, depressed, disgruntled, fearful, sad, serious | Not supported |
zh-CN-YunhaoNeural2 | advertisement-upbeat | Not supported |
zh-CN-YunjianNeural3,4 | angry, cheerful, depressed, disgruntled, documentary-narration, narration-relaxed, sad, serious, sports-commentary, sports-commentary-excited | Not supported |
zh-CN-YunxiaNeural | angry, calm, cheerful, fearful, sad | Not supported |
zh-CN-YunxiNeural | angry, assistant, chat, cheerful, depressed, disgruntled, embarrassed, fearful, narration-relaxed, newscast, sad, serious | Boy, Narrator, YoungAdultMale |
zh-CN-YunyangNeural | customerservice, narration-professional, newscast-casual | Not supported |
zh-CN-YunyeNeural | angry, calm, cheerful, disgruntled, embarrassed, fearful, sad, serious | Boy, Girl, OlderAdultFemale, OlderAdultMale, SeniorFemale, SeniorMale, YoungAdultFemale, YoungAdultMale |
zh-CN-YunzeNeural | angry, calm, cheerful, depressed, disgruntled, documentary-narration, fearful, sad, serious | OlderAdultMale, SeniorMale |
Screenshot from the DEMO showing its user interface. You enter the text to translate at the top and the language of the text is detected using Azure Cognitive Services text detection functionality. And you can then select which language to translate the text into. It will call a REST call to Azure Cognitive Services to translate the text. And it is also possible to hear the speech of the text. Now, it is also added to add voice style. Use the table shown above to select a voice actor that supports a voice style you want to test. As noted, voice styles are still limited to a few languages and voice actors supporting emotions or voice styles. You will hear the voice from the voice actor in a normal mood or voice style if additional emotions or voice styles are not supported.
Let's look at some code for this DEMO too. You can study the Github repo and clone it to test it out yourself.
The TextToSpeechUtil class handles much of the logic of creating voice from text input and also create the SSML-XML contents and performt the REST api call to create the voice file.
Note that SSML mentioned here, is the Speech Synthesis Markup Language (SSML).
The SSML standard is documented here on MSDN, it is a standard adopted by others too including Google.
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup
using Microsoft.Extensions.Configuration;
using MultiLingual.Translator.Lib.Models;
using System;
using System.Security;
using System.Text;
using System.Xml.Linq;
using static System.Runtime.InteropServices.JavaScript.JSType;
namespace MultiLingual.Translator.Lib
{
public class TextToSpeechUtil : ITextToSpeechUtil
{
public TextToSpeechUtil(IConfiguration configuration)
{
_configuration = configuration;
}
public async Task<TextToSpeechResult> GetSpeechFromText(string text, string language, TextToSpeechLanguage[] actorVoices,
string? preferredVoiceActorId, string? preferredVoiceStyle)
{
var result = new TextToSpeechResult();
result.Transcript = GetSpeechTextXml(text, language, actorVoices, preferredVoiceActorId, preferredVoiceStyle, result);
result.ContentType = _configuration[TextToSpeechSpeechContentType];
result.OutputFormat = _configuration[TextToSpeechSpeechXMicrosoftOutputFormat];
result.UserAgent = _configuration[TextToSpeechSpeechUserAgent];
result.AvailableVoiceActorIds = ResolveAvailableActorVoiceIds(language, actorVoices);
result.LanguageCode = language;
string? token = await GetUpdatedToken();
HttpClient httpClient = GetTextToSpeechWebClient(token);
string ttsEndpointUrl = _configuration[TextToSpeechSpeechEndpoint];
var response = await httpClient.PostAsync(ttsEndpointUrl, new StringContent(result.Transcript, Encoding.UTF8, result.ContentType));
using (var memStream = new MemoryStream()) {
var responseStream = await response.Content.ReadAsStreamAsync();
responseStream.CopyTo(memStream);
result.VoiceData = memStream.ToArray();
}
return result;
}
private async Task<string?> GetUpdatedToken()
{
string? token = _token?.ToNormalString();
if (_lastTimeTokenFetched == null || DateTime.Now.Subtract(_lastTimeTokenFetched.Value).Minutes > 8)
{
token = await GetIssuedToken();
}
return token;
}
private HttpClient GetTextToSpeechWebClient(string? token)
{
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", token);
httpClient.DefaultRequestHeaders.Add("X-Microsoft-OutputFormat", _configuration[TextToSpeechSpeechXMicrosoftOutputFormat]);
httpClient.DefaultRequestHeaders.Add("User-Agent", _configuration[TextToSpeechSpeechUserAgent]);
return httpClient;
}
public string GetSpeechTextXml(string text, string language, TextToSpeechLanguage[] actorVoices, string? preferredVoiceActorId,
string? preferredVoiceStyle, TextToSpeechResult result)
{
result.VoiceActorId = ResolveVoiceActorId(language, preferredVoiceActorId, actorVoices);
string speechXml = $@"
<speak version='1.0' xml:lang='en-US' xmlns:mstts='https://www.w3.org/2001/mstts'>
<voice xml:gender='Male' name='Microsoft Server Speech Text to Speech Voice {result.VoiceActorId}'>
<prosody rate='1'>{text}</prosody>
</voice>
</speak>";
speechXml = AddVoiceStyleEffectIfDesired(preferredVoiceStyle, speechXml);
return speechXml;
}
/// <summary>
/// Adds voice style / expression to the SSML markup for the voice
/// </summary>
private static string AddVoiceStyleEffectIfDesired(string? preferredVoiceStyle, string speechXml)
{
if (!string.IsNullOrWhiteSpace(preferredVoiceStyle) && preferredVoiceStyle != "normal-neutral")
{
var voiceDoc = XDocument.Parse(speechXml); //https://learn.microsoft.com/nb-no/azure/ai-services/speech-service/speech-synthesis-markup-voice#use-speaking-styles-and-roles
XElement? prosody = voiceDoc.Descendants("prosody").FirstOrDefault();
if (prosody?.Value != null)
{
// Create the <mstts:express-as> element, for now skip the ':' letter and replace at the end
var expressedAsWrappedElement = new XElement("msttsexpress-as",
new XAttribute("style", preferredVoiceStyle));
expressedAsWrappedElement.Value = prosody!.Value;
prosody?.ReplaceWith(expressedAsWrappedElement);
speechXml = voiceDoc.ToString().Replace(@"msttsexpress-as", "mstts:express-as");
}
}
return speechXml;
}
private List<string> ResolveAvailableActorVoiceIds(string language, TextToSpeechLanguage[] actorVoices)
{
if (actorVoices?.Any() == true)
{
var voiceActorIds = actorVoices.Where(v => v.LanguageKey == language || v.LanguageKey.Split("-")[0] == language).SelectMany(v => v.VoiceActors).Select(v => v.VoiceId).ToList();
return voiceActorIds;
}
return new List<string>();
}
private string ResolveVoiceActorId(string language, string? preferredVoiceActorId, TextToSpeechLanguage[] actorVoices)
{
string actorVoiceId = "(en-AU, NatashaNeural)"; //default to a select voice actor id
if (actorVoices?.Any() == true)
{
var voiceActorsForLanguage = actorVoices.Where(v => v.LanguageKey == language || v.LanguageKey.Split("-")[0] == language).SelectMany(v => v.VoiceActors).Select(v => v.VoiceId).ToList();
if (voiceActorsForLanguage != null)
{
if (voiceActorsForLanguage.Any() == true)
{
var resolvedPreferredVoiceActorId = voiceActorsForLanguage.FirstOrDefault(v => v == preferredVoiceActorId);
if (!string.IsNullOrWhiteSpace(resolvedPreferredVoiceActorId))
{
return resolvedPreferredVoiceActorId!;
}
actorVoiceId = voiceActorsForLanguage.First();
}
}
}
return actorVoiceId;
}
private async Task<string> GetIssuedToken()
{
var httpClient = new HttpClient();
string? textToSpeechSubscriptionKey = Environment.GetEnvironmentVariable("AZURE_TEXT_SPEECH_SUBSCRIPTION_KEY", EnvironmentVariableTarget.Machine);
httpClient.DefaultRequestHeaders.Add(OcpApiSubscriptionKeyHeaderName, textToSpeechSubscriptionKey);
string tokenEndpointUrl = _configuration[TextToSpeechIssueTokenEndpoint];
var response = await httpClient.PostAsync(tokenEndpointUrl, new StringContent("{}"));
_token = (await response.Content.ReadAsStringAsync()).ToSecureString();
_lastTimeTokenFetched = DateTime.Now;
return _token.ToNormalString();
}
public async Task<List<string>> GetVoiceStyles()
{
var voiceStyles = new List<string>
{
"normal-neutral",
"advertisement_upbeat",
"affectionate",
"angry",
"assistant",
"calm",
"chat",
"cheerful",
"customerservice",
"depressed",
"disgruntled",
"documentary-narration",
"embarrassed",
"empathetic",
"envious",
"excited",
"fearful",
"friendly",
"gentle",
"hopeful",
"lyrical",
"narration-professional",
"narration-relaxed",
"newscast",
"newscast-casual",
"newscast-formal",
"poetry-reading",
"sad",
"serious",
"shouting",
"sports_commentary",
"sports_commentary_excited",
"whispering",
"terrified",
"unfriendly"
};
return await Task.FromResult(voiceStyles);
}
private const string OcpApiSubscriptionKeyHeaderName = "Ocp-Apim-Subscription-Key";
private const string TextToSpeechIssueTokenEndpoint = "TextToSpeechIssueTokenEndpoint";
private const string TextToSpeechSpeechEndpoint = "TextToSpeechSpeechEndpoint";
private const string TextToSpeechSpeechContentType = "TextToSpeechSpeechContentType";
private const string TextToSpeechSpeechUserAgent = "TextToSpeechSpeechUserAgent";
private const string TextToSpeechSpeechXMicrosoftOutputFormat = "TextToSpeechSpeechXMicrosoftOutputFormat";
private readonly IConfiguration _configuration;
private DateTime? _lastTimeTokenFetched = null;
private SecureString _token = null;
}
}
The REST call to generate the voice file is using following set up:
TTS endpoint url:
https://norwayeast.tts.speech.microsoft.com/cognitiveservices/v1
The transcript (text to translate into speech) is the following in my test as a SSML-XML document:
<speak version="1.0" xml:lang="en-US" xmlns:mstts="https://www.w3.org/2001/mstts">
<voice xml:gender="Male" name="Microsoft Server Speech Text to Speech Voice (en-US, JaneNeural)">
<mstts:express-as style="angry">I listen to Eurovision and cheer for Norway</mstts:express-as>
</voice>
</speak>
The SSML also contains an extension called
mstts extension language that adds features to SSML such as the
express-as set to a voice style or emotion of "angry". Not all emotions or voice styles are supported by every voice actor in Azure Cognitive Services.
But this is a list of the voice styles that could be supported, it varies which voice actor you choose (and inherently which language).
- "normal-neutral"
- "advertisement_upbeat"
- "affectionate"
- "angry"
- "assistant"
- "calm"
- "chat"
- "cheerful"
- "customerservice"
- "depressed"
- "disgruntled"
- "documentary-narration"
- "embarrassed"
- "empathetic"
- "envious"
- "excited"
- "fearful"
- "friendly"
- "gentle"
- "hopeful"
- "lyrical"
- "narration-professional"
- "narration-relaxed"
- "newscast"
- "newscast-casual"
- "newscast-formal"
- "poetry-reading"
- "sad"
- "serious"
- "shouting"
- "sports_commentary"
- "sports_commentary_excited"
- "whispering"
- "terrified"
- "unfriendly
Microsoft has come a long way from the early work with SAPI - Microsoft Speech API with Microsoft SAM around 2000. The realism of synthetic voices more than 20 years ago were rather crude and robotic. Nowaydays, voice actors provided by Azure Cloud computing platform as shown here
are neural net trained and very realistic based upon training from real voice actors and now more and more voice actor voices support emotions or voice styles.
The usages of this can be diverse. Making use of text synthesis can serve in automated answering services and apps in diverse fields such as healthcare and public services or education and more.
Making this demo has been fun for me and it can be used to learn languages and with the voice functionality you can train on not only the translation but also pronounciation.