The speech service uses AI trained speech to provide natural speech and ease of use. You can just provide text and get it read out aloud.
An overview of supported languages in the Speech service is shown here:
You can create a TTS - Text To Speech service using Azure AI service for this. This Speech service in this demo uses the library Nuget Microsoft.CognitiveServices.Speech.
This repo contains a simple demo using Azure AI Speech synthesis using Azure.CognitiveServices.SpeechSynthesis.
It provides a simple way of synthesizing text to speech using Azure AI services. Its usage is shown here:
The code provides a simple builder for creating a SpeechSynthesizer instance.
using Microsoft.CognitiveServices.Speech;
namespaceToreAurstadIT.AzureAIDemo.SpeechSynthesis;
publicclassProgram
{
privatestaticasync Task Main(string[] args)
{
Console.WriteLine("Your text to speech input");
string? text = Console.ReadLine();
using (var synthesizer = SpeechSynthesizerBuilder.Instance.WithSubscription().Build())
{
using (var result = await synthesizer.SpeakTextAsync(text))
{
string reasonResult = result.Reason switch
{
ResultReason.SynthesizingAudioCompleted => $"The following text succeeded successfully: {text}",
_ => $"Result of speeech synthesis: {result.Reason}"
};
Console.WriteLine(reasonResult);
}
}
}
}
The builder looks like this:
using Microsoft.CognitiveServices.Speech;
namespaceToreAurstadIT.AzureAIDemo.SpeechSynthesis;
publicclassSpeechSynthesizerBuilder
{
privatestring? _subscriptionKey = null;
privatestring? _subscriptionRegion = null;
publicstatic SpeechSynthesizerBuilder Instance => new SpeechSynthesizerBuilder();
public SpeechSynthesizerBuilder WithSubscription(string? subscriptionKey = null, string? region = null)
{
_subscriptionKey = subscriptionKey ?? Environment.GetEnvironmentVariable("AZURE_AI_SERVICES_SPEECH_KEY", EnvironmentVariableTarget.User);
_subscriptionRegion = region ?? Environment.GetEnvironmentVariable("AZURE_AI_SERVICES_SPEECH_REGION", EnvironmentVariableTarget.User);
returnthis;
}
public SpeechSynthesizer Build()
{
var config = SpeechConfig.FromSubscription(_subscriptionKey, _subscriptionRegion);
var speechSynthesizer = new SpeechSynthesizer(config);
return speechSynthesizer;
}
}
Note that I observed that the audio could get chopped off in the very end. It might be a temporary issue, but if you encounter it too, you can add an initial pause to avoid this:
string? intialPause = " .... "; //this is added to avoid the text being cut in the start
This article presents code that shows how you can connect to OpenAI Chat GPT-4 client connection.
The repository for the code presented is the following GitHub repo:
The repo contains useful helper methods to use Azure AI Service and create AzureOpenAIClient or the more generic ChatClient which is a specified chat client for the AzureOpenAIClient that uses a specific ai model, default ai model to use is 'gpt-4'.
The creation of chat client is done using a class with a builder pattern.
To create a chat client you can simply create it like this :
using Azure.AI.OpenAI;
using OpenAI.Chat;
using System.ClientModel;
namespaceToreAurstadIT.OpenAIDemo
{
///<summary>/// Creates AzureOpenAIClient or ChatClient (default model is "gpt-4")/// Suggestion:/// Create user-specific Environment variables for : AZURE_AI_SERVICES_KEY and AZURE_AI_SERVICES_ENDPOINT to avoid exposing endpoint and key in source code./// Then us the 'WithDefault' methods to use the two user-specific environment variables, which must be set.///</summary>publicclassAzureOpenAIClientBuilder
{
privateconststring AZURE_AI_SERVICES_KEY = nameof(AZURE_AI_SERVICES_KEY);
privateconststring AZURE_AI_SERVICES_ENDPOINT = nameof(AZURE_AI_SERVICES_ENDPOINT);
privatestring? _endpoint = null;
private ApiKeyCredential? _key = null;
public AzureOpenAIClientBuilder WithEndpoint(string endpoint) { _endpoint = endpoint; returnthis; }
///<summary>/// Usage: Provide user-specific enviornment variable called : 'AZURE_AI_SERVICES_ENDPOINT'///</summary>///<returns></returns>public AzureOpenAIClientBuilder WithDefaultEndpointFromEnvironmentVariable() { _endpoint = Environment.GetEnvironmentVariable(AZURE_AI_SERVICES_ENDPOINT, EnvironmentVariableTarget.User); returnthis; }
public AzureOpenAIClientBuilder WithKey(string key) { _key = new ApiKeyCredential(key); returnthis; }
public AzureOpenAIClientBuilder WithKeyFromEnvironmentVariable(string key) { _key = new ApiKeyCredential(Environment.GetEnvironmentVariable(key) ?? "N/A"); returnthis; }
///<summary>/// Usage : Provide user-specific environment variable called : 'AZURE_AI_SERVICES_KEY'///</summary>///<returns></returns>public AzureOpenAIClientBuilder WithDefaultKeyFromEnvironmentVariable() { _key = new ApiKeyCredential(Environment.GetEnvironmentVariable(AZURE_AI_SERVICES_KEY, EnvironmentVariableTarget.User) ?? "N/A"); returnthis; }
public AzureOpenAIClient? Build() => !string.IsNullOrWhiteSpace(_endpoint) && _key != null ? new AzureOpenAIClient(new Uri(_endpoint), _key) : null;
///<summary>/// Default model will be set 'gpt-4'///</summary>///<returns></returns>public ChatClient? BuildChatClient(string aiModel = "gpt-4") => Build()?.GetChatClient(aiModel);
publicstatic AzureOpenAIClientBuilder Instance => new AzureOpenAIClientBuilder();
}
}
It is highly recommended to store your endpoint and key to the Azure AI service of course not in the source code repository, but another place, for example on your user-specific environment variable or Azure key vault or similar place hard to obtain for malicious use, for example using your account to route much traffic to Chat GPT-4 only to end up being billed for this traffic.
The code provided some 'default methods' which will look for environment variables.
Add the key and endpoint to your Azure AI to these user specific environment variables.
AZURE_AI_SERVICES_KEY
AZURE_AI_SERVICES_ENDPOINT
To use the chat client the following code shows how to do this:
ChatGptDemo.cs
publicasync Task<string?> RunChatGptQuery(ChatClient? chatClient, string msg)
{
if (chatClient == null)
{
Console.WriteLine("Sorry, the demo failed. The chatClient did not initialize propertly.");
returnnull;
}
var stopWatch = Stopwatch.StartNew();
string reply = await chatClient.GetStreamedReplyStringAsync(msg, outputToConsole: true);
Console.WriteLine($"The operation took: {stopWatch.ElapsedMilliseconds} ms");
Console.WriteLine();
return reply;
}
The communication against Azure AI service with Open AI Chat-GPT service is this line:
The Chat GPT-4 service will return the data streamed so you can output the result as quickly as possible. I have tested it out using Standard Service S0 tier, it is a bit slower than the default speed you get inside the browser using Copilot, but it works and if you output to the console, you get a similar experience.
The code here can be used in different environments, the repo contains a console app with .NET 8.0 Framework, written in C# as shown in the code.
Here is the helper methods for the ChatClient, provided as extension methods.
ChatclientExtensions.cs
using OpenAI.Chat;
using System.ClientModel;
using System.Text;
namespaceOpenAIDemo
{
publicstaticclassChatclientExtensions
{
///<summary>/// Provides a stream result from the Chatclient service using AzureAI services.///</summary>///<param name="chatClient">ChatClient instance</param>///<param name="message">The message to send and communicate to the ai-model</param>///<returns>Streamed chat reply / result. Consume using 'await foreach'</returns>publicstatic AsyncCollectionResult<StreamingChatCompletionUpdate> GetStreamedReplyAsync(this ChatClient chatClient, string message) =>
chatClient.CompleteChatStreamingAsync(
[new SystemChatMessage("You are an helpful, wonderful AI assistant"), new UserChatMessage(message)]);
publicstaticasync Task<string> GetStreamedReplyStringAsync(this ChatClient chatClient, string message, bool outputToConsole = false)
{
var sb = new StringBuilder();
awaitforeach (var update inGetStreamedReplyAsync(chatClient, message))
{
foreach (var textReply in update.ContentUpdate.Select(cu => cu.Text))
{
sb.Append(textReply);
if (outputToConsole)
{
Console.Write(textReply);
}
}
}
return sb.ToString();
}
}
}
The code presented here should make it a bit easier to communicate with the Azure AI Open AI Chat GPT-4 service. See the repository to test out the code.
Screenshot below shows the demo in use in a console running against the Azure AI Chat GPT-4 service :
The speech synthesis service of Azure AI is accessed via a REST service. You can actually test it out first in Postman, retrieving an access token via an endpoint for this and then
calling the text to speech endpoint using the access token as a bearer token.
To get the demo working, you have to inside the Azure Portal create the necessary resources / services. This article is focused on speech service.
Important, if you want to test out the DEMO yourself, remember to put the keys into environment variables so they are not exposed via source control.
To get started with speech synthesis in Azure Cognitive Services, add a Speech Service resource via the Azure Portal.
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview
We also need to add audio capability to our demo, which is a .NET MAUI Blazor app. The Nuget package used is the following :
MultiLingual.Translator.csproj
This Nuget package's website is here:
https://github.com/jfversluis/Plugin.Maui.Audio
The MauiProgram.cs looks like the following, make note of AudioManager.Current, which is registered as a singleton.
MauiProgram.cs
using Microsoft.Extensions.Configuration;
using MultiLingual.Translator.Lib;
using Plugin.Maui.Audio;
namespaceMultiLingual.Translator;
publicstaticclassMauiProgram
{
publicstatic MauiApp CreateMauiApp()
{
var builder = MauiApp.CreateBuilder();
builder
.UseMauiApp<App>()
.ConfigureFonts(fonts =>
{
fonts.AddFont("OpenSans-Regular.ttf", "OpenSansRegular");
});
builder.Services.AddMauiBlazorWebView();
#if DEBUG
builder.Services.AddBlazorWebViewDeveloperTools();
#endif
builder.Services.AddSingleton(AudioManager.Current);
builder.Services.AddTransient<MainPage>();
builder.Services.AddScoped<IDetectLanguageUtil, DetectLanguageUtil>();
builder.Services.AddScoped<ITranslateUtil, TranslateUtil>();
builder.Services.AddScoped<ITextToSpeechUtil, TextToSpeechUtil>();
var config = new ConfigurationBuilder().AddJsonFile("appsettings.json").Build();
builder.Configuration.AddConfiguration(config);
return builder.Build();
}
}
Next up, let's look at the TextToSpeechUtil. This class, which is a service that does two things against the REST API of the text-to-speech Azure Cognitive AI service :
Fetch an access token
Synthesize text to speech
TextToSpeechUtil.cs
using Microsoft.Extensions.Configuration;
using MultiLingual.Translator.Lib.Models;
using System.Security;
using System.Text;
namespaceMultiLingual.Translator.Lib
{
publicclassTextToSpeechUtil : ITextToSpeechUtil
{
publicTextToSpeechUtil(IConfiguration configuration)
{
_configuration = configuration;
}
publicasync Task<TextToSpeechResult> GetSpeechFromText(string text, string language, TextToSpeechLanguage[] actorVoices, string? preferredVoiceActorId)
{
var result = new TextToSpeechResult();
result.Transcript = GetSpeechTextXml(text, language, actorVoices, preferredVoiceActorId, result);
result.ContentType = _configuration[TextToSpeechSpeechContentType];
result.OutputFormat = _configuration[TextToSpeechSpeechXMicrosoftOutputFormat];
result.UserAgent = _configuration[TextToSpeechSpeechUserAgent];
result.AvailableVoiceActorIds = ResolveAvailableActorVoiceIds(language, actorVoices);
result.LanguageCode = language;
string? token = await GetUpdatedToken();
HttpClient httpClient = GetTextToSpeechWebClient(token);
string ttsEndpointUrl = _configuration[TextToSpeechSpeechEndpoint];
var response = await httpClient.PostAsync(ttsEndpointUrl, new StringContent(result.Transcript, Encoding.UTF8, result.ContentType));
using (var memStream = new MemoryStream()) {
var responseStream = await response.Content.ReadAsStreamAsync();
responseStream.CopyTo(memStream);
result.VoiceData = memStream.ToArray();
}
return result;
}
privateasync Task<string?> GetUpdatedToken()
{
string? token = _token?.ToNormalString();
if (_lastTimeTokenFetched == null || DateTime.Now.Subtract(_lastTimeTokenFetched.Value).Minutes > 8)
{
token = await GetIssuedToken();
}
return token;
}
private HttpClient GetTextToSpeechWebClient(string? token)
{
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", token);
httpClient.DefaultRequestHeaders.Add("X-Microsoft-OutputFormat", _configuration[TextToSpeechSpeechXMicrosoftOutputFormat]);
httpClient.DefaultRequestHeaders.Add("User-Agent", _configuration[TextToSpeechSpeechUserAgent]);
return httpClient;
}
privatestringGetSpeechTextXml(string text, string language, TextToSpeechLanguage[] actorVoices, string? preferredVoiceActorId, TextToSpeechResult result)
{
result.VoiceActorId = ResolveVoiceActorId(language, preferredVoiceActorId, actorVoices);
string speechXml = $@"
<speak version='1.0' xml:lang='en-US'>
<voice xml:lang='en-US' xml:gender='Male' name='Microsoft Server Speech Text to Speech Voice {result.VoiceActorId}'>
<prosody rate='1'>{text}</prosody>
</voice>
</speak>";
return speechXml;
}
private List<string> ResolveAvailableActorVoiceIds(string language, TextToSpeechLanguage[] actorVoices)
{
if (actorVoices?.Any() == true)
{
var voiceActorIds = actorVoices.Where(v => v.LanguageKey == language || v.LanguageKey.Split("-")[0] == language).SelectMany(v => v.VoiceActors).Select(v => v.VoiceId).ToList();
return voiceActorIds;
}
returnnew List<string>();
}
privatestringResolveVoiceActorId(string language, string? preferredVoiceActorId, TextToSpeechLanguage[] actorVoices)
{
string actorVoiceId = "(en-AU, NatashaNeural)"; //default to a select voice actor id if (actorVoices?.Any() == true)
{
var voiceActorsForLanguage = actorVoices.Where(v => v.LanguageKey == language || v.LanguageKey.Split("-")[0] == language).SelectMany(v => v.VoiceActors).Select(v => v.VoiceId).ToList();
if (voiceActorsForLanguage != null)
{
if (voiceActorsForLanguage.Any() == true)
{
var resolvedPreferredVoiceActorId = voiceActorsForLanguage.FirstOrDefault(v => v == preferredVoiceActorId);
if (!string.IsNullOrWhiteSpace(resolvedPreferredVoiceActorId))
{
return resolvedPreferredVoiceActorId!;
}
actorVoiceId = voiceActorsForLanguage.First();
}
}
}
return actorVoiceId;
}
privateasync Task<string> GetIssuedToken()
{
var httpClient = new HttpClient();
string? textToSpeechSubscriptionKey = Environment.GetEnvironmentVariable("AZURE_TEXT_SPEECH_SUBSCRIPTION_KEY", EnvironmentVariableTarget.Machine);
httpClient.DefaultRequestHeaders.Add(OcpApiSubscriptionKeyHeaderName, textToSpeechSubscriptionKey);
string tokenEndpointUrl = _configuration[TextToSpeechIssueTokenEndpoint];
var response = await httpClient.PostAsync(tokenEndpointUrl, new StringContent("{}"));
_token = (await response.Content.ReadAsStringAsync()).ToSecureString();
_lastTimeTokenFetched = DateTime.Now;
return _token.ToNormalString();
}
privateconststring OcpApiSubscriptionKeyHeaderName = "Ocp-Apim-Subscription-Key";
privateconststring TextToSpeechIssueTokenEndpoint = "TextToSpeechIssueTokenEndpoint";
privateconststring TextToSpeechSpeechEndpoint = "TextToSpeechSpeechEndpoint";
privateconststring TextToSpeechSpeechContentType = "TextToSpeechSpeechContentType";
privateconststring TextToSpeechSpeechUserAgent = "TextToSpeechSpeechUserAgent";
privateconststring TextToSpeechSpeechXMicrosoftOutputFormat = "TextToSpeechSpeechXMicrosoftOutputFormat";
privatereadonly IConfiguration _configuration;
private DateTime? _lastTimeTokenFetched = null;
private SecureString _token = null;
}
}
Let's look at the appsettings.json file. The Ocp-Apim-Subscription-Key is put into environment variable, this is a secret key you do not want to expose to avoid leaking a key an running costs for usage of service.
Appsettings.json
Next up, I have gathered all the voice actor ids for languages in Azure Cognitive Services which have voice actor ids. Thesee are all the most known languages in the list of Azure about 150 supported languages, see the following json for an overview of voice actor ids.
For example, Norwegian language got three voice actors that are synthesized neural net trained AI voice actors for realistic speech synthesis.
Let's look at the source code for calling the TextToSpeechUtil.cs shown above from a MAUI Blazor app view, Index.razor
The code below shown is two private methods that does the work of retrieving the audio file from the Azure Speeech Service by first loading up all the voice actor ids from a bundled json file of voice actors displayed above and deserialize this into a list of voice actors.
Retrieving the audio file passes in the translated text of which to generate synthesized speedch for and also the target language, all available actor voices and preferred voice actor id, if set.
Retrieved is metadata and the audio file, in a MP3 file format. The file format is recognized by for example Windows withouth having to have any codec libraries installed in addition.
Index.razor (Inside the @code block { .. } of that razor file)
A screenshot shows how the DEMO app now looks like. You can translate text into other language and then have speech synthesis in Azure AI Cognitive Service generate a realistic audio speech of the translated text so you can also see how the text not only is translated, but also pronounced.