Showing posts with label HtmlAgilityPack. Show all posts
Showing posts with label HtmlAgilityPack. Show all posts

Monday, 3 April 2023

Using Azure Cognitive Services to summarize articles

I have added a repo on Github for a web scraping app written in .NET MAUI Blazor. It uses Azure Cognitive Services to summarize articles. https://github.com/toreaurstadboss/DagbladetWebscrapper The web scrapper uses the Nuget package for Html agility pack to handle the DOM after downloading articles from the Internet. As the name of the repo suggests, it can be used to read for example Dagbladet articles, without having to waddle through ads. 'Website Scraping' is a term that means extracting data from web sites, or content in general. The following libraries are used in the Razor lib containing the text handling methods to scrap web pages:

<PackageReference Include="Azure.AI.TextAnalytics" Version="5.3.0" />
<PackageReference Include="HtmlAgilityPack" Version="1.11.52" />
<PackageReference Include="Microsoft.AspNetCore.Components.Web" Version="6.0.19" />
Let's first look at the SummarizationUtil class. This uses TextAnalyticsClient in Azure.AI.TextAnalytics. We will summarize articles into five sentence summaries using the
Azure AI text analytics client.


using Azure.AI.TextAnalytics;
using System.Text;

namespace Webscrapper.Lib
{
	public class SummarizationUtil : ISummarizationUtil
	{

		public async Task<List<ExtractiveSummarizeResult>> GetExtractiveSummarizeResult(string document, TextAnalyticsClient client)
		{
			var batchedDocuments = new List<string>
			{
				document
			};
			var result = new List<ExtractiveSummarizeResult>();
			var options = new ExtractiveSummarizeOptions
			{
				 MaxSentenceCount = 5
			};
			var operation = await client.ExtractiveSummarizeAsync(Azure.WaitUntil.Completed, batchedDocuments, options: options);
			await foreach (ExtractiveSummarizeResultCollection documentsInPage in operation.Value)
			{
				foreach (ExtractiveSummarizeResult documentResult in documentsInPage)
				{
					result.Add(documentResult);
				}
			}
			return result;
		}

		public async Task<string> GetExtractiveSummarizeSentectesResult(string document, TextAnalyticsClient client)
		{
			List<ExtractiveSummarizeResult> summaries = await GetExtractiveSummarizeResult(document, client);
			return string.Join(Environment.NewLine, summaries.Select(s => s.Sentences).SelectMany(x => x).Select(x => x.Text));
		}

	}

}

We set up the extraction here to return a maximum of five sentences. Note the use of await foreach here. (async ienumerable) Here is a helper method to get a string from a ExtractiveSummarizeResult.

using Azure.AI.TextAnalytics;
using System.Text;

namespace Webscrapper.Lib
{

	public static class SummarizationExtensions
	{

		public static string GetExtractiveSummarizeResultInfo(this ExtractiveSummarizeResult documentResults)
		{
			var sb = new StringBuilder();

			if (documentResults.HasError)
			{
				sb.AppendLine($"Error!");
				sb.AppendLine($"Document error code: {documentResults.Error.ErrorCode}.");
				sb.AppendLine($"Message: {documentResults.Error.Message}");
			}
			else
			{
				sb.AppendLine($"SUCCESS. There are no errors encountered while summarizing the document");
			}

			sb.AppendLine($"Extracted the following {documentResults.Sentences.Count} sentence(s):");
			sb.AppendLine();

			foreach (ExtractiveSummarySentence sentence in documentResults.Sentences)
			{
				sb.AppendLine($"Sentence: {sentence.Text} Offset: {sentence.Offset} Rankscore: {sentence.RankScore} Length:{sentence.Length}");
				sb.AppendLine();
			}
			return sb.ToString();
		}
	}

}



Here is a factory method to create a TextAnalyticsClient.


using Azure;
using Azure.AI.TextAnalytics;

namespace Webscrapper.Lib
{
    public static class TextAnalyticsClientFactory
    {

        public static TextAnalyticsClient CreateClient()
        {
            string? uri = Environment.GetEnvironmentVariable("AZURE_COGNITIVE_SERVICE_ENDPOINT", EnvironmentVariableTarget.Machine);
            string? key = Environment.GetEnvironmentVariable("AZURE_COGNITIVE_SERVICE_KEY", EnvironmentVariableTarget.Machine);
            if (uri == null)
            {
                throw new ArgumentNullException(nameof(uri), "Could not get system environment variable named 'AZURE_COGNITIVE_SERVICE_ENDPOINT' Set this variable first.");
            }
            if (uri == null)
            {
                throw new ArgumentNullException(nameof(uri), "Could not get system environment variable named 'AZURE_COGNITIVE_SERVICE_KEY' Set this variable first.");
            }
            var client = new TextAnalyticsClient(new Uri(uri!), new AzureKeyCredential(key!));
            return client;
        }

    }
}


To use Azure Cognitive Services, you have to get the endpoint (an url) and a service key for your account in Azure portal after having activated Azure Cognitive Services. The page extraction util looks like this, note the use of Html Agility pack.


using HtmlAgilityPack;
using System.Text;

namespace Webscrapper.Lib
{
	public class PageExtractionUtil : IPageExtractionUtil
	{

		public async Task<string?> ExtractHtml(string url, bool includeTags)
		{
			if (string.IsNullOrEmpty(url)) 
				return null;
			var httpClient = new HttpClient();

			string pageHtml = await httpClient.GetStringAsync(url);
			if (string.IsNullOrEmpty(pageHtml))
			{
				return null;
			}

			var htmlDoc = new HtmlDocument(); 
			htmlDoc.LoadHtml(pageHtml);
			var textNodes = htmlDoc.DocumentNode.SelectNodes("//h1|//h2|//h3|//h4|//h5|//h6|//p")
				.Where(n => !string.IsNullOrWhiteSpace(n.InnerText)).ToList();
			var sb = new StringBuilder();
			foreach (var textNode in textNodes)
			{
				var text = textNode.InnerText;
				if (includeTags)
				{
					sb.AppendLine($"<{textNode.Name}>{textNode.InnerText}</{textNode.Name}>");
				}
				else
				{
					sb.AppendLine($"{textNode.InnerText}");
				}
			}
			return sb.ToString();
		}
	}
}



Let's look at an example usage :

@page "/"
@inject ISummarizationUtil SummarizationUtil
@inject IPageExtractionUtil PageExtractionUtil

@using DagbladetWebscrapper.Models;

<h1>Dagbladet Artikkel Oppsummering</h1>

<EditForm Model="@Model" OnValidSubmit="@Submit" class="form-group">
    <DataAnnotationsValidator />
    <ValidationSummary />
  
    <div class="form-group row">
    <label for="Model.ArticleUrl">Url til artikkel</label>
    <InputText @bind-Value="Model!.ArticleUrl" placeholder="Skriv inn url til artikkel i Dagbladet" />
    </div>

    <div class="form-group row">
    <span>Artikkelens oppsummering</span>
    <InputTextArea readonly="readonly" placeholder="Her dukker opp artikkelens oppsummering" @bind-Value="Model!.SummarySentences" rows="5"></InputTextArea>
    </div>

    <div class="form-group row">
    <span>Artikkelens tekst</span>
    <InputTextArea readonly="readonly" placeholder="Her dukker opp teksten til artikkelen" @bind-Value="Model!.ArticleText" rows="5"></InputTextArea>
    </div>
    
    <button type="submit">Submit</button>


</EditForm>

@code {
    private Azure.AI.TextAnalytics.TextAnalyticsClient _client;

    public IndexModel Model { get; set; } = new();

    private async void Submit()
    {
        string articleText = await PageExtractionUtil.ExtractHtml(Model!.ArticleUrl, false);
        Model.ArticleText = articleText;
        if (_client == null)
        {
            _client = TextAnalyticsClientFactory.CreateClient();
        }
        string summaryText = await SummarizationUtil.GetExtractiveSummarizeSentectesResult(articleText, _client);
        Model.SummarySentences = summaryText;

        StateHasChanged();
    }   

}


The view model class for the form looks like this.


using System.ComponentModel.DataAnnotations;

namespace DagbladetWebscrapper.Models
{
	public class IndexModel
	{
        [Required]
        public string? ArticleUrl { get; set; }

        public string SummarySentences { get; set; }

        public string ArticleText { get; set; }
    }
}


Let's look at a screen shot that shows the app in use. It targets an article on the tabloid newspaper Dagbladet in Norway. This tabloid is notorious for writing sensational titles of articles so you have to click into the article (e.g. 'clickbait') and then inside the article, you have to wade through lots of ads. Here, you now have an app, where you can open up www.dagbladet.no and find a link to an article and now extract the text and get a five sentence summary using Azure AI Cognitive services in a .NET MAUI app.