I have added a repo on Github for a web scraping app written in .NET MAUI Blazor. It uses Azure Cognitive Services to summarize articles.
https://github.com/toreaurstadboss/DagbladetWebscrapper
The web scrapper uses the Nuget package for Html agility pack to handle the DOM after downloading articles from the Internet.
As the name of the repo suggests, it can be used to read for example Dagbladet articles, without having to waddle through ads. 'Website Scraping'
is a term that means extracting data from web sites, or content in general.
The following libraries are used in the Razor lib containing the text handling methods to scrap web pages:
Let's first look at the SummarizationUtil class. This uses TextAnalyticsClient in Azure.AI.TextAnalytics. We will summarize articles into five sentence summaries using the Azure AI
text analytics client.
using Azure.AI.TextAnalytics;
using System.Text;
namespaceWebscrapper.Lib
{
publicclassSummarizationUtil : ISummarizationUtil
{
publicasync Task<List<ExtractiveSummarizeResult>> GetExtractiveSummarizeResult(string document, TextAnalyticsClient client)
{
var batchedDocuments = new List<string>
{
document
};
var result = new List<ExtractiveSummarizeResult>();
var options = new ExtractiveSummarizeOptions
{
MaxSentenceCount = 5
};
var operation = await client.ExtractiveSummarizeAsync(Azure.WaitUntil.Completed, batchedDocuments, options: options);
awaitforeach (ExtractiveSummarizeResultCollection documentsInPage in operation.Value)
{
foreach (ExtractiveSummarizeResult documentResult in documentsInPage)
{
result.Add(documentResult);
}
}
return result;
}
publicasync Task<string> GetExtractiveSummarizeSentectesResult(string document, TextAnalyticsClient client)
{
List<ExtractiveSummarizeResult> summaries = await GetExtractiveSummarizeResult(document, client);
returnstring.Join(Environment.NewLine, summaries.Select(s => s.Sentences).SelectMany(x => x).Select(x => x.Text));
}
}
}
We set up the extraction here to return a maximum of five sentences. Note the use of await foreach here. (async ienumerable)
Here is a helper method to get a string from a ExtractiveSummarizeResult.
using Azure.AI.TextAnalytics;
using System.Text;
namespaceWebscrapper.Lib
{
publicstaticclassSummarizationExtensions
{
publicstaticstringGetExtractiveSummarizeResultInfo(this ExtractiveSummarizeResult documentResults)
{
var sb = new StringBuilder();
if (documentResults.HasError)
{
sb.AppendLine($"Error!");
sb.AppendLine($"Document error code: {documentResults.Error.ErrorCode}.");
sb.AppendLine($"Message: {documentResults.Error.Message}");
}
else
{
sb.AppendLine($"SUCCESS. There are no errors encountered while summarizing the document");
}
sb.AppendLine($"Extracted the following {documentResults.Sentences.Count} sentence(s):");
sb.AppendLine();
foreach (ExtractiveSummarySentence sentence in documentResults.Sentences)
{
sb.AppendLine($"Sentence: {sentence.Text} Offset: {sentence.Offset} Rankscore: {sentence.RankScore} Length:{sentence.Length}");
sb.AppendLine();
}
return sb.ToString();
}
}
}
Here is a factory method to create a TextAnalyticsClient.
using Azure;
using Azure.AI.TextAnalytics;
namespaceWebscrapper.Lib
{
publicstaticclassTextAnalyticsClientFactory
{
publicstatic TextAnalyticsClient CreateClient()
{
string? uri = Environment.GetEnvironmentVariable("AZURE_COGNITIVE_SERVICE_ENDPOINT", EnvironmentVariableTarget.Machine);
string? key = Environment.GetEnvironmentVariable("AZURE_COGNITIVE_SERVICE_KEY", EnvironmentVariableTarget.Machine);
if (uri == null)
{
thrownew ArgumentNullException(nameof(uri), "Could not get system environment variable named 'AZURE_COGNITIVE_SERVICE_ENDPOINT' Set this variable first.");
}
if (uri == null)
{
thrownew ArgumentNullException(nameof(uri), "Could not get system environment variable named 'AZURE_COGNITIVE_SERVICE_KEY' Set this variable first.");
}
var client = new TextAnalyticsClient(new Uri(uri!), new AzureKeyCredential(key!));
return client;
}
}
}
To use Azure Cognitive Services, you have to get the endpoint (an url) and a service key for your account in Azure portal after having activated Azure Cognitive Services.
The page extraction util looks like this, note the use of Html Agility pack.
using HtmlAgilityPack;
using System.Text;
namespaceWebscrapper.Lib
{
publicclassPageExtractionUtil : IPageExtractionUtil
{
publicasync Task<string?> ExtractHtml(string url, bool includeTags)
{
if (string.IsNullOrEmpty(url))
returnnull;
var httpClient = new HttpClient();
string pageHtml = await httpClient.GetStringAsync(url);
if (string.IsNullOrEmpty(pageHtml))
{
returnnull;
}
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(pageHtml);
var textNodes = htmlDoc.DocumentNode.SelectNodes("//h1|//h2|//h3|//h4|//h5|//h6|//p")
.Where(n => !string.IsNullOrWhiteSpace(n.InnerText)).ToList();
var sb = new StringBuilder();
foreach (var textNode in textNodes)
{
var text = textNode.InnerText;
if (includeTags)
{
sb.AppendLine($"<{textNode.Name}>{textNode.InnerText}</{textNode.Name}>");
}
else
{
sb.AppendLine($"{textNode.InnerText}");
}
}
return sb.ToString();
}
}
}
Let's look at an example usage :
@page "/"
@inject ISummarizationUtil SummarizationUtil
@inject IPageExtractionUtil PageExtractionUtil
@using DagbladetWebscrapper.Models;
<h1>Dagbladet Artikkel Oppsummering</h1>
<EditForm Model="@Model" OnValidSubmit="@Submit"class="form-group">
<DataAnnotationsValidator />
<ValidationSummary />
<div class="form-group row">
<label for="Model.ArticleUrl">Url til artikkel</label>
<InputText @bind-Value="Model!.ArticleUrl" placeholder="Skriv inn url til artikkel i Dagbladet" />
</div>
<div class="form-group row">
<span>Artikkelens oppsummering</span>
<InputTextArea readonly="readonly" placeholder="Her dukker opp artikkelens oppsummering" @bind-Value="Model!.SummarySentences" rows="5"></InputTextArea>
</div>
<div class="form-group row">
<span>Artikkelens tekst</span>
<InputTextArea readonly="readonly" placeholder="Her dukker opp teksten til artikkelen" @bind-Value="Model!.ArticleText" rows="5"></InputTextArea>
</div>
<button type="submit">Submit</button>
</EditForm>
@code {
private Azure.AI.TextAnalytics.TextAnalyticsClient _client;
public IndexModel Model { get; set; } = new();
privateasyncvoidSubmit()
{
string articleText = await PageExtractionUtil.ExtractHtml(Model!.ArticleUrl, false);
Model.ArticleText = articleText;
if (_client == null)
{
_client = TextAnalyticsClientFactory.CreateClient();
}
string summaryText = await SummarizationUtil.GetExtractiveSummarizeSentectesResult(articleText, _client);
Model.SummarySentences = summaryText;
StateHasChanged();
}
}
The view model class for the form looks like this.
Let's look at a screen shot that shows the app in use. It targets an article on the tabloid newspaper Dagbladet in Norway. This tabloid is notorious for writing sensational titles of articles so you have to click into the article (e.g. 'clickbait') and then inside the article, you have to wade through lots of ads. Here, you now have an app, where you can open up www.dagbladet.no and find a link to an article and now extract the text and get a five sentence summary using Azure AI Cognitive services in a .NET MAUI app.