Sunday, 20 July 2025

Object detection with Machine Learning

In this article, a demo performing object detection in images using machine learning is presented. ML.net will be used for the machine training in the Blazor serverside app the demo is made with. The demo will describe how you can train a machine learning model to detect stop signs in United States from some 50 images, downloaded from Unsplash website. You can of course choose whatever object to detect here and obtain sample images from for example sites such as Unsplash. There are no official lower limit on the needed images you need for reliable detection of objects in images using machine learning. Obviously, your images should include viewing the object from different angles and being partly obscured and similar variations, such as weathering effects (snow et cetera) and different lighting conditions (day, night, dusk, dawn and so on). High quality object detection will use thousands of images for training and use powerful GPUs in the cloud. I have trained the learning model on my own PC using a CPU. It took 830 seconds, some 13 minutes in all. ML.net supports computing machine learning models in the clouds, some cost may of course be expected instead of running in on cheaper hardware such as your own PC (CPU). The benefit of using GPUs in the cloud is that the machine learning model can use more diffusion models and compare which machine learning model is optimal. This demands a lot of computing. Note that the Modelbuilder of ML.net is used in VS 2022. This is available after installing the ML.net workload in VS 2022. I have also used VoTT to tag the 50+ images I used for training set. I have added the source code in my Github repo here:

https://github.com/toreaurstadboss/ObjectDetectionMachineLearning

VOTT - Visual Object Tagging Tool

ML.net object detection uses input from VoTT tool where a human (you!) has tagged the object(s) of interest in the image. Note that a stopsign along a road or street can be multiple places in an image. Some places for example, we see an object 2-3 times of course. Your objective in doing a good interpretation in training data (pictures!) of tagging and labelling objects of interest in Vott is a straight forward process. VoTT is available for download here, click in the releases of the Github repo to find latest stable installer (exe):


https://github.com/microsoft/VoTT/releases/tag/v2.2.0


https://github.com/microsoft/VoTT Vott repo on Github
Here is a screenshot of Vott while I am tagging pictures : You can open a folder with several pictures, I am using .png format, and then use the polygon tool to 'lasso around' the contours of each object of interest in the image. A label must be designated also. This process is manually repeated. A minimum of about 50 images such be considered, but a reliable trained machine learning model in productions should probably be trained with a lot more, in the several thousands. However, in the beginning, it would be benificial to keep the count of images low so finding the optimum machine learning algorithm is easier without such a large training set. In Ml.net, the algorithm ObjectDetectionMulti won, however also only one model was explored. Using GPUs in the cloud would at least reduce computation time and yield higher quality diffusion models.

Screenshot of the demo

The following screenshot shows three detected objects in the test image. The machine learned model is trained to detect Stop signs, used in road signs of roads and streets in United States. Interestingly, the machine learning model also detected the stop sign poining the other way for meeting traffic, so a total of three stop signs were detected in the rail road crossing sample image. The fourth stop sign was not detected, however, I have trained the data not on road signs rotated away from camera. Training also to detect stop signs in these cases would perfect this even better. But I am quite pleased with the results when I test out the machine learned model.

First, let's look at the code behind the demo shown in the screenshot above.

Machine learning model exposed via Web API

Using ML.net we can not only train a machine learning models to detect object in images, we can also via using ML.net generate the API that will process images and returns objects found in images, if any. ML.net object detection supports images of format .png, .jpeg and .bmp. In the demo, .png images were used. Actually the demo used high-res images from the web site Unsplash, which provides beautiful and royalty free of use images to use with for example meachine learning.

The API and endpoint for the processing of images is shown next. The client specifies the file name of the image to analyse and after processing, the resulted detected objects are returned. If any objects were found, we will get these objects in summary format of bounding boxes and detected labels from the machine learned model. ML.Net generated the Web api code here. Note that the POST endpoint was made manually by myself. We turn off caching here through response headers and note that the bounding box coordinates are returned scaled to a 800x600 virtual image. The bounding boxes are returned as quadruples (four floats per object) in the float array returned. The POST endpoing in the minimal API shown below shows how these bounding boxes are calculated and returned. I have not refactored this code yet into a helper method, but you can use this code as a reference to how to rescale the calculated bounding boxes from the virtual 800x600 image back to the original image pixel width and height.


Program.cs | StopsignDetection_webapi1 project




<!-- This file was auto-generated by ML.NET Model Builder. -->
using Microsoft.AspNetCore.Builder;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.ML;
using Microsoft.OpenApi.Models;
using Microsoft.ML.Data;
using System.Drawing;
using System.IO;
using System.Threading.Tasks;
using StopSignDetection_WebApi1;
using Microsoft.AspNetCore.Mvc;

<!-- Configure app -->
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddPredictionEnginePool<StopSignDetection.ModelInput, StopSignDetection.ModelOutput>()
    .FromFile("StopSignDetection.mlnet");

builder.Services.AddEndpointsApiExplorer();

builder.Services.AddSwaggerGen(c =>
{
    c.SwaggerDoc("v1", new OpenApiInfo { Title = "Object detection - Stop sign detection", Description = "Docs for my API", Version = "v1" });
});
var app = builder.Build();

app.UseSwagger();

if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI(c =>
    {
        c.SwaggerEndpoint("/swagger/v1/swagger.json", "Object detection - Machine learning MLN.");
    });
}

<!-- Define prediction route & handler -->
app.MapPost("/predict",
    async (HttpContext context, PredictionEnginePool<StopSignDetection.ModelInput, StopSignDetection.ModelOutput> predictionEnginePool, [FromBody] PredictRequest request) =>
    {
        context.Response.Headers["Cache-Control"] = "no-store, no-cache, must-revalidate, max-age=0";
        context.Response.Headers["Pragma"] = "no-cache";
        context.Response.Headers["Expires"] = "0";

        var image = MLImage.CreateFromFile(request.ImagePath);

        var input = new StopSignDetection.ModelInput()
        {
            Image = image,
        };

        int originalWidth = image.Width;
        int originalHeight = image.Height;

        const int virtualWidth = 800;
        const int virtualHeight = 600;

        var prediction = predictionEnginePool.Predict(input);
        var boxes = prediction.PredictedBoundingBoxes;

        for (int i = 0; i < boxes.Length; i += 4)
        {
            float left = boxes[i];
            float top = boxes[i + 1];
            float width = boxes[i + 2];
            float height = boxes[i + 3];

            float scaledLeft = left * originalWidth / virtualWidth;
            float scaledTop = top * originalHeight / virtualHeight;
            float scaledWidth = width * originalWidth / virtualWidth;
            float scaledHeight = height * originalHeight / virtualHeight;

            Console.WriteLine($"Box {i / 4}: X={scaledLeft}, Y={scaledTop}, Width={scaledWidth}, Height={scaledHeight}");

            (boxes[i], boxes[i + 1], boxes[i + 2], boxes[i + 3]) = (scaledLeft, scaledTop, scaledWidth, scaledHeight);  //assign using tuples and update the float array per object
        }

        return await Task.FromResult(prediction);
    });

app.Run();



PredictRequest is a simple class using as the request object while POST-ing.


PredictRequest.cs | StopsignDetection_webapi1 project

namespace StopSignDetection_WebApi1 { public class PredictRequest { public string ImagePath { get; set; } = string.Empty; } }

Over to the client, which is a Blazor serverside. The following UI is in the component that shows the UI with code-behind.

Home.razor




@page "/"
@using Microsoft.AspNetCore.Components.Forms

<PageTitle>Home</PageTitle>

<h1>Object detection using Machine learning</h1>

<script src="js/home.js" type="text/javascript"></script>

<p>
    Upload an image to use the Object detection demo. The machine-learned ML.Net model will detect <em>Stop signs</em> and
    display bounding boxes around each stop sign in the image. The stop sign is trained to use those used as traffic signs in United States
    along streets and roads.
</p>

<div class="container">

    <div class="row align-items-start">
        <div class="col">
            <label><b>Select a picture to run stop sign object detection</b></label><br />
            <InputFile OnChange="@OnInputFile" accept=".jpeg,.jpg,.png" />
            <br />
            <code class="alert-secondary">Supported file formats: .jpeg, .jpg and .png. (.bmp also supported) Max image file upload size : 10 MB</code>
            <br />
        </div>
    </div>

    <div class="row align-items-start">
        <div class="col">
            <label><b>Detected objects (stop-signs) in the loaded image:</b></label><br />

            @if (LatestPrediction?.predictedLabel?.Count > 0)
            {
                <table class="table table-bordered table-striped table-hover mt-3">
                    <thead class="table-dark">
                        <tr>
                            <th>#</th>
                            <th>Label</th>
                            <th>X1</th>
                            <th>Y1</th>
                            <th>X2</th>
                            <th>Y2</th>
                            <th>Confidence</th>
                        </tr>
                    </thead>
                    <tbody>
                        @for (int i = 0; i < LatestPrediction.predictedLabel.Count; i++)
                        {
                            var label = LatestPrediction.predictedLabel.ElementAt(i);
                            var bbox = LatestPrediction.predictedBoundingBoxes.Skip(i * 4).Take(4).ToArray();
                            var score = LatestPrediction.score.ElementAt(i);

                            <tr>
                                <td>@(i + 1)</td>
                                <td>@label</td>
                                <td>@bbox[0].ToString("0.00")</td>
                                <td>@bbox[1].ToString("0.00")</td>
                                <td>@bbox[2].ToString("0.00")</td>
                                <td>@bbox[3].ToString("0.00")</td>
                                <td>@score.ToString("0.0000")</td>
                            </tr>
                        }
                    </tbody>
                </table>
            }
            else
            {
                <p class="text-muted">No predictions available.</p>
            }
        </div>
    </div>

    <div class="row align-items-start">
        <div class="col overflow-scroll">
            <label class="alert-info">Preview of the selected image</label>
            <div>
                <img id="PreviewImage" style="border:1px solid black;" src="@UploadedImagePreview" /><br />
            </div>
        </div>
        <div class="col overflow-scroll">
            <label class="alert-info">Image with bounding boxes</label>
            <canvas height="400" id="PreviewImageBbox" style="border:solid 1px black">
            </canvas>
            <br />
        </div>
    </div>

</div>



The image you want to analyze is uploaded using an InputFile component in Blazor. This uses HTML file upload control. After file is uploaded, the API will be called in the event handler OnInputFile. The code behind of the home component is shown next.

Home.razor.cs



using Microsoft.AspNetCore.Components;
using Microsoft.AspNetCore.Components.Forms;
using Microsoft.JSInterop;
using ObjectDetectionMachineLearning.Web.Models;
using System.Text;
using System.Text.Json;

namespace ObjectDetectionMachineLearning.Web.Components.Pages
{
    partial class Home
    {

        [Inject]
        private HttpClient Http { get; set; } = default!;

        [Inject]
        private IJSRuntime JsRunTime { get; set; } = default!;

        private string? UploadedImagePreview;

        private MLPrediction LatestPrediction = default!;

        /// <summary>
        /// Uploads an image and sets the imagePreview property to display it
        /// </summary>
        /// <param name="e"></param>
        /// <returns></returns>

        private async Task OnInputFile(InputFileChangeEventArgs e)
        {
            var file = e.File;

            if (file != null && (file.ContentType == "image/jpeg" || file.ContentType == "image/png"))
            {
                string savedUploadedImageFullPath = await SaveUploadedImage(file);

                // Optional: Set preview if you still want to show it in the UI
                using var ms = new MemoryStream();
                using var previewStream = file.OpenReadStream(maxAllowedSize: 10 * 1024 * 1024);
                await previewStream.CopyToAsync(ms);
                var bytes = ms.ToArray();
                UploadedImagePreview = $"data:{file.ContentType};base64,{Convert.ToBase64String(bytes)}";

                string? prediction = await CallPredictApiAsync(savedUploadedImageFullPath);

                var jsonBboxes = CreateBoundingBoxJson(prediction);
                await JsRunTime.InvokeVoidAsync("InitLoadBoundingBoxes", jsonBboxes);

                Console.WriteLine($"Prediction {prediction}");

                //StateHasChanged();
            }
        }

        private string CreateBoundingBoxJson(string? prediction)
        {
            if (string.IsNullOrEmpty(prediction))
                return "[]";
            try
            {
                var mlPrediction = JsonSerializer.Deserialize<MLPrediction>(prediction);
                LatestPrediction = mlPrediction;
                return ConvertMLPredictionToBoundingBoxJson(mlPrediction!);
            }
            catch (JsonException ex)
            {
                Console.WriteLine($"Error deserializing prediction: {ex.Message}");
                return "[]";
            }

        }

        public static string ConvertMLPredictionToBoundingBoxJson(MLPrediction prediction)
        {
            var boxes = prediction.predictedBoundingBoxes;
            var labels = prediction.predictedLabel ?? new List<string>();
            var scores = prediction.score ?? new List<float>();

            if (boxes == null || boxes.Count % 4 != 0)
                return "[]";

            var results = new List<object>();

            for (int i = 0; i < boxes.Count; i += 4)
            {
                float x1 = boxes[i];
                float y1 = boxes[i + 1];
                float x2 = boxes[i + 2];
                float y2 = boxes[i + 3];

                float width = x2 - x1;
                float height = y2 - y1;

                results.Add(new
                {
                    Name = i / 4 < labels.Count ? labels[i / 4] : "Unknown",
                    X = x1,
                    Y = y1,
                    Width = width,
                    Height = height,
                    Confidence = i / 4 < scores.Count ? scores[i / 4].ToString("0.0000") : "0.0000"
                });
            }

            return JsonSerializer.Serialize(results, new JsonSerializerOptions { WriteIndented = true });
        }

        private static async Task<string> SaveUploadedImage(IBrowserFile file)
        {
            var uploadsFolder = Path.Combine(Environment.CurrentDirectory, "UploadedImages");
            Directory.CreateDirectory(uploadsFolder); // Ensure folder exists

            var fileName = $"{Guid.NewGuid()}_{file.Name}";
            var filePath = Path.Combine(uploadsFolder, fileName);

            using (var stream = file.OpenReadStream(maxAllowedSize: 10 * 1024 * 1024))
            using (var fileStream = new FileStream(filePath, FileMode.Create))
            {
                await stream.CopyToAsync(fileStream);
            }

            return filePath;
        }

        private async Task<string?> CallPredictApiAsync(string imagePath)
        {
            try
            {
                var payload = new { imagePath = imagePath };
                var content = new StringContent(JsonSerializer.Serialize(payload), Encoding.UTF8, "application/json");

                var response = await Http.PostAsync("https://localhost:65194/predict", content);

                if (response.IsSuccessStatusCode)
                {
                    var result = await response.Content.ReadAsStringAsync();
                    Console.WriteLine("Prediction result: " + result);
                    return result;
                }
                else
                {
                    Console.WriteLine($"API call failed: {response.StatusCode}");
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine("Error calling API: " + ex.Message);
            }

            return null;
        }

    }
}



home.js

Over to the client side javascript next, which handles the loading of bounding boxes using Canvas in HTML5.


<script type="text/javascript">
var colorPalette = [
    "red", "yellow", "blue", "green", "fuchsia",
    "moccasin", "purple", "magenta", "aliceblue",
    "lightyellow", "lightgreen"
];

function rescaleCanvas() {
    var img = document.getElementById('PreviewImage');
    var canvas = document.getElementById('PreviewImageBbox');
    var displayWidth = img.clientWidth;
    var displayHeight = img.clientHeight;
    canvas.width = displayWidth;
    canvas.height = displayHeight;
}

function LoadBoundingBoxes(objectDescriptions) {
    if (!objectDescriptions) {
        alert('No objects found in image.');
        return;
    }

    console.log(new Date() + ' ' + 'home.js : Loading bounding boxes from returned results ..');

    var objectDesc = typeof objectDescriptions === "string"
        ? JSON.parse(objectDescriptions)
        : objectDescriptions;

    var canvas = document.getElementById('PreviewImageBbox');
    var img = document.getElementById('PreviewImage');
    var ctx = canvas.getContext('2d');

    var scaleX = canvas.width / img.naturalWidth;
    var scaleY = canvas.height / img.naturalHeight;

    ctx.clearRect(0, 0, canvas.width, canvas.height);
    ctx.drawImage(img, 0, 0, canvas.width, canvas.height);
    ctx.font = "10px Verdana";

    console.log(`ctx.drawImage Canvas width: ${canvas.width} Canvas height: ${canvas.height} ScaleX ${scaleX} ScaleY ${scaleY}`);

    for (var i = 0; i < objectDesc.length; i++) {
        const obj = objectDesc[i];
        const x = obj.X * scaleX;
        const y = obj.Y * scaleY;
        const width = obj.Width * scaleX;
        const height = obj.Height * scaleY;

        ctx.beginPath();
        ctx.strokeStyle = "black";
        ctx.lineWidth = 1;
        ctx.fillText(obj.Name, x + width / 2, y + height / 2);
        ctx.fillText("Confidence: " + obj.Confidence, x + width / 2, 10 + y + height / 2);
    }

    for (var i = 0; i < objectDesc.length; i++) {
        const obj = objectDesc[i];
        const x = obj.X * scaleX;
        const y = obj.Y * scaleY;
        const width = obj.Width * scaleX;
        const height = obj.Height * scaleY;

        ctx.fillStyle = getColor();
        ctx.globalAlpha = 0.2;
        ctx.fillRect(x, y, width, height);

        ctx.globalAlpha = 1.0;
        ctx.lineWidth = 3;
        ctx.strokeStyle = "blue";
        ctx.strokeRect(x, y, width, height);

        ctx.fillStyle = "black";
        ctx.fillText("Color: " + getColor(), x + width / 2, 20 + y + height / 2);
    }

    console.log('Bounding boxes:', objectDesc);
}

function getColor() {
    var colorIndex = Math.floor(Math.random() * colorPalette.length);
    return colorPalette[colorIndex];
}

function InitLoadBoundingBoxes(objectDescriptions) {
    const img = document.getElementById('PreviewImage');

    const draw = () => {
        setTimeout(() => {
            LoadBoundingBoxes(objectDescriptions);
        }, 1000);
    };

    if (!img.complete) {
        img.onload = draw;
    } else {
        draw();
    }
}
</script>



ML.net project for Object detection

The repo also contains a project with the Object detection with the training data. You will need the ML.net workload here to work ML.Net in VS 2022.


Model builder in VS 2022

Training with pictures - training data

The ML.net project allows you to via the built-in ModelBuilder UI select the pictures that you can train the machine learning model.



Concluding notes

We have seen how we fairly conveniently can train a machine learning model to analyze images and detect objects. In this demo, stop signs along roads and streets in United States is used for the training. Note that you with using Vott, the Visual Object Tagging Tool, you can enter multiple labels. i.e. you can train the machine learning model to detect and interpret multiple types of objects. For example speed signs and stop signs to take an example. A self-driving car would use machine learning to intelligently interpret in real time such road signs and use additional input such as GPS, road databases and LIDAR to ultimately achieve a situation awareness that is needed to drive a car. Of course, we have just trained a machine learning model to just detect stop signs from 50 training images, but now you should have a better understanding how you can train machine learning models to detect objects in images.

Share this article on LinkedIn.

No comments:

Post a Comment