https://github.com/jeffprosise/ML.NET/blob/master/MLN-SimpleRegression/MLN-SimpleRegression/Data/poverty.csv

In this article, Linqpad 8 will be used. First off, the following two Nuget packages are added :
- Microsoft.ML
- Microsoft.ML.Mkl.Components
Plotutils.cs
void PlotScatterGraph<T>(MLContext mlContext, IDataView trainData, Func<T, PointItem> pointCreator, string chartTitle) where T : class, new()
{
//Convert the IDataview to an enumerable collection
var data = mlContext.Data.CreateEnumerable<T>(trainData, reuseRowObject: false).Select(x => pointCreator(x)).ToList();
// Calculate trendline (simple linear regression)
double avgX = data.Average(d => d.X);
double avgY = data.Average(d => d.Y);
double slope = data.Sum(d => (d.X - avgX) * (d.Y - avgY)) / data.Sum(d => (d.X - avgX) * (d.X - avgX));
double intercept = avgY - slope * avgX;
var trendline = data.Select(d => new { X = d.X, Y = slope * d.X + intercept }).ToList();
//Plot the scatter graph
var plot = data.Chart(d => d.X)
.AddYSeries(d => d.Y, LINQPad.Util.SeriesType.Point, chartTitle)
.AddYSeries(d => trendline.FirstOrDefault(t => t.X == d.X)?.Y ?? 0, Util.SeriesType.Line, "Trendline")
.ToWindowsChart();
plot.AntiAliasing = System.Windows.Forms.DataVisualization.Charting.AntiAliasingStyles.All;
plot.Dump();
}
Let's look at the code for loading the CSV data and into the MLContext and then used the method TrainTestSplit to split the data into training data and testing data.
Note also the classes Input and Output and the usage of LoadColumn and ColumnName
Program.cs
void Main()
{
string inputFile = Path.Combine(Path.GetDirectoryName(Util.CurrentQueryPath)!, @"Sampledata\poverty2.csv"); //linqpad tech
var context = new MLContext(seed: 0);
//Train the model
var data = context.Data
.LoadFromTextFile<Input>(inputFile, hasHeader: true, separatorChar: ';');
// Split data into training and test sets
var split = context.Data.TrainTestSplit(data, testFraction: 0.2);
var trainData = split.TrainSet;
var testData = split.TestSet;
var pipeline = context
.Transforms.NormalizeMinMax("PovertyRate")
.Append(context.Transforms.Concatenate("Features", "PovertyRate"))
.Append(context.Regression.Trainers.Ols());
var model = pipeline.Fit(trainData);
// Use the model to make a prediction
var predictor = context.Model.CreatePredictionEngine<Input, Output>(model);
var input = new Input { PovertyRate = 8.4f };
var actual = 36.8f;
var prediction = predictor.Predict(input);
Console.WriteLine($"Input poverty rate: {input.PovertyRate} . Predicted birth rate per 1000: {prediction.TeenageBirthRate:0.##}");
Console.WriteLine($"Actual birth rate per 1000: {actual}");
// Evaluate the regression model
var predictions = model.Transform(testData);
var metrics = context.Regression.Evaluate(predictions);
Console.WriteLine($"R-squared: {metrics.RSquared:0.##}");
Console.WriteLine($"Root Mean Squared Error: {metrics.RootMeanSquaredError:0.##}");
Console.WriteLine($"Mean Absolute Error: {metrics.MeanAbsoluteError:0.##}");
Console.WriteLine($"Mean Squared Error: {metrics.MeanSquaredError:0.##}");
PlotScatterGraph<Input>(context, trainData, (Input input) =>
new PointItem { X = (float) Math.Round(input.PovertyRate, 2), Y = (float) Math.Round(input.TeenageBirthRate, 2) },
"Poverty rate (%) vs Teenage Pregnancies per 1,000 birth");
}
public class PointItem {
public float X { get; set; }
public float Y { get; set; }
}
void PlotScatterGraph<T>(MLContext mlContext, IDataView trainData, Func<T, PointItem> pointCreator, string chartTitle) where T : class, new()
{
//Convert the IDataview to an enumerable collection
var data = mlContext.Data.CreateEnumerable<T>(trainData, reuseRowObject: false).Select(x => pointCreator(x)).ToList();
// Calculate trendline (simple linear regression)
double avgX = data.Average(d => d.X);
double avgY = data.Average(d => d.Y);
double slope = data.Sum(d => (d.X - avgX) * (d.Y - avgY)) / data.Sum(d => (d.X - avgX) * (d.X - avgX));
double intercept = avgY - slope * avgX;
var trendline = data.Select(d => new { X = d.X, Y = slope * d.X + intercept }).ToList();
//Plot the scatter graph
var plot = data.Chart(d => d.X)
.AddYSeries(d => d.Y, LINQPad.Util.SeriesType.Point, chartTitle)
.AddYSeries(d => trendline.FirstOrDefault(t => t.X == d.X)?.Y ?? 0, Util.SeriesType.Line, "Trendline")
.ToWindowsChart();
plot.AntiAliasing = System.Windows.Forms.DataVisualization.Charting.AntiAliasingStyles.All;
plot.Dump();
}
public class Input
{
[LoadColumn(1)]
public float PovertyRate;
[LoadColumn(5), ColumnName("Label")]
public float TeenageBirthRate { get; set; }
}
public class Output
{
[ColumnName("Score")]
public float TeenageBirthRate;
}
A pipeline is defined for the machine learning here consisting of the following :
- The method NormalizeMinMax will transform the poverty rate into a normalized scale between 0 and 1. The Concatenate method will be used to specify the "Features", in this case only the column Poverty rate is the feature of which we want to predict a score, this is the rate of teenage pregnancy births per 1,000 births. Note that our CSV data set contains more columns, but this is a simple regression where only one variable is taken into account.
- The trainers used to train the machine learning algorithm is Ols, the Ordinary Least Squares.
- The method fit will train using the training data defined from the method TrainTestSplit.
- The resulting model is used to create a prediction engine.
- Using the prediction engine, it is possible to predict a value value using the Predict method given one input item. Our prediction engine expects input objects of type Input and Output.
- Using the testdata, the method Transform using the model gives us multiple predictions and it is possible to evalute the regression analysis from the predictions to check how accurate the regression model is.
- Returning from this evaluation, we get the R-squared for example. This is a value from 0 to 1.0 where it describes how accurate the regression is in in describing the total variation of the residues of the model, the amount the data when plotted in a scatter graph where residue is the offset between the actual data and what the regression model predicts.
- Other values such as RMSE and MSE are the root and mean squared error, which are absolute values.
- Using the code above we got a fairly accurate regression model, but more accuracy would be achieved by taking in additional factors.
Output from the Linqpad 8 application shown in this article :
Input poverty rate: 8,4 . Predicted birth rate per 1000: 35,06
Actual birth rate per 1000: 36,8
R-squared: 0,59
Root Mean Squared Error: 8,99
Mean Absolute Error: 8,01
Mean Squared Error: 80,83
Please note that there are some standard column names used for machine learning.
Label: Represents the target variable (the value to predict). Features: Contains the input features used for training. Score: Stores the predicted value in regression models. PredictedLabel: Holds the predicted class in classification models. Probability: Represents the probability of a predicted class. FeatureContributions: Shows how much each feature contributes to a prediction.In the code above, the column names "Label", "Features" and "Score" was used to instruct the regression being calculated in the code here for ML.Net context model. The attribute ColumnName was being used here together with the Concatenate method.