Skip to content

Introduction

Step of capturing patterns from data is called fitting or training the model. The data used to fit the model is called the training data.

features

The columns that are inputed into our model (and later used to make predictions) are called features. By convention, the feature is called x

Prediction Target

The coloumn to be predicted is called the prediction target By convention, the prediction target is called y.

Mean Absolute Error

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

On average, our predictions are off by about X.

Mean, Median, and Mode

In Machine Learning (and in mathematics) there are often three values that interests us:

Mean - The average value Median - The mid point value Mode - The most common value

Standard Deviation

Standard deviation is a number that describes how spread out the values are. A low standard deviation means that most of the numbers are close to the mean (average) value. A high standard deviation means that the values are spread out over a wider range.

Example: This time we have registered the speed of 7 cars: speed = [86,87,88,86,87,85,86] The standard deviation is: 0.9

Let us do the same with a selection of numbers with a wider range: speed = [32,111,138,28,59,77,97] The standard deviation is: 37.85

speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)

Variance

If you multiply the standard deviation by itself, you get the variance! Standard Deviation is often represented by the symbol Sigma: σ Variance is often represented by the symbol Sigma Square: σ2

Percentile:

Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)
61.0

Histogram

To visualize the data set we can draw a histogram with the data we collected.

We will use the Python module Matplotlib to draw a histogram:

import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 250)
plt.hist(x, 5)
plt.show()

Scatter Plot

A scatter plot is a diagram where each value in the data set is represented by a dot

import matplotlib.pyplot as plt
# x is age of the car
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
# y is the speed of each car
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()

Coefficient

The coefficient is a factor that describes the relationship with an unknown variable.

Example: if x is a variable, then 2x is x two times. x is the unknown variable, and the number 2 is the coefficient.

Linear Regression

The term regression is used when you try to find the relationship between variables. In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events.

Linear regression uses the relationship between the data-points to draw a straight line through all them.

This line can be used to predict future values.

Polynomial Regression

If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression.

Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points.

Multiple Regression

Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.