Introduction
Step of capturing patterns from data is called fitting or training the model. The data used to fit the model is called the training data.
features¶
The columns that are inputed into our model (and later used to make predictions) are called features
.
By convention, the feature is called x
Prediction Target¶
The coloumn to be predicted is called the prediction target
By convention, the prediction target is called y.
Mean Absolute Error¶
With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as
On average, our predictions are off by about X.
Mean, Median, and Mode¶
In Machine Learning (and in mathematics) there are often three values that interests us:
Mean - The average value Median - The mid point value Mode - The most common value
Standard Deviation¶
Standard deviation is a number that describes how spread out the values are. A low standard deviation means that most of the numbers are close to the mean (average) value. A high standard deviation means that the values are spread out over a wider range.
Example: This time we have registered the speed of 7 cars:
speed = [86,87,88,86,87,85,86]
The standard deviation is: 0.9
Let us do the same with a selection of numbers with a wider range:
speed = [32,111,138,28,59,77,97]
The standard deviation is: 37.85
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
Variance¶
If you multiply the standard deviation by itself, you get the variance! Standard Deviation is often represented by the symbol Sigma: σ Variance is often represented by the symbol Sigma Square: σ2
Percentile:¶
Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)
61.0
Histogram¶
To visualize the data set we can draw a histogram with the data we collected.
We will use the Python module Matplotlib to draw a histogram:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 250)
plt.hist(x, 5)
plt.show()
Scatter Plot¶
A scatter plot is a diagram where each value in the data set is represented by a dot
import matplotlib.pyplot as plt
# x is age of the car
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
# y is the speed of each car
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Coefficient¶
The coefficient is a factor that describes the relationship with an unknown variable.
Example: if x is a variable, then 2x is x two times. x is the unknown variable, and the number 2 is the coefficient.
Linear Regression¶
The term regression is used when you try to find the relationship between variables. In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events.
Linear regression uses the relationship between the data-points to draw a straight line through all them.
This line can be used to predict future values.
Polynomial Regression¶
If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression.
Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points.
Multiple Regression¶
Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.