This post describes what cost functions are in Machine Learning as it relates to a linear regression supervised learning algorithm.
A function in programming and in mathematics describes a process of pairing unique input values with unique output values.
All the possible input values of a function is called the function’s domain. The corresponding list of output values of a function is called the function’s codomain or range. One value in the domain will only be connected to one value in the codomain and vice versa. A function following this strict definition is also called a Pure Function.
In a linear regression supervised learning problem, the goal is to find a hypothesis function which most accurately provides the “correct” output value across a continuous range given an input value. In order to find the best hypothesis function, we must provide training data which provides the basis for the hypothesis function selection. The training data would input examples of “correct” input and output pairs.
For example, the values in set X
, {1, 2, 3}
, can be inputs which map to values in the corresponding position in the set Y
, {4, 5, 6}
, which represent the outputs. The ordered pairs (1, 4)
, (2, 5)
, and (3, 6)
describes the training data where the first value of the pair, the input, which results in the second value, the output. The hypothesis function which takes all input values, predicts outputs most closely matches the expected outputs from the training data is the best function to use in a linear regression problem.
It takes some work to find this best function. To help with this task, we can check how closely a hypothesis function’s output value matches actual data for a given input value.
For example, if we call the hypothesis function h(x)
with the value 1
, then we would want the h(1)
to return 4
. This is the exact output value we had in the training data (1, 4)
. The difference of the output value of h(1)=4
and the actual value would equal 0
. This can be expressed as 4 - 4 = 0
. More generally, h(x) - y
where x
is the input to the hypothesis function h(x)
and y
is the actual expected output. The resulting value of this equation is called the error of the hypothesis function.
We would want to repeat the above process for all available training data and take the total sum of errors. A smaller sum would mean a function was able to more accurately predict the output values relative to the other candidate hypothesis functions for a given input.
I mentioned we’d, “take the total sum of the errors”. It is important to notice the error amount may be a negative number. This occurs when h(x)
is smaller than y
. To work around this, you can sum the absolute value instead. This can be done by squaring the difference then taking the principal square root like so .
While this equation would give a positive sum number to judge whether or not a function accurately predicts the data, the equation treats making a predict off by 2
from the actual value as twice as bad or costly as predicting a value 1
off from the actual value. The same reasoning would hold for being off by 4
instead of being 2
off. Rather than treating differences as a linear progression of how inaccurately the function is from the actual value, we could instead exponentially add the error amount to the total sum of errors. This would mean having predicting a value off by 4
contributes 16
to the total sum of errors. This is four times worst than being off by 2
which contributes only 4
to the total sum of errors.
Summing the squared errors of a function’s output, as illustrated above, describes Sum of Squared Errors. Sum of Squared Errors is a commonly used technique to create a cost function. A cost function is used to determine how accurate a hypothesis function predicts the data, and what parameters should be used in the hypothesis function.
Parameters are passed as arguments into a cost function. The cost function would then use these parameters to define a hypothesis function. The hypothesis function is judged on its accuracy with the use of some scoring technique such as Sum of Squared Errors. Several parameters are tested and passed into the cost function until a hypothesis function is found to have minimized the cost function to the greatest extent possible. This hypothesis function would then be the function used to predict future values for unseen data. The described cost function above is the Squared Errors Function.
Above we described how a cost function receives parameters as input and how it results in a number which represents the amount of error a hypothesis function had when using these parameters. To expand on this idea, we can find the best parameters to use with a cost function by incrementally increasing or decreasing a parameter’s value. For example, a parameter could be any real number in the positive or negative direction along an axis. By moving along this axis and plotting the output of the cost function, a cost function graph can be created. Depending on how many parameters are defined with the cost function, the graph can look like a 2d curve for a 1-parameter cost function or a 3d surface plot for a 2-parameter cost function.
If the above graphs are available, then finding a minimum or maximum point becomes trivial to do visually. In order to systematically find a minimum, the Gradient Descent algorithm can be used.
This was a description of what a cost function is used for in a linear regression unsupervised learning problem. Read more about Unsupervised Machine Learning in this post.
Questions to answer in this post or in another post later:
- What is Gradient Descent?
- Why is the mean/average used for the summation?
- What is a derivative?
- How does
1/2
help with the Squared Error Function after taking the derivative? - Why is this algorithm called Batch Gradient Descent?