While I was reading through the Housing Prices Competition for Kaggle Learn Users description, I wanted to get a better understanding of how user’s submissions were evaluated. What follows is an exploration of the metric used.
Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
The metric is using a logarithm to convert measures from a scale based on individual dollar units to a logarithmic scale based on proportional differences between the predicted and the observed sales price.
For illustration, I’ll denote the units to be powers of 10 which can be scaled up to make the result more realistic. A number like can be scaled to by multiplying the result by .
Let’s say there were two houses sold. A cheap house and an expensive house. The cheap house was sold for and the expensive house was sold for . If a prediction of $1 () was made for both houses, then the differences without taking the logarithm would be and respectively. The corresponding RMSE would be and .
Just by looking at the numbers, one may think the prediction was better for the cheaper house than the more expensive house. Actually, the predictions are equal in the amount that they differ from the observed sales price. This can be concluded by switching from measuring on a dollar unit scale to a proportional scale.
Between the numbers and , there are 9 discrete units (1…10). The same goes for and by (0.1…1). Since the relative unit distances of the prediction and the observed sales price are the same, the relative prediction error should also be the same.
This can be shown by . The absolute value is taken since this would be the effect of taking RMSE on a single data point like this.
Here I use a geometric mean to show that the difference between halfway between the order of magnitudes above and below the predict error will result in the same amount of error.
Derivatives is a concept introduced in Calculus where it is the instantaneous rate of change or slope at a given point. The slope of a line can be found by or or .
In the linear equation of , the slope is 2 because for each incremental change of x the output value would be 2.
For curves such as a quadratic, the idea of a slope still applies and it is instead called the derivative. has a derivative of 2x because for a change in x such as 2, the resulting output of the function g(2) is twice the input or in this case 4. . . . We say the instantaneous rate of change or the derivative is 2x. Instead of saying the whole equation’s slope is 2x as we did for f(x) above, we can use the Leibniz notation.
Gradient Descent is a commonly used algorithm in Machine Learning to find what parameters would minimize a cost function and find the best hypothesis function to predict a dataset.
The principle of Gradient Descent is to incrementally update parameters until a combination of parameter values creates a hypothesis function which most accurately predicts the data. To accurately predict the data, the hypothesis function would necessarily have a small amount of error. The error is measured by the cost function. After the rate of cost function decrease falls below a certain threshold, the Gradient Descent algorithm has converged on a local minimum point which may not be the global minima. To find a global minima, different starting parameters needs to be used.
In a single variable or feature problem, Gradient Descent works by defining an initial cost function with initial parameter values. The partial derivative of the cost function is found for one, a part, of the parameters. Once the partial derivative is found, the initial parameter value used is subtracted by the partial derivative. This continues for the remaining parameter and until parameter values are found to minimize the cost function.
In multiple variable Gradient Descent, the process is the same except there are more parameters update in each iteration.
Previously I described a machine learning single variable linear regression cost function algorithm which used a single feature to define a hypothesis. In this post, I describe how a cost function can be extended to use multiple features.
A single variable or linear equation takes the form of where are the parameters of the function and x is a feature in the dataset. An equation which takes two or more variables are polynomial equations such as a quadratic or cubic equations ( or ).
Using Linear Algebra, the parameters and features can be contained in separate vectors. The parameter vector can be transposed and then multiplied with the features vector to get the hypothesis function prediction. This is a nice concise way to find the result of a hypothesis function without writing out the whole equation.
A cost function is used along with the Gradient Descent algorithm to find the best parameters .
Now that there are more variables, they may be of different scales. This is where feature scaling comes in. For a feature, find the maximum value and minimum value of the value range. Subtracting the minimum from the maximum results in a range size. Modify the cost function to divide by the range size. The result are feature values between -1 and 1.
A learning rate can be tuned by either decreasing or increasing the learning rate. A larger learning rate results in overshooting and not finding convergence. A small learning rate results in a slow convergence. It is a good idea to test out several learning rates in order to strike a good balance between convergence speed and accuracy.
This post describes what cost functions are in Machine Learning as it relates to a linear regression supervised learning algorithm.
A function in programming and in mathematics describes a process of pairing unique input values with unique output values. All the possible input values of a function is called the function’s domain. The corresponding list of output values of a function is called the function’s codomain or range. One value in the domain will only be connected to one value in the codomain and vice versa. A function following this strict definition is also called a Pure Function.
In a linear regression supervised learning problem, the goal is to find a hypothesis function which most accurately provides the “correct” output value across a continuous range given an input value. In order to find the best hypothesis function, we must provide training data which provides the basis for the hypothesis function selection. The training data would input examples of “correct” input and output pairs.
For example, the values in set X, {1, 2, 3}, can be inputs which map to values in the corresponding position in the set Y, {4, 5, 6}, which represent the outputs. The ordered pairs (1, 4), (2, 5), and (3, 6) describes the training data where the first value of the pair, the input, which results in the second value, the output. The hypothesis function which takes all input values, predicts outputs most closely matches the expected outputs from the training data is the best function to use in a linear regression problem.
It takes some work to find this best function. To help with this task, we can check how closely a hypothesis function’s output value matches actual data for a given input value.
For example, if we call the hypothesis function h(x) with the value 1, then we would want the h(1) to return 4. This is the exact output value we had in the training data (1, 4). The difference of the output value of h(1)=4 and the actual value would equal 0. This can be expressed as 4 - 4 = 0. More generally, h(x) - y where x is the input to the hypothesis function h(x) and y is the actual expected output. The resulting value of this equation is called the error of the hypothesis function.
We would want to repeat the above process for all available training data and take the total sum of errors. A smaller sum would mean a function was able to more accurately predict the output values relative to the other candidate hypothesis functions for a given input.
I mentioned we’d, “take the total sum of the errors”. It is important to notice the error amount may be a negative number. This occurs when h(x) is smaller than y. To work around this, you can sum the absolute value instead. This can be done by squaring the difference then taking the principal square root like so .
While this equation would give a positive sum number to judge whether or not a function accurately predicts the data, the equation treats making a predict off by 2 from the actual value as twice as bad or costly as predicting a value 1 off from the actual value. The same reasoning would hold for being off by 4 instead of being 2 off. Rather than treating differences as a linear progression of how inaccurately the function is from the actual value, we could instead exponentially add the error amount to the total sum of errors. This would mean having predicting a value off by 4 contributes 16 to the total sum of errors. This is four times worst than being off by 2 which contributes only 4 to the total sum of errors.
Summing the squared errors of a function’s output, as illustrated above, describes Sum of Squared Errors. Sum of Squared Errors is a commonly used technique to create a cost function. A cost function is used to determine how accurate a hypothesis function predicts the data, and what parameters should be used in the hypothesis function.
Parameters are passed as arguments into a cost function. The cost function would then use these parameters to define a hypothesis function. The hypothesis function is judged on its accuracy with the use of some scoring technique such as Sum of Squared Errors. Several parameters are tested and passed into the cost function until a hypothesis function is found to have minimized the cost function to the greatest extent possible. This hypothesis function would then be the function used to predict future values for unseen data. The described cost function above is the Squared Errors Function.
Above we described how a cost function receives parameters as input and how it results in a number which represents the amount of error a hypothesis function had when using these parameters. To expand on this idea, we can find the best parameters to use with a cost function by incrementally increasing or decreasing a parameter’s value. For example, a parameter could be any real number in the positive or negative direction along an axis. By moving along this axis and plotting the output of the cost function, a cost function graph can be created. Depending on how many parameters are defined with the cost function, the graph can look like a 2d curve for a 1-parameter cost function or a 3d surface plot for a 2-parameter cost function.
If the above graphs are available, then finding a minimum or maximum point becomes trivial to do visually. In order to systematically find a minimum, the Gradient Descent algorithm can be used.
This was a description of what a cost function is used for in a linear regression unsupervised learning problem. Read more about Unsupervised Machine Learning in this post.
Questions to answer in this post or in another post later:
Unsupervised Learning is a class of problems or algorithms in Machine Learning where unstructured data is given as input with the goal of finding more structure in the data output. The word “Structure” can reference whether the input data includes a “right” answer, label, or grouping.
In an example, you may be given health data of several patients sharing a common illness. The health data are the variables, and the common illness is the relationship. Using this data, you may be able to predict whether a patient has this common illness. This is an example of Supervised Learning. Specifically a classification problem.
Instead of providing the bit of information about the patients’ shared common illness relationship, you instead attempted to use the health data to determine if there were any structure, commonalities, or relationships which can be found. This could mean a learning algorithm would output patients into 2 groupings: Group 1 and Group 2. The algorithm is not aware of what this grouping means, but it was able to notice there was enough of a similarity with patients in each group to result in two distinct groupings were recognized. This is an example of Unsupervised Learning. Specifically a clustering problem.