Kaggle Housing Prices Competition Evaluation Metric

While I was reading through the Housing Prices Competition for Kaggle Learn Users description, I wanted to get a better understanding of how user’s submissions were evaluated. What follows is an exploration of the metric used.

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

The metric is using a logarithm to convert measures from a scale based on individual dollar units to a logarithmic scale based on proportional differences between the predicted and the observed sales price.

For illustration, I’ll denote the units to be powers of 10 which can be scaled up to make the result more realistic. A number like 10^1 = 10 can be scaled to 100,000 by multiplying the result by 10,000.

Let’s say there were two houses sold. A cheap house and an expensive house. The cheap house was sold for 10^{-1} and the expensive house was sold for 10^1. If a prediction of $1 (10^0) was made for both houses, then the differences without taking the logarithm would be -0.9 (10^{-1} - 10^0 = 0.1 - 1 = -0.9) and 9 (10^1 - 10^0 = 10 - 1 = 9) respectively. The corresponding RMSE would be 0.9 and 9.

Just by looking at the numbers, one may think the prediction was better for the cheaper house than the more expensive house. Actually, the predictions are equal in the amount that they differ from the observed sales price. This can be concluded by switching from measuring on a dollar unit scale to a proportional scale.

Between the numbers 1 (10^0) and 10 (10^1), there are 9 discrete units (1…10). The same goes for 0.1 (10^{-1}) and 1 (10^0) by (0.1…1). Since the relative unit distances of the prediction and the observed sales price are the same, the relative prediction error should also be the same.

This can be shown by |log(10^0) - log(10^{-1})| = |log(10^0) - log(10^1)| = 1. The absolute value is taken since this would be the effect of taking RMSE on a single data point like this. \sqrt{\frac{(log(10^0) - log(10^{-1}))^2}{1}} = 1

Here I use a geometric mean to show that the difference between halfway between the order of magnitudes above and below the predict error will result in the same amount of error.

|log(10^0) - log(\sqrt{log(10^0) \cdot log(10^1)})| = |log(10^0) - log(\sqrt{log(10^0) \cdot log(10^{-1})}| = 0.5


Derivatives is a concept introduced in Calculus where it is the instantaneous rate of change or slope at a given point. The slope of a line can be found by \frac{rise}{run} or \frac{height}{width} or \frac{\delta y}{\delta x}.

In the linear equation of f(x) = 2x, the slope is 2 because for each incremental change of x the output value would be 2. \frac{2}{1}

For curves such as a quadratic, the idea of a slope still applies and it is instead called the derivative. g(x) = x^2 has a derivative of 2x because for a change in x such as 2, the resulting output of the function g(2) is twice the input or in this case 4. g(2) = 4. \delta x = 2 \delta y = 4. \frac{\delta y}{\delta x} = \frac{4}{2}. We say the instantaneous rate of change or the derivative is 2x. Instead of saying the whole equation’s slope is 2x as we did for f(x) above, we can use the Leibniz notation.

Further reading: Derivatives Wikipedia

Gradient Descent

Gradient Descent is a commonly used algorithm in Machine Learning to find what parameters would minimize a cost function and find the best hypothesis function to predict a dataset.

The principle of Gradient Descent is to incrementally update parameters until a combination of parameter values creates a hypothesis function which most accurately predicts the data. To accurately predict the data, the hypothesis function would necessarily have a small amount of error. The error is measured by the cost function. After the rate of cost function decrease falls below a certain threshold, the Gradient Descent algorithm has converged on a local minimum point which may not be the global minima. To find a global minima, different starting parameters needs to be used.

In a single variable or feature problem, Gradient Descent works by defining an initial cost function with initial parameter values. The partial derivative of the cost function is found for one, a part, of the parameters. Once the partial derivative is found, the initial parameter value used is subtracted by the partial derivative. This continues for the remaining parameter and until parameter values are found to minimize the cost function.

In multiple variable Gradient Descent, the process is the same except there are more parameters update in each iteration.

Multivariate Linear Regression Cost Function

Previously I described a machine learning single variable linear regression cost function algorithm which used a single feature to define a hypothesis. In this post, I describe how a cost function can be extended to use multiple features.

A single variable or linear equation takes the form of h(x) = \theta_0 + \theta_1 x where \theta are the parameters of the function and x is a feature in the dataset. An equation which takes two or more variables are polynomial equations such as a quadratic or cubic equations (h(x_1, x_2) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 or h(x_1, x_2, x_3) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3).

Using Linear Algebra, the parameters and features can be contained in separate vectors. The parameter vector can be transposed and then multiplied with the features vector to get the hypothesis function prediction. This is a nice concise way to find the result of a hypothesis function without writing out the whole equation.

A cost function is used along with the Gradient Descent algorithm to find the best parameters .

Now that there are more variables, they may be of different scales. This is where feature scaling comes in. For a feature, find the maximum value and minimum value of the value range. Subtracting the minimum from the maximum results in a range size. Modify the cost function to divide by the range size. The result are feature values between -1 and 1.

A learning rate can be tuned by either decreasing or increasing the learning rate. A larger learning rate results in overshooting and not finding convergence. A small learning rate results in a slow convergence. It is a good idea to test out several learning rates in order to strike a good balance between convergence speed and accuracy.

Single Variable Linear Regression Cost Functions

This post describes what cost functions are in Machine Learning as it relates to a linear regression supervised learning algorithm.

A function in programming and in mathematics describes a process of pairing unique input values with unique output values.
All the possible input values of a function is called the function’s domain. The corresponding list of output values of a function is called the function’s codomain or range. One value in the domain will only be connected to one value in the codomain and vice versa. A function following this strict definition is also called a Pure Function.

In a linear regression supervised learning problem, the goal is to find a hypothesis function which most accurately provides the “correct” output value across a continuous range given an input value. In order to find the best hypothesis function, we must provide training data which provides the basis for the hypothesis function selection. The training data would input examples of “correct” input and output pairs.

For example, the values in set X, {1, 2, 3}, can be inputs which map to values in the corresponding position in the set Y, {4, 5, 6}, which represent the outputs. The ordered pairs (1, 4), (2, 5), and (3, 6) describes the training data where the first value of the pair, the input, which results in the second value, the output. The hypothesis function which takes all input values, predicts outputs most closely matches the expected outputs from the training data is the best function to use in a linear regression problem.

It takes some work to find this best function. To help with this task, we can check how closely a hypothesis function’s output value matches actual data for a given input value.

For example, if we call the hypothesis function h(x) with the value 1, then we would want the h(1) to return 4. This is the exact output value we had in the training data (1, 4). The difference of the output value of h(1)=4 and the actual value would equal 0. This can be expressed as 4 - 4 = 0. More generally, h(x) - y where x is the input to the hypothesis function h(x) and y is the actual expected output. The resulting value of this equation is called the error of the hypothesis function.

We would want to repeat the above process for all available training data and take the total sum of errors. A smaller sum would mean a function was able to more accurately predict the output values relative to the other candidate hypothesis functions for a given input.

I mentioned we’d, “take the total sum of the errors”. It is important to notice the error amount may be a negative number. This occurs when h(x) is smaller than y. To work around this, you can sum the absolute value instead. This can be done by squaring the difference then taking the principal square root like so \sqrt((h(x) - y)^2).

While this equation would give a positive sum number to judge whether or not a function accurately predicts the data, the equation treats making a predict off by 2 from the actual value as twice as bad or costly as predicting a value 1 off from the actual value. The same reasoning would hold for being off by 4 instead of being 2 off. Rather than treating differences as a linear progression of how inaccurately the function is from the actual value, we could instead exponentially add the error amount to the total sum of errors. This would mean having predicting a value off by 4 contributes 16 to the total sum of errors. This is four times worst than being off by 2 which contributes only 4 to the total sum of errors.

Summing the squared errors of a function’s output, as illustrated above, describes Sum of Squared Errors. Sum of Squared Errors is a commonly used technique to create a cost function. A cost function is used to determine how accurate a hypothesis function predicts the data, and what parameters should be used in the hypothesis function.

Parameters are passed as arguments into a cost function. The cost function would then use these parameters to define a hypothesis function. The hypothesis function is judged on its accuracy with the use of some scoring technique such as Sum of Squared Errors. Several parameters are tested and passed into the cost function until a hypothesis function is found to have minimized the cost function to the greatest extent possible. This hypothesis function would then be the function used to predict future values for unseen data. The described cost function above is the Squared Errors Function.

Above we described how a cost function receives parameters as input and how it results in a number which represents the amount of error a hypothesis function had when using these parameters. To expand on this idea, we can find the best parameters to use with a cost function by incrementally increasing or decreasing a parameter’s value. For example, a parameter \theta_0 could be any real number in the positive or negative direction along an axis. By moving along this axis and plotting the output of the cost function, a cost function graph can be created. Depending on how many parameters are defined with the cost function, the graph can look like a 2d curve for a 1-parameter cost function or a 3d surface plot for a 2-parameter cost function.

A possible parabola curve of a 1 parameter cost function
A possible surface plot of a 2 parameter cost function

If the above graphs are available, then finding a minimum or maximum point becomes trivial to do visually. In order to systematically find a minimum, the Gradient Descent algorithm can be used.

This was a description of what a cost function is used for in a linear regression unsupervised learning problem. Read more about Unsupervised Machine Learning in this post.

Questions to answer in this post or in another post later:

What is Unsupervised Machine Learning?

Unsupervised Learning is a class of problems or algorithms in Machine Learning where unstructured data is given as input with the goal of finding more structure in the data output. The word “Structure” can reference whether the input data includes a “right” answer, label, or grouping.

In an example, you may be given health data of several patients sharing a common illness. The health data are the variables, and the common illness is the relationship. Using this data, you may be able to predict whether a patient has this common illness. This is an example of Supervised Learning. Specifically a classification problem.

Instead of providing the bit of information about the patients’ shared common illness relationship, you instead attempted to use the health data to determine if there were any structure, commonalities, or relationships which can be found. This could mean a learning algorithm would output patients into 2 groupings: Group 1 and Group 2. The algorithm is not aware of what this grouping means, but it was able to notice there was enough of a similarity with patients in each group to result in two distinct groupings were recognized. This is an example of Unsupervised Learning. Specifically a clustering problem.

Examples of Inversion of Control

In the post Inversion of Control (IoC), I described what was. Here I describe several examples of Inversion of Control.

Dependency Injection

When creating a class or module, it is good practice to reduce concrete dependencies when possible (Dependency Inversion Principle) . Concrete means having a hard dependency on another piece of code rather than a dependency on an abstraction.

This can be illustrated by defining a class. For example, a BillingForm may depend on a PaymentProcessor:

class BillingForm {
  constructor() {
    this.processor = new PaymentProcessor();
  processForm(...) {

Contrast above with the following example where the dependency is added as an argument when constructing a BillingForm instance.

class BillingForm {
  constructor(paymentProcessor) {
    this.processor = paymentProcessor;
  processForm(...) {
const paymentProcessor = new PaymentProcessor();
const billingForm = new BillingForm(paymentProcessor);

This is an example of IoC because the control of when a dependency is added is controlled by when the BillingForm is instantiated rather than immediately during the BillingForm instantiation.

React Lifecycle Methods

The Javascript frontend framework React.js has component lifecycle methods for components. Examples like componentWillMount, componentDidMount, and componentWillUnmount are defined in a component. These methods can be overridden by a component. The methods are then called at the appropriate time by the React.js framework.

Other examples includes: Command Line Interface, Javascript callbacks, Route Handlers, Object oriented class frameworks, and User Interfaces

Inversion of Control

Inversion of Control (IoC) is usually foundin programming, but it is a general concept. The concept relates to a program’s flow of control.

A program’s flow of control can be thought of as the direction a program executes. The simplest illustration of this is top-to-bottom line-by-line execution of a program.


The numbers 1, 2, 3 will be logged into the console in this order. The written program has control of what and when the numbers are shown.

Contrast the above example with:

.addEventListener('click', function(e) {
.addEventListener('click', function(e) {

This code add event listeners to 2 buttons button-1 and button-2. The program describes what will happen (logging a number), but it has left when to the browser which in turn leaves it to the user of the browser. It is said that the control has been inverted away from the application/program (the code which defines the number logging) to the framework (the browser)

The word “framework” is an abstract and necessary concept when dealing with IoC since it would decide when certain methods shall be called.

To read more, check out Examples of Inversion of Control

Delta propagation’s view-update problem

In reactive programming, the “push” implementation can take several forms of one node (computation) passing along a directed edge (a path from one node to another) some change information to another node. Two forms of change information that can be passed is:

  1. Complete node state
  2. Delta propagation

A node could pass the complete node state to the next node. The node receiving the complete state of the previous node could then use this information to compute its own new state. This is a straight-forward implementation, but it can be inefficient if the next node doesn’t need the complete state of a previous node to compute its own state.

To optimize, one could use delta propagation which only passes along the changed data of a node. This would be enough for the next node to compute its new state depending on the exact implementation.

The view-update problem occurs when a program needs to maintain the state of nodes while updating their states upon an update in data. Updates in data should only affect nodes where there is a path from the node which has a change in its data and connected nodes. In a directed acyclic graph with nodes connected with directed edges, the view of nodes passed the node with the change should be updated.

Here is an example:
node A -> node B
node B -> node C
node B -> node D

If there was an update in node B, then node views for C and D should be updated 1

The view-update problem is often described within the context of databases since relational database management system maintain different levels of views of its data. There is the external level where an end-user of a database would interact with to get data. There is the lowest level where the actual data is stored on physical hardware (disks, tape, SSDs). In between these two levels is the conceptual level which is relevant to the database software coordinating the two levels.

Each of these levels maintain their own view of the data. The external level is more interested in the data and less interested in how the data is stored on the hardware. This results in the database view-update problem. The hardware could be optimized by creating better indexes of the data or by replacing it with a faster technology. The external level should not need to know of this hardware update to generate its own view of the data.

1: For the sake of example, node A does not need to be the cause of the change in node B.


What is a runtime?

In programming, a problem have different phases in its overall lifecycle. Depending on the specific programming language, the program may require compiling. The phase when a program is compiling is called the compile time. Similarly, when a programming is executing or running this phase is called the execution time or runtime.

During a program’s runtime there may be errors. These are called runtime errors. Examples of runtime errors are dividing my zero, attempting to read a file on a file system that does not exist, and making a request to a website which does not exist.