When hearing about regression for the first time, many people wonder why we choose the squared error as the objective so often. Don’t we care more about absolute error? Or are there properties of the squared error that make it mathematically convenient?
In this introductory post, I answer this question from a probabilistic modeling perspective. While I find this a particulatly useful view, there are more interpretations of the squared error and for a comprehensive overview, I recommend reading the posts by Ben Kuhn and Morgan Giraud.
What is a probabilistic model?
Feel free to skip this section if you know what a probabilistic model is. A probabilistic model is a procedure that describes how we think our data set could have been created. Let’s say we have a data set of prices of used cars given their age. We are interested in predicting the price and thus modeling
Why does this need to be a probability? One possible model might just say the price in USD is generated by subtracting the age in years from the value 30, squaring the result, and multiplying by 50,
The problem with this model is that every car of the same age sells for the same price. To generate or explain data where the same input can result in different outputs, we need randomness. For example, with a slight abuse of notation, we can define our model as
Here we draw a uniform number between zero and before squaring and multiplying by 50. As a result, the models allows to cars of the same age to be sold for a range of prices.
Gaussian data distribution
In practice, we often use models that compute the mean of a Gaussian, and sample the final value from it. Since a Gaussian distribution is non-zero everywhere, this means a car of a given age could be sold for any price; prices far away from the mean are just very rare.
Another way to interpret modeling data as a Gaussian is to imagine a deterministic true value with added measurement noise. The central limit theorem gives a justification of why measurement noise that originates from many different sources should look Gaussian in combination.
In the next section, we will see that using a squared error corresponds to modeling the data as a Gaussian with fixed standard deviation and mean predicted by our neural network. In our example, the network takes the age and computes the mean of a Gaussian, from which the final price is sampled.
Maximizing the likelihood
To train our Gaussian model, we write down the loss function that maximizes the likelihood of the data. The likelihood is exactly the Gaussian density
with the normalization constant .
Taking the logarithm of this makes it easier to compute and does not change the solution, since it is a monotonic function. Moreover, we flip the sign to obtain a quantity to minimize, which gives the typical log loss
Now, if we plug in a fixed standard deviation of , this becomes the squared error scaled by a factor of :
Therefore, whenever we minimize a squared error, we effectively maximize the likelihood under a Gaussian with fixed standard deviation and mean predicted by our neural network.
By the way, without the factor of we would implicitly assume a standard deviation of . But the factor only scales the learning rate anyway.
Thinking about machine learning from a probabilistic perspective can be a good way to understand the assumptions underlying various methods. I hope this explanation gave you an intuition of what it means to use the squared error as loss function. Please feel free to ask for clarification or follow up questions in the comments.