Regularization from a Bayesian Viewpoint

I’ve read far too many articles on generalized linear models (GLMs) where regularization is introduced as a magical trick that can help you avoid overfitting your model to the data.  In this short post, I’d like to show in a relatively introductory way how many regularization methods such as l_2 regularization emerge naturally when GLMS are considered in a Bayesian context.

Estimation Techniques

The ultimate goal for many modeling techniques is to estimate a parameter Y using some data X.  One way to solve this problem is to choose a model for which you can easily compute Pr(X|Y), and use the value of Y which was most likely to generate the data we’ve seen.

For example, suppose we had a 6-sided die that may or not be fair. At the outset, we make no assumptions about the probability of each number appearing.  Let’s suppose we roll it five times, and observe that we roll three 4s, and two 1s.  Based only on this data, the most probable distribution for each value of the die is 3/5 for 4 and 2/5 for 1, and 0 for all other values.  Many would argue that this conclusion is absurd – how can we possibly conclude anything about the die based on such little data?  But what if rolled the die 120 times and observed 25 1s, 20 2s, 23 3s, 15 4s, 17 5s and 20 6s? The most likely probabilities would be the frequency of each values over 120, not 1/6 uniformly like our intuition might suggest.  Is that a more reasonable thing to conclude?  At what point do we say that we have seen enough data to make our final conclusion?  The above method is sometimes known as the “frequentist” approach.

A follower of the Bayesian school of thought would say that we should not draw conclusions solely based on the data we’ve seen. Data can be noisy, and you can never be sure you’ve seen enough data to obtain a true signal about the quantity you are trying to estimate.  Instead, we should integrate our prior beliefs about the mechanism being observed, and allow the data to influence our beliefs.  As we see more data, our beliefs develop to reflect what we’ve observed.  This approach allows us to balance the influence of the data we see with what we believe to be reasonable a priori assumptions about the measurement of interest.  The approach all flows from Bayes’ Formula, which states the following:

Pr(Y|X) = \frac{Pr(X|Y)Pr(Y)}{Pr(X)}

The goal now will be to find the Y that maximizes Pr(Y|X) – we’ve turned the frequentist method on its’ head!  Let’s pick apart Bayes’ Formula and look at the individual ingredients on the right-hand side.  Pr(X|Y) is something that can often be easily computed based on a mathematical model of the situation we are analyzing.  For example, if Y is the parameter determining the likelihood that a coin flip gives “heads”, and X is the number of heads we observe in 10 flips of a coin, then we can compute Pr(X=k|Y) using the binomial distribution.  Pr(X) is the probability of seeing our data and can in certain circumstances be very difficult to compute, as it often involves integration over our distribution for Y. The good news is that since we only care about maximizing Pr(Y|X) with respect to Y (and don’t usually care about the specific value of this maximal probability), when we take the derivative of Bayes’ Formula with respect to Y, Pr(X) will just be some constant and not affect our computation of the optimal Y.  So let’s just sweep that under the rug.   Pr(Y) is drawn from what is known as our prior distribution for the quantity Y that we are trying to estimate.   It represents our “reasonable initial assumptions” about the possible values Y could take.

If we’re flipping a coin, our prior distribution for the probability of heads occurring could look something similar to a Gaussian distribution centered at 1/2 or it could be a uniform distribution on the interval [0,1], for example.  The most mathematically convenient choice for our prior distribution in this particular case is called a beta distribution, which the uniform distribution is a special case of.  The convenience comes from the fact that Pr(Y|X) \propto Pr(X|Y)Pr(Y) turns out to look like a beta distribution with different parameters, reflecting the additional “information” that the data X gives us.  The mathematical terminology for this compatibility would be to say that the beta distribution is a conjugate prior for the binomial distribution.

This compatibility allows us to iterate on our estimations as new data comes in, without having to do a brand new analysis each time: we can apply Bayes’ Formula with new data X’ by replacing our initial prior distribution Pr(Y) with our updated posterior distribution Pr(Y|X), and repeat this process for as long as we like, updating our posterior distribution for Y as we see more and more data.

To summarize both approaches outlined above:

Frequentist Approach
  • Choose a mathematical model for Pr(X|Y)
  • Choose the Y which maximizes this conditional probability.
Bayesian Approach
  • Choose a mathematical model for Pr(X|Y)
  • Choose a prior distribution for Y
  •  Choose the Y which maximizes Pr(Y|X) using Bayes’ Formula

While the Bayesian approach can be more computationally intensive and is more complex mathematically, it can help you avoid the mistake of over-relying on noisy data, and is my preferred approach to making estimates.

Regularization from Bayesian Assumptions

This section will assume you are familiar with some of the basic facts about linear regression. The fundamental assumption of linear regression is that our quantity of interest Q depends on our dependent variables in a linear fashion, plus an error term which is normally distributed and centered at 0. That is,

Q = f(\mathbf{X}) = a_0X_0 + a_1X_1 + ... + a_kX_k + N(0,\sigma)

The parameter we want to determine in this situation is the vector of coefficients, \mathbf{W} = (a_0,...,a_k). Given a bunch of data points (Q_i,\mathbf{X_i}), we can compute Pr(Q_i|\mathbf{W}, \mathbf{X_i}) for any particular choice of \mathbf{W}. The familiar least squares solution to linear regression is obtained by finding the vector of coefficients which maximizes the product \prod Pr(Q_i|\mathbf{W}, \mathbf{X_i}), akin to the frequentist approach in the first section.  This method can break down in many scenarios, such as when there are many independent variables and not a correspondingly high number of data points.  In this case, the model may rely too heavily on the scant data that is given, and your model can fail to produce viable results in the future.

Since we want to act like Bayesians here, a natural thing to do here is to introduce prior distributions for each coefficient we’re trying to estimate.  Let us assume a priori that each coefficient is distributed normally around 0 with a variance of c, and see where the math takes us.  Let’s look at a single data point (Q,\mathbf{X}) for simplicity.  The full solution can be obtained by simply multiplying the probabilities of the individual data points (since we usually assume that each observation is independent).  Let us first collect the building blocks of our analysis:

    Pr((Q,\mathbf{X})|\mathbf{W}) \propto e ^  {-\frac{(Q - f(\mathbf{X}))^2} {2\sigma^2}} \propto e ^ {-\frac{(Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2}   {2\sigma^2}}

                    Pr(\mathbf{W}) \propto \prod \limits_{i=1}^{k} e ^ {-\frac{a_i^2}{2c^2}}

Note that we’ve dropped some constants from the above statements that don’t affect the conclusion which follows. Now, let us apply Bayes’ Formula to determine  Pr(\mathbf{W}|Q,\mathbf{X}):

Pr(\mathbf{W}|Q,\mathbf{X}) \propto Pr((Q,\mathbf{X})|\mathbf{W}) Pr(\textbf{W})

Pr(\mathbf{W}|Q,\mathbf{X}) \propto e ^ {-\frac{(Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2} {\sigma^2}} \prod \limits_{i=1}^{k} e ^ {-\frac{a_i^2}{c^2}}

We want to find the \mathbf{W} that maximizes this probability, which looks to be quite messy to compute.  Let’s take the negative natural logarithm to turn the nasty product into a sum, and find the optimal \mathbf{W} that minimizes our new function instead.  This composed function, called the negative log likelihood (NLL) will have its minimum where our original function had its maximum, because the natural logarithm is monotonically increasing.

 NLL \propto \frac{(Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2} {\sigma^2} + \frac{1}{c^2}\sum\limits_{i=1}^{k} {a_i^2}
 NLL \propto (Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2 + \frac{\sigma^2}{c^2}\sum\limits_{i=1}^{k} {a_i^2}

This now looks very much like the “standard” formula for the l_2 regularized linear regression loss function:

  (Q - \mathbf{W} \cdot \mathbf{X})^2 + \lambda ||\mathbf{W}||^2

except now we have an interpretation for what \lambda really means!  As we increase the variance c^2 of our prior distribution, the corresponding \lambda decreases, imposing less of a penalty on the coefficients of \mathbf{W}.  In other words, the weaker our prior beliefs (larger variance), the weaker the regularization, and vice versa.  If we choose different priors for \mathbf{W}, we get different regularizations.  If we assume that all of our priors are Laplace distributions with the same parameter, we get l_1 regularization.  We can even choose different priors for different coefficients to get hybrid regularization methods.  It’s a deep rabbit hole we could start going down. In the end, it’s important to remember that we’re trying to build models that help us make predictions using new unseen data, so cross validation using various forms of regularization is a good way to try out different regularization methods.

For me, the best way to learn a concept is to build a logical narrative that leads naturally to the new idea. When I first encountered regression, regularization, and model building (in a haphazard self-taught way) I didn’t fully appreciate how connected the various concepts were, until they were boiled down to their mathematical essence.  I hope I have constructed a useful narrative here that sheds some light on one of these connections.