Using quadratic programming to solve L1-norm regularization
When doing regression modeling, one will often want to use some sort of regularization to penalize model complexity, for reasons that I have discussed in many other posts. In the case of a linear regression, a popular choice is to penalize the L1-norm (sum of absolute values) of the coefficient weights, as this results in the LASSO estimator which has the attractive property that many of the (low-correlation) features are given zero weight[1].
The LASSO estimator can be formulated by two equivalent minimization problems: Ivanov regularization \(\eqref{eq:ivanov}\) or Tikhonov regularization \(\eqref{eq:tikhonov}\). The latter transforms a constrained problem into an unconstrained one, and are generally easier to solve. However, formulating the problem with Tikhonov regularization allows for a more interpretable model complexity measure. This short post will review what quadratic programming is, show its relation to the LASSO estimator, and provide the R
code necessary to solve these problems.
A simple quadratic programming problem
Consider the following problem as shown in equation \(\eqref{eq:simp}\). This is an example of a quadratic programming problem (QPP) because there is a quadratic objective function with linear constraints.
\[\begin{align} \min_{x_1,x_2,x_3}& x_1^2 + 2x_2^2 + 4x_3^2 - x_1 - x_2 + 5x_3 \tag{3}\label{eq:simp} \\ \text{s.t.}\hspace{2mm}& x_1 + x_3 \leq 1 \nonumber \\ & x_1 \geq 5 \nonumber \\ & x_2 \leq 0 \nonumber \end{align}\]This QPP can be solved in R
using the quadprog
library. However, the package requires that the problem be written in the form of:
The specific example shown above matches the general form when:
\[\begin{align*} b &= (x_1,x_2,x_3)^{T} \\ D &= \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 4 \end{pmatrix} \\ d &= (1,1,-5)^{T} \\ A^{T} &= \begin{pmatrix} -1 & 0 & -1 \\ 1 & 0 & 0 \\ 0 & -1 & 0 \end{pmatrix} \\ b_0 &= (-1,5,0) \end{align*}\]Implementing this if fairly simple:
QPP and LASSO
The LASSO formulation in equation \(\eqref{eq:ivanov}\) is also a QPP, as the residual sum of squares: \(RSS(b) \propto b^T (X^T X) b - 2y^T X \beta\) is equivalent to \(D=2 (X^TX)\) and \(d^{T}=2y^TX\). However, in order to write \(A\) there will have to be \(2^p\) inequality constraints. For example, when \(p=2\), the constraint \(|\beta_1| + |\beta_2| \leq t\) is equivalent to:
\[\begin{align*} \beta_1 + \beta_2 &\leq t \\ -\beta_1 + \beta_2 &\leq t \\ \beta_1 - \beta_2 &\leq t \\ -\beta_1 - \beta_2 &\leq t \\ \begin{pmatrix} -1 & -1 \\ 1 & -1 \\ -1 & 1 \\ 1 & 1 \end{pmatrix} &= A \\ (-t,-t,-t,-t)^T &= b_0 \end{align*}\]That \(2^p\) can quickly get out of hand was well known even to Ibn Khallikan. Nevertheless, this QP approach will be demonstrated for the simple Boston
data set from the MASS
package, by fitting a LASSO estimator to predict median house prices (medv
) using the average number of rooms (rm
) and the distance to employment centers (dis
) by shrinking the L1-norm of the parameters to half of their unconstrained OLS counterpart.
One thing that is nice about Ivanov regularization is that it is much more intuitive that Tikhonov regularization as the constraint term \(t\) can be thought of a shrinkage parameter: \(s=t/t_{\text{OLS}}\). We can now show the first example of a mapping between the \(t\) constraint from equation \(\eqref{eq:ivanov}\) and \(\lambda\) penalty from equation \(\eqref{eq:tikhonov}\). The glmnet
package can estimate a LASSO for a given \(\lambda\).
QP with moderately sized \(p\)
In order to scale the quadratic programming problem to a moderate number of features, the LowRankQP
package can be used. This allows us to use the interior point method to solve quadratic programming problems of the form[2]:
Using the old trick that \(q = q^+ - q^-\) the LASSO minimization problem can be re-written as:
\[\begin{align*} \min_{\beta^+,\beta^-} \hspace{3mm} &(\beta^+-\beta^-)^T XX^T (\beta^+-\beta^-) - 2yX^T (\beta^+-\beta^-) \\ \text{s.t.} \hspace{3mm} &\sum_{j=1}^p (\beta_j^+ - \beta_j^-) \leq t \\ &\beta^+ \succeq 0, \hspace{2mm} \beta^- \succeq 0 \end{align*}\]This problem requires only \(3p\) constraints rather than \(2^p\). The LowRankQP
package allows for the estimation of matrices that are not strictly positive definite.
Writing the function in R
:
A comparison between the custom function and glmnet
will be made for well-known prostate
data set[3], available in the ElemStatLearn
library.
The output shows that both approaches yield a very similar estimate. For medium-sized data sets, when performing model selection with cross-validation it is nice to be able to pick a range of \(t\)’s from equation \(\eqref{eq:ivanov}\) that have a nice interpretation: 10%, 25%, …, 99% shrinkage of the unregularized least squares estimate. However, quadratic programming will not scale as well to very large data sets and solving the regularization problem in the unconstrained form will necessary.
-
This is desirable property in that it embeds model selection (picking variables) and results in an interpretable model. ↩
-
For more details see the course notes for ST-810. ↩
-
This was the original data set used in Tibshirani’s 1996 paper. ↩