Using quadratic programming to solve L1-norm regularization

When doing regression modeling, one will often want to use some sort of regularization to penalize model complexity, for reasons that I have discussed in many other posts. In the case of a linear regression, a popular choice is to penalize the L1-norm (sum of absolute values) of the coefficient weights, as this results in the LASSO estimator which has the attractive property that many of the (low-correlation) features are given zero weight[¹].

The LASSO estimator can be formulated by two equivalent minimization problems: Ivanov regularization \(\eqref{eq:ivanov}\) or Tikhonov regularization \(\eqref{eq:tikhonov}\). The latter transforms a constrained problem into an unconstrained one, and are generally easier to solve. However, formulating the problem with Tikhonov regularization allows for a more interpretable model complexity measure. This short post will review what quadratic programming is, show its relation to the LASSO estimator, and provide the R code necessary to solve these problems.

\[\newcommand{\bbeta}{\pmb{\beta}}\] \[\begin{align} \text{Ivanov:} \hspace{1cm} \min_{\bbeta}& \sum_{i=1}^n (y_i - x_i^{T}\bbeta)^2 \hspace{4mm} \text{s.t.} \hspace{3mm} \sum_{j=1}^p |\beta_j| \leq t \tag{1}\label{eq:ivanov} \\ \text{Tikhonov:} \hspace{1cm} \min_{\bbeta}& (\textbf{y}-\textbf{X}\bbeta)^T(\textbf{y}-\textbf{X}\bbeta) + \lambda\sum_{j=1}^p |\beta_j| \tag{2}\label{eq:tikhonov} \end{align}\]

A simple quadratic programming problem

Consider the following problem as shown in equation \(\eqref{eq:simp}\). This is an example of a quadratic programming problem (QPP) because there is a quadratic objective function with linear constraints.

\[\begin{align} \min_{x_1,x_2,x_3}& x_1^2 + 2x_2^2 + 4x_3^2 - x_1 - x_2 + 5x_3 \tag{3}\label{eq:simp} \\ \text{s.t.}\hspace{2mm}& x_1 + x_3 \leq 1 \nonumber \\ & x_1 \geq 5 \nonumber \\ & x_2 \leq 0 \nonumber \end{align}\]

This QPP can be solved in R using the quadprog library. However, the package requires that the problem be written in the form of:

\[\min_b \frac{1}{2} b^{T}Db - d^{T}b \hspace{1cm} \text{s.t. } A^{T}b \geq b_0\]

The specific example shown above matches the general form when:

\[\begin{align*} b &= (x_1,x_2,x_3)^{T} \\ D &= \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 4 \end{pmatrix} \\ d &= (1,1,-5)^{T} \\ A^{T} &= \begin{pmatrix} -1 & 0 & -1 \\ 1 & 0 & 0 \\ 0 & -1 & 0 \end{pmatrix} \\ b_0 &= (-1,5,0) \end{align*}\]

Implementing this if fairly simple:

A <- cbind(c(-1,1,0),c(0,0,-1),c(-1,0,0))
b0 <- c(-1,5,0)
D <- diag(c(1,2,4))
P <- 2*D
d <- c(1,1,-5)
xHat <- solve.QP(Dmat=P,dvec=d,Amat=t(A),bvec=b0)
xHat$solution

## [1]  5  0 -4

QPP and LASSO

The LASSO formulation in equation \(\eqref{eq:ivanov}\) is also a QPP, as the residual sum of squares: \(RSS(b) \propto b^T (X^T X) b - 2y^T X \beta\) is equivalent to \(D=2 (X^TX)\) and \(d^{T}=2y^TX\). However, in order to write \(A\) there will have to be \(2^p\) inequality constraints. For example, when \(p=2\), the constraint \(|\beta_1| + |\beta_2| \leq t\) is equivalent to:

\[\begin{align*} \beta_1 + \beta_2 &\leq t \\ -\beta_1 + \beta_2 &\leq t \\ \beta_1 - \beta_2 &\leq t \\ -\beta_1 - \beta_2 &\leq t \\ \begin{pmatrix} -1 & -1 \\ 1 & -1 \\ -1 & 1 \\ 1 & 1 \end{pmatrix} &= A \\ (-t,-t,-t,-t)^T &= b_0 \end{align*}\]

That \(2^p\) can quickly get out of hand was well known even to Ibn Khallikan. Nevertheless, this QP approach will be demonstrated for the simple Boston data set from the MASS package, by fitting a LASSO estimator to predict median house prices (medv) using the average number of rooms (rm) and the distance to employment centers (dis) by shrinking the L1-norm of the parameters to half of their unconstrained OLS counterpart.

# Load in the data
X <- as.matrix(boston[,c('rm','dis')]) %>% scale(center=T,scale=F)
y <- as.matrix(boston[,'medv']) %>% scale(center=T,scale=F)
# Find the OLS estimates
ls.beta <- lm(y~-1+X)
t.max <- ls.beta  %>% coef %>% abs %>% sum
# Define the function
lasso.qp <- function(X,y,t) {
  n <- dim(X)[1]
  p <- dim(X)[2]
  D <- t(X) %*% X
  P <- 2*D
  d <- 2 * t(y) %*% X
  A <- as.matrix(expand.grid(rep(list(c(-1,1)),p)))
  b0 <- rep(-t,2^p)
  beta.lasso <- solve.QP(Dmat=P,dvec=d,Amat=t(A),bvec=b0)
  return(beta.lasso$solution)
}
# Run the LASSO for 50% shrinkage
lasso.qp(X=X,y=y,t=0.5*t.max) %>% round(2)

## [1] 4.36 0.28

One thing that is nice about Ivanov regularization is that it is much more intuitive that Tikhonov regularization as the constraint term \(t\) can be thought of a shrinkage parameter: \(s=t/t_{\text{OLS}}\). We can now show the first example of a mapping between the \(t\) constraint from equation \(\eqref{eq:ivanov}\) and \(\lambda\) penalty from equation \(\eqref{eq:tikhonov}\). The glmnet package can estimate a LASSO for a given \(\lambda\).

library(glmnet)
# Run the LASSO
lasso.tikhonov <- glmnet(x=X,y=y,intercept=F,standardize=F,alpha=1,lambda=1)
# Find the implied shrinkage
tikhonov.beta <- lasso.tikhonov %>% coef %>% as.matrix
tikhonov.shrink <- (tikhonov.beta %>% abs %>% sum)/t.max
# Use the lasso.qp function
ivanov.beta <- lasso.qp(X,y,t=tikhonov.shrink*t.max)
data.frame(tikhonov=tikhonov.beta[-1],ivanonv=ivanov.beta) %>%
  set_rownames(c('rm','dis'))

##      tikhonov   ivanonv
## rm  6.8276126 6.8276126
## dis 0.3980483 0.3980484

QP with moderately sized \(p\)

In order to scale the quadratic programming problem to a moderate number of features, the LowRankQP package can be used. This allows us to use the interior point method to solve quadratic programming problems of the form[²]:

\[\begin{align*} \min_x \hspace{3mm} &\frac{1}{2} x^T V V^T x + c^T x \\ &Ax = b \\ &l \preceq x \preceq u \end{align*}\]

Using the old trick that \(q = q^+ - q^-\) the LASSO minimization problem can be re-written as:

\[\begin{align*} \min_{\beta^+,\beta^-} \hspace{3mm} &(\beta^+-\beta^-)^T XX^T (\beta^+-\beta^-) - 2yX^T (\beta^+-\beta^-) \\ \text{s.t.} \hspace{3mm} &\sum_{j=1}^p (\beta_j^+ - \beta_j^-) \leq t \\ &\beta^+ \succeq 0, \hspace{2mm} \beta^- \succeq 0 \end{align*}\]

This problem requires only \(3p\) constraints rather than \(2^p\). The LowRankQP package allows for the estimation of matrices that are not strictly positive definite.

\[\newcommand{\tbeta}{\tilde{\beta}} \begin{align*} \min_{\tbeta} \hspace{3mm} &\tbeta^T Q^TXX^T\tbeta - 2yX^TQ\tbeta \\ \text{s.t.} \hspace{3mm} & -R\tbeta \succeq -s \\ & \tbeta \succeq 0 \end{align*}\]

Writing the function in R:

library(LowRankQP)
lasso.lr <- function(X,y,t) {
  n <- dim(X)[1]
  p <- dim(X)[2]
  Vn <- X %*% cbind (cbind (diag(p), -diag(p)), 0)
  yX = apply(sweep (X, MARGIN=1, -y, '*'), 2, sum)
  Zn = c (2*yX, -2*yX, 0)
  bOls = lm.fit(X, y)$coefficients
  u = c(abs(bOls), abs(bOls), sum(abs(bOls)))
  A = matrix (c(rep (1, 2*p), 1), nrow=1)
  b = c(min(t, sum(abs(bOls))))
  soln = LowRankQP(sqrt(2)*t(Vn), Zn, A, b, u, method="LU",verbose =F)
  return(round(soln$alpha[1:p] - soln$alpha[(p+1):(2*p)], digits=5))
}

A comparison between the custom function and glmnet will be made for well-known prostate data set[³], available in the ElemStatLearn library.

X <- prostate[,1:8] %>% as.matrix %>% scale(center=T,scale=T)
y <- prostate[,'lpsa'] %>% as.matrix %>% scale(center=T,scale=F)
# Fit for a random lambda with glmnet
prostate.glmnet <- glmnet(x=X,y=y,intercept=F,standardize=F,lambda=0.25)
beta.glmnet <- prostate.glmnet %>% coef %>% as.matrix %>% as.numeric
t.glmnet <- beta.glmnet %>% abs %>% sum
# Get the implied shrinkage
t.max <- lm(y~-1+X) %>% coef %>% abs %>% sum
shrink.glmnet <- t.glmnet/t.max
# Use the lasso.lr function
beta.lr <- lasso.lr(X,y,t=shrink.glmnet*t.max)

## LowRankQP CONVERGED IN 15 ITERATIONS
## 
##     Primal Feasibility    =   3.9920603e-13
##     Dual Feasibility      =   1.1102230e-16
##     Complementarity Value =   9.2018684e-11
##     Duality Gap           =   9.2015284e-11
##     Termination Condition =   1.2965388e-12

# Compare
rbind(beta.glmnet[-1],beta.lr) %>% data.frame %>%
  set_rownames(c('glmnet','lasso.lr')) %>%
  set_colnames(rownames(coef(prostate.glmnet))[-1]) %>%
  round(5)

##           lcavol lweight age lbph     svi lcp gleason pgg45
## glmnet   0.51014 0.08699   0    0 0.11259   0       0     0
## lasso.lr 0.51013 0.08699   0    0 0.11260   0       0     0

The output shows that both approaches yield a very similar estimate. For medium-sized data sets, when performing model selection with cross-validation it is nice to be able to pick a range of \(t\)’s from equation \(\eqref{eq:ivanov}\) that have a nice interpretation: 10%, 25%, …, 99% shrinkage of the unregularized least squares estimate. However, quadratic programming will not scale as well to very large data sets and solving the regularization problem in the unconstrained form will necessary.

This is desirable property in that it embeds model selection (picking variables) and results in an interpretable model. ↩
For more details see the course notes for ST-810. ↩
This was the original data set used in Tibshirani’s 1996 paper. ↩

Written on February 28, 2017