Building an ElasticNet Cox Model with TimeDependent covariates
Introduction
In survival analysis the goal is to model the process by which the rate of events happen. In the applied statistics setting this usually means identifying covariates that are likely to lead to a higher or lower event rate compared to some baseline. When the event is remission or death, a higher event rate corresponds to a riskier patient. One of the unique data challenges in survival modelling is the presence of censoring, which occurs when a patient’s followup time is recorded but the event has not yet (or may never) happen.[^{1}] For example, highrisk cancer patients may be monitored for a full year at the end of which some will have passed away and some will still be alive – the latter group having an censored observation at the oneyear mark.
The most common model in medical research is the proportional hazards model (which has been discussed in a previous post) which tries to capture the relative order at which patients experience events. In this framework, censored patients can contribute information to modelling the event rate as a censored patient who has lived longer than some patient who experienced the event is likely to be a less risky patient.[^{2}] In many studies with multiple followup points a patient’s covariate information changes over time. For example in the heart
dataset in the survival
package, patient survival is measured while they are on a waiting list for a heart transplant. Below is a snippet of the data.
As we can see patient (id) number 3 waited one month before receiving a transplant, and then lived another 15 months before dying. The format of a timedependent dataset must be put in the longformat as seen above, where every new measurement period is treated as though it were a new observation. This padding additional rows can be thought of as adding leftcensored observations (in addition to the standard rightcensoring discussed above), as this new “patient” contributes no information to the left of their start time.
While many datasets in survival or timetoevent analysis have these timedependent properties, there is currently no support for estimating a regularized version of these models using either glmnet
or fastcox
, the benchmark libraries used for fitting elasticnet models to the proportional hazards loss function. In this post we will show how to build a custom proximal gradient descent algorithm that can incorporate timedependent covariates.
Elasticnet cox model
To establish the notation, define each observation as having the following tuple: \((t_i^l, t_i^u,\delta_i, \bxi)\), there \(t_i^l\) and \(t_i^u\) are the time periods in which this patient \(i\) had this covariate information, \(\delta_i\) is an indicator of whether the event happened to the patient at \(t_i^u\) and \(\bxi\) is that patient’s vector of covariate measurements. The partial likelihood of the Cox model can be easily accommodated to handle this timedependent covariate situation.
\[\begin{align} &\text{Partial Likelihood} \nonumber \\ L(\bbeta) &= \prod_{i=1}^N \Bigg( \frac{e^{\bxi^T \bbeta}}{\sum_{j \in R(t_i)}^N e^{\bx_j^T \bbeta}} \Bigg)^{\delta_i} \\ &\text{Partial LogLikelihood} \nonumber \\ \ell(\bbeta) &= \sum_{i=1}^N \delta_i \Bigg\{ \bxi^T \bbeta  \log \Bigg[\sum_{j\in R(t_i)}^N \exp(\bx_j^T \bbeta) \Bigg] \Bigg\} \tag{1}\label{eq:logpartial} \end{align}\]Where \(\bxi=\bxi(t)\) is now some function of time, and \(R(t_i)\) is the riskset or index of patients who were alive/noncensored at the event time \(t_i\). Specifically:
\[\begin{align*} R(t_i) &= \{ j \hspace{1mm} : \hspace{1mm} (t_j^u \geq t_i) \wedge (t_j^l < t_i) \} \end{align*}\]The first condition ensures that patient \(j\) either experienced the event or was censored at a later time point than \(t_i\) (and was hence alive when patient \(i\) experienced the event) and the second condition ensures that the start time occurred before the event. Notice that in the heart
dataset, the \([t_3^l, t_3^u]=(0,1]\) and \([t_4^l, t_4^u]=(1,16]\) ensuring that patient 3 could never be in his own riskset twice. It will be useful for downstream computations to use a onehot encoding matrix of \([\bY]_{ij}=y_i(t_j)\) where \(y_i(t_j)=I[i \in R(t_j)]\).
For high dimensional datasets and prediction problems, the research goal is to find some \(\bbeta\) that balances both model fit (i.e. using all information in the data) as well as maintaining high generalization capabilities (i.e. ignoring datasetspecific noise). Regularization is a technique that returns a coefficient vector which is “smaller” than what would otherwise have been returned, thereby reducing the variance of our model estimate and improving generalization. Furthermore, in highdimensional settings when there are more features than observations, regularization is also a way to ensure the existence of a unique solution. The elasticnet model combines a weighted L1 and L2 penalty term of the coefficient vector, the former which can lead to sparsity (i.e. coefficients which are strictly zero) and the latter which ensures smooth coefficient shrinkage. The elasticnet optimization is as follows.
\[\begin{align} &\text{Elasticnet loss for the Cox model} \nonumber \\ \bbetah &= \arg \min_{\bbeta} \hspace{2mm} \sum_{i=1}^N \delta_i \Bigg\{ \bxi^T \bbeta  \log \Bigg[\sum_{j\in R(t_i)}^N \exp(\bx_j^T \bbeta) \Bigg] \Bigg\} + \lambda \big( \alpha \ \bbeta\_1 + 0.5 (1\alpha) \\bbeta\_2 \big) \tag{2}\label{eq:elnet_loss} \end{align}\]The hyperparameter \(\lambda\) in \eqref{eq:elnet_loss} determines the overall level of regularization and \(\alpha\) determines the balance between the sparsest solution possible (\(\alpha=1\) which is the Lasso model) and the zero sparsity approach (\(\alpha=0\) which is the Ridge model). The level of each hyperparameter can be adjusted using methods like crossvalidation.
Code base
In the survival
package, Surv()
objects are used to store the time/event information and can be converted into the \(\bY\) with the following function.
Here is an example of two different time processes, one that is timeinvariant (i.e. start time is always zero) and one that is timedependent.
Because the details of using proximal gradient descent have been outlined in this post I will merely recapitulate the gradient update target we are interested in.
\[\begin{align} &\text{Elasticnet Cox proximal update} \nonumber \\ \bbeta^{(k)} &= S_{\gamma\alpha\lambda}\Bigg(\beta^{(k1)} + \gamma^{(k)} \Big[ \frac{1}{N}\bX^T(\bdelta  \bP\bdelta)  \lambda(1\alpha)\bbeta^{(k1)} \Big] \Bigg) \tag{3}\label{eq:prox_step} \end{align}\]Where \(\gamma^{(k)}\) is the gradient step size at each iteration. Below we outline the code necessary for each component of this update step:
\(S():\)
\(\bP\):
\(\bdelta  \bP\bdelta\):
\(\frac{1}{N}\bX^T(\bdelta  \bP\bdelta) + \lambda(1\alpha)\bbeta^{(k1)}\)
Equation \eqref{eq:prox_step}:
Equation \eqref{eq:elnet_loss}:
Lastly, whenever doing convex optimization I prefer to use the BarziliaBorwien step size adjustment rather than more computational demanding approaches like line search. However for a fullydeveloped code certain circuit breaks should be included to ensure the step size does not explode in any one iteration.
All of these functions can be combined in a single wrapper elnet.prox.cox
shown below.
Examples
We first want to make sure that elnet.prox.cox
can recapitulate the output from coxph
in the timedependent case and glmnet
in the timeinvariant situation.
Of course the point of writing the elnet.prox.cox
function is not simply to replicate coxph
or glmnet
but rather apply it to a regularized timedependent Cox models. In the code below we will fit the elasticnet solution path for \(\alpha = [1/3, 2/3, 1]\) on the cgd
dataset.
This post has shown how to make minor adjustments to firstorder gradient methods to solve the elasticnet Cox model that can handle timedependent covariate information. In future posts we will explore further algorithmic approaches for estimating a proportional hazard model.
Footnotes

The latter scenario technically corresponds to a cure model framework. ↩

I say “likely” to be less risky as there is always stochasticity around any process, so that even patients with the same risk profile will realize the event at different times due to the nature of a random processes. Also note that patient who is censored before another patient experiences the event can contribute no information to that pairwise ordering. ↩