Canadian cancer statistics - a mixed story

I recently reviewed Azra Raza’s new book The First Cell, whose primary thesis is that cancer research has failed to translate into effective treatments due to a combination or poor incentives and a flawed experimental paradigm. The cancer that Dr. Raza treats, myelodysplastic syndromes (MDS) and its evolutionary successor, acute...
Read More

Implementing the bias-corrected and accelerated bootstrap in Python

The bootstrap is a powerful tool for carrying out inference on statistics whose distribution is unknown. The non-parametric version of the bootstrap obtains variation around the point estimate of a statistic by randomly resampling the data with replacement and recalculating the bootstrap-statistic based on these resamples. This simulated distribution can...
Read More

A winner's curse adjustment for a single test statistic

Background One of the primary culprits in the reproducability crisis in scientific research is the naive application of applied statistics for conducting inference. Even excluding cases of scientific misconduct, cited research findings are likely to be inaccurate due to 1) the file drawer problem, 2) researchers’ degrees of freedom and...
Read More

Calibrating a machine learning model for a statistical trial

Binary classification tasks are one of the most common applications of machine learning models in applied practice. After a model has been trained, various evaluation metrics exist to allow researchers to benchmark performance and assess application viability. Some metrics, like accuracy, sensitivity, and specificity, require a threshold to be established...
Read More

Pediatric incubation time for COVID-19 using CORD-19 data

This post replicates my recently-posted Kaggle notebook using the the CORD-19 dataset which has more 37K full-text COVID-related articles. The goal of post is to show how to filter articles that are look for incubation period of the disease with the goal of finding a subset of articles that have...
Read More

AI Deployment Symposium

I am excited to share the AI Deployment Symposium Report that went live today on Vector’s website. This report provides many examples of real-world ML tools that have been deployed in a clinical setting. Despite the volume of articles about the potential for AI to improve patient outcomes and health...
Read More

Direct AUROC optimization with PyTorch

\(\newcommand{\by}{\boldsymbol{y}}\) \(\newcommand{\beta}{\boldsymbol{\eta}}\) \(\newcommand{\bw}{\boldsymbol{w}}\) \(\newcommand{\bx}{\boldsymbol{x}}\) In this post I’ll discuss how to directly optimize the Area Under the Receiver Operating Characteristic Curve (AUROC), which measures the discriminatory ability of a model across a range of sensitivity and specificity thresholds for binary classification. The AUROC is often used as method to benchmark...
Read More

Ordinal regression using the threshold approach from scratch

\[\newcommand{\bx}{\boldsymbol{x}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\btheta}{\boldsymbol{\theta}}\] Ordinal regression models are used when an outcome has \(K\) ordered possibilities: \(y \in [1,\dots,K]\), e.g. \(y \in [\text{low},\text{mid},\text{high}]\). While ordinal outcomes are discrete, their cumulative and probability mass functions can be written for each level: \[\begin{align*} P(y \leq k; \bx,\bbeta,\btheta ) &= F(\theta_k - \bx^T...
Read More

Elastic Net Exponential model for censored data

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bt}{\boldsymbol{t}} \newcommand{\biota}{\boldsymbol{\iota}} \newcommand{\bdelta}{\boldsymbol{\delta}}\] Introduction There are many situations where a statistical model needs to be fit to data whose values are only partially known. When the outcome, or label, of a dataset is known to be at least or at most some amount, this...
Read More

The HRT for mixed data types (Python implementation)

Introduction In my last post I showed how the holdout random test (HRT) could be used to obtain valid p-values for any machine learning model by sampling from the conditional distribution of the design matrix. Like the permutation-type approaches used to assess variable importance for decision trees, this method sees...
Read More

Parameter inference through black box predictions: the Holdout Randomization Test (HRT)

\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\bXt}{\tilde{\bX}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bThet}{\boldsymbol{\Theta}} \newcommand{\bSigma}{\boldsymbol{\Sigma}}\] Background It is an exciting time for statistical inference in high-dimensional and machine learning settings. Techniques like Selective Inference (SI) have shown that certain classes of models (such as the Lasso or forward stepwise regression) can have valid inference procedures performed after...
Read More

Linear time AUC optimization

\[\newcommand{\mub}{\boldsymbol{\mu}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\thetab}{\boldsymbol{\theta}} \newcommand{\betab}{\boldsymbol{\beta}} \newcommand{\xb}{\boldsymbol{x}} \newcommand{\xbi}{\xb_i} \newcommand{\Xb}{\boldsymbol{X}} \newcommand{Xs}{\mathcal{X}} \newcommand{\Sigmab}{\boldsymbol{\Sigma}}\] Introduction In the binary classification setting linear models such as logistic regression, suport vector machines, and the linear discriminant analyis are often used when a researcher is interested in obtaining a simple model that can be fit quickly and returns interpretable...
Read More

A convex approximation of the concordance index (C-index)

\[\newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\betab}{\boldsymbol{\beta}} \newcommand{\xb}{\boldsymbol{x}} \newcommand{\xbi}{\xb_i} \newcommand{\Xb}{\boldsymbol{X}}\] Introduction When comparing the performance of an individual or competing group of survival models, the most common choice of scoring metric is the concordance index (C-index). In the time-to-event setting, with \(N\) observations, the quality of a risk score can be determined by comparing how...
Read More

Building an Elastic-Net Cox Model with Time-Dependent covariates

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i} \newcommand{\ei}{\varepsilon_i}\] Introduction In survival analysis the goal is to model the process by which the rate of events happen. In the applied statistics setting this usually means identifying covariates that are likely to lead to a higher...
Read More

Building a survival-neuralnet from scratch in base R

\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bA}{\boldsymbol{A}} \newcommand{\ba}{\boldsymbol{a}} \newcommand{\bb}{\boldsymbol{b}} \newcommand{\bB}{\boldsymbol{B}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\bZ}{\boldsymbol{Z}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\biota}{\boldsymbol{\iota}} \newcommand{\RR}{\mathbb{R}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\LL}{\mathcal{L}}\] Motivation After having completed two of Andrew Ng’s excellent Deep Learning courses on Coursera, which are written in Python, I was inspired to write a post about how to create a neural network using base R....
Read More

Stratified survival analysis as a form of multitask/transfer learning

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bt}{\boldsymbol{t}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bB}{\boldsymbol{B}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\mp}{\mathcal{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i}\] Introduction In my previous post I discussed how gradient methods could be used to optimize the partial likelihood from the Cox-PH model. This post will use the notations and equations established there previously. In...
Read More

Gradient descent for the elastic net Cox-PH model

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i} \newcommand{\ei}{\varepsilon_i}\] Introduction Machine learning in the survival modelling context is relatively small area of research, but has been gaining attention in recent years. See this arXiv paper for a good overview. Unlike the classical supervised learning scenario where...
Read More

OLS under covariate shift

\[\newcommand{\be}{\boldsymbol{e}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\br}{\boldsymbol{r}} \newcommand{\N}{\mathcal{N}} \newcommand{\D}{\mathcal{D}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\X}{\mathcal{X}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\bpi}{\boldsymbol{\pi}} \newcommand{\bSig}{\boldsymbol{\Sigma}} \newcommand{\Err}{\text{Err}} \newcommand{\Var}{\text{Var}} \newcommand{\MSE}{\text{MSE}} \newcommand{\Bias}{\text{Bias}}\] Introduction In statistical learning theory the goal of a prediction model is to learn regularities of a data generating process from a sample and then obtain the best possible generalization error, which...
Read More

Trade-offs in apportioning training, validation, and test sets

\[\newcommand{\Real}{\boldsymbol{R}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bxi}{\bx_i} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\boldsymbol{u}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\fH}{\mathcal{H}} \newcommand{\fR}{\mathcal{R}} \newcommand{\fX}{\mathcal{X}} \newcommand{\fY}{\mathcal{Y}}\] Background and motivation Before machine learning (ML) algorithms can be deployed in real world settings, an estimate of their generalization error is required. Because data is finite, a model can at most be trained on \(N\)...
Read More

Selective Inference: A useful technique for high-throughput biology

\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\mathbb{y}} \newcommand{\bx}{\mathbb{x}} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\mathbb{u}} \newcommand{\bX}{\mathbb{X}}\] Background The falling cost of sequencing technology has lead to a proliferation of biological datasets that are able to measure nuclear states such as relative amounts of protein expression via RNA-seq or targeted genotyping through next generation sequencing (NGS) platforms. These...
Read More

Theoretical properties of the Lasso: avoiding CV in high-dimensional datasets

\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bbetaz}{\bbeta_0} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\mathbb{y}} \newcommand{\bx}{\mathbb{x}} \newcommand{\be}{\mathbb{e}} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\mathbb{u}} \newcommand{\bX}{\mathbb{X}} \newcommand{\gram}{\bX^T\bX} \newcommand{\sqrtl}{\sqrt{\text{Lasso}} } \newcommand{\sqrtln}{\sqrt{\text{Lasso}}}\] Background Twenty years ago Tibshirani developed a penalized regression model known as the Lasso. Instead of using the L0-norm as the penalty term (equivalent to best-subset selection), the Lasso makes use of an L1-norm. This...
Read More

Hyperparameter learning via bi-level optimization

\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bbetahl}{\bbetah_\lambda} \newcommand{\LT}{\mathcal{L}_T} \newcommand{\PLT}{\mathcal{PL}_T} \newcommand{\LR}{\mathcal{L}_R} \newcommand{\by}{\mathbb{y}} \newcommand{\bX}{\mathbb{X}} \newcommand{\byR}{\by_R} \newcommand{\bXR}{\bX_R} \newcommand{\byT}{\by_T} \newcommand{\bXT}{\bX_T} \newcommand{\bU}{\mathbb{U}} \newcommand{\bV}{\mathbb{V}} \newcommand{\bD}{\mathbb{D}} \newcommand{\hone}{\hspace{1mm}} \newcommand{\hthree}{\hspace{3mm}} \newcommand{\hlam}{\hat{\lambda}}\] Background The superior prediction accuracy that machine learning (ML) models tend to have over traditional statistical approaches is largely driven by: Selecting parameter weights for prediction accuracy rather than statistical inference,...
Read More

Logistic regression from A to Z

Logistic regression (LR) is a type of classification model that is able to predict discrete or qualitative response categories. While LR usually refers to the two-class case (binary LR) it can also generalize to a multiclass system (multinomial LR) or the category-ordered situation (ordinal LR)[1]. By using a logit link,...
Read More

Adjusting survival curves with inverse probability weights

\(\newcommand{\bzi}{\mathbf{z}_i}\) \(\newcommand{\bZi}{\mathbf{Z}_i}\) \(\newcommand{\bmu}{\boldsymbol\mu}\) Introduction and motivation This post will consider how to adjust Kaplan-Meier (KM) survival curves, or non-parametric survival curves more generally, for an observational study setting by using inverse probability weights (IPWs). KM curves are frequently employed in survival analysis and clinical studies as they provide a visually...
Read More

Using quadratic programming to solve L1-norm regularization

When doing regression modeling, one will often want to use some sort of regularization to penalize model complexity, for reasons that I have discussed in many other posts. In the case of a linear regression, a popular choice is to penalize the L1-norm (sum of absolute values) of the coefficient...
Read More

Machine learning and causal inference

Introduction Machine learning and traditional statistical inference have, until very recently, been running along separate tracks. In broad strokes, machine learning researchers were interested in developing algorithms which maximized predictive accuracy. Natural processes were seen as a black box which could be approximated by creative data mining procedures. This approach...
Read More

Delta method

When fitting a distribution to a survival model it is often useful to re-parameterize it so that it has a more tractable scale[1]. However, estimating the parameters that index a distribution via likelihood methods is often easier in the original form, and therefore it is useful to be able to...
Read More

Introduction to R and Bioconductor

This post was created for students taking CISC 875 (Bioinformatics) and has two goals: (1) introduce the R programming language, and (2) demonstrate how to use some of the important Bioconductor packages for the analysis of gene expression datasets. R has become the dominant programming language for statistical computing in...
Read More

Cure models, genomic data, and the TCGA dataset

Background The advent of next-generation sequencing technology has given biologists a detailed resource with which they can better understand how cellular states are expressed in RNA sequence counts[1]. Statisticians have also been taking advantage of the NGS revolution by using machine learning algorithms to handle these over-determined datasets[2] and classify...
Read More

Introduction to survival analysis

Understanding the dynamics of survival times in clinical settings is important to both medical practitioners and patients. In statistics, time-to-event analysis models a continuous random variable \(T\), which represents the duration of a state. If the state is “being alive”, then the time to event is mortality, and we refer...
Read More

miRNA data for species classification

Introduction Micro RNAs (miRNAs) are a small RNA molecule, around 22 base pairs long[1], that are able to regulate gene expression by silencing specific RNAs. While these molecules were first discovered in the 1990s, their biological significance wasn’t fully appreciated until the early 2000s when they were found in C....
Read More

Batch effects

Introduction For my Advanced Biostatistics course this semester I gave a presentation about the problem of batch effects in microarray data analysis and I feel that it is worth expanding on in a post. DNA microarrays allow for the simultaneous measurement of thousands of genes from a cell sample. The...
Read More

Ridge regression

Introduction and motivation In Economics, my previous field of study, empirical research was often interested in estimating the parameters of a theoretical model. Such parameters of interest might include the risk premium embedded in the term structure of interest rates or the elasticity of the labour supply. The main data...
Read More

RMarkdown+Jekyll

My recent experience in blogging with the Ghost platform has been fairly satisfactory. When it comes to writing book reviews, or making short posts, writing up a post in Markdown/html is a fairly efficient process. However as a fledging Bioeconometrician, many of the topics I want to post about relate...
Read More