trialML: Preparing a machine learning model for a statistical trial

This post summarizes a newly released python package: trialML. This package is designed to help researchers and practitioners prepare their machine learning models for a statistical trial to establish a lower-bound on model performance. Specifically, this package helps to calibrating the operating threshold of a binary classifier and carry out...

SurvSet: An open-source time-to-event dataset respository

This post summarizes a newly released python package: SurvSet the first ever open-source time-to-event dataset repository. The goal of SurvSet is to allow researchers and practioneeres to benchmark machine learning models and assess statistical methods. All datasets in this repository are consisently formatted to enable rapid prototyping and inference. The...

Statistically validating a model for a point on the ROC curve

Background Validating machine learning (ML) models in a prospective setting is a necessary, but not a sufficient condition, for demonstrating the possibility algorithmic utility. Most models designed on research datasets will fail to translate in a real-word setting. This problem is referred to as the “AI chasm”.[1] There are numerous...

Computational tools to support enciphered poetry

Overview In cryptography, substitution ciphers are considered a weak form of encryption because the ciphertext shares the same empirical distribution of the plaintext language used to write the message. This allows the cipher to be easily cracked. Despite being a poor form of encryption, substitution ciphers present an interesting opportunity...

Understanding the fragility index

Summary This post reviews the fragility index, a statistical technique proposed by Walsh et. al (2014) to provide an intuitive measure of the robustness of study findings. I show that the distribution of the fragility index can be approximated by a truncated Gaussian whose expectation is directly related to the...

Using a modified Hausman likelihood to adjust for binary label error

(1) Overview In most statistical models randomness is assumed to come from a parametric relationship. For example, a random variable $$y$$ might have a Gaussian $$y \sim N(\mu,\sigma^2)$$ or exponential distribution $$y \sim \text{Exp}(\lambda)$$ centred around some point of central tendency $$E[y] = \mu$$ or $$E[y] = \lambda^{-1}$$. Conceptually, we...

Shorting the Canadian housing market with REITs

(1) Executive summary This post considers how well the Canadian housing market can be shorted (i.e. bet against) using publicly traded equities such as real estate investment trusts (REITs). There are structural reasons why the Canadian housing is difficult for investors to short: Between cities, house price changes are imperfectly...

The sum of a normal and a truncated normal


A case of matching methods being poorly suited to analysing harm reduction policies

Executive summary A recently published paper in JAMA Network Open by Lee et. al (2021) (hereafter Lee) uses an econometric method which claims to find a deleterious association between the adoption of harm reduction policies at the state level and overdose deaths.[1] Their method estimates that both Naloxone and Good...

Vectorizing t-tests and F-tests for unequal variances

Almost all modern data science tasks begin with an exploratory data analysis (EDA) phase. Visualizing summary statistics and testing for associations forms the basis of hypothesis generation and subsequent modelling. Applied statisticians need to be careful not to over-interpret the results of EDA since the p-values generated during this phase...

An analysis of neighbourhood level population changes in Toronto and Vancouver

I recently reviewed House Divided, a book which discusses the regulatory reasons behind the “missing” low- to medium-density housing structures that are absent in Toronto. The problem of regulatory constraints is not a Toronto-specific problem. A virtually identical problem exists in Vancouver. The political economy of Toronto and Vancouver has...

Confidence interval approximations for the AUROC

The area under the receiver operating characteristic curve (AUROC) is one of the most commonly used performance metrics for binary classification. Visually, the AUROC is the integral between the sensitivity and false positive rate curves across all thresholds for a binary classifier. The AUROC can also be shown to be...

Canadian cancer statistics - a mixed story

I recently reviewed Azra Raza’s new book The First Cell, whose primary thesis is that cancer research has failed to translate into effective treatments due to a combination or poor incentives and a flawed experimental paradigm. The cancer that Dr. Raza treats, myelodysplastic syndromes (MDS) and its evolutionary successor, acute...

Running a statistical trial for a machine learning regression model

Imagine you have been given an imaging dataset and have trained a convolutional neural network to count the number of cells in the image for a medical-based task. On a held-out test set you observe an average error ±5 cells. In order to be considered reliable enough for clinical use,...

Implementing the bias-corrected and accelerated bootstrap in Python

The bootstrap is a powerful tool for carrying out inference on statistics whose distribution is unknown. The non-parametric version of the bootstrap obtains variation around the point estimate of a statistic by randomly resampling the data with replacement and recalculating the bootstrap-statistic based on these resamples. This simulated distribution can...

A winner's curse adjustment for a single test statistic

Background One of the primary culprits in the reproducability crisis in scientific research is the naive application of applied statistics for conducting inference. Even excluding cases of scientific misconduct, cited research findings are likely to be inaccurate due to 1) the file drawer problem, 2) researchers’ degrees of freedom and...

Preparing a binary classifier for a statistical trial

Binary classification tasks are one of the most common applications of machine learning models in applied practice. After a model has been trained, various evaluation metrics exist to allow researchers to benchmark performance and assess application viability. Some metrics, like accuracy, sensitivity, and specificity, require a threshold to be established...

Pediatric incubation time for COVID-19 using CORD-19 data

This post replicates my recently-posted Kaggle notebook using the the CORD-19 dataset which has more 37K full-text COVID-related articles. The goal of post is to show how to filter articles that are look for incubation period of the disease with the goal of finding a subset of articles that have...

AI Deployment Symposium

I am excited to share the AI Deployment Symposium Report that went live today on Vector’s website. This report provides many examples of real-world ML tools that have been deployed in a clinical setting. Despite the volume of articles about the potential for AI to improve patient outcomes and health...

Direct AUROC optimization with PyTorch


The HRT for mixed data types (Python implementation)

Introduction In my last post I showed how the holdout random test (HRT) could be used to obtain valid p-values for any machine learning model by sampling from the conditional distribution of the design matrix. Like the permutation-type approaches used to assess variable importance for decision trees, this method sees...

Parameter inference through black box predictions: the Holdout Randomization Test (HRT)


Linear time AUC optimization


A convex approximation of the concordance index (C-index)


Building an Elastic-Net Cox Model with Time-Dependent covariates


Building a survival-neuralnet from scratch in base R


Stratified survival analysis as a form of multitask/transfer learning


Gradient descent for the elastic net Cox-PH model


OLS under covariate shift


Trade-offs in apportioning training, validation, and test sets


Selective Inference: A useful technique for high-throughput biology


Theoretical properties of the Lasso: avoiding CV in high-dimensional datasets


Hyperparameter learning via bi-level optimization


Logistic regression from A to Z

Logistic regression (LR) is a type of classification model that is able to predict discrete or qualitative response categories. While LR usually refers to the two-class case (binary LR) it can also generalize to a multiclass system (multinomial LR) or the category-ordered situation (ordinal LR)[1]. By using a logit link,...

Adjusting survival curves with inverse probability weights


Using quadratic programming to solve L1-norm regularization

When doing regression modeling, one will often want to use some sort of regularization to penalize model complexity, for reasons that I have discussed in many other posts. In the case of a linear regression, a popular choice is to penalize the L1-norm (sum of absolute values) of the coefficient...

Cancer classification using plasma cirDNA: A small N and large p environment

Over the last fifteen years the field of biology has undergone a significant cultural change. The pipette is being replaced by the piping operator. At the recent Software Carpentry workshop that occurred at Queen’s University this week, I noticed that most of the people there to learn about UNIX programming...

Machine learning and causal inference

Introduction Machine learning and traditional statistical inference have, until very recently, been running along separate tracks. In broad strokes, machine learning researchers were interested in developing algorithms which maximized predictive accuracy. Natural processes were seen as a black box which could be approximated by creative data mining procedures. This approach...

Delta method

When fitting a distribution to a survival model it is often useful to re-parameterize it so that it has a more tractable scale[1]. However, estimating the parameters that index a distribution via likelihood methods is often easier in the original form, and therefore it is useful to be able to...

Introduction to R and Bioconductor

This post was created for students taking CISC 875 (Bioinformatics) and has two goals: (1) introduce the R programming language, and (2) demonstrate how to use some of the important Bioconductor packages for the analysis of gene expression datasets. R has become the dominant programming language for statistical computing in...

Cure models, genomic data, and the TCGA dataset

Background The advent of next-generation sequencing technology has given biologists a detailed resource with which they can better understand how cellular states are expressed in RNA sequence counts[1]. Statisticians have also been taking advantage of the NGS revolution by using machine learning algorithms to handle these over-determined datasets[2] and classify...

Introduction to survival analysis

Understanding the dynamics of survival times in clinical settings is important to both medical practitioners and patients. In statistics, time-to-event analysis models a continuous random variable $$T$$, which represents the duration of a state. If the state is “being alive”, then the time to event is mortality, and we refer...

miRNA data for species classification

Introduction Micro RNAs (miRNAs) are a small RNA molecule, around 22 base pairs long[1], that are able to regulate gene expression by silencing specific RNAs. While these molecules were first discovered in the 1990s, their biological significance wasn’t fully appreciated until the early 2000s when they were found in C....

Batch effects

Introduction For my Advanced Biostatistics course this semester I gave a presentation about the problem of batch effects in microarray data analysis and I feel that it is worth expanding on in a post. DNA microarrays allow for the simultaneous measurement of thousands of genes from a cell sample. The...