sntn: A package to support data carving

This post accompanies the release of the sntn package, which implements a scipy-like class for doing inference on a sum of a normal and a trunctated normal (SNTN) distribution (see Kim (2006) and Arnold (1993)). I’ve written about this distribution before. The SNTN distribution can be used in a variety...
Read More

Using ESMFold to predict Cystic Fibrosis outcomes

Executive summary Embeddings from the ESMFold model can be used to predict clinical outcomes for different Cystic Fibrosis (CF) genotypes found in the CFTR2 database. The predictive correlation is both non-trivial and statistically significant. This work goes beyond the usual pathogenic/benign distinction and seeks to estimate average clinical outcomes for...
Read More

Can a fine-tuned GPT-3 write univocalic poems (Eunoia-style)?

Executive summary I developed a set of computational approaches to help a human (myself) write univocalic poems in the style of Christian Bök’s Eunoia. NOTE, this exercise was for research purposes only. I employed a four different strategies to create these poems: Fine-tuning a GPT-3 model on excerpts from Eunoia,...
Read More

Fine tuning GPT-3 to sound like a podcast host

Executive summary I developed four fine-tuned versions of OpenAI’s GPT-3 models to sound like Russ Roberts who is the host of the podcast EconTalk, using more than 10 years of episode transcripts. I calibrated my experiments for a \$40 USD budget (\$10 per model) and tested the fine-tuned models on...
Read More

paranet: Parametric survival models with elastic net regularization

This posts outlines the paranet package, which allows for the fitting of elastic net regularized parametric survival models with right-censored time-to-event data in python. I became interested in this topic when I realized that the a parametric modelling excercise would be useful for a business case, but was unable to...
Read More

trialML: Preparing a machine learning model for a statistical trial

This post summarizes a newly released python package: trialML. This package is designed to help researchers and practitioners prepare their machine learning models for a statistical trial to establish a lower-bound on model performance. Specifically, this package helps to calibrating the operating threshold of a binary classifier and carry out...
Read More

SurvSet: An open-source time-to-event dataset repository

This post summarizes a newly released python package: SurvSet the first ever open-source time-to-event dataset repository. The goal of SurvSet is to allow researchers and practioneeres to benchmark machine learning models and assess statistical methods. All datasets in this repository are consisently formatted to enable rapid prototyping and inference. The...
Read More

Statistically validating a model for a point on the ROC curve

Background Validating machine learning (ML) models in a prospective setting is a necessary, but not a sufficient condition, for demonstrating the possibility algorithmic utility. Most models designed on research datasets will fail to translate in a real-word setting. This problem is referred to as the “AI chasm”.[1] There are numerous...
Read More

Computational tools to support enciphered poetry

Please note that Heroku no longer provides free hosting of Dash apps and the links to the application in this post are currently not working. I am looking into alternative free hosting services. Overview In cryptography, substitution ciphers are considered a weak form of encryption because the ciphertext shares the...
Read More

Understanding the fragility index

Summary This post reviews the fragility index, a statistical technique proposed by Walsh et. al (2014) to provide an intuitive measure of the robustness of study findings. I show that the distribution of the fragility index can be approximated by a truncated Gaussian whose expectation is directly related to the...
Read More

Using a modified Hausman likelihood to adjust for binary label error

(1) Overview In most statistical models randomness is assumed to come from a parametric relationship. For example, a random variable \(y\) might have a Gaussian \(y \sim N(\mu,\sigma^2)\) or exponential distribution \(y \sim \text{Exp}(\lambda)\) centred around some point of central tendency \(E[y] = \mu\) or \(E[y] = \lambda^{-1}\). Conceptually, we...
Read More

Shorting the Canadian housing market with REITs

(1) Executive summary This post considers how well the Canadian housing market can be shorted (i.e. bet against) using publicly traded equities such as real estate investment trusts (REITs). There are structural reasons why the Canadian housing is difficult for investors to short: Between cities, house price changes are imperfectly...
Read More

The sum of a normal and a truncated normal

\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bXS}{\bX_S} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bhat}{\hat{\bbeta}} \newcommand{\bhatS}{\bhat_S} \newcommand{\bhato}{\bhat^{\text{OLS}}} \newcommand{\bhats}{\bhat^{\text{SI}}} \newcommand{\bhatc}{\bhat^{\text{Carv}}}\] (1) Background Consider the sum of a normal \(Z_1 \sim N(\mu_1,\tau_1^2)\) and a truncated normal \(Z_2 \sim TN(\mu_2,\tau_2^2,a,b)\), denoted \(W=Z_1+Z_2\). How can this normal-truncated sum (NTS) be characterized? One approach for doing inference on \(W\) would be to simply sample...
Read More

Vectorizing t-tests and F-tests for unequal variances

Almost all modern data science tasks begin with an exploratory data analysis (EDA) phase. Visualizing summary statistics and testing for associations forms the basis of hypothesis generation and subsequent modelling. Applied statisticians need to be careful not to over-interpret the results of EDA since the p-values generated during this phase...
Read More

An analysis of neighbourhood level population changes in Toronto and Vancouver

I recently reviewed House Divided, a book which discusses the regulatory reasons behind the “missing” low- to medium-density housing structures that are absent in Toronto. The problem of regulatory constraints is not a Toronto-specific problem. A virtually identical problem exists in Vancouver. The political economy of Toronto and Vancouver has...
Read More

Confidence interval approximations for the AUROC

The area under the receiver operating characteristic curve (AUROC) is one of the most commonly used performance metrics for binary classification. Visually, the AUROC is the integral between the sensitivity and false positive rate curves across all thresholds for a binary classifier. The AUROC can also be shown to be...
Read More

Canadian cancer statistics - a mixed story

I recently reviewed Azra Raza’s new book The First Cell, whose primary thesis is that cancer research has failed to translate into effective treatments due to a combination or poor incentives and a flawed experimental paradigm. The cancer that Dr. Raza treats, myelodysplastic syndromes (MDS) and its evolutionary successor, acute...
Read More

Implementing the bias-corrected and accelerated bootstrap in Python

The bootstrap is a powerful tool for carrying out inference on statistics whose distribution is unknown. The non-parametric version of the bootstrap obtains variation around the point estimate of a statistic by randomly resampling the data with replacement and recalculating the bootstrap-statistic based on these resamples. This simulated distribution can...
Read More

A winner's curse adjustment for a single test statistic

Background One of the primary culprits in the reproducability crisis in scientific research is the naive application of applied statistics for conducting inference. Even excluding cases of scientific misconduct, cited research findings are likely to be inaccurate due to 1) the file drawer problem, 2) researchers’ degrees of freedom and...
Read More

Preparing a binary classifier for a statistical trial

Binary classification tasks are one of the most common applications of machine learning models in applied practice. After a model has been trained, various evaluation metrics exist to allow researchers to benchmark performance and assess application viability. Some metrics, like accuracy, sensitivity, and specificity, require a threshold to be established...
Read More

Pediatric incubation time for COVID-19 using CORD-19 data

This post replicates my recently-posted Kaggle notebook using the the CORD-19 dataset which has more 37K full-text COVID-related articles. The goal of post is to show how to filter articles that are look for incubation period of the disease with the goal of finding a subset of articles that have...
Read More

AI Deployment Symposium

I am excited to share the AI Deployment Symposium Report that went live today on Vector’s website. This report provides many examples of real-world ML tools that have been deployed in a clinical setting. Despite the volume of articles about the potential for AI to improve patient outcomes and health...
Read More

Direct AUROC optimization with PyTorch

\(\newcommand{\by}{\boldsymbol{y}}\) \(\newcommand{\beta}{\boldsymbol{\eta}}\) \(\newcommand{\bw}{\boldsymbol{w}}\) \(\newcommand{\bx}{\boldsymbol{x}}\) In this post I’ll discuss how to directly optimize the Area Under the Receiver Operating Characteristic Curve (AUROC), which measures the discriminatory ability of a model across a range of sensitivity and specificity thresholds for binary classification. The AUROC is often used as method to benchmark...
Read More

Ordinal regression using the threshold approach from scratch

\[\newcommand{\bx}{\boldsymbol{x}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\btheta}{\boldsymbol{\theta}}\] Ordinal regression models are used when an outcome has \(K\) ordered possibilities: \(y \in [1,\dots,K]\), e.g. \(y \in [\text{low},\text{mid},\text{high}]\). While ordinal outcomes are discrete, their cumulative and probability mass functions can be written for each level: \[\begin{align*} P(y \leq k; \bx,\bbeta,\btheta ) &= F(\theta_k - \bx^T...
Read More

Elastic Net Exponential model for censored data

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bt}{\boldsymbol{t}} \newcommand{\biota}{\boldsymbol{\iota}} \newcommand{\bdelta}{\boldsymbol{\delta}}\] Introduction There are many situations where a statistical model needs to be fit to data whose values are only partially known. When the outcome, or label, of a dataset is known to be at least or at most some amount, this...
Read More

The HRT for mixed data types (Python implementation)

Introduction In my last post I showed how the holdout random test (HRT) could be used to obtain valid p-values for any machine learning model by sampling from the conditional distribution of the design matrix. Like the permutation-type approaches used to assess variable importance for decision trees, this method sees...
Read More

Parameter inference through black box predictions: the Holdout Randomization Test (HRT)

\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\bXt}{\tilde{\bX}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bThet}{\boldsymbol{\Theta}} \newcommand{\bSigma}{\boldsymbol{\Sigma}}\] Background It is an exciting time for statistical inference in high-dimensional and machine learning settings. Techniques like Selective Inference (SI) have shown that certain classes of models (such as the Lasso or forward stepwise regression) can have valid inference procedures performed after...
Read More

Linear time AUC optimization

\[\newcommand{\mub}{\boldsymbol{\mu}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\thetab}{\boldsymbol{\theta}} \newcommand{\betab}{\boldsymbol{\beta}} \newcommand{\xb}{\boldsymbol{x}} \newcommand{\xbi}{\xb_i} \newcommand{\Xb}{\boldsymbol{X}} \newcommand{Xs}{\mathcal{X}} \newcommand{\Sigmab}{\boldsymbol{\Sigma}}\] Introduction In the binary classification setting linear models such as logistic regression, suport vector machines, and the linear discriminant analyis are often used when a researcher is interested in obtaining a simple model that can be fit quickly and returns interpretable...
Read More

A convex approximation of the concordance index (C-index)

\[\newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\betab}{\boldsymbol{\beta}} \newcommand{\xb}{\boldsymbol{x}} \newcommand{\xbi}{\xb_i} \newcommand{\Xb}{\boldsymbol{X}}\] Introduction When comparing the performance of an individual or competing group of survival models, the most common choice of scoring metric is the concordance index (C-index). In the time-to-event setting, with \(N\) observations, the quality of a risk score can be determined by comparing how...
Read More

Building an Elastic-Net Cox Model with Time-Dependent covariates

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i} \newcommand{\ei}{\varepsilon_i}\] Introduction In survival analysis the goal is to model the process by which the rate of events happen. In the applied statistics setting this usually means identifying covariates that are likely to lead to a higher...
Read More

Building a survival-neuralnet from scratch in base R

\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bA}{\boldsymbol{A}} \newcommand{\ba}{\boldsymbol{a}} \newcommand{\bb}{\boldsymbol{b}} \newcommand{\bB}{\boldsymbol{B}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\bZ}{\boldsymbol{Z}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\biota}{\boldsymbol{\iota}} \newcommand{\RR}{\mathbb{R}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\LL}{\mathcal{L}}\] Motivation After having completed two of Andrew Ng’s excellent Deep Learning courses on Coursera, which are written in Python, I was inspired to write a post about how to create a neural network using base R....
Read More

Stratified survival analysis as a form of multitask/transfer learning

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bt}{\boldsymbol{t}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bB}{\boldsymbol{B}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\mp}{\mathcal{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i}\] Introduction In my previous post I discussed how gradient methods could be used to optimize the partial likelihood from the Cox-PH model. This post will use the notations and equations established there previously. In...
Read More

Gradient descent for the elastic net Cox-PH model

\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i} \newcommand{\ei}{\varepsilon_i}\] Introduction Machine learning in the survival modelling context is relatively small area of research, but has been gaining attention in recent years. See this arXiv paper for a good overview. Unlike the classical supervised learning scenario where...
Read More

OLS under covariate shift

\[\newcommand{\be}{\boldsymbol{e}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\br}{\boldsymbol{r}} \newcommand{\N}{\mathcal{N}} \newcommand{\D}{\mathcal{D}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\X}{\mathcal{X}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\bpi}{\boldsymbol{\pi}} \newcommand{\bSig}{\boldsymbol{\Sigma}} \newcommand{\Err}{\text{Err}} \newcommand{\Var}{\text{Var}} \newcommand{\MSE}{\text{MSE}} \newcommand{\Bias}{\text{Bias}}\] Introduction In statistical learning theory the goal of a prediction model is to learn regularities of a data generating process from a sample and then obtain the best possible generalization error, which...
Read More

Trade-offs in apportioning training, validation, and test sets

\[\newcommand{\Real}{\boldsymbol{R}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bxi}{\bx_i} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\boldsymbol{u}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\fH}{\mathcal{H}} \newcommand{\fR}{\mathcal{R}} \newcommand{\fX}{\mathcal{X}} \newcommand{\fY}{\mathcal{Y}}\] Background and motivation Before machine learning (ML) algorithms can be deployed in real world settings, an estimate of their generalization error is required. Because data is finite, a model can at most be trained on \(N\)...
Read More

Selective Inference: A useful technique for high-throughput biology

\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\mathbb{y}} \newcommand{\bx}{\mathbb{x}} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\mathbb{u}} \newcommand{\bX}{\mathbb{X}}\] Background The falling cost of sequencing technology has lead to a proliferation of biological datasets that are able to measure nuclear states such as relative amounts of protein expression via RNA-seq or targeted genotyping through next generation sequencing (NGS) platforms. These...
Read More

Theoretical properties of the Lasso: avoiding CV in high-dimensional datasets

\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bbetaz}{\bbeta_0} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\mathbb{y}} \newcommand{\bx}{\mathbb{x}} \newcommand{\be}{\mathbb{e}} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\mathbb{u}} \newcommand{\bX}{\mathbb{X}} \newcommand{\gram}{\bX^T\bX} \newcommand{\sqrtl}{\sqrt{\text{Lasso}} } \newcommand{\sqrtln}{\sqrt{\text{Lasso}}}\] Background Twenty years ago Tibshirani developed a penalized regression model known as the Lasso. Instead of using the L0-norm as the penalty term (equivalent to best-subset selection), the Lasso makes use of an L1-norm. This...
Read More

Hyperparameter learning via bi-level optimization

\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bbetahl}{\bbetah_\lambda} \newcommand{\LT}{\mathcal{L}_T} \newcommand{\PLT}{\mathcal{PL}_T} \newcommand{\LR}{\mathcal{L}_R} \newcommand{\by}{\mathbb{y}} \newcommand{\bX}{\mathbb{X}} \newcommand{\byR}{\by_R} \newcommand{\bXR}{\bX_R} \newcommand{\byT}{\by_T} \newcommand{\bXT}{\bX_T} \newcommand{\bU}{\mathbb{U}} \newcommand{\bV}{\mathbb{V}} \newcommand{\bD}{\mathbb{D}} \newcommand{\hone}{\hspace{1mm}} \newcommand{\hthree}{\hspace{3mm}} \newcommand{\hlam}{\hat{\lambda}}\] Background The superior prediction accuracy that machine learning (ML) models tend to have over traditional statistical approaches is largely driven by: Selecting parameter weights for prediction accuracy rather than statistical inference,...
Read More

Logistic regression from A to Z

Logistic regression (LR) is a type of classification model that is able to predict discrete or qualitative response categories. While LR usually refers to the two-class case (binary LR) it can also generalize to a multiclass system (multinomial LR) or the category-ordered situation (ordinal LR)[1]. By using a logit link,...
Read More

Adjusting survival curves with inverse probability weights

\(\newcommand{\bzi}{\mathbf{z}_i}\) \(\newcommand{\bZi}{\mathbf{Z}_i}\) \(\newcommand{\bmu}{\boldsymbol\mu}\) Introduction and motivation This post will consider how to adjust Kaplan-Meier (KM) survival curves, or non-parametric survival curves more generally, for an observational study setting by using inverse probability weights (IPWs). KM curves are frequently employed in survival analysis and clinical studies as they provide a visually...
Read More

Using quadratic programming to solve L1-norm regularization

When doing regression modeling, one will often want to use some sort of regularization to penalize model complexity, for reasons that I have discussed in many other posts. In the case of a linear regression, a popular choice is to penalize the L1-norm (sum of absolute values) of the coefficient...
Read More

Machine learning and causal inference

Introduction Machine learning and traditional statistical inference have, until very recently, been running along separate tracks. In broad strokes, machine learning researchers were interested in developing algorithms which maximized predictive accuracy. Natural processes were seen as a black box which could be approximated by creative data mining procedures. This approach...
Read More

Delta method

When fitting a distribution to a survival model it is often useful to re-parameterize it so that it has a more tractable scale[1]. However, estimating the parameters that index a distribution via likelihood methods is often easier in the original form, and therefore it is useful to be able to...
Read More

Introduction to R and Bioconductor

This post was created for students taking CISC 875 (Bioinformatics) and has two goals: (1) introduce the R programming language, and (2) demonstrate how to use some of the important Bioconductor packages for the analysis of gene expression datasets. R has become the dominant programming language for statistical computing in...
Read More

Cure models, genomic data, and the TCGA dataset

Background The advent of next-generation sequencing technology has given biologists a detailed resource with which they can better understand how cellular states are expressed in RNA sequence counts[1]. Statisticians have also been taking advantage of the NGS revolution by using machine learning algorithms to handle these over-determined datasets[2] and classify...
Read More

Introduction to survival analysis

Understanding the dynamics of survival times in clinical settings is important to both medical practitioners and patients. In statistics, time-to-event analysis models a continuous random variable \(T\), which represents the duration of a state. If the state is “being alive”, then the time to event is mortality, and we refer...
Read More

miRNA data for species classification

Introduction Micro RNAs (miRNAs) are a small RNA molecule, around 22 base pairs long[1], that are able to regulate gene expression by silencing specific RNAs. While these molecules were first discovered in the 1990s, their biological significance wasn’t fully appreciated until the early 2000s when they were found in C....
Read More

Batch effects

Introduction For my Advanced Biostatistics course this semester I gave a presentation about the problem of batch effects in microarray data analysis and I feel that it is worth expanding on in a post. DNA microarrays allow for the simultaneous measurement of thousands of genes from a cell sample. The...
Read More

Ridge regression

Introduction and motivation In Economics, my previous field of study, empirical research was often interested in estimating the parameters of a theoretical model. Such parameters of interest might include the risk premium embedded in the term structure of interest rates or the elasticity of the labour supply. The main data...
Read More

RMarkdown+Jekyll

My recent experience in blogging with the Ghost platform has been fairly satisfactory. When it comes to writing book reviews, or making short posts, writing up a post in Markdown/html is a fairly efficient process. However as a fledging Bioeconometrician, many of the topics I want to post about relate...
Read More