This post accompanies the release of the sntn package, which implements a scipy-like class for doing inference on a sum of a normal and a trunctated normal (SNTN) distribution (see Kim (2006) and Arnold (1993)). I’ve written about this distribution before. The SNTN distribution can be used in a variety...

Read More
Executive summary Embeddings from the ESMFold model can be used to predict clinical outcomes for different Cystic Fibrosis (CF) genotypes found in the CFTR2 database. The predictive correlation is both non-trivial and statistically significant. This work goes beyond the usual pathogenic/benign distinction and seeks to estimate average clinical outcomes for...

Read More
Executive summary I developed a set of computational approaches to help a human (myself) write univocalic poems in the style of Christian Bök’s Eunoia. NOTE, this exercise was for research purposes only. I employed a four different strategies to create these poems: Fine-tuning a GPT-3 model on excerpts from Eunoia,...

Read More
Executive summary I developed four fine-tuned versions of OpenAI’s GPT-3 models to sound like Russ Roberts who is the host of the podcast EconTalk, using more than 10 years of episode transcripts. I calibrated my experiments for a \$40 USD budget (\$10 per model) and tested the fine-tuned models on...

Read More
This posts outlines the paranet package, which allows for the fitting of elastic net regularized parametric survival models with right-censored time-to-event data in python. I became interested in this topic when I realized that the a parametric modelling excercise would be useful for a business case, but was unable to...

Read More
This post summarizes a newly released python package: trialML. This package is designed to help researchers and practitioners prepare their machine learning models for a statistical trial to establish a lower-bound on model performance. Specifically, this package helps to calibrating the operating threshold of a binary classifier and carry out...

Read More
This post summarizes a newly released python package: SurvSet the first ever open-source time-to-event dataset repository. The goal of SurvSet is to allow researchers and practioneeres to benchmark machine learning models and assess statistical methods. All datasets in this repository are consisently formatted to enable rapid prototyping and inference. The...

Read More
Background Validating machine learning (ML) models in a prospective setting is a necessary, but not a sufficient condition, for demonstrating the possibility algorithmic utility. Most models designed on research datasets will fail to translate in a real-word setting. This problem is referred to as the “AI chasm”.[1] There are numerous...

Read More
Please note that Heroku no longer provides free hosting of Dash apps and the links to the application in this post are currently not working. I am looking into alternative free hosting services. Overview In cryptography, substitution ciphers are considered a weak form of encryption because the ciphertext shares the...

Read More
Summary This post reviews the fragility index, a statistical technique proposed by Walsh et. al (2014) to provide an intuitive measure of the robustness of study findings. I show that the distribution of the fragility index can be approximated by a truncated Gaussian whose expectation is directly related to the...

Read More
(1) Overview In most statistical models randomness is assumed to come from a parametric relationship. For example, a random variable \(y\) might have a Gaussian \(y \sim N(\mu,\sigma^2)\) or exponential distribution \(y \sim \text{Exp}(\lambda)\) centred around some point of central tendency \(E[y] = \mu\) or \(E[y] = \lambda^{-1}\). Conceptually, we...

Read More
(1) Executive summary This post considers how well the Canadian housing market can be shorted (i.e. bet against) using publicly traded equities such as real estate investment trusts (REITs). There are structural reasons why the Canadian housing is difficult for investors to short: Between cities, house price changes are imperfectly...

Read More
\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bXS}{\bX_S} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bhat}{\hat{\bbeta}} \newcommand{\bhatS}{\bhat_S} \newcommand{\bhato}{\bhat^{\text{OLS}}} \newcommand{\bhats}{\bhat^{\text{SI}}} \newcommand{\bhatc}{\bhat^{\text{Carv}}}\] (1) Background Consider the sum of a normal \(Z_1 \sim N(\mu_1,\tau_1^2)\) and a truncated normal \(Z_2 \sim TN(\mu_2,\tau_2^2,a,b)\), denoted \(W=Z_1+Z_2\). How can this normal-truncated sum (NTS) be characterized? One approach for doing inference on \(W\) would be to simply sample...

Read More
Executive summary A recently published paper in JAMA Network Open by Lee et. al (2021) (hereafter Lee) uses an econometric method which claims to find a deleterious association between the adoption of harm reduction policies at the state level and overdose deaths.[1] Their method estimates that both Naloxone and Good...

Read More
Almost all modern data science tasks begin with an exploratory data analysis (EDA) phase. Visualizing summary statistics and testing for associations forms the basis of hypothesis generation and subsequent modelling. Applied statisticians need to be careful not to over-interpret the results of EDA since the p-values generated during this phase...

Read More
I recently reviewed House Divided, a book which discusses the regulatory reasons behind the “missing” low- to medium-density housing structures that are absent in Toronto. The problem of regulatory constraints is not a Toronto-specific problem. A virtually identical problem exists in Vancouver. The political economy of Toronto and Vancouver has...

Read More
The area under the receiver operating characteristic curve (AUROC) is one of the most commonly used performance metrics for binary classification. Visually, the AUROC is the integral between the sensitivity and false positive rate curves across all thresholds for a binary classifier. The AUROC can also be shown to be...

Read More
I recently reviewed Azra Raza’s new book The First Cell, whose primary thesis is that cancer research has failed to translate into effective treatments due to a combination or poor incentives and a flawed experimental paradigm. The cancer that Dr. Raza treats, myelodysplastic syndromes (MDS) and its evolutionary successor, acute...

Read More
Imagine you have been given an imaging dataset and have trained a convolutional neural network to count the number of cells in the image for a medical-based task. On a held-out test set you observe an average error ±5 cells. In order to be considered reliable enough for clinical use,...

Read More
The bootstrap is a powerful tool for carrying out inference on statistics whose distribution is unknown. The non-parametric version of the bootstrap obtains variation around the point estimate of a statistic by randomly resampling the data with replacement and recalculating the bootstrap-statistic based on these resamples. This simulated distribution can...

Read More
Background One of the primary culprits in the reproducability crisis in scientific research is the naive application of applied statistics for conducting inference. Even excluding cases of scientific misconduct, cited research findings are likely to be inaccurate due to 1) the file drawer problem, 2) researchers’ degrees of freedom and...

Read More
Binary classification tasks are one of the most common applications of machine learning models in applied practice. After a model has been trained, various evaluation metrics exist to allow researchers to benchmark performance and assess application viability. Some metrics, like accuracy, sensitivity, and specificity, require a threshold to be established...

Read More
This post replicates my recently-posted Kaggle notebook using the the CORD-19 dataset which has more 37K full-text COVID-related articles. The goal of post is to show how to filter articles that are look for incubation period of the disease with the goal of finding a subset of articles that have...

Read More
I am excited to share the AI Deployment Symposium Report that went live today on Vector’s website. This report provides many examples of real-world ML tools that have been deployed in a clinical setting. Despite the volume of articles about the potential for AI to improve patient outcomes and health...

Read More
\(\newcommand{\by}{\boldsymbol{y}}\) \(\newcommand{\beta}{\boldsymbol{\eta}}\) \(\newcommand{\bw}{\boldsymbol{w}}\) \(\newcommand{\bx}{\boldsymbol{x}}\) In this post I’ll discuss how to directly optimize the Area Under the Receiver Operating Characteristic Curve (AUROC), which measures the discriminatory ability of a model across a range of sensitivity and specificity thresholds for binary classification. The AUROC is often used as method to benchmark...

Read More
\[\newcommand{\bx}{\boldsymbol{x}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\btheta}{\boldsymbol{\theta}}\] Ordinal regression models are used when an outcome has \(K\) ordered possibilities: \(y \in [1,\dots,K]\), e.g. \(y \in [\text{low},\text{mid},\text{high}]\). While ordinal outcomes are discrete, their cumulative and probability mass functions can be written for each level: \[\begin{align*} P(y \leq k; \bx,\bbeta,\btheta ) &= F(\theta_k - \bx^T...

Read More
\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bt}{\boldsymbol{t}} \newcommand{\biota}{\boldsymbol{\iota}} \newcommand{\bdelta}{\boldsymbol{\delta}}\] Introduction There are many situations where a statistical model needs to be fit to data whose values are only partially known. When the outcome, or label, of a dataset is known to be at least or at most some amount, this...

Read More
Introduction In my last post I showed how the holdout random test (HRT) could be used to obtain valid p-values for any machine learning model by sampling from the conditional distribution of the design matrix. Like the permutation-type approaches used to assess variable importance for decision trees, this method sees...

Read More
\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\bXt}{\tilde{\bX}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bThet}{\boldsymbol{\Theta}} \newcommand{\bSigma}{\boldsymbol{\Sigma}}\] Background It is an exciting time for statistical inference in high-dimensional and machine learning settings. Techniques like Selective Inference (SI) have shown that certain classes of models (such as the Lasso or forward stepwise regression) can have valid inference procedures performed after...

Read More
\[\newcommand{\mub}{\boldsymbol{\mu}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\thetab}{\boldsymbol{\theta}} \newcommand{\betab}{\boldsymbol{\beta}} \newcommand{\xb}{\boldsymbol{x}} \newcommand{\xbi}{\xb_i} \newcommand{\Xb}{\boldsymbol{X}} \newcommand{Xs}{\mathcal{X}} \newcommand{\Sigmab}{\boldsymbol{\Sigma}}\] Introduction In the binary classification setting linear models such as logistic regression, suport vector machines, and the linear discriminant analyis are often used when a researcher is interested in obtaining a simple model that can be fit quickly and returns interpretable...

Read More
\[\newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\betab}{\boldsymbol{\beta}} \newcommand{\xb}{\boldsymbol{x}} \newcommand{\xbi}{\xb_i} \newcommand{\Xb}{\boldsymbol{X}}\] Introduction When comparing the performance of an individual or competing group of survival models, the most common choice of scoring metric is the concordance index (C-index). In the time-to-event setting, with \(N\) observations, the quality of a risk score can be determined by comparing how...

Read More
\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i} \newcommand{\ei}{\varepsilon_i}\] Introduction In survival analysis the goal is to model the process by which the rate of events happen. In the applied statistics setting this usually means identifying covariates that are likely to lead to a higher...

Read More
\[\newcommand{\bX}{\boldsymbol{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bA}{\boldsymbol{A}} \newcommand{\ba}{\boldsymbol{a}} \newcommand{\bb}{\boldsymbol{b}} \newcommand{\bB}{\boldsymbol{B}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\bZ}{\boldsymbol{Z}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\biota}{\boldsymbol{\iota}} \newcommand{\RR}{\mathbb{R}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\LL}{\mathcal{L}}\] Motivation After having completed two of Andrew Ng’s excellent Deep Learning courses on Coursera, which are written in Python, I was inspired to write a post about how to create a neural network using base R....

Read More
\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bt}{\boldsymbol{t}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bB}{\boldsymbol{B}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\mp}{\mathcal{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i}\] Introduction In my previous post I discussed how gradient methods could be used to optimize the partial likelihood from the Cox-PH model. This post will use the notations and equations established there previously. In...

Read More
\[\newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bW}{\boldsymbol{W}} \newcommand{\bp}{\boldsymbol{p}} \newcommand{\etab}{\boldsymbol{\eta}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bP}{\boldsymbol{P}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bw}{\boldsymbol{w}} \newcommand{\bxi}{\bx_i} \newcommand{\ei}{\varepsilon_i}\] Introduction Machine learning in the survival modelling context is relatively small area of research, but has been gaining attention in recent years. See this arXiv paper for a good overview. Unlike the classical supervised learning scenario where...

Read More
\[\newcommand{\be}{\boldsymbol{e}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bY}{\boldsymbol{Y}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\br}{\boldsymbol{r}} \newcommand{\N}{\mathcal{N}} \newcommand{\D}{\mathcal{D}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\X}{\mathcal{X}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\bpi}{\boldsymbol{\pi}} \newcommand{\bSig}{\boldsymbol{\Sigma}} \newcommand{\Err}{\text{Err}} \newcommand{\Var}{\text{Var}} \newcommand{\MSE}{\text{MSE}} \newcommand{\Bias}{\text{Bias}}\] Introduction In statistical learning theory the goal of a prediction model is to learn regularities of a data generating process from a sample and then obtain the best possible generalization error, which...

Read More
\[\newcommand{\Real}{\boldsymbol{R}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bxi}{\bx_i} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\boldsymbol{u}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\fH}{\mathcal{H}} \newcommand{\fR}{\mathcal{R}} \newcommand{\fX}{\mathcal{X}} \newcommand{\fY}{\mathcal{Y}}\] Background and motivation Before machine learning (ML) algorithms can be deployed in real world settings, an estimate of their generalization error is required. Because data is finite, a model can at most be trained on \(N\)...

Read More
\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\mathbb{y}} \newcommand{\bx}{\mathbb{x}} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\mathbb{u}} \newcommand{\bX}{\mathbb{X}}\] Background The falling cost of sequencing technology has lead to a proliferation of biological datasets that are able to measure nuclear states such as relative amounts of protein expression via RNA-seq or targeted genotyping through next generation sequencing (NGS) platforms. These...

Read More
\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bbetaz}{\bbeta_0} \newcommand{\bhatk}{\hat{\beta}_k} \newcommand{\by}{\mathbb{y}} \newcommand{\bx}{\mathbb{x}} \newcommand{\be}{\mathbb{e}} \newcommand{\bxk}{\bx_k} \newcommand{\bu}{\mathbb{u}} \newcommand{\bX}{\mathbb{X}} \newcommand{\gram}{\bX^T\bX} \newcommand{\sqrtl}{\sqrt{\text{Lasso}} } \newcommand{\sqrtln}{\sqrt{\text{Lasso}}}\] Background Twenty years ago Tibshirani developed a penalized regression model known as the Lasso. Instead of using the L0-norm as the penalty term (equivalent to best-subset selection), the Lasso makes use of an L1-norm. This...

Read More
\[\newcommand{\Real}{\mathbb{R}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bbetah}{\hat{\bbeta}} \newcommand{\bbetahl}{\bbetah_\lambda} \newcommand{\LT}{\mathcal{L}_T} \newcommand{\PLT}{\mathcal{PL}_T} \newcommand{\LR}{\mathcal{L}_R} \newcommand{\by}{\mathbb{y}} \newcommand{\bX}{\mathbb{X}} \newcommand{\byR}{\by_R} \newcommand{\bXR}{\bX_R} \newcommand{\byT}{\by_T} \newcommand{\bXT}{\bX_T} \newcommand{\bU}{\mathbb{U}} \newcommand{\bV}{\mathbb{V}} \newcommand{\bD}{\mathbb{D}} \newcommand{\hone}{\hspace{1mm}} \newcommand{\hthree}{\hspace{3mm}} \newcommand{\hlam}{\hat{\lambda}}\] Background The superior prediction accuracy that machine learning (ML) models tend to have over traditional statistical approaches is largely driven by: Selecting parameter weights for prediction accuracy rather than statistical inference,...

Read More
Logistic regression (LR) is a type of classification model that is able to predict discrete or qualitative response categories. While LR usually refers to the two-class case (binary LR) it can also generalize to a multiclass system (multinomial LR) or the category-ordered situation (ordinal LR)[1]. By using a logit link,...

Read More
\(\newcommand{\bzi}{\mathbf{z}_i}\) \(\newcommand{\bZi}{\mathbf{Z}_i}\) \(\newcommand{\bmu}{\boldsymbol\mu}\) Introduction and motivation This post will consider how to adjust Kaplan-Meier (KM) survival curves, or non-parametric survival curves more generally, for an observational study setting by using inverse probability weights (IPWs). KM curves are frequently employed in survival analysis and clinical studies as they provide a visually...

Read More
When doing regression modeling, one will often want to use some sort of regularization to penalize model complexity, for reasons that I have discussed in many other posts. In the case of a linear regression, a popular choice is to penalize the L1-norm (sum of absolute values) of the coefficient...

Read More
Over the last fifteen years the field of biology has undergone a significant cultural change. The pipette is being replaced by the piping operator. At the recent Software Carpentry workshop that occurred at Queen’s University this week, I noticed that most of the people there to learn about UNIX programming...

Read More
Introduction Machine learning and traditional statistical inference have, until very recently, been running along separate tracks. In broad strokes, machine learning researchers were interested in developing algorithms which maximized predictive accuracy. Natural processes were seen as a black box which could be approximated by creative data mining procedures. This approach...

Read More
When fitting a distribution to a survival model it is often useful to re-parameterize it so that it has a more tractable scale[1]. However, estimating the parameters that index a distribution via likelihood methods is often easier in the original form, and therefore it is useful to be able to...

Read More
This post was created for students taking CISC 875 (Bioinformatics) and has two goals: (1) introduce the R programming language, and (2) demonstrate how to use some of the important Bioconductor packages for the analysis of gene expression datasets. R has become the dominant programming language for statistical computing in...

Read More
Background The advent of next-generation sequencing technology has given biologists a detailed resource with which they can better understand how cellular states are expressed in RNA sequence counts[1]. Statisticians have also been taking advantage of the NGS revolution by using machine learning algorithms to handle these over-determined datasets[2] and classify...

Read More
Understanding the dynamics of survival times in clinical settings is important to both medical practitioners and patients. In statistics, time-to-event analysis models a continuous random variable \(T\), which represents the duration of a state. If the state is “being alive”, then the time to event is mortality, and we refer...

Read More
Introduction Micro RNAs (miRNAs) are a small RNA molecule, around 22 base pairs long[1], that are able to regulate gene expression by silencing specific RNAs. While these molecules were first discovered in the 1990s, their biological significance wasn’t fully appreciated until the early 2000s when they were found in C....

Read More
Introduction For my Advanced Biostatistics course this semester I gave a presentation about the problem of batch effects in microarray data analysis and I feel that it is worth expanding on in a post. DNA microarrays allow for the simultaneous measurement of thousands of genes from a cell sample. The...

Read More
Introduction and motivation In Economics, my previous field of study, empirical research was often interested in estimating the parameters of a theoretical model. Such parameters of interest might include the risk premium embedded in the term structure of interest rates or the elasticity of the labour supply. The main data...

Read More
My recent experience in blogging with the Ghost platform has been fairly satisfactory. When it comes to writing book reviews, or making short posts, writing up a post in Markdown/html is a fairly efficient process. However as a fledging Bioeconometrician, many of the topics I want to post about relate...

Read More