Advanced Topics in Statistical Data Science

6 Underrated Plot Types

correlation

statistical inference

Most data science workflows rely on a familiar trio of plots: histograms, scatterplots, and boxplots. They are useful, but they leave a lot of structure hidden in the data.

Apr 6, 2026

The Many Flavors of Principal Component Analysis

machine learning

flavors

Principal component analysis (PCA) is one of those methods that everyone learns early and then quietly keeps using for years. The appeal is obvious: take a high-dimensional…

Apr 5, 2026

Brief Overview of Treatment Effect Bounds

causal inference

In applied causal work, the real problem is often not estimation but identification. Attrition, imperfect take-up, endogenous selection, and missing outcomes can all make…

Apr 2, 2026

What OLS Estimates in Causal Inference

causal inference

parametric models

OLS is still the default causal estimator in a surprising amount of applied work. That is often understandable. Regression is simple, transparent, and often a reasonable…

Apr 1, 2026

The Many Flavors of Lasso

machine learning

flavors

The Lasso (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani in 1996, has become one of the go-to tools for variable selection and shrinkage in…

Mar 14, 2026

The Oracle Property: What It Promises (and What It Doesn’t)

machine learning

variable selection

In high-dimensional regression, we sometimes hear that a method possesses the oracle property. The phrase sounds impressive: it suggests that an estimator behaves as if the…

Mar 13, 2026

Why Some Confidence Intervals Are Not Symmetric

statistical inference

hypothesis testing

Most of us were trained to think of a \(95\%\) confidence interval as

Mar 10, 2026

OLS with Fixed vs Random \(X\): What Actually Changes?

parametric models

statistical inference

In regression courses, you will eventually hear the phrase: “OLS works whether \(X\) is fixed or random.” That statement is correct, but dangerously compressed.

Mar 8, 2026

Logistic Regression in Randomized Trials?

causal inference

parametric models

Randomized controlled trials (RCTs) are the gold standard for causal inference. Random assignment guarantees that treatment is independent of potential outcomes. As a…

Feb 17, 2026

Randomization Inference: A Gentle Introduction

causal inference

statistical inference

Randomization inference offers a refreshing alternative to traditional parametric inference, providing exact control over Type I error rates without relying on large-sample…

Feb 12, 2026

Generalized Additive Models: What You Need to Know

semiparametric models

machine learning

Generalized Additive Models (GAMs) are one of the most powerful and flexible tools in a data scientist’s toolbox for modeling complex, nonlinear relationships between…

Feb 12, 2026

Understanding Correlated Random Effects Models

causal inference

For decades, panel data analysis has largely revolved around a familiar dichotomy: fixed effects (FE) versus random effects (RE). More recently, generalized fixed effects…

Feb 11, 2026

The Many Flavors of Propensity Score Methods In Causal Inference

causal inference

flavors

Introduced by Rosenbaum and Rubin in 1983, the propensity score, the probability of receiving treatment given observed covariates, has become the workhorse for handling…

Jan 22, 2026

The Wilcoxon-Mann-Whitney Test is Not a Test of Medians

statistical inference

hypothesis testing

Nonparametric tests like the Wilcoxon-Mann-Whitney (WMW) are among the most popular alternatives to the \(t\)- and \(z\)-tests in settings where normality assumptions break…

Jan 20, 2026

Unconditional Quantile Regression and Treatment Effects

causal inference

Quantile regression has become a widely used tool in econometrics and statistics, thanks to its ability to model the entire distribution of an outcome variable rather than…

Dec 21, 2025

The Many Flavors of Matching for Causal Inference

causal inference

flavors

If you’ve worked on causal inference with observational data, you’ve likely faced the fundamental challenge: the treated and control groups often look very different.…

May 27, 2025

The Many Flavors of Variable Selection

variable selection

machine learning

flavors

If you’ve ever worked with high-dimensional data, you’ve likely faced a familiar challenge: too many variables. Some features are pure noise, others are redundant or…

May 26, 2025

The Many Flavors of Bootstrap

bootstrap

statistical inference

flavors

At its heart, the bootstrap poses a simple yet powerful question: “What if we could resample from our existing data, treating it as a stand-in for the population?” By doing…

May 26, 2025

The Secret Life of Correlation: Myths and Thirteen Views

statistical inference

correlation

Statistical correlation has long captivated me—it’s probably the topic I’ve written about most on this blog. What makes it so compelling is the combination of theoretical…

May 24, 2025

The Kolmogorov–Smirnov Test as a Goodness-of-fit

statistical inference

hypothesis testing

The Kolmogorov–Smirnov (KS) test is a staple in the statistical toolbox for checking how well data fit a hypothesized distribution. It comes in both a one-sample and a…

May 5, 2025

Jackknife vs. Bootstrap: A Tale of Two Resamplers

bootstrap

statistical inference

If you’ve ever dived into resampling methods, you’ve likely come across the jackknife and the bootstrap. They both aim to help us estimate uncertainty or bias without…

May 4, 2025

The Roles of Covariates in Randomized Experiments

randomized experiments

causal inference

Properly implemented randomized experiments (such as randomized controlled trials (RCTs) and A/B tests) guarantee unbiased estimates of the causal effect of a treatment \(T\)…

May 2, 2025

Causal Inference with Residualized Regressions

causal inference

statistical inference

The Frisch-Waugh-Lovell (FWL) theorem offers an elegant alternative to standard multivariate linear regression when estimating causal effects. Instead of running a full…

May 1, 2025

Causal vs. Predictive Modeling: Subtle, but Crucial Differences

causal inference

machine learning

It’s one of the most common mix-ups I see among data scientists—especially those coming from a machine learning background: confusing causal modeling with predictive…

Apr 30, 2025

The Two Types of Weights in Causal Inference

weights

causal inference

Causal inference fundamentally seeks to answer: What is the effect of a treatment or intervention? The challenge lies in ensuring that the comparison groups—treated versus…

Feb 28, 2025

Binscatter: A New Visual Tool for Data Analysis

correlation

In the realm of data visualization, the classical scatter plot has long been a staple for exploring bivariate relationships. However, as datasets grow larger and more…

Feb 9, 2025

Filling in Missing Data with MCMC

statistical inference

Every dataset inevitably contains missing or incomplete values. Practitioners then face the dilemma of how to address these missing observations. A common approach, though…

Jan 31, 2025

The Limits of Semiparametric Models: The Efficiency Bound

statistical inference

semiparametric models

The efficiency bound is a cornerstone of the academic literature on semiparametric models, and it’s easy to see why. This bound quantifies the potential loss in efficiency…

Jan 22, 2025

The Limits of Nonparametric Models

statistical inference

nonparametric models

Nonparametric statistics offers a powerful toolkit for data analysis when the underlying data-generating process is too complex or unknown to be captured by parametric…

Jan 22, 2025

The Limits of Parametric Models: The Cramér-Rao Bound

statistical inference

parametric models

Obtaining the lowest possible variance is a primary goal for anyone working with statistical models. Efficiency (or precision), as is the jargon, is a cornerstone of…

Jan 12, 2025

The Three Classes of Statistical Models

parametric models

semiparametric models

nonparametric models

Statistical modeling is among the most exciting elements of working with data. When mentoring junior data scientists, I never fail to see the spark in their eyes when our…

Jan 12, 2025

The Delta Method: Simplifying Confidence Intervals for Complex Estimators

statistical inference

You’ve likely encountered this scenario: you’ve calculated an estimate for a particular parameter, and now you require a confidence interval. Seems straightforward, doesn’t…

Jan 10, 2025

Stein’s Paradox: A Simple Illustration

statistical inference

paradox

In the realm of statistics, few findings are as counterintuitive and fascinating as Stein’s paradox. It defies our common sense about estimation and provides a glimpse into…

Jan 10, 2025

Mutual Information: What, Why, How, and When

correlation

When exploring dependencies between variables, the data scientist’s toolbox often relies on correlation measures to reveal relationships and potential patterns. But what if…

Jan 2, 2025

Generating Variables with Predefined Correlation

correlation

Suppose you are working on a project where the relationship between two variables is influenced by an unobserved confounder, and you want to simulate data that reflects this…

Dec 20, 2024

Stratified Sampling with Continuous Variables

randomized experiments

causal inference

Stratified sampling is a foundational technique in survey design, ensuring that observations capture key characteristics of a population. By dividing the data into distinct…

Dec 18, 2024

Column-Sampling Bootstrap?

bootstrap

statistical inference

The bootstrap is a versatile resampling technique traditionally focused on rows. Let’s add a twist to the plain vanilla bootstrap. Imagine you have a wide dataset—many…

Dec 16, 2024

The Bootstrap and its Limitations

bootstrap

statistical inference

The bootstrap is a powerful resampling technique used to estimate the sampling distribution of a statistic. By repeatedly drawing observations with replacement from the…

Dec 16, 2024

Simpson’s Paradox: A Simple Illustration

paradox

causal inference

Simpson’s paradox is one of the most counterintuitive phenomena in data analysis. It describes situations where a trend observed within groups disappears—or even…

Dec 6, 2024

Causation without Correlation

causal inference

correlation

While most people understand that correlation doesn’t imply causation, it might surprise many to learn that causation doesn’t always result in correlation. In the absence of…

Nov 21, 2024

The Many Flavors of Gradient Boosting

machine learning

flavors

Gradient boosting has emerged as one of the most powerful techniques for predictive modeling. In its simplest form, we can think of gradient boosting like having a team of…

Nov 6, 2024

Bayesian Analysis of Randomized Experiments: A Modern Approach

randomized experiments

Imagine you’re a data scientist evaluating an A/B test of a new recommendation algorithm. The results show a modest but promising 0.5% lift in conversion rate—up from \(8\%\)…

Oct 29, 2024

Weights in Statistical Analyses

weights

statistical inference

Weights in statistical analyses offer a way to assign varying importance to observations in a dataset. Although powerful, they can be quite confusing due to the various…

Sep 18, 2024

Causality without Experiments, Unconfoundedness, or Instruments

causal inference

Causality is central to many practical data-related questions. Conventional methods for isolating causal relationships rely on experimentation, assume unconfoundedness, or…

Aug 12, 2024

FOCI: A New Variable Selection Method

variable selection

machine learning

In our data-abundant world, we often have access to tens, hundreds, or even thousands of variables. Most of these features are usually irrelevant or redundant, leading to…

Jun 11, 2024

Nonlinear Correlations and Chatterjee’s Coefficient

correlation

flavors

Much of data science is concerned with learning about the relationships between different variables. The most basic tool to quantify relationship strength is the correlation…

Apr 12, 2024

A Brief Introduction to Conformal Inference

machine learning

statistical inference

Traditional confidence intervals estimate the range in which a population parameter, such as a mean or regression coefficient, is likely to fall with a specified level of…

Dec 20, 2023

Using Conformal Inference for Variable Importance in Machine Learning

machine learning

statistical inference

variable selection

Many machine learning (ML) methods operate as opaque systems, generating predictions when given a dataset as input. Identifying which variables have the greatest impact on…

Dec 20, 2023

Recent Developments in False Discovery Rate

multiple testing

statistical inference

A while back I wrote an article summarizing various approaches to correcting for multiple hypothesis testing. The dominant framework, False Discovery Rate (FDR), controls…

Oct 27, 2023

ML-Based Regression Adjustments in Randomized Experiments

machine learning

randomized experiments

Randomized experiments are the gold standard when interested in measuring causal relationships with data. In settings with small treatment effects or underpowered designs, a…

Aug 1, 2023

The Alphabet of Learners for Heterogeneous Treatment Effects

machine learning

randomized experiments

heterogeneous treatment effects

flavors

Numerous tales illustrate the inadequacy of the average to capture meaningful quantities. Statisticians love these. In my favorite one the protagonist places her head in a…

Jul 28, 2023

Lasso for Heterogeneous Treatment Effects Estimation

heterogeneous treatment effects

causal inference

variable selection

Lasso is one of my favorite machine learning algorithms. It is so simple, elegant, and powerful. My feelings aside, Lasso indeed has a lot to offer. While, admittedly, it is…

Jun 30, 2023

An Overview of Machine Learning Methods in Causal Inference

machine learning

causal inference

flavors

The most exciting trend in causal inference over the last decade has been the infusion of machine learning (ML) techniques. Supervised machine learning is designed to find…

Apr 30, 2023

The Variance of Propensity Score Matching Estimators

statistical inference

causal inference

Propensity score matching (PSM) is among the most popular methods for estimating causal effects with observational data. It lends its fame to both its power and simplicity.…

Mar 30, 2023

Correlation is a Cosine

correlation

statistical inference

You might have come across the statement, “correlation is a cosine,” but never taken the time to explore its precise meaning. It certainly sounds intriguing—how can the…

Feb 9, 2023

Correlation is Not (Always) Transitive

correlation

statistical inference

At first, I found this really puzzling. \(X\) is correlated (Pearson) with \(Y\), and \(Y\) is correlated with \(Z\). Does this mean \(X\) is necessarily correlated with \(Z\)…

Dec 22, 2022

Lord’s Paradox: A Simple Illustration

correlation

paradox

Lord’s paradox presents a fascinating challenge in causal inference and statistics. It highlights how different statistical methods applied to the same data can lead to…

Dec 18, 2022

Hypothesis Testing in Linear Machine Learning Models

statistical inference

machine learning

variable selection

Machine learning models are an indispensable part of data science. They are incredibly good at what they are designed for – making excellent predictions. They fall short in…

Nov 6, 2022

The Many Flavors of Multiple Testing Adjustments

multiple testing

statistical inference

flavors

The abundance of data around us is a major factor making the data science field so attractive. It enables all kinds of impactful, interesting, or fun analyses. I admit this…

Oct 22, 2022

Hypothesis Testing with Population Data

hypothesis testing

statistical inference

Classical statistical theory is built on the idea of working with a sample of data from a given population of interest. Our software packages compute confidence intervals to…

Sep 23, 2022

Overlapping Confidence Intervals and Statistical (In)Significance

statistical inference

hypothesis testing

This is a mistake I’ve made myself—more times than I’d like to admit. Even seasoned professors and expert data scientists sometimes fall into the same trap.

Aug 12, 2022

Categories