A Brief Introduction to Conformal Inference

machine learning

Published

December 20, 2023

Background

Traditional confidence intervals estimate the range in which a population parameter, such as a mean or regression coefficient, is likely to fall with a specified level of confidence (e.g., 95%). They reflect the uncertainty in estimating a fixed but unknown parameter based on observing only a sample of the population. But what if we are interested in measuring uncertainty in the context of making predictions? Can we use traditional methods, or do we need a new framework?

Prediction intervals provide a range within which a future individual observation is expected to fall with a given probability. Since they account for both the uncertainty in estimating the mean and the inherent variability of individual observations, they are typically wider than confidence intervals. While a confidence interval narrows as sample size increases, a prediction interval remains relatively wide due to the irreducible noise in individual data points. For simplicity, I will use both terms interchangeably.

Conformal inference offers a formalized framework for building such prediction intervals in machine learning models. This article provides a gentle introduction to the main idea behind conformal inference, presenting a new way of thinking about uncertainty in the context of machine learning (i.e., prediction) models.

As a teaser, the basic idea behind the method rests on a simple result about sample quantiles. Let me explain.

Notation

Let’s imagine a size n i.i.d. sample of an outcome variable $Y$ and a covariate vector $X$, $(X_1, Y_1) \dots (X_n, Y_n)$. Conformal inference is concerned with building a “confidence interval” for a new outcome observation $Y_{n+1}$ from a new feature realization $X_{n+1}$.

Importantly, this interval should be valid:

in finite samples (i.e., non-asymptotically),
without assumptions on the data generating process, and
for any estimator of the regression function, $\mu(x)=E[Y \mid X=x]$.

In mathematical notation, given a significance level , we want to construct a confidence interval $CI(X_{n+1})$ satisfying the above properties and such that:

\[P(Y_{n+1} \in CI(X_{n+1})) \geq 1-\alpha.\]

A Closer Look

Refresher on Sample Quantiles

I will start with reviewing sample quantiles. Given an i.i.d. sample, $U_1, \dots, U_n$, the ($1-\alpha$)th quantile is the value $\hat{q}_{1-\alpha}$ such that approximately $(1-\alpha)\times100\%$ of the data is smaller than it. For instance, the $95$th quantile (sometimes also called percentile) is the value for which $95\%$ of the observations are at least as small.

So, given a new observation $U_{n+1}$, we know that:

\[P(U_{n+1}\leq \hat{q}_{1-\alpha})\geq 1-\alpha.\]

The Naïve Approach

Let’s turn back to the regression example with Y and X. We are given a new observation $X_{n+1}$ and our focus is on $Y_{n+1}$. Following the fact described above, a naïve way to construct a confidence interval for $Y_{n+1}$ is as follows:

\[CI^{\text{naïve}}_{1-\alpha}(X_{n+1}) = \left[ \hat{\mu}(X_{n+1}) \pm \hat{q}^{\mid \hat{u} \mid}_{1-\alpha} \right].\]

Here $\mu(\cdot)$ is an estimate of the regression function $E[Y \mid X]$, and $\hat{q}^{\mid \hat{u} \mid}_{1-\alpha}$ is the $(1-\alpha)$th quantile of empirical distribution function of the fitted residuals $\mid Y-\hat{\mu}(X) \mid$.

Put simply, we can look at an interval around our best prediction for $Y_{n+1}$ (i.e., $\hat{\mu}(X_{n+1})$) defined by the residuals estimated on the original data.

It turns out this interval is too narrow. In a series of papers Vladimir Vovk and co-authors show that the empirical distribution function of the fitted residuals is often biased downward and hence this interval is invalid. This is where conformal inference comes in.

Conformal Inference

Consider the following strategy. For each $y$ we fit a regression $_y $ on the sample $(Y_1, X_1),\dots (Y_n, X_n), (y, X_{n+1})$. We calculate the residuals $R^y_i$ for $i=1,\dots,n$ and $R^y_{n+1}$ and count the proportion of $R^y_i$’s smaller than $R^y_{n+1}$. Let’s call this number $\sigma(y)$. That is,

\[\sigma(y) = \frac{1}{n+1}\sum_{i=1}^{n+1} I (R^y_i \leq R^y_{n+1}),\]

where $I(\cdot)$ is the indicator function equal to one when the statement in the parenthesis is true and 0 if when it is not.

The test statistic $\sigma({Y_{n+1}})$ is uniformly distributed over the set $\{ \frac{1}{n+1}, \frac{2}{n+1},\dots, 1\}$, implying we can use $1-\sigma({Y_{n+1}})$ as a valid p-value for testing the null that $Y_{n+1}=y.$ Then, using the sample quantiles logic outlined above we arrive at the following confidence interval for $Y_{n+1}$:

\[ CI^{\text{conformal}}_{1-\alpha}(X_{n+1}) \approx \{ y\in \mathbb{R} : \sigma(y)\leq 1-\alpha \}.\]

This is summarized in the following procedure:

Algorithm:

For each value $y$:

Fit the regression function $\mu(\cdot)$ on $(X_1, Y_1), \dots, (X_n, Y_n), (X_{n+1}, y)$ using your favorite estimator/learner.
Calculate the $n+1$ residuals.
Calculate the proportion $\sigma(y)$.

Construct $CI = \{y: \sigma(y) \leq (1-\alpha)\}$.

Software Package: conformalInference

Two notes. First, conformal inference guarantees unconditional coverage. This is conceptually different and should not be confused with the conditional statement $P(Y_{n+1}\in CI(x) \mid X_{n+1}=x)\geq 1-\alpha$. The latter is stronger and more difficult to assert, requiring additional assumptions such as consistency of our estimator of $\mu(\cdot)$.

Second, this procedure can be computationally expensive. For a given value $X_{n+1}$ we need to fit a regression model and compute residuals for every $y$ which we consider including in the confidence interval. This is where split conformal inference comes in.

Split Conformal Inference

Split conformal inference is a modification of the original algorithm that requires significantly less computation power. The idea is to split the fitting and ranking steps, so that the former is done only once. Here is the algorithm.

Algorithm:

Randomly split the data in two equal-sized bins.
Get $\hat{\mu}$ on the first bin.
Calculate the residuals for each observation in the second bin.
Let $d$ be the $s$-th smallest residual, where $s=(\frac{n}{2}+1)(1-\alpha)$.
Construct $CI^{\text{split}}=[\hat{\mu}-d,\hat{\mu}+d]$.

A downside of this splitting approach is the introduction of extra randomness. One way to mitigate this is to perform the split multiple times and construct a final confidence interval by taking the intersection of all intervals. The aggregation decreases the variability from a single data split and, as this paper shows, still remains valid. Similar random split aggregation has also been used in the context of statistical significance in high-dimensional models.

An Example

I used the popular iris dataset to try out the R package conformalInference. Like most of my data demos, this is meant to be a mere illustration and you should not take the results seriously.

The outcome variable was Sepal.Length, and the matrix $X$ included sepal.width, petal.length, petal.width, species_setosa, species_versicolor, and species_virginica. Some of these were categorical in which case I converted them to a bunch of binary variables. I used the first $148$ observations to estimate the regression function $\mu(X)$ using lasso and the $149$th row to form the prediction (i.e., the test set).

Here is the code.

# clear workspace and load packages
rm(list=ls())
library(conformalInference)
library(glmnet)
library(tidyverse)

# load the iris dataset
data <- iris

# clean data
colnames(data) <- tolower(colnames(data))

# one-hot encode the species variable
data <- data %>%
  mutate(species_setosa = as.integer(species == 'setosa'),
         species_versicolor = as.integer(species == 'versicolor'),
         species_virginica = as.integer(species == 'virginica')) %>%
  dplyr::select(-species)

# check for missing values (none in iris, but included for completeness)
data <- na.omit(data)

# split training/test data
data0 <- data %>% filter(row_number() == nrow(data))  # last row as test set
data <- data %>% filter(row_number() < nrow(data))   # remaining rows as training set

# select variables X, Y
y <- data$sepal.length  # target variable
x <- data %>% dplyr::select(-sepal.length)  # predictors
x <- as.matrix(x)
x0 <- data0 %>% dplyr::select(-sepal.length)  # test predictors
x0 <- as.matrix(x0)
n <- nrow(x)

# use lasso to estimate mu
out.gnet = glmnet(x, y, nlambda=100, lambda.min.ratio=1e-3)
lambda = min(out.gnet$lambda)
funs = lasso.funs(lambda=lambda)

# run conformal inference
out.conf = conformal.pred(x, y, x0, 
                          alpha=0.1,
                          train.fun=funs$train, 
                          predict.fun=funs$predict, 
                          verb=TRUE)

# run split conformal inference
out.split = conformal.pred.split(x, y, x0, 
                                 alpha=0.1,
                                 train.fun=funs$train, 
                                 predict.fun=funs$predict, 
                                 verb=TRUE)

# print results
paste('The lower bound is', out.conf$lo, 'and the upper bound is', out.conf$up)
> [1] "The lower bound is 5.89 and the upper bound is 6.68"
out.conf$pred
>         [,1]
> [1,] 6.316882

# print results for split conformal inference
paste('The lower bound is', out.split$lo, 'and the upper bound is', out.split$up)
> [1] "The lower bound is 5.74 and the upper bound is 6.93"
out.split$pred
>          [,1]
> [1,] 6.33556

The actual age value in the test set was $6.2$ while the conformal inference approach computed a confidence interval ($5.88, 6.68$). The splitting algorithm gave similar results.

Bottom Line

Conformal inference offers a novel approach for constructing valid finite-sample prediction intervals in machine learning models.

Where to Learn More

Conformal inference in machine learning is an ongoing research topic and I do not know of any review papers or textbook treatments of the subject. If you are interested in learning more, check the paper referenced below.

References

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094-1111.

Lei, J., Rinaldo, A., & Wasserman, L. (2015). A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74, 29-43.

Shafer, G., & Vovk, V. (2008). A Tutorial on Conformal Prediction. Journal of Machine Learning Research, 9(3).

Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world (Vol. 29). New York: Springer.