Linear Models and \(p > n\): What Can You Do?
Background
It seems intuitive that you cannot fit a regression with more predictors than observations. With \(p\) parameters and only \(n < p\) data points, the reasoning goes, you have more unknowns than equations, the system is underdetermined, and the whole enterprise collapses. Geneticists know this well.
This intuition is half right. It is true that when \(p > n\) you cannot recover the coefficient vector \(\beta\) — not even in principle, not with infinite sample or computing power. But it does not follow that the model is useless. The thing you usually care about — the prediction \(X\beta\) — can be perfectly well-defined even when \(\beta\) itself is hopelessly ambiguous. Coefficients that are individually unknowable can combine into a quantity that is pinned down exactly. I find this one of the more beautiful facts in linear-model theory, and it sits at the heart of a recent expository paper by Ronald Christensen (2026).
I will spend most of the post on why \(\beta\) is non-identifiable and what survives anyway, then briefly sketch the menu of estimators people actually reach for in this regime, and what “prediction” even means when half the parameter space is invisible to the data.
Notation
Start with the standard linear model
\[Y = X\beta + e, \qquad \mathbb{E}(e) = 0, \qquad \text{cov}(e) = \sigma^2 I_n,\]
where
- \(Y\) is an \(n\)-vector of outcomes,
- \(X\) is a known \(n \times p\) design matrix,
- \(\beta\) is the \(p\)-vector of unknown coefficients, and
- \(e\) is mean-zero noise.
The case of interest is \(p > n\), which forces the rank of \(X\) to satisfy \(r(X) \le n < p\). Two subspaces do all the work.
- The column space \(C(X) \subseteq \mathbb{R}^n\) is the set of all fitted-value vectors \(X\beta\) the model can produce.
- The row space \(C(X') \subseteq \mathbb{R}^p\) is the span of the rows of \(X\) — the directions in coefficient space that the data actually “see.”
Let \(N\) denote the (orthogonal) projection onto the row space \(C(X')\), so that any coefficient vector splits cleanly into
\[\beta = \underbrace{N\beta}_{\text{in the row space}} + \underbrace{(I - N)\beta}_{\text{orthogonal to it}}.\]
That decomposition is the whole story, so it is worth keeping in view.
A Closer Look
Why \(\beta\) cannot be identified
Here is the crux. Every row of \(X\) lives in \(C(X')\) by definition, which means \(X\) annihilates anything orthogonal to the row space: \(X(I - N) = 0\). Multiply the model through and the consequence is immediate —
\[X\beta = X\big(N\beta + (I-N)\beta\big) = X N\beta.\]
The component \((I - N)\beta\) contributes nothing to \(X\beta\), and therefore nothing to \(\mathbb{E}(Y)\). The data are generated by \(N\beta\) alone. Whatever value \((I-N)\beta\) takes — zero, enormous, anything — the distribution of \(Y\) is exactly the same. No estimator can distinguish between two coefficient vectors that agree on their row-space part and differ off it, because they produce statistically identical data.
When \(p \le n\) and \(X\) has full column rank, the row space is all of \(\mathbb{R}^p\), the projection \(N\) is the identity, \((I-N)\beta\) vanishes, and this issue never arises — which is why we rarely think about it. But the moment \(p > n\), the row space is a proper subspace of \(\mathbb{R}^p\), the orthogonal complement is non-trivial, and a whole \((p - r)\)-dimensional flat of coefficient vectors fits the data equally well. That is non-identifiability, stated geometrically: \(\beta\) is unknowable precisely along the directions the design never probes.
Non-unique coefficients, unique predictions
Now the payoff. The fitted-value vector is
\[X\hat\beta = X N\hat\beta,\]
and this object is unique even though \(\hat\beta\) is not. Geometrically, fitting projects \(Y\) onto the column space \(C(X)\); that projection is a single, well-defined point regardless of which of the infinitely many least-squares solutions you happened to compute. The same is true of the row-space part of the coefficients: \(N\beta\) is identifiable and estimable, while \((I - N)\beta\) is not. So the situation is not “everything is lost,” but rather a clean partition — the row-space shadow of \(\beta\) is recoverable, its orthogonal complement is forever hidden, and predictions live entirely in the recoverable part.
This is why high-dimensional regression is not the doomed exercise the equation-counting heuristic suggests. The coefficients are a fiction the data cannot fully resolve, but the model’s predictions are perfectly real. It is also a useful corrective to a habit many of us have — reading individual coefficients as if they meant something. With \(p > n\), an individual \(\hat\beta_j\) is whatever your estimator’s tie-breaking rule decided to make it; only its projection onto the row space carries information. (This tension between estimating coefficients and estimating predictions shows up even in well-posed problems; I have written about it in the difference between causal and predictive models.)
Four ways to pick a \(\beta\)
If infinitely many coefficient vectors fit, every estimator is really a rule for choosing one — equivalently, a rule for deciding what to do off the row space, where the data are silent. Christensen lays out four that practitioners actually use, and the unifying thread is that they are all functions of the ordinary least-squares fit; they differ only in how they pin down the ambiguous directions.
Minimum-norm least squares. Among all least-squares solutions, take the one with the smallest \(\|\beta\|\). This is exactly \(\hat\beta_m = N\hat\beta\) — the solution that sets the orthogonal component to zero and lives entirely in the row space \(C(X')\). It is unique, it is what gradient descent from the origin converges to, and it is the natural reference point for the others.
Penalized (regularized) least squares. Minimize \(\|Y - X\beta\|^2 + \lambda\, \mathcal{P}(\beta)\) for a penalty \(\mathcal{P}\) — ridge (\(\ell_2\)), the LASSO (\(\ell_1\)), elastic net, and so on. The penalty breaks the tie by preferring small or sparse coefficients, which generally makes the estimate unique.
Spectral shrinkage. Working from the singular value decomposition of \(X'X\), shrink the components of the fit associated with each eigen-direction by a chosen factor. Ridge regression is the canonical special case (it shrinks the direction with eigenvalue \(s\) by \(s/(s + \lambda)\)), and as \(\lambda \to 0\) ridge converges to the minimum-norm solution.
Principal component regression. Keep only the leading principal components of \(X\) and regress on those, discarding the low-variance directions entirely. This is a hard-thresholding cousin of spectral shrinkage — zero weight below the cutoff, full weight above.
The differences among them are real but secondary to the shared structure: each one estimates the identifiable part \(N\beta\) from the data and then imposes some discipline on the unidentifiable part. For the theory behind when these choices recover the truth, see my post on the LASSO’s theoretical guarantees.
Bottom Line
- With \(p > n\), the coefficient vector \(\beta\) is not identifiable — the design only sees its projection \(N\beta\) onto the row space, and the orthogonal part \((I-N)\beta\) is invisible to the data and to any prior.
- Yet the fitted values \(X\hat\beta\) and the row-space part \(N\beta\) are unique and estimable. Non-unique coefficients, unique predictions: that is the central fact, and it is what keeps high-dimensional regression alive.
- Every estimator is a rule for choosing among the infinitely many fits, agreeing on \(N\beta\) and differing only off the row space.
References
Christensen, R. (2026). Linear Model Estimation and Prediction for \(p > n\). The American Statistician, 80(2), 232–240.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). New York: Springer.