Vasco Yasenov
  • About Me
  • CV
  • Blog
  • Research
  • Software
  • Others
    • Methods Map
    • Kids Books
    • TV Series Ratings

On this page

  • Background
  • Notation
  • A Closer Look
    • Prediction Error
    • Estimation Error
    • Support Recovery
  • Bottom Line
  • Where to Learn More
  • References

Lasso and Its Theoretical Guarantees

machine learning
variable selection
statistical inference
Published

June 11, 2026

7 min read

Background

Every data scientist has heard of the Lasso, and I have written plenty about it here as well — its many flavors, its oracle property, and its use in estimating heterogeneous treatment effects, among other things. It is appealing for several reasons: simplicity, cheap and fast computation, interpretability, and a solid theoretical foundation.

In this note I focus on that last item — what makes the Lasso attractive from a theoretical standpoint. The goal is to give a high-level overview of the kinds of guarantees the Lasso promises. The literature here is heavy and daunting, so I keep the mathematical rigor to a minimum. I am after intuition: what can the Lasso deliver and under what conditions?

The punchline, which I will return to repeatedly, is that these guarantees are not free. Each one rests on assumptions about the design matrix, the noise, and the true signal — and those assumptions get stronger as the promises get bolder.

Notation

Consider the usual linear model

\[Y = X\beta^* + \varepsilon, \quad \varepsilon \sim (0, \sigma^2 I_n),\]

where \(Y \in \mathbb{R}^n\) is the outcome, \(X \in \mathbb{R}^{n \times p}\) is the design matrix, \(\beta^* \in \mathbb{R}^p\) is the true coefficient vector, and \(\varepsilon\) is mean-zero noise with variance \(\sigma^2\). The interesting regime is high-dimensional, where \(p\) can be comparable to or much larger than \(n\).

The signal is assumed sparse: only a few coordinates of \(\beta^*\) are non-zero. Let

\[S = \{j : \beta^*_j \neq 0\}, \quad s = |S|\]

denote the support and its size. Sparsity is what makes the problem tractable when \(p \gg n\).

The Lasso can be written in two equivalent ways. The penalized (Lagrangian) form is

\[\hat\beta = \arg\min_{\beta} \left( \frac{1}{2n} \| Y - X\beta \|_2^2 + \lambda \|\beta\|_1 \right),\]

and the constrained form is

\[\hat\beta = \arg\min_{\beta} \frac{1}{2n} \| Y - X\beta \|_2^2 \quad \text{subject to} \quad \|\beta\|_1 \leq t.\]

For every \(\lambda\) there is a \(t\) that yields the same solution. The \(\ell_1\) penalty \(\|\beta\|_1 = \sum_j |\beta_j|\) is what does the work: unlike the \(\ell_2\) (ridge) penalty, it has corners that produce exact zeros, giving simultaneous shrinkage and selection. The tuning parameter \(\lambda \geq 0\) controls the strength of regularization.

A Closer Look

I will examine three broad groups of theoretical results. Each comes with its own assumptions and works only under specific conditions — sparse true support, a particular growth rate for \(n\) relative to \(p\), a well-behaved design matrix, and so on. I have ordered the sections in roughly ascending order of how demanding those assumptions are. This ordering is the most important takeaway of the post: prediction is cheap, estimation costs more, and exact support recovery is expensive.


Prediction Error

This is the most basic guarantee. It says the Lasso will predict “well enough” — the in-sample prediction error will not exceed a certain bound.

Formally, with the penalty chosen at the canonical rate \(\lambda \asymp \sigma\sqrt{\tfrac{\log p}{n}}\), one can show that with high probability

\[\frac{1}{n} \| X(\hat\beta - \beta^*) \|_2^2 \;\lesssim\; \lambda \, \|\beta^*\|_1 \;\asymp\; \sigma \, \|\beta^*\|_1 \sqrt{\frac{\log p}{n}}.\]

This is the so-called slow rate.

What is remarkable is how little it requires: essentially no conditions on the design matrix \(X\) beyond standardizing the columns. All you need is for \(\lambda\) to be large enough to dominate the noise — which is exactly what the \(\sqrt{\log p / n}\) scaling ensures. The \(\log p\) term is the price of searching over \(p\) candidate predictors, and it enters only logarithmically, which is why the Lasso tolerates enormous \(p\).

If you are willing to assume more about the design (see the next section), the bound sharpens to the fast rate,

\[\frac{1}{n} \| X(\hat\beta - \beta^*) \|_2^2 \;\lesssim\; \frac{s\, \sigma^2 \log p}{n},\]

which scales with the number of true signals \(s\) rather than with \(\|\beta^*\|_1\).


Estimation Error

Here the question shifts from predictions \(X\hat\beta\) to the coefficients \(\hat\beta\) themselves — how close is \(\hat\beta\) to \(\beta^*\)? This is a harder problem, and it requires a condition on the design matrix.

The standard assumption is the restricted eigenvalue (RE) condition. Loosely, it requires that \(\tfrac{1}{n} X^\top X\) behave like a well-conditioned matrix — bounded away from singularity — but only over the directions that matter, namely vectors that are approximately sparse. We cannot ask \(X^\top X\) to be invertible outright, since \(p > n\) makes that impossible; the restriction to a sparse cone is what rescues the argument.

Under the RE condition with constant \(\kappa > 0\), and again \(\lambda \asymp \sigma\sqrt{\tfrac{\log p}{n}}\),

\[\|\hat\beta - \beta^*\|_2 \;\lesssim\; \frac{\sigma}{\kappa} \sqrt{\frac{s \log p}{n}}, \qquad \|\hat\beta - \beta^*\|_1 \;\lesssim\; \frac{\sigma s}{\kappa} \sqrt{\frac{\log p}{n}}.\]

The intuition is that strong correlations among predictors flatten the objective function in certain directions, making the coefficients poorly identified even when predictions are fine. The RE condition rules out the worst of this. When it holds, the \(\ell_2\) error decays at essentially the rate an oracle who knew \(S\) would achieve, up to the \(\log p\) factor.

You know to never take the coefficients — or the implied \(p\)-values — from a Lasso fit too seriously without an extreme degree of caution. The shrinkage that makes the Lasso work also biases every non-zero coefficient toward zero, and standard inference ignores the selection step entirely. I have written about hypothesis testing in linear ML models elsewhere on the blog; the estimation bounds above are about rates, not about valid confidence intervals.


Support Recovery

This is the most intriguing guarantee and, fittingly, the most expensive. It is relevant for model selection rather than prediction: it asks whether the Lasso picks out exactly the right set of variables, \(\hat S = S\), where \(\hat S = \{j : \hat\beta_j \neq 0\}\).

The goal is selection consistency,

\[\mathbb{P}(\hat S = S) \to 1.\]

Two strong conditions are typically needed. The first is the irrepresentable condition, which limits how strongly the irrelevant predictors (\(X_{S^c}\)) can be correlated with the relevant ones (\(X_S\)):

\[\big\| X_{S^c}^\top X_S (X_S^\top X_S)^{-1} \operatorname{sign}(\beta^*_S) \big\|_\infty \leq 1 - \eta\]

for some \(\eta > 0\). This is close to necessary, not merely sufficient — if it fails, the Lasso tends to bring spurious variables into the model. The second is a beta-min condition: the smallest non-zero coefficient must be detectably large, \(\min_{j \in S} |\beta^*_j| \gtrsim \sqrt{\tfrac{\log p}{n}}\). Signals weaker than the noise floor simply cannot be distinguished from zero.

The practical implication is sobering. Exact support recovery is fragile: it demands that the truly important variables be only mildly correlated with the unimportant ones, and that no real signal be too small. Both are easy to violate with observational data. This is also why I treated this topic separately in my post on the oracle property — standard Lasso generally does not achieve oracle-style selection, and the fixes (adaptive Lasso, nonconvex penalties like SCAD and MCP) exist precisely to relax the irrepresentable condition.


Bottom Line

  • The Lasso is among the most popular machine learning methods, and a good part of its appeal comes from a strong theoretical foundation.
  • Under suitable conditions, one can show that the Lasso performs well across three distinct tasks — prediction, coefficient estimation, and exact support recovery..
  • Prediction error is nearly free; estimation error needs a restricted-eigenvalue-type condition on the design; support recovery additionally needs an irrepresentable condition and a minimum-signal-strength condition.
  • Not all of these guarantees will hold on any given dataset. The researcher must evaluate the regularity conditions carefully for each claim she makes.

Where to Learn More

For a focused, modern treatment, “Statistical Learning with Sparsity” by Hastie, Tibshirani, and Wainwright (2015) is the natural starting point, with “The Elements of Statistical Learning” (2009) as gentler background. For the theory nerds, Bühlmann and van de Geer’s “Statistics for High-Dimensional Data” (2011) and Wainwright’s “High-Dimensional Statistics” (2019) are the standard references, though both are too dense for most mortals.

References

  • Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika, 98(4), 791–806.

  • Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732.

  • Bühlmann, P., & van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

  • Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press.

  • Raskutti, G., Wainwright, M. J., & Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over \(\ell_q\)-balls. IEEE Transactions on Information Theory, 57(10), 6976–6994.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.

  • van de Geer, S., Bühlmann, P., Ritov, Y., & Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202.

  • Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell_1\)-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5), 2183–2202.

  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.

  • Zhao, P., & Yu, B. (2006). On model selection consistency of Lasso. Journal of Machine Learning Research, 7, 2541–2563.

© 2025 Vasco Yasenov

 

Powered by Quarto