The Many Flavors of Principal Component Analysis
Background
Principal component analysis (PCA) is one of those methods that everyone learns early and then quietly keeps using for years. The appeal is obvious: take a high-dimensional data matrix, rotate it into orthogonal directions of maximum variance, and keep only the first few directions. That gives you compression, visualization, denoising, and sometimes a useful preprocessing step for downstream models.
The common misconception is that PCA is a generic tool for finding the “most important” variables or the “true latent factors” in the data. It is neither. Classical PCA finds directions of high variance. That is often useful, but it is not the same thing as finding predictive features, interpretable components, or nonlinear structure. Once you keep that distinction straight, the many PCA variants make much more sense: each flavor modifies classical PCA to target a different practical goal.
In this post I will use the standard PCA formulation as the baseline and then focus on four variants that I think matter most in applied work. The goal is to get a broad sense of some of the most popular ways PCA has evolved over the years.
Notation
Let \(X \in \mathbb{R}^{n \times p}\) be a data matrix with rows as observations and columns as variables. Assume the columns have been centered so that
\[ \frac{1}{n}\sum_{i=1}^n X_{ij}=0 \qquad \text{for } j=1,\dots,p. \]
When variables are on very different scales, it is often better to standardize them as well and work with the correlation matrix rather than the covariance matrix. I will write the empirical covariance matrix as
\[ S = \frac{1}{n}X'X. \]
The first principal component loading vector \(v_1 \in \mathbb{R}^p\) solves
\[ v_1 = \arg\max_{\|v\|_2=1} v'Sv. \]
Subsequent components solve the same problem subject to orthogonality constraints. If \(V_k = [v_1,\dots,v_k]\), the corresponding score matrix is
\[ Z_k = XV_k. \]
Equivalently, if \(X = UDV'\) is the singular value decomposition, the columns of \(V\) are the loading vectors and the diagonal entries of \(D^2/n\) are the explained variances.
A Closer Look
Classical PCA
Classical PCA is the benchmark because its optimization problem is clean and its geometry is transparent. The first component is the unit vector that captures the most sample variance; the second is the best such vector orthogonal to the first; and so on. If the singular values of \(X\) are \(d_1 \ge \cdots \ge d_r\), then the proportion of variance explained by the first \(k\) components is
\[ \frac{\sum_{j=1}^k d_j^2}{\sum_{j=1}^r d_j^2}. \]
In practice, two issues matter more than the derivation. First, PCA is extremely sensitive to scaling. If one variable is measured in dollars and another in percentages, the dollar variable may dominate the first component unless the data are standardized. Second, variance is not the same thing as signal. A noisy feature with large variance can easily drive the first component. I treat classical PCA as a compression tool, not as an automatic discovery engine.
library(stats)
set.seed(1988)
n <- 300
p <- 8
# Simulate a low-rank signal with two latent factors
latent_factors <- matrix(rnorm(n * 2), n, 2)
loadings_true <- matrix(c(
0.9, 0.0,
0.8, 0.1,
0.7, -0.1,
0.6, 0.2,
0.0, 0.8,
0.1, 0.7,
-0.1, 0.6,
0.2, 0.5
), nrow = p, byrow = TRUE)
X <- latent_factors %*% t(loadings_true) + matrix(rnorm(n * p, sd = 0.3), n, p)
colnames(X) <- paste0("feature_", 1:p)
# Standardize before PCA because variables may be on different scales
pca_fit <- prcomp(X, center = TRUE, scale. = TRUE)
# Explained variance ratio
explained_var <- pca_fit$sdev^2 / sum(pca_fit$sdev^2)
round(explained_var[1:4], 3)
# First two loading vectors
round(pca_fit$rotation[, 1:2], 3)import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
np.random.seed(1988)
n = 300
p = 8
latent_factors = np.random.randn(n, 2)
loadings_true = np.array([
[0.9, 0.0],
[0.8, 0.1],
[0.7, -0.1],
[0.6, 0.2],
[0.0, 0.8],
[0.1, 0.7],
[-0.1, 0.6],
[0.2, 0.5],
])
X = latent_factors @ loadings_true.T + np.random.randn(n, p) * 0.3
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
pca = PCA(n_components=4)
pca.fit(X_std)
print("Explained variance ratio:", np.round(pca.explained_variance_ratio_, 3))
print("First two loading vectors:\n", np.round(pca.components_[:2].T, 3))Sparse PCA
Sparse PCA modifies the loading vectors so that many coordinates are exactly zero. A convenient way to write the idea is
\[ \max_{\|v\|_2=1} v'Sv \qquad \text{subject to } \|v\|_1 \le c, \]
or, equivalently, with an \(\ell_1\) penalty on the loadings. The point is not to improve the mathematics of PCA. The point is to make the components readable.
This matters when \(p\) is large and the classical loading vectors spread small weight across almost every variable. In genomics, marketing, or text applications, that is often useless from a substantive perspective. Sparse PCA forces each component to be built from a smaller set of variables. The tradeoff is that you lose some variance explained, orthogonality becomes less clean, and the components can be more sensitive to tuning choices. In practice, I reach for Sparse PCA when interpretation matters at least as much as compression.
# install.packages("elasticnet")
library(elasticnet)
set.seed(1988)
n <- 200
p <- 12
latent_factor <- rnorm(n)
X <- cbind(
latent_factor + rnorm(n, sd = 0.2),
0.9 * latent_factor + rnorm(n, sd = 0.2),
0.8 * latent_factor + rnorm(n, sd = 0.2),
0.7 * latent_factor + rnorm(n, sd = 0.2),
matrix(rnorm(n * (p - 4)), n, p - 4)
)
X <- scale(X)
# Ask for two sparse components with at most 4 nonzero loadings each
spca_fit <- spca(X, K = 2, type = "predictor", sparse = "varnum", para = c(4, 4))
round(spca_fit$loadings[, 1:2], 3)import numpy as np
from sklearn.decomposition import SparsePCA
from sklearn.preprocessing import StandardScaler
np.random.seed(1988)
n = 200
p = 12
latent_factor = np.random.randn(n)
X = np.column_stack([
latent_factor + np.random.randn(n) * 0.2,
0.9 * latent_factor + np.random.randn(n) * 0.2,
0.8 * latent_factor + np.random.randn(n) * 0.2,
0.7 * latent_factor + np.random.randn(n) * 0.2,
np.random.randn(n, p - 4)
])
X_std = StandardScaler().fit_transform(X)
spca = SparsePCA(n_components=2, alpha=1.0, random_state=1988)
spca.fit(X_std)
print("Sparse loadings:\n", np.round(spca.components_.T, 3))Kernel PCA
Kernel PCA keeps the variance-maximization logic but applies it in a nonlinear feature space. Instead of diagonalizing the covariance matrix of \(X\), we diagonalize a centered kernel matrix
\[ K_{ij} = k(x_i, x_j), \]
where \(k(\cdot,\cdot)\) might be a radial basis function kernel or a polynomial kernel. PCA is then performed on the centered version of \(K\) rather than on the original variables.
This is useful when the data lie on a curved manifold rather than near a linear subspace. The classic example is concentric circles: ordinary PCA sees almost no useful low-dimensional linear structure, while Kernel PCA can often unfold the geometry. The price is interpretability. Classical PCA gives loading vectors in the original variables; Kernel PCA gives components in an implicit feature space. In practice, that makes it more of a nonlinear embedding method than a variable-summary tool. It is also sensitive to kernel choice and scale, so I do not treat it as a push-button replacement for standard PCA.
# install.packages("kernlab")
library(kernlab)
set.seed(1988)
n <- 300
angles <- runif(n, 0, 2 * pi)
radius <- rep(c(1, 2), each = n / 2) + rnorm(n, sd = 0.05)
X_circle <- cbind(
radius * cos(angles),
radius * sin(angles)
)
kpca_fit <- kpca(
x = X_circle,
kernel = "rbfdot",
kpar = list(sigma = 5),
features = 2
)
head(rotated(kpca_fit))import numpy as np
from sklearn.decomposition import KernelPCA
np.random.seed(1988)
n = 300
angles = np.random.uniform(0, 2 * np.pi, size=n)
radius = np.repeat([1.0, 2.0], repeats=n // 2) + np.random.randn(n) * 0.05
X_circle = np.column_stack([
radius * np.cos(angles),
radius * np.sin(angles),
])
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=5)
X_embedded = kpca.fit_transform(X_circle)
print(np.round(X_embedded[:5], 3))Probabilistic PCA
Probabilistic PCA (PPCA) replaces the deterministic projection view with a latent variable model:
\[ x_i = \mu + W z_i + \varepsilon_i, \qquad z_i \sim N(0, I_q), \qquad \varepsilon_i \sim N(0, \sigma^2 I_p). \]
Here \(z_i\) is a \(q\)-dimensional latent factor and \(\varepsilon_i\) is isotropic Gaussian noise. Under maximum likelihood, the estimated subspace coincides with classical PCA in a particular limit, but the formulation buys you something important: a likelihood, uncertainty quantification, and a principled way to deal with missing values.
That makes PPCA attractive when PCA is part of a generative modeling workflow rather than just a preprocessing step. I especially like it when the data matrix has moderate missingness and I do not want to impute first and hope for the best. The main caveat is the isotropic-noise assumption. If feature-specific noise levels differ substantially, PPCA can be too restrictive and factor analysis may be the better model.
# install.packages("pcaMethods")
library(pcaMethods)
set.seed(1988)
n <- 150
p <- 6
latent_factors <- matrix(rnorm(n * 2), n, 2)
loadings_true <- matrix(rnorm(p * 2), p, 2)
X <- latent_factors %*% t(loadings_true) + matrix(rnorm(n * p, sd = 0.2), n, p)
# Introduce missing values
missing_index <- sample(length(X), size = 0.1 * length(X))
X[missing_index] <- NA
ppca_fit <- pca(X, method = "ppca", nPcs = 2, seed = 1988)
# Completed data and estimated scores
X_completed <- completeObs(ppca_fit)
scores(ppca_fit)# pip install ppca-py
import numpy as np
from ppca import PPCA
np.random.seed(1988)
n = 150
p = 6
latent_factors = np.random.randn(n, 2)
loadings_true = np.random.randn(p, 2)
X = latent_factors @ loadings_true.T + np.random.randn(n, p) * 0.2
# Introduce missing values
missing_mask = np.random.rand(n, p) < 0.10
X[missing_mask] = np.nan
ppca = PPCA(n_components=2)
ppca.fit(X)
scores, score_cov = ppca.posterior_latent(X)
X_imputed = ppca.sample_missing(X, n_draws=1)[0]
print("Estimated noise variance:", round(ppca.noise_variance_, 4))
print("First five latent scores:\n", np.round(scores[:5], 3))Truncated PCA
This last flavor is a little different. Truncated PCA does not change the statistical target. It changes the computation. Instead of computing the full singular value decomposition, we directly approximate the top \(k\) singular vectors:
\[ X \approx U_k D_k V_k'. \]
When \(n\) and \(p\) are large, or when \(X\) is sparse, that distinction matters a lot. If all you want are the first few components, computing the full decomposition is wasted effort.
For practitioners, this is often the most useful PCA variant of all because it makes the classical method scale. The catch is conceptual rather than mathematical: randomized or truncated PCA is not discovering a different notion of component. It is approximating the same principal subspace more cheaply. If the approximation error is small, great. If not, you have a computational shortcut, not a new estimator.
# install.packages("irlba")
library(irlba)
set.seed(1988)
n <- 1000
p <- 200
latent_factors <- matrix(rnorm(n * 5), n, 5)
loadings_true <- matrix(rnorm(p * 5), p, 5)
X_large <- latent_factors %*% t(loadings_true) + matrix(rnorm(n * p, sd = 0.5), n, p)
# Fast approximation to the first 5 principal components
pca_fast <- prcomp_irlba(X_large, n = 5, center = TRUE, scale. = TRUE)
pca_fast$sdev^2 / sum(pca_fast$sdev^2)import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
np.random.seed(1988)
n = 1000
p = 200
latent_factors = np.random.randn(n, 5)
loadings_true = np.random.randn(p, 5)
X_large = latent_factors @ loadings_true.T + np.random.randn(n, p) * 0.5
X_large = StandardScaler().fit_transform(X_large)
# Randomized SVD computes an approximate leading subspace
pca_fast = PCA(n_components=5, svd_solver="randomized", random_state=1988)
pca_fast.fit(X_large)
print(np.round(pca_fast.explained_variance_ratio_, 3))Bottom Line
- Classical PCA is a variance-maximizing compression tool, not a generic device for finding the “most important” variables or latent causes.
- Sparse PCA is the right upgrade when interpretability matters and dense loading vectors are getting in the way.
- Kernel PCA is useful for nonlinear geometry, but you give up the clean loading-vector interpretation that makes ordinary PCA attractive.
- Probabilistic PCA is worth using when likelihood, uncertainty, or missing data matter; otherwise classical PCA is usually simpler.
- Truncated PCA is often the most practical choice on large matrices because it targets the same principal subspace at a much lower computational cost.
Where to Learn More
For the classical theory, Jolliffe’s Principal Component Analysis is still the standard reference and Jolliffe and Cadima (2016) is a concise modern review. Zou, Hastie, and Tibshirani (2006) is the canonical sparse PCA paper. Schölkopf, Smola, and Müller (1998) remains the core reference for Kernel PCA, while Tipping and Bishop (1999) is the paper to read for the probabilistic view. If your main concern is computation at scale, Halko, Martinsson, and Tropp (2011) is the right randomized linear algebra entry point.
References
Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288.
Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202.
Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611-622.
Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265-286.