Causal vs. Predictive Modeling: Subtle, but Crucial Differences
Background
It’s one of the most common mix-ups I see among data scientists—especially those coming from a machine learning background: confusing causal modeling with predictive modeling. On the surface, they look similar. You build a model, you include some variables, you fit it, and then you do… something with the results. But under the hood, these two approaches serve fundamentally different goals and require very different mindsets.
Predictive modeling is about building a model that can forecast outcomes. Causal modeling is about understanding how the world works. And mixing them up can lead to some really bad decisions—like launching a product based on a spurious correlation or controlling for the wrong variables and wiping out your treatment effect.
This post is for all the data scientists who’ve ever wondered any of the following:
- “Why can’t I just throw everything into my causal model like I do with my random forest?”
- “This causal model is great. Can’t we just use it for prediction as well?”
- “What exactly is the difference between the two?”
Let’s unpack it.
A Closer Look
Predictive Modeling
Let’s start with what more folks are familiar with: predictive models.
In predictive modeling, you’re judged by how well you can forecast an outcome, \(Y\). That’s it. You can (and often do) throw in everything and the kitchen sink—lagged outcomes, future values of other variables (careful though!), variables that are correlated with the outcome but not necessarily meaningful in a causal sense.
It’s all good as long as it helps you reduce Root Mean Squared Error (RMSE), increase Area Under the Curve (AUC), or minimize cross-entropy loss. Data leakage is your main enemy, but otherwise, the bar for “what goes in the model” is pretty low. Setting data leakage and interpretation aside, just throw anything you have in there. You can often get away with building a decent model without deep institutional or context knowledge. The complex algorithms take care of that for you.
No one cares why your model works, only that it does.
Causal Modeling
Now, enter the world of causal inference. The rules are completely different. Demonstrating the challenging nature of determining causality, the ancient Greek philosopher Democritus famously said:
“I would rather understand one cause than be King of Persia.”
In causal modeling, the goal is not prediction, but isolation of the effect of a treatment \(T\) on \(Y\). And to do that, you need to control for confounders—variables that affect both the treatment and the outcome. But here’s the catch: not all variables should be controlled for.
This is where the concept of bad controls comes in—variables that are affected by the treatment (post-treatment variables), or colliders that open up backdoor paths and induce spurious associations.
In other words, in causal inference:
- Including the wrong variable can make things worse.
- You must think hard about the causal structure of your data.
- Domain knowledge is critical.
Throwing in “everything” like in a predictive model? That can completely destroy your estimate. Furthermore, in causal inference you can almost never get away without deep knowledge of every aspect of your analysis - environment, intervention, sample, etc. Thus, in many ways causality is significantly more challenging than prediction.
Propensity Scores
One place where this confusion often plays out is in propensity score modeling.
To recap, the propensity score \(e(X) = P(T = 1 \mid X)\) is the probability of receiving treatment given covariates. It’s often estimated via a logistic regression or ML model. Then, you use this score to adjust for differences between treated and control groups (e.g., via weighting or matching).
And here’s the key point: your goal is not to get the best prediction of treatment. Your goal is to use the propensity score to balance covariates between groups. That’s it.
So even if a fancy XGBoost
model gives you higher prediction accuracy, it may overfit or fail to achieve covariate balance—which defeats the purpose. In fact, some of the best-performing PS models (for causal purposes) may have terrible predictive accuracy but excel at achieving balance.
There’s a trade-off here:
- Predictive ML models focus on minimizing error.
- Propensity score models should optimize covariate balance.
And that trade-off is why a more accurate model is not necessarily better for causal inference.
An Example
The distinction between machine learning and causal inference is best illustrated with an example. Suppose we build a causal model to understand what drives student success (\(Y\)), and we find that hard work (\(T\)) has a causal effect on academic achievement. This means that if we were to increase a given student’s effort, we would expect—on average—a corresponding improvement in that student’s success. Causality is about what will happen if we intervene (i.e., manipulate \(T\)). Intervene is the key word here. Here’s the crazy thing - in observational data, student success and effort might not even be correlated!
So, this doesn’t necessarily help us predict with high accuracy which students will be most successful. For that task, other variables—like parental income (\(X_1\)), neighborhood quality (\(X_2\)), or prior test scores (\(X_3\))—might outperform hard work as predictors, even if they are not causes. In short, causal inference answers what-if questions about interventions, while machine learning focuses on associational patterns that are useful for prediction, regardless of whether they are causal.
Bottom Line
Predictive models are about forecasting outcomes; causal models are about estimating effects.
In causal inference, you must think carefully about what to include in the model—“bad controls” can bias results.
Propensity scores should be judged by how well they balance covariates, not by how well they predict treatment.
More context and domain knowledge is usually required for causal models than for predictive ones.
References
Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If.
Cunningham, S. (2021). Causal Inference: The Mixtape.
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics.
Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect.