Stats · Multiple Linear Regression
Lesson · Statistics for Practitioners

Multiple Linear Regression, Made Useful

A working understanding of how regression with several predictors really behaves — the math you need, the assumptions you can't ignore, and the pitfalls that quietly corrupt published results.

Estimated time
26 minutes
Level
College / Early grad
Format
Interactive · Tool-agnostic
Module 1

What multiple regression actually does

Reading time · 2 min

Simple linear regression draws a line through a cloud of points to predict y from one variable x. Multiple linear regression generalizes that to many predictors at once: instead of a line in two dimensions, you fit a hyperplane in p+1 dimensions where p is the number of predictors. The output is still one continuous number — what changes is that several inputs help you predict it.

The headline benefit is partial effects. A coefficient in a multiple regression tells you how the response changes when one predictor moves by one unit while every other predictor is held constant. That "holding constant" part is what people are reaching for when they say "controlling for". It's a statistical control — different from a controlled experiment — but it's how observational researchers attempt to isolate the effect of one variable from another.

Why this matters
Most of the world's data is observational and confounded. House price depends on size and location and age. If you only fit price ~ size, the size coefficient absorbs everything correlated with size. Multiple regression lets you separate the strands.
If the embed doesn't load, open on YouTube.
Module 2

The model: equation, OLS, and what the coefficients mean

Reading time · 4 min

The multiple linear regression model is written:

y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε

where β₀ is the intercept, each βⱼ is the slope on predictor xⱼ, and ε is the irreducible error — everything that the predictors don't explain. We assume ε has mean zero and constant variance σ².

Ordinary Least Squares (OLS) picks the β values that minimize the sum of squared residuals — the squared distance between each observed y and the predicted ŷ. In matrix form, with X the design matrix (a column of 1s plus your predictors) and y the response vector, the closed-form OLS solution is:

β̂ = (XᵀX)⁻¹ Xᵀ y

You almost never compute that inverse by hand. Software uses QR decomposition or singular value decomposition for numerical stability, but the coefficients are mathematically identical to the formula above.

Interpreting a coefficient — the right way

Imagine a model predicting starting salary in thousands of dollars from years of education and years of experience:

salary = 18 + 4·education + 1.5·experience

Each extra year of education is associated with $4,000 more in starting salary, holding experience constant. The "holding constant" qualifier is mandatory. Drop it and your interpretation is wrong, because in real data education and experience tend to be negatively correlated (more school means less time working), so the marginal effect of one is contaminated by the other unless you partial it out.

Interactive: tune a two-predictor model

Move the sliders to change the intercept and slopes. The plot shows predicted salary surfaces; the readouts show predictions for three sample employees.

Bachelor + 2yr exp
Master + 5yr exp
PhD + 8yr exp

Notice that flipping a coefficient negative immediately flips the slope of the surface. In real fitting, OLS hands you those numbers; your job is interpretation.

If the embed doesn't load, open on YouTube.

If you'd like a refresher on the simple, single-predictor case before going further:

If the embed doesn't load, open on YouTube.
Module 3

The assumptions you cannot ignore (LINE)

Reading time · 3 min

OLS gives unbiased, minimum-variance coefficient estimates only when a short list of conditions hold. The mnemonic is LINE: Linearity, Independence, Normality, Equal variance. Modern texts add a fifth — no perfect multicollinearity — which we cover separately in Module 5.

Click any assumption to see what it means and how to check it.
Linearity. The expected value of y is a linear combination of the predictors. Curved relationships violate this. Check it with a scatter plot of residuals against each predictor and against fitted values — patterns (U shapes, fans) signal trouble. Fix it with transformations (log, square root) or by adding polynomial or interaction terms.

A violation of any one of these does not necessarily kill the model — it changes what your output means. Non-linearity biases the coefficients themselves. Heteroscedasticity (unequal variance) leaves the coefficients unbiased but breaks standard errors, so your p-values lie. Non-independent errors (think time-series autocorrelation) similarly distort inference. Non-normal residuals are mostly a small-sample concern; with a few hundred observations the central limit theorem rescues you.

Common misconception
Multiple regression doesn't require the predictors themselves to be normally distributed — only the residuals. Categorical dummies and skewed predictors are fine; what matters is what the model leaves over.
Module 4

Categorical predictors and interaction terms

Reading time · 3 min

Real datasets are not all continuous. Region, treatment group, browser type, gender, blood type — these are categorical. To use them in a linear model you encode them as dummy variables: one binary 0/1 column per level, with one level dropped as a reference.

If region has four levels (North, South, East, West), you create three dummy columns — say, South, East, West — with North as the baseline. The coefficient on South is then the average difference in y between South and North, holding everything else constant. There is no separate North column because the intercept already encodes the baseline. Including all four would cause perfect multicollinearity (the dreaded "dummy variable trap").

y = β₀ + β₁·age + β₂·South + β₃·East + β₄·West + ε

Modern statistical software (R's lm(), Python's statsmodels, scikit-learn with OneHotEncoder(drop="first")) handles dummy creation automatically. You still need to choose a sensible reference category — usually the largest group or the natural control.

Interaction terms

Sometimes the effect of one predictor depends on the value of another. The effect of an extra year of experience on salary might be larger for those with advanced degrees. You model that with a product term:

salary = β₀ + β₁·education + β₂·experience + β₃·(education × experience) + ε

If β₃ is significantly different from zero, the slopes are not parallel — the predictors interact. Interactions almost always need to be motivated by theory or hypothesis; throwing in every pairwise product inflates your variance budget and produces brittle coefficients.

Always include the main effects
When you include an interaction term x·z, keep x and z in the model on their own as well. Dropping a main effect while keeping its interaction makes coefficients essentially uninterpretable.
Module 5

Multicollinearity and the Variance Inflation Factor

Reading time · 3 min

When two or more predictors carry similar information, OLS struggles to decide how to split the credit between them. The coefficients become unstable: small changes in the data flip signs, blow up standard errors, and produce p-values that are wildly inconsistent with predictive performance. This is multicollinearity.

The standard diagnostic is the Variance Inflation Factor (VIF). For each predictor xⱼ, regress it on every other predictor and grab the resulting R². The VIF is:

VIFj = 1 / (1 − R²j)

VIF = 1 means xⱼ is uncorrelated with the rest. As correlation grows, R²j approaches 1 and VIF explodes. Common rules of thumb: VIF above 5 deserves a look, VIF above 10 is a serious problem. (These thresholds are guidelines, not laws — see the citations.)

VIF calculator

Set the R² of the auxiliary regression — i.e., how well the rest of the predictors explain this one — and watch the VIF respond.

2.00
Healthy

Fixes range from gentle to aggressive: drop one of the offending predictors, combine them into a single index, center the variables (helps with polynomial collinearity), or move to a regularized estimator like ridge or LASSO that handles correlated predictors gracefully.

When you can ignore it
Multicollinearity hurts inference about individual coefficients. If you only care about overall predictions or about coefficients on uncollinear predictors, a high VIF on a control variable may not matter at all.
Module 6

Goodness of fit: R², adjusted R², and the F-test

Reading time · 3 min

is the proportion of variance in y that the model explains. It runs from 0 to 1, and it always increases when you add a predictor — even a useless one — because the optimizer can always find some way to reduce residuals. That makes raw R² a terrible model selection criterion in multiple regression.

Adjusted R² penalizes complexity. The formula deflates R² by (n−1)/(n−p−1), so adding a predictor only raises adjusted R² if the predictor's contribution beats the cost of the extra degree of freedom. If you add a junk variable, adjusted R² falls.

Adjusted R² = 1 − (1 − R²) · (n − 1) / (n − p − 1)

Watch adjusted R² punish overfitting

Add useless predictors and see what happens. With n = 50 observations and a true R² of 0.65, every extra random predictor inches raw R² up but pushes adjusted R² down.

0.650
Adjusted R²
0.627

The overall F-test asks a different question: are any of the predictors useful, taken together? Its null hypothesis is β₁ = β₂ = … = βₚ = 0. Reject it and you have evidence that at least one coefficient differs from zero. It does not tell you which one.

For comparing nested models — model B with extra predictors versus model A — use a partial F-test. For comparing non-nested models with different predictor sets, use information criteria like AIC or BIC.

A high R² is not a virtue
In physics you can hit R² = 0.99. In behavioral science, 0.30 might be excellent. The only honest standard is what's reasonable for your domain and what the model is for. Adjusted R² also has no fixed "good" threshold — it's only meaningful as a comparison.
Module 7

Diagnostics and residual plots

Reading time · 4 min

Reading the fit summary tells you nothing about whether the assumptions hold. For that you have to look at the residuals. There are four standard diagnostic plots; learn to read them and you'll catch most modeling mistakes before they ship.

Residuals vs Fitted — should be a featureless cloud around zero. A funnel shape signals heteroscedasticity. A U or arc means non-linearity. Q-Q plot of residuals — points should hug the diagonal; heavy tails mean non-normality. Scale-Location (square-rooted standardized residuals vs fitted) — flat horizontal trend means equal variance. Residuals vs Leverage — points outside Cook's distance contours are unduly influencing the fit.

Spot the assumption violation

Three of these residual plots show problems. Click the one that looks healthy.

Plot A
Plot B
Plot C
Plot D

Beyond visual diagnostics, several formal tests are useful: the Breusch-Pagan test for heteroscedasticity, the Durbin-Watson statistic for autocorrelated errors (values near 2 indicate independence; near 0 or 4 indicate trouble), and Cook's distance for influential observations. None of these substitute for actually plotting the residuals — they catch some violations and miss others.

If the embed doesn't load, open on YouTube.
Module 8

Best practices that hold up

Reading time · 2 min

Most regression failures are not technical — they're procedural. The list below condenses the practices that working analysts and recent peer-reviewed methodological papers consistently endorse.

Module 9

Pitfalls, ethics, and responsible use

Reading time · 2 min

Multiple regression is one of the most cited and most misused techniques in applied statistics. The technical traps are well known. The ethical ones are quieter but more damaging.

Technical pitfalls

Ethics and responsible use

When regression models inform decisions about people — credit, hiring, healthcare, insurance — coefficient choices have real consequences. Several considerations recur in modern data-ethics guidance and regulations such as the EU AI Act (2024) and the GDPR's right to explanation.

A coefficient is not a truth
It's a number that came out of the data you happened to collect, the model you happened to specify, and the assumptions you happened to make. State your uncertainty as clearly as you state your point estimate.
Knowledge check

Test yourself: 10 questions

Reading time · 5 min

One correct answer per question. Pick all your answers, then click "Score me" to see explanations.

Glossary

Quick reference

Open glossary (14 terms)
Coefficient (β)
The estimated change in the response variable for a one-unit change in a predictor, holding all other predictors constant.
Design matrix (X)
The matrix of predictor values, conventionally with a leading column of 1s for the intercept.
Dummy variable
A binary 0/1 column encoding membership in one level of a categorical variable. A k-level categorical predictor needs k−1 dummies.
Heteroscedasticity
Non-constant variance of residuals. Violates the equal-variance assumption; biases standard errors but not coefficients.
Homoscedasticity
Constant variance of residuals across all fitted values. The healthy condition.
Interaction term
A product of two predictors included as a third predictor. Lets the slope of one variable depend on the value of another.
Leverage
How extreme an observation's predictor values are. High-leverage points pull the fit toward themselves; combined with a residual, they become influential.
Multicollinearity
Strong linear dependence among predictors. Inflates coefficient variance and destabilizes inference.
OLS (Ordinary Least Squares)
The estimator that picks β to minimize the sum of squared residuals.
R² (coefficient of determination)
Fraction of variance in y explained by the model. Always between 0 and 1; always rises when predictors are added.
Residual
The difference between observed and predicted y for a single observation: e = y − ŷ.
Standard error
The standard deviation of a coefficient's sampling distribution. Drives the t-statistic and confidence interval.
VIF (Variance Inflation Factor)
1 / (1 − R²ⱼ) where R²ⱼ comes from regressing predictor j on the other predictors. Diagnostic for multicollinearity.
Adjusted R²
R² penalized by the number of predictors. Drops when junk variables are added; useful for comparing models.
Sources

References cited in this lesson