Multiple Linear Regression 101 — A Practical Lesson

Module 1

What multiple regression actually does

Reading time · 2 min

Simple linear regression draws a line through a cloud of points to predict y from one variable x. Multiple linear regression generalizes that to many predictors at once: instead of a line in two dimensions, you fit a hyperplane in p+1 dimensions where p is the number of predictors. The output is still one continuous number — what changes is that several inputs help you predict it.

The headline benefit is partial effects. A coefficient in a multiple regression tells you how the response changes when one predictor moves by one unit while every other predictor is held constant. That "holding constant" part is what people are reaching for when they say "controlling for". It's a statistical control — different from a controlled experiment — but it's how observational researchers attempt to isolate the effect of one variable from another.

Why this matters

Most of the world's data is observational and confounded. House price depends on size and location and age. If you only fit price ~ size, the size coefficient absorbs everything correlated with size. Multiple regression lets you separate the strands.

If the embed doesn't load, open on YouTube.

Module 2

The model: equation, OLS, and what the coefficients mean

Reading time · 4 min

The multiple linear regression model is written:

y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε

where β₀ is the intercept, each βⱼ is the slope on predictor xⱼ, and ε is the irreducible error — everything that the predictors don't explain. We assume ε has mean zero and constant variance σ².

Ordinary Least Squares (OLS) picks the β values that minimize the sum of squared residuals — the squared distance between each observed y and the predicted ŷ. In matrix form, with X the design matrix (a column of 1s plus your predictors) and y the response vector, the closed-form OLS solution is:

β̂ = (XᵀX)⁻¹ Xᵀ y

You almost never compute that inverse by hand. Software uses QR decomposition or singular value decomposition for numerical stability, but the coefficients are mathematically identical to the formula above.

Interpreting a coefficient — the right way

Imagine a model predicting starting salary in thousands of dollars from years of education and years of experience:

salary = 18 + 4·education + 1.5·experience

Each extra year of education is associated with $4,000 more in starting salary, holding experience constant. The "holding constant" qualifier is mandatory. Drop it and your interpretation is wrong, because in real data education and experience tend to be negatively correlated (more school means less time working), so the marginal effect of one is contaminated by the other unless you partial it out.

Interactive: tune a two-predictor model

Move the sliders to change the intercept and slopes. The plot shows predicted salary surfaces; the readouts show predictions for three sample employees.

Intercept β₀ 18

β₁ (education) 4.0

β₂ (experience) 1.5

Bachelor + 2yr exp

—

Master + 5yr exp

—

PhD + 8yr exp

—

Notice that flipping a coefficient negative immediately flips the slope of the surface. In real fitting, OLS hands you those numbers; your job is interpretation.

If the embed doesn't load, open on YouTube.

If you'd like a refresher on the simple, single-predictor case before going further:

If the embed doesn't load, open on YouTube.

Module 3

The assumptions you cannot ignore (LINE)

Reading time · 3 min

OLS gives unbiased, minimum-variance coefficient estimates only when a short list of conditions hold. The mnemonic is LINE: Linearity, Independence, Normality, Equal variance. Modern texts add a fifth — no perfect multicollinearity — which we cover separately in Module 5.

Click any assumption to see what it means and how to check it.

Linearity. The expected value of y is a linear combination of the predictors. Curved relationships violate this. Check it with a scatter plot of residuals against each predictor and against fitted values — patterns (U shapes, fans) signal trouble. Fix it with transformations (log, square root) or by adding polynomial or interaction terms.

A violation of any one of these does not necessarily kill the model — it changes what your output means. Non-linearity biases the coefficients themselves. Heteroscedasticity (unequal variance) leaves the coefficients unbiased but breaks standard errors, so your p-values lie. Non-independent errors (think time-series autocorrelation) similarly distort inference. Non-normal residuals are mostly a small-sample concern; with a few hundred observations the central limit theorem rescues you.

Common misconception

Multiple regression doesn't require the predictors themselves to be normally distributed — only the residuals. Categorical dummies and skewed predictors are fine; what matters is what the model leaves over.

Module 4

Categorical predictors and interaction terms

Reading time · 3 min

Real datasets are not all continuous. Region, treatment group, browser type, gender, blood type — these are categorical. To use them in a linear model you encode them as dummy variables: one binary 0/1 column per level, with one level dropped as a reference.

If region has four levels (North, South, East, West), you create three dummy columns — say, South, East, West — with North as the baseline. The coefficient on South is then the average difference in y between South and North, holding everything else constant. There is no separate North column because the intercept already encodes the baseline. Including all four would cause perfect multicollinearity (the dreaded "dummy variable trap").

y = β₀ + β₁·age + β₂·South + β₃·East + β₄·West + ε

Modern statistical software (R's lm(), Python's statsmodels, scikit-learn with OneHotEncoder(drop="first")) handles dummy creation automatically. You still need to choose a sensible reference category — usually the largest group or the natural control.

Interaction terms

Sometimes the effect of one predictor depends on the value of another. The effect of an extra year of experience on salary might be larger for those with advanced degrees. You model that with a product term:

salary = β₀ + β₁·education + β₂·experience + β₃·(education × experience) + ε

If β₃ is significantly different from zero, the slopes are not parallel — the predictors interact. Interactions almost always need to be motivated by theory or hypothesis; throwing in every pairwise product inflates your variance budget and produces brittle coefficients.

Always include the main effects

When you include an interaction term x·z, keep x and z in the model on their own as well. Dropping a main effect while keeping its interaction makes coefficients essentially uninterpretable.

Module 5

Multicollinearity and the Variance Inflation Factor

Reading time · 3 min

When two or more predictors carry similar information, OLS struggles to decide how to split the credit between them. The coefficients become unstable: small changes in the data flip signs, blow up standard errors, and produce p-values that are wildly inconsistent with predictive performance. This is multicollinearity.

The standard diagnostic is the Variance Inflation Factor (VIF). For each predictor xⱼ, regress it on every other predictor and grab the resulting R². The VIF is:

VIF_j = 1 / (1 − R²_j)

VIF = 1 means xⱼ is uncorrelated with the rest. As correlation grows, R²_j approaches 1 and VIF explodes. Common rules of thumb: VIF above 5 deserves a look, VIF above 10 is a serious problem. (These thresholds are guidelines, not laws — see the citations.)

VIF calculator

Set the R² of the auxiliary regression — i.e., how well the rest of the predictors explain this one — and watch the VIF respond.

Auxiliary R² 0.50

2.00

Healthy

Fixes range from gentle to aggressive: drop one of the offending predictors, combine them into a single index, center the variables (helps with polynomial collinearity), or move to a regularized estimator like ridge or LASSO that handles correlated predictors gracefully.

When you can ignore it

Multicollinearity hurts inference about individual coefficients. If you only care about overall predictions or about coefficients on uncollinear predictors, a high VIF on a control variable may not matter at all.

Module 6

Goodness of fit: R², adjusted R², and the F-test

Reading time · 3 min

R² is the proportion of variance in y that the model explains. It runs from 0 to 1, and it always increases when you add a predictor — even a useless one — because the optimizer can always find some way to reduce residuals. That makes raw R² a terrible model selection criterion in multiple regression.

Adjusted R² penalizes complexity. The formula deflates R² by (n−1)/(n−p−1), so adding a predictor only raises adjusted R² if the predictor's contribution beats the cost of the extra degree of freedom. If you add a junk variable, adjusted R² falls.

Adjusted R² = 1 − (1 − R²) · (n − 1) / (n − p − 1)

The overall F-test asks a different question: are any of the predictors useful, taken together? Its null hypothesis is β₁ = β₂ = … = βₚ = 0. Reject it and you have evidence that at least one coefficient differs from zero. It does not tell you which one.

For comparing nested models — model B with extra predictors versus model A — use a partial F-test. For comparing non-nested models with different predictor sets, use information criteria like AIC or BIC.

A high R² is not a virtue

In physics you can hit R² = 0.99. In behavioral science, 0.30 might be excellent. The only honest standard is what's reasonable for your domain and what the model is for. Adjusted R² also has no fixed "good" threshold — it's only meaningful as a comparison.

Module 7

Diagnostics and residual plots

Reading time · 4 min

Reading the fit summary tells you nothing about whether the assumptions hold. For that you have to look at the residuals. There are four standard diagnostic plots; learn to read them and you'll catch most modeling mistakes before they ship.

Residuals vs Fitted — should be a featureless cloud around zero. A funnel shape signals heteroscedasticity. A U or arc means non-linearity. Q-Q plot of residuals — points should hug the diagonal; heavy tails mean non-normality. Scale-Location (square-rooted standardized residuals vs fitted) — flat horizontal trend means equal variance. Residuals vs Leverage — points outside Cook's distance contours are unduly influencing the fit.

Spot the assumption violation

Three of these residual plots show problems. Click the one that looks healthy.

Plot A

Plot B

Plot C

Plot D

Beyond visual diagnostics, several formal tests are useful: the Breusch-Pagan test for heteroscedasticity, the Durbin-Watson statistic for autocorrelated errors (values near 2 indicate independence; near 0 or 4 indicate trouble), and Cook's distance for influential observations. None of these substitute for actually plotting the residuals — they catch some violations and miss others.

If the embed doesn't load, open on YouTube.

Module 8

Best practices that hold up

Reading time · 2 min

Most regression failures are not technical — they're procedural. The list below condenses the practices that working analysts and recent peer-reviewed methodological papers consistently endorse.

Plot first, model second. Look at every predictor against the response, and at predictors against each other. Half of all surprises in a final model are visible in the data the day you load it.
Hold out a test set (or cross-validate) for any model used for prediction. In-sample R² always overstates how the model will perform on new data.
Report effect sizes with confidence intervals, not just p-values. A 95% CI tells the reader how precise your estimate is and lets them judge practical significance.
Check VIF before you trust any coefficient. Rule of thumb: investigate above 5, fix above 10.
Examine residuals every time. All four standard plots — fitted vs residual, Q-Q, scale-location, leverage — take seconds and catch most assumption violations.
Pre-register your model for confirmatory analyses. Specifying predictors and tests in advance protects against p-hacking.
Report the full model, including non-significant predictors. Selective reporting biases the literature.
Centre or standardize predictors when interactions or polynomials are involved — it tames collinearity and makes coefficients interpretable at the mean of the data rather than at zero.

Module 9

Pitfalls, ethics, and responsible use

Reading time · 2 min

Multiple regression is one of the most cited and most misused techniques in applied statistics. The technical traps are well known. The ethical ones are quieter but more damaging.

Technical pitfalls

Stepwise selection. Algorithms that add or drop predictors based on p-values inflate Type-I error and produce models that don't replicate. Use theory, regularization, or out-of-sample validation instead.
Extrapolation. A linear model is only credible inside the range of the predictors you fit. Predicting beyond it is guesswork.
Correlation ≠ causation. A regression coefficient describes association, not effect. Causal interpretation requires either an experiment or carefully argued identification.
Outliers and leverage points. A handful of extreme observations can dominate the fit. Use Cook's distance and decide deliberately whether to keep, downweight, or remove them — and document the choice.

Ethics and responsible use

When regression models inform decisions about people — credit, hiring, healthcare, insurance — coefficient choices have real consequences. Several considerations recur in modern data-ethics guidance and regulations such as the EU AI Act (2024) and the GDPR's right to explanation.

Disparate impact. Even when sensitive attributes (race, sex, age) are excluded, correlated proxies (zip code, name, school) can encode them. Test for differential error rates across protected groups.
Disclosure of methods. Anyone affected by an automated decision is entitled to know, in plain language, what variables drove it. Multiple regression is among the most transparent models — use that as a feature, not just a fallback.
Data provenance. Be honest about how the data was collected, what's missing, and who was excluded. Coefficients on a non-representative sample do not generalize, no matter how tight the standard errors.
No fishing. Running dozens of specifications and reporting the one that "works" turns regression into a publication trick. Pre-registration and full reporting are the antidotes.

A coefficient is not a truth

It's a number that came out of the data you happened to collect, the model you happened to specify, and the assumptions you happened to make. State your uncertainty as clearly as you state your point estimate.

Knowledge check

Test yourself: 10 questions

Reading time · 5 min

One correct answer per question. Pick all your answers, then click "Score me" to see explanations.

Glossary

Quick reference

Open glossary (14 terms)

Coefficient (β): The estimated change in the response variable for a one-unit change in a predictor, holding all other predictors constant.
Design matrix (X): The matrix of predictor values, conventionally with a leading column of 1s for the intercept.
Dummy variable: A binary 0/1 column encoding membership in one level of a categorical variable. A k-level categorical predictor needs k−1 dummies.
Heteroscedasticity: Non-constant variance of residuals. Violates the equal-variance assumption; biases standard errors but not coefficients.
Homoscedasticity: Constant variance of residuals across all fitted values. The healthy condition.
Interaction term: A product of two predictors included as a third predictor. Lets the slope of one variable depend on the value of another.
Leverage: How extreme an observation's predictor values are. High-leverage points pull the fit toward themselves; combined with a residual, they become influential.
Multicollinearity: Strong linear dependence among predictors. Inflates coefficient variance and destabilizes inference.
OLS (Ordinary Least Squares): The estimator that picks β to minimize the sum of squared residuals.
R² (coefficient of determination): Fraction of variance in y explained by the model. Always between 0 and 1; always rises when predictors are added.
Residual: The difference between observed and predicted y for a single observation: e = y − ŷ.
Standard error: The standard deviation of a coefficient's sampling distribution. Drives the t-statistic and confidence interval.
VIF (Variance Inflation Factor): 1 / (1 − R²ⱼ) where R²ⱼ comes from regressing predictor j on the other predictors. Diagnostic for multicollinearity.
Adjusted R²: R² penalized by the number of predictors. Drops when junk variables are added; useful for comparing models.

Sources

Multiple Linear Regression, Made Useful

What multiple regression actually does

The model: equation, OLS, and what the coefficients mean

Interpreting a coefficient — the right way

Interactive: tune a two-predictor model

The assumptions you cannot ignore (LINE)

Categorical predictors and interaction terms

Interaction terms

Multicollinearity and the Variance Inflation Factor

VIF calculator

Goodness of fit: R², adjusted R², and the F-test

Watch adjusted R² punish overfitting

Diagnostics and residual plots

Spot the assumption violation

Best practices that hold up

Pitfalls, ethics, and responsible use

Technical pitfalls

Ethics and responsible use

Test yourself: 10 questions

Quick reference

References cited in this lesson