Multiple Linear Regression, Made Useful
A working understanding of how regression with several predictors really behaves — the math you need, the assumptions you can't ignore, and the pitfalls that quietly corrupt published results.
What multiple regression actually does
Simple linear regression draws a line through a cloud of points to predict y from one variable x. Multiple linear regression generalizes that to many predictors at once: instead of a line in two dimensions, you fit a hyperplane in p+1 dimensions where p is the number of predictors. The output is still one continuous number — what changes is that several inputs help you predict it.
The headline benefit is partial effects. A coefficient in a multiple regression tells you how the response changes when one predictor moves by one unit while every other predictor is held constant. That "holding constant" part is what people are reaching for when they say "controlling for". It's a statistical control — different from a controlled experiment — but it's how observational researchers attempt to isolate the effect of one variable from another.
The model: equation, OLS, and what the coefficients mean
The multiple linear regression model is written:
where β₀ is the intercept, each βⱼ is the slope on predictor xⱼ, and ε is the irreducible error — everything that the predictors don't explain. We assume ε has mean zero and constant variance σ².
Ordinary Least Squares (OLS) picks the β values that minimize the sum of squared residuals — the squared distance between each observed y and the predicted ŷ. In matrix form, with X the design matrix (a column of 1s plus your predictors) and y the response vector, the closed-form OLS solution is:
You almost never compute that inverse by hand. Software uses QR decomposition or singular value decomposition for numerical stability, but the coefficients are mathematically identical to the formula above.
Interpreting a coefficient — the right way
Imagine a model predicting starting salary in thousands of dollars from years of education and years of experience:
Each extra year of education is associated with $4,000 more in starting salary, holding experience constant. The "holding constant" qualifier is mandatory. Drop it and your interpretation is wrong, because in real data education and experience tend to be negatively correlated (more school means less time working), so the marginal effect of one is contaminated by the other unless you partial it out.
Interactive: tune a two-predictor model
Move the sliders to change the intercept and slopes. The plot shows predicted salary surfaces; the readouts show predictions for three sample employees.
Notice that flipping a coefficient negative immediately flips the slope of the surface. In real fitting, OLS hands you those numbers; your job is interpretation.
If you'd like a refresher on the simple, single-predictor case before going further:
The assumptions you cannot ignore (LINE)
OLS gives unbiased, minimum-variance coefficient estimates only when a short list of conditions hold. The mnemonic is LINE: Linearity, Independence, Normality, Equal variance. Modern texts add a fifth — no perfect multicollinearity — which we cover separately in Module 5.
y is a linear combination of the predictors. Curved relationships violate this. Check it with a scatter plot of residuals against each predictor and against fitted values — patterns (U shapes, fans) signal trouble. Fix it with transformations (log, square root) or by adding polynomial or interaction terms.
A violation of any one of these does not necessarily kill the model — it changes what your output means. Non-linearity biases the coefficients themselves. Heteroscedasticity (unequal variance) leaves the coefficients unbiased but breaks standard errors, so your p-values lie. Non-independent errors (think time-series autocorrelation) similarly distort inference. Non-normal residuals are mostly a small-sample concern; with a few hundred observations the central limit theorem rescues you.
Categorical predictors and interaction terms
Real datasets are not all continuous. Region, treatment group, browser type, gender, blood type — these are categorical. To use them in a linear model you encode them as dummy variables: one binary 0/1 column per level, with one level dropped as a reference.
If region has four levels (North, South, East, West), you create three dummy columns — say, South, East, West — with North as the baseline. The coefficient on South is then the average difference in y between South and North, holding everything else constant. There is no separate North column because the intercept already encodes the baseline. Including all four would cause perfect multicollinearity (the dreaded "dummy variable trap").
Modern statistical software (R's lm(), Python's statsmodels, scikit-learn with OneHotEncoder(drop="first")) handles dummy creation automatically. You still need to choose a sensible reference category — usually the largest group or the natural control.
Interaction terms
Sometimes the effect of one predictor depends on the value of another. The effect of an extra year of experience on salary might be larger for those with advanced degrees. You model that with a product term:
If β₃ is significantly different from zero, the slopes are not parallel — the predictors interact. Interactions almost always need to be motivated by theory or hypothesis; throwing in every pairwise product inflates your variance budget and produces brittle coefficients.
x·z, keep x and z in the model on their own as well. Dropping a main effect while keeping its interaction makes coefficients essentially uninterpretable.
Multicollinearity and the Variance Inflation Factor
When two or more predictors carry similar information, OLS struggles to decide how to split the credit between them. The coefficients become unstable: small changes in the data flip signs, blow up standard errors, and produce p-values that are wildly inconsistent with predictive performance. This is multicollinearity.
The standard diagnostic is the Variance Inflation Factor (VIF). For each predictor xⱼ, regress it on every other predictor and grab the resulting R². The VIF is:
VIF = 1 means xⱼ is uncorrelated with the rest. As correlation grows, R²j approaches 1 and VIF explodes. Common rules of thumb: VIF above 5 deserves a look, VIF above 10 is a serious problem. (These thresholds are guidelines, not laws — see the citations.)
VIF calculator
Set the R² of the auxiliary regression — i.e., how well the rest of the predictors explain this one — and watch the VIF respond.
Fixes range from gentle to aggressive: drop one of the offending predictors, combine them into a single index, center the variables (helps with polynomial collinearity), or move to a regularized estimator like ridge or LASSO that handles correlated predictors gracefully.
Goodness of fit: R², adjusted R², and the F-test
R² is the proportion of variance in y that the model explains. It runs from 0 to 1, and it always increases when you add a predictor — even a useless one — because the optimizer can always find some way to reduce residuals. That makes raw R² a terrible model selection criterion in multiple regression.
Adjusted R² penalizes complexity. The formula deflates R² by (n−1)/(n−p−1), so adding a predictor only raises adjusted R² if the predictor's contribution beats the cost of the extra degree of freedom. If you add a junk variable, adjusted R² falls.
The overall F-test asks a different question: are any of the predictors useful, taken together? Its null hypothesis is β₁ = β₂ = … = βₚ = 0. Reject it and you have evidence that at least one coefficient differs from zero. It does not tell you which one.
For comparing nested models — model B with extra predictors versus model A — use a partial F-test. For comparing non-nested models with different predictor sets, use information criteria like AIC or BIC.
Diagnostics and residual plots
Reading the fit summary tells you nothing about whether the assumptions hold. For that you have to look at the residuals. There are four standard diagnostic plots; learn to read them and you'll catch most modeling mistakes before they ship.
Residuals vs Fitted — should be a featureless cloud around zero. A funnel shape signals heteroscedasticity. A U or arc means non-linearity. Q-Q plot of residuals — points should hug the diagonal; heavy tails mean non-normality. Scale-Location (square-rooted standardized residuals vs fitted) — flat horizontal trend means equal variance. Residuals vs Leverage — points outside Cook's distance contours are unduly influencing the fit.
Spot the assumption violation
Three of these residual plots show problems. Click the one that looks healthy.
Beyond visual diagnostics, several formal tests are useful: the Breusch-Pagan test for heteroscedasticity, the Durbin-Watson statistic for autocorrelated errors (values near 2 indicate independence; near 0 or 4 indicate trouble), and Cook's distance for influential observations. None of these substitute for actually plotting the residuals — they catch some violations and miss others.
Best practices that hold up
Most regression failures are not technical — they're procedural. The list below condenses the practices that working analysts and recent peer-reviewed methodological papers consistently endorse.
- Plot first, model second. Look at every predictor against the response, and at predictors against each other. Half of all surprises in a final model are visible in the data the day you load it.
- Hold out a test set (or cross-validate) for any model used for prediction. In-sample R² always overstates how the model will perform on new data.
- Report effect sizes with confidence intervals, not just p-values. A 95% CI tells the reader how precise your estimate is and lets them judge practical significance.
- Check VIF before you trust any coefficient. Rule of thumb: investigate above 5, fix above 10.
- Examine residuals every time. All four standard plots — fitted vs residual, Q-Q, scale-location, leverage — take seconds and catch most assumption violations.
- Pre-register your model for confirmatory analyses. Specifying predictors and tests in advance protects against p-hacking.
- Report the full model, including non-significant predictors. Selective reporting biases the literature.
- Centre or standardize predictors when interactions or polynomials are involved — it tames collinearity and makes coefficients interpretable at the mean of the data rather than at zero.
Pitfalls, ethics, and responsible use
Multiple regression is one of the most cited and most misused techniques in applied statistics. The technical traps are well known. The ethical ones are quieter but more damaging.
Technical pitfalls
- Stepwise selection. Algorithms that add or drop predictors based on p-values inflate Type-I error and produce models that don't replicate. Use theory, regularization, or out-of-sample validation instead.
- Extrapolation. A linear model is only credible inside the range of the predictors you fit. Predicting beyond it is guesswork.
- Correlation ≠ causation. A regression coefficient describes association, not effect. Causal interpretation requires either an experiment or carefully argued identification.
- Outliers and leverage points. A handful of extreme observations can dominate the fit. Use Cook's distance and decide deliberately whether to keep, downweight, or remove them — and document the choice.
Ethics and responsible use
When regression models inform decisions about people — credit, hiring, healthcare, insurance — coefficient choices have real consequences. Several considerations recur in modern data-ethics guidance and regulations such as the EU AI Act (2024) and the GDPR's right to explanation.
- Disparate impact. Even when sensitive attributes (race, sex, age) are excluded, correlated proxies (zip code, name, school) can encode them. Test for differential error rates across protected groups.
- Disclosure of methods. Anyone affected by an automated decision is entitled to know, in plain language, what variables drove it. Multiple regression is among the most transparent models — use that as a feature, not just a fallback.
- Data provenance. Be honest about how the data was collected, what's missing, and who was excluded. Coefficients on a non-representative sample do not generalize, no matter how tight the standard errors.
- No fishing. Running dozens of specifications and reporting the one that "works" turns regression into a publication trick. Pre-registration and full reporting are the antidotes.
Test yourself: 10 questions
One correct answer per question. Pick all your answers, then click "Score me" to see explanations.
Quick reference
Open glossary (14 terms)
- Coefficient (β)
- The estimated change in the response variable for a one-unit change in a predictor, holding all other predictors constant.
- Design matrix (X)
- The matrix of predictor values, conventionally with a leading column of 1s for the intercept.
- Dummy variable
- A binary 0/1 column encoding membership in one level of a categorical variable. A k-level categorical predictor needs k−1 dummies.
- Heteroscedasticity
- Non-constant variance of residuals. Violates the equal-variance assumption; biases standard errors but not coefficients.
- Homoscedasticity
- Constant variance of residuals across all fitted values. The healthy condition.
- Interaction term
- A product of two predictors included as a third predictor. Lets the slope of one variable depend on the value of another.
- Leverage
- How extreme an observation's predictor values are. High-leverage points pull the fit toward themselves; combined with a residual, they become influential.
- Multicollinearity
- Strong linear dependence among predictors. Inflates coefficient variance and destabilizes inference.
- OLS (Ordinary Least Squares)
- The estimator that picks β to minimize the sum of squared residuals.
- R² (coefficient of determination)
- Fraction of variance in y explained by the model. Always between 0 and 1; always rises when predictors are added.
- Residual
- The difference between observed and predicted y for a single observation: e = y − ŷ.
- Standard error
- The standard deviation of a coefficient's sampling distribution. Drives the t-statistic and confidence interval.
- VIF (Variance Inflation Factor)
- 1 / (1 − R²ⱼ) where R²ⱼ comes from regressing predictor j on the other predictors. Diagnostic for multicollinearity.
- Adjusted R²
- R² penalized by the number of predictors. Drops when junk variables are added; useful for comparing models.
References cited in this lesson
- Statistical Power, Reliability, and Model Assumptions in Multiple Linear Regression — Measurement and Evaluation in Counseling and Development (2026)
- The Five Assumptions of Multiple Linear Regression — Statology
- Detecting Multicollinearity Using Variance Inflation Factors — Penn State STAT 462
- Multicollinearity in Regression Analysis — Statistics By Jim
- Addressing Multicollinearity — UVA Library
- When Can You Safely Ignore Multicollinearity? — Statistical Horizons
- Interpreting Adjusted R-Squared and Predicted R-Squared — Statistics By Jim
- Understanding Diagnostic Plots for Linear Regression Analysis — UVA Library
- Identifying Specific Problems Using Residual Plots — Penn State STAT 462
- Interaction Between Dummy-Coded Categorical Variables — The Analysis Factor
- Applied Statistics with R, Ch. 11: Categorical Predictors and Interactions
- Coefficient of determination — Wikipedia