Learning Module 10 Simple Linear Regression

50 questions available

Estimation and Core Formulas5 min
Simple linear regression (SLR) models the relation between a dependent variable Y and a single independent variable X as Y = b0 + b1 X + epsilon, where b0 is the intercept, b1 the slope, and epsilon the error term. Ordinary least squares (OLS) fits the line by selecting estimates b_hat0 and b_hat1 that minimize the sum of squared residuals SSE = sum (Yi - Yhat_i)^2. The slope estimator equals the covariance of X and Y divided by the variance of X: b_hat1 = sum (Yi - Ybar)(Xi - Xbar) / sum (Xi - Xbar)^2; the intercept is b_hat0 = Ybar - b_hat1 Xbar. Residuals e_i = Yi - Yhat_i sum to zero by construction. The sample pairwise correlation r between X and Y relates to slope; in SLR, R^2 = r^2 and equals SSR / SST, where SST = sum (Yi - Ybar)^2, SSR = sum (Yhat_i - Ybar)^2, and SSE = sum (Yi - Yhat_i)^2 so that SST = SSR + SSE. Goodness-of-fit measures: coefficient of determination R^2; F-statistic = MSR / MSE with MSR = SSR / k and MSE = SSE / (n - k - 1); and standard error of the estimate se = sqrt(MSE). Hypothesis testing on slope and intercept uses t-statistics: t = (b_hat1 - B1) / SE(b_hat1) and t_intercept = (b_hat0 - B0) / SE(b_hat0), with degrees of freedom n - 2 in SLR. SE(b_hat1) = se / sqrt( sum (Xi - Xbar)^2 ). Testing whether correlation r equals zero uses the same t-statistic as test of slope = 0, t = r sqrt((n - 2)/(1 - r^2)). SLR assumes: (1) linearity (true relation linear in parameters and residuals random), (2) homoskedasticity (constant variance E(epsilon_i^2) = sigma_e^2 across i), (3) independence (errors uncorrelated across observations), and (4) normality of residuals (for finite-sample inference). Residual plots (residual vs X, residual vs time) and histograms/normal probability plots are diagnostic tools to detect nonlinearity, heteroskedasticity (pattern or fan shape), autocorrelation (patterns over time or seasonality), and non-normality or outliers. Heteroskedasticity and autocorrelation invalidate standard error estimates and test statistics; remedies include variable transformation, weighted least squares, adding regime indicators, or time-series models. An analysis of variance (ANOVA) table for regression displays SSR, SSE, SST with degrees of freedom and mean squares; the F-test in ANOVA tests whether the model explains more variance than unexplained variance (i.e., slope differs from zero). Prediction: predicted Y at Xf is Yhat_f = b_hat0 + b_hat1 Xf. The standard error of prediction sf = se sqrt(1 + 1/n + (Xf - Xbar)^2 / sum (Xi - Xbar)^2). A (1 - alpha) prediction interval is Yhat_f ± t_{alpha/2, n-2} sf. Prediction uncertainty increases with se, decreases with n, and increases as Xf moves away from Xbar. Indicator (dummy) independent variables take values 0/1; in SLR the intercept equals mean Y when dummy = 0 and the slope equals difference in means between groups. Functional forms: when the relationship is nonlinear, transform variables: log-lin (ln Y = b0 + b1 X) yields percentage change in Y for unit change in X; lin-log (Y = b0 + b1 ln X) yields absolute change in Y for percentage change in X; log-log (ln Y = b0 + b1 ln X) means b1 is elasticity: percent change in Y for percent change in X. Use fit statistics (R^2, se, F) and residual patterns to choose transformations. Outliers and data errors can dramatically alter slope, R^2, and se; always inspect data and correct erroneous observations. For inferences, choose significance level alpha, calculate t or F statistics, and/or compute p-values; small p-values indicate coefficients significantly different from hypothesized values. As sample size increases, the critical values shrink and smaller correlations or slopes can become significant; conversely, large samples can make trivial effects statistically significant, so consider economic significance as well as statistical significance.

Key Points

  • OLS estimates b_hat1 and b_hat0 minimize sum of squared residuals.
  • SST = SSR + SSE; R^2 = SSR / SST = r^2 in SLR.
  • t-tests for slope and intercept use n - 2 df; F-test compares MSR to MSE.
  • Prediction interval uses se and increases with distance of Xf from Xbar.
  • Transformations (log/lin) change interpretation: log-log gives elasticity.
Assumptions and Diagnostics5 min
SLR assumptions are linearity, homoskedasticity, independence, and normality of residuals. Linearity requires that the conditional mean of Y is linear in X (or in transformed X). Nonlinearity shows as systematic patterns in residual-vs-X plots. Homoskedasticity (constant variance) is violated if residual spread varies with X or over regimes; detect by plotting residuals or using tests (e.g., Breusch-Pagan). Independence is violated when residuals are autocorrelated (time series with seasonality), detectable through residual vs time plots and Durbin-Watson tests. Non-normal residuals are a concern for small samples; use histograms, normal probability plots, or tests (e.g., Shapiro-Wilk). Outliers and leverage points can unduly influence estimates; examine data and correct errors or consider robust methods. When assumptions do not hold, remedies include transformations, adding dummy variables for regimes, weighted least squares, generalized least squares, or time-series approaches (ARIMA, HAC standard errors) for autocorrelation. Interpret coefficients carefully: intercept may be meaningless if X=0 is outside feasible range; slope is marginal change in Y per unit change in X; indicator variable slopes represent group mean differences. Consider both statistical and practical (economic) significance when evaluating results.

Key Points

  • Examine residual plots to detect violations.
  • Heteroskedasticity and autocorrelation affect standard errors and tests.
  • Remedies: transform variables, weighted/GMM/GLS, add dummies, correct data errors.
  • Outliers can distort slope and R^2; always inspect raw data and corrected values.
  • Normality matters more in small samples; large samples may rely on CLT for inference.
Hypothesis Tests, ANOVA, and Prediction5 min
Use t-tests to evaluate hypotheses on individual coefficients: two-sided or one-sided. The standard error of the slope uses se and the spread of X: SE(b_hat1) = se / sqrt(sum (Xi - Xbar)^2). The F-statistic in the ANOVA table tests the null that all slope coefficients are zero; in SLR F = t^2 for the slope test. ANOVA displays SSR, SSE, SST with df and mean squares; se = sqrt(MSE). For prediction, compute Yhat = b_hat0 + b_hat1 Xf and the standard error sf = se sqrt(1 + 1/n + (Xf - Xbar)^2 / sum (Xi - Xbar)^2). The prediction interval at level (1 - alpha) is Yhat ± t_{alpha/2, n-2} sf. Forecast uncertainty declines with larger n, smaller se, and Xf near Xbar. p-values are the smallest alpha at which the null is rejected; small p-values indicate stronger evidence against H0. When testing slope = 1 or other specific values, use t = (b_hat1 - hypoth_value)/SE(b_hat1). Using indicator variables, compare differences in group means via slope and pooled variance if needed. Beware of multiple testing and the effect of sample size on statistical significance; interpret effect sizes too.

Key Points

  • t-statistic for slope uses SE that depends on se and X variance.
  • F-statistic for model fit compares explained to unexplained variance; in SLR F = t^2.
  • Prediction intervals incorporate model uncertainty and distance of Xf from Xbar.
  • p-values quantify smallest alpha at which H0 would be rejected.
  • Testing nonzero constants (e.g., slope = 1) uses the same t formula with hypothesized value.
Functional Forms and Practical Considerations5 min
When the true relation is nonlinear, transform variables: log-lin (ln Y = b0 + b1 X) means b1 approximates proportional change in Y per unit X; lin-log (Y = b0 + b1 ln X) means b1 is absolute change in Y for a percent change in X (approx b1 * d ln X); log-log (ln Y = b0 + b1 ln X) makes b1 an elasticity (percent change in Y per percent change in X). Compare fits using R^2, se, F-statistic, and residual randomness. Do not compare R^2 across models with different dependent variable transformations without care; compare se or transformed fit metrics. Use domain knowledge when selecting functional form. Use indicator variables to capture regime shifts, announcement effects, or group differences; the slope on dummy is difference in means. Verify data quality—errors or outliers can produce misleadingly high R^2 and distorted slope estimates; correct obvious errors and consider robust regression if outliers are genuine. Finally, large samples can make trivial effects statistically significant; assess economic importance in addition to statistical tests.

Key Points

  • Select functional form based on residual patterns and fit metrics.
  • Log transformations change coefficient interpretation (elasticities and percent changes).
  • Indicator variables let SLR compare group means and capture regime shifts.
  • Always check and clean data; outliers can materially change results.
  • Balance statistical significance with economic/practical significance.

Questions

Question 1

In a simple linear regression of Y on X using OLS, which expression gives the estimated slope coefficient b_hat1?

View answer and explanation
Question 2

Which equality holds in a correctly estimated simple linear regression with an intercept?

View answer and explanation
Question 3

If the sample correlation between X and Y in SLR is r = 0.8 and SD(X)=2 and SD(Y)=5, what is the estimated slope b_hat1 (approx)?

View answer and explanation
Question 4

Which statement best describes R-squared in simple linear regression?

View answer and explanation
Question 5

You estimate SLR with n = 30 and find SSE = 180. What is the standard error of the estimate se?

View answer and explanation
Question 6

Which assumption is violated if residuals plotted versus X show a clear U-shaped pattern?

View answer and explanation
Question 7

In testing H0: b1 = 0 versus Ha: b1 ≠ 0 in SLR with n observations, what is the degrees of freedom for the t-statistic?

View answer and explanation
Question 8

If sample size n increases while sample correlation r remains fixed, what happens to the t-statistic for testing r = 0?

View answer and explanation
Question 9

Which of the following changes would reduce the standard error of the slope estimate SE(b_hat1) in SLR?

View answer and explanation
Question 10

In SLR, the F-statistic for testing whether the model explains variance equals:

View answer and explanation
Question 11

Which diagnostic plot would best help detect heteroskedasticity in a regression model?

View answer and explanation
Question 12

When residuals in a time-series regression show seasonally higher positive values every fourth quarter, which assumption is violated?

View answer and explanation
Question 13

You estimate Y on X and obtain b_hat1 = 1.25, SE(b_hat1) = 0.3124, and df = 4. For a two-sided 5% test of H0: b1 = 0, the critical t is ±2.776. Which conclusion is correct?

View answer and explanation
Question 14

Which change will widen a 95% prediction interval for Yhat at a specific Xf?

View answer and explanation
Question 15

What is the proper interpretation of the intercept b_hat0 in SLR?

View answer and explanation
Question 16

If you regress monthly returns on an indicator variable EARN that equals 1 for months with earnings announcements and 0 otherwise, what does the slope coefficient represent?

View answer and explanation
Question 17

Which functional form lets you interpret the slope b1 directly as the elasticity of Y with respect to X?

View answer and explanation
Question 18

You fit SLR and find one observation has unusually large X and large residual; this point is best described as:

View answer and explanation
Question 19

Which remedy is appropriate if residuals show increasing spread as X increases (heteroskedasticity)?

View answer and explanation
Question 20

In SLR, you observe R^2 = 0.80 and se = 3.46. Which statement is most accurate?

View answer and explanation
Question 21

You estimate ln(Y) = b0 + b1 X. A one-unit increase in X leads to what approximate change in Y?

View answer and explanation
Question 22

Which is true regarding p-values reported for regression coefficients?

View answer and explanation
Question 23

You have SLR with estimated b_hat1=0.98 and SE(b_hat1)=0.052. Test H0: b1 = 1.0 at 5% level (two-sided). Which result is correct? (t = (0.98-1)/0.052 = -0.385).

View answer and explanation
Question 24

Which statement about the ANOVA decomposition SST = SSR + SSE is correct?

View answer and explanation
Question 25

You forecast Y at Xf=6 given b_hat0=4.875, b_hat1=1.25. What is Yhat?

View answer and explanation
Question 26

Which of the following increases the power of the t-test for a slope coefficient in SLR?

View answer and explanation
Question 27

If residuals are not normally distributed in a small sample regression, which consequence is most direct?

View answer and explanation
Question 28

Which of the following is an effect of an outlier caused by data entry error far from the bulk of observations?

View answer and explanation
Question 29

Which statement about prediction intervals vs. confidence intervals for mean response is true?

View answer and explanation
Question 30

When comparing two nested models (Model A with only intercept, Model B with intercept and one X), which test evaluates whether X adds explanatory power?

View answer and explanation
Question 31

If you estimate ln Y = 0.6 + 0.2951 FATO and SE of estimate se = 0.2631, what is interpretation of coefficient 0.2951?

View answer and explanation
Question 32

An analyst finds slope p-value = 0.044 in a regression with n=6. What is correct inference at 5% level?

View answer and explanation
Question 33

Which phrase best describes heteroskedasticity?

View answer and explanation
Question 34

You estimate SLR for CPI forecasts: intercept 0.0001 (SE 0.0002), slope 0.9830 (SE 0.0155), n=60. Test H0: slope = 1.0 at 5% two-sided. t = (0.9830 - 1)/0.0155 ≈ -1.097. What conclusion?

View answer and explanation
Question 35

Which data situation favors use of Spearman rank correlation over Pearson correlation?

View answer and explanation
Question 36

You have SLR and want a prediction interval for Y at Xf. Which of these reduces width of that interval?

View answer and explanation
Question 37

Which functional form would you try if scatter of Y vs X shows curvature with increasing slope (convex)?

View answer and explanation
Question 38

In SLR, what does the standardized residual equal?

View answer and explanation
Question 39

Which test statistic equals the t-statistic squared in simple linear regression?

View answer and explanation
Question 40

When should you prefer a lin-log model (Y = b0 + b1 ln X) over lin-lin?

View answer and explanation
Question 41

Which of these is a direct consequence of autocorrelated residuals in a time-series regression?

View answer and explanation
Question 42

Which statistic would you compute to examine whether residuals follow a normal distribution in a small-sample regression?

View answer and explanation
Question 43

You fit SLR and obtain residuals with markedly fatter tails than normal in a small sample. Best action?

View answer and explanation
Question 44

Which of these is a valid form for a log-lin regression where percent-change interpretation applies?

View answer and explanation
Question 45

In a time-series SLR of revenue on time, residuals show upward jump each fourth quarter. A suitable regression modification is:

View answer and explanation
Question 46

Which of the following best describes the standard error of the forecast sf used in prediction intervals?

View answer and explanation
Question 47

Which of these indicates a good reason to use weighted least squares (WLS)?

View answer and explanation
Question 48

You test H0: b0 ≤ 3 vs Ha: b0 > 3 and calculate t_intercept = 0.79 with critical one-sided t=2.132. Which decision?

View answer and explanation
Question 49

Which of these best justifies transforming variables before regression (e.g., log transform)?

View answer and explanation
Question 50

Which of the following is the most important first step before trusting regression outputs?

View answer and explanation