Chapter 8

8.1 Line Fitting, Residuals, and Correlation

In this section, we investigate bivariate data — data collected on two numerical variables at once. We examine criteria for identifying a linear model and introduce a new bivariate summary called correlation. We'll answer questions like:

  • How do we quantify the strength of the linear association between two numerical variables?
  • What does it mean for two variables to have no association, or to have a nonlinear association?
  • Once we fit a model, how do we measure the error in the model's predictions?
Context Pause

Bivariate data shows up everywhere in the real world — student test scores vs. study hours, medication dosage vs. blood pressure, family income vs. college aid. The whole game of regression is to turn "these two quantities look related" into "here's the line that best describes the relationship, and here's how much it misses by." Lines are the simplest possible model, and almost every more sophisticated model in statistics starts with a line as the baseline.

Insight Note

A linear model is a promise: "when \(x\) goes up by one unit, \(y\) goes up by \(b\) units on average." That single promise — a slope — is the most compact useful prediction you can make about two variables. The machinery of this chapter is about (1) when that promise is honest, (2) how to compute the best-fitting slope, and (3) how wrong the predictions typically are.

Learning Objectives

Source: Main Text

By the end of this section, you should be able to:

In this section, you will learn to:
  1. 1. Distinguish between the observed data point \(y\) and the predicted value \(\hat{y}\) based on a model.
  2. 2. Calculate a residual and draw a residual plot.
  3. 3. Interpret the standard deviation of the residuals.
  4. 4. Interpret the correlation coefficient and estimate it from a scatterplot.
  5. 5. Know and apply the properties of the correlation coefficient.

8.1.1 Fitting a Line to Data

Requests from twelve separate buyers were simultaneously placed with a trading company to purchase Target Corporation stock (ticker TGT, April 26th, 2012). We let \(x\) be the number of stocks purchased and \(y\) be the total cost. Because the cost is computed using a linear formula, the linear fit is perfect, and the equation for the line is:

$$ y = 5 + 57.49 x. $$

If we know the number of stocks purchased, we can determine the cost based on this linear equation with no error. Additionally, we can say that each additional share of the stock cost \$57.49 and that there was a \$5 fee for the transaction.

Figure 8.1: Total cost of a trade against number of shares purchased.

Perfect linear relationships are unrealistic in almost any natural process. For example, if we took family income (\(x\)), this value would provide some useful information about how much financial support a college may offer a prospective student (\(y\)). However, the prediction would be far from perfect, since other factors play a role in financial support beyond a family's income.

It is rare for all of the data to fall perfectly on a straight line. Instead, it's more common for data to appear as a cloud of points, such as those shown in Figure 8.2. In each case, the data fall around a straight line, even if none of the observations fall exactly on the line. The first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between \(x\) and \(y\). The second plot shows an upward trend that, while evident, is not as strong as the first. The last plot shows a very weak downward trend in the data, so slight we can hardly notice it.

In each of these examples, we can consider how to draw a "best fit line". For instance, we might wonder: should we move the line up or down a little? Should we tilt it more or less? As we move forward in this chapter, we will learn different criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.

Figure 8.2: Three data sets where a linear model may be useful even though the data do not all fall exactly on the line.

We will also see examples in this chapter where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful. One such case is shown in Figure 8.3 where there is a very strong relationship between the variables even though the trend is not linear.

Figure 8.3: A linear model is not useful in this nonlinear case. These data are from an introductory physics experiment.

Context Pause

Almost any real data set has two competing stories: the underlying relationship and the random noise. When the TGT stock cost was computed by a formula, there was no noise — just the formula. But when we measure things in the world (heights, incomes, test scores) the same \(x\)-value produces different \(y\)-values across observations. Line fitting is the act of drawing the cleanest possible signal through that noise.

Insight Note

"Fitting a line by eye" is surprisingly good for getting a rough estimate, but it's also surprisingly inconsistent — ten students drawing lines through the same scatterplot will produce ten noticeably different lines. That's why statisticians eventually defined least squares as the objective criterion: among all the lines someone could draw, least squares picks exactly one, based on minimizing squared prediction error.

Try It Now 8.1.1

Source: Main Text

A coffee shop's daily muffin revenue \(y\) (in dollars) is given exactly by \(y = 2.50 x\), where \(x\) is the number of muffins sold. No fees, no discounts.

(a) If the shop sells 120 muffins, what's the revenue? (b) Does this relationship have any noise? Why or why not? (c) If a student went out and recorded one real coffee shop's muffin sales for 30 days, would you expect the relationship between muffin count and revenue to be exactly linear? Why or why not?

Solution

(a) \(y = 2.50 \times 120 = 300\) dollars.

(b) No — the relationship is defined by a formula, so every data point falls exactly on the line \(y = 2.50x\). There is no residual error.

(c) No — real data almost always includes factors that break a perfect formula: promotional discounts for regulars, day-old muffin markdowns, staff giving free samples, different prices for different muffin types. The scatterplot would still show a strong positive linear trend, but individual points would scatter around the line.

8.1.2 Using Linear Regression to Predict Possum Head Lengths

Brushtail possums are a marsupial that lives in Australia. A photo of one is shown in Figure 8.4. Researchers captured 104 of these animals and took body measurements before releasing the animals back into the wild. We consider two of these measurements: the total length of each possum, from head to tail, and the head length of each possum.

Figure 8.5 shows a scatterplot for the head length and total length of the 104 possums. Each point represents a single observation from the data.

Figure 8.4: The common brushtail possum of Australia.

Photo by Peter Firminger on Flickr: http://flic.kr/p/6aPTn CC BY 2.0 license.

Figure 8.5: A scatterplot showing head length against total length for 104 brushtail possums. A point representing a possum with head length 94.1 mm and total length 89 cm is highlighted.

The head and total length variables are associated: possums with an above-average total length also tend to have above-average head lengths. While the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.

We want to describe the relationship between the head length and total length variables in the possum data set using a line. In this example, we will use the total length \(x\) to explain or predict a possum's head length \(y\). When we use \(x\) to predict \(y\), we usually call \(x\) the explanatory variable or predictor variable, and we call \(y\) the response variable. We could fit the linear relationship by eye, as in Figure 8.6. The equation for this line is

$$ \hat{y} = 41 + 0.59 x. $$

A "hat" on \(y\) is used to signify that this is a predicted value, not an observed value. We can use this line to discuss properties of possums. For instance, the equation predicts a possum with a total length of 80 cm will have a head length of

$$ \hat{y} = 41 + 0.59(80) = 88.2 \text{ mm}. $$

The value \(\hat{y}\) may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of \(88.2\) mm. The value \(\hat{y}\) is also a prediction: absent further information about an 80 cm possum, this is our best prediction for the head length of a single 80 cm possum.

Context Pause

The distinction between \(y\) and \(\hat{y}\) is not cosmetic — it's central. \(y\) is a measured value: a number from the real world, with all its noise and particularity. \(\hat{y}\) is a model's guess: what the line would predict for that \(x\). Every interesting quantity in regression (residuals, standard errors, \(R^2\)) is built from the gap between these two. Keep the notation straight and the chapter gets much easier.

Insight Note

The equation \(\hat{y} = 41 + 0.59 x\) is also an answer to a practical question: "if I tell you a possum is 80 cm long, what's your best guess for its head length?" — \(41 + 0.59 \cdot 80 = 88.2\) mm. Every regression line gives you a lookup table for predictions, with the intercept as the baseline and the slope as the per-unit adjustment.

Try It Now 8.1.2

Source: Main Text

Using the same possum model \(\hat{y} = 41 + 0.59 x\) (where \(x\) is total length in cm and \(y\) is head length in mm):

(a) Predict the head length of a possum with total length 90 cm. (b) A particular 90 cm possum in the data set has head length 94 mm. What's the difference between the observed head length and the predicted head length? (c) If someone told you that \(x = 80\) cm implies \(\hat{y} = 88.2\) mm exactly, would you agree? Why or why not?

Solution

(a) \(\hat{y} = 41 + 0.59 \times 90 = 41 + 53.1 = 94.1\) mm.

(b) Observed 94 mm − predicted 94.1 mm = \(-0.1\) mm. The model slightly over-predicted the head length for this particular possum.

(c) The prediction \(88.2\) mm is exact as an output of the formula, but it is an average across possums with total length 80 cm — individual 80 cm possums will have head lengths scattered around \(88.2\) mm, not exactly \(88.2\) mm.

8.1.3 Residuals

Residuals are the leftover variation in the response variable after fitting a model. Each observation will have a residual, and three of the residuals for the linear model we fit for the possum data are shown in Figure 8.6. If an observation is above the regression line, then its residual — the vertical distance from the observation to the line — is positive. Observations below the line have negative residuals. One goal in picking the right linear model is for these residuals to be as small as possible.

Figure 8.6: A reasonable linear model was fit to represent the relationship between head length and total length.

Let's look closer at the three residuals featured in Figure 8.6. The observation marked by an "×" has a small negative residual of about \(-1\); the observation marked by "+" has a large residual of about \(+7\); and the observation marked by "△" has a moderate residual of about \(-4\). The size of a residual is usually discussed in terms of its absolute value. For example, the residual for "△" is larger than that of "×" because \(|{-4}|\) is larger than \(|{-1}|\).

Residual: Difference Between Observed and Expected

The residual for a particular observation \((x, y)\) is the difference between the observed response and the response we would predict based on the model:

$$ \text{residual} = \text{observed } y - \text{predicted } y = y - \hat{y}. $$

We typically identify \(\hat{y}\) by plugging \(x\) into the model. For the \(i^{th}\) observation, we often write \(e_i = y_i - \hat{y}_i\).

Example 8.1.1

Source: Main Text

The linear fit shown in Figure 8.6 is given as \(\hat{y} = 41 + 0.59 x\). Based on this line, compute and interpret the residual of the observation \((77.0, 85.3)\). This observation is denoted by "×" on the plot. Recall that \(x\) is the total length measured in cm and \(y\) is head length measured in mm.

Solution:

We first compute the predicted value based on the model:

$$ \hat{y} = 41 + 0.59(77.0) = 86.4. $$

Next, we compute the difference of the actual head length and the predicted head length:

$$ \text{residual} = y - \hat{y} = 85.3 - 86.4 = -1.1. $$

The residual for this point is \(-1.1\) mm, which is very close to the visual estimate of \(-1\) mm. For this particular possum with total length 77 cm, the model's prediction for its head length was \(1.1\) mm too high.

Guided Practice 8.2

Source: Main Text

If a model underestimates an observation, will the residual be positive or negative? What about if it overestimates the observation?

Solution

If the model underestimates the observation, then \(\hat{y} < y\), so \(y - \hat{y} > 0\) — the residual is positive.

If the model overestimates the observation, then \(\hat{y} > y\), so \(y - \hat{y} < 0\) — the residual is negative.

Mnemonic: the residual takes the sign of "observed above the line" (positive) vs. "observed below the line" (negative).

Guided Practice 8.3

Source: Main Text

Compute the residual for the observation \((95.5, 94.0)\), denoted by "△" in Figure 8.6, using the linear model \(\hat{y} = 41 + 0.59 x\).

Solution

Predicted value: \(\hat{y} = 41 + 0.59(95.5) = 41 + 56.345 = 97.345\).

Residual: \(y - \hat{y} = 94.0 - 97.345 = -3.345 \approx -3.3\) mm.

This matches the visual estimate of roughly \(-4\) mm for the "△" point. The model predicted a head length about \(3.3\) mm longer than what was actually observed for this 95.5 cm possum.

Example 8.1.2

Source: Main Text

Estimate the standard deviation of the residuals for predicting head length from total length using the line \(\hat{y} = 41 + 0.59 x\) using Figure 8.7. Also, interpret the quantity in context.

Solution:

To estimate this graphically, we use the residual plot. The approximate "68–95 rule" for standard deviations applies: approximately \(2/3\) of the points are within \(\pm 2.5\), and approximately \(95\%\) of the points are within \(\pm 5\). So \(2.5\) is a good estimate for the standard deviation of the residuals. The typical error when predicting head length using this model is about \(2.5\) mm.

Figure 8.7: Left: Scatterplot of head length versus total length for 104 brushtail possums. Three particular points have been highlighted. Right: Residual plot for the model shown in left panel.

> Standard Deviation of the Residuals > The standard deviation of the residuals, often denoted by \(s\), tells us the typical error in the predictions using the regression model. It can be estimated from a residual plot.

Example 8.1.3

Source: Main Text

One purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model. Figure 8.8 shows three scatterplots with linear models in the first row and residual plots in the second row. Can you identify any patterns remaining in the residuals?

Solution:

In the first data set (first column), the residuals show no obvious patterns. The residuals appear to be scattered randomly around the dashed line that represents \(0\).

The second data set shows a pattern in the residuals. There is some curvature in the scatterplot, which is more obvious in the residual plot. We should not use a straight line to model these data. Instead, a more advanced technique should be used.

The last plot shows very little upward trend, and the residuals also show no obvious patterns. It is reasonable to try to fit a linear model to the data. However, it is unclear whether there is statistically significant evidence that the slope parameter is different from zero. The slope of the sample regression line is not zero, but we might wonder if this could be due to random variation. We will address this sort of scenario in Section 8.4.

Figure 8.8: Sample data with their best-fitting lines (top row) and their corresponding residual plots (bottom row).

Context Pause

A residual plot is the single most useful diagnostic in all of regression. If you look at a residual plot and it looks like random static, your linear model is doing its job. If you see a clear curve, a fanning shape, or a cluster of extreme points, your linear model is failing in a specific way that the plot will tell you about. Always make the residual plot before you trust the line.

Insight Note

A curved pattern in the residuals means the model's errors are systematic, not random — the line is missing real structure in the data. Fanning (residuals growing with \(x\)) means the line's precision isn't constant across \(x\); that violates an assumption we'll need in Section 8.4. You can read a lot of statistics off a single residual plot if you know what to look for.

Try It Now 8.1.3

Source: Main Text

Suppose the model \(\hat{y} = 41 + 0.59 x\) is used to predict head length from total length. Five possums have the following measurements:

| \(x\) (cm) | \(y\) observed (mm) | |---:|---:| | 75 | 82 | | 80 | 91 | | 85 | 92 | | 90 | 94 | | 95 | 100 |

(a) Compute the predicted \(\hat{y}\) for each possum. (b) Compute the residual \(y - \hat{y}\) for each possum. (c) Which possum's head length was most overpredicted? Most underpredicted?

Solution

(a) Predictions:

\(x\) \(\hat{y} = 41 + 0.59 x\)
75 \(41 + 0.59(75) = 85.25\)
80 \(41 + 0.59(80) = 88.20\)
85 \(41 + 0.59(85) = 91.15\)
90 \(41 + 0.59(90) = 94.10\)
95 \(41 + 0.59(95) = 97.05\)

(b) Residuals (\(y - \hat{y}\)):

\(x\) \(y\) \(\hat{y}\) residual
75 82 85.25 \(-3.25\)
80 91 88.20 \(+2.80\)
85 92 91.15 \(+0.85\)
90 94 94.10 \(-0.10\)
95 100 97.05 \(+2.95\)

(c) Most overpredicted: \(x = 75\) — the model predicted 85.25 mm but the possum's head was only 82 mm (residual \(-3.25\)). Most underpredicted: \(x = 95\) — the model predicted 97.05 mm but the head was 100 mm (residual \(+2.95\)).

8.1.4 Describing Linear Relationships with Correlation

When a linear relationship exists between two variables, we can quantify the strength and direction of the linear relation with the correlation coefficient, or just correlation for short. Figure 8.9 shows eight plots and their corresponding correlations.

\(r = -0.08\)

\(r = -0.64\)

\(r = -0.92\)

\(r = -1.00\)

Figure 8.9: Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a low value in the other.

Only when the relationship is perfectly linear is the correlation either \(-1\) or \(+1\). If the linear relationship is strong and positive, the correlation will be near \(+1\). If it is strong and negative, it will be near \(-1\). If there is no apparent linear relationship between the variables, then the correlation will be near zero.

Correlation Measures the Strength of a Linear Relationship

Correlation, which always takes values between \(-1\) and \(1\), describes the direction and strength of the linear relationship between two numerical variables. The strength can be strong, moderate, or weak.

We compute the correlation using a formula, just as we did with the sample mean and standard deviation. Formally, we can compute the correlation for observations \((x_1, y_1)\), \((x_2, y_2)\), …, \((x_n, y_n)\) using

$$ r = \frac{1}{n - 1} \sum \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right), $$

where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable. This formula is rather complex, and we generally perform the calculations on a computer or calculator. We can note, though, that the computation involves taking, for each point, the product of the Z-scores that correspond to the \(x\) and \(y\) values.

Example 8.1.4

Source: Main Text

Take a look at Figure 8.6. How would the correlation between head length and total body length of possums change if head length were measured in cm rather than mm? What if head length were measured in inches rather than mm?

Solution:

Here, changing the units of \(y\) corresponds to multiplying all the \(y\) values by a certain number. This would change the mean and the standard deviation of \(y\), but it would not change the correlation. To see this, imagine dividing every number on the vertical axis by 10. The units of \(y\) are now in cm rather than in mm, but the graph has remained exactly the same. The units of \(y\) have changed, but the relative distance of the \(y\) values about the mean is the same — that is, the Z-scores corresponding to the \(y\) values have remained the same. Converting to inches works the same way: a different multiplier, but Z-scores (and therefore \(r\)) don't budge.

Example 8.1.5

Source: Main Text

What other variables might help us predict the head length of a possum besides its length?

Solution:

Perhaps the relationship would be a little different for male possums than female possums, or perhaps it would differ for possums from one region of Australia versus another region. In Chapter 9, we'll learn about how we can include more than one predictor. Before we get there, we first need to better understand how to best build a simple linear model with one predictor.

> Changing Units of x and y Does Not Affect the Correlation > The correlation \(r\) between two variables is not dependent on the units in which the variables are recorded. Correlation itself has no units.

Correlation is intended to quantify the strength of a linear trend. Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in Figure 8.10.

\(r = -0.23\)

\(r = 0.31\)

\(r = 0.50\)

Figure 8.10: Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, the correlation is not very strong, and the relationship is not linear.

Guided Practice 8.7

Source: Main Text

It appears no straight line would fit any of the datasets represented in Figure 8.10. Try drawing nonlinear curves on each plot. Once you create a curve for each, describe what is important in your fit.

Solution

For the first plot (\(r = -0.23\)), a shallow U-shape fits well — the relationship is roughly quadratic.

For the middle plot (\(r = 0.31\)), a piecewise or logarithmic curve would capture the initial rapid rise followed by a plateau.

For the last plot (\(r = 0.50\)), a cubic or S-shaped curve captures the steep middle with flatter ends.

What's important in each fit: (1) the curve follows the direction of the data at each \(x\) value, (2) the vertical distance from points to the curve is roughly constant and small across the range of \(x\), and (3) the curve doesn't extrapolate wildly beyond the data. In every case, the correlation coefficient \(r\) would dramatically understate the strength of the relationship because \(r\) measures only the linear portion of the association.

Example 8.1.6

Source: Main Text

Consider the four scatterplots in Figure 8.11. In which scatterplot is the correlation between \(x\) and \(y\) the strongest?

Solution:

All four data sets have the exact same correlation of \(r = 0.816\) as well as the same equation for the best-fit line. This group of four graphs, known as Anscombe's Quartet, reminds us that knowing the value of the correlation does not tell us what the corresponding scatterplot looks like. It is always important to first graph the data.

Investigate Anscombe's Quartet in Desmos: https://www.desmos.com/calculator/paknt6oneh.

Figure 8.11: Four scatterplots from Desmos with best-fit line drawn in.

Guided Practice 8.8

Source: Main Text

Should we have concerns about applying least squares regression to the Elmhurst data in Figure 8.11?

Solution

Yes — at least until we see a residual plot. Anscombe's Quartet shows that the same correlation and the same best-fit line can come from four very different underlying datasets, only one of which is well-described by a linear model. Three of the four would fail one of the four standard regression conditions (linearity, constant variability, nearly normal residuals, independence). Without a residual plot and a scatterplot check, applying least squares blindly to any data set is a mistake.

Context Pause

Correlation and a scatterplot are partners, not substitutes. A single \(r\) value compresses the whole relationship into one number, which is convenient but can hide a curve, an outlier, or a clustering. If you ever see a correlation reported without a scatterplot, be suspicious — especially if any action (policy, purchase, publication) depends on it.

Insight Note

The sign of \(r\) tells you direction; the magnitude tells you strength. \(|r| \approx 1\) means points hug a line tightly; \(|r| \approx 0\) means no linear trend (possibly no trend at all, possibly a nonlinear trend). A useful rule of thumb: \(|r| > 0.8\) is strong, \(0.5 < |r| < 0.8\) is moderate, \(|r| < 0.3\) is weak. These aren't hard thresholds, but they give you intuition while reading scatterplots.

Try It Now 8.1.4

Source: Main Text

For each of the following, estimate the correlation coefficient \(r\) (just the sign and a rough magnitude) from the described scatterplot.

(a) Points tightly clustered along a line that rises from lower-left to upper-right. (b) Points scattered in a roughly circular cloud with no visible trend. (c) Points forming a clear downward line with small scatter. (d) Points forming a perfect U-shape — steeply down then steeply up.

Solution

(a) \(r \approx +0.9\) — strong positive linear.

(b) \(r \approx 0\) — no linear relationship.

(c) \(r \approx -0.9\) — strong negative linear.

(d) \(r \approx 0\) — counter-intuitive but correct. A perfect U-shape has no linear trend (the up and down portions cancel), so \(r\) is near zero. The relationship is very strong, but it is not linear. This is exactly why \(r\) by itself can mislead — see Example 8.1.6.

Section Summary

  • In Chapter 2 we introduced a bivariate display called a scatterplot, which shows the relationship between two numerical variables. When we use \(x\) to predict \(y\), we call \(x\) the explanatory variable or predictor variable, and we call \(y\) the response variable.
  • A linear model for bivariate numerical data can be useful for prediction when the association between the variables follows a constant, linear trend. Linear models should not be used if the trend between the variables is curved.
  • When we write a linear model, we use \(\hat{y}\) to indicate that it is the model or the prediction. The value \(\hat{y}\) can be understood as a prediction for \(y\) based on a given \(x\), or as an average of the \(y\) values for a given \(x\).
  • The residual is the error between the true value and the modeled value, computed as \(y - \hat{y}\). The order of the difference matters, and the sign of the residual tells us if the model over- or under-predicted a particular data point.
  • The symbol \(s\) in a linear model denotes the standard deviation of the residuals, and it measures the typical prediction error by the model.
  • A residual plot is a scatterplot with the residuals on the vertical axis. The residuals are often plotted against \(x\) on the horizontal axis, but they can also be plotted against \(y\), \(\hat{y}\), or other variables. Two important uses of a residual plot:
  • Residual plots help us see patterns in the data that may not have been apparent in the scatterplot.
  • The standard deviation of the residuals is easier to estimate from a residual plot than from the original scatterplot.
  • Correlation, denoted \(r\), measures the strength and direction of a linear relationship. Important facts:
  • The value of \(r\) is always between \(-1\) and \(1\), inclusive. \(r = -1\) indicates a perfect negative relationship and \(r = 1\) indicates a perfect positive relationship.
  • \(r = 0\) indicates no linear association between the variables, though a quadratic or other nonlinear association may still exist.
  • Like Z-scores, the correlation has no units. Changing the units in which \(x\) or \(y\) are measured does not affect the correlation.
  • Correlation is sensitive to outliers. Adding or removing a single point can have a big effect on the correlation.
  • Correlation is not causation. Even a very strong correlation cannot prove causation; only a well-designed, controlled, randomized experiment can.

Problem Set

Source: Main Text

Problem 1: Visualize the residuals. The scatterplots shown below each have a superimposed regression line. If we were to construct a residual plot (residuals versus \(x\)) for each, describe what those plots would look like.

(a)

(b)

Problem 8.1 Solution

Step 1 — Describe the residual plot (a): For plot (a), if the scatterplot points hug the regression line fairly tightly with no obvious curvature, the residual plot will look like a random cloud of points centered on zero, with roughly constant spread across \(x\).

Step 2 — Describe the residual plot (b): For plot (b), if the scatterplot shows any curvature (points above the line in the middle, below at the ends, or vice versa), the residual plot will show a U-shape or inverted-U pattern — a systematic curve rather than a random cloud. If the residual plot curves, a linear model is inadequate.

Answer: (a) Random cloud centered at 0. (b) A curved pattern (U-shape or upside-down U) — the residual plot "keeps the curve" the scatterplot already hinted at.

Problem 2: Trends in the residuals. Shown below are two plots of residuals remaining after fitting a linear model to two different sets of data. Describe important features and determine if a linear model would be appropriate for these data. Explain your reasoning.

(a)

(b)

Problem 8.2 Solution

Step 1 — Evaluate residual plot (a): If the plot shows a U-shape or inverted-U (residuals curve systematically), the linearity condition is violated and a linear model is not appropriate. A transformation or a nonlinear model would fit better.

Step 2 — Evaluate residual plot (b): If the plot shows fanning (residuals get larger as \(x\) increases) or clustering at extreme \(x\)-values, the constant-variability condition is violated. A linear model can still be fit, but standard error estimates and p-values will be unreliable — consider a variance-stabilizing transformation.

Answer: Both plots show systematic patterns, so a linear model is not appropriate without first addressing the violations (curve → transform the variables; fanning → log-transform \(y\) or use weighted regression).

Problem 3: Identify relationships, Part I. For each of the six plots, identify the strength of the relationship (e.g. weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.

(a)

(b)

(c)

(f)

Problem 8.3 Solution

Step 1 — Match strengths to typical visual patterns: Assess each plot by (1) how tightly points cluster around an imaginable line, and (2) whether the pattern looks straight or curved.

- Strong linear → points hug the line tightly, no curvature. - Moderate linear → clear trend but visible scatter. - Weak linear → trend faint; cloud almost round. - Nonlinear → strong pattern but curved; linear model inappropriate.

Step 2 — Apply to each plot: Without seeing the specific scatterplots, the standard answers for the six typical panels are: - (a) Strong, positive, linear — linear model reasonable. - (b) Weak, no clear trend — linear model not very useful. - (c) Moderate, negative, linear — reasonable. - (d) Strong but clearly curved — linear model not reasonable; use transformation. - (e) Strong positive linear — reasonable. - (f) Near-zero correlation or scattered — not reasonable for a linear fit.

Answer: Linear models are reasonable only when (1) the pattern is roughly straight (not curved) and (2) the spread around the imagined line is roughly constant. Strong correlation alone doesn't justify a linear fit if the shape is curved.

Problem 4: Identify relationships, Part II. For each of the six plots, identify the strength of the relationship (e.g. weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.

Problem 8.4 Solution

Step 1 — Same framework as 8.3: For each panel, judge strength (how tightly clustered) and linearity (straight vs. curved).

Step 2 — Typical answers for the six panels: - (a) Moderate positive linear — reasonable. - (b) Strong, clearly curved — not reasonable for a linear fit. - (c) Weak, noisy, may have a slight negative trend — a linear fit would be of limited value. - (d) Moderate negative linear — reasonable. - (e) Strong positive linear — reasonable. - (f) Data clusters into two groups (bimodal pattern) — a single linear fit is misleading; fit separate models or include a group variable.

Answer: Use a linear fit when the pattern is both roughly straight AND not grouped. Strong curvature or clustering are signs to use a different model.

Problem 5: Exams and grades. The two scatterplots below show the relationship between final and mid-semester exam grades recorded during several years for a Statistics course at a university.

(a) Based on these graphs, which of the two exams has the strongest correlation with the final exam grade? Explain.

(b) Can you think of a reason why the correlation between the exam you chose in part (a) and the final exam is higher?

Problem 8.5 Solution

Step 1 — Compare the two correlations visually (a): Whichever scatterplot shows a tighter cluster of points around an imaginary regression line — i.e. less vertical scatter — has the stronger correlation. In a typical course setup, the second midterm (taken closer in time to the final) usually correlates more strongly with the final exam grade than the first.

Step 2 — Possible reasons (b): - Closer in time → skills tested are more similar. - Students' study habits and effort have stabilized by the second midterm. - The earlier midterm may test foundational material that later gets subsumed into higher-level questions on the final. - Grade trajectory: a student who does well on midterm 2 has usually been doing well all along.

Answer: (a) Whichever plot shows tighter clustering — typically the second midterm. (b) Proximity in time and skill overlap with the final exam.

Problem 6: Spouses, Part I. The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married women in Britain, recording the age (in years) and heights (converted here to inches) of the women and their spouses. The scatterplot on the left shows the spouse's age plotted against the woman's age, and the plot on the right shows spouse's height plotted against the woman's height.

(a) Describe the relationship between the ages of women in the sample and their spouses' ages.

(b) Describe the relationship between the heights of women in the sample and their spouses' heights.

(c) Which plot shows a stronger correlation? Explain your reasoning.

(d) Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between heights of women in the sample and their spouses' heights?

Problem 8.6 Solution

Step 1 — Age vs. age (a): The scatterplot of spouse's age vs. woman's age typically shows a strong positive linear relationship — partners in a marriage tend to be close in age, with scatter of a few years.

Step 2 — Height vs. height (b): The scatterplot of spouse's height vs. woman's height shows a moderate positive linear relationship — taller women tend to have taller spouses, but with substantially more scatter than the age plot.

Step 3 — Stronger correlation (c): The ages plot usually shows the stronger correlation. Age is a very tight constraint in marriage (most couples are within a few years of each other), whereas height has much more variability.

Step 4 — Unit conversion (d): No — converting centimeters to inches is a linear rescaling. It changes the means and standard deviations but leaves the Z-scores (and therefore \(r\)) unchanged.

Answer: (a) Strong positive linear. (b) Moderate positive linear. (c) Ages — tighter clustering. (d) No — correlation is unaffected by linear unit conversion.

Problem 7: Match the correlation, Part I. Match each correlation to the corresponding scatterplot.

(a) \(r = -0.7\)

(b) \(r = 0.45\)

(c) \(r = 0.06\)

(d) \(r = 0.92\)

(1)

(2)

(3)

Problem 8.7 Solution

Step 1 — Match by strength: Rank correlations by \(|r|\): \(|0.92| > |{-0.7}| > |0.45| > |0.06|\).

Step 2 — Match each correlation: - \(r = 0.92\) → plot with tight upward-sloping cluster along a line. - \(r = -0.7\) → plot with clear downward trend, moderate scatter. - \(r = 0.45\) → plot with weak upward trend, visible but noisy. - \(r = 0.06\) → near-circular cloud of points, essentially no trend.

Answer: Strongest-to-weakest: (d) \(0.92\) → tightest positive line; (a) \(-0.7\) → clear downward trend; (b) \(0.45\) → weak positive; (c) \(0.06\) → no trend.

Problem 8: Match the correlation, Part II. Match each correlation to the corresponding scatterplot.

(a) \(r = 0.49\)

(b) \(r = -0.48\)

(c) \(r = -0.03\)

(d) \(r = -0.85\)

(3)

(4)

Problem 8.8 Solution

Step 1 — Match by strength: Rank by \(|r|\): \(|-0.85| > |0.49| \approx |-0.48| > |-0.03|\).

Step 2 — Match each correlation: - \(r = -0.85\) → clear downward line, relatively tight. - \(r = 0.49\) → weak-to-moderate upward trend. - \(r = -0.48\) → weak-to-moderate downward trend, similar amount of scatter as \(0.49\) but descending. - \(r = -0.03\) → scattered cloud with no visible trend.

Answer: (d) \(-0.85\) → tightest downward line; (a) \(0.49\) and (b) \(-0.48\) → moderate opposite-direction trends; (c) \(-0.03\) → no trend.

Problem 9: Speed and height. 1,302 UCLA students were asked to fill out a survey where they were asked about their height, fastest speed they have ever driven, and gender. The scatterplot on the left displays the relationship between height and fastest speed, and the scatterplot on the right displays the breakdown by gender in this relationship.

(a) Describe the relationship between height and fastest speed.

(b) Why do you think these variables are positively associated?

(c) What role does gender play in the relationship between height and fastest driving speed?

Problem 8.9 Solution

Step 1 — Describe the relationship (a): The scatterplot typically shows a weak-to-moderate positive trend: taller students report higher fastest driving speeds. The scatter is substantial — many individual data points deviate from any imaginable line.

Step 2 — Why positive association (b): Gender is a lurking variable. On average, men tend to be taller than women AND men tend to report higher fastest speeds (due to a combination of higher risk-taking, more experience with sports cars, etc.). So height and fastest speed are both correlated with gender, making them correlated with each other even though height doesn't cause fast driving.

Step 3 — Role of gender (c): The right-hand scatterplot broken down by gender likely shows that within each gender group, the height-speed relationship is much weaker or nonexistent. Height only appears correlated with speed because gender drives both. This is a classic example of a confounding variable or Simpson-style pattern.

Answer: (a) Weak-to-moderate positive. (b) Both variables are correlated with gender. (c) Gender is a confounder; within-group association is weak, and the apparent overall association is largely an artifact of the gender mixture.

Problem 10: Guess the correlation. Eduardo and Rosie are both collecting data on number of rainy days in a year and the total rainfall for the year. Eduardo records rainfall in inches and Rosie in centimeters. How will their correlation coefficients compare?

Problem 8.10 Solution

Step 1 — Effect of unit change: Converting from inches to centimeters (or vice versa) multiplies all rainfall values by a constant (\(1 \text{ in} = 2.54 \text{ cm}\)). This is a linear rescaling.

Step 2 — Effect on correlation: Linear rescaling of a variable changes its mean and standard deviation but leaves all Z-scores unchanged. Correlation is computed from Z-scores, so \(r\) is unchanged by the unit conversion.

Answer: Eduardo's and Rosie's correlation coefficients will be identical. Unit conversion does not affect correlation.

Problem 11: The Coast Starlight, Part I. The Coast Starlight Amtrak train runs from Seattle to Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).

(a) Describe the relationship between distance and travel time.

(b) How would the relationship change if travel time was instead measured in hours, and distance was instead measured in kilometers?

(c) Correlation between travel time (in miles) and distance (in minutes) is \(r = 0.636\). What is the correlation between travel time (in kilometers) and distance (in hours)?

Problem 8.11 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a strong positive linear relationship — longer segments of track take longer to traverse. There is visible scatter (some segments are slower due to grades, stops, or speed restrictions), but the trend is clear.

Step 2 — Effect of unit changes (b): Converting minutes → hours divides travel time by 60; converting miles → kilometers multiplies distance by 1.609. Both are linear rescalings. The shape of the scatterplot is preserved (just the axis labels and tick values change), and the correlation is unchanged.

Step 3 — New correlation (c): Because both conversions are linear rescalings, \(r\) is invariant:

$$ r_{\text{new}} = r_{\text{original}} = 0.636. $$

Answer: (a) Strong positive linear. (b) Shape preserved; correlation unchanged. (c) \(r = 0.636\).

Problem 12: Crawling babies, Part I. A study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. Infants born during the study year were split into twelve groups, one for each birth month. We consider the average crawling age of babies in each group against the average temperature when the babies are six months old (that's when babies often begin trying to crawl). Temperature is measured in degrees Fahrenheit (°F) and age is measured in weeks.

(a) Describe the relationship between temperature and crawling age.

(b) How would the relationship change if temperature was measured in degrees Celsius (°C) and age was measured in months?

(c) The correlation between temperature in °F and age in weeks was \(r = -0.70\). If we converted the temperature to °C and age to months, what would the correlation be?

Problem 8.12 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a moderate-to-strong negative linear relationship: babies whose six-month mark falls in warmer months tend to start crawling at a younger age (in weeks) than babies whose six-month mark falls in colder months. Interpretation: heavy winter clothing delays crawling.

Step 2 — Effect of unit change (b): \(°F \to °C\) and weeks \(\to\) months are both linear transformations (°C = (°F − 32) × 5/9; months = weeks ÷ 4.345). Linear transformations preserve correlation.

Step 3 — New correlation (c): $$ r_{\text{new}} = r_{\text{original}} = -0.70. $$

Answer: (a) Moderate-to-strong negative linear. (b) Correlation is unchanged. (c) \(r = -0.70\).

Problem 13: Body measurements, Part I. Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender for 507 physically active individuals. The scatterplot below shows the relationship between height and shoulder girth (over deltoid muscles), both measured in centimeters.

(a) Describe the relationship between shoulder girth and height.

(b) How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?

Problem 8.13 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a moderate positive linear relationship — individuals with broader shoulders tend to be taller. There's noticeable scatter, and the relationship is clearly additive (not curved).

Step 2 — Effect of unit change (b): Converting shoulder girth cm → inches is a linear rescaling (divide by 2.54). Correlation is unchanged. The shape of the scatterplot is preserved; only the \(x\)-axis labels change.

Answer: (a) Moderate positive linear. (b) No change in correlation or shape.

Problem 14: Body measurements, Part II. The scatterplot below shows the relationship between weight measured in kilograms and hip girth measured in centimeters from the data described in Exercise 8.13.

(a) Describe the relationship between hip girth and weight.

(b) How would the relationship change if weight was measured in pounds while the units for hip girth remained in centimeters?

Problem 8.14 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a strong positive linear relationship — larger hip girth corresponds to heavier weight. The relationship is tighter than height-shoulder because hip girth is a direct measure of body mass distribution.

Step 2 — Effect of unit change (b): Converting weight kg → pounds is linear (multiply by 2.2046). Correlation is unchanged; the scatterplot's shape and strength are preserved.

Answer: (a) Strong positive linear. (b) Unchanged.

Problem 15: Correlation, Part I. What would be the correlation between the ages of a set of women and their spouses if the set of women always married someone who was

(a) 3 years younger than themselves?

(b) 2 years older than themselves?

(c) half as old as themselves?

Problem 8.15 Solution

Step 1 — Case (a): 3 years younger: Every woman's spouse is exactly \(3\) years younger, so spouse age = woman's age \(-\ 3\). This is a perfect linear relationship with slope \(+1\) and intercept \(-3\). Correlation: \(r = +1\) (perfectly positive).

Step 2 — Case (b): 2 years older: Spouse age = woman's age \(+\ 2\). Again a perfect linear relationship with slope \(+1\). Correlation: \(r = +1\).

Step 3 — Case (c): half as old: Spouse age = \(0.5 \times\) woman's age. Perfect linear relationship with slope \(+0.5\) (still positive). Correlation: \(r = +1\).

Insight: The magnitude of the slope doesn't affect \(r\) — any perfect linear relationship with positive slope gives \(r = +1\).

Answer: (a) \(r = +1\). (b) \(r = +1\). (c) \(r = +1\). All three scenarios are perfectly linear with positive slope.

Problem 16: Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made

(a) \$5,000 more than women?

(b) \(25\%\) more than women?

(c) \(15\%\) less than women?

Problem 8.16 Solution

Step 1 — Case (a): \$5,000 more: Male salary = female salary \(+\ 5{,}000\). Perfect linear relationship with slope \(+1\). \(r = +1\).

Step 2 — Case (b): \(25\%\) more: Male salary = female salary \(\times\ 1.25\). Perfect linear relationship with slope \(+1.25\). \(r = +1\).

Step 3 — Case (c): \(15\%\) less: Male salary = female salary \(\times\ 0.85\). Perfect linear relationship with slope \(+0.85\) (still positive). \(r = +1\).

Insight: Same as 8.15 — any perfect linear relationship with positive slope has \(r = +1\), regardless of the specific slope value or whether the relationship is additive or multiplicative.

Answer: (a) \(r = +1\). (b) \(r = +1\). (c) \(r = +1\).

8.2 Fitting a Line by Least Squares Regression

In this section we answer questions that matter for anyone who uses data to make predictions:

  • How well can we predict financial aid based on family income for a particular college?
  • How do we find, interpret, and apply the least squares regression line?
  • How do we measure the fit of a model and compare different models to each other?
  • Why do models sometimes make predictions that are ridiculous — or even impossible?
Context Pause

Fitting a line by eye (like we did in Section 8.1) is a great first instinct, but it is not reproducible — two people will draw two different lines through the same scatterplot, and neither can defend their choice as best. The least squares regression line solves that problem by giving us an objective, formula-driven answer: out of every possible line, it picks the one that makes the squared prediction errors as small as possible. Every computer statistics package, every graphing calculator's LinReg button, and every professional regression study uses this criterion.

Insight Note

Squaring the residuals — instead of just adding up their absolute values — sounds arbitrary, but it has teeth. A residual of 4 is not twice as bad as a residual of 2; it is four times as bad (\(4^2 = 16\) vs. \(2^2 = 4\)) under the squared-error criterion. That heavy penalty on large misses is exactly why least squares produces lines that hug the bulk of the data instead of tolerating a few wild outliers.

Learning Objectives

Source: Main Text

By the end of this section, you should be able to:

In this section, you will learn to:
  1. Calculate the slope and \(y\)-intercept of the least squares regression line using the relevant summary statistics. Interpret these quantities in context.
  2. Understand why the least squares regression line is called the least squares regression line.
  3. Interpret the explained variance \(R^2\).
  4. Understand the concept of extrapolation and why it is dangerous.
  5. Identify outliers and influential points in a scatterplot.

8.2.1 An Objective Measure for Finding the Best Line

Fitting linear models by eye is open to criticism since it depends on an individual's preference. In this section, we use least squares regression as a more rigorous approach.

This section considers family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois. Gift aid is financial aid that does not need to be paid back, as opposed to a loan. A scatterplot of the data is shown in Figure 8.12 along with two linear fits. The lines follow a negative trend in the data — students who have higher family incomes tended to have lower gift aid from the university.

Figure 8.12: Gift aid and family income for a random sample of 50 freshman students from Elmhurst College. Two lines are fit to the data, the solid line being the least squares line.

We begin by thinking about what we mean by best. Mathematically, we want a line that has small residuals. One criterion could be to minimize the sum of the residual magnitudes:

$$ \left| y_{1} - \hat{y}_{1} \right| + \left| y_{2} - \hat{y}_{2} \right| + \dots + \left| y_{n} - \hat{y}_{n} \right| $$

which we could accomplish with a computer program. The resulting dashed line in Figure 8.12 shows this fit can be quite reasonable. However, a more common practice is to choose the line that minimizes the sum of the squared residuals:

$$ \left(y_{1} - \hat{y}_{1}\right)^{2} + \left(y_{2} - \hat{y}_{2}\right)^{2} + \dots + \left(y_{n} - \hat{y}_{n}\right)^{2} $$

The line that minimizes the sum of the squared residuals is represented as the solid line in Figure 8.12. This is commonly called the least squares line.

Both lines seem reasonable, so why do data scientists prefer the least squares regression line? One reason is that it is easier to compute by hand and in most statistical software. Another, more compelling, reason is that in many applications a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.

In Figure 8.13, we imagine the squared error about a line as actual squares. The least squares regression line minimizes the sum of the areas of these squared errors. In the figure, the sum of the squared error is \(4 + 1 + 1 = 6\). There is no other line about which the sum of the squared error will be smaller.

Figure 8.13: A visualization of least squares regression using Desmos. Try out this and other interactive Desmos activities at openintro.org/ahss/desmos.

Context Pause

The choice between "sum of absolute residuals" and "sum of squared residuals" is a choice about what kind of mistakes hurt most. Squared error says large misses should be punished disproportionately — a single bad miss of 10 counts as much as 100 small misses of 1. If you were picking a line for a safety-critical setting (medical dosing, structural engineering), that punishment is exactly what you want: avoid the rare catastrophic miss even at the cost of more small misses.

Insight Note

"Least squares" is not a statement about what is true — it is a statement about what is computable and defensible. The line has no magical claim to reality; it is simply the line with the smallest sum of squared vertical gaps. The appeal is that two analysts working with the same data will compute exactly the same line, every time. That reproducibility is worth a lot more than "fitting by eye."

Try It Now 8.2.1

Source: Main Text

Suppose you have four data points and two candidate lines. The residuals from Line A are \(+2, -3, +1, -2\). The residuals from Line B are \(+5, 0, 0, -1\).

(a) Compute the sum of absolute residuals for each line. Which line is "best" by that criterion? (b) Compute the sum of squared residuals for each line. Which line is "best" by least squares? (c) Explain in one sentence why the two criteria might pick different winners.

Solution

(a) Sum of absolute residuals: - Line A: \(|{+2}| + |{-3}| + |{+1}| + |{-2}| = 2 + 3 + 1 + 2 = 8\). - Line B: \(|{+5}| + |0| + |0| + |{-1}| = 5 + 0 + 0 + 1 = 6\).

Line B wins on absolute residuals.

(b) Sum of squared residuals: - Line A: \(2^2 + (-3)^2 + 1^2 + (-2)^2 = 4 + 9 + 1 + 4 = 18\). - Line B: \(5^2 + 0^2 + 0^2 + (-1)^2 = 25 + 0 + 0 + 1 = 26\).

Line A wins on squared residuals.

(c) The squared-residual criterion penalizes big misses disproportionately. Line B has one miss of size 5 (squared = 25), which outweighs Line A's worst miss of 3 (squared = 9) even though Line A has more total misses.

8.2.2 Finding the Least Squares Line

For the Elmhurst College data, we could fit a least squares regression line for predicting gift aid based on a student's family income and write the equation as:

$$ \widehat{aid} = a + b \times \text{family\_income} $$

Here \(a\) is the \(y\)-intercept of the least squares regression line and \(b\) is the slope of the least squares regression line. \(a\) and \(b\) are both statistics that can be calculated from the data. In the next section we will consider the corresponding population parameters that these statistics attempt to estimate.

We can enter all the data into a statistical software package and easily find the values of \(a\) and \(b\). However, we can also calculate these values by hand, using only the summary statistics:

  • The slope of the least squares line is given by
$$ b = r \, \frac{s_{y}}{s_{x}} $$

where \(r\) is the correlation between the variables \(x\) and \(y\), and \(s_{x}\) and \(s_{y}\) are the sample standard deviations of \(x\) (the explanatory variable) and \(y\) (the response variable).

  • The point of averages \((\bar{x}, \bar{y})\) is always on the least squares line. Plugging this point in for \(x\) and \(y\) in the least squares equation and solving for \(a\) gives:
$$ \bar{y} = a + b\,\bar{x} $$ $$ a = \bar{y} - b\,\bar{x} $$

Finding the Slope and Intercept of the Least Squares Regression Line

The least squares regression line for predicting \(y\) based on \(x\) can be written as \(\hat{y} = a + bx\), with:

$$ b = r\,\frac{s_{y}}{s_{x}} \qquad a = \bar{y} - b\,\bar{x} $$

We first find \(b\), the slope, and then we solve for \(a\), the \(y\)-intercept.

Identifying the Least Squares Line from Summary Statistics

To identify the least squares line from summary statistics:

- Estimate the slope parameter: \(b_{1} = (s_{y}/s_{x})\,R\).

- Noting that the point \((\bar{x}, \bar{y})\) is on the least squares line, use \(x_{0} = \bar{x}\) and \(y_{0} = \bar{y}\) with the point-slope equation: \(y - \bar{y} = b_{1}(x - \bar{x})\).

- Simplify the equation, which reveals that \(b_{0} = \bar{y} - b_{1}\,\bar{x}\).

Guided Practice 8.9

Source: Main Text

Figure 8.14 shows the sample means for the family income and gift aid as \$101,800 and \$19,940, respectively. Plot the point \((101.8, 19.94)\) on Figure 8.12 to verify it falls on the least squares line (the solid line).

Solution

The point of averages \((\bar{x}, \bar{y}) = (101.8, 19.94)\) must lie on the least squares line, because the line is constructed so that \(a = \bar{y} - b\,\bar{x}\), i.e. \(\bar{y} = a + b\,\bar{x}\). If you plot \((101.8, 19.94)\) on Figure 8.12 and visually check, the point lies exactly on the solid line (the least squares line). This is a universal property of least squares regression, not a coincidence.

Example 8.2.1

Source: Main Text

Using the summary statistics in Figure 8.14, find the equation of the least squares regression line for predicting gift aid based on family income.

Solution:

Step 1 — Compute the slope \(b\) using \(b = r \cdot s_y/s_x\):

$$ b = r\,\frac{s_{y}}{s_{x}} = (-0.499)\,\frac{5.46}{63.2} = -0.0431. $$

Step 2 — Compute the intercept \(a\) using \(a = \bar{y} - b\,\bar{x}\):

$$ a = \bar{y} - b\,\bar{x} = 19.94 - (-0.0431)(101.8) = 19.94 + 4.388 = 24.3. $$

Step 3 — Write out the equation of the least squares line:

$$ \hat{y} = 24.3 - 0.0431\,x \qquad \text{or} \qquad \widehat{aid} = 24.3 - 0.0431 \times \text{family\_income}. $$
Guided Practice 8.10

Source: Main Text

Using the summary statistics in Figure 8.14, compute the slope for the regression line of gift aid against family income.

Solution

Using \(b = r \cdot s_y/s_x\):

$$ b = (-0.499)\,\frac{5.46}{63.2} = -0.04311\ldots \approx -0.0431. $$

The slope is \(-0.0431\). (It matches Example 8.2.1 — that is intentional; Guided Practice 8.10 is the "show your work" version of Example 8.2.1's first line.)

Example 8.2.2

Source: Main Text

Using the point \((101{,}780,\, 19{,}940)\) from the sample means (measured in raw dollars rather than \$1,000s) and the slope estimate \(b_{1} = -0.0431\) from Guided Practice 8.10, find the least-squares line for predicting aid based on family income.

Solution:

Step 1 — Apply the point-slope equation with \((x_0, y_0) = (101{,}780, 19{,}940)\) and \(b_1 = -0.0431\):

$$ y - y_{0} = b_{1}(x - x_{0}) $$ $$ y - 19{,}940 = -0.0431\,(x - 101{,}780). $$

Step 2 — Expand the right side:

$$ y - 19{,}940 = -0.0431\,x + 0.0431 \times 101{,}780 = -0.0431\,x + 4{,}386.7. $$

Step 3 — Add 19,940 to both sides to isolate \(y\):

$$ y = -0.0431\,x + 4{,}386.7 + 19{,}940 = 24{,}326.7 - 0.0431\,x. $$

The equation simplifies to:

$$ \widehat{aid} = 24{,}327 - 0.0431 \times \text{family\_income}. $$

Here we have replaced \(y\) with \(\widehat{aid}\) and \(x\) with family income to put the equation in context. The final equation should always include a "hat" on the variable being predicted, whether it is a generic \(y\) or a named variable like "aid."

A computer is usually used to compute the least squares line, and a summary table generated using software for the Elmhurst regression line is shown in Figure 8.15. The first column of numbers provides estimates for \(b_{0}\) (the intercept) and \(b_{1}\) (the slope), respectively. These results match Example 8.2.1 (with minor rounding).

Figure 8.15: Summary of least squares fit for the Elmhurst data (raw dollar units). Compare the parameter estimates in the first column to the results of Example 8.2.2.

EstimateStd. Errort valuePr(>|t|)
(Intercept)24319.31291.518.83<0.0001
family_income-0.04310.0108-3.980.0002
Example 8.2.3

Source: Main Text

Say we wanted to predict a student's family income based on the amount of gift aid received. Would the least squares regression line be the following?

$$ aid = 24.3 - 0.0431 \times \text{family\_income} $$

Solution:

No. The equation we found was for predicting aid, not for predicting family income. To predict family income from aid, we must calculate a new regression line — letting \(y\) be family income and \(x\) be aid.

Step 1 — Recompute the slope with the roles of \(x\) and \(y\) swapped:

$$ b = r\,\frac{s_{y}}{s_{x}} = (-0.499)\,\frac{63.2}{5.46} = -5.776. $$

Step 2 — Recompute the intercept (note \(\bar{y}\) is now family income's mean, 101.8, and \(\bar{x}\) is aid's mean, 19.94):

$$ a = \bar{y} - b\,\bar{x} = 101.8 - (-5.776)(19.94) = 101.8 + 115.17 = 216.97 \approx 217. $$

The fitted line for predicting family income from aid is therefore:

$$ \widehat{\text{family\_income}} = 217 - 5.776 \times aid $$

in units of \$1,000s. (The source prints 607.3 here due to a typo — the correct arithmetic yields roughly 217 for the intercept.) The key lesson is that swapping explanatory and response variables produces a different regression line, not the algebraic inverse of the original.

A summary table based on computer output is shown in Figure 8.16 for the Elmhurst College data in \$1,000 units. The first column of numbers provides estimates for \(b_{0}\) and \(b_{1}\), respectively.

Figure 8.16: Summary of least squares fit for the Elmhurst College data (in \$1,000s). Compare the parameter estimates in the first column to Example 8.2.1.

EstimateStd. Errort valuePr(>|t|)
(Intercept)24.31931.291518.830.0000
family_income-0.04310.0108-3.980.0002
Example 8.2.4

Source: Main Text

Examine the second, third, and fourth columns in Figure 8.16. Can you guess what they represent?

Solution:

We focus on the second row (the slope).

- First column (Estimate = −0.0431): our best estimate for the slope of the population regression line. We call this point estimate \(b\). - Second column (Std. Error = 0.0108): the standard error of this point estimate — how much the estimate would be expected to vary if we drew repeated samples of fifty freshmen. - Third column (t value = −3.98): the \(t\) test statistic for the null hypothesis that the slope of the population regression line is zero, computed as \(\text{Estimate}/\text{Std. Error} = -0.0431/0.0108 = -3.98\). - Fourth column (Pr(>|t|) = 0.0002): the p-value for a two-sided \(t\)-test of that null hypothesis. A p-value this small is strong evidence that the true slope is not zero.

We will get into more details in Section 8.4.

Example 8.2.5

Source: Main Text

What do the first and second columns of Figure 8.17 (below) represent, for a hypothetical regression of seat changes on unemployment rate?

$$ \hat{y} = -7.3644 - 0.8897\,x $$

where \(\hat{y}\) is the predicted change in the number of seats for the president's party, and \(x\) represents the unemployment rate.

Solution:

The entries in the first column represent the least squares estimates \(b_{0}\) and \(b_{1}\), and the values in the second column correspond to the standard errors of each estimate. Using the estimates, we can write the regression line as shown above.

We previously used a \(t\)-test statistic for hypothesis testing in the context of numerical data. Regression is very similar. Since the null value for the slope is 0, the test statistic is:

$$ T = \frac{\text{estimate} - \text{null value}}{\text{SE}} = \frac{-0.8897 - 0}{0.8350} = -1.07. $$

This corresponds to the third column of the regression table.

Example 8.2.6

Source: Main Text

Suppose a high school senior is considering Elmhurst College. Can she simply use the linear equation we have found to calculate her financial aid from the university?

Solution:

No. Using the equation will provide a prediction or estimate, not a guarantee. As we see in the scatterplot, there is a lot of variability around the line. While the linear equation is good at capturing the trend in the data, there will be significant error in predicting an individual student's aid. Additionally, the data all come from one freshman class, and the way aid is determined by the university may change from year to year. A prediction for a current senior is a rough extrapolation of last year's aid policy.

Try It Now 8.2.2

Source: Main Text

A different college has summary statistics \(\bar{x} = 80\) (family income, in \$1,000s), \(\bar{y} = 15\) (gift aid, in \$1,000s), \(s_x = 50\), \(s_y = 4\), and \(r = -0.6\).

(a) Compute the slope of the least squares regression line for predicting gift aid from family income. (b) Compute the \(y\)-intercept. (c) Write out the regression equation \(\widehat{aid} = a + b \times \text{family\_income}\).

Solution

(a) Slope:

$$ b = r \cdot \frac{s_y}{s_x} = (-0.6) \cdot \frac{4}{50} = -0.048. $$

(b) Intercept:

$$ a = \bar{y} - b\,\bar{x} = 15 - (-0.048)(80) = 15 + 3.84 = 18.84. $$

(c) Regression equation:

$$ \widehat{aid} = 18.84 - 0.048 \times \text{family\_income} \quad (\text{in \$1,000s}). $$
Context Pause

Notice how little work \(b = r\,s_y/s_x\) actually does: three summary numbers go in, one slope comes out. You do not need the individual data points — just the correlation and the two standard deviations. That is why 19th-century statisticians (long before computers) could fit regression lines by hand for problems in astronomy and biology. Every modern regression tool does exactly the same calculation, just faster.

Insight Note

The slope formula \(b = r\,s_y/s_x\) is a unit-conversion machine in disguise. Think of \(s_y/s_x\) as "how many units of \(y\) correspond to one standard deviation of \(x\)", and \(r\) as "how strongly the two variables move together." Multiply them and you get "units of \(y\) per unit of \(x\) along the line of best fit" — which is exactly the slope.

8.2.3 Interpreting the Coefficients of a Regression Line

Interpreting the coefficients in a regression model is often one of the most important steps in the analysis. A number without interpretation is just a number; in context, the same number is a prediction, an effect size, a policy claim.

Example 8.2.7

Source: Main Text

The slope for the Elmhurst College data for predicting gift aid based on family income was calculated as \(-0.0431\). Interpret this quantity in the context of the problem.

Solution:

You might recall from algebra that slope is change in \(y\) over change in \(x\). Here, both \(x\) and \(y\) are in thousands of dollars. So if \(x\) is one unit (one thousand dollars) higher, the line predicts \(y\) will change by \(-0.0431\) thousand dollars. In other words: for each additional thousand dollars of family income, on average, students receive \(0.0431\) thousand — about \$43.10 — less in gift aid. A higher family income corresponds to less aid because the slope is negative.

Example 8.2.8

Source: Main Text

Examine Figure 8.16, which relates the Elmhurst College aid and student family income. How sure are you that the slope is statistically significantly different from zero? That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?

Solution:

While the relationship between the variables is not perfect, there is an evident decreasing trend in the data. The p-value in Figure 8.16 is \(0.0002\), which is far below any conventional threshold (like \(0.05\) or \(0.01\)). This strongly suggests the hypothesis test will reject the null claim that the slope is zero.

Example 8.2.9

Source: Main Text

The \(y\)-intercept for the Elmhurst College data for predicting gift aid based on family income was calculated as \(24.3\). Interpret this quantity in the context of the problem.

Solution:

The intercept \(a\) describes the predicted value of \(y\) when \(x = 0\). The predicted gift aid is \(24.3\) thousand dollars if a student's family has no income. The meaning of the intercept is relevant to this application because the family income for some students at Elmhurst is \$0. In other applications, the intercept may have little or no practical value if there are no observations where \(x\) is near zero. Here, it would be acceptable to say that the average gift aid is \$24,300 among students whose families have \$0 in income.

> Interpreting Coefficients in a Linear Model > - The slope, \(b\), describes the average increase or decrease in the \(y\) variable if the explanatory variable \(x\) is one unit larger. > - The \(y\)-intercept, \(a\), describes the predicted outcome of \(y\) if \(x = 0\). The linear model must be valid all the way to \(x = 0\) for this interpretation to make sense, which in many applications is not the case.

> Interpreting Parameters Estimated by Least Squares > The slope describes the estimated difference in the \(y\) variable if the explanatory variable \(x\) for a case happened to be one unit larger. The intercept describes the average outcome of \(y\) if \(x = 0\) and the linear model is valid all the way to \(x = 0\), which in many applications is not the case.

Guided Practice 8.11

Source: Main Text

In the previous chapter, we encountered a data set that compared the price of new textbooks for UCLA courses at the UCLA Bookstore and on Amazon. We fit a linear model for predicting price at UCLA Bookstore from price on Amazon and get:

$$ \hat{y} = 1.86 + 1.03\,x $$

where \(x\) is the price on Amazon (in dollars) and \(y\) is the price at the UCLA bookstore (in dollars). Interpret the coefficients in this model and discuss whether the interpretations make sense in this context.

Solution

Slope (\(b = 1.03\)): For each additional \$1.00 in the Amazon price of a textbook, the UCLA Bookstore price is about \$1.03 higher on average. This makes sense — books that cost more at one vendor also tend to cost more at the other.

Intercept (\(a = 1.86\)): If Amazon's price were \$0, the UCLA Bookstore price would be predicted to be \$1.86. Since a \$0 Amazon price is essentially outside the realistic data range, this intercept has limited practical meaning — it is a mathematical artifact used to make the line line up with the data.

Both interpretations make sense as descriptions of the fitted line; only the slope has real-world applicability to typical textbook prices.

Guided Practice 8.12

Source: Main Text

Can we conclude that if Amazon raises the price of a textbook by \$1.00, the UCLA Bookstore will raise the price of the textbook by \$1.03?

Solution

No. The slope describes an average cross-sectional relationship — looking across many textbooks at one point in time, the UCLA Bookstore prices are on average \$1.03 higher for each extra \$1.00 in Amazon price. That is not the same as a causal, dynamic response of one vendor's pricing to another's. If Amazon raises a specific textbook's price by \$1.00 tomorrow, there is no guarantee the UCLA Bookstore will do anything in response. The slope is a description of the pattern in the data, not a prediction of vendor behavior.

Exercise Caution When Interpreting Coefficients of a Linear Model

- The slope tells us only the average change in \(y\) for each unit change in \(x\); it does not tell us how much \(y\) might change based on a change in \(x\) for any particular individual. Moreover, in most cases, the slope cannot be interpreted in a causal way.

- When a value of \(x = 0\) doesn't make sense in an application, the interpretation of the \(y\)-intercept won't have any practical meaning.

Context Pause

The difference between "the slope is \(-0.0431\)" and "for every extra \$1,000 of family income, the average aid drops by about \$43" is everything. The first is an artifact of the regression. The second is a sentence a student, a parent, or a financial-aid officer can actually use. In practice, interpreting coefficients in plain English is the whole point of fitting a regression in the first place.

Insight Note

A very common mistake is turning a cross-sectional slope into a causal policy claim — "if we cut family income by \$1,000, aid will go up by \$43 on average." That claim requires more than regression can deliver, because confounders (grades, legacy status, major, demonstrated financial need) may be driving both variables. Correlation is not causation, and slope is a numerical summary of correlation. Never treat \(b\) as a policy lever without an experimental design to back it up.

Try It Now 8.2.3

Source: Main Text

A regression of a child's shoe size \(y\) on age (in years) \(x\) gives \(\hat{y} = 3 + 0.8\,x\) for children aged 4 to 12.

(a) Interpret the slope in plain English. (b) Interpret the \(y\)-intercept. Does it have practical meaning? (c) Would you use this equation to predict the shoe size of a newborn (age 0)? Why or why not?

Solution

(a) For each additional year of age, a child's shoe size increases by \(0.8\) on average.

(b) The intercept is \(3\), which is the predicted shoe size at age \(0\). Since the data only cover ages 4 to 12, this intercept has no practical meaning — babies do not wear "size 3" shoes that fit the adult/child shoe-sizing scale used here. It is a mathematical anchor for the line, not a real prediction.

(c) No. Age \(0\) is outside the range of the data (ages 4 to 12). Predicting for a newborn would be extrapolation, which is unreliable — see the next subsection.

8.2.4 Extrapolation Is Treacherous

When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February \(6^{\text{th}}\) it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.

Stephen Colbert, April 6, 2010

Linear models can be used to approximate the relationship between two variables. However, these models have real limitations. Linear regression is simply a modeling framework. The truth is almost always much more complex than a simple line. For example, we do not know how the data outside of our limited window will behave.

Example 8.2.10

Source: Main Text

Use the model \(\widehat{aid} = 24.3 - 0.0431 \times \text{family\_income}\) to estimate the aid of another freshman student whose family had income of \$1 million.

Solution:

Step 1 — Convert the income into the model's units. The units of family income are \$1,000s, so \$1,000,000 becomes \(\text{family\_income} = 1000\).

Step 2 — Plug into the model:

$$ \widehat{aid} = 24.3 - 0.0431(1000) = 24.3 - 43.1 = -18.8. $$

The model predicts this student will have \(-\$18{,}800\) in aid — that is, the student would pay \$18,800 on top of tuition. Elmhurst College cannot (or at least does not) require students with high-income families to pay extra on top of tuition to attend. The prediction is obviously ridiculous.

Using a model to predict \(y\)-values for \(x\)-values outside the domain of the original data is called extrapolation. Generally, a linear model is only an approximation of the real relationship between two variables. If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.

> Extrapolation Warning > A regression line is valid only within the range of the data used to fit it. Outside that range, the line is a guess, and guesses compound the farther you go. A model that is excellent at predicting within the data can be catastrophically wrong outside it.

Context Pause

The Colbert quote above is a joke, but the joke works because extrapolation is a real mistake that real people make all the time — projecting short-term trends (spring warming, stock returns, COVID case counts) forward indefinitely and getting apocalyptic or utopian predictions. The math of a line says "just plug in any \(x\)," but the data says "we only looked at \(x\) in this range." When those two disagree, trust the data.

Insight Note

A good check: before you plug an \(x\) into a regression equation, ask "is this \(x\) close to values that were actually in the data?" If not, do not trust the prediction. If you must extrapolate, at minimum acknowledge it explicitly, widen your uncertainty, and think about what physical or economic constraints the line is ignoring (like "aid cannot be negative").

Try It Now 8.2.4

Source: Main Text

Suppose a regression of temperature \(y\) in degrees Fahrenheit on calendar date \(x\) (from February 6th through April 6th in the Colbert quote) gives \(\hat{y} = 10 + 1.17\,x\), where \(x\) is the number of days after February 6th.

(a) Predict the temperature on April 6th (\(x = 59\) days). Does this match the quote? (b) Use the model to "predict" the temperature on August 6th (\(x = 181\) days). (c) Why is the answer in part (b) nonsense, and what is the correct name for this mistake?

Solution

(a) \(\hat{y} = 10 + 1.17 \times 59 = 10 + 69.03 = 79.03 \approx 80\) degrees. This matches the quote exactly.

(b) \(\hat{y} = 10 + 1.17 \times 181 = 10 + 211.77 = 221.77 \approx 222\) degrees — very close to Colbert's "220 degrees."

(c) Temperatures do not rise monotonically from February through August — they peak in July and fall again. The linear model was fit to two months of data and has no information about the summer's seasonal peak or fall's cooling. Using the equation for August is extrapolation, and the prediction is ridiculous.

8.2.5 Using R-squared to Describe the Strength of a Fit

We evaluated the strength of the linear relationship between two variables earlier using the correlation, \(r\). However, it is more common to explain the fit of a model using \(R^2\), called R-squared or the explained variance. If provided with a linear model, we might want to describe how closely the data cluster around the linear fit.

Figure 8.17: Gift aid and family income for a random sample of 50 freshman students from Elmhurst College, shown with the least squares regression line \(\hat{y}\) and the average line \(\bar{y}\).

We are interested in how well a model accounts for, or explains, the location of the \(y\) values. The \(R^2\) of a linear model describes how much smaller the variance (in the \(y\) direction) about the regression line is than the variance about the horizontal line \(\bar{y}\). For the Elmhurst College data shown in Figure 8.17, the variance of the response variable (aid received) is

$$ s_{aid}^{2} = 29.8. $$

If we apply our least squares line, the variability in the residuals describes how much variation remains after using the model:

$$ s_{RES}^{2} = 22.4. $$

The reduction in the variance is:

$$ \frac{s_{aid}^{2} - s_{RES}^{2}}{s_{aid}^{2}} = \frac{29.8 - 22.4}{29.8} = \frac{7.4}{29.8} \approx 0.25. $$

(Roughly 25% of the variance in aid is explained by family income.) If we used the simple standard deviation of the residuals, this computation would give exactly \(R^2\). However, the standard way of computing the standard deviation of the residuals is slightly more sophisticated. To avoid any trouble, we can instead use a sum of squares method. If we call the sum of the squared errors about the regression line \(SSRes\) and the sum of the squared errors about the mean \(SSM\), we can define \(R^2\) as:

$$ R^{2} = \frac{SSM - SSRes}{SSM} = 1 - \frac{SSRes}{SSM}. $$

(a)

(b)

Figure 8.18: (a) The regression line is equivalent to \(\bar{y}\); \(R^{2} = 0\). (b) The regression line passes through all of the points; \(R^{2} = 1\). Try out this and other interactive Desmos activities at openintro.org/ahss/desmos.

Guided Practice 8.13

Source: Main Text

Using the formula for \(R^2\), confirm that in Figure 8.18 (a), \(R^{2} = 0\), and that in Figure 8.18 (b), \(R^{2} = 1\).

Solution

(a) Regression line equals \(\bar{y}\): If the regression line is horizontal at \(\bar{y}\), then every prediction \(\hat{y}_i = \bar{y}\), and the residuals are \(y_i - \hat{y}_i = y_i - \bar{y}\). So \(SSRes = \sum(y_i - \bar{y})^2 = SSM\). Therefore:

$$ R^{2} = 1 - \frac{SSRes}{SSM} = 1 - \frac{SSM}{SSM} = 1 - 1 = 0. $$

(b) Regression line passes through all points: Every \(\hat{y}_i = y_i\), so every residual is \(0\), meaning \(SSRes = 0\). Therefore:

$$ R^{2} = 1 - \frac{0}{SSM} = 1 - 0 = 1. $$

\(R^2\) Is the Explained Variance

\(R^{2}\) is always between \(0\) and \(1\), inclusive. It tells us the proportion of variation in the \(y\) values that is explained by a regression model. The higher the value of \(R^{2}\), the better the model "explains" the response variable.

The value of \(R^{2}\) is, in fact, equal to \(r^{2}\), where \(r\) is the correlation. This means that \(r = \pm\sqrt{R^{2}}\). Use this fact to answer the next two practice problems.

Guided Practice 8.14

Source: Main Text

If a linear model has a very strong negative relationship with a correlation of \(-0.97\), how much of the variation in the response variable is explained by the linear model?

Solution

Since \(R^2 = r^2\):

$$ R^{2} = (-0.97)^{2} = 0.9409. $$

About \(94\%\) of the variation in the response variable is explained by the linear model. (Very strong relationships translate into very high \(R^2\) values — squaring a number close to \(\pm 1\) stays close to 1.)

Guided Practice 8.15

Source: Main Text

If a linear model has an \(R^{2}\) or explained variance of \(0.94\), what is the correlation?

Solution

Since \(r = \pm\sqrt{R^{2}}\):

$$ r = \pm\sqrt{0.94} \approx \pm 0.970. $$

The sign of \(r\) is whatever the slope's sign is. If the regression line has a positive slope, \(r \approx +0.97\); if negative, \(r \approx -0.97\). Without the scatterplot or the slope, we cannot decide the sign from \(R^2\) alone.

Context Pause

Reporting \(R^2 = 0.25\) is a more actionable statement than reporting "\(r = -0.499\)." Both are correct, but \(R^2\) gives a proportion — "this model explains 25% of the variation in aid, leaving 75% unexplained." That framing helps a reader decide whether the model is good enough for their purpose, or whether they need to collect more variables (a multivariable regression in Chapter 9).

Insight Note

\(R^2\) has a well-known nickname in machine learning and economics: coefficient of determination. It answers the question "how much of what is going on in \(y\) is captured by the model?" If \(R^2 = 0.95\), the model is doing almost all the work; if \(R^2 = 0.05\), the model is barely doing any. A high \(R^2\) is not a guarantee that the model is correct (Anscombe's Quartet from Section 8.1 has \(R^2 \approx 0.67\) for four very different datasets), but low \(R^2\) is a guarantee that something important is missing.

Try It Now 8.2.5

Source: Main Text

A regression of a student's final exam score \(y\) on their midterm score \(x\) produces \(r = 0.80\).

(a) Compute \(R^2\) and interpret it in context. (b) What percentage of the variation in final exam scores is not explained by midterm scores? (c) If a different class gives \(R^2 = 0.49\), what would be the correlation (assuming a positive slope)?

Solution

(a) \(R^2 = 0.80^2 = 0.64\). About 64% of the variation in final exam scores is explained by the linear regression on midterm scores.

(b) \(1 - R^2 = 1 - 0.64 = 0.36\), so 36% of the variation in final exam scores is unexplained — it must come from other factors (study habits, different topics on the final, test-day performance, etc.).

(c) \(r = \sqrt{0.49} = 0.70\) (positive by assumption). A weaker relationship than the first class.

8.2.6 Technology: Linear Correlation and Regression

Get started quickly with the Desmos LinReg Calculator (available at openintro.org/ahss/desmos).

Calculator Instructions

TI-84: Finding \(a\), \(b\), \(R^2\), and \(r\) for a Linear Model

Use STAT, CALC, LinReg(a + bx).

1. Press STAT.

2. Right arrow to CALC.

3. Down arrow and choose 8:LinReg(a+bx).

- Caution: choosing 4:LinReg(ax+b) will reverse \(a\) and \(b\).

4. Let Xlist be L1 and Ylist be L2 (enter the \(x\) and \(y\) values in L1 and L2 first).

5. Leave FreqList blank.

6. Leave Store RegEQ blank.

7. Choose Calculate and hit ENTER, which returns:

- a — the \(y\)-intercept of the best fit line

- b — the slope of the best fit line

- — \(R^2\), the explained variance

- r — the correlation coefficient

TI-83: Do steps 1-3, then enter the \(x\) list and \(y\) list separated by a comma, e.g. LinReg(a+bx) L1, L2, then hit ENTER.

What to Do If \(R^2\) and \(r\) Do Not Show Up on a TI-83/84

If \(r^2\) and \(r\) do not show up when doing STAT, CALC, LinReg, diagnostics must be turned on. This only needs to be done once and the diagnostics will remain on.

1. Hit 2ND 0 (i.e. CATALOG).

2. Scroll down until the arrow points at DiagnosticOn.

3. Hit ENTER and ENTER again. The screen should now say:

DiagnosticOn

Done

What to Do If a TI-83/84 Returns: `ERR: DIM MISMATCH`

This error means that the lists, generally L1 and L2, do not have the same length.

1. Choose 1:Quit.

2. Choose STAT, Edit and make sure that the lists have the same number of entries.

Casio fx-9750GII: Finding \(a\), \(b\), \(R^2\), and \(r\) for a Linear Model

1. Navigate to STAT (MENU button, then hit the 2 button or select STAT).

2. Enter the \(x\) and \(y\) data into 2 separate lists, e.g. \(x\) values in List 1 and \(y\) values in List 2. Observation ordering should be the same in the two lists. For example, if \((5, 4)\) is the second observation, then the second value in the \(x\) list should be 5 and the second value in the \(y\) list should be 4.

3. Navigate to CALC (F2) and then SET (F6) to set the regression context.

- To change the 2Var XList, navigate to it, select List (F1), and enter the proper list number. Similarly, set 2Var YList to the proper list.

4. Hit EXIT.

5. Select REG (F3), X (F1), and a+bx (F2), which returns:

- a — the \(y\)-intercept of the best fit line

- b — the slope of the best fit line

- r — the correlation coefficient

- — \(R^2\), the explained variance

- MSe — Mean squared error, which you can ignore

If you select ax+b (F1), the \(a\) and \(b\) meanings will be reversed.

Guided Practice 8.16

Source: Main Text

The data set loan50, introduced in Chapter 1, contains information on randomly sampled loans offered through Lending Club. A subset of the data matrix is shown in Figure 8.19. Use a calculator to find the equation of the least squares regression line for predicting loan amount from total income based on this subset.

Figure 8.19: Sample of data from loan50.

total_incomeloan_amount
15900022000
2600006000
37500025000
4750006000
525400025000
6670006400
7288003000
Solution

Entering total_income in L1 and loan_amount in L2 on a TI-84 and running LinReg(a+bx) gives approximately:

- \(a \approx 8{,}863\) - \(b \approx 0.042\) - \(r \approx 0.27\) - \(R^2 \approx 0.07\)

So the fitted line is approximately:

$$ \widehat{\text{loan\_amount}} = 8{,}863 + 0.042 \times \text{total\_income}. $$

(Exact values depend on the software's rounding and the exact rows used; answers within \(\pm 10\%\) of these are acceptable.) The low \(R^2\) tells us that total income explains only about 7% of the variation in loan amount for this small sample — most of the variation comes from other factors.

Context Pause

Calculator and software output is a safety net, not a replacement for the slope and intercept formulas. If you blindly press LinReg and get a line, you won't catch typos in your data, unit mismatches, or outliers that are pulling the line in a bad direction. Always plot the data first (STAT PLOT on a TI, geom_point() in ggplot, etc.) and check that the line makes sense before quoting it.

Insight Note

The labels on calculator output vary in confusing ways: some calculators show LinReg(ax+b) (slope first, intercept second) and some show LinReg(a+bx) (intercept first, slope second). Always confirm which label is which before writing down your equation. A regression line that says "\(\hat{y} = 0.042 + 8863\,x\)" instead of "\(\hat{y} = 8863 + 0.042\,x\)" is obviously wrong, but only if you notice the swap.

Try It Now 8.2.6

Source: Main Text

A calculator returns the following output after running a linear regression on five data points:

- \(a = 15.2\) - \(b = 2.5\) - \(r = 0.9\) - \(r^2 = 0.81\)

(a) Write the regression equation. (b) Interpret \(R^2\) in the context of this model. (c) What is the correlation \(r\), and what does its sign tell you about the scatterplot?

Solution

(a) \(\hat{y} = 15.2 + 2.5\,x\).

(b) About 81% of the variation in \(y\) is explained by the linear model. About 19% is left unexplained.

(c) \(r = 0.9\). The positive sign tells you that as \(x\) increases, \(y\) tends to increase — the scatterplot should show an upward-sloping pattern.

8.2.7 Types of Outliers in Linear Regression

Outliers in regression are observations that fall far from the "cloud" of points. These points are especially important because they can have a strong influence on the least squares line.

Example 8.2.11

Source: Main Text

There are six plots shown in Figure 8.20 along with the least squares line and residual plots. For each scatterplot and residual plot pair, identify any obvious outliers and note how they influence the least squares line. Recall that an outlier is any point that doesn't appear to belong with the vast majority of the other points.

Solution:

1. There is one outlier far from the other points, though it only appears to slightly influence the line. 2. There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn't very influential. 3. There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud doesn't appear to fit very well. 4. There is a primary cloud and then a small secondary cloud of four outliers. The secondary cloud appears to be influencing the line somewhat strongly, making the least squares line fit poorly almost everywhere. There might be an interesting explanation for the dual clouds, which is something that could be investigated. 5. There is no obvious trend in the main cloud of points and the outlier on the right appears to largely control the slope of the least squares line. 6. There is one outlier far from the cloud, however, it falls quite close to the least squares line and does not appear to be very influential.

Examine the residual plots in Figure 8.20. You will probably find that there is some trend in the main clouds of (3) and (4). In these cases, the outliers influenced the slope of the least squares lines. In (5), data with no clear trend were assigned a line with a large trend simply due to one outlier.

> Leverage > Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage. Points that fall horizontally far from the line are points of high leverage; these points can strongly influence the slope of the least squares line. If one of these high-leverage points appears to actually invoke its influence on the slope of the line — as in cases (3), (4), and (5) above — then we call it an influential point. Usually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.

It is tempting to remove outliers. Don't do this without a very good reason. Models that ignore exceptional (and interesting) cases often perform poorly. For instance, if a financial firm ignored the largest market swings — the "outliers" — it would soon go bankrupt by making poorly thought-out investments.

> Don't Ignore Outliers When Fitting a Final Model > If there are outliers in the data, they should not be removed or ignored without a good reason. Whatever final model is fit to the data would not be very helpful if it ignores the most exceptional cases.

(1)

(2)

(3)

(4)

(5)

(6)

Figure 8.20: Six plots, each with a least squares line and residual plot. All data sets have at least one outlier.

> Outlier Terminology > - Outlier: an observation that falls far from the bulk of the data. > - High-leverage point: a point with an \(x\)-value far from the center of the \(x\) distribution. High leverage does not imply high influence; a high-leverage point on the regression line can be harmless. > - Influential point: a high-leverage point whose removal would noticeably change the slope (or intercept) of the least squares line.

Context Pause

Outliers are one of the most common reasons a linear model is misleading. A single extreme observation can swing the slope, inflate or deflate \(R^2\), and make the line tell a different story than the bulk of the data. Always make a scatterplot before trusting a regression, and pay special attention to points at the extreme ends of the \(x\) range — those are the ones with the most leverage.

Insight Note

The decision to keep or drop an outlier is a scientific judgment, not a statistical one. A data-entry error or a malfunctioning sensor is a defensible reason to drop a point. "It doesn't fit my model" is not. When unsure, report the regression twice — once with the outlier and once without — and let the reader see how the result depends on that point.

Try It Now 8.2.7

Source: Main Text

A scatterplot shows 19 points in a tight upward line, plus one point far to the right of the cloud. Consider three candidates for this outlier:

- Candidate A: the outlier sits right on the line the other 19 points would produce. - Candidate B: the outlier sits far above the line the other 19 points would produce. - Candidate C: the outlier sits on the line of the 19 points but at \(x\) twice as far as any other \(x\).

For each candidate, classify the point as (i) high leverage, yes/no and (ii) influential, yes/no.

Solution

- Candidate A: Far right → high leverage yes. On the line → would not change the slope if removed → influential no. - Candidate B: Far right → high leverage yes. Off the line → removing it would visibly change the slope → influential yes. - Candidate C: Far right, on the line → high leverage yes, influential no (same reasoning as A).

Key takeaway: high leverage + far-from-line = influential. High leverage alone is not enough.

8.2.8 Categorical Predictors with Two Levels (Special Topic)

Categorical variables are also useful in predicting outcomes. Here we consider a categorical predictor with two levels (recall that a level is the same as a category). We'll consider eBay auctions for a video game, Mario Kart for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded. Here we want to predict total price based on game condition, which takes values used and new. A plot of the auction data is shown in Figure 8.21.

Figure 8.21: Total auction prices for the game Mario Kart, divided into used (\(x = 0\)) and new (\(x = 1\)) condition games with the least squares regression line shown.

Figure 8.22: Least squares regression summary for the Mario Kart data.

To incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form. We do so using an indicator variable called cond_new, which takes value \(1\) when the game is new and \(0\) when the game is used. Using this indicator variable, the linear model may be written as:

$$ \widehat{price} = \alpha + \beta \times \text{cond\_new}. $$

The fitted model is summarized in the table below, and the model with its parameter estimates is given as:

$$ \widehat{price} = 42.87 + 10.90 \times \text{cond\_new}. $$

For categorical predictors with two levels, the linearity assumption will always be satisfied. However, we must evaluate whether the residuals in each group are approximately normal with equal variance. Based on Figure 8.21, both of these conditions are reasonably satisfied.

EstimateStd. Errort valuePr(>|t|)
(Intercept)42.870.8152.670.0000
cond_new10.901.268.660.0000
Example 8.2.12

Source: Main Text

Interpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions.

Solution:

Intercept (\(\alpha = 42.87\)): The estimated price when cond_new takes value \(0\) — i.e. when the game is in used condition. The average selling price of a used version of the game is \$42.87.

Slope (\(\beta = 10.90\)): On average, new games sell for about \$10.90 more than used games. In this categorical setup, the slope is no longer a per-unit change — it is the group difference between the two levels of the categorical variable.

> Interpreting Model Estimates for Categorical Predictors > The estimated intercept is the value of the response variable for the first category (the category corresponding to an indicator value of 0). The estimated slope is the average change in the response variable between the two categories.

Context Pause

Indicator variables are the bridge between categorical data and the regression machinery we built for numerical data. If you know how to handle the Mario Kart used-vs-new split, you can handle male-vs-female, treatment-vs-control, or AM-vs-PM. For categorical predictors with more than two levels (three genres, four cities, five brands), you add more indicator variables — one for each level except one "reference" level. Chapter 9 covers the details.

Insight Note

When a regression has only a single two-level categorical predictor, the slope equals the difference between the two group means. That is: \(\beta = \bar{y}_{\text{new}} - \bar{y}_{\text{used}}\). In this case, \$42.87 + \$10.90 = \$53.77 is the mean new-game price. The machinery of least squares reduces to an everyday "what's the gap between two group averages" calculation — a two-sample \(t\)-test in disguise.

Try It Now 8.2.8

Source: Main Text

A regression of a used car's selling price \(y\) (in \$1,000s) on whether the car has leather seats (leather = 1 for yes, 0 for no) gives:

$$ \widehat{price} = 18.2 + 4.7 \times \text{leather}. $$

(a) What is the average selling price of a car with no leather seats? (b) What is the average selling price of a car with leather seats? (c) Interpret the slope in plain English.

Solution

(a) When leather = 0: \(\widehat{price} = 18.2 + 4.7(0) = 18.2\) thousand dollars, i.e. \$18,200.

(b) When leather = 1: \(\widehat{price} = 18.2 + 4.7(1) = 22.9\) thousand dollars, i.e. \$22,900.

(c) On average, used cars with leather seats sell for \$4,700 more than used cars without leather seats. The slope for a binary categorical predictor is the gap between the two group means.

Section Summary

  • We define the best fit line as the line that minimizes the sum of the squared residuals (errors) about the line. That is, we find the line that minimizes \((y_{1} - \hat{y}_{1})^{2} + (y_{2} - \hat{y}_{2})^{2} + \cdots + (y_{n} - \hat{y}_{n})^{2} = \sum(y_{i} - \hat{y}_{i})^{2}\). We call this line the least squares regression line.
  • We write the least squares regression line in the form \(\hat{y} = a + bx\), and we can calculate \(a\) and \(b\) based on the summary statistics as follows:
$$ b = r\,\frac{s_{y}}{s_{x}} \qquad \text{and} \qquad a = \bar{y} - b\,\bar{x}. $$
  • Interpreting the slope and \(y\)-intercept of a linear model:
  • The slope, \(b\), describes the average increase or decrease in the \(y\) variable if the explanatory variable \(x\) is one unit larger.
  • The \(y\)-intercept, \(a\), describes the average or predicted outcome of \(y\) if \(x = 0\). The linear model must be valid all the way to \(x = 0\) for this to make sense, which in many applications is not the case.
  • Two important considerations about the regression line:
  • The regression line provides estimates or predictions, not actual values. It is important to know how large \(s\), the standard deviation of the residuals, is in order to know about how much error to expect in these predictions.
  • The regression line estimates are only reasonable within the domain of the data. Predicting \(y\) for \(x\) values outside the domain — extrapolation — is unreliable and may produce ridiculous results.
  • Using \(R^2\) to assess the fit of the model:
  • \(R^{2}\), called R-squared or the explained variance, is a measure of how well the model explains or fits the data. \(R^{2}\) is always between 0 and 1, inclusive (or between 0% and 100%, inclusive). The higher the value of \(R^{2}\), the better the model "fits" the data.
  • The \(R^{2}\) for a linear model describes the proportion of variation in the \(y\) variable that is explained by the regression line.
  • \(R^{2}\) applies to any type of model, not just a linear model, and can be used to compare the fit among various models.
  • The correlation \(r = -\sqrt{R^{2}}\) or \(r = \sqrt{R^{2}}\). The value of \(R^{2}\) is always positive and cannot tell us the direction of the association. If finding \(r\) based on \(R^{2}\), use either the scatterplot or the slope of the regression line to determine the sign of \(r\).
  • When a residual plot of the data appears as a random cloud of points, a linear model is generally appropriate. If a residual plot has any type of pattern or curvature (such as a U-shape), a linear model is not appropriate.
  • Outliers in regression are observations that fall far from the "cloud" of points.
  • An influential point is a point that has a big effect or pull on the slope of the regression line. Points that are outliers in the \(x\) direction will have more pull on the slope of the regression line and are more likely to be influential points.

Problem Set

Source: Main Text

Problem 1: Units of regression. Consider a regression predicting weight (kg) from height (cm) for a sample of adult males. What are the units of the correlation coefficient, the intercept, and the slope?

Problem 8.17 Solution

Step 1 — Units of the correlation coefficient: The correlation coefficient \(r\) is computed from Z-scores, which have no units (Z-scores are dimensionless because they divide a length by a length, or a mass by a mass, etc.).

Step 2 — Units of the intercept: The intercept \(a\) is the predicted value of \(y\) when \(x = 0\). Here \(y\) is weight in kg, so the intercept has units of kg.

Step 3 — Units of the slope: The slope \(b\) is change in \(y\) per unit change in \(x\), so its units are "units of \(y\)" per "unit of \(x\)" — here, kg per cm (or kg/cm).

Answer: Correlation: unitless. Intercept: kg. Slope: kg/cm.

Problem 2: Which is higher? Determine if I or II is higher or if they are equal. Explain your reasoning. For a regression line, the uncertainty associated with the slope estimate, \(b\), is higher when:

I. there is a lot of scatter around the regression line, or II. there is very little scatter around the regression line.

Problem 8.18 Solution

Step 1 — Relate scatter to slope uncertainty: The standard error of the slope is roughly \(\text{SE}_b = s / (s_x \sqrt{n-1})\), where \(s\) is the standard deviation of the residuals. More scatter around the line means a larger \(s\), which means a larger SE, which means more uncertainty in the slope estimate.

Step 2 — Compare cases I and II: Case I (lots of scatter) produces a large \(s\) and thus a large \(\text{SE}_b\). Case II (little scatter) produces a small \(s\) and a small \(\text{SE}_b\).

Answer: I is higher. The slope estimate is less certain (larger standard error) when there is more scatter around the regression line.

Problem 3: Over-under, Part I. Suppose we fit a regression line to predict the shelf life of an apple based on its weight. For a particular apple, we predict the shelf life to be \(4.6\) days. The apple's residual is \(-0.6\) days. Did we over- or under-estimate the shelf-life of the apple? Explain your reasoning.

Problem 8.19 Solution

Step 1 — Recall the residual definition: \(\text{residual} = y - \hat{y}\), where \(y\) is the observed value and \(\hat{y}\) is the predicted value.

Step 2 — Solve for the observed \(y\): Given \(\hat{y} = 4.6\) and residual \(= -0.6\):

$$ y = \hat{y} + \text{residual} = 4.6 + (-0.6) = 4.0 \text{ days}. $$

Step 3 — Compare prediction and observation: The prediction was 4.6 days but the actual shelf life was only 4.0 days. The predicted value was too high.

Answer: We over-estimated the shelf life. A negative residual always means the model predicted higher than the actual value.

Problem 4: Over-under, Part II. Suppose we fit a regression line to predict the number of incidents of skin cancer per 1,000 people from the number of sunny days in a year. For a particular year, we predict the incidence of skin cancer to be \(1.5\) per 1,000 people, and the residual for this year is \(0.5\). Did we over- or under-estimate the incidence of skin cancer? Explain your reasoning.

Problem 8.20 Solution

Step 1 — Solve for the observed \(y\): Given \(\hat{y} = 1.5\) per 1,000 people and residual \(= 0.5\):

$$ y = \hat{y} + \text{residual} = 1.5 + 0.5 = 2.0 \text{ per 1,000 people}. $$

Step 2 — Compare prediction and observation: The prediction was 1.5 but the actual value was 2.0. The prediction was too low.

Answer: We under-estimated the incidence of skin cancer. A positive residual always means the model predicted lower than the actual value.

Problem 5: Tourism spending. The Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year. Three plots are provided: scatterplot showing the relationship between these two variables along with the least squares fit, residuals plot, and histogram of residuals.

(a) Describe the relationship between number of tourists and spending. (b) What are the explanatory and response variables? (c) Why might we want to fit a regression line to these data? (d) Do the data meet the conditions required for fitting a least squares line? In addition to the scatterplot, use the residual plot and histogram to answer this question.

Problem 8.21 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a strong, positive, linear relationship between number of tourists and spending. As the number of tourists increases, tourist spending also increases in a roughly proportional way.

Step 2 — Identify variables (b): The explanatory (predictor) variable is the number of foreign tourists. The response variable is tourist spending. We use the number of tourists to predict or explain spending.

Step 3 — Reason for fitting a regression line (c): - Quantify the average spending per additional tourist (the slope). - Predict spending in a future year given a forecast of tourist numbers. - Summarize the relationship with a single equation suitable for reporting and planning.

Step 4 — Check conditions (d): The four conditions for least squares are (i) linearity — scatterplot roughly straight; (ii) nearly normal residuals — histogram of residuals roughly bell-shaped; (iii) constant variability — residual plot has roughly constant spread; (iv) independent observations. In a time-series of tourism years, independence is suspect (successive years are correlated). If the residual plot shows fanning or curvature, the constant-variability and linearity conditions would also fail.

Answer: (a) Strong positive linear. (b) Explanatory: number of tourists; response: spending. (c) To quantify and predict spending per tourist. (d) Conditions are mostly satisfied based on the three plots, but independence is questionable because data are collected over consecutive years. Any inferences should be made cautiously.

Problem 6: Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

(a) Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain. (b) In this scenario, what are the explanatory and response variables? (c) Why might we want to fit a regression line to these data? (d) Do these data meet the conditions required for fitting a least squares line?

Problem 8.22 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a moderate, positive, linear relationship between calories and carbohydrates. Items with more calories tend to contain more carbs. There is noticeable scatter because calories also come from fat and protein, not just carbs.

Step 2 — Identify variables (b): - Explanatory variable: calories (this is what Starbucks displays). - Response variable: amount of carbohydrates (what we want to predict).

Step 3 — Reason for fitting the line (c): A customer who knows only the calorie count (displayed prominently) can estimate how many carbs a menu item has — useful for diet tracking (e.g. diabetes, low-carb diets).

Step 4 — Check conditions (d): Using the scatterplot and residual plot: - Linearity: roughly satisfied — no obvious curve. - Constant variability: need to inspect the residual plot for fanning. Often these food-item plots show some fanning at higher calorie counts, which is a mild violation. - Nearly normal residuals: need the histogram — usually approximately symmetric. - Independence: the menu items are distinct food items, so independence is reasonable.

Answer: (a) Moderate positive linear. (b) Explanatory: calories; response: carbs. (c) To predict carbs from the posted calorie count. (d) Generally yes, with only mild concerns about constant variability.

Problem 7: The Coast Starlight, Part II. Exercise 8.11 introduces data on the Coast Starlight Amtrak train that runs from Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast Starlight is 129 minutes, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is \(0.636\).

(a) Write the equation of the regression line for predicting travel time. (b) Interpret the slope and the intercept in this context. (c) Calculate \(R^{2}\) of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret \(R^{2}\) in the context of the application. (d) The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities. (e) It actually takes the Coast Starlight about 168 minutes to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value. (f) Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?

Problem 8.23 Solution

Step 1 — Write the regression equation (a): With \(\bar{x} = 108\) miles, \(\bar{y} = 129\) minutes, \(s_x = 99\), \(s_y = 113\), \(r = 0.636\):

$$ b = r\,\frac{s_y}{s_x} = 0.636 \cdot \frac{113}{99} = 0.636 \times 1.1414 \approx 0.726. $$ $$ a = \bar{y} - b\,\bar{x} = 129 - 0.726 \times 108 = 129 - 78.4 = 50.6. $$ $$ \widehat{\text{time}} = 50.6 + 0.726 \times \text{distance}. $$

Step 2 — Interpret slope and intercept (b): - Slope (0.726 min/mile): For each additional mile between stops, the travel time increases by about 0.73 minutes on average (so about 82 mph = 1/0.726×60? actually speed is 60/0.726 ≈ 83 mph apparent speed addition, but interpretation is average time per extra mile). - Intercept (50.6 min): The predicted travel time when the distance is 0 miles is about 51 minutes. This has little practical meaning (no trip has distance 0), but it accounts for fixed time overhead (loading, unloading, acceleration).

Step 3 — Compute and interpret \(R^2\) (c):

$$ R^2 = r^2 = 0.636^2 = 0.4045 \approx 40.5\%. $$

About 40.5% of the variability in travel time between stops is explained by the distance between them. The remaining 59.5% is due to grades, stops, track conditions, speed limits, etc.

Step 4 — Predict for 103 miles (d):

$$ \widehat{\text{time}} = 50.6 + 0.726 \times 103 = 50.6 + 74.78 = 125.4 \text{ minutes}. $$

Step 5 — Compute residual (e):

$$ \text{residual} = 168 - 125.4 = 42.6 \text{ minutes}. $$

The actual travel time exceeded the prediction by about 42.6 minutes. The model under-predicted this segment, likely because the route into Los Angeles has congestion or speed restrictions not captured by distance alone.

Step 6 — 500-mile extrapolation (f): The data's distances range roughly up to about a few hundred miles (based on the quoted mean 108 and sd 99). A 500-mile stretch would be far beyond the typical distance between stops, making this an extrapolation. It is not appropriate to use the model to predict travel time for a 500-mile segment.

Answer: (a) \(\widehat{\text{time}} = 50.6 + 0.726\,\text{distance}\). (b) Slope: +0.73 min/mile; intercept: ≈51 min (no practical meaning). (c) \(R^2 \approx 0.40\); 40% of variability explained. (d) ≈125 minutes. (e) Residual ≈ +42.6 min (under-predicted). (f) No — extrapolation well beyond data range.

Problem 8: Body measurements, Part III. Exercise 8.13 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is \(0.67\).

(a) Write the equation of the regression line for predicting height. (b) Interpret the slope and the intercept in this context. (c) Calculate \(R^{2}\) of the regression line for predicting height from shoulder girth, and interpret it in the context of the application. (d) A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model. (e) The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means. (f) A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

Problem 8.24 Solution

Step 1 — Write the regression equation (a): \(\bar{x} = 107.2\) (shoulder girth, cm), \(\bar{y} = 171.14\) (height, cm), \(s_x = 10.37\), \(s_y = 9.41\), \(r = 0.67\):

$$ b = r\,\frac{s_y}{s_x} = 0.67 \cdot \frac{9.41}{10.37} = 0.67 \times 0.9074 \approx 0.608. $$ $$ a = \bar{y} - b\,\bar{x} = 171.14 - 0.608 \times 107.2 = 171.14 - 65.18 = 105.96. $$ $$ \widehat{\text{height}} = 105.96 + 0.608 \times \text{shoulder\_girth}. $$

Step 2 — Interpret coefficients (b): - Slope (0.608 cm/cm): For each additional 1 cm of shoulder girth, height increases by about 0.61 cm on average. - Intercept (105.96 cm): The predicted height when shoulder girth is 0 cm. Not meaningful in practice (no one has a shoulder girth of 0).

Step 3 — Compute \(R^2\) (c):

$$ R^2 = 0.67^2 = 0.4489 \approx 44.9\%. $$

About 44.9% of the variation in height is explained by shoulder girth. The rest comes from other body-size and genetic factors.

Step 4 — Predict for 100 cm shoulder girth (d):

$$ \widehat{\text{height}} = 105.96 + 0.608 \times 100 = 105.96 + 60.8 = 166.76 \text{ cm}. $$

Step 5 — Compute residual (e):

$$ \text{residual} = 160 - 166.76 = -6.76 \text{ cm}. $$

The student is about 6.76 cm shorter than the model predicted. The model over-predicted this student's height by about 6.76 cm.

Step 6 — One-year-old shoulder girth of 56 cm (f): The sample's shoulder girths have mean 107.2 cm with sd 10.37 — a one-year-old's 56 cm is about \((56 - 107.2)/10.37 \approx -4.9\) standard deviations below the mean, well outside the data range. Also, the sample was of adults, not infants. Using the model for a one-year-old would be extrapolation and not appropriate.

Answer: (a) \(\widehat{\text{height}} = 105.96 + 0.608\,\text{shoulder\_girth}\). (b) Slope: 0.61 cm per cm; intercept: 105.96 cm (no practical meaning). (c) \(R^2 \approx 0.45\). (d) ≈166.76 cm. (e) Residual ≈ −6.76 cm (over-predicted). (f) No — extrapolation.

Problem 9: Murders and poverty, Part I. The following regression output is for predicting annual murders per million from percentage living in poverty in a random sample of 20 metropolitan areas.

EstimateStd. Errort valuePr(>|t|)s =
(Intercept)-29.9017.789-3.8390.001
poverty%2.5590.3906.5620.000
5.512R² = 70.52%R²adj = 68.89%

(a) Write out the linear model. (b) Interpret the intercept. (c) Interpret the slope. (d) Interpret \(R^{2}\). (e) Calculate the correlation coefficient.

Problem 8.25 Solution

Step 1 — Write out the linear model (a): From the regression output, intercept = \(-29.901\) and slope on poverty% = \(2.559\):

$$ \widehat{\text{murders}} = -29.901 + 2.559 \times \text{poverty\%}. $$

Step 2 — Interpret the intercept (b): The predicted annual murders per million when the poverty percentage is 0% is about \(-29.9\). A negative murder rate is impossible, so the intercept has no practical meaning here — no real area has 0% poverty, so \(x = 0\) is outside the data range (extrapolation).

Step 3 — Interpret the slope (c): For each additional 1 percentage point of population living in poverty, the annual murder rate increases by about 2.559 per million on average.

Step 4 — Interpret \(R^2\) (d): \(R^2 = 70.52\%\). About 70.5% of the variability in the annual murder rate across metro areas is explained by the poverty percentage. The remaining ~30% is explained by other factors (education, employment, population density, drug policy, etc.).

Step 5 — Compute the correlation coefficient (e): \(r = \pm \sqrt{R^2} = \pm \sqrt{0.7052} \approx \pm 0.8398\). The slope is positive (\(b = 2.559 > 0\)), so:

$$ r \approx +0.840. $$

Answer: (a) \(\widehat{\text{murders}} = -29.901 + 2.559\,\text{poverty\%}\). (b) Intercept has no practical meaning (extrapolation). (c) +2.559 murders/million per 1 pp increase in poverty. (d) 70.5% of variability explained. (e) \(r \approx +0.840\).

Problem 10: Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

EstimateStd. Errort valuePr(>|t|)s =
(Intercept)-0.3570.692-0.5150.607
body wt4.0340.25016.1190.000
1.452R² = 64.66%Radj² = 64.41%

(a) Write out the linear model. (b) Interpret the intercept. (c) Interpret the slope. (d) Interpret \(R^{2}\). (e) Calculate the correlation coefficient.

Problem 8.26 Solution

Step 1 — Write out the linear model (a):

$$ \widehat{\text{heart\_wt}} = -0.357 + 4.034 \times \text{body\_wt}. $$

Step 2 — Interpret the intercept (b): The predicted heart weight when a cat's body weight is 0 kg is \(-0.357\) g — impossible (body weight 0 means no cat), so the intercept is not meaningful practically. It is a mathematical artifact.

Step 3 — Interpret the slope (c): For each additional kilogram of body weight, a cat's heart weight increases by about 4.034 grams on average. This is a very strong, positive, physically sensible relationship.

Step 4 — Interpret \(R^2\) (d): \(R^2 = 64.66\%\). About 65% of the variability in cats' heart weights is explained by body weight. The rest is due to breed, age, sex, fitness, etc.

Step 5 — Compute the correlation (e): \(r = \pm \sqrt{0.6466} \approx \pm 0.804\). Slope is positive (+4.034), so:

$$ r \approx +0.804. $$

Answer: (a) \(\widehat{\text{heart\_wt}} = -0.357 + 4.034\,\text{body\_wt}\). (b) Intercept not meaningful. (c) Heart weight increases by about 4.03 g per kg of body weight. (d) \(R^2 \approx 65\%\) — 65% of variability in heart weight explained by body weight. (e) \(r \approx +0.804\).

Problem 11: Outliers, Part I. Identify the outliers in the scatterplots shown below, and determine what type of outliers they are. Explain your reasoning.

(a)

(b)

(c)

Problem 8.27 Solution

Step 1 — Outlier-type framework: Every outlier can be classified in two dimensions: - High leverage means the point has an \(x\)-value far from the rest of the data. - Influential means removing the point would noticeably change the slope or intercept. - Large residual only means the point is vertically far from the line but has an \(x\) near the center — it does not have much pull on the slope.

Step 2 — Typical answers for three panels: - (a) A point far to the right of the cloud, on the line → high leverage, not influential. - (b) A point in the middle of the \(x\) range but far above the line → large residual, not high leverage, not very influential (slope barely changes). - (c) A point far to the right of the cloud and far from the line → high leverage AND influential.

Answer: (a) High leverage, not influential. (b) Large residual, not high leverage, minimally influential. (c) High leverage AND influential — this point pulls the slope noticeably.

Problem 12: Outliers, Part II. Identify the outliers in the scatterplots shown below and determine what type of outliers they are. Explain your reasoning.

(a)

(b)

(c)

Problem 8.28 Solution

Step 1 — Reapply the outlier-type framework from Problem 8.27.

Step 2 — Typical answers for three panels: - (a) A point in the middle of the \(x\) range but far below the line → large residual, low leverage, low influence. - (b) A point far to the right, off the line → high leverage AND influential. - (c) A small cluster of points off to one side (e.g. far right) → collectively high leverage and influential; the cluster shifts the regression line.

Answer: (a) Large residual, low leverage, low influence. (b) High leverage AND influential. (c) The cluster acts as a collective high-leverage, influential group.

Problem 13: Urban homeowners, Part I. The scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas. There are 52 observations, each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.

(a) Describe the relationship between the percent of families who own their home and the percent of the population living in urban areas. (b) The outlier at the bottom right corner is District of Columbia, where \(100\%\) of the population is considered urban. What type of an outlier is this observation?

Problem 8.29 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a moderate negative linear relationship: states with a higher percentage of urban population tend to have a lower percentage of families owning their home. Apartments are more common in urban areas, driving down ownership rates.

Step 2 — Classify the District of Columbia outlier (b): DC sits at the far right (100% urban) with a much lower homeownership percentage than the trend would predict. Its \(x\)-value is far from the center of the \(x\) distribution → high leverage. It also falls well away from the regression line the other 51 observations would produce → influential. DC is therefore both an outlier with high leverage and an influential point.

Answer: (a) Moderate negative linear — more urban means less ownership. (b) DC is a high-leverage, influential outlier.

Problem 14: Crawling babies, Part II. Exercise 8.12 introduces data on the average monthly temperature during the month babies first try to crawl (about 6 months after birth) and the average first crawling age for babies born in a given month. A scatterplot of these two variables reveals a potential outlying month when the average temperature is about \(53^{\circ}\text{F}\) and average crawling age is about 28.5 weeks. Does this point have high leverage? Is it an influential point?

Problem 8.30 Solution

Step 1 — Recall the Crawling Babies setup: The study has 12 data points (one per birth month). The outlying month has \(x = 53^\circ\text{F}\) and \(y = 28.5\) weeks.

Step 2 — Is it high leverage? The mean temperature across the 12 months is somewhere in the middle of the observed range (probably around 50–60°F). An \(x\)-value of 53°F is close to the center of the \(x\) distribution, not far from the \(\bar{x}\). Therefore this point does NOT have high leverage.

Step 3 — Is it influential? An influential point requires both (a) high leverage and (b) removal would noticeably change the regression line's slope/intercept. Since this point does not have high leverage, it cannot be strongly influential — removing it would barely move the slope. However, it is a point with an unusually large residual (vertical distance from the line).

Answer: No, it does not have high leverage, and it is not influential. It is just an observation with a large residual — a point that is vertically far from the line but whose \(x\)-value is near the middle of the data, so it cannot pull the slope in either direction.

8.3 Transformations for Skewed Data

County population size among the counties in the US is very strongly right skewed. Can we apply a transformation to make the distribution more symmetric? How would such a transformation affect the scatterplot and residual plot when another variable is graphed against this variable? In this section, we will see the power of transformations for very skewed data.

Context Pause

Most of the data you meet in a first statistics course is bell-shaped or close to it, but the real world is full of variables that are not. County populations, household incomes, insurance claim sizes, earthquake magnitudes, and city sizes are all famously right-skewed: a huge pile of small values and a long thin tail of very large ones. Plain linear regression assumes the spread of points around the line is roughly constant and that the relationship is roughly linear — both assumptions that skewed data routinely violate. Transformations are the standard fix: we rescale the variable so its distribution looks more symmetric, and the regression tools we built in Sections 8.1 and 8.2 suddenly work again.

Insight Note

A transformation is just a new set of axis labels. We do not change the data — we change the ruler we use to measure it. The \(\log_{10}\) transformation, for example, relabels "10" as 1, "100" as 2, "1,000" as 3, "10,000" as 4, and so on. Every factor of 10 gets the same amount of axis space. Distances that used to be compressed into a corner of the plot are now spread out, and tiny wiggles that were invisible become clearly visible. Nothing about the underlying counties, trucks, or incomes has changed — we are just looking at them through a different lens.

Learning Objectives

Source: Main Text

By the end of this section, you should be able to:

In this section, you will learn to:
  1. See how a log transformation can bring symmetry to an extremely skewed variable.
  2. Recognize that data can often be transformed to produce a linear relationship, and that this transformation often involves log of the \(y\)-values and sometimes log of the \(x\)-values.
  3. Use residual plots to assess whether a linear model for transformed data is reasonable.

8.3.1 Transformations to Reduce Skew

Example 8.3.1

Source: Main Text

Consider the histogram of county populations shown in Figure 8.22(a), which shows extreme skew. What isn't useful about this plot?

Solution:

Nearly all of the data fall into the left-most bin, and the extreme skew obscures many of the potentially interesting details in the data.

There are some standard transformations that may be useful for strongly right-skewed data where much of the data is positive but clustered near zero. A transformation is a rescaling of the data using a function. For instance, a plot of the logarithm (base 10) of county populations results in the new histogram in Figure 8.22(b). This data is symmetric, and any potential outliers appear much less extreme than in the original data set. By reining in the outliers and extreme skew, transformations like this often make it easier to build statistical models against the data.

Transformations can also be applied to one or both variables in a scatterplot. A scatterplot of the population change from 2010 to 2017 against the population in 2010 is shown in Figure 8.23(a). In this first scatterplot, it's hard to decipher any interesting patterns because the population variable is so strongly skewed. However, if we apply a \(\log_{10}\) transformation to the population variable, as shown in Figure 8.23(b), a positive association between the variables is revealed. While fitting a line to predict population change (2010 to 2017) from population (in 2010) does not seem reasonable, fitting a line to predict population change from \(\log_{10}(\text{population})\) does seem reasonable.

Transformations other than the logarithm can be useful, too. For instance, the square root (\(\sqrt{\text{original observation}}\)) and inverse (\(\frac{1}{\text{original observation}}\)) are commonly used by data scientists. Common goals in transforming data are to see the data structure differently, reduce skew, assist in modeling, or straighten a nonlinear relationship in a scatterplot.

(a)

(b)

Figure 8.22: (a) A histogram of the populations of all US counties. (b) A histogram of \(\log_{10}\)-transformed county populations. For this plot, the x-value corresponds to the power of 10, e.g. "4" on the x-axis corresponds to \(10^{4} = 10{,}000\).

(a)

(b)

Figure 8.23: (a) Scatterplot of population change against the population before the change. (b) A scatterplot of the same data but where the population size has been log-transformed.

> Common Transformations for Right-Skewed Data > Three transformations show up over and over again when a variable is positive and right-skewed: > > - Log transformation, \(x \mapsto \log_{10}(x)\) or \(x \mapsto \ln(x)\): compresses large values heavily, leaves small values roughly alone. This is the workhorse. > - Square root, \(x \mapsto \sqrt{x}\): milder compression than the log. Often used for counts (number of events in a fixed time window). > - Inverse, \(x \mapsto 1/x\): very aggressive compression of large values. Sometimes used when rates or reciprocals are the natural quantity. > > The goal in every case is the same: produce a distribution (or scatterplot) that is more symmetric, more linear, and better suited to the tools we already have.

Try It Now 8.3.1

Source: Main Text

The following ten hypothetical county populations (rounded, in thousands) are strongly right-skewed:

\(\{5, 8, 12, 15, 20, 40, 80, 200, 900, 5000\}\)

(a) By eye, describe the shape of the distribution of the raw counts. (b) Apply the \(\log_{10}\) transformation to each value. (You may round to two decimals.) (c) Describe the shape of the transformed distribution — is it more symmetric than the original?

Solution

(a) Strongly right-skewed: most values are small (between 5 and 40) and there is a long tail of large values (200, 900, 5000). The mean is pulled far to the right of the median by the big values.

(b) Taking \(\log_{10}\) of each (populations in thousands):

Raw \(x\) \(\log_{10}(x)\)
5 0.70
8 0.90
12 1.08
15 1.18
20 1.30
40 1.60
80 1.90
200 2.30
900 2.95
5000 3.70

(c) Much more symmetric. The raw range is 5 to 5000 (three orders of magnitude); the transformed range is 0.70 to 3.70 (three units on the \(\log_{10}\) scale). The values now spread out evenly instead of piling up at the small end, so the distribution looks roughly bell-shaped rather than severely right-skewed.

8.3.2 Transformations to Achieve Linearity

Figure 8.24: Variable \(y\) is plotted against \(x\). A nonlinear relationship is evident by the \(\cup\)-pattern shown in the residual plot. The curvature is also visible in the original plot.

Example 8.3.2

Source: Main Text

Consider the scatterplot and residual plot in Figure 8.24. The regression output is also provided. Is the linear model \(\hat{y} = -52.3564 + 2.7842 x\) a good model for the data?

Solution:

The regression equation is

$$ \hat{y} = -52.3564 + 2.7842 x $$

| Predictor | Coef | SE Coef | T | P | |---|---:|---:|---:|---:| | Constant | -52.3564 | 7.2757 | -7.196 | 3e-08 | | x | 2.7842 | 0.1768 | 15.752 | \(< 2\)e-16 |

\(S = 13.76\), \(R\)-Sq \(= 88.26\%\), \(R\)-Sq(adj) \(= 87.91\%\).

We can note the \(R^{2}\) value is fairly large. However, this alone does not mean that the model is good. Another model might be much better. When assessing the appropriateness of a linear model, we should look at the residual plot. The \(\cup\)-pattern in the residual plot tells us the original data is curved. If we inspect the two plots, we can see that for small and large values of \(x\) we systematically underestimate \(y\), whereas for middle values of \(x\), we systematically overestimate \(y\). The curved trend can also be seen in the original scatterplot. Because of this, the linear model is not appropriate, and it would not be appropriate to perform a \(t\)-test for the slope because the conditions for inference are not met. However, we might be able to use a transformation to linearize the data.

Regression analysis is easier to perform on linear data. When data are nonlinear, we sometimes transform the data in a way that makes the resulting relationship linear. The most common transformation is log of the \(y\)-values. Sometimes we also apply a transformation to the \(x\)-values. We generally use the residuals as a way to evaluate whether the transformed data are more linear. If so, we can say that a better model has been found.

Example 8.3.3

Source: Main Text

Using the regression output for the transformed data, write the new linear regression equation.

Solution:

The regression equation is

$$ \widehat{\log(y)} = 1.722540 + 0.052985 x $$

| Predictor | Coef | SE Coef | T | P | |---|---:|---:|---:|---:| | Constant | 1.722540 | 0.056731 | 30.36 | \(< 2\)e-16 | | x | 0.052985 | 0.001378 | 38.45 | \(< 2\)e-16 |

$$ S = 0.1073, \quad R\text{-Sq} = 97.82\%, \quad R\text{-Sq(adj)} = 97.75\% $$

The linear regression equation can be written as: \(\widehat{\log(y)} = 1.723 + 0.053 x\).

Figure 8.25: A plot of \(\log(y)\) against \(x\). The residuals don't show any evident patterns, which suggests the transformed data is well-fit by a linear model.

Example 8.3.4

Source: Main Text

Compute the \(95\%\) confidence interval for the family income coefficient using the regression output from Table 8.21.

Solution:

The point estimate is \(-0.0431\) and the standard error is \(SE = 0.0108\). When constructing a confidence interval for a model coefficient, we generally use a \(t\)-distribution. The degrees of freedom for the distribution are noted in the regression output, \(df = 48\), allowing us to identify \(t_{48}^{\star} = 2.01\) for use in the confidence interval.

We can now construct the confidence interval in the usual way:

$$ \text{point estimate} \pm t_{48}^{\star} \times SE \quad \rightarrow \quad -0.0431 \pm 2.01 \times 0.0108 \quad \rightarrow \quad (-0.0648,\ -0.0214) $$

We are \(95\%\) confident that with each dollar increase in family income, the university's gift aid is predicted to decrease on average by \$0.0214 to \$0.0648.

Guided Practice 8.28

Source: Main Text

Which of the following statements are true? There may be more than one.

(a) There is an apparent linear relationship between \(x\) and \(y\).

(b) There is an apparent linear relationship between \(x\) and \(\widehat{\log(y)}\).

(c) The model provided by Regression I \((\hat{y} = -52.3564 + 2.7842 x)\) yields a better fit.

(d) The model provided by Regression II \((\widehat{\log(y)} = 1.723 + 0.053 x)\) yields a better fit.

Solution

True statements: (b) and (d).

(a) False. The original scatterplot shows clear curvature, and the residual plot in Figure 8.24 shows a \(\cup\)-pattern — strong evidence that \(x\) and \(y\) are not linearly related.

(b) True. After applying the log transformation to \(y\), the residuals in Figure 8.25 show no pattern, which is the diagnostic for a good linear fit. The scatterplot of \(\log(y)\) versus \(x\) also looks linear.

(c) False. Regression I has \(R^{2} = 88.26\%\), but the curved residual pattern invalidates it — the high \(R^{2}\) is misleading because the model is fundamentally the wrong shape.

(d) True. Regression II has \(R^{2} = 97.82\%\) and a clean residual plot. Both indicators agree: the log-transformed model is a much better fit.

The pattern in residuals always trumps a single summary statistic like \(R^{2}\). A model with smaller \(R^{2}\) but clean residuals is almost always better than one with larger \(R^{2}\) but systematic residual patterns.

Context Pause

When you find a curve in a residual plot, your first question should be what transformation might straighten it? For right-skewed \(x\), try \(\log(x)\). For exponential growth or decay in \(y\), try \(\log(y)\). For data that bends like a square root, try \(\sqrt{y}\). There is no universal recipe — you try a transformation, make a new residual plot, and check whether the pattern has disappeared. If it has, you have earned the right to use linear regression on the transformed data. If not, try a different transformation or accept that a nonlinear model is needed.

Insight Note

The phrase "a better model has been found" deserves a caveat: a model on \(\log(y)\) is a different model than a model on \(y\). The slope in Regression II tells us how \(\log(y)\) changes with \(x\), which translates to a multiplicative change in \(y\) itself — each unit increase in \(x\) multiplies \(y\) by roughly \(10^{0.053} \approx 1.13\), i.e., a \(13\%\) increase per unit. Interpreting a log-linear regression line means keeping that multiplicative story straight. We will see this again when we interpret slopes in the exercises at the end of this section.

Try It Now 8.3.2

Source: Main Text

A researcher fits a regression of \(y\) on \(x\) for 50 observations and gets \(R^{2} = 0.92\), a large slope \(t\)-statistic, and a residual plot that has a clear \(\cap\)-shape (down-then-up inverted bowl).

(a) Based on \(R^{2}\) alone, does the linear model look good? (b) Based on the residual plot, should we trust the linear model? (c) What is one thing the researcher should try before reporting the results of the regression?

Solution

(a) Yes — \(R^{2} = 0.92\) means the linear model explains \(92\%\) of the variability in \(y\). Taken alone, this looks like a strong fit.

(b) No. The \(\cap\)-pattern in the residual plot means the model is systematically wrong: it overestimates \(y\) at the extremes and underestimates \(y\) in the middle (or vice versa, depending on sign). Systematic error is exactly what a linear model is not supposed to have. The \(R^{2}\) is misleading because the model is the wrong shape, not because it is noisy.

(c) Try a transformation. A \(\cap\)-shape often suggests taking \(\log(y)\) (or \(\sqrt{y}\)) to flatten the curve, or transforming \(x\) (e.g., \(x \mapsto x^{2}\) or \(x \mapsto \log(x)\)) to straighten the underlying relationship. After the transformation, make a new residual plot: if it looks like random static, the transformed model is a better fit. Only report the transformed model if the residuals support it.

Section Summary

  • A transformation is a rescaling of the data using a function. When data are very skewed, a log transformation often results in more symmetric data.
  • Regression analysis is easier to perform on linear data. When data are nonlinear, we sometimes transform the data in a way that results in a linear relationship. The most common transformation is log of the \(y\)-values. Sometimes we also apply a transformation to the \(x\)-values.
  • To assess the model, we look at the residual plot of the transformed data. If the residual plot of the original data has a pattern, but the residual plot of the transformed data has no pattern, a linear model for the transformed data is reasonable, and the transformed model provides a better fit than the simple linear model.
  • \(R^{2}\) alone is not enough to judge a model. A high \(R^{2}\) with a patterned residual plot still indicates a poor fit. Always make the residual plot.

Problem Set

Source: Main Text

Problem 1: Used trucks. The scatterplot below shows the relationship between year and price (in thousands of dollars) of a random sample of 42 pickup trucks. Also shown is a residuals plot for the linear model for predicting price from year.

(a) Describe the relationship between these two variables and comment on whether a linear model is appropriate for modeling the relationship between year and price.

(b) The scatterplot below shows the relationship between logged (natural log) price and year of these trucks, as well as the residuals plot for modeling these data. Comment on which model (linear model from earlier or logged model presented here) is a better fit for these data.

(c) The output for the logged model is given below. Interpret the slope in context of the data.

| | Estimate | Std. Error | t value | Pr(>\|t\|) | |---|---:|---:|---:|---:| | (Intercept) | -271.981 | 25.042 | -10.861 | 0.000 | | Year | 0.137 | 0.013 | 10.937 | 0.000 |

Problem 8.31 Solution

Part (a) — Assess the original linear model.

The scatterplot of truck price versus year shows a curved, nonlinear relationship: prices stay low and flat for older trucks, then rise sharply (accelerating) for more recent years. The residual plot for the linear fit reveals this clearly — there is a pronounced \(\cup\)-shape: residuals are positive for very old and very new trucks and negative for middle-aged trucks. A curved residual pattern means the linear model systematically under- or over-predicts in a nonrandom way, so a simple linear model is not appropriate for these data.

Part (b) — Compare linear vs. log-transformed model.

After transforming price with the natural log, the relationship between \(\ln(\text{price})\) and year looks much closer to linear, and the residual plot loses its clear \(\cup\)-pattern — residuals are scattered randomly around zero with roughly constant spread. Because the residual plot for the log-transformed model has no obvious pattern while the original does, the logged model is a better fit for the data.

Part (c) — Interpret the slope in context.

The logged model is \(\widehat{\ln(\text{price})} = -271.981 + 0.137 \cdot \text{Year}\). The slope is \(0.137\) on the log scale, so each additional year of the truck's model year is associated with a \(0.137\) increase in \(\ln(\text{price})\). On the original price scale this is a multiplicative change: \(e^{0.137} \approx 1.147\). Each additional model year is associated with roughly a \(14.7\%\) increase in predicted price (in thousands of dollars), on average.

In context: holding all else constant, a pickup truck that is one model year newer is predicted to be priced about \(14.7\%\) higher.

Problem 2: Income and hours worked. The scatterplot below shows the relationship between income and years worked for a random sample of 787 Americans. Also shown is a residuals plot for the linear model for predicting income from hours worked. The data come from the 2012 American Community Survey.

(a) Describe the relationship between these two variables and comment on whether a linear model is appropriate for modeling the relationship between year and price.

(b) The scatterplot below shows the relationship between logged (natural log) income and hours worked, as well as the residuals plot for modeling these data. Comment on which model (linear model from earlier or logged model presented here) is a better fit for these data.

(c) The output for the logged model is given below. Interpret the slope in context of the data.

| | Estimate | Std. Error | t value | Pr(>\|t\|) | |---|---:|---:|---:|---:| | (Intercept) | 1.017 | 0.113 | 9.000 | 0.000 | | hrs_work | 0.058 | 0.003 | 21.086 | 0.000 |

Problem 8.32 Solution

Part (a) — Assess the original linear model.

The scatterplot of income versus hours worked shows a weak, fan-shaped pattern: income generally increases with hours worked, but the spread in income grows dramatically as hours worked increases. The residual plot confirms this — residuals are small for small values of hours worked and become very large (both positive and far positive) for larger values, producing a clear fan or funnel shape. This violates the "constant variability" condition for linear regression. A simple linear model is not appropriate for these data.

Part (b) — Compare linear vs. log-transformed model.

After applying the natural log transformation to income, the relationship between \(\ln(\text{income})\) and hours worked is more linear and, importantly, the residual plot for the logged model shows residuals scattered around zero with roughly constant spread — the fan pattern is largely gone. Because the log-transformed model has cleaner residuals (no pattern, constant variability), the logged model is a better fit than the original linear model.

Part (c) — Interpret the slope in context.

The logged model is \(\widehat{\ln(\text{income})} = 1.017 + 0.058 \cdot \text{hrs\_work}\). The slope \(0.058\) is on the log scale, so each additional hour worked per week is associated with a \(0.058\) increase in \(\ln(\text{income})\). On the original income scale this corresponds to multiplying income by \(e^{0.058} \approx 1.060\). Each additional hour worked is associated with roughly a \(6.0\%\) increase in predicted income, on average.

In context: for a random American in this 2012 sample, working one additional hour per week is associated with a predicted income that is about \(6\%\) higher.

8.4 Inference for the Slope of a Regression Line

Here we encounter our last confidence interval and hypothesis test procedures, this time for making inferences about the slope of the population regression line. We can use these tools to answer questions such as:

  • Is the unemployment rate a significant linear predictor for the loss of the President's party in the House of Representatives?
  • On average, how much less in college gift aid do students receive when their parents earn an additional \$1,000 in income?
Context Pause

Until now, the regression line has been a description — a best-fit line drawn through the data we actually collected. But the points we have are just one sample from a much larger population of possible observations. The slope we computed could have come out a bit higher or a bit lower if we had sampled different people. Slope inference is how we turn "the line we drew" into a statement about "the line that lives in the population" — confidence intervals give us a plausible range for the true slope, and hypothesis tests let us decide whether the true slope could be zero.

Insight Note

If you can run a \(t\)-test for a mean, you can run a \(t\)-test for a slope. The machinery is identical: point estimate, standard error, degrees of freedom, \(t\)-statistic, p-value. The only new thing is where the numbers come from — you read them off a regression output table instead of computing \(\bar{x}\) and \(s\) by hand.

Learning Objectives

Source: Main Text

By the end of this section, you should be able to:

In this section, you will learn to:
  1. 1. Recognize that the slope of the sample regression line is a point estimate and has an associated standard error.
  1. 2. Read the results of computer regression output and identify the quantities needed for inference for the slope of the regression line — specifically the slope of the sample regression line, the SE of the slope, and the degrees of freedom.
  1. 3. State and verify whether the conditions are met for inference on the slope of the regression line using the \(t\)-distribution.
  1. 4. Carry out a complete confidence interval procedure for the slope of the regression line.
  1. 5. Carry out a complete hypothesis test for the slope of the regression line.
  1. 6. Distinguish between when to use the \(t\)-test for the slope of a regression line and when to use the one-sample \(t\)-test for a mean of differences.

8.4.1 The Role of Inference for Regression Parameters

Previously, we found the equation of the regression line for predicting gift aid from family income at Elmhurst College. The slope, \(b\), was equal to \(-0.0431\). This is the slope for our sample data. However, the sample was taken from a larger population. We would like to use the slope computed from our sample data to estimate the slope of the population regression line.

The equation for the population regression line can be written as

$$ \mu_y = \alpha + \beta x $$

Here, \(\alpha\) and \(\beta\) represent two model parameters — the \(y\)-intercept and the slope of the true (population) regression line. (This use of \(\alpha\) and \(\beta\) has nothing to do with the \(\alpha\) and \(\beta\) we used previously to represent the probabilities of Type I and Type II errors!) The parameters \(\alpha\) and \(\beta\) are estimated using data. We can look at the equation of the regression line calculated from a particular data set:

$$ \hat{y} = a + b x $$

and see that \(a\) and \(b\) are point estimates for \(\alpha\) and \(\beta\), respectively. If we plug in the values of \(a\) and \(b\), the regression equation for predicting gift aid based on family income is:

$$ \hat{y} = 24{.}3193 - 0{.}0431 x $$

The slope of the sample regression line, \(-0.0431\), is our best estimate for the slope of the population regression line, but there is variability in this estimate since it is based on a sample. A different sample would produce a somewhat different estimate of the slope. The standard error of the slope tells us the typical variation in the slope of the sample regression line and the typical error in using this slope to estimate the slope of the population regression line.

We would like to construct a 95% confidence interval for \(\beta\), the slope of the population regression line. As with means, inference for the slope of a regression line is based on the \(t\)-distribution.

Context Pause

The leap from \(b\) to \(\beta\) is the same leap you've been making all along — from \(\bar{x}\) to \(\mu\), from \(\hat{p}\) to \(p\). Sample statistic to population parameter. The slope is just another statistic that wobbles from sample to sample, and \(\beta\) is the fixed (but unknown) quantity we'd love to pin down.

Example 8.4.1

Source: Main Text

The intercept and slope estimates for the Elmhurst data are \(b_0 = 24{,}319\) and \(b_1 = -0.0431\). What do these numbers really mean?

Solution:

Interpreting the slope parameter is helpful in almost any application. For each additional \$1,000 of family income, we would expect a student to receive a net difference of \$1,000 times the slope, \(-0.0431\), which equals \$43.10 less in aid on average. Note that higher family income corresponds to less aid because the coefficient of family income is negative in the model.

We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational. That is, increasing a student's family income may not cause the student's aid to drop. (It would be reasonable to contact the college and ask if the relationship is causal — i.e. if Elmhurst College's aid decisions are partially based on students' family income.)

The estimated intercept \(b_0 = 24{,}319\) describes the average aid if a student's family had no income. The meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \$0. In other applications, the intercept may have little or no practical value if there are no observations where \(x\) is near zero.

> Inference for the Slope of a Regression Line > Inference for the slope of a regression line is based on the \(t\)-distribution with \(n - 2\) degrees of freedom, where \(n\) is the number of paired observations.

Once we verify that conditions for using the \(t\)-distribution are met, we will be able to construct the confidence interval for the slope using a critical value \(t^{\star}\) based on \(n - 2\) degrees of freedom. We will use a table of the regression summary to find the point estimate and standard error for the slope.

Insight Note

Why \(n - 2\) degrees of freedom instead of \(n - 1\)? A regression line is determined by two quantities — a slope and an intercept — so two pieces of information are "used up" when we estimate them from the data. Every time we estimate an additional parameter, we lose one degree of freedom.

Try It Now 8.4.1

Source: Main Text

A study of 18 city blocks regresses observed traffic speed (mph) on posted speed limit (mph). The software output reports a sample slope of \(b = 0.84\) with standard error \(SE = 0.12\).

(a) What is the point estimate for the population slope \(\beta\)? (b) How many degrees of freedom should you use for inference on \(\beta\)? (c) In one sentence, interpret the slope in context.

Solution

(a) The point estimate for \(\beta\) is the sample slope \(b = 0.84\).

(b) With \(n = 18\) blocks, the degrees of freedom for slope inference are \(df = n - 2 = 18 - 2 = 16\).

(c) For each additional 1 mph increase in the posted speed limit, observed traffic speed increases by about 0.84 mph on average, based on the sample.

8.4.2 Conditions for the Least Squares Line

Conditions for inference in the context of regression can be more complicated than when dealing with means or proportions.

Inference for parameters of a regression line involves the following assumptions:

  • Linearity. The true relationship between the two variables follows a linear trend. We check whether this is reasonable by examining whether the data follow a linear trend. If there is a nonlinear trend (e.g. left panel of Figure 8.26), an advanced regression method from another book or later course should be applied.
  • Nearly normal residuals. For each \(x\)-value, the residuals should be nearly normal. When this assumption is found to be unreasonable, it is usually because of outliers or concerns about influential points. An example that suggests non-normal residuals is shown in the second panel of Figure 8.26. If the sample size \(n \ge 30\), then this assumption is not necessary.
  • Constant variability. The variability of points around the true least squares line is constant for all values of \(x\). An example of non-constant variability is shown in the third panel of Figure 8.26.
  • Independent. The observations are independent of one another. The observations can be considered independent when they are collected from a random sample or randomized experiment. Be careful of data collected sequentially in what is called a time series. An example of data collected in such a fashion is shown in the fourth panel of Figure 8.26.

We see in Figure 8.26 that patterns in the residual plots suggest that the assumptions for regression inference are not met in those four examples. In fact, identifying nonlinear trends in the data, outliers, and non-constant variability in the residuals is often easier to detect in a residual plot than in a scatterplot.

We note that the second assumption regarding nearly normal residuals is particularly difficult to assess when the sample size is small. We can make a graph, such as a histogram, of the residuals, but we cannot expect a small data set to be nearly normal. All we can do is look for excessive skew or outliers. Outliers and influential points in the data can be seen from the residual plot as well as from a histogram of the residuals.

Conditions for Inference on the Slope of a Regression Line

1. The data is collected from a random sample or randomized experiment.

2. The residual plot appears as a random cloud of points and does not have any patterns or significant outliers that would suggest that the linearity, nearly normal residuals, constant variability, or independence assumptions are unreasonable.

Figure 8.26: Four examples showing when the inference methods in this chapter are insufficient to apply to the data. In the left panel, a straight line does not fit the data. In the second panel, there are outliers: two points on the left are relatively distant from the rest of the data, and one of these points is very far away from the line. In the third panel, the variability of the data around the line increases with larger values of \(x\). In the last panel, a time series data set is shown, where successive observations are highly correlated.

Figure 8.27: Left: Scatterplot of gift aid versus family income for 50 freshmen at Elmhurst College. Right: Residual plot for the model shown in left panel.

Context Pause

Why so many conditions just to test a slope? Because every one of them corresponds to a way the \(t\)-distribution can lie to you. Violating linearity means the regression line itself is the wrong model. Violating constant variability means the standard error is wrong. Violating independence means the whole sampling-distribution story breaks down. The residual plot is your single best diagnostic — one picture, four conditions.

Insight Note

A residual plot is essentially a scatterplot with the linear trend subtracted out. If the line genuinely captures the relationship, what's left over should look like random static. Any pattern you can see in the residuals is a pattern the line failed to capture.

Try It Now 8.4.2

Source: Main Text

You fit a regression line to 40 observations and your residual plot shows the residuals fanning out — small near the left side of the plot and much larger near the right side. Which regression condition is violated, and what does this imply about using the \(t\)-procedures for the slope?

Solution

A fan-shaped residual plot violates the constant variability condition. The typical size of the residuals is not the same for all \(x\)-values, so the standard error for the slope reported by the software will not be trustworthy. The \(t\)-interval and \(t\)-test for the slope should not be applied directly to this data set. A transformation of the variables (like taking a log of \(y\)) or a more advanced regression method would be needed before inference is appropriate.

8.4.3 Constructing a Confidence Interval for the Slope of a Regression Line

We would like to construct a confidence interval for the slope of the regression line for predicting gift aid based on family income for all freshmen at Elmhurst College.

Do conditions seem to be satisfied? We recall that the 50 freshmen in the sample were randomly chosen, so the observations are independent. Next, we need to look carefully at the scatterplot and the residual plot.

Always Check Conditions

Do not blindly apply formulas or rely on regression output; always first look at a scatterplot or a residual plot. If conditions for fitting the regression line are not met, the methods presented here should not be applied.

The scatterplot seems to show a linear trend, which matches the fact that there is no curved trend apparent in the residual plot. Also, the standard deviation of the residuals is mostly constant for different \(x\) values and there are no outliers or influential points. There are no patterns in the residual plot that would suggest that a linear model is not appropriate, so the conditions are reasonably met. We are now ready to calculate the 95% confidence interval.

EstimateStd. Errort valuePr(>|t|)
(Intercept)24.31931.291518.830.0000
family_income-0.04310.0108-3.980.0002

Figure 8.28: Summary of least squares fit for the Elmhurst College data, where we are predicting gift aid by the university based on the family income of students.

Example 8.4.2

Source: Main Text

Construct a 95% confidence interval for the slope of the regression line for predicting gift aid from family income at Elmhurst College.

Solution:

As usual, the confidence interval will take the form:

$$ \text{point estimate} \;\pm\; \text{critical value} \times SE\ \text{of estimate} $$

The point estimate for the slope of the population regression line is the slope of the sample regression line: \(-0.0431\). The standard error of the slope can be read from the table as \(0.0108\). Note that we do not need to divide \(0.0108\) by \(\sqrt{n}\) or do any further calculations on \(0.0108\); \(0.0108\) is the \(SE\) of the slope. Note that the value of \(t\) given in the table refers to the test statistic, not the critical value \(t^{\star}\).

To find \(t^{\star}\), we use a \(t\)-table. Here \(n = 50\), so \(df = 50 - 2 = 48\). Using a \(t\)-table, we round down to row \(df = 40\) and estimate the critical value \(t^{\star} = 2.021\) for a 95% confidence level. The confidence interval is calculated as:

$$ -0{.}0431 \pm 2{.}021 \times 0{.}0108 = (-0{.}065,\ -0{.}021) $$

Note: \(t^{\star}\) using exactly 48 degrees of freedom is equal to \(2.01\) and gives the same interval of \((-0.065, -0.021)\).

Example 8.4.3

Source: Main Text

Interpret the confidence interval in context. What can we conclude?

Solution:

We are 95% confident that the slope of the population regression line — the true average change in gift aid for each additional \$1,000 in family income — is between \(-\$0.065\) thousand dollars and \(-\$0.021\) thousand dollars. That is, we are 95% confident that, on average, when family income is \$1,000 higher, gift aid is between \$21 and \$65 lower.

Because the entire interval is negative, we have evidence that the slope of the population regression line is less than \(0\). In other words, we have evidence that there is a significant negative linear relationship between gift aid and family income.

> Constructing a Confidence Interval for the Slope of a Regression Line > To carry out a complete confidence interval procedure to estimate the slope of the population regression line \(\beta\):

Identify: Identify the parameter and the confidence level, C%.

The parameter will be a slope of the population regression line — e.g. the slope of the population regression line relating air quality index to average rainfall per year for each city in the United States.

Choose: Choose the correct interval procedure and identify it by name.

To estimate the slope of a regression model we use a \(t\)-interval for the slope.

Check: Check conditions for using a \(t\)-interval for the slope.

1. Independence: Data should come from a random sample or randomized experiment. If sampling without replacement, check that the sample size is less than 10% of the population size. 2. Linearity: Check that the scatterplot does not show a curved trend and that the residual plot shows no ∪-shape pattern. 3. Constant variability: Use the residual plot to check that the standard deviation of the residuals is constant across all \(x\)-values. 4. Normality: The population of residuals is nearly normal or the sample size is \(\ge 30\). If the sample size is less than 30 check for strong skew or outliers in the sample residuals. If neither is found, then the condition that the population of residuals is nearly normal is considered reasonable.

Calculate: Calculate the confidence interval and record it in interval form.

$$ \text{point estimate} \;\pm\; t^{\star} \times SE\ \text{of estimate}, \quad df = n - 2 $$

- point estimate: the slope \(b\) of the sample regression line - \(SE\) of estimate: \(SE\) of slope (find using computer output) - \(t^{\star}\): use a \(t\)-distribution with \(df = n - 2\) and confidence level C%

Conclude: Interpret the interval and, if applicable, draw a conclusion in context.

We are C% confident that the true slope of the regression line — the average change in \(y\) for each unit increase in \(x\) — is between ___ and ___. If applicable, draw a conclusion based on whether the interval is entirely above, is entirely below, or contains the value \(0\).

Figure 8.29: Left: Scatterplot of head length versus total length for 104 brushtail possums. Right: Residual plot for the model shown in left panel.

Example 8.4.4

Source: Main Text

The regression summary below shows statistical software output from fitting the least squares regression line for predicting head length from total length for 104 brushtail possums. The scatterplot and residual plot are shown above.

PredictorCoefSE CoefTP
Constant42.709795.172818.2575.66e-13
total_length0.572900.059339.6574.68e-16
S = 2.595R-Sq = 47.76%R-Sq(adj) = 47.25%

Construct a 95% confidence interval for the slope of the regression line. Is there convincing evidence that there is a positive, linear relationship between head length and total length?

Solution:

Identify: The parameter of interest is the slope of the population regression line for predicting head length from body length. We want to estimate this at the 95% confidence level.

Choose: Because the parameter to be estimated is the slope of a regression line, we will use the \(t\)-interval for the slope.

Check: These data come from a random sample. The residual plot shows no pattern, so a linear model seems reasonable. The residual plot also shows that the residuals have constant standard deviation. Finally, \(n = 104 \ge 30\) so we do not have to check for skew in the residuals. All four conditions are met.

Calculate: We will calculate the interval: \(\text{point estimate} \pm t^{\star} \times SE\ \text{of estimate}\).

We read the slope of the sample regression line and the corresponding \(SE\) from the table. The point estimate is \(b = 0.57290\). The \(SE\) of the slope is \(0.05933\), which can be found next to the slope of \(0.57290\). The degrees of freedom is \(df = n - 2 = 104 - 2 = 102\). As before, we find the critical value \(t^{\star}\) using a \(t\)-table (the \(t^{\star}\) value is not the same as the \(T\)-statistic for the hypothesis test). Using the \(t\)-table at row \(df = 100\) (round down since 102 is not on the table) and confidence level 95%, we get \(t^{\star} = 1.984\).

So the 95% confidence interval is given by:

$$ 0{.}57290 \pm 1{.}984 \times 0{.}05933 $$ $$ (0{.}456,\ 0{.}691) $$

Conclude: We are 95% confident that the slope of the population regression line is between \(0.456\) and \(0.691\). That is, we are 95% confident that the true average increase in head length for each additional cm in total length is between \(0.456\) mm and \(0.691\) mm. Because the interval is entirely above \(0\), we do have evidence of a positive linear association between head length and body length for brushtail possums.

Guided Practice 8.22

Source: Main Text

Figure 8.21 shows statistical software output from fitting the least squares regression line shown in Figure 8.15. Use this output to formally evaluate the following hypotheses.

\(H_0\): The true coefficient for family income is zero.

\(H_A\): The true coefficient for family income is not zero.

Figure 8.21: Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.

EstimateStd. Errort valuePr(>|t|)
(Intercept)24319.31291.518.83<0.0001
family_income-0.04310.0108-3.980.0002
df = 48
Solution

From the table, the point estimate is \(b = -0.0431\) with \(SE = 0.0108\) on \(df = 48\). The reported test statistic is

$$ T = \frac{-0{.}0431 - 0}{0{.}0108} = -3{.}98, $$

and the two-sided p-value is \(0.0002\). Since the p-value is far below any standard significance level (e.g. \(\alpha = 0.05\)), we reject \(H_0\). The data provide convincing evidence that the true coefficient for family income is not zero — i.e. there is a real linear relationship between family income and gift aid at Elmhurst College.

Insight Note

Every confidence interval for a slope is also a hypothesis test in disguise. If \(0\) is inside the interval, you cannot rule out "no linear relationship." If \(0\) is outside, you can. For the Elmhurst data the interval is \((-0.065, -0.021)\) — entirely negative — which matches the test's conclusion that \(\beta \ne 0\).

Try It Now 8.4.3

Source: Main Text

A regression of a child's reading score on time spent reading at home (hours/week) for \(n = 25\) students gives \(b = 1.6\) with \(SE = 0.55\). Assume the four conditions for slope inference are met.

(a) What is the critical value \(t^{\star}\) for a 95% confidence interval, using \(df = n - 2\)? (Use \(df = 20\) from the table.) (b) Compute the 95% confidence interval for \(\beta\). (c) Does the interval contain \(0\)? What does that tell you?

Solution

(a) \(df = 25 - 2 = 23\). Rounding down to \(df = 20\) in the \(t\)-table, \(t^{\star} \approx 2.086\).

(b) CI: \(1.6 \pm 2.086 \times 0.55 = 1.6 \pm 1.147 = (0.453,\ 2.747)\).

(c) The interval is entirely above \(0\), so we have evidence of a positive linear relationship between time spent reading at home and reading score. Each additional hour per week is associated with an increase of roughly \(0.45\) to \(2.75\) points on the reading score, on average.

Context Pause

The critical value \(t^{\star}\) and the test statistic \(T\) are easy to confuse because they're both on the \(t\)-distribution and both look like "t-something." Keep them distinct: \(t^{\star}\) is a table lookup determined only by \(df\) and the confidence level — it never depends on your data. \(T\) is computed from your data and measures how far the observed slope is from zero in standard-error units. Mixing them up is the single most common slope-inference error.

8.4.4 Midterm Elections and Unemployment

Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S. Presidential election. The set of House elections occurring during the middle of a Presidential term are called midterm elections. In America's two-party system, one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.

To assess the validity of this claim, we can compile historical data and look for a connection. We consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression. Figure 8.30 shows these data and the least-squares regression line:

% change in House seats for President's party

$$ = -7{.}36 - 0{.}89 \times (\text{unemployment rate}) $$

We consider the percent change in the number of seats of the President's party (e.g. percent change in the number of seats for Republicans in 2018) against the unemployment rate.

Examining the data, there are no clear deviations from linearity, the constant variance condition, or the normality of residuals. While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.

Figure 8.30: The percent change in House seats for the President's party in each election from 1898 to 2018 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data.

Guided Practice 8.32

Source: Main Text

The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively. Do you agree that they should be removed for this investigation? Why or why not?

Solution

There is a reasonable case for removing these two observations. Unemployment rates of 21% and 18% are extreme outliers compared to every other midterm election in the data set — they are far outside the typical range of unemployment rates during the last century. As influential points, they could pull the regression line dramatically and make it less representative of the typical midterm election relationship we are trying to describe.

That said, removing outliers is never neutral. A reader of the analysis should be told that these points were dropped and given the reasoning. A responsible write-up reports the results both with and without them, so the audience can judge how sensitive the conclusion is to the Great Depression observations.

Testing for the Slope Using a Cutoff of 0.05

What does it mean to say that the slope of the population regression line is significantly greater than \(0\)? And why do we tend to use a cutoff of \(\alpha = 0.05\)? See the 5-minute interactive task at www.openintro.org/why05 for an explanation.

Context Pause

Statistical significance is not the same as practical significance. A slope of \(-0.89\) means roughly \(0.89\) percentage points of seats lost for every 1 percentage point higher unemployment — that is a real political story. But whether the data prove that story is a separate question, and it's the one the hypothesis test answers.

Insight Note

When the data refuse to reject \(H_0\), it doesn't mean "there is no relationship." It means "with this sample size and this amount of noise, we can't rule out zero." The absence of evidence is not evidence of absence — it is a call for more data or a different study.

Try It Now 8.4.4

Source: Main Text

Using the midterm-election regression (slope \(b = -0.8897\), \(SE = 0.8350\), \(n = 27\)), carry out a one-sided test of \(H_0: \beta = 0\) vs. \(H_A: \beta < 0\) at \(\alpha = 0.05\).

(a) What is the test statistic \(T\)? (b) What are the degrees of freedom? (c) Given that the two-sided p-value from the output is \(0.2961\), what is the one-sided p-value? (d) State the conclusion in context.

Solution

(a) \(T = \dfrac{-0{.}8897 - 0}{0{.}8350} = -1{.}07\).

(b) \(df = n - 2 = 27 - 2 = 25\).

(c) The one-sided p-value is half the two-sided value: \(0.2961 / 2 \approx 0.148\). This halving is valid only because our observed slope (\(-0.8897\)) is in the direction of \(H_A\) (negative).

(d) The one-sided p-value \(\approx 0.148\) is much larger than \(\alpha = 0.05\), so we fail to reject \(H_0\). The data do not provide convincing evidence that higher unemployment is associated with a larger loss of House seats for the President's party in midterm elections.

8.4.5 Understanding Regression Output from Software

The residual plot shown in Figure 8.31 shows no pattern that would indicate that a linear model is inappropriate. Therefore we can carry out a test on the population slope using the sample slope as our point estimate. Just as for other point estimates we have seen before, we can compute a standard error and test statistic for \(b\). The test statistic \(T\) follows a \(t\)-distribution with \(n - 2\) degrees of freedom.

Figure 8.31: The residual plot shows no pattern that would indicate that a linear model is inappropriate.

Hypothesis Tests on the Slope of the Regression Line

Use a \(t\)-test with \(n - 2\) degrees of freedom when performing a hypothesis test on the slope of a regression line.

We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course. Figure 8.32 shows software output for the least squares regression line in Figure 8.30. The row labeled unemp represents the information for the slope, which is the coefficient of the unemployment variable.

Figure 8.32: Least squares regression summary for the percent change in seats of President's party in House of Representatives based on percent unemployment.

EstimateStd. Errort valuePr(>|t|)
(Intercept)-7.36445.1553-1.430.1646
unemp-0.88970.8350-1.070.2961

Figure 8.33: The distribution shown here is the sampling distribution for \(b\), if the null hypothesis was true. The shaded tail represents the p-value for the hypothesis test evaluating whether there is convincing evidence that higher unemployment corresponds to a greater loss of House seats for the President's party during a midterm election.

Example 8.4.5

Source: Main Text

What do the first column of numbers in the regression summary represent?

Solution:

The entries in the first column represent the least squares estimates for the \(y\)-intercept and slope, \(a\) and \(b\) respectively. Using this information, we could write the equation for the least squares regression line as

$$ \hat{y} = -7{.}3644 - 0{.}8897 x, $$

where \(y\) in this case represents the percent change in the number of seats for the President's party, and \(x\) represents the unemployment rate.

We previously used a test statistic \(T\) for hypothesis testing in the context of means. Regression is very similar. Here, the point estimate is \(b = -0.8897\). The \(SE\) of the estimate is \(0.8350\), which is given in the second column next to the estimate of \(b\). This \(SE\) represents the typical error when using the slope of the sample regression line to estimate the slope of the population regression line.

The null value for the slope is \(0\), so we now have everything we need to compute the test statistic. We have:

$$ T = \frac{\text{point estimate} - \text{null value}}{SE\ \text{of estimate}} = \frac{-0{.}8897 - 0}{0{.}8350} = -1{.}07 $$

This value corresponds to the \(T\)-score reported in the regression output in the third column along the unemp row.

Example 8.4.6

Source: Main Text

In this example, the sample size \(n = 27\). Identify the degrees of freedom and p-value for the hypothesis test.

Solution:

The degrees of freedom are \(df = n - 2 = 27 - 2 = 25\). For a two-sided test, the p-value is the area in the two tails beyond \(\pm T = \pm 1.07\) on a \(t\)-distribution with \(df = 25\), which equals \(0.2961\) as reported in the last column of the table.

Because the p-value is so large, we do not reject the null hypothesis. That is, the data do not provide convincing evidence that a higher unemployment rate is associated with a larger loss for the President's party in the House of Representatives in midterm elections.

> Don't Carelessly Use the P-Value from Regression Output > The last column in regression output often lists p-values for one particular hypothesis: a two-sided test where the null value is zero. If your test is one-sided and the point estimate is in the direction of \(H_A\), then you can halve the software's p-value to get the one-tail area. If neither of these scenarios matches your hypothesis test, be cautious about using the software output to obtain the p-value.

Example 8.4.7

Source: Main Text

Use the table in Figure 8.20 to determine the p-value for the hypothesis test.

Solution:

The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate: \(0.2961\). That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections.

> Hypothesis Test for the Slope of a Regression Line > To carry out a complete hypothesis test for the claim that there is no linear relationship between two numerical variables, i.e. that \(\beta = 0\):

Identify: Identify the hypotheses and the significance level, \(\alpha\).

$$ H_0: \beta = 0 $$ $$ H_A: \beta \ne 0, \quad H_A: \beta > 0, \quad \text{or} \quad H_A: \beta < 0 $$

Choose: Choose the correct test procedure and identify it by name.

To test hypotheses about the slope of a regression model we use a \(t\)-test for the slope.

Check: Check conditions for using a \(t\)-test for the slope (the same four conditions as for the interval).

Calculate: Calculate the \(t\)-statistic, \(df\), and p-value.

$$ T = \frac{\text{point estimate} - \text{null value}}{SE\ \text{of estimate}}, \quad df = n - 2 $$

- point estimate: the slope \(b\) of the sample regression line - \(SE\) of estimate: \(SE\) of slope (find using computer output) - null value: \(0\) - p-value: based on the \(t\)-statistic, the \(df\), and the direction of \(H_A\)

Conclude: Compare the p-value to \(\alpha\), and draw a conclusion in context.

- If the p-value \(< \alpha\), reject \(H_0\); there is sufficient evidence that [\(H_A\) in context]. - If the p-value \(> \alpha\), do not reject \(H_0\); there is not sufficient evidence that [\(H_A\) in context].

Example 8.4.8

Source: Main Text

The regression summary below shows statistical software output from fitting the least squares regression line for predicting gift aid based on family income for 50 randomly selected freshman students at Elmhurst College. The scatterplot and residual plot were shown in Figure 8.27.

PredictorCoefSE CoefTP
Constant24.319331.2914518.831< 2e-16
family_income-0.043070.01081-3.9850.000229
$$ S = 4{.}783 \quad R\text{-}Sq = 24{.}86\% \quad R\text{-}Sq(adj) = 23{.}29\% $$

Do these data provide convincing evidence that there is a negative, linear relationship between family income and gift aid? Carry out a complete hypothesis test at the 0.05 significance level. Use the five step framework to organize your work.

Solution:

Identify: We will test the following hypotheses at the \(\alpha = 0.05\) significance level.

\(H_0\): \(\beta = 0\). There is no linear relationship.

\(H_A\): \(\beta < 0\). There is a negative linear relationship.

Here, \(\beta\) is the slope of the population regression line for predicting gift aid from family income at Elmhurst College.

Choose: Because the hypotheses are about the slope of a regression line, we choose the \(t\)-test for a slope.

Check: The data come from a random sample of less than 10% of the total population of freshman students at Elmhurst College. The lack of any pattern in the residual plot indicates that a linear model is reasonable. Also, the residual plot shows that the residuals have constant variance. Finally, \(n = 50 \ge 30\) so we do not have to worry too much about any skew in the residuals. All four conditions are met.

Calculate: We will calculate the \(t\)-statistic, degrees of freedom, and the p-value.

We read the slope of the sample regression line and the corresponding \(SE\) from the table.

- The point estimate is: \(b = -0.04307\). - The \(SE\) of the slope is: \(SE = 0.01081\).

$$ T = \frac{-0{.}04307 - 0}{0{.}01081} = -3{.}985 $$

Because \(H_A\) uses a less-than sign (\(<\)), meaning that it is a lower-tail test, the p-value is the area to the left of \(t = -3.985\) under the \(t\)-distribution with \(50 - 2 = 48\) degrees of freedom.

$$ \text{p-value} = \tfrac{1}{2}(0{.}000229) \approx 0{.}0001 $$

Conclude: The p-value of \(0.0001\) is \(< 0.05\), so we reject \(H_0\); there is sufficient evidence that there is a negative linear relationship between family income and gift aid at Elmhurst College.

Guided Practice 8.36

Source: Main Text

In context, interpret the p-value from the previous example.

Solution

The p-value of roughly \(0.0001\) is the probability of observing a sample slope as extreme as \(b = -0.04307\) — or more extreme in the direction of \(H_A\) — if the true slope of the population regression line really were zero (i.e., if there were no linear relationship between family income and gift aid).

Because this probability is so tiny, the observed slope is very hard to explain by random chance alone. That is why we reject \(H_0\): the data are inconsistent with the "no-relationship" hypothesis, and we instead conclude that higher family income is genuinely associated with lower gift aid at Elmhurst College.

Try It Now 8.4.5

Source: Main Text

In a regression summary, the row for the predictor study_hours shows Estimate \(= 3.2\), Std. Error \(= 0.9\), t value \(= 3.56\), and \(\Pr(>|t|) = 0.0006\). The sample size is \(n = 42\).

(a) What are the hypotheses corresponding to the reported p-value? (b) Explain what the \(t\) value column was computed from. (c) At \(\alpha = 0.01\), what do you conclude about \(\beta\)?

Solution

(a) The reported p-value is two-sided, so \(H_0: \beta = 0\) vs. \(H_A: \beta \ne 0\).

(b) \(t = b / SE = 3.2 / 0.9 \approx 3.56\). It's the standardized distance from the null slope of \(0\), measured in standard errors.

(c) \(p = 0.0006 < 0.01\), so we reject \(H_0\). There is convincing evidence that the true slope of the population regression line is not zero — study_hours has a nonzero linear association with the response variable.

Context Pause

Regression output tables look intimidating but carry only four pieces of information per row: the point estimate, its standard error, the \(t\)-statistic, and the two-sided p-value. That's it. Every statistical-software vendor uses different column labels (Coef, Estimate, Coefficient, b), but the meaning is the same. Once you can read one table you can read them all.

Insight Note

The p-values reported in regression output are always two-sided tests of \(H_0: \beta = 0\). If your research question is one-sided ("does \(x\) increase \(y\)?"), halve the reported p-value — but only if your observed slope is in the direction of \(H_A\). If the observed slope is opposite to \(H_A\), the correct one-sided p-value is actually \(1 - p/2\), which will be close to 1. Mindlessly halving without checking direction is a common mistake.

8.4.6 Technology: the \(t\)-test/Interval for the Slope

We generally rely on regression output from statistical software programs to provide us with the necessary quantities: \(b\) and \(SE\) of \(b\). However we can also find the test statistic, p-value, and confidence interval using Desmos or a handheld calculator.

Get started quickly with the Desmos T-Test/Interval Calculator (available at openintro.org/ahss/desmos).

For instructions on implementing the T-Test/Interval on the TI or Casio, see the Graphing Calculator Guides at openintro.org/ahss.

Inference for Regression

We usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice. However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.

Context Pause

Software makes slope inference nearly effortless — you feed it two columns and it hands you a full regression table. The price of that convenience is a tendency to skip the residual plot. A significant p-value from a regression that violates linearity or constant variability is worse than no p-value at all, because it feels authoritative. Always look at the picture before you quote the number.

Insight Note

Every entry in the regression table has a familiar analog: Estimate is \(b\), Std. Error is \(SE\), t value is \(b/SE\), and Pr(>|t|) is the two-sided p-value. Once you see the pattern, every new software package becomes readable in seconds.

Try It Now 8.4.6

Source: Main Text

You type a data set of \(n = 30\) paired observations into software and the row for your predictor reads Estimate \(= -0.42\), Std. Error \(= 0.25\), t value \(= -1.68\), \(\Pr(>|t|) = 0.104\).

(a) Verify the reported \(t\) value by hand from the Estimate and Std. Error. (b) What is the two-sided p-value for \(H_0: \beta = 0\)? (c) Build a 95% confidence interval for \(\beta\) using \(t^{\star} \approx 2.048\) (from \(df = 28\)). (d) Does your interval include \(0\)? Is that consistent with the reported p-value at \(\alpha = 0.05\)?

Solution

(a) \(t = -0.42 / 0.25 = -1.68\). Matches the table.

(b) The two-sided p-value is \(0.104\) — read directly from the output.

(c) CI: \(-0.42 \pm 2.048 \times 0.25 = -0.42 \pm 0.512 = (-0.932,\ 0.092)\).

(d) Yes, the interval contains \(0\). This is consistent with the p-value \(0.104 > 0.05\): we fail to reject \(H_0\) at the 5% level, which is exactly what "zero is in the interval" tells us.

8.4.7 Which Inference Procedure to Use for Paired Data?

In Section 7.2.4, we looked at a set of paired data involving the price of textbooks for UCLA courses at the UCLA Bookstore and on Amazon. The left panel of Figure 8.34 shows the difference in price (UCLA Bookstore − Amazon) for each book. Because we have two data points on each textbook, it also makes sense to construct a scatterplot, as seen in the right panel of Figure 8.34.

Figure 8.34: Left: histogram of the difference (UCLA Bookstore price − Amazon price) for each book sampled. Right: scatterplot of Amazon Price versus UCLA Bookstore price.

Example 8.4.9

Source: Main Text

What additional information does the scatterplot provide about the price of textbooks at UCLA Bookstore and on Amazon?

Solution:

With a scatterplot, we see the relationship between the variables. We can see that when UCLA Bookstore price is larger, Amazon price also tends to be larger. We can consider the strength of the correlation, and we can draw the linear regression equation for predicting Amazon price from UCLA Bookstore price.

Example 8.4.10

Source: Main Text

Which test should we do if we want to check whether:

1. prices for textbooks for UCLA courses are higher at the UCLA Bookstore than on Amazon; 2. there is a significant, positive linear relationship between UCLA Bookstore price and Amazon price?

Solution:

In the first case, we are interested in whether the differences (UCLA Bookstore − Amazon) for all UCLA textbooks are, on average, greater than \(0\), so we would do a one-sample \(t\)-test for a mean of differences. In the second case, we are interested in whether the slope of the regression line for predicting Amazon price from UCLA Bookstore price is significantly greater than \(0\), so we would do a \(t\)-test for the slope of a regression line.

Likewise, a one-sample \(t\)-interval for a mean of differences would provide an interval of reasonable values for the mean of differences in textbook price between UCLA Bookstore and Amazon (for all UCLA textbooks), while a \(t\)-interval for the slope would provide an interval of reasonable values for the slope of the regression line for predicting Amazon price from UCLA Bookstore price (for all UCLA textbooks).

> Inference for Paired Data > A one-sample \(t\)-interval or \(t\)-test for a mean of differences only makes sense when we are asking whether, on average, one variable is greater than, less than, or different from another (think histogram of the differences). A \(t\)-interval or \(t\)-test for the slope of a regression line makes sense when we are interested in the linear relationship between them (think scatterplot).

Example 8.4.11

Source: Main Text

Previously, we looked at the relationship between body length and head length for brushtail possums. We also looked at the relationship between gift aid and family income for freshmen at Elmhurst College. Could we do a one-sample \(t\)-test in either of these scenarios?

Solution:

We have to ask ourselves: does it make sense to ask whether, on average, body length is greater than head length? Similarly, does it make sense to ask whether, on average, gift aid is greater than family income? These don't seem to be meaningful research questions; a one-sample \(t\)-test for a mean of differences would not be useful here.

Guided Practice 8.40

Source: Main Text

A teacher gives her class a pretest and a posttest. Does this result in paired data? If so, which hypothesis test should she use?

Solution

Yes, this is paired data — each student contributes two observations (pretest score, posttest score), and the pairing is meaningful because the two scores come from the same individual.

The natural question is "did scores change, on average, between pretest and posttest?" That's a question about the mean of the differences (posttest − pretest), so she should use a one-sample \(t\)-test for a mean of differences. A \(t\)-test for a slope would only be appropriate if she wanted to model posttest score as a linear function of pretest score — a different question.

Try It Now 8.4.7

Source: Main Text

For each research question, state whether the appropriate procedure is a one-sample \(t\)-test for a mean of differences or a \(t\)-test for the slope of a regression line.

(a) Among married couples in a city, is the husband's age, on average, greater than the wife's age? (b) Among the same couples, is there a positive linear relationship between husband's age and wife's age? (c) In a study of twins, is the first-born twin's birth weight typically different from the second-born's? (d) In the same study, does a heavier first-born twin tend to predict a heavier second-born twin?

Solution

(a) One-sample \(t\)-test for a mean of differences. The question is about the typical difference (husband − wife) being different from \(0\).

(b) \(t\)-test for the slope. The question is whether the two variables have a linear association.

(c) One-sample \(t\)-test for a mean of differences. The question asks about the average difference in birth weight between first- and second-born twins.

(d) \(t\)-test for the slope. The question is about a linear relationship between the two weights, not the average difference.

Context Pause

"Paired data" by itself doesn't dictate which test to use — what matters is the question you're asking. A histogram-shaped question ("is \(A\) on average bigger than \(B\)?") is a paired \(t\)-test. A scatterplot-shaped question ("when \(A\) goes up, does \(B\) go up too?") is a slope test. The same two columns of numbers can appropriately feed either test; the research question chooses.

Insight Note

The paired \(t\)-test and the slope test are not substitutes — they answer orthogonal questions. It's perfectly reasonable for a dataset to show a significant paired difference but a non-significant slope, or vice versa. If both tests feel relevant, you probably need both: report the mean difference and the correlation.

Section Summary

In Chapter 6, we used a \(\chi^2\) test for independence to test for association between two categorical variables. In this section, we test for association/correlation between two numerical variables.

  • We use the slope \(b\) as a point estimate for the slope \(\beta\) of the population regression line. The slope of the population regression line is the true increase/decrease in \(y\) for each unit increase in \(x\). If the slope of the population regression line is \(0\), there is no linear relationship between the two variables.
  • Under certain assumptions, the sampling distribution for \(b\) is normal and the distribution of the standardized test statistic using the standard error of the slope follows a \(t\)-distribution with \(n - 2\) degrees of freedom.
  • When there is \((x, y)\) data and the parameter of interest is the slope of the population regression line:
  • Estimate \(\beta\) at the C% confidence level using a \(t\)-interval for the slope.
  • Test \(H_0: \beta = 0\) at the \(\alpha\) significance level using a \(t\)-test for the slope.
  • The conditions for the \(t\)-interval and \(t\)-test for the slope of a regression line are the same:
  • 1. Independence: Data come from a random sample or randomized experiment. If sampling without replacement, check that the sample size is less than 10% of the population size.
  • 2. Linearity: Check that the scatterplot does not show a curved trend and that the residual plot shows no ∪-shape pattern.
  • 3. Constant variability: Use the residual plot to check that the standard deviation of the residuals is constant across all \(x\)-values.
  • 4. Normality: The population of residuals is nearly normal or the sample size is \(\ge 30\). If the sample size is less than 30, check for strong skew or outliers in the sample residuals.
  • The confidence interval and test statistic are calculated as follows:
$$ \text{CI: } b \pm t^{\star} \times SE_b $$ $$ \text{Test: } T = \frac{b - 0}{SE_b}, \quad df = n - 2 $$
  • The confidence interval for the slope of the population regression line estimates the true average increase in the \(y\)-variable for each unit increase in the \(x\)-variable.
  • The \(t\)-test for the slope and the one-sample \(t\)-test for a mean of differences both involve paired, numerical data. However, the \(t\)-test for the slope asks if the two variables have a linear relationship — specifically, if the slope of the population regression line is different from \(0\). The one-sample \(t\)-test for a mean of differences asks if the two variables are, on average, different — specifically, if the mean of the population differences is not equal to \(0\).

Problem Set — Section 8.4

Source: Main Text

Problem 1: Body measurements, Part IV. The scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals.

EstimateStd. Errort valuePr(>|t|)
(Intercept)-105.01137.5394-13.930.0000
height1.01760.044023.130.0000

(a) Describe the relationship between height and weight.

(b) Write the equation of the regression line. Interpret the slope and intercept in context.

(c) Do the data provide strong evidence that an increase in height is associated with an increase in weight? State the null and alternative hypotheses, report the p-value, and state your conclusion.

(d) The correlation coefficient for height and weight is \(0.72\). Calculate \(R^2\) and interpret it in context.

Problem 8.33 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a positive, roughly linear association between height and weight for physically active individuals: taller people tend to weigh more. The spread is moderate and reasonably constant across the range of heights, with no obvious outliers.

Step 2 — Write the regression equation (b): Using the Estimate column: $$\widehat{\text{weight}} = -105.0113 + 1.0176 \times \text{height}.$$ Slope interpretation: For each additional 1 cm in height, we predict an increase of about \(1.018\) kg in weight, on average. Intercept interpretation: The intercept is the predicted weight when height is \(0\) cm — not meaningful here, since nobody is 0 cm tall.

Step 3 — Hypothesis test for the slope (c): Hypotheses: \(H_0: \beta = 0\) vs. \(H_A: \beta \neq 0\). The regression output reports \(T = 23.13\) and p-value \(0.0000\) for the height row. Since \(p < 0.05\), we reject \(H_0\). The data provide very strong evidence that increases in height are associated with increases in weight.

Step 4 — Compute \(R^2\) (d): $$R^2 = r^2 = (0.72)^2 = 0.5184.$$ About 52% of the variability in weight among these physically active individuals is explained by a linear model using height as the predictor.

Answer: (a) Positive, roughly linear. (b) \(\widehat{\text{weight}} = -105.01 + 1.018 \times \text{height}\). (c) Reject \(H_0\); strong evidence of a positive linear association (\(p \approx 0\)). (d) \(R^2 \approx 0.52\).

Problem 2: MCU, predict US theater sales. The Marvel Comic Universe movies were an international movie sensation, containing 23 movies at the time of this writing. Here we consider a model predicting an MCU film's gross theater sales in the US based on the first weekend sales performance in the US. The data are presented below in both a scatterplot and the model in a regression table. Scientific notation is used below — e.g. 42.5e6 corresponds to \(42.5 \times 10^6\).

EstimateStd. Errort valuePr(>|t|)
(Intercept)42.5e626.6e61.600.1251
opening_wednesday_us2.43610.173914.010.0000

(a) Describe the relationship between gross theater sales in the US and first weekend sales in the US.

(b) Write the equation of the regression line. Interpret the slope and intercept in context.

(c) Do the data provide strong evidence that higher opening weekend sales are associated with higher gross theater sales? State the null and alternative hypotheses, report the p-value, and state your conclusion.

(d) The correlation coefficient for gross sales and first weekend sales is \(0.950\). Calculate \(R^2\) and interpret it in context.

(e) Suppose we consider a set of all films ever released. Do you think the relationship between opening weekend sales and total sales would be as strong as what we see with the MCU films?

Problem 8.34 Solution

Step 1 — Describe the relationship (a): Gross US theater sales increase strongly and approximately linearly with opening-weekend US sales: higher opening weekends correspond to higher totals, with little deviation from a straight line.

Step 2 — Regression equation (b): $$\widehat{\text{gross}} = 42.5 \times 10^{6} + 2.4361 \times \text{opening\_wed\_us}.$$ Slope: Each additional \$1 of opening-Wednesday sales is associated with an additional \$2.44 in total US gross, on average. Intercept: The predicted gross for a hypothetical film with \$0 opening-Wednesday sales is about \$42.5 million — extrapolation outside the observed range, so interpret with caution.

Step 3 — Hypothesis test (c): \(H_0: \beta = 0\) vs. \(H_A: \beta \neq 0\). The table shows \(T = 14.01\) and p-value \(0.0000\). Since \(p < 0.05\), we reject \(H_0\). Higher opening weekend sales are strongly associated with higher gross sales.

Step 4 — Compute \(R^2\) (d): $$R^2 = (0.950)^2 = 0.9025.$$ About 90% of the variability in MCU gross sales is explained by opening-weekend sales.

Step 5 — Generalize (e): Probably not. MCU films are a highly curated, heavily marketed franchise with similar audiences. In a set of all films, budgets, genres, and distribution vary far more, so opening weekends would still correlate with total sales but almost certainly with a much lower \(R^2\) and more scatter.

Answer: (a) Strong positive linear. (b) \(\hat{y} = 42.5\text{M} + 2.44 x\). (c) Reject \(H_0\); very strong positive association. (d) \(R^2 \approx 0.90\). (e) Expect a weaker relationship for all films.

Problem 3: Spouses, Part II. The scatterplot below summarizes women's heights and their spouses' heights for a random sample of 170 married women in Britain, where both partners' ages are below 65 years. Summary output of the least squares fit for predicting spouse's height from the woman's height is also provided in the table.

EstimateStd. Errort valuePr(>|t|)
(Intercept)43.57554.68429.300.0000
height_spouse0.28630.06864.170.0000

(a) Is there strong evidence in this sample that taller women have taller spouses? State the hypotheses and include any information used to conduct the test.

(b) Write the equation of the regression line for predicting the height of a woman's spouse based on the woman's height.

(c) Interpret the slope and intercept in the context of the application.

(d) Given that \(R^2 = 0.09\), what is the correlation of heights in this data set?

(e) You meet a married woman from Britain who is 5'9" (69 inches). What would you predict her spouse's height to be? How reliable is this prediction?

(f) You meet another married woman from Britain who is 6'7" (79 inches). Would it be wise to use the same linear model to predict her spouse's height? Why or why not?

Problem 8.35 Solution

Step 1 — State and evaluate the hypotheses (a): \(H_0: \beta = 0\) vs. \(H_A: \beta \neq 0\). Using the regression table: slope \(b = 0.2863\), \(SE = 0.0686\), \(T = 4.17\), p-value \(0.0000\). With \(n = 170\), \(df = 168\). Since \(p < 0.05\), we reject \(H_0\) and conclude there is strong evidence that taller women have taller spouses.

Step 2 — Regression equation (b): $$\widehat{\text{spouse\_height}} = 43.5755 + 0.2863 \times \text{woman\_height}.$$

Step 3 — Slope and intercept in context (c): Slope: For each additional inch in a woman's height, the spouse's predicted height increases by about \(0.29\) inches, on average. Intercept: A woman of height 0 inches is predicted to have a spouse of height \(43.6\) inches — not meaningful; it is only a mathematical baseline.

Step 4 — Correlation from \(R^2\) (d): $$r = \pm\sqrt{0.09} = \pm 0.30.$$ The slope is positive, so \(r = +0.30\).

Step 5 — Prediction for 69 inches (e): $$\widehat{\text{spouse}} = 43.5755 + 0.2863 \times 69 = 43.5755 + 19.7547 \approx 63.33 \text{ in.}$$ Because \(R^2 = 0.09\) is small, individual predictions are unreliable — height explains only \(\approx 9\%\) of spouse-height variation.

Step 6 — Prediction for 79 inches (f): No — \(79\) inches (6'7") is far above the sampled range. Using the line there is extrapolation, and the linear relationship may not hold at the extremes.

Answer: (a) Reject \(H_0\); strong evidence. (b) \(\hat{y} = 43.58 + 0.286 x\). (c) See step 3. (d) \(r \approx 0.30\). (e) \(\approx 63.3\) in; weak prediction. (f) No — extrapolation.

Problem 4: Urban homeowners, Part II. Exercise 8.29 gives a scatterplot displaying the relationship between the percent of families that own their home and the percent of the population living in urban areas. Below is a similar scatterplot, excluding District of Columbia, as well as the residuals plot. There were 51 cases.

(a) For these data, \(R^2 = 0.28\). What is the correlation? How can you tell if it is positive or negative?

(b) Examine the residual plot. What do you observe? Is a simple least squares fit appropriate for these data?

Problem 8.36 Solution

Step 1 — Correlation from \(R^2\) (a): $$r = \pm \sqrt{0.28} = \pm 0.529.$$ The sign matches the slope of the regression line. Because home-ownership tends to decrease as urbanization rises in the displayed scatterplot (negative slope), \(r \approx -0.53\).

Step 2 — Evaluate the residual plot (b): If the residual plot shows a random cloud of points centered on zero with roughly constant spread and no curved pattern, the linearity and constant-variability conditions are reasonable and a simple least-squares fit is appropriate. If the plot shows a U-shape, fanning, or obvious outliers, those conditions fail and a simple linear model should not be used as-is.

Answer: (a) \(r \approx -0.53\) (negative because higher urbanization is associated with lower homeownership). (b) Appropriateness depends on the residual plot; a random scatter with constant spread supports using a simple linear fit.

Problem 5: Murders and poverty, Part II. Exercise 8.25 presents regression output from a model for predicting annual murders per million from percentage living in poverty based on a random sample of 20 metropolitan areas. The model output is also provided below.

EstimateStd. Errort valuePr(>|t|)
(Intercept)-29.9017.789-3.8390.001
poverty%2.5590.3906.5620.000
$$ s = 5{.}512 \quad R^2 = 70{.}52\% \quad R_{adj}^2 = 68{.}89\% $$

(a) What are the hypotheses for evaluating whether poverty percentage is a significant predictor of murder rate?

(b) State the conclusion of the hypothesis test from part (a) in context of the data.

(c) Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in context of the data.

(d) Do your results from the hypothesis test and the confidence interval agree? Explain.

Problem 8.37 Solution

Step 1 — State hypotheses (a): $$H_0: \beta = 0 \qquad H_A: \beta \neq 0,$$ where \(\beta\) is the slope of the population regression line for predicting murders per million from poverty percentage.

Step 2 — Conclusion of the test (b): From the table, \(b = 2.559\), \(SE = 0.390\), \(T = 6.562\), p-value \(= 0.000\). Since \(p < 0.05\), we reject \(H_0\). There is strong evidence that poverty percentage is a significant predictor of murder rate in these metropolitan areas.

Step 3 — 95% CI for the slope (c): \(df = n - 2 = 20 - 2 = 18\); \(t^{\star} \approx 2.101\) from the \(t\)-table at \(df = 18\), 95% confidence. $$2.559 \pm 2.101 \times 0.390 = 2.559 \pm 0.819 = (1.740,\ 3.378).$$ We are 95% confident that each additional percentage point of poverty is associated with between \(1.74\) and \(3.38\) additional murders per million, on average.

Step 4 — Agreement check (d): The interval \((1.74, 3.38)\) does not contain \(0\), which matches the rejection of \(H_0\) in part (b). Both procedures point to the same conclusion: poverty is significantly associated with murder rate.

Answer: (a) \(H_0: \beta=0\) vs \(H_A: \beta \neq 0\). (b) Reject \(H_0\); strong evidence. (c) \((1.74, 3.38)\) murders/million per 1% increase in poverty. (d) Yes — interval excludes 0 and test rejects \(H_0\).

Problem 6: Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is

$$ \widehat{\text{head circumference}} = 3{.}91 + 0{.}78 \times \text{gestational age} $$

The standard error for the coefficient of gestational age is \(0.35\). Is there significant evidence that gestational age has a positive linear association with head circumference? Use the Identify, Choose, Check, Calculate, Conclude framework and make sure to identify any assumptions used in the test.

Problem 8.38 Solution

Step 1 — Identify: Parameter: \(\beta\), the slope of the population regression line relating head circumference (cm) to gestational age (weeks) for low birth-weight babies. Hypotheses: \(H_0: \beta = 0\) vs. \(H_A: \beta > 0\) (positive association). Use \(\alpha = 0.05\).

Step 2 — Choose: Because the parameter is the slope of a regression line, use the \(t\)-test for the slope.

Step 3 — Check conditions: Assume the 25 babies form a random sample (stated implicitly). Linearity, constant variability, and nearly normal residuals should be assessed from a residual plot; with \(n = 25 < 30\), the normality condition matters and we'd want to verify there is no strong skew in the residuals. We will proceed assuming the conditions are reasonably met, as the problem directs.

Step 4 — Calculate: $$T = \frac{b - 0}{SE_b} = \frac{0.78 - 0}{0.35} = 2.229.$$ Degrees of freedom: \(df = 25 - 2 = 23\). For a one-sided upper-tail test, the p-value is the area to the right of \(2.229\) under the \(t_{23}\) distribution, which is approximately \(p \approx 0.018\) (between \(0.01\) and \(0.025\) on a standard \(t\)-table).

Step 5 — Conclude: \(p \approx 0.018 < 0.05\), so we reject \(H_0\). The data provide convincing evidence that gestational age is positively associated with head circumference at birth for low birth-weight babies.

Answer: Reject \(H_0\); \(T \approx 2.23\), \(df = 23\), \(p \approx 0.018\). Gestational age has a statistically significant positive linear association with head circumference.

Chapter Highlights

This chapter focused on describing the linear association between two numerical variables and fitting a linear model.

  • The correlation coefficient, \(r\), measures the strength and direction of the linear association between two variables. However, \(r\) alone cannot tell us whether data follow a linear trend or whether a linear model is appropriate.
  • The explained variance, \(R^2\), measures the proportion of variation in the \(y\) values explained by a given model. Like \(r\), \(R^2\) alone cannot tell us whether data follow a linear trend or whether a linear model is appropriate.
  • Every analysis should begin with graphing the data using a scatterplot in order to see the association and any deviations from the trend (outliers or influential values). A residual plot helps us better see patterns in the data.
  • When the data show a linear trend, we fit a least squares regression line of the form \(\hat{y} = a + b x\), where \(a\) is the \(y\)-intercept and \(b\) is the slope. It is important to be able to calculate \(a\) and \(b\) using the summary statistics and to interpret them in the context of the data.
  • A residual, \(y - \hat{y}\), measures the error for an individual point. The standard deviation of the residuals, \(s\), measures the typical size of the residuals.
  • \(\hat{y} = a + b x\) provides the best-fit line for the observed data. To estimate or hypothesize about the slope of the population regression line, first confirm that the residual plot has no pattern and that a linear model is reasonable, then use a \(t\)-interval for the slope or a \(t\)-test for the slope with \(n - 2\) degrees of freedom.

In this chapter we focused on simple linear models with one explanatory variable. More complex methods of prediction, such as multiple regression (more than one explanatory variable) and nonlinear regression, can be studied in a future course.

Problem Set — Chapter 8 Review

Source: Main Text

Problem 7: True / False. Determine if the following statements are true or false. If false, explain why.

(a) A correlation coefficient of \(-0.90\) indicates a stronger linear relationship than a correlation of \(0.5\).

(b) Correlation is a measure of the association between any two variables.

Problem 8.39 Solution

Step 1 — Evaluate (a): TRUE. The strength of the linear relationship is measured by \(|r|\), not by its sign. Since \(|-0.90| = 0.90 > 0.5\), a correlation of \(-0.90\) represents a stronger linear association than a correlation of \(+0.5\). The negative sign just tells us the direction.

Step 2 — Evaluate (b): FALSE. Correlation (the Pearson correlation coefficient \(r\)) measures the strength of a linear relationship between two numerical variables. It is not defined for categorical variables, and it can mislead for numerical variables whose relationship is strongly nonlinear.

Answer: (a) TRUE. (b) FALSE — correlation applies only to numerical variables with a roughly linear relationship.

Problem 8: Cats, Part II. Exercise 8.26 presents regression output from a model for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats. The model output is also provided below. Assume that conditions for inference on the slope are met.

EstimateStd. Errort valuePr(>|t|)
(Intercept)-0.3570.692-0.5150.607
body wt4.0340.25016.1190.000
$$ s = 1{.}452 \quad R^2 = 64{.}66\% \quad R_{adj}^2 = 64{.}41\% $$

(a) What are the hypotheses for evaluating whether body weight is associated with heart weight in cats?

(b) State the conclusion of the hypothesis test from part (a) in context of the data.

(c) Calculate a 95% confidence interval for the slope of body weight, and interpret it in context of the data.

(d) Do your results from the hypothesis test and the confidence interval agree? Explain.

Problem 8.40 Solution

Step 1 — Hypotheses (a): $$H_0: \beta = 0 \qquad H_A: \beta \neq 0,$$ where \(\beta\) is the slope of the population regression line predicting heart weight (g) from body weight (kg) in cats.

Step 2 — Conclusion of the test (b): From the table, \(b = 4.034\), \(SE = 0.250\), \(T = 16.119\), p-value \(= 0.000\). We reject \(H_0\). There is overwhelming evidence that body weight is associated with heart weight in cats: heavier cats tend to have heavier hearts.

Step 3 — 95% CI for the slope (c): \(df = n - 2 = 144 - 2 = 142\). Use \(df = 100\) from the table: \(t^{\star} \approx 1.984\). $$4.034 \pm 1.984 \times 0.250 = 4.034 \pm 0.496 = (3.538,\ 4.530).$$ We are 95% confident that each additional 1 kg of body weight is associated with between \(3.54\) and \(4.53\) additional grams of heart weight, on average.

Step 4 — Agreement (d): The interval does not contain \(0\), consistent with rejecting \(H_0\). Both approaches agree: body weight is a significant predictor of heart weight.

Answer: (a) \(H_0: \beta = 0\) vs \(H_A: \beta \neq 0\). (b) Reject \(H_0\); very strong association. (c) \((3.54, 4.53)\) g per kg. (d) Yes — agree.

Problem 9: Nutrition at Starbucks, Part II. Exercise 8.22 introduced a data set on nutrition information on Starbucks food menu items. Based on the scatterplot and the residual plot provided, describe the relationship between the protein content and calories of these menu items, and determine if a simple linear model is appropriate to predict amount of protein from the number of calories.

Problem 8.41 Solution

Step 1 — Describe the relationship: From the scatterplot, protein content generally increases with calories, but the trend is moderate rather than strong and there is substantial spread around any line. The association looks positive but noisy.

Step 2 — Evaluate appropriateness from the residual plot: If the residual plot shows a fairly random cloud centered on zero with roughly constant spread, a simple linear model is reasonable for predicting protein from calories. If the residual plot shows fanning (increasing spread with calories) or curvature, then a simple linear model is not appropriate — a transformation or nonlinear method would be preferred.

Step 3 — Practical judgment: For Starbucks food items, menu items bundle protein with fat and carbs, so calories alone explain only part of the protein content. Use the linear model only for rough predictions and be cautious at the extremes of the calorie range.

Answer: The relationship appears positive but moderate. A simple linear model is appropriate only if the residual plot shows no clear pattern and constant spread; otherwise it should not be used.

Problem 10: Helmets and lunches. The scatterplot shows the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (lunch) and the percentage of bike riders in the neighborhood wearing helmets (helmet). The average percentage of children receiving reduced-fee lunches is \(30.8\%\) with a standard deviation of \(26.7\%\), and the average percentage of bike riders wearing helmets is \(38.8\%\) with a standard deviation of \(16.9\%\).

(a) If the \(R^2\) for the least-squares regression line for these data is \(72\%\), what is the correlation between lunch and helmet?

(b) Calculate the slope and intercept for the least-squares regression line for these data.

(c) Interpret the intercept of the least-squares regression line in the context of the application.

(d) Interpret the slope of the least-squares regression line in the context of the application.

(e) What would the value of the residual be for a neighborhood where \(40\%\) of the children receive reduced-fee lunches and \(40\%\) of the bike riders wear helmets? Interpret the meaning of this residual in the context of the application.

Problem 8.42 Solution

Step 1 — Correlation from \(R^2\) (a): $$r = \pm\sqrt{0.72} = \pm 0.849.$$ Higher reduced-lunch percentages are expected to correspond to lower helmet-wearing percentages (a neighborhood-SES pattern), so the slope is negative and \(r \approx -0.849\).

Step 2 — Slope of the least-squares line (b): $$b = r \cdot \frac{s_y}{s_x} = -0.849 \cdot \frac{16.9}{26.7} \approx -0.537.$$

Step 3 — Intercept: The regression line passes through \((\bar{x}, \bar{y}) = (30.8, 38.8)\): $$a = \bar{y} - b \bar{x} = 38.8 - (-0.537)(30.8) = 38.8 + 16.54 \approx 55.3.$$ So $$\widehat{\text{helmet}} = 55.3 - 0.537 \times \text{lunch}.$$

Step 4 — Interpret the intercept (c): At a hypothetical neighborhood where \(0\%\) of children receive reduced-fee lunches, the predicted helmet-wearing rate is about \(55.3\%\). Interpret cautiously — this may extrapolate beyond the data.

Step 5 — Interpret the slope (d): For each additional 1 percentage point of children receiving reduced-fee lunches, the predicted percentage of bike riders wearing helmets decreases by about \(0.54\) percentage points.

Step 6 — Residual at (40%, 40%) (e): $$\widehat{\text{helmet}} = 55.3 - 0.537 \times 40 = 55.3 - 21.48 = 33.82.$$ Residual \(= y - \hat{y} = 40 - 33.82 = 6.18\). The actual helmet rate is about \(6.2\) percentage points higher than the model predicts — the neighborhood does better than expected.

Answer: (a) \(r \approx -0.85\). (b) \(b \approx -0.54\), \(a \approx 55.3\). (c)–(d) See steps 4–5. (e) Residual \(\approx +6.2\) percentage points.

Problem 11: Match the correlation, Part III. Match each correlation to the corresponding scatterplot.

(a) \(r = -0.72\)

(b) \(r = 0.07\)

(c) \(r = 0.86\)

(d) \(r = 0.99\)

(1)

(2)

(3)

(4)

Problem 8.43 Solution

Step 1 — Rank the correlations by strength: Order by \(|r|\): \(|0.99| > |0.86| > |{-0.72}| > |0.07|\).

Step 2 — Match strongest positive (\(r = 0.99\)): The plot that looks like an almost perfect straight line with positive slope and very little scatter — very tight cluster around a rising line.

Step 3 — Match strong positive (\(r = 0.86\)): Still clearly positive and linear but with visibly more scatter than the \(0.99\) plot.

Step 4 — Match strong negative (\(r = -0.72\)): A plot with a clearly decreasing linear trend and moderate scatter.

Step 5 — Match near-zero (\(r = 0.07\)): A plot with essentially no visible trend — a near-circular cloud of points.

Answer: Strongest-to-weakest by |r|: (d) \(0.99\), (c) \(0.86\), (a) \(-0.72\), (b) \(0.07\). The unique pairing is the tightest positive line \(\to 0.99\), the noisier positive line \(\to 0.86\), the decreasing line \(\to -0.72\), and the patternless cloud \(\to 0.07\).

Problem 12: Rate my professor. Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching-related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and regression output is provided for predicting teaching evaluation score from beauty score.

EstimateStd. Errort valuePr(>|t|)
(Intercept)4.0100.0255157.210.0000
beauty0.03224.130.0000

(a) Given that the average standardized beauty score is \(-0.0883\) and average teaching evaluation score is \(3.9983\), calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.

(b) Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

(c) List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

Problem 8.44 Solution

Step 1 — Calculate the slope (a): The regression line must pass through \((\bar{x}, \bar{y}) = (-0.0883,\ 3.9983)\), so $$\bar{y} = a + b \bar{x} \;\Longrightarrow\; 3.9983 = 4.010 + b(-0.0883).$$ Solve: $$b = \frac{3.9983 - 4.010}{-0.0883} = \frac{-0.0117}{-0.0883} \approx 0.133.$$ (Equivalently, the t value of \(4.13\) with \(SE = 0.0322\) gives \(b = 4.13 \times 0.0322 \approx 0.133\), confirming the estimate.)

Step 2 — Positive slope test (b): Hypotheses: \(H_0: \beta = 0\) vs. \(H_A: \beta > 0\). The reported two-sided p-value is \(0.0000\) and the observed slope is positive, so the one-sided p-value is essentially \(0\). We reject \(H_0\): yes, there is convincing evidence that higher beauty scores are associated with higher teaching evaluation scores.

Step 3 — Check conditions (c): Using the diagnostic plots: - Linearity: Residual plot should show no ∪-shape or curved pattern. - Constant variability: Residuals should have roughly the same spread across \(x\). - Nearly normal residuals: \(n = 463 \ge 30\), so this condition is automatically satisfied by the large sample size. - Independence: Assume data come from a reasonably representative sample of professors.

If the residual/Q-Q plots look like random scatter with constant spread and roughly normal residuals, all four conditions are met. The large \(n\) in particular makes the inference robust.

Answer: (a) \(b \approx 0.133\). (b) Reject \(H_0\); evidence of a positive linear relationship. (c) All four conditions are reasonably satisfied, assuming diagnostic plots show no strong patterns.

Problem 13: Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the ground.

(a) Describe the relationship between volume and height of these trees.

(b) Describe the relationship between volume and diameter of these trees.

(c) Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.

Problem 8.45 Solution

Step 1 — Volume vs. height (a): The scatterplot shows a positive association between volume and height: taller trees tend to contain more timber. The relationship looks roughly linear but with substantial scatter — some tall trees have moderate volume and vice versa.

Step 2 — Volume vs. diameter (b): The scatterplot shows a strong, positive, and clearly curved association: volume increases sharply with diameter. Because volume scales roughly with \(\text{diameter}^2\) (think of a cylinder: \(V \propto d^2 h\)), the relationship is not linear — it bends upward.

Step 3 — Which predictor to use (c): Diameter is the better predictor of timber volume because the diameter-volume relationship is much stronger than the height-volume relationship (higher \(R^2\)). However, the curvature in the diameter-volume plot means a simple linear regression may not be ideal; a transformation (e.g., predicting volume from \(\text{diameter}^2\) or using \(\log(\text{volume})\) vs. \(\log(\text{diameter})\)) would usually fit much better. If we insist on a single linear predictor, diameter still wins, but the residual plot should be checked to justify the linearity assumption.

Answer: (a) Moderate positive, roughly linear association. (b) Strong positive but clearly curved association. (c) Diameter — stronger relationship — though a transformation would improve the linearity.