Chapter 8

\(x\)	\(\hat{y} = 41 + 0.59 x\)
75	\(41 + 0.59(75) = 85.25\)
80	\(41 + 0.59(80) = 88.20\)
85	\(41 + 0.59(85) = 91.15\)
90	\(41 + 0.59(90) = 94.10\)
95	\(41 + 0.59(95) = 97.05\)

\(x\)	\(y\)	\(\hat{y}\)	residual
75	82	85.25	\(-3.25\)
80	91	88.20	\(+2.80\)
85	92	91.15	\(+0.85\)
90	94	94.10	\(-0.10\)
95	100	97.05	\(+2.95\)

8.2 Fitting a Line by Least Squares Regression

In this section we answer questions that matter for anyone who uses data to make predictions:

How well can we predict financial aid based on family income for a particular college?
How do we find, interpret, and apply the least squares regression line?
How do we measure the fit of a model and compare different models to each other?
Why do models sometimes make predictions that are ridiculous — or even impossible?

Context Pause

Fitting a line by eye (like we did in Section 8.1) is a great first instinct, but it is not reproducible — two people will draw two different lines through the same scatterplot, and neither can defend their choice as best. The least squares regression line solves that problem by giving us an objective, formula-driven answer: out of every possible line, it picks the one that makes the squared prediction errors as small as possible. Every computer statistics package, every graphing calculator's LinReg button, and every professional regression study uses this criterion.

Insight Note

Squaring the residuals — instead of just adding up their absolute values — sounds arbitrary, but it has teeth. A residual of 4 is not twice as bad as a residual of 2; it is four times as bad ($4^2 = 16$ vs. $2^2 = 4$) under the squared-error criterion. That heavy penalty on large misses is exactly why least squares produces lines that hug the bulk of the data instead of tolerating a few wild outliers.

Learning Objectives

Source: Main Text

By the end of this section, you should be able to:

In this section, you will learn to:

Calculate the slope and $y$-intercept of the least squares regression line using the relevant summary statistics. Interpret these quantities in context.
Understand why the least squares regression line is called the least squares regression line.
Interpret the explained variance $R^2$.
Understand the concept of extrapolation and why it is dangerous.
Identify outliers and influential points in a scatterplot.

8.2.1 An Objective Measure for Finding the Best Line

Fitting linear models by eye is open to criticism since it depends on an individual's preference. In this section, we use least squares regression as a more rigorous approach.

This section considers family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois. Gift aid is financial aid that does not need to be paid back, as opposed to a loan. A scatterplot of the data is shown in Figure 8.12 along with two linear fits. The lines follow a negative trend in the data — students who have higher family incomes tended to have lower gift aid from the university.

Figure 8.12: Gift aid and family income for a random sample of 50 freshman students from Elmhurst College. Two lines are fit to the data, the solid line being the least squares line.

We begin by thinking about what we mean by best. Mathematically, we want a line that has small residuals. One criterion could be to minimize the sum of the residual magnitudes:

$$ \left| y_{1} - \hat{y}_{1} \right| + \left| y_{2} - \hat{y}_{2} \right| + \dots + \left| y_{n} - \hat{y}_{n} \right| $$

which we could accomplish with a computer program. The resulting dashed line in Figure 8.12 shows this fit can be quite reasonable. However, a more common practice is to choose the line that minimizes the sum of the squared residuals:

$$ \left(y_{1} - \hat{y}_{1}\right)^{2} + \left(y_{2} - \hat{y}_{2}\right)^{2} + \dots + \left(y_{n} - \hat{y}_{n}\right)^{2} $$

The line that minimizes the sum of the squared residuals is represented as the solid line in Figure 8.12. This is commonly called the least squares line.

Both lines seem reasonable, so why do data scientists prefer the least squares regression line? One reason is that it is easier to compute by hand and in most statistical software. Another, more compelling, reason is that in many applications a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.

In Figure 8.13, we imagine the squared error about a line as actual squares. The least squares regression line minimizes the sum of the areas of these squared errors. In the figure, the sum of the squared error is $4 + 1 + 1 = 6$. There is no other line about which the sum of the squared error will be smaller.

Figure 8.13: A visualization of least squares regression using Desmos. Try out this and other interactive Desmos activities at openintro.org/ahss/desmos.

Context Pause

The choice between "sum of absolute residuals" and "sum of squared residuals" is a choice about what kind of mistakes hurt most. Squared error says large misses should be punished disproportionately — a single bad miss of 10 counts as much as 100 small misses of 1. If you were picking a line for a safety-critical setting (medical dosing, structural engineering), that punishment is exactly what you want: avoid the rare catastrophic miss even at the cost of more small misses.

Insight Note

"Least squares" is not a statement about what is true — it is a statement about what is computable and defensible. The line has no magical claim to reality; it is simply the line with the smallest sum of squared vertical gaps. The appeal is that two analysts working with the same data will compute exactly the same line, every time. That reproducibility is worth a lot more than "fitting by eye."

Try It Now 8.2.1

Source: Main Text

Suppose you have four data points and two candidate lines. The residuals from Line A are $+2, -3, +1, -2$. The residuals from Line B are $+5, 0, 0, -1$.

(a) Compute the sum of absolute residuals for each line. Which line is "best" by that criterion? (b) Compute the sum of squared residuals for each line. Which line is "best" by least squares? (c) Explain in one sentence why the two criteria might pick different winners.

Solution

(a) Sum of absolute residuals: - Line A: $|{+2}| + |{-3}| + |{+1}| + |{-2}| = 2 + 3 + 1 + 2 = 8$. - Line B: $|{+5}| + |0| + |0| + |{-1}| = 5 + 0 + 0 + 1 = 6$.

Line B wins on absolute residuals.

(b) Sum of squared residuals: - Line A: $2^2 + (-3)^2 + 1^2 + (-2)^2 = 4 + 9 + 1 + 4 = 18$. - Line B: $5^2 + 0^2 + 0^2 + (-1)^2 = 25 + 0 + 0 + 1 = 26$.

Line A wins on squared residuals.

(c) The squared-residual criterion penalizes big misses disproportionately. Line B has one miss of size 5 (squared = 25), which outweighs Line A's worst miss of 3 (squared = 9) even though Line A has more total misses.

8.2.2 Finding the Least Squares Line

For the Elmhurst College data, we could fit a least squares regression line for predicting gift aid based on a student's family income and write the equation as:

$$ \widehat{aid} = a + b \times \text{family\_income} $$

Here $a$ is the $y$-intercept of the least squares regression line and $b$ is the slope of the least squares regression line. $a$ and $b$ are both statistics that can be calculated from the data. In the next section we will consider the corresponding population parameters that these statistics attempt to estimate.

We can enter all the data into a statistical software package and easily find the values of $a$ and $b$. However, we can also calculate these values by hand, using only the summary statistics:

The slope of the least squares line is given by

$$ b = r \, \frac{s_{y}}{s_{x}} $$

where $r$ is the correlation between the variables $x$ and $y$, and $s_{x}$ and $s_{y}$ are the sample standard deviations of $x$ (the explanatory variable) and $y$ (the response variable).

The point of averages $(\bar{x}, \bar{y})$ is always on the least squares line. Plugging this point in for $x$ and $y$ in the least squares equation and solving for $a$ gives:

$$ \bar{y} = a + b\,\bar{x} $$ $$ a = \bar{y} - b\,\bar{x} $$

Finding the Slope and Intercept of the Least Squares Regression Line

The least squares regression line for predicting $y$ based on $x$ can be written as $\hat{y} = a + bx$, with:
$$ b = r\,\frac{s_{y}}{s_{x}} \qquad a = \bar{y} - b\,\bar{x} $$
We first find $b$, the slope, and then we solve for $a$, the $y$-intercept.

Identifying the Least Squares Line from Summary Statistics

To identify the least squares line from summary statistics:

- Estimate the slope parameter: $b_{1} = (s_{y}/s_{x})\,R$.

- Noting that the point $(\bar{x}, \bar{y})$ is on the least squares line, use $x_{0} = \bar{x}$ and $y_{0} = \bar{y}$ with the point-slope equation: $y - \bar{y} = b_{1}(x - \bar{x})$.

- Simplify the equation, which reveals that $b_{0} = \bar{y} - b_{1}\,\bar{x}$.

Guided Practice 8.9

Source: Main Text

Figure 8.14 shows the sample means for the family income and gift aid as \$101,800 and \$19,940, respectively. Plot the point $(101.8, 19.94)$ on Figure 8.12 to verify it falls on the least squares line (the solid line).

Solution

The point of averages $(\bar{x}, \bar{y}) = (101.8, 19.94)$ must lie on the least squares line, because the line is constructed so that $a = \bar{y} - b\,\bar{x}$, i.e. $\bar{y} = a + b\,\bar{x}$. If you plot $(101.8, 19.94)$ on Figure 8.12 and visually check, the point lies exactly on the solid line (the least squares line). This is a universal property of least squares regression, not a coincidence.

Example 8.2.1

Source: Main Text

Using the summary statistics in Figure 8.14, find the equation of the least squares regression line for predicting gift aid based on family income.

Solution:

Step 1 — Compute the slope $b$ using $b = r \cdot s_y/s_x$:

$$ b = r\,\frac{s_{y}}{s_{x}} = (-0.499)\,\frac{5.46}{63.2} = -0.0431. $$

Step 2 — Compute the intercept $a$ using $a = \bar{y} - b\,\bar{x}$:

$$ a = \bar{y} - b\,\bar{x} = 19.94 - (-0.0431)(101.8) = 19.94 + 4.388 = 24.3. $$

Step 3 — Write out the equation of the least squares line:

$$ \hat{y} = 24.3 - 0.0431\,x \qquad \text{or} \qquad \widehat{aid} = 24.3 - 0.0431 \times \text{family\_income}. $$

Guided Practice 8.10

Source: Main Text

Using the summary statistics in Figure 8.14, compute the slope for the regression line of gift aid against family income.

Solution

Using $b = r \cdot s_y/s_x$:

$$ b = (-0.499)\,\frac{5.46}{63.2} = -0.04311\ldots \approx -0.0431. $$

The slope is $-0.0431$. (It matches Example 8.2.1 — that is intentional; Guided Practice 8.10 is the "show your work" version of Example 8.2.1's first line.)

Example 8.2.2

Source: Main Text

Using the point $(101{,}780,\, 19{,}940)$ from the sample means (measured in raw dollars rather than \$1,000s) and the slope estimate $b_{1} = -0.0431$ from Guided Practice 8.10, find the least-squares line for predicting aid based on family income.

Solution:

Step 1 — Apply the point-slope equation with $(x_0, y_0) = (101{,}780, 19{,}940)$ and $b_1 = -0.0431$:

$$ y - y_{0} = b_{1}(x - x_{0}) $$ $$ y - 19{,}940 = -0.0431\,(x - 101{,}780). $$

Step 2 — Expand the right side:

$$ y - 19{,}940 = -0.0431\,x + 0.0431 \times 101{,}780 = -0.0431\,x + 4{,}386.7. $$

Step 3 — Add 19,940 to both sides to isolate $y$:

$$ y = -0.0431\,x + 4{,}386.7 + 19{,}940 = 24{,}326.7 - 0.0431\,x. $$

The equation simplifies to:

$$ \widehat{aid} = 24{,}327 - 0.0431 \times \text{family\_income}. $$

Here we have replaced $y$ with $\widehat{aid}$ and $x$ with family income to put the equation in context. The final equation should always include a "hat" on the variable being predicted, whether it is a generic $y$ or a named variable like "aid."

A computer is usually used to compute the least squares line, and a summary table generated using software for the Elmhurst regression line is shown in Figure 8.15. The first column of numbers provides estimates for $b_{0}$ (the intercept) and $b_{1}$ (the slope), respectively. These results match Example 8.2.1 (with minor rounding).

Figure 8.15: Summary of least squares fit for the Elmhurst data (raw dollar units). Compare the parameter estimates in the first column to the results of Example 8.2.2.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	24319.3	1291.5	18.83	<0.0001
family_income	-0.0431	0.0108	-3.98	0.0002

Example 8.2.3

Source: Main Text

Say we wanted to predict a student's family income based on the amount of gift aid received. Would the least squares regression line be the following?

$$ aid = 24.3 - 0.0431 \times \text{family\_income} $$

Solution:

No. The equation we found was for predicting aid, not for predicting family income. To predict family income from aid, we must calculate a new regression line — letting $y$ be family income and $x$ be aid.

Step 1 — Recompute the slope with the roles of $x$ and $y$ swapped:

$$ b = r\,\frac{s_{y}}{s_{x}} = (-0.499)\,\frac{63.2}{5.46} = -5.776. $$

Step 2 — Recompute the intercept (note $\bar{y}$ is now family income's mean, 101.8, and $\bar{x}$ is aid's mean, 19.94):

$$ a = \bar{y} - b\,\bar{x} = 101.8 - (-5.776)(19.94) = 101.8 + 115.17 = 216.97 \approx 217. $$

The fitted line for predicting family income from aid is therefore:

$$ \widehat{\text{family\_income}} = 217 - 5.776 \times aid $$

in units of \$1,000s. (The source prints 607.3 here due to a typo — the correct arithmetic yields roughly 217 for the intercept.) The key lesson is that swapping explanatory and response variables produces a different regression line, not the algebraic inverse of the original.

A summary table based on computer output is shown in Figure 8.16 for the Elmhurst College data in \$1,000 units. The first column of numbers provides estimates for $b_{0}$ and $b_{1}$, respectively.

Figure 8.16: Summary of least squares fit for the Elmhurst College data (in \$1,000s). Compare the parameter estimates in the first column to Example 8.2.1.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	24.3193	1.2915	18.83	0.0000
family_income	-0.0431	0.0108	-3.98	0.0002

Example 8.2.4

Source: Main Text

Examine the second, third, and fourth columns in Figure 8.16. Can you guess what they represent?

Solution:

We focus on the second row (the slope).

- First column (Estimate = −0.0431): our best estimate for the slope of the population regression line. We call this point estimate $b$. - Second column (Std. Error = 0.0108): the standard error of this point estimate — how much the estimate would be expected to vary if we drew repeated samples of fifty freshmen. - Third column (t value = −3.98): the $t$ test statistic for the null hypothesis that the slope of the population regression line is zero, computed as $\text{Estimate}/\text{Std. Error} = -0.0431/0.0108 = -3.98$. - Fourth column (Pr(>|t|) = 0.0002): the p-value for a two-sided $t$-test of that null hypothesis. A p-value this small is strong evidence that the true slope is not zero.

We will get into more details in Section 8.4.

Example 8.2.5

Source: Main Text

What do the first and second columns of Figure 8.17 (below) represent, for a hypothetical regression of seat changes on unemployment rate?

$$ \hat{y} = -7.3644 - 0.8897\,x $$

where $\hat{y}$ is the predicted change in the number of seats for the president's party, and $x$ represents the unemployment rate.

Solution:

The entries in the first column represent the least squares estimates $b_{0}$ and $b_{1}$, and the values in the second column correspond to the standard errors of each estimate. Using the estimates, we can write the regression line as shown above.

We previously used a $t$-test statistic for hypothesis testing in the context of numerical data. Regression is very similar. Since the null value for the slope is 0, the test statistic is:

$$ T = \frac{\text{estimate} - \text{null value}}{\text{SE}} = \frac{-0.8897 - 0}{0.8350} = -1.07. $$

This corresponds to the third column of the regression table.

Example 8.2.6

Source: Main Text

Suppose a high school senior is considering Elmhurst College. Can she simply use the linear equation we have found to calculate her financial aid from the university?

Solution:

No. Using the equation will provide a prediction or estimate, not a guarantee. As we see in the scatterplot, there is a lot of variability around the line. While the linear equation is good at capturing the trend in the data, there will be significant error in predicting an individual student's aid. Additionally, the data all come from one freshman class, and the way aid is determined by the university may change from year to year. A prediction for a current senior is a rough extrapolation of last year's aid policy.

Try It Now 8.2.2

Source: Main Text

A different college has summary statistics $\bar{x} = 80$ (family income, in \$1,000s), $\bar{y} = 15$ (gift aid, in \$1,000s), $s_x = 50$, $s_y = 4$, and $r = -0.6$.

(a) Compute the slope of the least squares regression line for predicting gift aid from family income. (b) Compute the $y$-intercept. (c) Write out the regression equation $\widehat{aid} = a + b \times \text{family\_income}$.

Solution

(a) Slope:

$$ b = r \cdot \frac{s_y}{s_x} = (-0.6) \cdot \frac{4}{50} = -0.048. $$

(b) Intercept:

$$ a = \bar{y} - b\,\bar{x} = 15 - (-0.048)(80) = 15 + 3.84 = 18.84. $$

$$ \widehat{aid} = 18.84 - 0.048 \times \text{family\_income} \quad (\text{in \$1,000s}). $$

Context Pause

Notice how little work $b = r\,s_y/s_x$ actually does: three summary numbers go in, one slope comes out. You do not need the individual data points — just the correlation and the two standard deviations. That is why 19th-century statisticians (long before computers) could fit regression lines by hand for problems in astronomy and biology. Every modern regression tool does exactly the same calculation, just faster.

Insight Note

The slope formula $b = r\,s_y/s_x$ is a unit-conversion machine in disguise. Think of $s_y/s_x$ as "how many units of $y$ correspond to one standard deviation of $x$", and $r$ as "how strongly the two variables move together." Multiply them and you get "units of $y$ per unit of $x$ along the line of best fit" — which is exactly the slope.

8.2.3 Interpreting the Coefficients of a Regression Line

Interpreting the coefficients in a regression model is often one of the most important steps in the analysis. A number without interpretation is just a number; in context, the same number is a prediction, an effect size, a policy claim.

Example 8.2.7

Source: Main Text

The slope for the Elmhurst College data for predicting gift aid based on family income was calculated as $-0.0431$. Interpret this quantity in the context of the problem.

Solution:

You might recall from algebra that slope is change in $y$ over change in $x$. Here, both $x$ and $y$ are in thousands of dollars. So if $x$ is one unit (one thousand dollars) higher, the line predicts $y$ will change by $-0.0431$ thousand dollars. In other words: for each additional thousand dollars of family income, on average, students receive $0.0431$ thousand — about \$43.10 — less in gift aid. A higher family income corresponds to less aid because the slope is negative.

Example 8.2.8

Source: Main Text

Examine Figure 8.16, which relates the Elmhurst College aid and student family income. How sure are you that the slope is statistically significantly different from zero? That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?

Solution:

While the relationship between the variables is not perfect, there is an evident decreasing trend in the data. The p-value in Figure 8.16 is $0.0002$, which is far below any conventional threshold (like $0.05$ or $0.01$). This strongly suggests the hypothesis test will reject the null claim that the slope is zero.

Example 8.2.9

Source: Main Text

The $y$-intercept for the Elmhurst College data for predicting gift aid based on family income was calculated as $24.3$. Interpret this quantity in the context of the problem.

Solution:

The intercept $a$ describes the predicted value of $y$ when $x = 0$. The predicted gift aid is $24.3$ thousand dollars if a student's family has no income. The meaning of the intercept is relevant to this application because the family income for some students at Elmhurst is \$0. In other applications, the intercept may have little or no practical value if there are no observations where $x$ is near zero. Here, it would be acceptable to say that the average gift aid is \$24,300 among students whose families have \$0 in income.

> Interpreting Coefficients in a Linear Model > - The slope, $b$, describes the average increase or decrease in the $y$ variable if the explanatory variable $x$ is one unit larger. > - The $y$-intercept, $a$, describes the predicted outcome of $y$ if $x = 0$. The linear model must be valid all the way to $x = 0$ for this interpretation to make sense, which in many applications is not the case.

> Interpreting Parameters Estimated by Least Squares > The slope describes the estimated difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger. The intercept describes the average outcome of $y$ if $x = 0$ and the linear model is valid all the way to $x = 0$, which in many applications is not the case.

Guided Practice 8.11

Source: Main Text

In the previous chapter, we encountered a data set that compared the price of new textbooks for UCLA courses at the UCLA Bookstore and on Amazon. We fit a linear model for predicting price at UCLA Bookstore from price on Amazon and get:

$$ \hat{y} = 1.86 + 1.03\,x $$

where $x$ is the price on Amazon (in dollars) and $y$ is the price at the UCLA bookstore (in dollars). Interpret the coefficients in this model and discuss whether the interpretations make sense in this context.

Solution

Slope ($b = 1.03$): For each additional \$1.00 in the Amazon price of a textbook, the UCLA Bookstore price is about \$1.03 higher on average. This makes sense — books that cost more at one vendor also tend to cost more at the other.

Intercept ($a = 1.86$): If Amazon's price were \$0, the UCLA Bookstore price would be predicted to be \$1.86. Since a \$0 Amazon price is essentially outside the realistic data range, this intercept has limited practical meaning — it is a mathematical artifact used to make the line line up with the data.

Both interpretations make sense as descriptions of the fitted line; only the slope has real-world applicability to typical textbook prices.

Guided Practice 8.12

Source: Main Text

Can we conclude that if Amazon raises the price of a textbook by \$1.00, the UCLA Bookstore will raise the price of the textbook by \$1.03?

Solution

No. The slope describes an average cross-sectional relationship — looking across many textbooks at one point in time, the UCLA Bookstore prices are on average \$1.03 higher for each extra \$1.00 in Amazon price. That is not the same as a causal, dynamic response of one vendor's pricing to another's. If Amazon raises a specific textbook's price by \$1.00 tomorrow, there is no guarantee the UCLA Bookstore will do anything in response. The slope is a description of the pattern in the data, not a prediction of vendor behavior.

Exercise Caution When Interpreting Coefficients of a Linear Model

- The slope tells us only the average change in $y$ for each unit change in $x$; it does not tell us how much $y$ might change based on a change in $x$ for any particular individual. Moreover, in most cases, the slope cannot be interpreted in a causal way.

- When a value of $x = 0$ doesn't make sense in an application, the interpretation of the $y$-intercept won't have any practical meaning.

Context Pause

The difference between "the slope is $-0.0431$" and "for every extra \$1,000 of family income, the average aid drops by about \$43" is everything. The first is an artifact of the regression. The second is a sentence a student, a parent, or a financial-aid officer can actually use. In practice, interpreting coefficients in plain English is the whole point of fitting a regression in the first place.

Insight Note

A very common mistake is turning a cross-sectional slope into a causal policy claim — "if we cut family income by \$1,000, aid will go up by \$43 on average." That claim requires more than regression can deliver, because confounders (grades, legacy status, major, demonstrated financial need) may be driving both variables. Correlation is not causation, and slope is a numerical summary of correlation. Never treat $b$ as a policy lever without an experimental design to back it up.

Try It Now 8.2.3

Source: Main Text

A regression of a child's shoe size $y$ on age (in years) $x$ gives $\hat{y} = 3 + 0.8\,x$ for children aged 4 to 12.

(a) Interpret the slope in plain English. (b) Interpret the $y$-intercept. Does it have practical meaning? (c) Would you use this equation to predict the shoe size of a newborn (age 0)? Why or why not?

Solution

(a) For each additional year of age, a child's shoe size increases by $0.8$ on average.

(b) The intercept is $3$, which is the predicted shoe size at age $0$. Since the data only cover ages 4 to 12, this intercept has no practical meaning — babies do not wear "size 3" shoes that fit the adult/child shoe-sizing scale used here. It is a mathematical anchor for the line, not a real prediction.

(c) No. Age $0$ is outside the range of the data (ages 4 to 12). Predicting for a newborn would be extrapolation, which is unreliable — see the next subsection.

8.2.4 Extrapolation Is Treacherous

When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February $6^{\text{th}}$ it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.

— Stephen Colbert, April 6, 2010

Linear models can be used to approximate the relationship between two variables. However, these models have real limitations. Linear regression is simply a modeling framework. The truth is almost always much more complex than a simple line. For example, we do not know how the data outside of our limited window will behave.

Example 8.2.10

Source: Main Text

Use the model $\widehat{aid} = 24.3 - 0.0431 \times \text{family\_income}$ to estimate the aid of another freshman student whose family had income of \$1 million.

Solution:

Step 1 — Convert the income into the model's units. The units of family income are \$1,000s, so \$1,000,000 becomes $\text{family\_income} = 1000$.

Step 2 — Plug into the model:

$$ \widehat{aid} = 24.3 - 0.0431(1000) = 24.3 - 43.1 = -18.8. $$

The model predicts this student will have $-\$18{,}800$ in aid — that is, the student would pay \$18,800 on top of tuition. Elmhurst College cannot (or at least does not) require students with high-income families to pay extra on top of tuition to attend. The prediction is obviously ridiculous.

Using a model to predict $y$-values for $x$-values outside the domain of the original data is called extrapolation. Generally, a linear model is only an approximation of the real relationship between two variables. If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.

> Extrapolation Warning > A regression line is valid only within the range of the data used to fit it. Outside that range, the line is a guess, and guesses compound the farther you go. A model that is excellent at predicting within the data can be catastrophically wrong outside it.

Context Pause

The Colbert quote above is a joke, but the joke works because extrapolation is a real mistake that real people make all the time — projecting short-term trends (spring warming, stock returns, COVID case counts) forward indefinitely and getting apocalyptic or utopian predictions. The math of a line says "just plug in any $x$," but the data says "we only looked at $x$ in this range." When those two disagree, trust the data.

Insight Note

A good check: before you plug an $x$ into a regression equation, ask "is this $x$ close to values that were actually in the data?" If not, do not trust the prediction. If you must extrapolate, at minimum acknowledge it explicitly, widen your uncertainty, and think about what physical or economic constraints the line is ignoring (like "aid cannot be negative").

Try It Now 8.2.4

Source: Main Text

Suppose a regression of temperature $y$ in degrees Fahrenheit on calendar date $x$ (from February 6th through April 6th in the Colbert quote) gives $\hat{y} = 10 + 1.17\,x$, where $x$ is the number of days after February 6th.

(a) Predict the temperature on April 6th ($x = 59$ days). Does this match the quote? (b) Use the model to "predict" the temperature on August 6th ($x = 181$ days). (c) Why is the answer in part (b) nonsense, and what is the correct name for this mistake?

Solution

(a) $\hat{y} = 10 + 1.17 \times 59 = 10 + 69.03 = 79.03 \approx 80$ degrees. This matches the quote exactly.

(b) $\hat{y} = 10 + 1.17 \times 181 = 10 + 211.77 = 221.77 \approx 222$ degrees — very close to Colbert's "220 degrees."

(c) Temperatures do not rise monotonically from February through August — they peak in July and fall again. The linear model was fit to two months of data and has no information about the summer's seasonal peak or fall's cooling. Using the equation for August is extrapolation, and the prediction is ridiculous.

8.2.5 Using R-squared to Describe the Strength of a Fit

We evaluated the strength of the linear relationship between two variables earlier using the correlation, $r$. However, it is more common to explain the fit of a model using $R^2$, called R-squared or the explained variance. If provided with a linear model, we might want to describe how closely the data cluster around the linear fit.

Figure 8.17: Gift aid and family income for a random sample of 50 freshman students from Elmhurst College, shown with the least squares regression line $\hat{y}$ and the average line $\bar{y}$.

We are interested in how well a model accounts for, or explains, the location of the $y$ values. The $R^2$ of a linear model describes how much smaller the variance (in the $y$ direction) about the regression line is than the variance about the horizontal line $\bar{y}$. For the Elmhurst College data shown in Figure 8.17, the variance of the response variable (aid received) is

$$ s_{aid}^{2} = 29.8. $$

If we apply our least squares line, the variability in the residuals describes how much variation remains after using the model:

$$ s_{RES}^{2} = 22.4. $$

The reduction in the variance is:

$$ \frac{s_{aid}^{2} - s_{RES}^{2}}{s_{aid}^{2}} = \frac{29.8 - 22.4}{29.8} = \frac{7.4}{29.8} \approx 0.25. $$

(Roughly 25% of the variance in aid is explained by family income.) If we used the simple standard deviation of the residuals, this computation would give exactly $R^2$. However, the standard way of computing the standard deviation of the residuals is slightly more sophisticated. To avoid any trouble, we can instead use a sum of squares method. If we call the sum of the squared errors about the regression line $SSRes$ and the sum of the squared errors about the mean $SSM$, we can define $R^2$ as:

$$ R^{2} = \frac{SSM - SSRes}{SSM} = 1 - \frac{SSRes}{SSM}. $$

(a)

(b)

Figure 8.18: (a) The regression line is equivalent to $\bar{y}$; $R^{2} = 0$. (b) The regression line passes through all of the points; $R^{2} = 1$. Try out this and other interactive Desmos activities at openintro.org/ahss/desmos.

Guided Practice 8.13

Source: Main Text

Using the formula for $R^2$, confirm that in Figure 8.18 (a), $R^{2} = 0$, and that in Figure 8.18 (b), $R^{2} = 1$.

Solution

(a) Regression line equals $\bar{y}$: If the regression line is horizontal at $\bar{y}$, then every prediction $\hat{y}_i = \bar{y}$, and the residuals are $y_i - \hat{y}_i = y_i - \bar{y}$. So $SSRes = \sum(y_i - \bar{y})^2 = SSM$. Therefore:

$$ R^{2} = 1 - \frac{SSRes}{SSM} = 1 - \frac{SSM}{SSM} = 1 - 1 = 0. $$

(b) Regression line passes through all points: Every $\hat{y}_i = y_i$, so every residual is $0$, meaning $SSRes = 0$. Therefore:

$$ R^{2} = 1 - \frac{0}{SSM} = 1 - 0 = 1. $$

$R^2$ Is the Explained Variance

$R^{2}$ is always between $0$ and $1$, inclusive. It tells us the proportion of variation in the $y$ values that is explained by a regression model. The higher the value of $R^{2}$, the better the model "explains" the response variable.

The value of $R^{2}$ is, in fact, equal to $r^{2}$, where $r$ is the correlation. This means that $r = \pm\sqrt{R^{2}}$. Use this fact to answer the next two practice problems.

Guided Practice 8.14

Source: Main Text

If a linear model has a very strong negative relationship with a correlation of $-0.97$, how much of the variation in the response variable is explained by the linear model?

Solution

Since $R^2 = r^2$:

$$ R^{2} = (-0.97)^{2} = 0.9409. $$

About $94\%$ of the variation in the response variable is explained by the linear model. (Very strong relationships translate into very high $R^2$ values — squaring a number close to $\pm 1$ stays close to 1.)

Guided Practice 8.15

Source: Main Text

If a linear model has an $R^{2}$ or explained variance of $0.94$, what is the correlation?

Solution

Since $r = \pm\sqrt{R^{2}}$:

$$ r = \pm\sqrt{0.94} \approx \pm 0.970. $$

The sign of $r$ is whatever the slope's sign is. If the regression line has a positive slope, $r \approx +0.97$; if negative, $r \approx -0.97$. Without the scatterplot or the slope, we cannot decide the sign from $R^2$ alone.

Context Pause

Reporting $R^2 = 0.25$ is a more actionable statement than reporting "$r = -0.499$." Both are correct, but $R^2$ gives a proportion — "this model explains 25% of the variation in aid, leaving 75% unexplained." That framing helps a reader decide whether the model is good enough for their purpose, or whether they need to collect more variables (a multivariable regression in Chapter 9).

Insight Note

$R^2$ has a well-known nickname in machine learning and economics: coefficient of determination. It answers the question "how much of what is going on in $y$ is captured by the model?" If $R^2 = 0.95$, the model is doing almost all the work; if $R^2 = 0.05$, the model is barely doing any. A high $R^2$ is not a guarantee that the model is correct (Anscombe's Quartet from Section 8.1 has $R^2 \approx 0.67$ for four very different datasets), but low $R^2$ is a guarantee that something important is missing.

Try It Now 8.2.5

Source: Main Text

A regression of a student's final exam score $y$ on their midterm score $x$ produces $r = 0.80$.

(a) Compute $R^2$ and interpret it in context. (b) What percentage of the variation in final exam scores is not explained by midterm scores? (c) If a different class gives $R^2 = 0.49$, what would be the correlation (assuming a positive slope)?

Solution

(a) $R^2 = 0.80^2 = 0.64$. About 64% of the variation in final exam scores is explained by the linear regression on midterm scores.

(b) $1 - R^2 = 1 - 0.64 = 0.36$, so 36% of the variation in final exam scores is unexplained — it must come from other factors (study habits, different topics on the final, test-day performance, etc.).

8.2.6 Technology: Linear Correlation and Regression

Get started quickly with the Desmos LinReg Calculator (available at openintro.org/ahss/desmos).

Calculator Instructions

TI-84: Finding $a$, $b$, $R^2$, and $r$ for a Linear Model

Use STAT, CALC, LinReg(a + bx).

1. Press STAT.

2. Right arrow to CALC.

3. Down arrow and choose 8:LinReg(a+bx).

- Caution: choosing 4:LinReg(ax+b) will reverse $a$ and $b$.

4. Let Xlist be L1 and Ylist be L2 (enter the $x$ and $y$ values in L1 and L2 first).

5. Leave FreqList blank.

6. Leave Store RegEQ blank.

7. Choose Calculate and hit ENTER, which returns:

- a — the $y$-intercept of the best fit line

- b — the slope of the best fit line

- r² — $R^2$, the explained variance

- r — the correlation coefficient

TI-83: Do steps 1-3, then enter the $x$ list and $y$ list separated by a comma, e.g. LinReg(a+bx) L1, L2, then hit ENTER.

What to Do If $R^2$ and $r$ Do Not Show Up on a TI-83/84

If $r^2$ and $r$ do not show up when doing STAT, CALC, LinReg, diagnostics must be turned on. This only needs to be done once and the diagnostics will remain on.

1. Hit 2ND 0 (i.e. CATALOG).

2. Scroll down until the arrow points at DiagnosticOn.

3. Hit ENTER and ENTER again. The screen should now say:

DiagnosticOn

Done

What to Do If a TI-83/84 Returns: `ERR: DIM MISMATCH`

This error means that the lists, generally L1 and L2, do not have the same length.

1. Choose 1:Quit.

2. Choose STAT, Edit and make sure that the lists have the same number of entries.

Casio fx-9750GII: Finding $a$, $b$, $R^2$, and $r$ for a Linear Model

1. Navigate to STAT (MENU button, then hit the 2 button or select STAT).

2. Enter the $x$ and $y$ data into 2 separate lists, e.g. $x$ values in List 1 and $y$ values in List 2. Observation ordering should be the same in the two lists. For example, if $(5, 4)$ is the second observation, then the second value in the $x$ list should be 5 and the second value in the $y$ list should be 4.

3. Navigate to CALC (F2) and then SET (F6) to set the regression context.

- To change the 2Var XList, navigate to it, select List (F1), and enter the proper list number. Similarly, set 2Var YList to the proper list.

4. Hit EXIT.

5. Select REG (F3), X (F1), and a+bx (F2), which returns:

- a — the $y$-intercept of the best fit line

- b — the slope of the best fit line

- r — the correlation coefficient

- r² — $R^2$, the explained variance

- MSe — Mean squared error, which you can ignore

If you select ax+b (F1), the $a$ and $b$ meanings will be reversed.

Guided Practice 8.16

Source: Main Text

The data set loan50, introduced in Chapter 1, contains information on randomly sampled loans offered through Lending Club. A subset of the data matrix is shown in Figure 8.19. Use a calculator to find the equation of the least squares regression line for predicting loan amount from total income based on this subset.

Figure 8.19: Sample of data from loan50.

	total_income	loan_amount
1	59000	22000
2	60000	6000
3	75000	25000
4	75000	6000
5	254000	25000
6	67000	6400
7	28800	3000

Solution

Entering total_income in L1 and loan_amount in L2 on a TI-84 and running LinReg(a+bx) gives approximately:

- $a \approx 8{,}863$ - $b \approx 0.042$ - $r \approx 0.27$ - $R^2 \approx 0.07$

So the fitted line is approximately:

$$ \widehat{\text{loan\_amount}} = 8{,}863 + 0.042 \times \text{total\_income}. $$

(Exact values depend on the software's rounding and the exact rows used; answers within $\pm 10\%$ of these are acceptable.) The low $R^2$ tells us that total income explains only about 7% of the variation in loan amount for this small sample — most of the variation comes from other factors.

Context Pause

Calculator and software output is a safety net, not a replacement for the slope and intercept formulas. If you blindly press LinReg and get a line, you won't catch typos in your data, unit mismatches, or outliers that are pulling the line in a bad direction. Always plot the data first (STAT PLOT on a TI, geom_point() in ggplot, etc.) and check that the line makes sense before quoting it.

Insight Note

The labels on calculator output vary in confusing ways: some calculators show LinReg(ax+b) (slope first, intercept second) and some show LinReg(a+bx) (intercept first, slope second). Always confirm which label is which before writing down your equation. A regression line that says "$\hat{y} = 0.042 + 8863\,x$" instead of "$\hat{y} = 8863 + 0.042\,x$" is obviously wrong, but only if you notice the swap.

Try It Now 8.2.6

Source: Main Text

A calculator returns the following output after running a linear regression on five data points:

- $a = 15.2$ - $b = 2.5$ - $r = 0.9$ - $r^2 = 0.81$

(a) Write the regression equation. (b) Interpret $R^2$ in the context of this model. (c) What is the correlation $r$, and what does its sign tell you about the scatterplot?

Solution

(a) $\hat{y} = 15.2 + 2.5\,x$.

(b) About 81% of the variation in $y$ is explained by the linear model. About 19% is left unexplained.

(c) $r = 0.9$. The positive sign tells you that as $x$ increases, $y$ tends to increase — the scatterplot should show an upward-sloping pattern.

8.2.7 Types of Outliers in Linear Regression

Outliers in regression are observations that fall far from the "cloud" of points. These points are especially important because they can have a strong influence on the least squares line.

Example 8.2.11

Source: Main Text

There are six plots shown in Figure 8.20 along with the least squares line and residual plots. For each scatterplot and residual plot pair, identify any obvious outliers and note how they influence the least squares line. Recall that an outlier is any point that doesn't appear to belong with the vast majority of the other points.

Solution:

1. There is one outlier far from the other points, though it only appears to slightly influence the line. 2. There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn't very influential. 3. There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud doesn't appear to fit very well. 4. There is a primary cloud and then a small secondary cloud of four outliers. The secondary cloud appears to be influencing the line somewhat strongly, making the least squares line fit poorly almost everywhere. There might be an interesting explanation for the dual clouds, which is something that could be investigated. 5. There is no obvious trend in the main cloud of points and the outlier on the right appears to largely control the slope of the least squares line. 6. There is one outlier far from the cloud, however, it falls quite close to the least squares line and does not appear to be very influential.

Examine the residual plots in Figure 8.20. You will probably find that there is some trend in the main clouds of (3) and (4). In these cases, the outliers influenced the slope of the least squares lines. In (5), data with no clear trend were assigned a line with a large trend simply due to one outlier.

> Leverage > Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage. Points that fall horizontally far from the line are points of high leverage; these points can strongly influence the slope of the least squares line. If one of these high-leverage points appears to actually invoke its influence on the slope of the line — as in cases (3), (4), and (5) above — then we call it an influential point. Usually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.

It is tempting to remove outliers. Don't do this without a very good reason. Models that ignore exceptional (and interesting) cases often perform poorly. For instance, if a financial firm ignored the largest market swings — the "outliers" — it would soon go bankrupt by making poorly thought-out investments.

> Don't Ignore Outliers When Fitting a Final Model > If there are outliers in the data, they should not be removed or ignored without a good reason. Whatever final model is fit to the data would not be very helpful if it ignores the most exceptional cases.

(1)

(2)

(3)

(4)

(5)

(6)

Figure 8.20: Six plots, each with a least squares line and residual plot. All data sets have at least one outlier.

> Outlier Terminology > - Outlier: an observation that falls far from the bulk of the data. > - High-leverage point: a point with an $x$-value far from the center of the $x$ distribution. High leverage does not imply high influence; a high-leverage point on the regression line can be harmless. > - Influential point: a high-leverage point whose removal would noticeably change the slope (or intercept) of the least squares line.

Context Pause

Outliers are one of the most common reasons a linear model is misleading. A single extreme observation can swing the slope, inflate or deflate $R^2$, and make the line tell a different story than the bulk of the data. Always make a scatterplot before trusting a regression, and pay special attention to points at the extreme ends of the $x$ range — those are the ones with the most leverage.

Insight Note

The decision to keep or drop an outlier is a scientific judgment, not a statistical one. A data-entry error or a malfunctioning sensor is a defensible reason to drop a point. "It doesn't fit my model" is not. When unsure, report the regression twice — once with the outlier and once without — and let the reader see how the result depends on that point.

Try It Now 8.2.7

Source: Main Text

A scatterplot shows 19 points in a tight upward line, plus one point far to the right of the cloud. Consider three candidates for this outlier:

- Candidate A: the outlier sits right on the line the other 19 points would produce. - Candidate B: the outlier sits far above the line the other 19 points would produce. - Candidate C: the outlier sits on the line of the 19 points but at $x$ twice as far as any other $x$.

For each candidate, classify the point as (i) high leverage, yes/no and (ii) influential, yes/no.

Solution

- Candidate A: Far right → high leverage yes. On the line → would not change the slope if removed → influential no. - Candidate B: Far right → high leverage yes. Off the line → removing it would visibly change the slope → influential yes. - Candidate C: Far right, on the line → high leverage yes, influential no (same reasoning as A).

Key takeaway: high leverage + far-from-line = influential. High leverage alone is not enough.

8.2.8 Categorical Predictors with Two Levels (Special Topic)

Categorical variables are also useful in predicting outcomes. Here we consider a categorical predictor with two levels (recall that a level is the same as a category). We'll consider eBay auctions for a video game, Mario Kart for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded. Here we want to predict total price based on game condition, which takes values used and new. A plot of the auction data is shown in Figure 8.21.

Figure 8.21: Total auction prices for the game Mario Kart, divided into used ($x = 0$) and new ($x = 1$) condition games with the least squares regression line shown.

Figure 8.22: Least squares regression summary for the Mario Kart data.

To incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form. We do so using an indicator variable called cond_new, which takes value $1$ when the game is new and $0$ when the game is used. Using this indicator variable, the linear model may be written as:

$$ \widehat{price} = \alpha + \beta \times \text{cond\_new}. $$

The fitted model is summarized in the table below, and the model with its parameter estimates is given as:

$$ \widehat{price} = 42.87 + 10.90 \times \text{cond\_new}. $$

For categorical predictors with two levels, the linearity assumption will always be satisfied. However, we must evaluate whether the residuals in each group are approximately normal with equal variance. Based on Figure 8.21, both of these conditions are reasonably satisfied.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	42.87	0.81	52.67	0.0000
cond_new	10.90	1.26	8.66	0.0000

Example 8.2.12

Source: Main Text

Interpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions.

Solution:

Intercept ($\alpha = 42.87$): The estimated price when cond_new takes value $0$ — i.e. when the game is in used condition. The average selling price of a used version of the game is \$42.87.

Slope ($\beta = 10.90$): On average, new games sell for about \$10.90 more than used games. In this categorical setup, the slope is no longer a per-unit change — it is the group difference between the two levels of the categorical variable.

> Interpreting Model Estimates for Categorical Predictors > The estimated intercept is the value of the response variable for the first category (the category corresponding to an indicator value of 0). The estimated slope is the average change in the response variable between the two categories.

Context Pause

Indicator variables are the bridge between categorical data and the regression machinery we built for numerical data. If you know how to handle the Mario Kart used-vs-new split, you can handle male-vs-female, treatment-vs-control, or AM-vs-PM. For categorical predictors with more than two levels (three genres, four cities, five brands), you add more indicator variables — one for each level except one "reference" level. Chapter 9 covers the details.

Insight Note

When a regression has only a single two-level categorical predictor, the slope equals the difference between the two group means. That is: $\beta = \bar{y}_{\text{new}} - \bar{y}_{\text{used}}$. In this case, \$42.87 + \$10.90 = \$53.77 is the mean new-game price. The machinery of least squares reduces to an everyday "what's the gap between two group averages" calculation — a two-sample $t$-test in disguise.

Try It Now 8.2.8

Source: Main Text

A regression of a used car's selling price $y$ (in \$1,000s) on whether the car has leather seats (leather = 1 for yes, 0 for no) gives:

$$ \widehat{price} = 18.2 + 4.7 \times \text{leather}. $$

(a) What is the average selling price of a car with no leather seats? (b) What is the average selling price of a car with leather seats? (c) Interpret the slope in plain English.

Solution

(a) When leather = 0: $\widehat{price} = 18.2 + 4.7(0) = 18.2$ thousand dollars, i.e. \$18,200.

(b) When leather = 1: $\widehat{price} = 18.2 + 4.7(1) = 22.9$ thousand dollars, i.e. \$22,900.

(c) On average, used cars with leather seats sell for \$4,700 more than used cars without leather seats. The slope for a binary categorical predictor is the gap between the two group means.

Section Summary

We define the best fit line as the line that minimizes the sum of the squared residuals (errors) about the line. That is, we find the line that minimizes $(y_{1} - \hat{y}_{1})^{2} + (y_{2} - \hat{y}_{2})^{2} + \cdots + (y_{n} - \hat{y}_{n})^{2} = \sum(y_{i} - \hat{y}_{i})^{2}$. We call this line the least squares regression line.
We write the least squares regression line in the form $\hat{y} = a + bx$, and we can calculate $a$ and $b$ based on the summary statistics as follows:

$$ b = r\,\frac{s_{y}}{s_{x}} \qquad \text{and} \qquad a = \bar{y} - b\,\bar{x}. $$

Interpreting the slope and $y$-intercept of a linear model:
The slope, $b$, describes the average increase or decrease in the $y$ variable if the explanatory variable $x$ is one unit larger.
The $y$-intercept, $a$, describes the average or predicted outcome of $y$ if $x = 0$. The linear model must be valid all the way to $x = 0$ for this to make sense, which in many applications is not the case.
Two important considerations about the regression line:
The regression line provides estimates or predictions, not actual values. It is important to know how large $s$, the standard deviation of the residuals, is in order to know about how much error to expect in these predictions.
The regression line estimates are only reasonable within the domain of the data. Predicting $y$ for $x$ values outside the domain — extrapolation — is unreliable and may produce ridiculous results.
Using $R^2$ to assess the fit of the model:
$R^{2}$, called R-squared or the explained variance, is a measure of how well the model explains or fits the data. $R^{2}$ is always between 0 and 1, inclusive (or between 0% and 100%, inclusive). The higher the value of $R^{2}$, the better the model "fits" the data.
The $R^{2}$ for a linear model describes the proportion of variation in the $y$ variable that is explained by the regression line.
$R^{2}$ applies to any type of model, not just a linear model, and can be used to compare the fit among various models.
The correlation $r = -\sqrt{R^{2}}$ or $r = \sqrt{R^{2}}$. The value of $R^{2}$ is always positive and cannot tell us the direction of the association. If finding $r$ based on $R^{2}$, use either the scatterplot or the slope of the regression line to determine the sign of $r$.
When a residual plot of the data appears as a random cloud of points, a linear model is generally appropriate. If a residual plot has any type of pattern or curvature (such as a U-shape), a linear model is not appropriate.
Outliers in regression are observations that fall far from the "cloud" of points.
An influential point is a point that has a big effect or pull on the slope of the regression line. Points that are outliers in the $x$ direction will have more pull on the slope of the regression line and are more likely to be influential points.

Problem Set

Source: Main Text

Problem 1: Units of regression. Consider a regression predicting weight (kg) from height (cm) for a sample of adult males. What are the units of the correlation coefficient, the intercept, and the slope?

Problem 8.17 Solution

Step 1 — Units of the correlation coefficient: The correlation coefficient $r$ is computed from Z-scores, which have no units (Z-scores are dimensionless because they divide a length by a length, or a mass by a mass, etc.).

Step 2 — Units of the intercept: The intercept $a$ is the predicted value of $y$ when $x = 0$. Here $y$ is weight in kg, so the intercept has units of kg.

Step 3 — Units of the slope: The slope $b$ is change in $y$ per unit change in $x$, so its units are "units of $y$" per "unit of $x$" — here, kg per cm (or kg/cm).

Answer: Correlation: unitless. Intercept: kg. Slope: kg/cm.

Problem 2: Which is higher? Determine if I or II is higher or if they are equal. Explain your reasoning. For a regression line, the uncertainty associated with the slope estimate, $b$, is higher when:

I. there is a lot of scatter around the regression line, or II. there is very little scatter around the regression line.

Problem 8.18 Solution

Step 1 — Relate scatter to slope uncertainty: The standard error of the slope is roughly $\text{SE}_b = s / (s_x \sqrt{n-1})$, where $s$ is the standard deviation of the residuals. More scatter around the line means a larger $s$, which means a larger SE, which means more uncertainty in the slope estimate.

Step 2 — Compare cases I and II: Case I (lots of scatter) produces a large $s$ and thus a large $\text{SE}_b$. Case II (little scatter) produces a small $s$ and a small $\text{SE}_b$.

Answer: I is higher. The slope estimate is less certain (larger standard error) when there is more scatter around the regression line.

Problem 3: Over-under, Part I. Suppose we fit a regression line to predict the shelf life of an apple based on its weight. For a particular apple, we predict the shelf life to be $4.6$ days. The apple's residual is $-0.6$ days. Did we over- or under-estimate the shelf-life of the apple? Explain your reasoning.

Problem 8.19 Solution

Step 1 — Recall the residual definition: $\text{residual} = y - \hat{y}$, where $y$ is the observed value and $\hat{y}$ is the predicted value.

Step 2 — Solve for the observed $y$: Given $\hat{y} = 4.6$ and residual $= -0.6$:

$$ y = \hat{y} + \text{residual} = 4.6 + (-0.6) = 4.0 \text{ days}. $$

Step 3 — Compare prediction and observation: The prediction was 4.6 days but the actual shelf life was only 4.0 days. The predicted value was too high.

Answer: We over-estimated the shelf life. A negative residual always means the model predicted higher than the actual value.

Problem 4: Over-under, Part II. Suppose we fit a regression line to predict the number of incidents of skin cancer per 1,000 people from the number of sunny days in a year. For a particular year, we predict the incidence of skin cancer to be $1.5$ per 1,000 people, and the residual for this year is $0.5$. Did we over- or under-estimate the incidence of skin cancer? Explain your reasoning.

Problem 8.20 Solution

Step 1 — Solve for the observed $y$: Given $\hat{y} = 1.5$ per 1,000 people and residual $= 0.5$:

$$ y = \hat{y} + \text{residual} = 1.5 + 0.5 = 2.0 \text{ per 1,000 people}. $$

Step 2 — Compare prediction and observation: The prediction was 1.5 but the actual value was 2.0. The prediction was too low.

Answer: We under-estimated the incidence of skin cancer. A positive residual always means the model predicted lower than the actual value.

Problem 5: Tourism spending. The Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year. Three plots are provided: scatterplot showing the relationship between these two variables along with the least squares fit, residuals plot, and histogram of residuals.

(a) Describe the relationship between number of tourists and spending. (b) What are the explanatory and response variables? (c) Why might we want to fit a regression line to these data? (d) Do the data meet the conditions required for fitting a least squares line? In addition to the scatterplot, use the residual plot and histogram to answer this question.

Problem 8.21 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a strong, positive, linear relationship between number of tourists and spending. As the number of tourists increases, tourist spending also increases in a roughly proportional way.

Step 2 — Identify variables (b): The explanatory (predictor) variable is the number of foreign tourists. The response variable is tourist spending. We use the number of tourists to predict or explain spending.

Step 3 — Reason for fitting a regression line (c): - Quantify the average spending per additional tourist (the slope). - Predict spending in a future year given a forecast of tourist numbers. - Summarize the relationship with a single equation suitable for reporting and planning.

Step 4 — Check conditions (d): The four conditions for least squares are (i) linearity — scatterplot roughly straight; (ii) nearly normal residuals — histogram of residuals roughly bell-shaped; (iii) constant variability — residual plot has roughly constant spread; (iv) independent observations. In a time-series of tourism years, independence is suspect (successive years are correlated). If the residual plot shows fanning or curvature, the constant-variability and linearity conditions would also fail.

Answer: (a) Strong positive linear. (b) Explanatory: number of tourists; response: spending. (c) To quantify and predict spending per tourist. (d) Conditions are mostly satisfied based on the three plots, but independence is questionable because data are collected over consecutive years. Any inferences should be made cautiously.

Problem 6: Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

(a) Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain. (b) In this scenario, what are the explanatory and response variables? (c) Why might we want to fit a regression line to these data? (d) Do these data meet the conditions required for fitting a least squares line?

Problem 8.22 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a moderate, positive, linear relationship between calories and carbohydrates. Items with more calories tend to contain more carbs. There is noticeable scatter because calories also come from fat and protein, not just carbs.

Step 2 — Identify variables (b): - Explanatory variable: calories (this is what Starbucks displays). - Response variable: amount of carbohydrates (what we want to predict).

Step 3 — Reason for fitting the line (c): A customer who knows only the calorie count (displayed prominently) can estimate how many carbs a menu item has — useful for diet tracking (e.g. diabetes, low-carb diets).

Step 4 — Check conditions (d): Using the scatterplot and residual plot: - Linearity: roughly satisfied — no obvious curve. - Constant variability: need to inspect the residual plot for fanning. Often these food-item plots show some fanning at higher calorie counts, which is a mild violation. - Nearly normal residuals: need the histogram — usually approximately symmetric. - Independence: the menu items are distinct food items, so independence is reasonable.

Answer: (a) Moderate positive linear. (b) Explanatory: calories; response: carbs. (c) To predict carbs from the posted calorie count. (d) Generally yes, with only mild concerns about constant variability.

Problem 7: The Coast Starlight, Part II. Exercise 8.11 introduces data on the Coast Starlight Amtrak train that runs from Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast Starlight is 129 minutes, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is $0.636$.

(a) Write the equation of the regression line for predicting travel time. (b) Interpret the slope and the intercept in this context. (c) Calculate $R^{2}$ of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret $R^{2}$ in the context of the application. (d) The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities. (e) It actually takes the Coast Starlight about 168 minutes to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value. (f) Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?

Problem 8.23 Solution

Step 1 — Write the regression equation (a): With $\bar{x} = 108$ miles, $\bar{y} = 129$ minutes, $s_x = 99$, $s_y = 113$, $r = 0.636$:

$$ b = r\,\frac{s_y}{s_x} = 0.636 \cdot \frac{113}{99} = 0.636 \times 1.1414 \approx 0.726. $$ $$ a = \bar{y} - b\,\bar{x} = 129 - 0.726 \times 108 = 129 - 78.4 = 50.6. $$ $$ \widehat{\text{time}} = 50.6 + 0.726 \times \text{distance}. $$

Step 2 — Interpret slope and intercept (b): - Slope (0.726 min/mile): For each additional mile between stops, the travel time increases by about 0.73 minutes on average (so about 82 mph = 1/0.726×60? actually speed is 60/0.726 ≈ 83 mph apparent speed addition, but interpretation is average time per extra mile). - Intercept (50.6 min): The predicted travel time when the distance is 0 miles is about 51 minutes. This has little practical meaning (no trip has distance 0), but it accounts for fixed time overhead (loading, unloading, acceleration).

Step 3 — Compute and interpret $R^2$ (c):

$$ R^2 = r^2 = 0.636^2 = 0.4045 \approx 40.5\%. $$

About 40.5% of the variability in travel time between stops is explained by the distance between them. The remaining 59.5% is due to grades, stops, track conditions, speed limits, etc.

Step 4 — Predict for 103 miles (d):

$$ \widehat{\text{time}} = 50.6 + 0.726 \times 103 = 50.6 + 74.78 = 125.4 \text{ minutes}. $$

Step 5 — Compute residual (e):

$$ \text{residual} = 168 - 125.4 = 42.6 \text{ minutes}. $$

The actual travel time exceeded the prediction by about 42.6 minutes. The model under-predicted this segment, likely because the route into Los Angeles has congestion or speed restrictions not captured by distance alone.

Step 6 — 500-mile extrapolation (f): The data's distances range roughly up to about a few hundred miles (based on the quoted mean 108 and sd 99). A 500-mile stretch would be far beyond the typical distance between stops, making this an extrapolation. It is not appropriate to use the model to predict travel time for a 500-mile segment.

Answer: (a) $\widehat{\text{time}} = 50.6 + 0.726\,\text{distance}$. (b) Slope: +0.73 min/mile; intercept: ≈51 min (no practical meaning). (c) $R^2 \approx 0.40$; 40% of variability explained. (d) ≈125 minutes. (e) Residual ≈ +42.6 min (under-predicted). (f) No — extrapolation well beyond data range.

Problem 8: Body measurements, Part III. Exercise 8.13 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is $0.67$.

(a) Write the equation of the regression line for predicting height. (b) Interpret the slope and the intercept in this context. (c) Calculate $R^{2}$ of the regression line for predicting height from shoulder girth, and interpret it in the context of the application. (d) A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model. (e) The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means. (f) A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

Problem 8.24 Solution

Step 1 — Write the regression equation (a): $\bar{x} = 107.2$ (shoulder girth, cm), $\bar{y} = 171.14$ (height, cm), $s_x = 10.37$, $s_y = 9.41$, $r = 0.67$:

$$ b = r\,\frac{s_y}{s_x} = 0.67 \cdot \frac{9.41}{10.37} = 0.67 \times 0.9074 \approx 0.608. $$ $$ a = \bar{y} - b\,\bar{x} = 171.14 - 0.608 \times 107.2 = 171.14 - 65.18 = 105.96. $$ $$ \widehat{\text{height}} = 105.96 + 0.608 \times \text{shoulder\_girth}. $$

Step 2 — Interpret coefficients (b): - Slope (0.608 cm/cm): For each additional 1 cm of shoulder girth, height increases by about 0.61 cm on average. - Intercept (105.96 cm): The predicted height when shoulder girth is 0 cm. Not meaningful in practice (no one has a shoulder girth of 0).

Step 3 — Compute $R^2$ (c):

$$ R^2 = 0.67^2 = 0.4489 \approx 44.9\%. $$

About 44.9% of the variation in height is explained by shoulder girth. The rest comes from other body-size and genetic factors.

Step 4 — Predict for 100 cm shoulder girth (d):

$$ \widehat{\text{height}} = 105.96 + 0.608 \times 100 = 105.96 + 60.8 = 166.76 \text{ cm}. $$

Step 5 — Compute residual (e):

$$ \text{residual} = 160 - 166.76 = -6.76 \text{ cm}. $$

The student is about 6.76 cm shorter than the model predicted. The model over-predicted this student's height by about 6.76 cm.

Step 6 — One-year-old shoulder girth of 56 cm (f): The sample's shoulder girths have mean 107.2 cm with sd 10.37 — a one-year-old's 56 cm is about $(56 - 107.2)/10.37 \approx -4.9$ standard deviations below the mean, well outside the data range. Also, the sample was of adults, not infants. Using the model for a one-year-old would be extrapolation and not appropriate.

Answer: (a) $\widehat{\text{height}} = 105.96 + 0.608\,\text{shoulder\_girth}$. (b) Slope: 0.61 cm per cm; intercept: 105.96 cm (no practical meaning). (c) $R^2 \approx 0.45$. (d) ≈166.76 cm. (e) Residual ≈ −6.76 cm (over-predicted). (f) No — extrapolation.

Problem 9: Murders and poverty, Part I. The following regression output is for predicting annual murders per million from percentage living in poverty in a random sample of 20 metropolitan areas.

	Estimate	Std. Error	t value	Pr(>\|t\|)	s =
(Intercept)	-29.901	7.789	-3.839	0.001
poverty%	2.559	0.390	6.562	0.000
5.512	R² = 70.52%			R²adj = 68.89%

(a) Write out the linear model. (b) Interpret the intercept. (c) Interpret the slope. (d) Interpret $R^{2}$. (e) Calculate the correlation coefficient.

Problem 8.25 Solution

Step 1 — Write out the linear model (a): From the regression output, intercept = $-29.901$ and slope on poverty% = $2.559$:

$$ \widehat{\text{murders}} = -29.901 + 2.559 \times \text{poverty\%}. $$

Step 2 — Interpret the intercept (b): The predicted annual murders per million when the poverty percentage is 0% is about $-29.9$. A negative murder rate is impossible, so the intercept has no practical meaning here — no real area has 0% poverty, so $x = 0$ is outside the data range (extrapolation).

Step 3 — Interpret the slope (c): For each additional 1 percentage point of population living in poverty, the annual murder rate increases by about 2.559 per million on average.

Step 4 — Interpret $R^2$ (d): $R^2 = 70.52\%$. About 70.5% of the variability in the annual murder rate across metro areas is explained by the poverty percentage. The remaining ~30% is explained by other factors (education, employment, population density, drug policy, etc.).

Step 5 — Compute the correlation coefficient (e): $r = \pm \sqrt{R^2} = \pm \sqrt{0.7052} \approx \pm 0.8398$. The slope is positive ($b = 2.559 > 0$), so:

$$ r \approx +0.840. $$

Answer: (a) $\widehat{\text{murders}} = -29.901 + 2.559\,\text{poverty\%}$. (b) Intercept has no practical meaning (extrapolation). (c) +2.559 murders/million per 1 pp increase in poverty. (d) 70.5% of variability explained. (e) $r \approx +0.840$.

Problem 10: Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

	Estimate	Std. Error	t value	Pr(>\|t\|)	s =
(Intercept)	-0.357	0.692	-0.515	0.607
body wt	4.034	0.250	16.119	0.000
1.452	R² = 64.66%			Radj² = 64.41%

(a) Write out the linear model. (b) Interpret the intercept. (c) Interpret the slope. (d) Interpret $R^{2}$. (e) Calculate the correlation coefficient.

Problem 8.26 Solution

Step 1 — Write out the linear model (a):

$$ \widehat{\text{heart\_wt}} = -0.357 + 4.034 \times \text{body\_wt}. $$

Step 2 — Interpret the intercept (b): The predicted heart weight when a cat's body weight is 0 kg is $-0.357$ g — impossible (body weight 0 means no cat), so the intercept is not meaningful practically. It is a mathematical artifact.

Step 3 — Interpret the slope (c): For each additional kilogram of body weight, a cat's heart weight increases by about 4.034 grams on average. This is a very strong, positive, physically sensible relationship.

Step 4 — Interpret $R^2$ (d): $R^2 = 64.66\%$. About 65% of the variability in cats' heart weights is explained by body weight. The rest is due to breed, age, sex, fitness, etc.

Step 5 — Compute the correlation (e): $r = \pm \sqrt{0.6466} \approx \pm 0.804$. Slope is positive (+4.034), so:

$$ r \approx +0.804. $$

Answer: (a) $\widehat{\text{heart\_wt}} = -0.357 + 4.034\,\text{body\_wt}$. (b) Intercept not meaningful. (c) Heart weight increases by about 4.03 g per kg of body weight. (d) $R^2 \approx 65\%$ — 65% of variability in heart weight explained by body weight. (e) $r \approx +0.804$.

Problem 11: Outliers, Part I. Identify the outliers in the scatterplots shown below, and determine what type of outliers they are. Explain your reasoning.

(a)

(b)

(c)

Problem 8.27 Solution

Step 1 — Outlier-type framework: Every outlier can be classified in two dimensions: - High leverage means the point has an $x$-value far from the rest of the data. - Influential means removing the point would noticeably change the slope or intercept. - Large residual only means the point is vertically far from the line but has an $x$ near the center — it does not have much pull on the slope.

Step 2 — Typical answers for three panels: - (a) A point far to the right of the cloud, on the line → high leverage, not influential. - (b) A point in the middle of the $x$ range but far above the line → large residual, not high leverage, not very influential (slope barely changes). - (c) A point far to the right of the cloud and far from the line → high leverage AND influential.

Answer: (a) High leverage, not influential. (b) Large residual, not high leverage, minimally influential. (c) High leverage AND influential — this point pulls the slope noticeably.

Problem 12: Outliers, Part II. Identify the outliers in the scatterplots shown below and determine what type of outliers they are. Explain your reasoning.

(a)

(b)

(c)

Problem 8.28 Solution

Step 1 — Reapply the outlier-type framework from Problem 8.27.

Step 2 — Typical answers for three panels: - (a) A point in the middle of the $x$ range but far below the line → large residual, low leverage, low influence. - (b) A point far to the right, off the line → high leverage AND influential. - (c) A small cluster of points off to one side (e.g. far right) → collectively high leverage and influential; the cluster shifts the regression line.

Answer: (a) Large residual, low leverage, low influence. (b) High leverage AND influential. (c) The cluster acts as a collective high-leverage, influential group.

Problem 13: Urban homeowners, Part I. The scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas. There are 52 observations, each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.

(a) Describe the relationship between the percent of families who own their home and the percent of the population living in urban areas. (b) The outlier at the bottom right corner is District of Columbia, where $100\%$ of the population is considered urban. What type of an outlier is this observation?

Problem 8.29 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a moderate negative linear relationship: states with a higher percentage of urban population tend to have a lower percentage of families owning their home. Apartments are more common in urban areas, driving down ownership rates.

Step 2 — Classify the District of Columbia outlier (b): DC sits at the far right (100% urban) with a much lower homeownership percentage than the trend would predict. Its $x$-value is far from the center of the $x$ distribution → high leverage. It also falls well away from the regression line the other 51 observations would produce → influential. DC is therefore both an outlier with high leverage and an influential point.

Answer: (a) Moderate negative linear — more urban means less ownership. (b) DC is a high-leverage, influential outlier.

Problem 14: Crawling babies, Part II. Exercise 8.12 introduces data on the average monthly temperature during the month babies first try to crawl (about 6 months after birth) and the average first crawling age for babies born in a given month. A scatterplot of these two variables reveals a potential outlying month when the average temperature is about $53^{\circ}\text{F}$ and average crawling age is about 28.5 weeks. Does this point have high leverage? Is it an influential point?

Problem 8.30 Solution

Step 1 — Recall the Crawling Babies setup: The study has 12 data points (one per birth month). The outlying month has $x = 53^\circ\text{F}$ and $y = 28.5$ weeks.

Step 2 — Is it high leverage? The mean temperature across the 12 months is somewhere in the middle of the observed range (probably around 50–60°F). An $x$-value of 53°F is close to the center of the $x$ distribution, not far from the $\bar{x}$. Therefore this point does NOT have high leverage.

Step 3 — Is it influential? An influential point requires both (a) high leverage and (b) removal would noticeably change the regression line's slope/intercept. Since this point does not have high leverage, it cannot be strongly influential — removing it would barely move the slope. However, it is a point with an unusually large residual (vertical distance from the line).

Answer: No, it does not have high leverage, and it is not influential. It is just an observation with a large residual — a point that is vertically far from the line but whose $x$-value is near the middle of the data, so it cannot pull the slope in either direction.

Raw \(x\)	\(\log_{10}(x)\)
5	0.70
8	0.90
12	1.08
15	1.18
20	1.30
40	1.60
80	1.90
200	2.30
900	2.95
5000	3.70

8.4 Inference for the Slope of a Regression Line

Here we encounter our last confidence interval and hypothesis test procedures, this time for making inferences about the slope of the population regression line. We can use these tools to answer questions such as:

Is the unemployment rate a significant linear predictor for the loss of the President's party in the House of Representatives?
On average, how much less in college gift aid do students receive when their parents earn an additional \$1,000 in income?

Context Pause

Until now, the regression line has been a description — a best-fit line drawn through the data we actually collected. But the points we have are just one sample from a much larger population of possible observations. The slope we computed could have come out a bit higher or a bit lower if we had sampled different people. Slope inference is how we turn "the line we drew" into a statement about "the line that lives in the population" — confidence intervals give us a plausible range for the true slope, and hypothesis tests let us decide whether the true slope could be zero.

Insight Note

If you can run a $t$-test for a mean, you can run a $t$-test for a slope. The machinery is identical: point estimate, standard error, degrees of freedom, $t$-statistic, p-value. The only new thing is where the numbers come from — you read them off a regression output table instead of computing $\bar{x}$ and $s$ by hand.

Learning Objectives

Source: Main Text

By the end of this section, you should be able to:

In this section, you will learn to:

1. Recognize that the slope of the sample regression line is a point estimate and has an associated standard error.

2. Read the results of computer regression output and identify the quantities needed for inference for the slope of the regression line — specifically the slope of the sample regression line, the SE of the slope, and the degrees of freedom.

3. State and verify whether the conditions are met for inference on the slope of the regression line using the $t$-distribution.

4. Carry out a complete confidence interval procedure for the slope of the regression line.

5. Carry out a complete hypothesis test for the slope of the regression line.

6. Distinguish between when to use the $t$-test for the slope of a regression line and when to use the one-sample $t$-test for a mean of differences.

8.4.1 The Role of Inference for Regression Parameters

Previously, we found the equation of the regression line for predicting gift aid from family income at Elmhurst College. The slope, $b$, was equal to $-0.0431$. This is the slope for our sample data. However, the sample was taken from a larger population. We would like to use the slope computed from our sample data to estimate the slope of the population regression line.

The equation for the population regression line can be written as

$$ \mu_y = \alpha + \beta x $$

Here, $\alpha$ and $\beta$ represent two model parameters — the $y$-intercept and the slope of the true (population) regression line. (This use of $\alpha$ and $\beta$ has nothing to do with the $\alpha$ and $\beta$ we used previously to represent the probabilities of Type I and Type II errors!) The parameters $\alpha$ and $\beta$ are estimated using data. We can look at the equation of the regression line calculated from a particular data set:

$$ \hat{y} = a + b x $$

and see that $a$ and $b$ are point estimates for $\alpha$ and $\beta$, respectively. If we plug in the values of $a$ and $b$, the regression equation for predicting gift aid based on family income is:

$$ \hat{y} = 24{.}3193 - 0{.}0431 x $$

The slope of the sample regression line, $-0.0431$, is our best estimate for the slope of the population regression line, but there is variability in this estimate since it is based on a sample. A different sample would produce a somewhat different estimate of the slope. The standard error of the slope tells us the typical variation in the slope of the sample regression line and the typical error in using this slope to estimate the slope of the population regression line.

We would like to construct a 95% confidence interval for $\beta$, the slope of the population regression line. As with means, inference for the slope of a regression line is based on the $t$-distribution.

Context Pause

The leap from $b$ to $\beta$ is the same leap you've been making all along — from $\bar{x}$ to $\mu$, from $\hat{p}$ to $p$. Sample statistic to population parameter. The slope is just another statistic that wobbles from sample to sample, and $\beta$ is the fixed (but unknown) quantity we'd love to pin down.

Example 8.4.1

Source: Main Text

The intercept and slope estimates for the Elmhurst data are $b_0 = 24{,}319$ and $b_1 = -0.0431$. What do these numbers really mean?

Solution:

Interpreting the slope parameter is helpful in almost any application. For each additional \$1,000 of family income, we would expect a student to receive a net difference of \$1,000 times the slope, $-0.0431$, which equals \$43.10 less in aid on average. Note that higher family income corresponds to less aid because the coefficient of family income is negative in the model.

We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational. That is, increasing a student's family income may not cause the student's aid to drop. (It would be reasonable to contact the college and ask if the relationship is causal — i.e. if Elmhurst College's aid decisions are partially based on students' family income.)

The estimated intercept $b_0 = 24{,}319$ describes the average aid if a student's family had no income. The meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \$0. In other applications, the intercept may have little or no practical value if there are no observations where $x$ is near zero.

> Inference for the Slope of a Regression Line > Inference for the slope of a regression line is based on the $t$-distribution with $n - 2$ degrees of freedom, where $n$ is the number of paired observations.

Once we verify that conditions for using the $t$-distribution are met, we will be able to construct the confidence interval for the slope using a critical value $t^{\star}$ based on $n - 2$ degrees of freedom. We will use a table of the regression summary to find the point estimate and standard error for the slope.

Insight Note

Why $n - 2$ degrees of freedom instead of $n - 1$? A regression line is determined by two quantities — a slope and an intercept — so two pieces of information are "used up" when we estimate them from the data. Every time we estimate an additional parameter, we lose one degree of freedom.

Try It Now 8.4.1

Source: Main Text

A study of 18 city blocks regresses observed traffic speed (mph) on posted speed limit (mph). The software output reports a sample slope of $b = 0.84$ with standard error $SE = 0.12$.

(a) What is the point estimate for the population slope $\beta$? (b) How many degrees of freedom should you use for inference on $\beta$? (c) In one sentence, interpret the slope in context.

Solution

(a) The point estimate for $\beta$ is the sample slope $b = 0.84$.

(b) With $n = 18$ blocks, the degrees of freedom for slope inference are $df = n - 2 = 18 - 2 = 16$.

(c) For each additional 1 mph increase in the posted speed limit, observed traffic speed increases by about 0.84 mph on average, based on the sample.

8.4.2 Conditions for the Least Squares Line

Conditions for inference in the context of regression can be more complicated than when dealing with means or proportions.

Inference for parameters of a regression line involves the following assumptions:

Linearity. The true relationship between the two variables follows a linear trend. We check whether this is reasonable by examining whether the data follow a linear trend. If there is a nonlinear trend (e.g. left panel of Figure 8.26), an advanced regression method from another book or later course should be applied.

Nearly normal residuals. For each $x$-value, the residuals should be nearly normal. When this assumption is found to be unreasonable, it is usually because of outliers or concerns about influential points. An example that suggests non-normal residuals is shown in the second panel of Figure 8.26. If the sample size $n \ge 30$, then this assumption is not necessary.

Constant variability. The variability of points around the true least squares line is constant for all values of $x$. An example of non-constant variability is shown in the third panel of Figure 8.26.

Independent. The observations are independent of one another. The observations can be considered independent when they are collected from a random sample or randomized experiment. Be careful of data collected sequentially in what is called a time series. An example of data collected in such a fashion is shown in the fourth panel of Figure 8.26.

We see in Figure 8.26 that patterns in the residual plots suggest that the assumptions for regression inference are not met in those four examples. In fact, identifying nonlinear trends in the data, outliers, and non-constant variability in the residuals is often easier to detect in a residual plot than in a scatterplot.

We note that the second assumption regarding nearly normal residuals is particularly difficult to assess when the sample size is small. We can make a graph, such as a histogram, of the residuals, but we cannot expect a small data set to be nearly normal. All we can do is look for excessive skew or outliers. Outliers and influential points in the data can be seen from the residual plot as well as from a histogram of the residuals.

Conditions for Inference on the Slope of a Regression Line

1. The data is collected from a random sample or randomized experiment.

2. The residual plot appears as a random cloud of points and does not have any patterns or significant outliers that would suggest that the linearity, nearly normal residuals, constant variability, or independence assumptions are unreasonable.

Figure 8.26: Four examples showing when the inference methods in this chapter are insufficient to apply to the data. In the left panel, a straight line does not fit the data. In the second panel, there are outliers: two points on the left are relatively distant from the rest of the data, and one of these points is very far away from the line. In the third panel, the variability of the data around the line increases with larger values of $x$. In the last panel, a time series data set is shown, where successive observations are highly correlated.

Figure 8.27: Left: Scatterplot of gift aid versus family income for 50 freshmen at Elmhurst College. Right: Residual plot for the model shown in left panel.

Context Pause

Why so many conditions just to test a slope? Because every one of them corresponds to a way the $t$-distribution can lie to you. Violating linearity means the regression line itself is the wrong model. Violating constant variability means the standard error is wrong. Violating independence means the whole sampling-distribution story breaks down. The residual plot is your single best diagnostic — one picture, four conditions.

Insight Note

A residual plot is essentially a scatterplot with the linear trend subtracted out. If the line genuinely captures the relationship, what's left over should look like random static. Any pattern you can see in the residuals is a pattern the line failed to capture.

Try It Now 8.4.2

Source: Main Text

You fit a regression line to 40 observations and your residual plot shows the residuals fanning out — small near the left side of the plot and much larger near the right side. Which regression condition is violated, and what does this imply about using the $t$-procedures for the slope?

Solution

A fan-shaped residual plot violates the constant variability condition. The typical size of the residuals is not the same for all $x$-values, so the standard error for the slope reported by the software will not be trustworthy. The $t$-interval and $t$-test for the slope should not be applied directly to this data set. A transformation of the variables (like taking a log of $y$) or a more advanced regression method would be needed before inference is appropriate.

8.4.3 Constructing a Confidence Interval for the Slope of a Regression Line

We would like to construct a confidence interval for the slope of the regression line for predicting gift aid based on family income for all freshmen at Elmhurst College.

Do conditions seem to be satisfied? We recall that the 50 freshmen in the sample were randomly chosen, so the observations are independent. Next, we need to look carefully at the scatterplot and the residual plot.

Always Check Conditions

Do not blindly apply formulas or rely on regression output; always first look at a scatterplot or a residual plot. If conditions for fitting the regression line are not met, the methods presented here should not be applied.

The scatterplot seems to show a linear trend, which matches the fact that there is no curved trend apparent in the residual plot. Also, the standard deviation of the residuals is mostly constant for different $x$ values and there are no outliers or influential points. There are no patterns in the residual plot that would suggest that a linear model is not appropriate, so the conditions are reasonably met. We are now ready to calculate the 95% confidence interval.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	24.3193	1.2915	18.83	0.0000
family_income	-0.0431	0.0108	-3.98	0.0002

Figure 8.28: Summary of least squares fit for the Elmhurst College data, where we are predicting gift aid by the university based on the family income of students.

Example 8.4.2

Source: Main Text

Construct a 95% confidence interval for the slope of the regression line for predicting gift aid from family income at Elmhurst College.

Solution:

As usual, the confidence interval will take the form:

$$ \text{point estimate} \;\pm\; \text{critical value} \times SE\ \text{of estimate} $$

The point estimate for the slope of the population regression line is the slope of the sample regression line: $-0.0431$. The standard error of the slope can be read from the table as $0.0108$. Note that we do not need to divide $0.0108$ by $\sqrt{n}$ or do any further calculations on $0.0108$; $0.0108$ is the $SE$ of the slope. Note that the value of $t$ given in the table refers to the test statistic, not the critical value $t^{\star}$.

To find $t^{\star}$, we use a $t$-table. Here $n = 50$, so $df = 50 - 2 = 48$. Using a $t$-table, we round down to row $df = 40$ and estimate the critical value $t^{\star} = 2.021$ for a 95% confidence level. The confidence interval is calculated as:

$$ -0{.}0431 \pm 2{.}021 \times 0{.}0108 = (-0{.}065,\ -0{.}021) $$

Note: $t^{\star}$ using exactly 48 degrees of freedom is equal to $2.01$ and gives the same interval of $(-0.065, -0.021)$.

Example 8.4.3

Source: Main Text

Interpret the confidence interval in context. What can we conclude?

Solution:

We are 95% confident that the slope of the population regression line — the true average change in gift aid for each additional \$1,000 in family income — is between $-\$0.065$ thousand dollars and $-\$0.021$ thousand dollars. That is, we are 95% confident that, on average, when family income is \$1,000 higher, gift aid is between \$21 and \$65 lower.

Because the entire interval is negative, we have evidence that the slope of the population regression line is less than $0$. In other words, we have evidence that there is a significant negative linear relationship between gift aid and family income.

> Constructing a Confidence Interval for the Slope of a Regression Line > To carry out a complete confidence interval procedure to estimate the slope of the population regression line $\beta$:

Identify: Identify the parameter and the confidence level, C%.

The parameter will be a slope of the population regression line — e.g. the slope of the population regression line relating air quality index to average rainfall per year for each city in the United States.

Choose: Choose the correct interval procedure and identify it by name.

To estimate the slope of a regression model we use a $t$-interval for the slope.

Check: Check conditions for using a $t$-interval for the slope.

1. Independence: Data should come from a random sample or randomized experiment. If sampling without replacement, check that the sample size is less than 10% of the population size. 2. Linearity: Check that the scatterplot does not show a curved trend and that the residual plot shows no ∪-shape pattern. 3. Constant variability: Use the residual plot to check that the standard deviation of the residuals is constant across all $x$-values. 4. Normality: The population of residuals is nearly normal or the sample size is $\ge 30$. If the sample size is less than 30 check for strong skew or outliers in the sample residuals. If neither is found, then the condition that the population of residuals is nearly normal is considered reasonable.

Calculate: Calculate the confidence interval and record it in interval form.

$$ \text{point estimate} \;\pm\; t^{\star} \times SE\ \text{of estimate}, \quad df = n - 2 $$

- point estimate: the slope $b$ of the sample regression line - $SE$ of estimate: $SE$ of slope (find using computer output) - $t^{\star}$: use a $t$-distribution with $df = n - 2$ and confidence level C%

Conclude: Interpret the interval and, if applicable, draw a conclusion in context.

We are C% confident that the true slope of the regression line — the average change in $y$ for each unit increase in $x$ — is between ___ and ___. If applicable, draw a conclusion based on whether the interval is entirely above, is entirely below, or contains the value $0$.

Figure 8.29: Left: Scatterplot of head length versus total length for 104 brushtail possums. Right: Residual plot for the model shown in left panel.

Example 8.4.4

Source: Main Text

The regression summary below shows statistical software output from fitting the least squares regression line for predicting head length from total length for 104 brushtail possums. The scatterplot and residual plot are shown above.

Predictor	Coef	SE Coef	T	P
Constant	42.70979	5.17281	8.257	5.66e-13
total_length	0.57290	0.05933	9.657	4.68e-16
S = 2.595	R-Sq = 47.76%	R-Sq(adj) = 47.25%

Construct a 95% confidence interval for the slope of the regression line. Is there convincing evidence that there is a positive, linear relationship between head length and total length?

Solution:

Identify: The parameter of interest is the slope of the population regression line for predicting head length from body length. We want to estimate this at the 95% confidence level.

Choose: Because the parameter to be estimated is the slope of a regression line, we will use the $t$-interval for the slope.

Check: These data come from a random sample. The residual plot shows no pattern, so a linear model seems reasonable. The residual plot also shows that the residuals have constant standard deviation. Finally, $n = 104 \ge 30$ so we do not have to check for skew in the residuals. All four conditions are met.

Calculate: We will calculate the interval: $\text{point estimate} \pm t^{\star} \times SE\ \text{of estimate}$.

We read the slope of the sample regression line and the corresponding $SE$ from the table. The point estimate is $b = 0.57290$. The $SE$ of the slope is $0.05933$, which can be found next to the slope of $0.57290$. The degrees of freedom is $df = n - 2 = 104 - 2 = 102$. As before, we find the critical value $t^{\star}$ using a $t$-table (the $t^{\star}$ value is not the same as the $T$-statistic for the hypothesis test). Using the $t$-table at row $df = 100$ (round down since 102 is not on the table) and confidence level 95%, we get $t^{\star} = 1.984$.

So the 95% confidence interval is given by:

$$ 0{.}57290 \pm 1{.}984 \times 0{.}05933 $$ $$ (0{.}456,\ 0{.}691) $$

Conclude: We are 95% confident that the slope of the population regression line is between $0.456$ and $0.691$. That is, we are 95% confident that the true average increase in head length for each additional cm in total length is between $0.456$ mm and $0.691$ mm. Because the interval is entirely above $0$, we do have evidence of a positive linear association between head length and body length for brushtail possums.

Guided Practice 8.22

Source: Main Text

Figure 8.21 shows statistical software output from fitting the least squares regression line shown in Figure 8.15. Use this output to formally evaluate the following hypotheses.

$H_0$: The true coefficient for family income is zero.

$H_A$: The true coefficient for family income is not zero.

Figure 8.21: Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	24319.3	1291.5	18.83	<0.0001
family_income	-0.0431	0.0108	-3.98	0.0002
				df = 48

Solution

From the table, the point estimate is $b = -0.0431$ with $SE = 0.0108$ on $df = 48$. The reported test statistic is

$$ T = \frac{-0{.}0431 - 0}{0{.}0108} = -3{.}98, $$

and the two-sided p-value is $0.0002$. Since the p-value is far below any standard significance level (e.g. $\alpha = 0.05$), we reject $H_0$. The data provide convincing evidence that the true coefficient for family income is not zero — i.e. there is a real linear relationship between family income and gift aid at Elmhurst College.

Insight Note

Every confidence interval for a slope is also a hypothesis test in disguise. If $0$ is inside the interval, you cannot rule out "no linear relationship." If $0$ is outside, you can. For the Elmhurst data the interval is $(-0.065, -0.021)$ — entirely negative — which matches the test's conclusion that $\beta \ne 0$.

Try It Now 8.4.3

Source: Main Text

A regression of a child's reading score on time spent reading at home (hours/week) for $n = 25$ students gives $b = 1.6$ with $SE = 0.55$. Assume the four conditions for slope inference are met.

(a) What is the critical value $t^{\star}$ for a 95% confidence interval, using $df = n - 2$? (Use $df = 20$ from the table.) (b) Compute the 95% confidence interval for $\beta$. (c) Does the interval contain $0$? What does that tell you?

Solution

(a) $df = 25 - 2 = 23$. Rounding down to $df = 20$ in the $t$-table, $t^{\star} \approx 2.086$.

(b) CI: $1.6 \pm 2.086 \times 0.55 = 1.6 \pm 1.147 = (0.453,\ 2.747)$.

(c) The interval is entirely above $0$, so we have evidence of a positive linear relationship between time spent reading at home and reading score. Each additional hour per week is associated with an increase of roughly $0.45$ to $2.75$ points on the reading score, on average.

Context Pause

The critical value $t^{\star}$ and the test statistic $T$ are easy to confuse because they're both on the $t$-distribution and both look like "t-something." Keep them distinct: $t^{\star}$ is a table lookup determined only by $df$ and the confidence level — it never depends on your data. $T$ is computed from your data and measures how far the observed slope is from zero in standard-error units. Mixing them up is the single most common slope-inference error.

8.4.4 Midterm Elections and Unemployment

Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S. Presidential election. The set of House elections occurring during the middle of a Presidential term are called midterm elections. In America's two-party system, one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.

To assess the validity of this claim, we can compile historical data and look for a connection. We consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression. Figure 8.30 shows these data and the least-squares regression line:

% change in House seats for President's party

$$ = -7{.}36 - 0{.}89 \times (\text{unemployment rate}) $$

We consider the percent change in the number of seats of the President's party (e.g. percent change in the number of seats for Republicans in 2018) against the unemployment rate.

Examining the data, there are no clear deviations from linearity, the constant variance condition, or the normality of residuals. While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.

Figure 8.30: The percent change in House seats for the President's party in each election from 1898 to 2018 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data.

Guided Practice 8.32

Source: Main Text

The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively. Do you agree that they should be removed for this investigation? Why or why not?

Solution

There is a reasonable case for removing these two observations. Unemployment rates of 21% and 18% are extreme outliers compared to every other midterm election in the data set — they are far outside the typical range of unemployment rates during the last century. As influential points, they could pull the regression line dramatically and make it less representative of the typical midterm election relationship we are trying to describe.

That said, removing outliers is never neutral. A reader of the analysis should be told that these points were dropped and given the reasoning. A responsible write-up reports the results both with and without them, so the audience can judge how sensitive the conclusion is to the Great Depression observations.

Testing for the Slope Using a Cutoff of 0.05

What does it mean to say that the slope of the population regression line is significantly greater than $0$? And why do we tend to use a cutoff of $\alpha = 0.05$? See the 5-minute interactive task at www.openintro.org/why05 for an explanation.

Context Pause

Statistical significance is not the same as practical significance. A slope of $-0.89$ means roughly $0.89$ percentage points of seats lost for every 1 percentage point higher unemployment — that is a real political story. But whether the data prove that story is a separate question, and it's the one the hypothesis test answers.

Insight Note

When the data refuse to reject $H_0$, it doesn't mean "there is no relationship." It means "with this sample size and this amount of noise, we can't rule out zero." The absence of evidence is not evidence of absence — it is a call for more data or a different study.

Try It Now 8.4.4

Source: Main Text

Using the midterm-election regression (slope $b = -0.8897$, $SE = 0.8350$, $n = 27$), carry out a one-sided test of $H_0: \beta = 0$ vs. $H_A: \beta < 0$ at $\alpha = 0.05$.

(a) What is the test statistic $T$? (b) What are the degrees of freedom? (c) Given that the two-sided p-value from the output is $0.2961$, what is the one-sided p-value? (d) State the conclusion in context.

Solution

(a) $T = \dfrac{-0{.}8897 - 0}{0{.}8350} = -1{.}07$.

(b) $df = n - 2 = 27 - 2 = 25$.

(c) The one-sided p-value is half the two-sided value: $0.2961 / 2 \approx 0.148$. This halving is valid only because our observed slope ($-0.8897$) is in the direction of $H_A$ (negative).

(d) The one-sided p-value $\approx 0.148$ is much larger than $\alpha = 0.05$, so we fail to reject $H_0$. The data do not provide convincing evidence that higher unemployment is associated with a larger loss of House seats for the President's party in midterm elections.

8.4.5 Understanding Regression Output from Software

The residual plot shown in Figure 8.31 shows no pattern that would indicate that a linear model is inappropriate. Therefore we can carry out a test on the population slope using the sample slope as our point estimate. Just as for other point estimates we have seen before, we can compute a standard error and test statistic for $b$. The test statistic $T$ follows a $t$-distribution with $n - 2$ degrees of freedom.

Figure 8.31: The residual plot shows no pattern that would indicate that a linear model is inappropriate.

Hypothesis Tests on the Slope of the Regression Line

Use a $t$-test with $n - 2$ degrees of freedom when performing a hypothesis test on the slope of a regression line.

We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course. Figure 8.32 shows software output for the least squares regression line in Figure 8.30. The row labeled unemp represents the information for the slope, which is the coefficient of the unemployment variable.

Figure 8.32: Least squares regression summary for the percent change in seats of President's party in House of Representatives based on percent unemployment.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-7.3644	5.1553	-1.43	0.1646
unemp	-0.8897	0.8350	-1.07	0.2961

Figure 8.33: The distribution shown here is the sampling distribution for $b$, if the null hypothesis was true. The shaded tail represents the p-value for the hypothesis test evaluating whether there is convincing evidence that higher unemployment corresponds to a greater loss of House seats for the President's party during a midterm election.

Example 8.4.5

Source: Main Text

What do the first column of numbers in the regression summary represent?

Solution:

The entries in the first column represent the least squares estimates for the $y$-intercept and slope, $a$ and $b$ respectively. Using this information, we could write the equation for the least squares regression line as

$$ \hat{y} = -7{.}3644 - 0{.}8897 x, $$

where $y$ in this case represents the percent change in the number of seats for the President's party, and $x$ represents the unemployment rate.

We previously used a test statistic $T$ for hypothesis testing in the context of means. Regression is very similar. Here, the point estimate is $b = -0.8897$. The $SE$ of the estimate is $0.8350$, which is given in the second column next to the estimate of $b$. This $SE$ represents the typical error when using the slope of the sample regression line to estimate the slope of the population regression line.

The null value for the slope is $0$, so we now have everything we need to compute the test statistic. We have:

$$ T = \frac{\text{point estimate} - \text{null value}}{SE\ \text{of estimate}} = \frac{-0{.}8897 - 0}{0{.}8350} = -1{.}07 $$

This value corresponds to the $T$-score reported in the regression output in the third column along the unemp row.

Example 8.4.6

Source: Main Text

In this example, the sample size $n = 27$. Identify the degrees of freedom and p-value for the hypothesis test.

Solution:

The degrees of freedom are $df = n - 2 = 27 - 2 = 25$. For a two-sided test, the p-value is the area in the two tails beyond $\pm T = \pm 1.07$ on a $t$-distribution with $df = 25$, which equals $0.2961$ as reported in the last column of the table.

Because the p-value is so large, we do not reject the null hypothesis. That is, the data do not provide convincing evidence that a higher unemployment rate is associated with a larger loss for the President's party in the House of Representatives in midterm elections.

> Don't Carelessly Use the P-Value from Regression Output > The last column in regression output often lists p-values for one particular hypothesis: a two-sided test where the null value is zero. If your test is one-sided and the point estimate is in the direction of $H_A$, then you can halve the software's p-value to get the one-tail area. If neither of these scenarios matches your hypothesis test, be cautious about using the software output to obtain the p-value.

Example 8.4.7

Source: Main Text

Use the table in Figure 8.20 to determine the p-value for the hypothesis test.

Solution:

The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate: $0.2961$. That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections.

> Hypothesis Test for the Slope of a Regression Line > To carry out a complete hypothesis test for the claim that there is no linear relationship between two numerical variables, i.e. that $\beta = 0$:

Identify: Identify the hypotheses and the significance level, $\alpha$.

$$ H_0: \beta = 0 $$ $$ H_A: \beta \ne 0, \quad H_A: \beta > 0, \quad \text{or} \quad H_A: \beta < 0 $$

Choose: Choose the correct test procedure and identify it by name.

To test hypotheses about the slope of a regression model we use a $t$-test for the slope.

Check: Check conditions for using a $t$-test for the slope (the same four conditions as for the interval).

Calculate: Calculate the $t$-statistic, $df$, and p-value.

$$ T = \frac{\text{point estimate} - \text{null value}}{SE\ \text{of estimate}}, \quad df = n - 2 $$

- point estimate: the slope $b$ of the sample regression line - $SE$ of estimate: $SE$ of slope (find using computer output) - null value: $0$ - p-value: based on the $t$-statistic, the $df$, and the direction of $H_A$

Conclude: Compare the p-value to $\alpha$, and draw a conclusion in context.

- If the p-value $< \alpha$, reject $H_0$; there is sufficient evidence that [$H_A$ in context]. - If the p-value $> \alpha$, do not reject $H_0$; there is not sufficient evidence that [$H_A$ in context].

Example 8.4.8

Source: Main Text

The regression summary below shows statistical software output from fitting the least squares regression line for predicting gift aid based on family income for 50 randomly selected freshman students at Elmhurst College. The scatterplot and residual plot were shown in Figure 8.27.

Predictor	Coef	SE Coef	T	P
Constant	24.31933	1.29145	18.831	< 2e-16
family_income	-0.04307	0.01081	-3.985	0.000229

$$ S = 4{.}783 \quad R\text{-}Sq = 24{.}86\% \quad R\text{-}Sq(adj) = 23{.}29\% $$

Do these data provide convincing evidence that there is a negative, linear relationship between family income and gift aid? Carry out a complete hypothesis test at the 0.05 significance level. Use the five step framework to organize your work.

Solution:

Identify: We will test the following hypotheses at the $\alpha = 0.05$ significance level.

$H_0$: $\beta = 0$. There is no linear relationship.

$H_A$: $\beta < 0$. There is a negative linear relationship.

Here, $\beta$ is the slope of the population regression line for predicting gift aid from family income at Elmhurst College.

Choose: Because the hypotheses are about the slope of a regression line, we choose the $t$-test for a slope.

Check: The data come from a random sample of less than 10% of the total population of freshman students at Elmhurst College. The lack of any pattern in the residual plot indicates that a linear model is reasonable. Also, the residual plot shows that the residuals have constant variance. Finally, $n = 50 \ge 30$ so we do not have to worry too much about any skew in the residuals. All four conditions are met.

Calculate: We will calculate the $t$-statistic, degrees of freedom, and the p-value.

We read the slope of the sample regression line and the corresponding $SE$ from the table.

- The point estimate is: $b = -0.04307$. - The $SE$ of the slope is: $SE = 0.01081$.

$$ T = \frac{-0{.}04307 - 0}{0{.}01081} = -3{.}985 $$

Because $H_A$ uses a less-than sign ($<$), meaning that it is a lower-tail test, the p-value is the area to the left of $t = -3.985$ under the $t$-distribution with $50 - 2 = 48$ degrees of freedom.

$$ \text{p-value} = \tfrac{1}{2}(0{.}000229) \approx 0{.}0001 $$

Conclude: The p-value of $0.0001$ is $< 0.05$, so we reject $H_0$; there is sufficient evidence that there is a negative linear relationship between family income and gift aid at Elmhurst College.

Guided Practice 8.36

Source: Main Text

In context, interpret the p-value from the previous example.

Solution

The p-value of roughly $0.0001$ is the probability of observing a sample slope as extreme as $b = -0.04307$ — or more extreme in the direction of $H_A$ — if the true slope of the population regression line really were zero (i.e., if there were no linear relationship between family income and gift aid).

Because this probability is so tiny, the observed slope is very hard to explain by random chance alone. That is why we reject $H_0$: the data are inconsistent with the "no-relationship" hypothesis, and we instead conclude that higher family income is genuinely associated with lower gift aid at Elmhurst College.

Try It Now 8.4.5

Source: Main Text

In a regression summary, the row for the predictor study_hours shows Estimate $= 3.2$, Std. Error $= 0.9$, t value $= 3.56$, and $\Pr(>|t|) = 0.0006$. The sample size is $n = 42$.

(a) What are the hypotheses corresponding to the reported p-value? (b) Explain what the $t$ value column was computed from. (c) At $\alpha = 0.01$, what do you conclude about $\beta$?

Solution

(a) The reported p-value is two-sided, so $H_0: \beta = 0$ vs. $H_A: \beta \ne 0$.

(b) $t = b / SE = 3.2 / 0.9 \approx 3.56$. It's the standardized distance from the null slope of $0$, measured in standard errors.

(c) $p = 0.0006 < 0.01$, so we reject $H_0$. There is convincing evidence that the true slope of the population regression line is not zero — study_hours has a nonzero linear association with the response variable.

Context Pause

Regression output tables look intimidating but carry only four pieces of information per row: the point estimate, its standard error, the $t$-statistic, and the two-sided p-value. That's it. Every statistical-software vendor uses different column labels (Coef, Estimate, Coefficient, b), but the meaning is the same. Once you can read one table you can read them all.

Insight Note

The p-values reported in regression output are always two-sided tests of $H_0: \beta = 0$. If your research question is one-sided ("does $x$ increase $y$?"), halve the reported p-value — but only if your observed slope is in the direction of $H_A$. If the observed slope is opposite to $H_A$, the correct one-sided p-value is actually $1 - p/2$, which will be close to 1. Mindlessly halving without checking direction is a common mistake.

8.4.6 Technology: the $t$-test/Interval for the Slope

We generally rely on regression output from statistical software programs to provide us with the necessary quantities: $b$ and $SE$ of $b$. However we can also find the test statistic, p-value, and confidence interval using Desmos or a handheld calculator.

Get started quickly with the Desmos T-Test/Interval Calculator (available at openintro.org/ahss/desmos).

For instructions on implementing the T-Test/Interval on the TI or Casio, see the Graphing Calculator Guides at openintro.org/ahss.

Inference for Regression

We usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice. However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.

Context Pause

Software makes slope inference nearly effortless — you feed it two columns and it hands you a full regression table. The price of that convenience is a tendency to skip the residual plot. A significant p-value from a regression that violates linearity or constant variability is worse than no p-value at all, because it feels authoritative. Always look at the picture before you quote the number.

Insight Note

Every entry in the regression table has a familiar analog: Estimate is $b$, Std. Error is $SE$, t value is $b/SE$, and Pr(>|t|) is the two-sided p-value. Once you see the pattern, every new software package becomes readable in seconds.

Try It Now 8.4.6

Source: Main Text

You type a data set of $n = 30$ paired observations into software and the row for your predictor reads Estimate $= -0.42$, Std. Error $= 0.25$, t value $= -1.68$, $\Pr(>|t|) = 0.104$.

(a) Verify the reported $t$ value by hand from the Estimate and Std. Error. (b) What is the two-sided p-value for $H_0: \beta = 0$? (c) Build a 95% confidence interval for $\beta$ using $t^{\star} \approx 2.048$ (from $df = 28$). (d) Does your interval include $0$? Is that consistent with the reported p-value at $\alpha = 0.05$?

Solution

(a) $t = -0.42 / 0.25 = -1.68$. Matches the table.

(b) The two-sided p-value is $0.104$ — read directly from the output.

(d) Yes, the interval contains $0$. This is consistent with the p-value $0.104 > 0.05$: we fail to reject $H_0$ at the 5% level, which is exactly what "zero is in the interval" tells us.

8.4.7 Which Inference Procedure to Use for Paired Data?

In Section 7.2.4, we looked at a set of paired data involving the price of textbooks for UCLA courses at the UCLA Bookstore and on Amazon. The left panel of Figure 8.34 shows the difference in price (UCLA Bookstore − Amazon) for each book. Because we have two data points on each textbook, it also makes sense to construct a scatterplot, as seen in the right panel of Figure 8.34.

Figure 8.34: Left: histogram of the difference (UCLA Bookstore price − Amazon price) for each book sampled. Right: scatterplot of Amazon Price versus UCLA Bookstore price.

Example 8.4.9

Source: Main Text

What additional information does the scatterplot provide about the price of textbooks at UCLA Bookstore and on Amazon?

Solution:

With a scatterplot, we see the relationship between the variables. We can see that when UCLA Bookstore price is larger, Amazon price also tends to be larger. We can consider the strength of the correlation, and we can draw the linear regression equation for predicting Amazon price from UCLA Bookstore price.

Example 8.4.10

Source: Main Text

Which test should we do if we want to check whether:

1. prices for textbooks for UCLA courses are higher at the UCLA Bookstore than on Amazon; 2. there is a significant, positive linear relationship between UCLA Bookstore price and Amazon price?

Solution:

In the first case, we are interested in whether the differences (UCLA Bookstore − Amazon) for all UCLA textbooks are, on average, greater than $0$, so we would do a one-sample $t$-test for a mean of differences. In the second case, we are interested in whether the slope of the regression line for predicting Amazon price from UCLA Bookstore price is significantly greater than $0$, so we would do a $t$-test for the slope of a regression line.

Likewise, a one-sample $t$-interval for a mean of differences would provide an interval of reasonable values for the mean of differences in textbook price between UCLA Bookstore and Amazon (for all UCLA textbooks), while a $t$-interval for the slope would provide an interval of reasonable values for the slope of the regression line for predicting Amazon price from UCLA Bookstore price (for all UCLA textbooks).

> Inference for Paired Data > A one-sample $t$-interval or $t$-test for a mean of differences only makes sense when we are asking whether, on average, one variable is greater than, less than, or different from another (think histogram of the differences). A $t$-interval or $t$-test for the slope of a regression line makes sense when we are interested in the linear relationship between them (think scatterplot).

Example 8.4.11

Source: Main Text

Previously, we looked at the relationship between body length and head length for brushtail possums. We also looked at the relationship between gift aid and family income for freshmen at Elmhurst College. Could we do a one-sample $t$-test in either of these scenarios?

Solution:

We have to ask ourselves: does it make sense to ask whether, on average, body length is greater than head length? Similarly, does it make sense to ask whether, on average, gift aid is greater than family income? These don't seem to be meaningful research questions; a one-sample $t$-test for a mean of differences would not be useful here.

Guided Practice 8.40

Source: Main Text

A teacher gives her class a pretest and a posttest. Does this result in paired data? If so, which hypothesis test should she use?

Solution

Yes, this is paired data — each student contributes two observations (pretest score, posttest score), and the pairing is meaningful because the two scores come from the same individual.

The natural question is "did scores change, on average, between pretest and posttest?" That's a question about the mean of the differences (posttest − pretest), so she should use a one-sample $t$-test for a mean of differences. A $t$-test for a slope would only be appropriate if she wanted to model posttest score as a linear function of pretest score — a different question.

Try It Now 8.4.7

Source: Main Text

For each research question, state whether the appropriate procedure is a one-sample $t$-test for a mean of differences or a $t$-test for the slope of a regression line.

(a) Among married couples in a city, is the husband's age, on average, greater than the wife's age? (b) Among the same couples, is there a positive linear relationship between husband's age and wife's age? (c) In a study of twins, is the first-born twin's birth weight typically different from the second-born's? (d) In the same study, does a heavier first-born twin tend to predict a heavier second-born twin?

Solution

(a) One-sample $t$-test for a mean of differences. The question is about the typical difference (husband − wife) being different from $0$.

(b) $t$-test for the slope. The question is whether the two variables have a linear association.

(c) One-sample $t$-test for a mean of differences. The question asks about the average difference in birth weight between first- and second-born twins.

(d) $t$-test for the slope. The question is about a linear relationship between the two weights, not the average difference.

Context Pause

"Paired data" by itself doesn't dictate which test to use — what matters is the question you're asking. A histogram-shaped question ("is $A$ on average bigger than $B$?") is a paired $t$-test. A scatterplot-shaped question ("when $A$ goes up, does $B$ go up too?") is a slope test. The same two columns of numbers can appropriately feed either test; the research question chooses.

Insight Note

The paired $t$-test and the slope test are not substitutes — they answer orthogonal questions. It's perfectly reasonable for a dataset to show a significant paired difference but a non-significant slope, or vice versa. If both tests feel relevant, you probably need both: report the mean difference and the correlation.

Section Summary

In Chapter 6, we used a $\chi^2$ test for independence to test for association between two categorical variables. In this section, we test for association/correlation between two numerical variables.

We use the slope $b$ as a point estimate for the slope $\beta$ of the population regression line. The slope of the population regression line is the true increase/decrease in $y$ for each unit increase in $x$. If the slope of the population regression line is $0$, there is no linear relationship between the two variables.

Under certain assumptions, the sampling distribution for $b$ is normal and the distribution of the standardized test statistic using the standard error of the slope follows a $t$-distribution with $n - 2$ degrees of freedom.

When there is $(x, y)$ data and the parameter of interest is the slope of the population regression line:
Estimate $\beta$ at the C% confidence level using a $t$-interval for the slope.
Test $H_0: \beta = 0$ at the $\alpha$ significance level using a $t$-test for the slope.

The conditions for the $t$-interval and $t$-test for the slope of a regression line are the same:
1. Independence: Data come from a random sample or randomized experiment. If sampling without replacement, check that the sample size is less than 10% of the population size.
2. Linearity: Check that the scatterplot does not show a curved trend and that the residual plot shows no ∪-shape pattern.
3. Constant variability: Use the residual plot to check that the standard deviation of the residuals is constant across all $x$-values.
4. Normality: The population of residuals is nearly normal or the sample size is $\ge 30$. If the sample size is less than 30, check for strong skew or outliers in the sample residuals.

The confidence interval and test statistic are calculated as follows:

$$ \text{CI: } b \pm t^{\star} \times SE_b $$ $$ \text{Test: } T = \frac{b - 0}{SE_b}, \quad df = n - 2 $$

The confidence interval for the slope of the population regression line estimates the true average increase in the $y$-variable for each unit increase in the $x$-variable.

The $t$-test for the slope and the one-sample $t$-test for a mean of differences both involve paired, numerical data. However, the $t$-test for the slope asks if the two variables have a linear relationship — specifically, if the slope of the population regression line is different from $0$. The one-sample $t$-test for a mean of differences asks if the two variables are, on average, different — specifically, if the mean of the population differences is not equal to $0$.

Problem Set — Section 8.4

Source: Main Text

Problem 1: Body measurements, Part IV. The scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-105.0113	7.5394	-13.93	0.0000
height	1.0176	0.0440	23.13	0.0000

(a) Describe the relationship between height and weight.

(b) Write the equation of the regression line. Interpret the slope and intercept in context.

(c) Do the data provide strong evidence that an increase in height is associated with an increase in weight? State the null and alternative hypotheses, report the p-value, and state your conclusion.

(d) The correlation coefficient for height and weight is $0.72$. Calculate $R^2$ and interpret it in context.

Problem 8.33 Solution

Step 1 — Describe the relationship (a): The scatterplot shows a positive, roughly linear association between height and weight for physically active individuals: taller people tend to weigh more. The spread is moderate and reasonably constant across the range of heights, with no obvious outliers.

Step 2 — Write the regression equation (b): Using the Estimate column: $$\widehat{\text{weight}} = -105.0113 + 1.0176 \times \text{height}.$$ Slope interpretation: For each additional 1 cm in height, we predict an increase of about $1.018$ kg in weight, on average. Intercept interpretation: The intercept is the predicted weight when height is $0$ cm — not meaningful here, since nobody is 0 cm tall.

Step 3 — Hypothesis test for the slope (c): Hypotheses: $H_0: \beta = 0$ vs. $H_A: \beta \neq 0$. The regression output reports $T = 23.13$ and p-value $0.0000$ for the height row. Since $p < 0.05$, we reject $H_0$. The data provide very strong evidence that increases in height are associated with increases in weight.

Step 4 — Compute $R^2$ (d): $$R^2 = r^2 = (0.72)^2 = 0.5184.$$ About 52% of the variability in weight among these physically active individuals is explained by a linear model using height as the predictor.

Answer: (a) Positive, roughly linear. (b) $\widehat{\text{weight}} = -105.01 + 1.018 \times \text{height}$. (c) Reject $H_0$; strong evidence of a positive linear association ($p \approx 0$). (d) $R^2 \approx 0.52$.

Problem 2: MCU, predict US theater sales. The Marvel Comic Universe movies were an international movie sensation, containing 23 movies at the time of this writing. Here we consider a model predicting an MCU film's gross theater sales in the US based on the first weekend sales performance in the US. The data are presented below in both a scatterplot and the model in a regression table. Scientific notation is used below — e.g. 42.5e6 corresponds to $42.5 \times 10^6$.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	42.5e6	26.6e6	1.60	0.1251
opening_wednesday_us	2.4361	0.1739	14.01	0.0000

(a) Describe the relationship between gross theater sales in the US and first weekend sales in the US.

(b) Write the equation of the regression line. Interpret the slope and intercept in context.

(c) Do the data provide strong evidence that higher opening weekend sales are associated with higher gross theater sales? State the null and alternative hypotheses, report the p-value, and state your conclusion.

(d) The correlation coefficient for gross sales and first weekend sales is $0.950$. Calculate $R^2$ and interpret it in context.

(e) Suppose we consider a set of all films ever released. Do you think the relationship between opening weekend sales and total sales would be as strong as what we see with the MCU films?

Problem 8.34 Solution

Step 1 — Describe the relationship (a): Gross US theater sales increase strongly and approximately linearly with opening-weekend US sales: higher opening weekends correspond to higher totals, with little deviation from a straight line.

Step 2 — Regression equation (b): $$\widehat{\text{gross}} = 42.5 \times 10^{6} + 2.4361 \times \text{opening\_wed\_us}.$$ Slope: Each additional \$1 of opening-Wednesday sales is associated with an additional \$2.44 in total US gross, on average. Intercept: The predicted gross for a hypothetical film with \$0 opening-Wednesday sales is about \$42.5 million — extrapolation outside the observed range, so interpret with caution.

Step 3 — Hypothesis test (c): $H_0: \beta = 0$ vs. $H_A: \beta \neq 0$. The table shows $T = 14.01$ and p-value $0.0000$. Since $p < 0.05$, we reject $H_0$. Higher opening weekend sales are strongly associated with higher gross sales.

Step 4 — Compute $R^2$ (d): $$R^2 = (0.950)^2 = 0.9025.$$ About 90% of the variability in MCU gross sales is explained by opening-weekend sales.

Step 5 — Generalize (e): Probably not. MCU films are a highly curated, heavily marketed franchise with similar audiences. In a set of all films, budgets, genres, and distribution vary far more, so opening weekends would still correlate with total sales but almost certainly with a much lower $R^2$ and more scatter.

Answer: (a) Strong positive linear. (b) $\hat{y} = 42.5\text{M} + 2.44 x$. (c) Reject $H_0$; very strong positive association. (d) $R^2 \approx 0.90$. (e) Expect a weaker relationship for all films.

Problem 3: Spouses, Part II. The scatterplot below summarizes women's heights and their spouses' heights for a random sample of 170 married women in Britain, where both partners' ages are below 65 years. Summary output of the least squares fit for predicting spouse's height from the woman's height is also provided in the table.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	43.5755	4.6842	9.30	0.0000
height_spouse	0.2863	0.0686	4.17	0.0000

(a) Is there strong evidence in this sample that taller women have taller spouses? State the hypotheses and include any information used to conduct the test.

(b) Write the equation of the regression line for predicting the height of a woman's spouse based on the woman's height.

(d) Given that $R^2 = 0.09$, what is the correlation of heights in this data set?

(e) You meet a married woman from Britain who is 5'9" (69 inches). What would you predict her spouse's height to be? How reliable is this prediction?

(f) You meet another married woman from Britain who is 6'7" (79 inches). Would it be wise to use the same linear model to predict her spouse's height? Why or why not?

Problem 8.35 Solution

Step 1 — State and evaluate the hypotheses (a): $H_0: \beta = 0$ vs. $H_A: \beta \neq 0$. Using the regression table: slope $b = 0.2863$, $SE = 0.0686$, $T = 4.17$, p-value $0.0000$. With $n = 170$, $df = 168$. Since $p < 0.05$, we reject $H_0$ and conclude there is strong evidence that taller women have taller spouses.

Step 2 — Regression equation (b): $$\widehat{\text{spouse\_height}} = 43.5755 + 0.2863 \times \text{woman\_height}.$$

Step 3 — Slope and intercept in context (c): Slope: For each additional inch in a woman's height, the spouse's predicted height increases by about $0.29$ inches, on average. Intercept: A woman of height 0 inches is predicted to have a spouse of height $43.6$ inches — not meaningful; it is only a mathematical baseline.

Step 4 — Correlation from $R^2$ (d): $$r = \pm\sqrt{0.09} = \pm 0.30.$$ The slope is positive, so $r = +0.30$.

Step 5 — Prediction for 69 inches (e): $$\widehat{\text{spouse}} = 43.5755 + 0.2863 \times 69 = 43.5755 + 19.7547 \approx 63.33 \text{ in.}$$ Because $R^2 = 0.09$ is small, individual predictions are unreliable — height explains only $\approx 9\%$ of spouse-height variation.

Step 6 — Prediction for 79 inches (f): No — $79$ inches (6'7") is far above the sampled range. Using the line there is extrapolation, and the linear relationship may not hold at the extremes.

Answer: (a) Reject $H_0$; strong evidence. (b) $\hat{y} = 43.58 + 0.286 x$. (c) See step 3. (d) $r \approx 0.30$. (e) $\approx 63.3$ in; weak prediction. (f) No — extrapolation.

Problem 4: Urban homeowners, Part II. Exercise 8.29 gives a scatterplot displaying the relationship between the percent of families that own their home and the percent of the population living in urban areas. Below is a similar scatterplot, excluding District of Columbia, as well as the residuals plot. There were 51 cases.

(a) For these data, $R^2 = 0.28$. What is the correlation? How can you tell if it is positive or negative?

(b) Examine the residual plot. What do you observe? Is a simple least squares fit appropriate for these data?

Problem 8.36 Solution

Step 1 — Correlation from $R^2$ (a): $$r = \pm \sqrt{0.28} = \pm 0.529.$$ The sign matches the slope of the regression line. Because home-ownership tends to decrease as urbanization rises in the displayed scatterplot (negative slope), $r \approx -0.53$.

Step 2 — Evaluate the residual plot (b): If the residual plot shows a random cloud of points centered on zero with roughly constant spread and no curved pattern, the linearity and constant-variability conditions are reasonable and a simple least-squares fit is appropriate. If the plot shows a U-shape, fanning, or obvious outliers, those conditions fail and a simple linear model should not be used as-is.

Answer: (a) $r \approx -0.53$ (negative because higher urbanization is associated with lower homeownership). (b) Appropriateness depends on the residual plot; a random scatter with constant spread supports using a simple linear fit.

Problem 5: Murders and poverty, Part II. Exercise 8.25 presents regression output from a model for predicting annual murders per million from percentage living in poverty based on a random sample of 20 metropolitan areas. The model output is also provided below.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-29.901	7.789	-3.839	0.001
poverty%	2.559	0.390	6.562	0.000

$$ s = 5{.}512 \quad R^2 = 70{.}52\% \quad R_{adj}^2 = 68{.}89\% $$

(a) What are the hypotheses for evaluating whether poverty percentage is a significant predictor of murder rate?

(b) State the conclusion of the hypothesis test from part (a) in context of the data.

(d) Do your results from the hypothesis test and the confidence interval agree? Explain.

Problem 8.37 Solution

Step 1 — State hypotheses (a): $$H_0: \beta = 0 \qquad H_A: \beta \neq 0,$$ where $\beta$ is the slope of the population regression line for predicting murders per million from poverty percentage.

Step 2 — Conclusion of the test (b): From the table, $b = 2.559$, $SE = 0.390$, $T = 6.562$, p-value $= 0.000$. Since $p < 0.05$, we reject $H_0$. There is strong evidence that poverty percentage is a significant predictor of murder rate in these metropolitan areas.

Step 3 — 95% CI for the slope (c): $df = n - 2 = 20 - 2 = 18$; $t^{\star} \approx 2.101$ from the $t$-table at $df = 18$, 95% confidence. $$2.559 \pm 2.101 \times 0.390 = 2.559 \pm 0.819 = (1.740,\ 3.378).$$ We are 95% confident that each additional percentage point of poverty is associated with between $1.74$ and $3.38$ additional murders per million, on average.

Step 4 — Agreement check (d): The interval $(1.74, 3.38)$ does not contain $0$, which matches the rejection of $H_0$ in part (b). Both procedures point to the same conclusion: poverty is significantly associated with murder rate.

Answer: (a) $H_0: \beta=0$ vs $H_A: \beta \neq 0$. (b) Reject $H_0$; strong evidence. (c) $(1.74, 3.38)$ murders/million per 1% increase in poverty. (d) Yes — interval excludes 0 and test rejects $H_0$.

Problem 6: Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is

$$ \widehat{\text{head circumference}} = 3{.}91 + 0{.}78 \times \text{gestational age} $$

The standard error for the coefficient of gestational age is $0.35$. Is there significant evidence that gestational age has a positive linear association with head circumference? Use the Identify, Choose, Check, Calculate, Conclude framework and make sure to identify any assumptions used in the test.

Problem 8.38 Solution

Step 1 — Identify: Parameter: $\beta$, the slope of the population regression line relating head circumference (cm) to gestational age (weeks) for low birth-weight babies. Hypotheses: $H_0: \beta = 0$ vs. $H_A: \beta > 0$ (positive association). Use $\alpha = 0.05$.

Step 2 — Choose: Because the parameter is the slope of a regression line, use the $t$-test for the slope.

Step 3 — Check conditions: Assume the 25 babies form a random sample (stated implicitly). Linearity, constant variability, and nearly normal residuals should be assessed from a residual plot; with $n = 25 < 30$, the normality condition matters and we'd want to verify there is no strong skew in the residuals. We will proceed assuming the conditions are reasonably met, as the problem directs.

Step 4 — Calculate: $$T = \frac{b - 0}{SE_b} = \frac{0.78 - 0}{0.35} = 2.229.$$ Degrees of freedom: $df = 25 - 2 = 23$. For a one-sided upper-tail test, the p-value is the area to the right of $2.229$ under the $t_{23}$ distribution, which is approximately $p \approx 0.018$ (between $0.01$ and $0.025$ on a standard $t$-table).

Step 5 — Conclude: $p \approx 0.018 < 0.05$, so we reject $H_0$. The data provide convincing evidence that gestational age is positively associated with head circumference at birth for low birth-weight babies.

Answer: Reject $H_0$; $T \approx 2.23$, $df = 23$, $p \approx 0.018$. Gestational age has a statistically significant positive linear association with head circumference.

Chapter Highlights

This chapter focused on describing the linear association between two numerical variables and fitting a linear model.

The correlation coefficient, $r$, measures the strength and direction of the linear association between two variables. However, $r$ alone cannot tell us whether data follow a linear trend or whether a linear model is appropriate.

The explained variance, $R^2$, measures the proportion of variation in the $y$ values explained by a given model. Like $r$, $R^2$ alone cannot tell us whether data follow a linear trend or whether a linear model is appropriate.

Every analysis should begin with graphing the data using a scatterplot in order to see the association and any deviations from the trend (outliers or influential values). A residual plot helps us better see patterns in the data.

When the data show a linear trend, we fit a least squares regression line of the form $\hat{y} = a + b x$, where $a$ is the $y$-intercept and $b$ is the slope. It is important to be able to calculate $a$ and $b$ using the summary statistics and to interpret them in the context of the data.

A residual, $y - \hat{y}$, measures the error for an individual point. The standard deviation of the residuals, $s$, measures the typical size of the residuals.

$\hat{y} = a + b x$ provides the best-fit line for the observed data. To estimate or hypothesize about the slope of the population regression line, first confirm that the residual plot has no pattern and that a linear model is reasonable, then use a $t$-interval for the slope or a $t$-test for the slope with $n - 2$ degrees of freedom.

In this chapter we focused on simple linear models with one explanatory variable. More complex methods of prediction, such as multiple regression (more than one explanatory variable) and nonlinear regression, can be studied in a future course.

Problem Set — Chapter 8 Review

Source: Main Text

Problem 7: True / False. Determine if the following statements are true or false. If false, explain why.

(a) A correlation coefficient of $-0.90$ indicates a stronger linear relationship than a correlation of $0.5$.

(b) Correlation is a measure of the association between any two variables.

Problem 8.39 Solution

Step 1 — Evaluate (a): TRUE. The strength of the linear relationship is measured by $|r|$, not by its sign. Since $|-0.90| = 0.90 > 0.5$, a correlation of $-0.90$ represents a stronger linear association than a correlation of $+0.5$. The negative sign just tells us the direction.

Step 2 — Evaluate (b): FALSE. Correlation (the Pearson correlation coefficient $r$) measures the strength of a linear relationship between two numerical variables. It is not defined for categorical variables, and it can mislead for numerical variables whose relationship is strongly nonlinear.

Answer: (a) TRUE. (b) FALSE — correlation applies only to numerical variables with a roughly linear relationship.

Problem 8: Cats, Part II. Exercise 8.26 presents regression output from a model for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats. The model output is also provided below. Assume that conditions for inference on the slope are met.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.357	0.692	-0.515	0.607
body wt	4.034	0.250	16.119	0.000

$$ s = 1{.}452 \quad R^2 = 64{.}66\% \quad R_{adj}^2 = 64{.}41\% $$

(a) What are the hypotheses for evaluating whether body weight is associated with heart weight in cats?

(b) State the conclusion of the hypothesis test from part (a) in context of the data.

(d) Do your results from the hypothesis test and the confidence interval agree? Explain.

Problem 8.40 Solution

Step 1 — Hypotheses (a): $$H_0: \beta = 0 \qquad H_A: \beta \neq 0,$$ where $\beta$ is the slope of the population regression line predicting heart weight (g) from body weight (kg) in cats.

Step 2 — Conclusion of the test (b): From the table, $b = 4.034$, $SE = 0.250$, $T = 16.119$, p-value $= 0.000$. We reject $H_0$. There is overwhelming evidence that body weight is associated with heart weight in cats: heavier cats tend to have heavier hearts.

Step 3 — 95% CI for the slope (c): $df = n - 2 = 144 - 2 = 142$. Use $df = 100$ from the table: $t^{\star} \approx 1.984$. $$4.034 \pm 1.984 \times 0.250 = 4.034 \pm 0.496 = (3.538,\ 4.530).$$ We are 95% confident that each additional 1 kg of body weight is associated with between $3.54$ and $4.53$ additional grams of heart weight, on average.

Step 4 — Agreement (d): The interval does not contain $0$, consistent with rejecting $H_0$. Both approaches agree: body weight is a significant predictor of heart weight.

Answer: (a) $H_0: \beta = 0$ vs $H_A: \beta \neq 0$. (b) Reject $H_0$; very strong association. (c) $(3.54, 4.53)$ g per kg. (d) Yes — agree.

Problem 9: Nutrition at Starbucks, Part II. Exercise 8.22 introduced a data set on nutrition information on Starbucks food menu items. Based on the scatterplot and the residual plot provided, describe the relationship between the protein content and calories of these menu items, and determine if a simple linear model is appropriate to predict amount of protein from the number of calories.

Problem 8.41 Solution

Step 1 — Describe the relationship: From the scatterplot, protein content generally increases with calories, but the trend is moderate rather than strong and there is substantial spread around any line. The association looks positive but noisy.

Step 2 — Evaluate appropriateness from the residual plot: If the residual plot shows a fairly random cloud centered on zero with roughly constant spread, a simple linear model is reasonable for predicting protein from calories. If the residual plot shows fanning (increasing spread with calories) or curvature, then a simple linear model is not appropriate — a transformation or nonlinear method would be preferred.

Step 3 — Practical judgment: For Starbucks food items, menu items bundle protein with fat and carbs, so calories alone explain only part of the protein content. Use the linear model only for rough predictions and be cautious at the extremes of the calorie range.

Answer: The relationship appears positive but moderate. A simple linear model is appropriate only if the residual plot shows no clear pattern and constant spread; otherwise it should not be used.

Problem 10: Helmets and lunches. The scatterplot shows the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (lunch) and the percentage of bike riders in the neighborhood wearing helmets (helmet). The average percentage of children receiving reduced-fee lunches is $30.8\%$ with a standard deviation of $26.7\%$, and the average percentage of bike riders wearing helmets is $38.8\%$ with a standard deviation of $16.9\%$.

(a) If the $R^2$ for the least-squares regression line for these data is $72\%$, what is the correlation between lunch and helmet?

(b) Calculate the slope and intercept for the least-squares regression line for these data.

(d) Interpret the slope of the least-squares regression line in the context of the application.

(e) What would the value of the residual be for a neighborhood where $40\%$ of the children receive reduced-fee lunches and $40\%$ of the bike riders wear helmets? Interpret the meaning of this residual in the context of the application.

Problem 8.42 Solution

Step 1 — Correlation from $R^2$ (a): $$r = \pm\sqrt{0.72} = \pm 0.849.$$ Higher reduced-lunch percentages are expected to correspond to lower helmet-wearing percentages (a neighborhood-SES pattern), so the slope is negative and $r \approx -0.849$.

Step 2 — Slope of the least-squares line (b): $$b = r \cdot \frac{s_y}{s_x} = -0.849 \cdot \frac{16.9}{26.7} \approx -0.537.$$

Step 3 — Intercept: The regression line passes through $(\bar{x}, \bar{y}) = (30.8, 38.8)$: $$a = \bar{y} - b \bar{x} = 38.8 - (-0.537)(30.8) = 38.8 + 16.54 \approx 55.3.$$ So $$\widehat{\text{helmet}} = 55.3 - 0.537 \times \text{lunch}.$$

Step 4 — Interpret the intercept (c): At a hypothetical neighborhood where $0\%$ of children receive reduced-fee lunches, the predicted helmet-wearing rate is about $55.3\%$. Interpret cautiously — this may extrapolate beyond the data.

Step 5 — Interpret the slope (d): For each additional 1 percentage point of children receiving reduced-fee lunches, the predicted percentage of bike riders wearing helmets decreases by about $0.54$ percentage points.

Step 6 — Residual at (40%, 40%) (e): $$\widehat{\text{helmet}} = 55.3 - 0.537 \times 40 = 55.3 - 21.48 = 33.82.$$ Residual $= y - \hat{y} = 40 - 33.82 = 6.18$. The actual helmet rate is about $6.2$ percentage points higher than the model predicts — the neighborhood does better than expected.

Answer: (a) $r \approx -0.85$. (b) $b \approx -0.54$, $a \approx 55.3$. (c)–(d) See steps 4–5. (e) Residual $\approx +6.2$ percentage points.

Problem 11: Match the correlation, Part III. Match each correlation to the corresponding scatterplot.

(a) $r = -0.72$

(b) $r = 0.07$

(d) $r = 0.99$

(1)

(2)

(3)

(4)

Problem 8.43 Solution

Step 1 — Rank the correlations by strength: Order by $|r|$: $|0.99| > |0.86| > |{-0.72}| > |0.07|$.

Step 2 — Match strongest positive ($r = 0.99$): The plot that looks like an almost perfect straight line with positive slope and very little scatter — very tight cluster around a rising line.

Step 3 — Match strong positive ($r = 0.86$): Still clearly positive and linear but with visibly more scatter than the $0.99$ plot.

Step 4 — Match strong negative ($r = -0.72$): A plot with a clearly decreasing linear trend and moderate scatter.

Step 5 — Match near-zero ($r = 0.07$): A plot with essentially no visible trend — a near-circular cloud of points.

Answer: Strongest-to-weakest by |r|: (d) $0.99$, (c) $0.86$, (a) $-0.72$, (b) $0.07$. The unique pairing is the tightest positive line $\to 0.99$, the noisier positive line $\to 0.86$, the decreasing line $\to -0.72$, and the patternless cloud $\to 0.07$.

Problem 12: Rate my professor. Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching-related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and regression output is provided for predicting teaching evaluation score from beauty score.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.010	0.0255	157.21	0.0000
beauty	□	0.0322	4.13	0.0000

(a) Given that the average standardized beauty score is $-0.0883$ and average teaching evaluation score is $3.9983$, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.

(b) Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

(c) List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

Problem 8.44 Solution

Step 1 — Calculate the slope (a): The regression line must pass through $(\bar{x}, \bar{y}) = (-0.0883,\ 3.9983)$, so $$\bar{y} = a + b \bar{x} \;\Longrightarrow\; 3.9983 = 4.010 + b(-0.0883).$$ Solve: $$b = \frac{3.9983 - 4.010}{-0.0883} = \frac{-0.0117}{-0.0883} \approx 0.133.$$ (Equivalently, the t value of $4.13$ with $SE = 0.0322$ gives $b = 4.13 \times 0.0322 \approx 0.133$, confirming the estimate.)

Step 2 — Positive slope test (b): Hypotheses: $H_0: \beta = 0$ vs. $H_A: \beta > 0$. The reported two-sided p-value is $0.0000$ and the observed slope is positive, so the one-sided p-value is essentially $0$. We reject $H_0$: yes, there is convincing evidence that higher beauty scores are associated with higher teaching evaluation scores.

Step 3 — Check conditions (c): Using the diagnostic plots: - Linearity: Residual plot should show no ∪-shape or curved pattern. - Constant variability: Residuals should have roughly the same spread across $x$. - Nearly normal residuals: $n = 463 \ge 30$, so this condition is automatically satisfied by the large sample size. - Independence: Assume data come from a reasonably representative sample of professors.

If the residual/Q-Q plots look like random scatter with constant spread and roughly normal residuals, all four conditions are met. The large $n$ in particular makes the inference robust.

Answer: (a) $b \approx 0.133$. (b) Reject $H_0$; evidence of a positive linear relationship. (c) All four conditions are reasonably satisfied, assuming diagnostic plots show no strong patterns.

Problem 13: Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the ground.

(a) Describe the relationship between volume and height of these trees.

(b) Describe the relationship between volume and diameter of these trees.

(c) Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.

Problem 8.45 Solution

Step 1 — Volume vs. height (a): The scatterplot shows a positive association between volume and height: taller trees tend to contain more timber. The relationship looks roughly linear but with substantial scatter — some tall trees have moderate volume and vice versa.

Step 2 — Volume vs. diameter (b): The scatterplot shows a strong, positive, and clearly curved association: volume increases sharply with diameter. Because volume scales roughly with $\text{diameter}^2$ (think of a cylinder: $V \propto d^2 h$), the relationship is not linear — it bends upward.

Step 3 — Which predictor to use (c): Diameter is the better predictor of timber volume because the diameter-volume relationship is much stronger than the height-volume relationship (higher $R^2$). However, the curvature in the diameter-volume plot means a simple linear regression may not be ideal; a transformation (e.g., predicting volume from $\text{diameter}^2$ or using $\log(\text{volume})$ vs. $\log(\text{diameter})$) would usually fit much better. If we insist on a single linear predictor, diameter still wins, but the residual plot should be checked to justify the linearity assumption.

Answer: (a) Moderate positive, roughly linear association. (b) Strong positive but clearly curved association. (c) Diameter — stronger relationship — though a transformation would improve the linearity.

Chapter 8

8.1 Line Fitting, Residuals, and Correlation

Learning Objectives

8.1.1 Fitting a Line to Data

8.1.2 Using Linear Regression to Predict Possum Head Lengths

8.1.3 Residuals

8.1.4 Describing Linear Relationships with Correlation

Section Summary

Problem Set

8.2 Fitting a Line by Least Squares Regression

Learning Objectives

8.2.1 An Objective Measure for Finding the Best Line

8.2.2 Finding the Least Squares Line

8.2.3 Interpreting the Coefficients of a Regression Line

8.2.4 Extrapolation Is Treacherous

8.2.5 Using R-squared to Describe the Strength of a Fit

8.2.6 Technology: Linear Correlation and Regression

Calculator Instructions

8.2.7 Types of Outliers in Linear Regression

8.2.8 Categorical Predictors with Two Levels (Special Topic)

Section Summary

Problem Set

8.3 Transformations for Skewed Data

Learning Objectives

8.3.1 Transformations to Reduce Skew

8.3.2 Transformations to Achieve Linearity

Section Summary

Problem Set

8.4 Inference for the Slope of a Regression Line

Learning Objectives

8.4.1 The Role of Inference for Regression Parameters

8.4.2 Conditions for the Least Squares Line

8.4.3 Constructing a Confidence Interval for the Slope of a Regression Line

8.4.4 Midterm Elections and Unemployment

8.4.5 Understanding Regression Output from Software

8.4.6 Technology: the \(t\)-test/Interval for the Slope

8.4.7 Which Inference Procedure to Use for Paired Data?

Section Summary

Problem Set — Section 8.4

Chapter Highlights

Problem Set — Chapter 8 Review