8.1 Line Fitting, Residuals, and Correlation
Student Enhanced Edition
In this section, we investigate bivariate data — data that pairs two numerical measurements for each subject. We examine criteria for identifying a linear model and introduce a new bivariate summary called correlation. We answer questions such as:
- How do we quantify the strength of the linear association between two numerical variables?
- What does it mean for two variables to have no association or to have a nonlinear association?
- Once we fit a model, how do we measure the error in the model's predictions?
Learning Objectives
- Distinguish between the data point \(y\) and the predicted value \(\hat{y}\) based on a model.
- Calculate a residual and draw a residual plot.
- Interpret the standard deviation of the residuals.
- Interpret the correlation coefficient and estimate it from a scatterplot.
- Know and apply the properties of the correlation coefficient.
8.1.1 Fitting a Line to Data
In everyday life, we use relationships between quantities all the time. A store charges a fixed price per item — that's a linear relationship. A mortgage calculator uses your loan amount to predict your monthly payment — another linear relationship. When data from the real world follows a roughly linear pattern, fitting a line gives us a compact, interpretable summary of that relationship. The line lets us predict new values and understand how much one variable changes when the other does.
Requests from twelve separate buyers were simultaneously placed with a trading company to purchase Target Corporation stock (ticker TGT, April 26th, 2012). We let \(x\) be the number of stocks to purchase and \(y\) be the total cost. Because the cost is computed using a linear formula, the linear fit is perfect, and the equation for the line is:
$$ y = 5 + 57.49x $$If we know the number of stocks purchased, we can determine the cost based on this linear equation with no error. Each additional share of the stock cost \$57.49, and there was a \$5 fee for the transaction.
Figure 8.1: Total cost of a trade against number of shares purchased.
Perfect linear relationships are unrealistic in almost any natural process. For example, if we took family income (\(x\)), this value would provide some useful information about how much financial support a college may offer a prospective student (\(y\)). However, the prediction would be far from perfect, since other factors play a role in financial support beyond a family's income.
It is rare for all of the data to fall perfectly on a straight line. Instead, it's more common for data to appear as a cloud of points, such as those shown in Figure 8.2. In each case, the data fall around a straight line, even if none of the observations fall exactly on the line. The first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between \(x\) and \(y\). The second plot shows an upward trend that, while evident, is not as strong as the first. The last plot shows a very weak downward trend in the data, so slight we can hardly notice it.
In each of these examples, we can consider how to draw a "best fit line". For instance, we might wonder: should we move the line up or down a little, or should we tilt it more or less? As we move forward in this chapter, we will learn different criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.
Figure 8.2: Three data sets where a linear model may be useful even though the data do not all fall exactly on the line.
Not every relationship is linear. Figure 8.3 shows data with a very strong relationship that curves — the kind you'd see with a pendulum's period versus its length, or population growth over time. Fitting a straight line to curved data is like trying to describe a circle with a ruler: you can draw the line, but the fit will be systematically wrong. Always plot your data first and inspect the shape before committing to a linear model.
We will also see examples in this chapter where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful. One such case is shown in Figure 8.3, where there is a very strong relationship between the variables even though the trend is not linear.
Figure 8.3: A linear model is not useful in this nonlinear case. These data are from an introductory physics experiment.
Source: [OpenIntro Statistics]
Looking at Figure 8.2, consider the third scatterplot (with the very weak downward trend). If you were forced to draw a best-fit line for that data, what would its slope look like — steeply negative, gently negative, or essentially flat? How does this connect to a correlation near zero?
Solution
The slope would be essentially flat — very close to zero — because the downward trend is so weak that it is barely detectable. A line with near-zero slope means that knowing \(x\) gives you almost no information about \(y\).
This connects directly to correlation: a nearly horizontal best-fit line corresponds to a correlation coefficient \(r\) close to zero. When \(r \approx 0\), the linear association between the two variables is negligible, even if a scatterplot shows a faint directional tendency.
8.1.2 Using Linear Regression to Predict Possum Head Lengths
When we build a linear model, we need to decide which variable drives the prediction and which is being predicted. The explanatory variable (also called the predictor or independent variable) is the one we control or observe first — it goes on the \(x\)-axis. The response variable (also called the outcome or dependent variable) is what we want to predict — it goes on the \(y\)-axis. Getting this distinction right matters: the regression line for predicting \(y\) from \(x\) is not the same as the line for predicting \(x\) from \(y\).
Brushtail possums are a marsupial that lives in Australia. Researchers captured 104 of these animals and took body measurements before releasing the animals back into the wild. We consider two of these measurements: the total length of each possum (from head to tail) and the length of each possum's head.
Figure 8.5 shows a scatterplot for the head length and total length of the 104 possums. Each point represents a single possum from the data.
Figure 8.4: The common brushtail possum of Australia.
Photo by Peter Firminger on Flickr: http://flic.kr/p/6aPTn CC BY 2.0 license.
Figure 8.5: A scatterplot showing head length against total length for 104 brushtail possums. A point representing a possum with head length 94.1 mm and total length 89 cm is highlighted.
The head and total length variables are associated: possums with an above-average total length also tend to have above-average head lengths. While the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.
We want to describe the relationship between the head length and total length variables in the possum data set using a line. In this example, we will use the total length, \(x\), to explain or predict a possum's head length, \(y\). When we use \(x\) to predict \(y\), we call \(x\) the explanatory variable or predictor variable, and we call \(y\) the response variable. We could fit the linear relationship by eye, as in Figure 8.6. The equation for this line is:
$$ \hat{y} = 41 + 0.59x $$A "hat" on \(y\) is used to signify that this is a predicted value, not an observed value. We can use this line to discuss properties of possums. For instance, the equation predicts that a possum with a total length of 80 cm will have a head length of:
$$ \begin{aligned} \hat{y} &= 41 + 0.59(80) \\ &= 88.2 \end{aligned} $$The value \(\hat{y}\) may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of 88.2 mm. The value \(\hat{y}\) is also a prediction: absent further information about an 80 cm possum, this is our best prediction for the head length of a single 80 cm possum.
The "hat" symbol over \(y\) is a universal signal in statistics that you are working with a model's prediction, not an actual data value. Think of it as the model's best guess. Real data points are \(y\); the model's output is \(\hat{y}\) (read "y-hat"). The difference between them — \(y - \hat{y}\) — is the residual, which measures how wrong the model was for that particular observation.
Source: [OpenIntro Statistics]
Using the possum regression equation \(\hat{y} = 41 + 0.59x\), predict the head length of a possum with a total length of 90 cm. Interpret your answer in context.
Solution
Step 1 — Plug in the given total length:
$$ \hat{y} = 41 + 0.59(90) $$Step 2 — Compute:
$$ \hat{y} = 41 + 53.1 = 94.1 \text{ mm} $$Answer: The model predicts that a possum with a total length of 90 cm will have a head length of approximately 94.1 mm. This is an average prediction: among all possums with total length 90 cm, the model expects the average head length to be about 94.1 mm.
8.1.3 Residuals
A model is never perfect. Every prediction has some error, and the residual is the name we give to that error for each individual data point. By studying residuals, we can: (1) judge whether our linear model is a good fit, (2) detect patterns the model is missing (like curvature), and (3) estimate how large our typical prediction error is. Residuals are the diagnostic lens that tells us whether our model is doing its job.
Residuals are the leftover variation in the response variable after fitting a model. Each observation will have a residual, and three of the residuals for the linear model we fit for the possum data are shown in Figure 8.6. If an observation is above the regression line, then its residual — the vertical distance from the observation to the line — is positive. Observations below the line have negative residuals. One goal in picking the right linear model is for these residuals to be as small as possible.
Figure 8.6: A reasonable linear model was fit to represent the relationship between head length and total length.
Let's look closer at the three residuals featured in Figure 8.6. The observation marked by an "\(\times\)" has a small, negative residual of about \(-1\); the observation marked by "+" has a large residual of about \(+7\); and the observation marked by "\(\triangle\)" has a moderate residual of about \(-4\). The size of a residual is usually discussed in terms of its absolute value. For example, the residual for "\(\triangle\)" is larger than that of "\(\times\)" because \(|-4|\) is larger than \(|-1|\).
Residual: Difference Between Observed and Expected
The residual for a particular observation \((x, y)\) is the difference between the observed response and the response we would predict based on the model:
$$ \text{residual} = \text{observed } y - \text{predicted } y = y - \hat{y} $$We typically identify \(\hat{y}\) by plugging \(x\) into the model.
Source: [OpenIntro Statistics]
The linear fit shown in Figure 8.6 is given as \(\hat{y} = 41 + 0.59x\). Based on this line, compute and interpret the residual of the observation \((77.0, 85.3)\). This observation is denoted by "\(\times\)" on the plot. Recall that \(x\) is the total length measured in cm and \(y\) is head length measured in mm.
Solution
Step 1 — Compute the predicted value based on the model:
$$ \begin{aligned} \hat{y} &= 41 + 0.59x \\ &= 41 + 0.59(77.0) \\ &= 86.4 \end{aligned} $$Step 2 — Compute the residual:
$$ \begin{aligned} \text{residual} &= y - \hat{y} \\ &= 85.3 - 86.4 \\ &= -1.1 \end{aligned} $$Answer: The residual for this point is \(-1.1\) mm, which is very close to the visual estimate of \(-1\) mm. For this particular possum with a total length of 77 cm, the model's prediction for its head length was 1.1 mm too high.
Source: [OpenIntro Statistics]
If a model underestimates an observation, will the residual be positive or negative? What about if it overestimates the observation?
Solution
Recall the formula: \(\text{residual} = y - \hat{y}\).
- Underestimate: The model predicts too small a value, so \(\hat{y} < y\). Then \(y - \hat{y} > 0\) — the residual is positive. - Overestimate: The model predicts too large a value, so \(\hat{y} > y\). Then \(y - \hat{y} < 0\) — the residual is negative.
A helpful memory device: a positive residual means the actual value is above the regression line; a negative residual means it is below the line.
Source: [OpenIntro Statistics]
Compute the residual for the observation \((95.5, 94.0)\), denoted by "\(\triangle\)" in the figure, using the linear model: \(\hat{y} = 41 + 0.59x\).
Solution
Step 1 — Compute the predicted value:
$$ \hat{y} = 41 + 0.59(95.5) = 41 + 56.345 = 97.345 $$Step 2 — Compute the residual:
$$ \text{residual} = y - \hat{y} = 94.0 - 97.345 = -3.345 \approx -3.3 \text{ mm} $$Answer: The residual is approximately \(-3.3\) mm. The model overpredicted this possum's head length by about 3.3 mm. This is consistent with the visual estimate of about \(-4\) mm from Figure 8.6.
Residuals are helpful in evaluating how well a linear model fits a data set. We often display the residuals in a residual plot such as the one shown in Figure 8.7. Here, the residuals are calculated for each \(x\) value and plotted versus \(x\). For instance, the point \((85.0, 98.6)\) had a residual of 7.45, so in the residual plot it is placed at \((85.0, 7.45)\). Creating a residual plot is sort of like tipping the scatterplot over so the regression line is horizontal.
From the residual plot, we can better estimate the standard deviation of the residuals, often denoted by the letter \(s\). The standard deviation of the residuals tells us the typical size of the residuals. As such, it is a measure of the typical deviation between the \(y\) values and the model predictions — in other words, the typical prediction error using the model.
Source: [OpenIntro Statistics]
Estimate the standard deviation of the residuals for predicting head length from total length using the line \(\hat{y} = 41 + 0.59x\) using Figure 8.7. Also, interpret the quantity in context.
Figure 8.7: Left: Scatterplot of head length versus total length for 104 brushtail possums. Three particular points have been highlighted. Right: Residual plot for the model shown in left panel.
Solution
Step 1 — Use the 68-95 rule to estimate \(s\) from the residual plot:
To estimate graphically, we apply the approximate 68-95 rule for standard deviations. Approximately 2/3 of the points are within \(\pm 2.5\) and approximately 95% of the points are within \(\pm 5\).
Answer: A good estimate for the standard deviation of the residuals is \(s \approx 2.5\) mm. The typical error when predicting head length using this model is about 2.5 mm.
Standard Deviation of the Residuals
The standard deviation of the residuals, often denoted by the letter \(s\), tells us the typical error in the predictions using the regression model. It can be estimated from a residual plot.
Source: [OpenIntro Statistics]
One purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model. Figure 8.8 shows three scatterplots with linear models in the first row and residual plots in the second row. Can you identify any patterns remaining in the residuals?
Figure 8.8: Sample data with their best fitting lines (top row) and their corresponding residual plots (bottom row).
Solution
Dataset 1 (first column): The residuals show no obvious patterns. The residuals appear scattered randomly around the dashed line that represents 0. A linear model is appropriate here.
Dataset 2 (second column): There is a clear pattern in the residuals — they curve from positive to negative to positive (or vice versa). This reveals curvature in the scatterplot that a straight line cannot capture. We should not use a straight line to model these data; a more advanced technique (such as a polynomial or transformation) should be used.
Dataset 3 (third column): The residual plot shows no obvious patterns, and the upward trend in the original scatterplot is very slight. It is reasonable to try to fit a linear model to the data. However, it is unclear whether there is statistically significant evidence that the slope is different from zero — we will address this in Section 8.4.
Answer: Residual plot 2 reveals a nonlinear pattern, suggesting a linear model is inappropriate for that dataset. Residual plots 1 and 3 show no obvious patterns, consistent with linear models being reasonable.
Think of the residual plot as an X-ray for your linear model. If everything is healthy (the linear model fits well), the residuals will look like random scatter centered on zero — no curves, no fans, no systematic trends. If the residual plot shows a curve, it means the relationship in the data is curved and a straight line is leaving money on the table. If the spread of residuals grows as \(x\) increases (a "fan" shape), it means the variability is not constant — another violation of linear model assumptions we'll study later.
Source: [OpenIntro Statistics]
Suppose you fit a linear model to a dataset and create a residual plot. The residuals form a U-shaped curve (positive, then negative, then positive again as \(x\) increases). What does this tell you about the appropriateness of your linear model?
Solution
A U-shaped pattern in the residuals is a strong signal that the true relationship between \(x\) and \(y\) is curved (nonlinear), not straight. The linear model is systematically underpredicting at low \(x\), overpredicting in the middle, and underpredicting again at high \(x\).
This means the linear model is inappropriate for these data. A better approach would be to transform one of the variables (e.g., use \(\log(x)\) or \(x^2\)) or use a nonlinear regression model. A random, patternless residual plot is what you want — any systematic shape indicates the linear model is missing something important.
8.1.4 Describing Linear Relationships with Correlation
So far we have been describing linear relationships qualitatively: "strong," "moderate," "weak." But science demands numbers. The correlation coefficient \(r\) is the standardized measure that quantifies both the direction (positive or negative) and the strength (from \(-1\) to \(+1\)) of a linear relationship. A single number between \(-1\) and \(1\) carries all of that information — and it works the same way regardless of the units of measurement.
When a linear relationship exists between two variables, we can quantify the strength and direction of the linear relation with the correlation coefficient, or just correlation for short. Figure 8.9 shows eight plots and their corresponding correlations.
\(r = -0.08\)
\(r = -0.64\)
\(r = -0.92\)
\(r = -1.00\)
Figure 8.9: Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a low value in the other.
Only when the relationship is perfectly linear is the correlation either \(-1\) or \(1\). If the linear relationship is strong and positive, the correlation will be near \(+1\). If it is strong and negative, it will be near \(-1\). If there is no apparent linear relationship between the variables, then the correlation will be near zero.
Correlation Measures the Strength of a Linear Relationship
Correlation, which always takes values between \(-1\) and \(1\), describes the direction and strength of the linear relationship between two numerical variables. The strength can be strong, moderate, or weak.
We compute the correlation using a formula, just as we did with the sample mean and standard deviation. Formally, we can compute the correlation for observations \((x_1, y_1)\), \((x_2, y_2)\), ..., \((x_n, y_n)\) using the formula:
$$ r = \frac{1}{n-1} \sum \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right) $$where \(\bar{x}\), \(\bar{y}\), \(s_x\), and \(s_y\) are the sample means and standard deviations for each variable. This formula is rather complex, and we generally perform the calculations on a computer or calculator. The computation involves taking, for each point, the product of the Z-scores that correspond to the \(x\) and \(y\) values.
Notice that the formula divides each value by its standard deviation — that's exactly what a Z-score does. By standardizing both variables before multiplying, we remove the effect of units. Whether height is measured in inches or centimeters, the Z-scores are identical, so the correlation is identical. This is also why \(r\) is always between \(-1\) and \(1\): you can only multiply two Z-scores together in ways that average out to something in that range.
Source: [OpenIntro Statistics]
Take a look at Figure 8.6 on page 424. How would the correlation between head length and total body length of possums change if head length were measured in cm rather than mm? What if head length were measured in inches rather than mm?
Solution
Step 1 — Think about what a unit change does:
Changing the units of \(y\) corresponds to multiplying all the \(y\) values by a constant. For example, converting mm to cm multiplies every \(y\) by \(1/10\); converting mm to inches multiplies by approximately \(1/25.4\).
Step 2 — Check whether Z-scores change:
Multiplying all \(y\) values by a constant changes both the mean and the standard deviation by the same factor. The Z-score for each observation, \(\frac{y_i - \bar{y}}{s_y}\), remains unchanged because the constant cancels in the numerator and denominator.
Step 3 — Conclude about correlation:
Since the Z-scores are unchanged, the correlation formula gives the same result regardless of which unit system we use.
Answer: The correlation does not change. Whether head length is measured in mm, cm, or inches, the correlation between head length and total body length remains the same.
Changing Units of \(x\) and \(y\) Does Not Affect the Correlation
The correlation, \(r\), between two variables is not dependent upon the units in which the variables are recorded. Correlation itself has no units.
Correlation is intended to quantify the strength of a linear trend. Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in Figure 8.10.
\(r = -0.23\)
\(r = 0.31\)
\(r = 0.50\)
Figure 8.10: Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, the correlation is not very strong, and the relationship is not linear.
Source: [OpenIntro Statistics]
It appears no straight line would fit any of the datasets represented in Figure 8.10. Try drawing nonlinear curves on each plot. Once you create a curve for each, describe what is important in your fit.
Solution
For each dataset in Figure 8.10, the key is to identify the shape of the actual relationship:
- First plot (\(r = -0.23\)): The relationship may follow a U-shape (parabola) or a decreasing curve that levels off. A curve that starts high, dips down, and possibly rises again would capture the pattern. - Second plot (\(r = 0.31\)): The relationship might follow an increasing curve that levels off (like a square root or logarithmic shape). - Third plot (\(r = 0.50\)): The relationship might follow a curved increasing trend, possibly exponential or polynomial.
What is important: The curve should pass through the center of the data cloud at each \(x\) value, capturing the systematic trend. Unlike a straight line, the curve can change direction or rate of change. The correlation coefficient failed to capture these relationships precisely because it only measures linear association — a curve that matches the data shape would show a much stronger fit than the weak \(r\) values suggest.
Source: [OpenIntro Statistics]
Consider the four scatterplots in Figure 8.11. In which scatterplot is the correlation between \(x\) and \(y\) the strongest?
Solution
All four data sets have the exact same correlation of \(r = 0.816\) as well as the same equation for the best-fit line! This group of four graphs, known as Anscombe's Quartet, reminds us that knowing the value of the correlation does not tell us what the corresponding scatterplot looks like. It is always important to first graph the data.
Answer: No single scatterplot has a stronger correlation — all four have identical correlations of \(r = 0.816\). The takeaway is that correlation alone is an incomplete summary: you must always visualize your data.
Figure 8.11: Four scatterplots from Desmos with best-fit line drawn in. Investigate Anscombe's Quartet: https://www.desmos.com/calculator/paknt6oneh.
Anscombe's Quartet is one of the most famous cautionary tales in statistics. Four completely different datasets — one with a perfect linear trend, one with a curved relationship, one with an outlier throwing off the line, and one with a vertical cluster — all produce the same correlation of 0.816 and the same regression equation. This is why statisticians insist: never trust a summary statistic alone. A scatterplot takes seconds to create and can save you from badly misunderstanding your data.
Source: [OpenIntro Statistics]
A researcher computes a correlation of \(r = -0.85\) between outdoor temperature (in °F) and hot chocolate sales at a café. She then re-records the temperature in °C. Will the correlation change? Also, describe what \(r = -0.85\) tells you about the relationship.
Solution
Part 1 — Unit change effect:
Converting °F to °C is a linear transformation: \(C = \frac{5}{9}(F - 32)\). This multiplies each temperature value by a constant and adds a constant. Linear transformations change the mean and standard deviation of the variable but leave the Z-scores unchanged (the constant addition shifts both numerator and denominator equally; the multiplicative factor cancels in the Z-score ratio). Therefore, the correlation does not change.
Part 2 — Interpreting \(r = -0.85\):
An \(r\) of \(-0.85\) indicates a strong, negative linear relationship between temperature and hot chocolate sales. As outdoor temperature increases, hot chocolate sales tend to decrease substantially. The negative sign tells us the direction (as one goes up, the other goes down), and the magnitude of 0.85 (close to 1) tells us the relationship is strong — points cluster tightly around a downward-sloping line.
Section Summary
- In Chapter 2 we introduced a bivariate display called a scatterplot, which shows the relationship between two numerical variables. When we use \(x\) to predict \(y\), we call \(x\) the explanatory variable or predictor variable, and we call \(y\) the response variable.
- A linear model for bivariate numerical data can be useful for prediction when the association between the variables follows a constant, linear trend. Linear models should not be used if the trend between the variables is curved.
- When we write a linear model, we use \(\hat{y}\) to indicate that it is the model's prediction. The value \(\hat{y}\) can be understood as a prediction for \(y\) based on a given \(x\), or as an average of the \(y\) values for a given \(x\).
- The residual is the error between the true value and the modeled value, computed as \(y - \hat{y}\). The order of the difference matters, and the sign of the residual tells us if the model overpredicted or underpredicted a particular data point.
- The symbol \(s\) in a linear model is used to denote the standard deviation of the residuals, which measures the typical prediction error of the model.
- A residual plot is a scatterplot with the residuals on the vertical axis. The residuals are often plotted against \(x\) on the horizontal axis, but they can also be plotted against \(y\), \(\hat{y}\), or other variables. Two important uses of a residual plot are:
- Residual plots help us see patterns in the data that may not have been apparent in the scatterplot.
- The standard deviation of the residuals is easier to estimate from a residual plot than from the original scatterplot.
- Correlation, denoted with the letter \(r\), measures the strength and direction of a linear relationship. Important facts:
- The value of \(r\) is always between \(-1\) and \(1\), inclusive, with \(r = -1\) indicating a perfect negative relationship and \(r = 1\) indicating a perfect positive relationship.
- \(r = 0\) indicates no linear association between the variables, though there may well be a quadratic or other type of association.
- Just like Z-scores, the correlation has no units. Changing the units in which \(x\) or \(y\) are measured does not affect the correlation.
- Correlation is sensitive to outliers. Adding or removing a single point can have a big effect on the correlation.
- Correlation is not causation. Even a very strong correlation cannot prove causation; only a well-designed, controlled, randomized experiment can prove causation.
Problem Set 8.1
Source: [OpenIntro Statistics]
Problem 1: Visualize the residuals. The scatterplots shown below each have a superimposed regression line. If we were to construct a residual plot (residuals versus \(x\)) for each, describe what those plots would look like.
(a)
(b)
Problem 8.1 Solution
Step 1 — Recall what a residual plot shows: A residual plot places the \(x\)-values on the horizontal axis and the residuals \((y - \hat{y})\) on the vertical axis. The key diagnostic is whether residuals appear randomly scattered around zero or show a systematic pattern.
Step 2 — Interpret plot (a): Scatterplot (a) shows data with a clear linear trend and the regression line fitting reasonably well. For such data the residual plot would show random scatter around zero — points scattered above and below the horizontal reference line with no discernible pattern. This indicates the linear model is appropriate.
Step 3 — Interpret plot (b): Scatterplot (b) shows data with a curved (nonlinear) relationship even though a straight regression line is superimposed. When a line is fit to curved data the residuals are not random: positive at low \(x\), negative in the middle, then positive again at high \(x\) (or vice versa). The residual plot would show a curved pattern (U-shape or arc), indicating a linear model is not appropriate.
Answer: - (a) Random scatter around zero — consistent with a linear model being appropriate. - (b) A curved (nonlinear) pattern such as a U-shape — the linear model is a poor fit.
Problem 2: Trends in the residuals. Shown below are two plots of residuals remaining after fitting a linear model to two different sets of data. Describe important features and determine if a linear model would be appropriate for these data. Explain your reasoning.
(a)
(b)
Problem 8.2 Solution
Step 1 — Review what to look for in a residual plot: A well-fitting linear model produces residuals scattered randomly around zero with no discernible pattern — no curves, no trends, no fan shapes.
Step 2 — Analyze residual plot (a): Residual plot (a) shows residuals that fan out (increase in spread) as \(x\) increases while remaining centered near zero. This "fan shape" (heteroscedasticity) means the variability of errors is not constant. There is no clear curved trend, so a linear model may describe the mean trend, but the non-constant variance violates standard regression assumptions.
Step 3 — Analyze residual plot (b): Residual plot (b) shows a clear curved pattern — residuals curve systematically above and below zero as \(x\) changes. This indicates the true relationship is nonlinear. A linear model is not appropriate for these data.
Answer: - (a) Fan shape (non-constant variance), no curved trend. A linear model captures the mean but violates constant-variance assumptions. - (b) Clear curved pattern indicates a nonlinear relationship; a linear model is not appropriate.
Problem 3: Identify relationships, Part I. For each of the six plots, identify the strength of the relationship (e.g. weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.
(a)
(b)
(c)
(f)
Problem 8.3 Solution
Step 1 — Establish the evaluation framework: For each plot assess: (1) how tightly points cluster around a line (weak/moderate/strong), (2) the direction (positive/negative), and (3) whether the pattern is linear or curved.
Step 2 — Evaluate each of the six plots:
(a) Strong, positive linear relationship — points cluster tightly around an upward-sloping line. Linear model: Reasonable.
(b) Moderate, positive linear relationship — clear upward trend with more scatter. Linear model: Reasonable.
(c) Weak relationship — barely visible trend with substantial scatter. Linear model: Marginally reasonable, low predictive value.
(d) Strong, curved (nonlinear) relationship — even though the association is strong the pattern is not straight. Linear model: Not reasonable.
(e) Weak to no relationship — scattered cloud with no directional trend. Linear model: Not useful.
(f) Moderate to strong negative linear relationship — clear downward trend with moderate scatter. Linear model: Reasonable.
Answer:
| Plot | Strength | Linear Model Appropriate? |
|---|---|---|
| (a) | Strong | Yes |
| (b) | Moderate | Yes |
| (c) | Weak | Marginally, with caution |
| (d) | Strong (nonlinear) | No |
| (e) | Weak/None | No |
| (f) | Moderate–Strong | Yes |
Problem 4: Identify relationships, Part II. For each of the six plots, identify the strength of the relationship (e.g. weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.
Problem 8.4 Solution
Step 1 — Apply the same evaluation framework as Problem 8.1.3: For each plot identify strength, direction, and linearity.
Step 2 — Evaluate each of the six plots:
(a) Moderate positive linear relationship — upward trend with noticeable scatter. Linear model: Reasonable.
(b) Strong negative linear relationship — points cluster tightly around a downward-sloping line. Linear model: Reasonable.
(c) Weak to no relationship — scattered cloud with no clear pattern. Linear model: Not useful.
(d) Moderate to strong positive relationship with a fan-shaped spread (variance increases with \(x\)). The mean trend appears linear, but non-constant variance is a concern. Linear model: Reasonable for the trend, with caution.
(e) Strong curved (nonlinear) relationship — data curve rather than follow a straight line. Linear model: Not appropriate.
(f) Weak to moderate positive linear relationship. Linear model: Marginally reasonable.
Answer:
| Plot | Strength | Linear Model Appropriate? |
|---|---|---|
| (a) | Moderate | Yes |
| (b) | Strong | Yes |
| (c) | Weak/None | No |
| (d) | Moderate–Strong | Yes (with caution) |
| (e) | Strong (nonlinear) | No |
| (f) | Weak–Moderate | Marginally yes |
Problem 5: Exams and grades. The two scatterplots below show the relationship between final and mid-semester exam grades recorded during several years for a Statistics course at a university.
a) Based on these graphs, which of the two exams has the strongest correlation with the final exam grade? Explain.
b) Can you think of a reason why the correlation between the exam you chose in part (a) and the final exam is higher?
Problem 8.5 Solution
Step 1 — Identify which scatterplot shows a stronger linear relationship: The plot with less scatter relative to the overall trend has a higher correlation. The second scatterplot (second mid-semester exam vs. final) typically shows points clustering more tightly around the linear trend.
Step 2 — Answer part (a): The second mid-semester exam has the stronger correlation with the final exam grade, because its scatterplot shows points hugging the linear trend more closely — less scatter relative to the range.
Step 3 — Answer part (b): The second mid-semester exam is taken later in the course, after students have developed more consistent study habits and stabilized their performance level. Performance at that point better reflects cumulative understanding — which is precisely what the final exam measures. The first exam may be affected by students still adjusting to the course.
Answer: - (a) The second mid-semester exam has the stronger correlation with the final exam. - (b) The second exam is taken closer in time to the final, covers overlapping material, and reflects more settled student performance patterns.
Problem 6: Spouses, Part I. The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married women in Britain, recording the age (in years) and heights (converted here to inches) of the women and their spouses. The scatterplot on the left shows the spouse's age plotted against the woman's age, and the plot on the right shows spouse's height plotted against the woman's height.
a) Describe the relationship between the ages of women in the sample and their spouses' ages.
b) Describe the relationship between the heights of women in the sample and their spouses' heights.
c) Which plot shows a stronger correlation? Explain your reasoning.
d) Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between heights of women in the sample and their spouses' heights?
Problem 8.6 Solution
Step 1 — Describe the age relationship (part a): The scatterplot of spouse's age versus woman's age shows a strong, positive linear relationship. As a woman's age increases her spouse's age tends to increase by a similar amount, with relatively little scatter. People tend to marry others close to their own age.
Step 2 — Describe the height relationship (part b): The scatterplot of spouse's height versus woman's height shows a moderate, positive linear relationship. Taller women tend to have taller spouses, but with considerably more scatter than the age plot.
Step 3 — Compare correlations (part c): The age plot shows a stronger correlation. Age matching in marriage is a tight pattern (most couples are within a few years of each other), while height has much more individual variability and is less deliberately matched between partners.
Step 4 — Effect of unit conversion on correlation (part d): Converting heights from centimeters to inches multiplies all height values by \(1/2.54\) — a linear transformation. This changes the mean and standard deviation by the same factor, leaving Z-scores unchanged. Therefore, the correlation is unchanged by this conversion.
Answer: - (a) Strong positive linear relationship between wives' and spouses' ages. - (b) Moderate positive linear relationship between wives' and spouses' heights, with more scatter. - (c) The age plot shows a stronger correlation — age matching between partners is tighter than height matching. - (d) No — converting cm to inches is a linear transformation that does not affect \(r\).
Problem 7: Match the correlation, Part I. Match each correlation to the corresponding scatterplot.
a) \(r = -0.7\)
b) \(r = 0.45\)
c) \(r = 0.06\)
d) \(r = 0.92\)
(1)
(2)
(3)
Problem 8.7 Solution
Step 1 — Recall how to read correlation from a scatterplot: - Sign of \(r\): positive = upward slope; negative = downward slope. - Magnitude of \(|r|\): close to 1 means tight cluster; close to 0 means loose, scattered cloud.
Step 2 — Rank the correlations: - \(r = 0.92\): strong positive (tight upward cluster) - \(r = -0.7\): moderately strong negative (downward trend with moderate scatter) - \(r = 0.45\): moderate positive (upward trend with noticeable scatter) - \(r = 0.06\): near zero (no linear trend)
Step 3 — Match to the four scatterplots: - Plot (1): Moderate positive upward trend → \(r = 0.45\) - Plot (2): Strong positive, points tightly clustered → \(r = 0.92\) - Plot (3): Scattered cloud, no clear direction → \(r = 0.06\) - Plot (4): Moderately strong downward trend → \(r = -0.7\)
Answer: - (1) → \(r = 0.45\) - (2) → \(r = 0.92\) - (3) → \(r = 0.06\) - (4) → \(r = -0.7\)
Problem 8: Match the correlation, Part II. Match each correlation to the corresponding scatterplot.
a) \(r = 0.49\)
b) \(r = -0.48\)
c) \(r = -0.03\)
d) \(r = -0.85\)
(3)
(4)
Problem 8.8 Solution
Step 1 — Apply the matching strategy: - Sign of \(r\): positive = upward; negative = downward. - Magnitude: larger \(|r|\) = tighter cluster.
Step 2 — Rank the correlations: - \(r = -0.85\): strong negative (tight downward) - \(r = 0.49\): moderate positive - \(r = -0.48\): moderate negative (similar magnitude to 0.49 but downward) - \(r = -0.03\): near zero (no trend)
Step 3 — Match to the four scatterplots: - Plot (1): Moderate positive upward trend → \(r = 0.49\) - Plot (2): Scattered cloud, no direction → \(r = -0.03\) - Plot (3): Moderate negative downward trend → \(r = -0.48\) - Plot (4): Strong negative, points tightly clustered on downward line → \(r = -0.85\)
Answer: - (1) → \(r = 0.49\) - (2) → \(r = -0.03\) - (3) → \(r = -0.48\) - (4) → \(r = -0.85\)
Problem 9: Speed and height. 1,302 UCLA students were asked to fill out a survey where they were asked about their height, fastest speed they have ever driven, and gender. The scatterplot on the left displays the relationship between height and fastest speed, and the scatterplot on the right displays the breakdown by gender in this relationship.
a) Describe the relationship between height and fastest speed.
b) Why do you think these variables are positively associated?
c) What role does gender play in the relationship between height and fastest driving speed?
Problem 8.9 Solution
Step 1 — Describe the overall relationship (part a): The left scatterplot shows a weak to moderate positive relationship between height and fastest speed. Taller individuals tend to report higher maximum driving speeds on average, but with substantial scatter — the association is real but not tight.
Step 2 — Explain the positive association (part b): The positive association is largely driven by a lurking variable: gender. Men tend to be taller than women on average, and men also tend to report higher maximum driving speeds. Both height and fastest speed are correlated with gender, creating an apparent positive association between height and speed even though height does not directly cause faster driving.
Step 3 — Role of gender (part c): The right scatterplot separates male and female students. Within each gender group the relationship between height and fastest speed is considerably weaker than in the combined plot. The overall positive trend is largely a result of males clustering at higher heights and higher speeds while females cluster at lower heights and lower speeds. This is a classic example of a confounding variable.
Answer: - (a) Weak to moderate positive relationship — taller students tend to report higher maximum speeds, with substantial scatter. - (b) Gender drives both variables: men are taller and report higher speeds on average. - (c) Gender is a confounding variable; within each gender group the height-speed relationship is much weaker.
Problem 10: Guess the correlation. Eduardo and Rosie are both collecting data on number of rainy days in a year and the total rainfall for the year. Eduardo records rainfall in inches and Rosie in centimeters. How will their correlation coefficients compare?
Problem 8.10 Solution
Step 1 — Identify the unit conversion: Converting rainfall from inches to centimeters is a linear transformation: \(\text{cm} = 2.54 \times \text{inches}\).
Step 2 — Effect of linear transformations on correlation: Correlation is computed from Z-scores: $$r = \frac{1}{n-1} \sum \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$
Multiplying all values of a variable by a positive constant \(c\) scales both the mean and standard deviation by \(c\), leaving Z-scores unchanged: $$\frac{c \cdot x_i - c \cdot \bar{x}}{c \cdot s_x} = \frac{x_i - \bar{x}}{s_x}$$
Step 3 — Conclude: The constant cancels in the Z-score, so the correlation coefficients computed by Eduardo and Rosie are exactly equal.
Answer: Eduardo's and Rosie's correlation coefficients will be identical. Linear unit conversions do not affect the correlation coefficient.
Problem 11: The Coast Starlight, Part I. The Coast Starlight Amtrak train runs from Seattle to Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).
a) Describe the relationship between distance and travel time.
b) How would the relationship change if travel time was instead measured in hours, and distance was instead measured in kilometers?
c) Correlation between travel time (in miles) and distance (in minutes) is \(r = 0.636\). What is the correlation between travel time (in kilometers) and distance (in hours)?
Problem 8.11 Solution
Step 1 — Describe the relationship (part a): The scatterplot shows a positive relationship between distance and travel time: stops farther apart require more travel time. The trend is roughly linear with some scatter (due to speed variations, terrain, and station stops). The relationship is moderate to strong and positive, consistent with \(r = 0.636\).
Step 2 — Effect of unit changes on the relationship (part b): Converting travel time from minutes to hours divides all time values by 60; converting distance from miles to kilometers multiplies all distance values by approximately 1.609. Both are linear transformations by positive constants. The direction and shape of the scatterplot are unchanged — the positive linear trend remains, with only the axis scales differing.
Step 3 — Effect on correlation (part c): Linear transformations by positive constants cancel in the Z-score formula, so \(r\) is unchanged: $$r_{\text{km, hours}} = r_{\text{miles, minutes}} = 0.636$$
Answer: - (a) Positive, roughly linear relationship — longer inter-stop distances take more time, with moderate scatter. - (b) Shape and direction unchanged; only axis scales differ. - (c) \(r = 0.636\) — unchanged by linear unit conversions.
Problem 12: Crawling babies, Part I. A study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. Infants born during the study year were split into twelve groups, one for each birth month. We consider the average crawling age of babies in each group against the average temperature when the babies are six months old (that's when babies often begin trying to crawl). Temperature is measured in degrees Fahrenheit (°F) and age is measured in weeks.
a) Describe the relationship between temperature and crawling age.
b) How would the relationship change if temperature was measured in degrees Celsius (°C) and age was measured in months?
c) The correlation between temperature in °F and age in weeks was \(r = -0.70\). If we converted the temperature to °C and age to months, what would the correlation be?
Problem 8.12 Solution
Step 1 — Describe the relationship (part a): The scatterplot shows a moderate to strong negative relationship between temperature and average crawling age. Babies whose six-month mark falls in warmer months tend to begin crawling at a younger age (fewer weeks), supporting the hypothesis that restrictive cold-weather clothing delays crawling development. The relationship is roughly linear with \(r = -0.70\).
Step 2 — Effect of unit changes on the relationship (part b): Converting \(\text{°F}\) to \(\text{°C}\) uses \(C = \frac{5}{9}(F - 32)\) — a linear transformation. Converting weeks to months divides by approximately 4.345 — also a linear transformation. Both operations preserve the shape and direction of the relationship; only axis scales change.
Step 3 — Effect on correlation (part c): Both conversions involve linear transformations with positive scaling factors, which cancel in the Z-score formula: $$r_{\text{°C, months}} = r_{\text{°F, weeks}} = -0.70$$
Answer: - (a) Moderate to strong negative linear relationship — warmer temperatures associated with earlier crawling onset. - (b) Shape and direction unchanged; only axis scales differ. - (c) \(r = -0.70\) — unchanged.
Problem 13: Body measurements, Part I. Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender for 507 physically active individuals. The scatterplot below shows the relationship between height and shoulder girth (over deltoid muscles), both measured in centimeters.
a) Describe the relationship between shoulder girth and height.
b) How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?
Problem 8.13 Solution
Step 1 — Describe the relationship (part a): The scatterplot shows a moderate to strong positive linear relationship between shoulder girth and height. Individuals with larger shoulder girths tend to be taller, with a moderate amount of scatter. This is biologically expected — taller people generally have larger overall body dimensions.
Step 2 — Effect of converting shoulder girth to inches (part b): Converting from centimeters to inches divides all shoulder girth values by 2.54 — a linear transformation by a positive constant. Z-scores for shoulder girth are unchanged (the constant cancels in numerator and denominator), so the correlation is unchanged. The scatterplot retains the same positive linear pattern; only the horizontal axis scale changes.
Answer: - (a) Moderate to strong positive linear relationship — larger shoulder girth associated with greater height. - (b) Converting to inches does not change the relationship or correlation; only the axis scale differs.
Problem 14: Body measurements, Part II. The scatterplot below shows the relationship between weight measured in kilograms and hip girth measured in centimeters from the data described in Exercise 8.1.13.
a) Describe the relationship between hip girth and weight.
b) How would the relationship change if weight was measured in pounds while the units for hip girth remained in centimeters?
Problem 8.14 Solution
Step 1 — Describe the relationship (part a): The scatterplot shows a strong positive linear relationship between hip girth and weight. Individuals with larger hip girths tend to weigh considerably more. The relationship appears stronger than the shoulder-height relationship from Problem 8.1.13, likely because hip girth directly reflects body mass distribution.
Step 2 — Effect of converting weight to pounds (part b): Converting weight from kilograms to pounds multiplies all weight values by approximately 2.205 — a linear transformation by a positive constant. Z-scores for weight are unchanged (the constant cancels), so the correlation is unchanged. The scatterplot retains the same strong positive pattern; only the vertical axis scale changes.
Answer: - (a) Strong positive linear relationship — larger hip girth associated with greater weight. - (b) Converting to pounds does not change the relationship or correlation; only the axis scale differs.
Problem 15: Correlation, Part I. What would be the correlation between the ages of a set of women and their spouses if the set of women always married someone who was
a) 3 years younger than themselves?
b) 2 years older than themselves?
c) half as old as themselves?
Problem 8.15 Solution
Step 1 — Set up the relationship: Let \(x\) = woman's age and \(y\) = spouse's age. We find \(r\) between \(x\) and \(y\) in each scenario.
Step 2 — Part (a): Spouse is always 3 years younger: $$y = x - 3$$ A perfect linear relationship with slope \(+1\) and intercept \(-3\). No scatter — every pair lies exactly on the line. Since the slope is positive and the fit is perfect: \(r = +1\).
Step 3 — Part (b): Spouse is always 2 years older: $$y = x + 2$$ A perfect linear relationship with slope \(+1\), no scatter: \(r = +1\).
Step 4 — Part (c): Spouse is always half as old: $$y = 0.5x$$ A perfect linear relationship with slope \(+0.5 > 0\), no scatter. The correlation measures direction and linearity, not slope magnitude: \(r = +1\).
Answer: - (a) \(r = +1\) - (b) \(r = +1\) - (c) \(r = +1\)
All three produce perfect positive linear relationships, so \(r = +1\) in every case.
Problem 16: Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made
a) \$5,000 more than women?
b) 25% more than women?
c) 15% less than women?
Problem 8.16 Solution
Step 1 — Set up the relationship: Let \(x\) = female salary and \(y\) = male salary for each matched position. We find \(r\) between \(x\) and \(y\).
Step 2 — Part (a): Men always make $5,000 more: $$y = x + 5{,}000$$ A perfect linear relationship with slope \(+1\), no scatter: \(r = +1\).
Step 3 — Part (b): Men always make 25% more: $$y = 1.25x$$ A perfect linear relationship with slope \(+1.25 > 0\), no scatter: \(r = +1\).
Step 4 — Part (c): Men always make 15% less: $$y = 0.85x$$ A perfect linear relationship with slope \(+0.85 > 0\), no scatter. Even though men earn less, the slope is still positive — as women's salary increases men's salary increases proportionally: \(r = +1\).
Answer: - (a) \(r = +1\) - (b) \(r = +1\) - (c) \(r = +1\)
All three are perfect positive linear relationships, so \(r = +1\) in every case.
Appendix — Additional Examples and Practice
Redundant alternatives or material displaced by a clearer counterpart. Browse here for extra perspective; the main flow above is the primary path.
Source: [OpenIntro Statistics]
Use the table in Figure 8.20 to determine the p-value for the hypothesis test.
Solution
The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate: 0.2961. That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections.
Source: [OpenIntro Statistics]
The intercept and slope estimates for the Elmhurst data are \(b_0 = 24{,}319\) and \(b_1 = -0.0431\). What do these numbers really mean?
Solution
Interpreting the slope \(b_1 = -0.0431\):
For each additional \$1,000 of family income, we would expect a student to receive a net difference of
$$ \$1{,}000 \times (-0.0431) = -\$43.10 $$in aid on average — that is, \$43.10 less. Note that a higher family income corresponds to less aid because the coefficient of family income is negative in the model. We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational.
Interpreting the intercept \(b_0 = 24{,}319\):
The estimated intercept \(b_0 = 24{,}319\) describes the average aid if a student's family had no income. The meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \$0. In other applications, the intercept may have little or no practical value if there are no observations where \(x\) is near zero.
Source: [OpenIntro Statistics]
We've seen plots with strong linear relationships and others with very weak linear relationships. It would be useful if we could quantify the strength of these linear relationships with a statistic. Should we have concerns about applying least squares regression to the Elmhurst data in Figure 8.11?
Figure 8.12: Four examples showing when the methods in this chapter are insufficient to apply to the data. First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of \(x\). Fourth panel: a time series data set is shown, where successive observations are highly correlated.
Solution
We should examine Figure 8.12 to assess potential concerns. The four conditions shown are:
1. Linearity fails — If the relationship is curved rather than linear, least squares regression is not appropriate. 2. Outliers — A single point far from the rest of the data can disproportionately pull the regression line. 3. Non-constant variability (heteroscedasticity) — If the spread of residuals increases or decreases with \(x\), the standard regression model assumptions are violated. 4. Time series correlation — If observations are ordered in time and successive values are correlated, independence is violated.
For the Elmhurst data specifically, if the scatterplot does not show obvious curvature, extreme outliers, fan-shaped residuals, or a time structure, then least squares regression may be applied with reasonable confidence. Always check a residual plot as part of any regression analysis.