further practice and understand linear models
use them for predictions, and
investigate problems that arise from using summary data
To achieve these goals we will spend today’s class with data collected from the Old Faithful geyser.
2021-10-18
further practice and understand linear models
use them for predictions, and
investigate problems that arise from using summary data
To achieve these goals we will spend today’s class with data collected from the Old Faithful geyser.
For more practice with the predictive aspects of data analysis, we consider data about the Old Faithful geyser in the Yellowstone National Park.
Old Faithful was once called “Eternity’s Timepiece” because of the regularity of its eruptions, even though the eruptions do not happen in regular intervals.
A data frame about the Old Faithful geyser is preinstalled in R. You can load it into your environment by typing
data(faithful)
Do some preliminary data exploration.
How many observations and variables do the data contain and what do they represent?
As usual in R, you can get more information by typing ?faithful
.
str(faithful)
## 'data.frame': 272 obs. of 2 variables: ## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ... ## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
The data contains observations of two variables:
Eruption durations in mins, and
Corresponding waiting times until the next eruption in mins.
Park rangers use data observations of this type to make daily predictions for tourists.
They use the duration of an eruption to predict the time of the next eruption.
Let’s explore the data analysis part of the ranger profession.
As usual, have a look at the data to come up with an idea of how predictions of eruption times could be done.
Would the mean or median be a good statistic for waiting time prediction?
Check if necessary conditions for the method you plan to use are met.
hist(faithful$waiting, breaks=seq(40,100,3), col="lightblue", main="Waiting times between eruptions", xlab="Waiting times in mins")
The distribution of waiting times appears to be bimodal.
The mean or median would be good statistics for average waiting time if the distribution was, say, close to a normal distribution. Here it would however be more accurate to say that waiting times between eruptions are either around 80 mins or around 55 mins.
What can we say beyond that?
Can the duration of a current eruption help to explain the time interval between that eruption and the next eruption?
Is it reasonable to use a linear model for inference of waiting times from an observed eruption duration?
plot(waiting ~ eruptions, data = faithful, main = "Waiting time vs. duration of eruption", xlab = "duration of eruption", ylab = "waiting time")
Does the ‘straight enough condition’ seem to be satisfied in the scatterplot? Are there outliers?
To see whether eruption duration \(x\) can be used to predict waiting times till the next eruption \(y\), fit a linear model.
Describe the linear equation for the model and the summary statistic \(R^2\).
geyserlm <- lm(waiting ~ eruptions, data = faithful) coef(geyserlm) # display regression coefficients
## (Intercept) eruptions ## 33.47440 10.72964
summary(geyserlm)$r.squared # display Rsquared
## [1] 0.8114608
We see from the output that the slope \(b_1\) is 10.73 and the intercept \(b_0\) is 33.47. The equation for the linear model is therefore \[ \widehat{y}=33.47 + 10.73 x \] where \(\hat{y}\) is the predicted waiting time till the next eruption after observing an eruption time \(x\).
In general, the waiting time seems to increase with eruption duration: The slope of 10.73 indicates that an additional minute in observed eruption time tends to be followed by a waiting time that is increased by 10.73 minutes.
While \(R^2\) of 0.81 says nothing about causality it indicates that the waiting time till the next eruption is much accounted for by the duration of the previous eruption. Here 81% of the variation in the waiting time is explained by the eruption duration.
Check whether the \(R^2\) value you just calculated is really the square of the correlation coefficient \(r\) between \(x\) and \(y\).
summary(geyserlm)$r.squared
## [1] 0.8114608
cor(faithful$eruptions, faithful$waiting)^2
## [1] 0.8114608
It is not surprising that \(R^2\) and \(r\) should be related:
\(r\) describes how strong two variables are correlated
\(R^2\) describes how well the data is explained for by the model: \[ R^2 = \frac{\textrm{variance in the data accounted for by the model}}{\textrm{total variance in the data}} \]
For your curiosity only (like math-box p.208 textbook): A bit arithmetic reveals that indeed
\[ R^2= \frac{\sum(\hat{y}-\overline{y})^2}{\sum({y}-\overline{y})^2} = \frac{b_1^2\sum(x-\overline{x})^2}{\sum({y}-\overline{y})^2} =b_1^2\frac{s_x^2}{s_y^2} = r^2 \]
Add the line of best fit to your scatterplot.
Do you think that the duration of eruption can be used to predict the waiting time to the next eruption via the linear model you just calculated?
Beyond the inspection of the model fit and of \(R^2\) just done, base your answer on assessing whether the conditions on the residuals are met.
Check the plot of the residuals against either the predictor \(x\) or the fitted values stored in your linear model.
Check whether the residuals have a Normal-distribution.
What plots should you use? What should you look out for in those plots?
plot(waiting ~ eruptions, data = faithful, main = "Waiting time vs. duration of eruption", xlab = "duration of eruption", ylab = "waiting time") abline(geyserlm, col = "blue", lwd = 3)
plot(residuals(geyserlm) ~ predict(geyserlm), xlab = "fitted values", ylab='Residuals', main = "Residuals vs. fitted values")
plot(residuals(geyserlm) ~ faithful$eruptions, xlab = "duration of eruption", ylab='Residuals', main = "Residuals vs. x")
May be hard to judge normality of residuals directly from the histogram:
hist(residuals(geyserlm), breaks=seq(-15,20,3), col="lightblue", main = "Histogram of the residuals", xlab = "Residuals")
We better also do a Q-Q plot:
qqnorm(residuals(geyserlm)) qqline(residuals(geyserlm), col="deeppink")
Linear model fits the relationship well. Large \(R^2\) suggests duration of eruption is highly associated with waiting time till next eruption.
Linearity conditions are also verified by inspection of the residual plot: It does not notably bend nor “thicken”, suggesting that the variation is constant throughout the range of \(x\).
The two residual scatterplots look identical since the fitted value is a linear transformation of the waiting time.
Neither the scatter plot nor the residual plot show notable outliers.
In the qq-plot of the residuals there appears to be some deviation from normality in the left tail.
It doesn’t look like any transformation of the data is needed.
In a linear regression we associate \(y\) (here waiting time) with \(x\) (here previous eruption duration).
Some reasons explaining longer or shorter waiting times are left in the residual.
If those reasons are not just random fluctuations, this would mean that some explanatory information was still left in the residual. It would mean there is something beyond the linear relationship that potentially could also be included in the model, e.g. another variable we didn’t consider, a term or transformation to model a curve, etc.
If, however, those other reasons are essentially random, then the residual plot should look like scatter with no structure.
Then isn’t it enough to just quickly check that the residuals and \(x\) have no correlation? (rather than inspecting scatterplots..)
Isn’t it enough to only quickly check that the residuals and \(x\) have no correlation?
cor(residuals(geyserlm), predict(geyserlm))
## [1] 1.067144e-16
But it’s not enough to check this!
After fitting the linear model, the residual and the fitted values are always completely uncorrelated.
This is because R has fitted the best possible linear model and thus no “one-variable-linear” explanatory component is left in the residuals.
Based on your linear model, what waiting times should the ranger anounce on the tourist boards after she observed eruption durations of 2.25, 4.1, or 4.72 minutes?
Create a data frame with one column containing these durations.
The column name should match the variable name of the predictor variable in your linear model
Use this data frame as second argument of the predict
function.
The linear model predictions for waiting times after eruption durations of 2.25, 4.1, and 4.72 minutes are:
measurements <- data.frame(eruptions = c(2.25,4.1,4.72)) predict(geyserlm, newdata = measurements)
## 1 2 3 ## 57.61609 77.46593 84.11830
Suppose you flipped the roles of the \(x\) and \(y\) variables you feed lm()
to get your linear model.
Which parts of the output would change, and which won’t?
Discuss first in your team and then try.
geyserlm2 <- lm(eruptions ~ waiting, data = faithful) coef(geyserlm2)
## (Intercept) waiting ## -1.87401599 0.07562795
summary(geyserlm2)$r.squared
## [1] 0.8114608
While the slope and intercept of the linear model change, \(R^2\) stays the same as before.
This illustrates how the order of variables matters in linear regression, but doesn’t in correlations.
Does this mean we can use the waiting time to predict the eruption duration?
Like telling tourists: “Just stay – even if the announced waiting time is long. It will be worth your while since you will see a longer eruption.” ?
Does this mean we can use the waiting time to predict the eruption duration?
Maybe… But not from the data that we just considered!
We need data about the
eruption times and the
corresponding waiting times before these respective eruptions.
Data of this type (in mins) are provided at https://qr.dasmithmaths.com/w10mon/faithful_AzzaliniBowman1990.csv for you to download:
faithful_pd <- read.csv("faithful_AzzaliniBowman1990.csv")
str(faithful_pd)
## 'data.frame': 299 obs. of 2 variables: ## $ waiting : int 80 71 57 80 75 77 60 86 77 56 ... ## $ duration: num 4.02 2.15 4 4 4 ...
faithful_pd
data about thewaiting time leading up to an eruption of a certain duration.
How much of the variance in eruption duration is explained by the waiting time between eruptions?
Does this linear model have more or less explanatory power than the one we just considered?
Perform the same assessment as for the previous linear model.
plot(duration ~ waiting, data = faithful_pd)
From the scatter plot we see that
if one observes an eruption lasting about 2 mins, then one has probably waited 80 to 90 minutes from the last eruption to the current one.
if one observes an eruption lasting around 4.5 mins, the waiting time could have been any time between around 50 and 90 mins.
a short waiting time is always followed by a long eruption, but a long waiting time is followed by short or long eruptions in roughly equal proportions.
Such conclusions cannot be made from a linear model.
What happens when we assess the linear model?
cor(faithful_pd$duration, faithful_pd$waiting)
## [1] -0.644623
The negative correlation indicates an inverse relationship between waiting time and eruption duration: the longer the eruption, the shorter the waiting time before the eruption.
\(R^2\) would now be \((-0.64)^2=41\%\), so this would be a less powerful model.
The equation for the linear model now would be
\[ \textrm{Eruption duration} = 7.31 - 0.05\cdot\textrm{(Waiting time)} \]
Let’s assess the fit.
geyserlm_pd <- lm(duration ~ waiting, data = faithful_pd) plot(duration ~ waiting, data = faithful_pd) abline(geyserlm_pd, col = "lightblue", lwd = 3)
plot(residuals(geyserlm_pd) ~ predict(geyserlm_pd), xlab = "Fitted values", ylab='Residuals', main = "Residuals vs. fitted values")
The correlation between residuals and predicted eruption durations is still zero!
cor(residuals(geyserlm_pd), predict(geyserlm_pd))
## [1] -5.406898e-17
This is just saying we fitted the best linear model and so no “linear” component is left in the residual.
But we see from the plots that even the best among the linear models here cannot perform that well.
hist(geyserlm_pd$residuals, breaks=seq(-3,3,0.5), col="lightblue", main = "Histogram of the residuals", xlab = "Residuals")
qqnorm(geyserlm_pd$residuals) qqline(geyserlm_pd$residuals, col="deeppink")
Have you noticed the high proportion of eruptions lasting exactly 2 and exactly 4 mins in the data? Some rangers just indicated L for long and S for short eruptions.
Suppose a ranger wants to summarize the data even further by considering mean waiting times for eruption durations that fall within 0.1 mins intervals.
First use the function round
to append a new column to faithful_pd
containing the eruption durations rounded to one decimal place. (Hint: ?round
)
Then use aggregate
to get the mean waiting times per duration bin.
Finally, fit a linear model to this summarized data.
From a quick assessment, what seems to be problematic about working with summary data?
faithful_pd$duration_rounded <- round(faithful_pd$duration, 1) geyser_summarized <- aggregate(waiting ~ duration_rounded, data = faithful_pd, FUN = mean) head(geyser_summarized)
## duration_rounded waiting ## 1 0.8 80.00000 ## 2 1.6 89.75000 ## 3 1.7 84.83333 ## 4 1.8 84.44444 ## 5 1.9 84.10526 ## 6 2.0 81.44444
coef(lm(geyser_summarized))
## (Intercept) waiting ## 9.92867902 -0.08909649
summary(lm(geyser_summarized))$r.squared
## [1] 0.7688285
\(R^2\) jumped up from 41% to 77%.
How about the visual fit of the regression line?
plot(geyser_summarized$duration_rounded ~ geyser_summarized$waiting, xlab = "", ylab='', main = "") abline(lm(geyser_summarized), col = "lightblue", lwd = 3)
Can this really have helped to be able to better explain the eruption duration from the waiting time?!
While we have seen earlier that a linear model does not do well in predicting eruption duration from waiting time, when using summarized data it erroneously looks as if a linear model can now do much better!
This illustrates some important general points:
To the extend possible, one should use raw data for analysis of relationships between variables.
Using summary data reduces the variance and in regression analysis gives a false impression of how well the linear model fits the data.
If sampling is uneven across the classes or bins in which the data is summarized, the estimated slope may well be wrong.
In today’s activity we:
Deepened our understanding of aspects of the linear model such as:
Used a linear model to make predictions for waiting times until eruptions for the Old-Faithful geyser.
Saw the problematic effects that can arise from using summary data. They made a linear model appear more powerful than it really was.
Recall from the video in the prep that the rangers quantify their amount of uncertainty by giving an interval of plus-minus a certain number of minutes along with their estimated waiting time.
This makes a lot of sense since we know that usually few observations lie precisely on the regression line.
Prediction intervals for new observations can be made based on the condition that the residiuals satisfy the normality condition we checked earlier. It is the most important reason for checking this condition.
Intuitively, this has to do with something we used a lot about the normal distribution: the 68-95-99 rule. (What did it say?)
It could be desireable to also provide some level of uncertainty for the slope of the linear model itself.
For instance: In 1929, E. Hubble observed a positive correlation between the distances of nebulae from earth, \(y\), and the velocity with which they move away from earth, \(x\). Scientists then wondered how this positive linear association could have arisen. The result was Big Bang theory. The theory proposes that the universe started with a Big Bang at a single point in space a long time ago, scattering material around the surface of an ever expanding sphere. If the theory is correct, then the linear relationship should be of the form \(y = b_1 x\), where the slope \(b_1\) is the age of the universe when the observations were made.
Wasn’t there actually something non-random left in our residual plot after all? Something we could have already noticed in the scatterplot of faithful
?
The scatterplot seemed to have two clusters.
One could imagine fitting a linear model to each cluster to come up with a piecewise linear model with a jump.
… in our case things are not quite so clear cut:
How exactly would we divide the two clusters? What is still a longer of the short eruptions and what is already a shorter of the long eruptions?
And are there really just two clusters to be considered? The ‘faithful_pd’ scatterplot seemed to suggest three clusters. Geologically it is only confirmed that Old-Faitful has at least two reservoirs.