Old Faithful – Activity

2021-10-18

Learning Goals

further practice and understand linear models
use them for predictions, and
investigate problems that arise from using summary data

To achieve these goals we will spend today’s class with data collected from the Old Faithful geyser.

Activity

For more practice with the predictive aspects of data analysis, we consider data about the Old Faithful geyser in the Yellowstone National Park.

Old Faithful was once called “Eternity’s Timepiece” because of the regularity of its eruptions, even though the eruptions do not happen in regular intervals.

1.1 Data exploration

A data frame about the Old Faithful geyser is preinstalled in R. You can load it into your environment by typing

data(faithful)

Do some preliminary data exploration.

How many observations and variables do the data contain and what do they represent?

As usual in R, you can get more information by typing ?faithful.

1.1 Data exploration

str(faithful)

## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

The data contains observations of two variables:

Eruption durations in mins, and
Corresponding waiting times until the next eruption in mins.

Park rangers use data observations of this type to make daily predictions for tourists.

They use the duration of an eruption to predict the time of the next eruption.

1.2 Ranger’s data analysis

Let’s explore the data analysis part of the ranger profession.

As usual, have a look at the data to come up with an idea of how predictions of eruption times could be done.
Would the mean or median be a good statistic for waiting time prediction?
Check if necessary conditions for the method you plan to use are met.

1.2 Ranger’s data analysis

hist(faithful$waiting, breaks=seq(40,100,3), col="lightblue", 
     main="Waiting times between eruptions", xlab="Waiting times in mins")

1.2 Ranger’s data analysis

The distribution of waiting times appears to be bimodal.
The mean or median would be good statistics for average waiting time if the distribution was, say, close to a normal distribution. Here it would however be more accurate to say that waiting times between eruptions are either around 80 mins or around 55 mins.
What can we say beyond that?
Can the duration of a current eruption help to explain the time interval between that eruption and the next eruption?
Is it reasonable to use a linear model for inference of waiting times from an observed eruption duration?

1.2 Ranger’s data analysis

plot(waiting ~ eruptions, data = faithful,  
     main = "Waiting time vs. duration of eruption",
     xlab = "duration of eruption", ylab = "waiting time")

1.3 Fit a linear model

Does the ‘straight enough condition’ seem to be satisfied in the scatterplot? Are there outliers?

To see whether eruption duration \(x\) can be used to predict waiting times till the next eruption \(y\), fit a linear model.

Describe the linear equation for the model and the summary statistic \(R^2\).

1.3 Fit a linear model

geyserlm <- lm(waiting ~ eruptions, data = faithful)
coef(geyserlm)                 # display regression coefficients

## (Intercept)   eruptions 
##    33.47440    10.72964

summary(geyserlm)$r.squared    # display Rsquared

## [1] 0.8114608

1.3 Fit a linear model

We see from the output that the slope \(b_1\) is 10.73 and the intercept \(b_0\) is 33.47. The equation for the linear model is therefore \[ \widehat{y}=33.47 + 10.73 x \] where \(\hat{y}\) is the predicted waiting time till the next eruption after observing an eruption time \(x\).

In general, the waiting time seems to increase with eruption duration: The slope of 10.73 indicates that an additional minute in observed eruption time tends to be followed by a waiting time that is increased by 10.73 minutes.

While \(R^2\) of 0.81 says nothing about causality it indicates that the waiting time till the next eruption is much accounted for by the duration of the previous eruption. Here 81% of the variation in the waiting time is explained by the eruption duration.

1.4 \(R^2\) versus \(r\)

Check whether the \(R^2\) value you just calculated is really the square of the correlation coefficient \(r\) between \(x\) and \(y\).

1.4 \(R^2\) versus \(r\)

summary(geyserlm)$r.squared

## [1] 0.8114608

cor(faithful$eruptions, faithful$waiting)^2

## [1] 0.8114608

\(R^2\) versus \(r\)

It is not surprising that \(R^2\) and \(r\) should be related:

\(r\) describes how strong two variables are correlated
\(R^2\) describes how well the data is explained for by the model: \[ R^2 = \frac{\textrm{variance in the data accounted for by the model}}{\textrm{total variance in the data}} \]

For your curiosity only (like math-box p.208 textbook): A bit arithmetic reveals that indeed

\[ R^2= \frac{\sum(\hat{y}-\overline{y})^2}{\sum({y}-\overline{y})^2} = \frac{b_1^2\sum(x-\overline{x})^2}{\sum({y}-\overline{y})^2} =b_1^2\frac{s_x^2}{s_y^2} = r^2 \]

1.5 Evaluate fit

Add the line of best fit to your scatterplot.

Do you think that the duration of eruption can be used to predict the waiting time to the next eruption via the linear model you just calculated?

Beyond the inspection of the model fit and of \(R^2\) just done, base your answer on assessing whether the conditions on the residuals are met.

Check the plot of the residuals against either the predictor \(x\) or the fitted values stored in your linear model.
Check whether the residuals have a Normal-distribution.
What plots should you use? What should you look out for in those plots?

1.5 Evaluate fit

plot(waiting ~ eruptions, data = faithful, 
     main = "Waiting time vs. duration of eruption", 
     xlab = "duration of eruption", ylab = "waiting time")
abline(geyserlm, col = "blue", lwd = 3)

1.5 Evaluate fit – Residual plot I

plot(residuals(geyserlm) ~ predict(geyserlm),  xlab = "fitted values", 
     ylab='Residuals', main = "Residuals vs. fitted values")

1.5 Evaluate fit – Residual plot II

plot(residuals(geyserlm) ~ faithful$eruptions, xlab = "duration of eruption",
     ylab='Residuals', main = "Residuals vs. x")

1.5 Evaluate fit – Normality of residuals

May be hard to judge normality of residuals directly from the histogram:

hist(residuals(geyserlm), breaks=seq(-15,20,3), col="lightblue",  
     main = "Histogram of the residuals", xlab = "Residuals")

1.5 Evaluate fit – Normality of residuals

We better also do a Q-Q plot:

qqnorm(residuals(geyserlm)) 
qqline(residuals(geyserlm), col="deeppink")

Evaluate fit – Summary

Linear model fits the relationship well. Large \(R^2\) suggests duration of eruption is highly associated with waiting time till next eruption.
Linearity conditions are also verified by inspection of the residual plot: It does not notably bend nor “thicken”, suggesting that the variation is constant throughout the range of \(x\).
The two residual scatterplots look identical since the fitted value is a linear transformation of the waiting time.
Neither the scatter plot nor the residual plot show notable outliers.
In the qq-plot of the residuals there appears to be some deviation from normality in the left tail.
It doesn’t look like any transformation of the data is needed.

Evaluate fit – Discussion

In a linear regression we associate \(y\) (here waiting time) with \(x\) (here previous eruption duration).
Some reasons explaining longer or shorter waiting times are left in the residual.
If those reasons are not just random fluctuations, this would mean that some explanatory information was still left in the residual. It would mean there is something beyond the linear relationship that potentially could also be included in the model, e.g. another variable we didn’t consider, a term or transformation to model a curve, etc.
If, however, those other reasons are essentially random, then the residual plot should look like scatter with no structure.
Then isn’t it enough to just quickly check that the residuals and \(x\) have no correlation? (rather than inspecting scatterplots..)

Evaluate fit – Discussion

Isn’t it enough to only quickly check that the residuals and \(x\) have no correlation?

cor(residuals(geyserlm), predict(geyserlm))

## [1] 1.067144e-16

But it’s not enough to check this!

After fitting the linear model, the residual and the fitted values are always completely uncorrelated.

This is because R has fitted the best possible linear model and thus no “one-variable-linear” explanatory component is left in the residuals.

1.6 A day at the rangers’ job

Based on your linear model, what waiting times should the ranger anounce on the tourist boards after she observed eruption durations of 2.25, 4.1, or 4.72 minutes?

Create a data frame with one column containing these durations.
The column name should match the variable name of the predictor variable in your linear model
Use this data frame as second argument of the predict function.

1.6 A day at the rangers’ job

The linear model predictions for waiting times after eruption durations of 2.25, 4.1, and 4.72 minutes are:

measurements <- data.frame(eruptions = c(2.25,4.1,4.72))
predict(geyserlm, newdata = measurements)

##        1        2        3 
## 57.61609 77.46593 84.11830

1.7 Reversing the variables

Suppose you flipped the roles of the \(x\) and \(y\) variables you feed lm() to get your linear model.

Which parts of the output would change, and which won’t?

Discuss first in your team and then try.

1.7 Reversing the variables

geyserlm2 <- lm(eruptions ~ waiting, data = faithful)
coef(geyserlm2)

## (Intercept)     waiting 
## -1.87401599  0.07562795

summary(geyserlm2)$r.squared

## [1] 0.8114608

While the slope and intercept of the linear model change, \(R^2\) stays the same as before.
This illustrates how the order of variables matters in linear regression, but doesn’t in correlations.

1.8 Predicting eruption duration

Does this mean we can use the waiting time to predict the eruption duration?

Like telling tourists: “Just stay – even if the announced waiting time is long. It will be worth your while since you will see a longer eruption.” ?

1.8 Predicting eruption duration

Does this mean we can use the waiting time to predict the eruption duration?

Maybe… But not from the data that we just considered!

We need data about the

eruption times and the
corresponding waiting times before these respective eruptions.

Data of this type (in mins) are provided at https://qr.dasmithmaths.com/w10mon/faithful_AzzaliniBowman1990.csv for you to download:

faithful_pd <- read.csv("faithful_AzzaliniBowman1990.csv")

1.9 Worth waiting?

str(faithful_pd)

## 'data.frame':    299 obs. of  2 variables:
##  $ waiting : int  80 71 57 80 75 77 60 86 77 56 ...
##  $ duration: num  4.02 2.15 4 4 4 ...

To investigate whether the waiting time can be used to predict the following eruption duration, use this faithful_pd data about the

waiting time leading up to an eruption of a certain duration.

How much of the variance in eruption duration is explained by the waiting time between eruptions?
Does this linear model have more or less explanatory power than the one we just considered?
Perform the same assessment as for the previous linear model.

1.9 Worth waiting?

plot(duration ~ waiting, data = faithful_pd)

1.9 Worth waiting?

From the scatter plot we see that

if one observes an eruption lasting about 2 mins, then one has probably waited 80 to 90 minutes from the last eruption to the current one.
if one observes an eruption lasting around 4.5 mins, the waiting time could have been any time between around 50 and 90 mins.
a short waiting time is always followed by a long eruption, but a long waiting time is followed by short or long eruptions in roughly equal proportions.

Such conclusions cannot be made from a linear model.

What happens when we assess the linear model?

1.9 Worth waiting?

cor(faithful_pd$duration, faithful_pd$waiting)

## [1] -0.644623

The negative correlation indicates an inverse relationship between waiting time and eruption duration: the longer the eruption, the shorter the waiting time before the eruption.

\(R^2\) would now be \((-0.64)^2=41\%\), so this would be a less powerful model.

The equation for the linear model now would be

\[ \textrm{Eruption duration} = 7.31 - 0.05\cdot\textrm{(Waiting time)} \]

Let’s assess the fit.

1.9 Worth waiting?

geyserlm_pd <- lm(duration ~ waiting, data = faithful_pd)
plot(duration ~ waiting, data = faithful_pd)
abline(geyserlm_pd, col = "lightblue", lwd = 3)

1.9 Worth waiting?

plot(residuals(geyserlm_pd) ~ predict(geyserlm_pd), xlab = "Fitted values",
     ylab='Residuals', main = "Residuals vs. fitted values")

1.9 Worth waiting?

The correlation between residuals and predicted eruption durations is still zero!

cor(residuals(geyserlm_pd), predict(geyserlm_pd))

## [1] -5.406898e-17

This is just saying we fitted the best linear model and so no “linear” component is left in the residual.

But we see from the plots that even the best among the linear models here cannot perform that well.

1.9 Worth waiting?

hist(geyserlm_pd$residuals, breaks=seq(-3,3,0.5), col="lightblue",  
     main = "Histogram of the residuals", xlab = "Residuals")

1.9 Worth waiting?

qqnorm(geyserlm_pd$residuals)
qqline(geyserlm_pd$residuals, col="deeppink")

1.10 Predict erupt duration by summary data

Have you noticed the high proportion of eruptions lasting exactly 2 and exactly 4 mins in the data? Some rangers just indicated L for long and S for short eruptions.

Suppose a ranger wants to summarize the data even further by considering mean waiting times for eruption durations that fall within 0.1 mins intervals.

First use the function round to append a new column to faithful_pd containing the eruption durations rounded to one decimal place. (Hint: ?round)
Then use aggregate to get the mean waiting times per duration bin.
Finally, fit a linear model to this summarized data.
From a quick assessment, what seems to be problematic about working with summary data?

1.10 Predict erupt duration by summary data

faithful_pd$duration_rounded <- round(faithful_pd$duration, 1) 
geyser_summarized <- aggregate(waiting ~ duration_rounded, 
                               data = faithful_pd, 
                               FUN = mean)
head(geyser_summarized)

##   duration_rounded  waiting
## 1              0.8 80.00000
## 2              1.6 89.75000
## 3              1.7 84.83333
## 4              1.8 84.44444
## 5              1.9 84.10526
## 6              2.0 81.44444

1.10 Predict erupt duration by summary data

coef(lm(geyser_summarized))

## (Intercept)     waiting 
##  9.92867902 -0.08909649

summary(lm(geyser_summarized))$r.squared

## [1] 0.7688285

\(R^2\) jumped up from 41% to 77%.

How about the visual fit of the regression line?

1.10 Predict erupt duration by summary data

plot(geyser_summarized$duration_rounded ~ geyser_summarized$waiting, xlab = "", 
     ylab='', main = "")
abline(lm(geyser_summarized), col = "lightblue", lwd = 3)

1.10 Predict erupt duration by summary data

Can this really have helped to be able to better explain the eruption duration from the waiting time?!

While we have seen earlier that a linear model does not do well in predicting eruption duration from waiting time, when using summarized data it erroneously looks as if a linear model can now do much better!

This illustrates some important general points:

To the extend possible, one should use raw data for analysis of relationships between variables.
Using summary data reduces the variance and in regression analysis gives a false impression of how well the linear model fits the data.
If sampling is uneven across the classes or bins in which the data is summarized, the estimated slope may well be wrong.

Recap

In today’s activity we:

Deepened our understanding of aspects of the linear model such as:
- the assessment of residual plots
- the effect of reversing variables or
- the connection between correlation and explanatory power of the model.
Used a linear model to make predictions for waiting times until eruptions for the Old-Faithful geyser.
Saw the problematic effects that can arise from using summary data. They made a linear model appear more powerful than it really was.

Further discussion and outlook beyond QR

Intervals rather than point estimates

Recall from the video in the prep that the rangers quantify their amount of uncertainty by giving an interval of plus-minus a certain number of minutes along with their estimated waiting time.
This makes a lot of sense since we know that usually few observations lie precisely on the regression line.
Prediction intervals for new observations can be made based on the condition that the residiuals satisfy the normality condition we checked earlier. It is the most important reason for checking this condition.
Intuitively, this has to do with something we used a lot about the normal distribution: the 68-95-99 rule. (What did it say?)

The slope of the linear model

It could be desireable to also provide some level of uncertainty for the slope of the linear model itself.
For instance: In 1929, E. Hubble observed a positive correlation between the distances of nebulae from earth, \(y\), and the velocity with which they move away from earth, \(x\). Scientists then wondered how this positive linear association could have arisen. The result was Big Bang theory. The theory proposes that the universe started with a Big Bang at a single point in space a long time ago, scattering material around the surface of an ever expanding sphere. If the theory is correct, then the linear relationship should be of the form \(y = b_1 x\), where the slope \(b_1\) is the age of the universe when the observations were made.

Beyond the linear model

Wasn’t there actually something non-random left in our residual plot after all? Something we could have already noticed in the scatterplot of faithful?
The scatterplot seemed to have two clusters.
One could imagine fitting a linear model to each cluster to come up with a piecewise linear model with a jump.

Beyond the linear model

While it is not hard to imagine cases that would beg for this approach, - even when there is a strong overall linear trend (somewhat similar to the Simpson paradox), …

Beyond the linear model

… in our case things are not quite so clear cut:

How exactly would we divide the two clusters? What is still a longer of the short eruptions and what is already a shorter of the long eruptions?
And are there really just two clusters to be considered? The ‘faithful_pd’ scatterplot seemed to suggest three clusters. Geologically it is only confirmed that Old-Faitful has at least two reservoirs.