Statistical significance

AY2021-2022, Semester 1, Week 13 Thursday

Introduction

Course context

Uncertainty quantification requires the concept of randomness, introduced recently.

Statistical significance is one method for uncertainty quantification.

We recently introduced the for-loop, a programming technique which allows for more efficient programming of simulations.

Today, we use simulations to determine statistical significance.

Permutation tests

Observed data exhibit a statistically significant feature if the probability to observe this feature is unlikely to be the result of randomness alone.

A permutation test is a method to investigate statistical significance.

If we are interested in the relation between two variables, we permute the values of one of the variables to break any association between the variables.

We compare the distribution of simulated data with observed data to determine statistical significance.

Learning outcomes

After today’s class, you should be able to:

discuss whether a feature has practical significance.
determine statistical significance using simulations.
use different statistics (e.g. mortality ratio, slope of regression line) as a response variable in a permutation test.

Activity 1: Were women more likely to survive the Titanic disaster than men?

Titanic

“Women and children first” is a code of conduct during maritime disasters if there are not enough lifeboats to rescue everybody. Is there evidence that women were more likely to survive the sinking of the Titanic than men?

The file titanic.csv contains data on people on the Titanic. Each row represents one person. The columns describe different attributes of the person.

titanic <- read.csv("titanic.csv") 
head(titanic)

##   fam_name                    given_name gender age class survived ticket pax_on_tckt pnd shl pnc
## 1   ABBING                       Anthony   Male  41   3rd    FALSE   5547           1   7  11   0
## 2   ABBOTT                   Ernest Owen   Male  21  Crew    FALSE   <NA>          NA  NA  NA  NA
## 3   ABBOTT                 Eugene Joseph   Male  13   3rd    FALSE CA2673           3  20   5   0
## 4   ABBOTT               Rossmore Edward   Male  16   3rd    FALSE CA2673           3  20   5   0
## 5   ABBOTT             Rhoda Mary 'Rosa' Female  39   3rd     TRUE CA2673           3  20   5   0
## 6 ABELSETH Kalle (Karen) Marie Kristiane Female  16   3rd     TRUE 348125           1   7  13   0

1.1 Survival and gender on the Titanic

Create a two-way table of survived and gender.
Visualise these data.
Compute the proportion of men and the proportion of women who died.

1.1 Survival and gender on the Titanic

survival_by_gender <- table(titanic$survived, titanic$gender)
survival_by_gender

##        
##         Female Male
##   FALSE    130 1366
##   TRUE     357  355

1.1 Survival and gender on the Titanic

barplot(survival_by_gender,
        main = "Passengers on the Titanic",
        legend.text = c("Death", "Survivor"),
        args.legend = list(x = "topleft"))

1.1 Survival and gender on the Titanic

men <- titanic[titanic$gender == "Male", ]
mean(!men$survived)

## [1] 0.7937246

women <- titanic[titanic$gender == "Female", ]
mean(!women$survived)

## [1] 0.2669405

1.1 Survival and gender on the Titanic

# Alternative
aggregate(!survived ~ gender, data = titanic, mean)

##   gender !survived
## 1 Female 0.2669405
## 2   Male 0.7937246

79% of men died versus 27% of women.

1.2 Practical and statistical significance

Is 79% of the men dying versus 27% of women practically significant?
How can we assess whether it is statistically significant that 79% of the men died versus 27% of women?

1.2 Practical and statistical significance

Men dying with a three times higher probability than women is a meaningful difference. High male mortality is linked to the deaths of 1366 men. Therefore, the observed mortality ratio is practically significant.
Whether this difference is statistically significant depends on how likely such a difference is when survival does not depend on gender. We can find an answer by performing a permutation test.

1.3 Mortality ratio as response variable

We want to study the statistical significance of changing two statistics: male mortality and female mortality. We define male (female) mortality as the number of male (female) deaths divided by the number of men (women) on board.

Comparing two statistics simultaneously is difficult because it requires thinking and plotting graphs in more than two dimensions. Instead, we combine the two statistics into a single response variable. One out of many possibilities is to take the “mortality ratio”: \[ \displaystyle\text{response}=\frac{\text{male mortality}}{\text{female mortality}}\ . \]

What is the range of possible values of this response variable?
What would have been the response if gender had played no role in the survival probability?

1.3 Mortality ratio as response variable

The mortality ratio can take any nonnegative value:
- It can be zero if all men survived and at least one woman died.
- It can have an infinitely large positive value if at least one man died and all women survived.
If all men and all women survived, the mortality ratio is undefined (zero divided by zero).
If men and women had died with the same probability, the response would have been approximately 1.

1.4 Permutation

Let us perform a simulation to check for statistical significance.

Calculate the mortality ratio for the observed data.
Append a column gender_p to titanic that is a permutation of the values in the column gender. Only perform one permutation for the time being (i.e. no for-loop yet).
What is the response variable for this permutation?

1.4 Permutation

male_mortality <- mean(!men$survived)
female_mortality <- mean(!women$survived)
mortality_ratio <- male_mortality / female_mortality
mortality_ratio

## [1] 2.973414

The observed mortality ratio is approximately 2.97.

1.4 Permutation

titanic$gender_p <- sample(titanic$gender)  # Permute gender
men_p <- titanic[titanic$gender_p == "Male", ]
women_p <- titanic[titanic$gender_p == "Female", ]
male_mortality_p <- mean(!men_p$survived)
female_mortality_p <- mean(!women_p$survived)
response <- male_mortality_p / female_mortality_p
response

## [1] 1.027646

Is it plausible that the response of the permuted data has a value close to 1?

Yes, the permutation broke the association between gender and survival; thus, the male and female mortalities are approximately equal.

1.5 Permutation test

We wrote code to generate a single response based on a random permutation. We will use this code to test whether a mortality ratio of 2.97 is statistically significant.

Perform 10,000 permutations and store the values of the response variable in a vector called responses. (Hint: You need a for-loop.)
Visualise the distribution of the response variable with a histogram.
Judging from the histogram, is a mortality ratio of 2.97 statistically significant?

1.5 Permutation test

n_trials <- 10000
responses <- numeric(n_trials)
for (trial in 1:n_trials) {
  titanic$gender_p <- sample(titanic$gender)  # Permute gender
  men_p <- titanic[titanic$gender_p == "Male", ]
  women_p <- titanic[titanic$gender_p == "Female", ]
  male_mortality_p <- mean(!men_p$survived)
  female_mortality_p <- mean(!women_p$survived)
  responses[trial] <- male_mortality_p / female_mortality_p
}

1.5 Permutation test

hist(responses)

The observed mortality ratio of 2.97 is statistically significant because all responses in the permutation test were smaller.

Summary

We performed a permutation test, which randomly permutes the gender of people on board and then calculates the response variable (the ratio of the male mortality to the female mortality).

By using many permutations, we obtained the distribution of the response variable.

We found that it is extremely unlikely that a mortality ratio of 2.97 could have arisen by chance.

We conclude that there is an association between gender and survival. Men had a statistically significantly higher mortality than women.

We cannot conclude that gender was the cause of the difference in mortality. For example, there might be a lurking variable: most crew members were male, and crew members were less likely to be rescued than passengers.

Activity 2: Were adults less likely to survive the Titanic disaster?

How do we compare mortality for adults and younger people on the Titanic?

In the previous activity, we found that women were more likely to survive than men. Was it also the case that younger people were more likely to be rescued than adults (defined as people \(\geq\) 18 years of age)?

We can use a similar response variable as before: \[ \displaystyle\text{response}=\frac{\text{adult mortality}}{\text{mortality of younger people}}\ . \]

If adults were less likely to survive, then “response > 1”.

2.1 Observed mortality ratio

Append a column is_adult to titanic with a value TRUE if the person was an adult and FALSE otherwise.
Calculate the mortality ratio of adults and non-adults for the observed data. (Hint: Remove rows with missing age information.)
Is the observed mortality ratio practically significant?

2.1 Observed mortality ratio

titanic$is_adult <- (titanic$age >= 18)
adults <- titanic[which(titanic$is_adult), ]  # Use which() to remove NAs
younger <- titanic[which(!titanic$is_adult), ]
adult_mortality <- mean(!adults$survived) 
younger_mortality <- mean(!younger$survived)
mortality_ratio <- adult_mortality / younger_mortality
mortality_ratio

## [1] 1.188829

It is a subjective judgement call whether to consider an adult mortality as practically significant if it is 19% higher than the mortality of younger people.

Compared to the effect of gender on survival, the effect of age is relatively small.

Still, the age-dependent difference had practical significance for those younger people whose lives were saved as a result.

2.2 Permutation test

Is the adult mortality significantly higher than the mortality of younger people?

Perform a permutation test.
Draw a histogram.
What is the proportion of permutations that resulted in a response that was greater than the observed mortality ratio?

2.2 Permutation test

n_trials <- 10000
responses <- numeric(n_trials)
for (trial in 1:n_trials) {
  titanic$is_adult_p <- sample(titanic$is_adult)
  adults_p <- titanic[which(titanic$is_adult_p), ]
  younger_p <- titanic[which(!titanic$is_adult_p), ]
  adult_mortality_p <- mean(!adults_p$survived)
  younger_mortality_p <- mean(!younger_p$survived)
  responses[trial] <- adult_mortality_p / younger_mortality_p
}

2.2 Permutation test

hist(responses)

# Insert vertical line at the observed value
abline(v = mortality_ratio, col = "blue", lwd = 3)

2.2 Permutation test

mean(responses > mortality_ratio)

## [1] 8e-04

The mortality ratio was more extreme than the observed value in only 0.08% of the permutations. We conclude that adults were statistically significantly more likely to die.

We cannot conclude that being an adult was the cause of a higher mortality. What might be a lurking variable?

Summary

The permutation test for this activity was similar to the permutation test in activity 1, but we randomly permuted the variable is_adult instead of gender.

A small percentage of trials (0.08% in my simulation; your value may differ) produced a response that was more extreme (i.e. larger) than the observed mortality ratio. This percentage is below the common threshold of 5%. Thus, we conclude that the result was statistically significant.

The 5%-threshold is arbitrary, but it is a common criterion in statistics. Thus, we generally adopt this threshold in this course.

Activity 3: Association between hurdles and javelin throw in heptathlon

Load and explore data

The file hept.csv contains data on the 2016 women’s Olympic heptathlon.

hept <- read.csv("hept.csv")

Each row represents one competing athlete. The columns give details on the athletes and their times or distances in each event. The columns are given in the order the events took place. Here are the first few rows and columns.

head(hept[, 1:12])

##   rank                   athlete country pts_total hurdles pts_hurdles   hj pts_hj    sp pts_sp run200 pts_run200
## 1    1          Nafissatou Thiam     BEL      6810   13.56        1041 1.98   1211 14.91    855  25.10        878
## 2    2        Jessica Ennis-Hill     GBR      6775   12.84        1149 1.89   1093 13.86    785  23.49       1030
## 3    3     Brianne Theisen Eaton     CAN      6653   13.18        1097 1.86   1054 13.45    757  24.18        963
## 4    4  Laura Ikauniece-Admidina     LAT      6617   13.33        1075 1.77    941 13.52    762  23.76       1004
## 5    5           Carolin Schafer     GER      6540   13.12        1106 1.83   1016 14.57    832  23.99        982
## 6    6 Katarina Johnson-Thompson     GBR      6523   13.48        1053 1.98   1211 11.68    640  23.26       1053

3.1 Linear model for hurdles and javelin throw

One might hypothesize that athletes who are better in hurdles are also better in javelin throw (jt).

Make a scatter plot of jt versus hurdles. Verify that the conditions for linear regression are satisfied.
Fit a linear model that predicts jt from hurdles.
Add the slope to the scatter plot.
Record the slope of the linear model and describe in words what this value means.
What is the model’s \(R^2\) value? How do you interpret it?
Is the slope practically significant?

3.1 Linear model for hurdles and javelin throw

plot(jt ~ hurdles,
     data = hept,
     main = "Olympic heptathlon 2016",
     xlab = "Hurdles (seconds)",
     ylab = "Javelin throw (metres)")

3.1 Linear model for hurdles and javelin throw

We satisfy the linear regression conditions:

Both variables are quantitative.
There is no noticeable curvature.
There are no outliers.
The plot does not thicken.

3.1 Linear model for hurdles and javelin throw

model <- lm(jt ~ hurdles, data = hept)
plot(jt ~ hurdles,
     data = hept,
     main = "Olympic heptathlon 2016",
     xlab = "Hurdles (seconds)",
     ylab = "Javelin throw (metres)")
abline(model, col = "blue")

3.1 Linear model for hurdles and javelin throw

slope <- coefficients(model)[2]
slope

##   hurdles 
## -0.238086

For every extra second an athlete needs in the hurdles event, the linear model predicts that the athlete throws the javelin 23.8 centimetres shorter.

summary(model)$r.squared

## [1] 0.0002317609

The model only predicts 0.02% of the variability of the javelin throw distances. However, a low \(R^2\) does not necessarily imply that the slope is insignificant (neither practically nor statistically).

3.1 Linear model for hurdles and javelin throw

Keep in mind that being better at hurdles is equivalent to having a smaller value in hurdles. However, being better in javelin throw translates into having a larger value in javelin. Thus, we might expect the slope of the linear model to be negative.

However, the observed effect is small. Even a one-second difference in hurdle time (which is a big difference in practice) only leads to a difference of a few centimetres in the predicted javelin throw distance (which has a standard deviation of 5.75 metres). Therefore, the observed slope is not practically significant.

3.2 Statistical significance of the slope

Is the observed slope statistically significant? Use a permutation test to answer this question.

Permute the athletes’ hurdles times.
Fit a linear model for the javelin throws as a function of the permuted hurdles times. The slope of the linear model is the response variable.
Store the responses for the permuted data.
Repeat for 10,000 trials.
Draw a histogram of the responses.
Is the observed slope statistically significant?

3.2 Statistical significance of the slope

n_trials <- 10000
responses <- numeric(n_trials)
for (trial in 1:n_trials) {
  hept$hurdles_p <- sample(hept$hurdles)  # Permute hurdles
  model_p <- lm(jt ~ hurdles_p, data = hept)
  responses[trial] <- coefficients(model_p)[2]
}

3.2 Statistical significance of the slope

hist(responses)
abline(v = slope, col = "blue", lwd = 3)

3.2 Statistical significance of the slope

What is the proportion of trials that produced a response that was even more strongly negative than the observed slope?

mean(responses < slope)

## [1] 0.4785

Almost half of the responses was more extreme than what we observed.

Therefore, there is no statistically significant evidence that better hurdle runners tend to be better javelin throwers.

Concluding remarks about Activity 3

The absence of evidence for an association does not imply that we would have evidence to rule out the existence of an association.

For example, it is conceivable that we might have found evidence for an association if we had included data from all Olympic games since 1984 (when heptathlon first appeared in the Olympics).

However, based on the data at hand, we cannot regard the association as statistically significant.

Closing points

Based on today’s class, you should be able to
- understand the purpose of permutation tests.
- conduct permutation tests.
- assess the statistical significance of an observation on the basis of permutation tests.
Practical significance is a judgement call based on the size of an observed effect.
Statistical significance is a mathematical concept based on randomness.
Practical significance does not imply statistical significance and vice versa.