AY2021-2022, Semester 1, Week 13 Thursday
Uncertainty quantification requires the concept of randomness, introduced recently.
Statistical significance is one method for uncertainty quantification.
We recently introduced the for
-loop, a programming technique which allows for more efficient programming of simulations.
Today, we use simulations to determine statistical significance.
Observed data exhibit a statistically significant feature if the probability to observe this feature is unlikely to be the result of randomness alone.
A permutation test is a method to investigate statistical significance.
If we are interested in the relation between two variables, we permute the values of one of the variables to break any association between the variables.
We compare the distribution of simulated data with observed data to determine statistical significance.
After today’s class, you should be able to:
“Women and children first” is a code of conduct during maritime disasters if there are not enough lifeboats to rescue everybody. Is there evidence that women were more likely to survive the sinking of the Titanic than men?
The file titanic.csv
contains data on people on the Titanic. Each row represents one person. The columns describe different attributes of the person.
titanic <- read.csv("titanic.csv") head(titanic)
## fam_name given_name gender age class survived ticket pax_on_tckt pnd shl pnc ## 1 ABBING Anthony Male 41 3rd FALSE 5547 1 7 11 0 ## 2 ABBOTT Ernest Owen Male 21 Crew FALSE <NA> NA NA NA NA ## 3 ABBOTT Eugene Joseph Male 13 3rd FALSE CA2673 3 20 5 0 ## 4 ABBOTT Rossmore Edward Male 16 3rd FALSE CA2673 3 20 5 0 ## 5 ABBOTT Rhoda Mary 'Rosa' Female 39 3rd TRUE CA2673 3 20 5 0 ## 6 ABELSETH Kalle (Karen) Marie Kristiane Female 16 3rd TRUE 348125 1 7 13 0
survived
and gender
.survival_by_gender <- table(titanic$survived, titanic$gender) survival_by_gender
## ## Female Male ## FALSE 130 1366 ## TRUE 357 355
barplot(survival_by_gender, main = "Passengers on the Titanic", legend.text = c("Death", "Survivor"), args.legend = list(x = "topleft"))
men <- titanic[titanic$gender == "Male", ] mean(!men$survived)
## [1] 0.7937246
women <- titanic[titanic$gender == "Female", ] mean(!women$survived)
## [1] 0.2669405
# Alternative aggregate(!survived ~ gender, data = titanic, mean)
## gender !survived ## 1 Female 0.2669405 ## 2 Male 0.7937246
79% of men died versus 27% of women.
We want to study the statistical significance of changing two statistics: male mortality and female mortality. We define male (female) mortality as the number of male (female) deaths divided by the number of men (women) on board.
Comparing two statistics simultaneously is difficult because it requires thinking and plotting graphs in more than two dimensions. Instead, we combine the two statistics into a single response variable. One out of many possibilities is to take the “mortality ratio”: \[ \displaystyle\text{response}=\frac{\text{male mortality}}{\text{female mortality}}\ . \]
What is the range of possible values of this response variable?
What would have been the response if gender had played no role in the survival probability?
The mortality ratio can take any nonnegative value:
If all men and all women survived, the mortality ratio is undefined (zero divided by zero).
If men and women had died with the same probability, the response would have been approximately 1.
Let us perform a simulation to check for statistical significance.
gender_p
to titanic
that is a permutation of the values in the column gender
. Only perform one permutation for the time being (i.e. no for
-loop yet).male_mortality <- mean(!men$survived) female_mortality <- mean(!women$survived) mortality_ratio <- male_mortality / female_mortality mortality_ratio
## [1] 2.973414
The observed mortality ratio is approximately 2.97.
titanic$gender_p <- sample(titanic$gender) # Permute gender men_p <- titanic[titanic$gender_p == "Male", ] women_p <- titanic[titanic$gender_p == "Female", ] male_mortality_p <- mean(!men_p$survived) female_mortality_p <- mean(!women_p$survived) response <- male_mortality_p / female_mortality_p response
## [1] 1.027646
Is it plausible that the response
of the permuted data has a value close to 1?
Yes, the permutation broke the association between gender and survival; thus, the male and female mortalities are approximately equal.
We wrote code to generate a single response based on a random permutation. We will use this code to test whether a mortality ratio of 2.97 is statistically significant.
responses
. (Hint: You need a for
-loop.)n_trials <- 10000 responses <- numeric(n_trials) for (trial in 1:n_trials) { titanic$gender_p <- sample(titanic$gender) # Permute gender men_p <- titanic[titanic$gender_p == "Male", ] women_p <- titanic[titanic$gender_p == "Female", ] male_mortality_p <- mean(!men_p$survived) female_mortality_p <- mean(!women_p$survived) responses[trial] <- male_mortality_p / female_mortality_p }
hist(responses)
The observed mortality ratio of 2.97 is statistically significant because all responses in the permutation test were smaller.
We performed a permutation test, which randomly permutes the gender of people on board and then calculates the response variable (the ratio of the male mortality to the female mortality).
By using many permutations, we obtained the distribution of the response variable.
We found that it is extremely unlikely that a mortality ratio of 2.97 could have arisen by chance.
We conclude that there is an association between gender and survival. Men had a statistically significantly higher mortality than women.
We cannot conclude that gender was the cause of the difference in mortality. For example, there might be a lurking variable: most crew members were male, and crew members were less likely to be rescued than passengers.
In the previous activity, we found that women were more likely to survive than men. Was it also the case that younger people were more likely to be rescued than adults (defined as people \(\geq\) 18 years of age)?
We can use a similar response variable as before: \[ \displaystyle\text{response}=\frac{\text{adult mortality}}{\text{mortality of younger people}}\ . \]
If adults were less likely to survive, then “response > 1”.
is_adult
to titanic
with a value TRUE
if the person was an adult and FALSE
otherwise.titanic$is_adult <- (titanic$age >= 18) adults <- titanic[which(titanic$is_adult), ] # Use which() to remove NAs younger <- titanic[which(!titanic$is_adult), ] adult_mortality <- mean(!adults$survived) younger_mortality <- mean(!younger$survived) mortality_ratio <- adult_mortality / younger_mortality mortality_ratio
## [1] 1.188829
It is a subjective judgement call whether to consider an adult mortality as practically significant if it is 19% higher than the mortality of younger people.
Compared to the effect of gender on survival, the effect of age is relatively small.
Still, the age-dependent difference had practical significance for those younger people whose lives were saved as a result.
Is the adult mortality significantly higher than the mortality of younger people?
n_trials <- 10000 responses <- numeric(n_trials) for (trial in 1:n_trials) { titanic$is_adult_p <- sample(titanic$is_adult) adults_p <- titanic[which(titanic$is_adult_p), ] younger_p <- titanic[which(!titanic$is_adult_p), ] adult_mortality_p <- mean(!adults_p$survived) younger_mortality_p <- mean(!younger_p$survived) responses[trial] <- adult_mortality_p / younger_mortality_p }
hist(responses) # Insert vertical line at the observed value abline(v = mortality_ratio, col = "blue", lwd = 3)
mean(responses > mortality_ratio)
## [1] 8e-04
The mortality ratio was more extreme than the observed value in only 0.08% of the permutations. We conclude that adults were statistically significantly more likely to die.
We cannot conclude that being an adult was the cause of a higher mortality. What might be a lurking variable?
The permutation test for this activity was similar to the permutation test in activity 1, but we randomly permuted the variable is_adult
instead of gender.
A small percentage of trials (0.08% in my simulation; your value may differ) produced a response that was more extreme (i.e. larger) than the observed mortality ratio. This percentage is below the common threshold of 5%. Thus, we conclude that the result was statistically significant.
The 5%-threshold is arbitrary, but it is a common criterion in statistics. Thus, we generally adopt this threshold in this course.
The file hept.csv
contains data on the 2016 women’s Olympic heptathlon.
hept <- read.csv("hept.csv")
Each row represents one competing athlete. The columns give details on the athletes and their times or distances in each event. The columns are given in the order the events took place. Here are the first few rows and columns.
head(hept[, 1:12])
## rank athlete country pts_total hurdles pts_hurdles hj pts_hj sp pts_sp run200 pts_run200 ## 1 1 Nafissatou Thiam BEL 6810 13.56 1041 1.98 1211 14.91 855 25.10 878 ## 2 2 Jessica Ennis-Hill GBR 6775 12.84 1149 1.89 1093 13.86 785 23.49 1030 ## 3 3 Brianne Theisen Eaton CAN 6653 13.18 1097 1.86 1054 13.45 757 24.18 963 ## 4 4 Laura Ikauniece-Admidina LAT 6617 13.33 1075 1.77 941 13.52 762 23.76 1004 ## 5 5 Carolin Schafer GER 6540 13.12 1106 1.83 1016 14.57 832 23.99 982 ## 6 6 Katarina Johnson-Thompson GBR 6523 13.48 1053 1.98 1211 11.68 640 23.26 1053
One might hypothesize that athletes who are better in hurdles
are also better in javelin throw (jt
).
jt
versus hurdles
. Verify that the conditions for linear regression are satisfied.jt
from hurdles
.plot(jt ~ hurdles, data = hept, main = "Olympic heptathlon 2016", xlab = "Hurdles (seconds)", ylab = "Javelin throw (metres)")
We satisfy the linear regression conditions:
model <- lm(jt ~ hurdles, data = hept) plot(jt ~ hurdles, data = hept, main = "Olympic heptathlon 2016", xlab = "Hurdles (seconds)", ylab = "Javelin throw (metres)") abline(model, col = "blue")
slope <- coefficients(model)[2] slope
## hurdles ## -0.238086
For every extra second an athlete needs in the hurdles event, the linear model predicts that the athlete throws the javelin 23.8 centimetres shorter.
summary(model)$r.squared
## [1] 0.0002317609
The model only predicts 0.02% of the variability of the javelin throw distances. However, a low \(R^2\) does not necessarily imply that the slope is insignificant (neither practically nor statistically).
Keep in mind that being better at hurdles is equivalent to having a smaller value in hurdles
. However, being better in javelin throw translates into having a larger value in javelin
. Thus, we might expect the slope of the linear model to be negative.
However, the observed effect is small. Even a one-second difference in hurdle time (which is a big difference in practice) only leads to a difference of a few centimetres in the predicted javelin throw distance (which has a standard deviation of 5.75 metres). Therefore, the observed slope is not practically significant.
Is the observed slope statistically significant? Use a permutation test to answer this question.
n_trials <- 10000 responses <- numeric(n_trials) for (trial in 1:n_trials) { hept$hurdles_p <- sample(hept$hurdles) # Permute hurdles model_p <- lm(jt ~ hurdles_p, data = hept) responses[trial] <- coefficients(model_p)[2] }
hist(responses) abline(v = slope, col = "blue", lwd = 3)
What is the proportion of trials that produced a response that was even more strongly negative than the observed slope?
mean(responses < slope)
## [1] 0.4785
Almost half of the responses was more extreme than what we observed.
Therefore, there is no statistically significant evidence that better hurdle runners tend to be better javelin throwers.
The absence of evidence for an association does not imply that we would have evidence to rule out the existence of an association.
For example, it is conceivable that we might have found evidence for an association if we had included data from all Olympic games since 1984 (when heptathlon first appeared in the Olympics).
However, based on the data at hand, we cannot regard the association as statistically significant.