2021-10-21
Focus so far: data description, visualization, and predictions.
Results from data also carry uncertainty. The quantification of uncertainty requires the concept of randomness.
Simulations provide hands-on experience with randomness. They generate a response variable under an assumed model.
Today we use the R function sample
to implement a simulation for the investigation of a basketball phenomenon.
Until the early 20th century the scientific view of the world was dominated by Newtonian mechanics and determinism: to every cause there is a reaction. The world was thought of as some kind of complicating but in principle predictably machine.
Then Bohr-Heisenberg introduced causal non-determinism: At its most fundamental level the behaviour of the world cannot be predicted with certainty. We can only make statements of the form “x is likely to occur”, not “x is certain to occur.”
Not everyone was ready to accept this. Einstein: “God does not play dice.”
Does it matter? Whether or not the world is inherently unpredictable, the fact that we never have complete information/knowledge about the world suggests that we might as well treat it as inherently unpredictable (predictive non-determinism).
A random process is an ongoing process in which the next state might depend on the previous state and some element of randomness.
In a simulation, we set the rules of a random process and then let the computer use (pseudo)random numbers to generate outcomes that adhere to those rules.
Simple example: One flip of a fair coin
outcomes <- c("heads", "tails") sample(outcomes, size = 1, replace = TRUE)
## [1] "heads"
To simulate flipping a fair coin 100 times, you could either run the function sample
100 times, or adjust its size argument, which determines how many samples to draw.
sim_fair_coin
that simulates flipping a fair coin 100 times.sim_fair_coin
reproducible such that the proportion of heads stays the same when repeatedly evaluating your code?set.seed
can be used to initialize R’s pseudorandom number generator.set.seed(3) sim_fair_coin <- sample(outcomes, size = 100, replace = TRUE) table(sim_fair_coin)/length(sim_fair_coin)
## sim_fair_coin ## heads tails ## 0.44 0.56
Repeat the simulation – but now with an unfairly weighted coin that we know lands heads only 20% of the time.
We can adjust for the unfairness by adding the argument prob
and providing it with a vector of two probability weights.
sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(.2, .8)) table(sim_unfair_coin)/length(sim_unfair_coin)
## sim_unfair_coin ## heads tails ## 0.18 0.82
prob=c(.2,.8)
indicates that for the two elements in the outcomes vector, the first, ‘heads’, is selected with probability .2 and the second, ‘tails’, with probability .8. The default for prob
is that each outcome is equally likely.
The proportion of an unfair coin varies less than that of a fair coin. This is because uncertainty is maximized when the proportion of heads versus tails is exactly 50/50.
Edited from the Probability lab of OpenIntro Statistics (https://www.openintro.org/stat)
Basketball players who make several baskets in succession are described as having a “hot hand” or being “in the zone”
Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. Instead they believe that previous success can change the psychological attitude and subsequent success rate of a player.
A 1985 paper by Gilovich et al. collected evidence to show that successive shots are independent events.
This paper started a big controversy as you can see by Googling “hot hand basketball”.
Esterman et al.: In the Zone or Zoning Out? Tracking Behavioral and Neural Fluctuations During Sustained Attention
We do not expect to ultimately resolve the hot-hands controversy today.
We will apply a simulation approach towards answering questions like this. The goals for this activity are to
Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers.
His performance against the Orlando Magic in the 2009 NBA finals earned him the title “Most Valuable Player” and many spectators commented on how he appeared to show a hot hand.
load(url("http://www.openintro.org/stat/data/kobe.RData"))
Note: This will also load a custom function we will be using.
Do some basic exploration of the data.
What does a row in this data frame represent?
Which part of the data is most relevant for our hot hand investigation?
head(kobe)
## vs game quarter time description ## 1 ORL 1 1 9:47 Kobe Bryant makes 4-foot two point shot ## 2 ORL 1 1 9:07 Kobe Bryant misses jumper ## 3 ORL 1 1 8:11 Kobe Bryant misses 7-foot jumper ## 4 ORL 1 1 7:41 Kobe Bryant makes 16-foot jumper (Derek Fisher assists) ## 5 ORL 1 1 7:03 Kobe Bryant makes driving layup ## 6 ORL 1 1 6:01 Kobe Bryant misses jumper ## basket ## 1 H ## 2 M ## 3 M ## 4 H ## 5 H ## 6 M
How could we use these data for our hot hand investigation?
Just looking at the string of hits and misses, it can be difficult to gauge whether there was a hot hand.
One way we can approach this is by considering that hot hand shooters tend to go on shooting streaks.
Let’s define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.
What does a streak length of 1 mean? How many hits and misses are in a streak of 1?
What about a streak length of 0?
What was the sequence of hits and misses Kobe had from his nine shots in the first quarter of Game 1?
How many streaks are contained in the first nine shots? What are their lengths?
A streak of length 1 means that you get 1 hit before a miss
A streak of length 0 means you simply miss at first attempt, i.e. no hits.
kobe$basket[1:9]
## [1] "H" "M" "M" "H" "H" "M" "M" "M" "M"
H M | M | H H M | M | M | M
The custom function calc_streak()
, which was loaded with the data, may be used to calculate the lengths of all shooting streaks and to then look at their distribution.
Use it to visualize the distribution of shooting streaks by Kobe Bryant. What would be an appropriate visualization, given that the shooting streaks are a discrete variable.
Describe the distribution of Kobe’s streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak?
kobe_streak <- calc_streak(kobe$basket) barplot(table(kobe_streak), main="Koby Bryant", ylim=c(0,40), col="lightblue", xlab = "length of shooting streak")
Skewed to the right, with a mode at 0. His longest streak was 4.
Here, a bar plot from a table is preferable to a histogram since our variable is discrete - counts - instead of continuous.
Compute the relative frequencies of Kobe’s streak lengths.
Compute the relative frequencies of Kobe’s streak lengths.
Does this suggest a hot hand?
table(kobe_streak)/length(kobe_streak)
## kobe_streak ## 0 1 2 3 4 ## 0.51315789 0.31578947 0.07894737 0.07894737 0.01315789
We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands?
What can we compare them to?
Think back to the activities from the beginning of today’s class.
Two events are independent if the outcome of one doesn’t affect the outcome of the second.
If each shot that a player takes is an independent event, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
A shooter with a hot hand will have shots that are not independent of one another. Psychologically, she is “in the zone”. Specifically, if the shooter makes her first shot, the hot hand model says she will have a higher probability of making her second shot.
If there is no hot hands phenomenon, then Kobe Bryant should be an independent shooter.
We will generate shots of an independent shooter with a simulation.
Before you do any coding, discuss with your teammates the details of how the simulation should proceed and some of the aspects of a real basketball final series that our simulation will not model.
Be specific: What probabilities do you use in your simulation? How many shots do you include in the simulation?
An independent shooter is modeled by simulating 133 shots resulting hitting or missing a basket.
We will record the streak lengths computed from the 133 hits and misses.
We will align the success rate of our independent shooter with Kobe’s.
For an independent shooter each shot has the same success probability. As an estimate of the success probability of an independent Kobe Bryant let’s use the proportion of hits out of the 133 shots in our data.
Our simulation does not precisely reflect a real final series, e.g.:
Simulating an independent shooter uses the same mechanism as simulating a coin flip.
We can simulate a single shot from an independent shooter with shooting success rate of 50% by:
outcomes <- c("H", "M") sim_basket <- sample(outcomes, size = 1, replace = TRUE)
For a sensible comparison between Kobe and our simulated independent shooter, we need to align both their shooting success rate and the number of attempted shots.
Adjust the last line of the above code such that it adheres to Kobe’s shooting percentage and generates 133 shots.
# Outcomes are always hit or miss. outcomes <- c("H", "M") # We compute the hit rate of Kobe Bryant. hit_prop <- mean(kobe$basket == "H") # hits / (hits + misses) hit_prop
## [1] 0.4360902
sim_basket
is based on the number of shots in the data frame and the hit rate:
sim_basket <- sample(outcomes, size = length(kobe$basket), replace = TRUE, prob = c(hit_prop, 1-hit_prop))
How do we tell if Kobe’s shooting streaks are long enough to indicate that he has hot hands?
We now have data to compare Kobe to a simulated independent shooter, who we know does not have hot hands.
We can look at Kobe’s data alongside our simulated data:
head(cbind(kobe$basket, sim_basket), 7)
## sim_basket ## [1,] "H" "H" ## [2,] "M" "M" ## [3,] "M" "M" ## [4,] "H" "H" ## [5,] "H" "M" ## [6,] "M" "H" ## [7,] "M" "H"
Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 44% (hit_prob
). It is hard to spot a pattern!
Let’s look instead at the streak lengths.
Reuse the code from when we answered this for kobe$basket
:
sim_streak <- calc_streak(sim_basket) table(sim_streak)/length(sim_streak)
## sim_streak ## 0 1 2 3 4 5 ## 0.53030303 0.22727273 0.06060606 0.09090909 0.04545455 0.04545455
The typical, most common length is 0.
We can also get the maximum directly:
max(sim_streak)
## [1] 5
Let’s compare “our” shots with Kobe Bryant!
kobe_streak
and sim_streak
to compare our simulated independent shooter with Kobe Bryant with regards to their “hot handedness”.par(mfrow = c(1,2)) barplot(table(kobe_streak), main = "Kobe Bryant", col="lightblue" , ylim = c(0, 40)) barplot(table(sim_streak), main = "Independent shooter", col="pink" ,ylim = c(0, 40))
par(mfrow = c(1,1))
max(calc_streak(sample(outcomes, size = length(kobe$basket), replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 5
max(calc_streak(sample(outcomes, size = length(kobe$basket), replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 6
max(calc_streak(sample(outcomes, size = length(kobe$basket), replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 7
max(calc_streak(sample(outcomes, size = length(kobe$basket), replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 6
max(calc_streak(sample(outcomes, size = length(kobe$basket), replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 5
max(kobe_streak)
## [1] 4
Kobe’s longest streak length is not longer than expected from an independent shooter, providing no evidence for the hot hand hypothesis.
Today we used a simulation approach to find the answer to a questions. We
Additionally, we
set.seed
makes them deterministic.Repeating trials or simulations can get tedious. Programming languages can automate the repeats using for
loops.
We need to quantify uncertainty: What would be the average maximal streak length of an indpendent shooter? If Kobe Briant’s maximal streak length in the 2009 NBA finals was different, how can we make a statement about whether this difference was significant or due to randomness?
Randomness also plays a crucial role in sampling as the process of obtaining a subset from a population.
The goal is for the subset to be representative of the entire population, which can be achieved by adding randomness to the sampling procedure.