2021-10-21

Introduction

Learning goals

  • Focus so far: data description, visualization, and predictions.

  • Results from data also carry uncertainty. The quantification of uncertainty requires the concept of randomness.

  • Simulations provide hands-on experience with randomness. They generate a response variable under an assumed model.

  • Today we use the R function sample to implement a simulation for the investigation of a basketball phenomenon.

What is uncertainty in data?

  • Until the early 20th century the scientific view of the world was dominated by Newtonian mechanics and determinism: to every cause there is a reaction. The world was thought of as some kind of complicating but in principle predictably machine.

  • Then Bohr-Heisenberg introduced causal non-determinism: At its most fundamental level the behaviour of the world cannot be predicted with certainty. We can only make statements of the form “x is likely to occur”, not “x is certain to occur.”

  • Not everyone was ready to accept this. Einstein: “God does not play dice.”

  • Does it matter? Whether or not the world is inherently unpredictable, the fact that we never have complete information/knowledge about the world suggests that we might as well treat it as inherently unpredictable (predictive non-determinism).

Activity 1: Simulation in R

Random process simulation

A random process is an ongoing process in which the next state might depend on the previous state and some element of randomness.

In a simulation, we set the rules of a random process and then let the computer use (pseudo)random numbers to generate outcomes that adhere to those rules.

Simple example: One flip of a fair coin

outcomes <- c("heads", "tails")
sample(outcomes, size = 1, replace = TRUE)
## [1] "heads"

1.1 Simulation in R

To simulate flipping a fair coin 100 times, you could either run the function sample 100 times, or adjust its size argument, which determines how many samples to draw.

  • Create a vector sim_fair_coin that simulates flipping a fair coin 100 times.
  • Compute the proportion of heads in simulating a fair coin and compare your result with your teammates.
  • How can you make sim_fair_coin reproducible such that the proportion of heads stays the same when repeatedly evaluating your code?
  • Does the next flip in this random process depend on the previous flip or only on (pseudo)randomness?

1.1 Simulation in R

  • The proportion of heads should be close to .5 but varies across teammates.
  • set.seed can be used to initialize R’s pseudorandom number generator.
set.seed(3)
sim_fair_coin <- sample(outcomes, size = 100, replace = TRUE)
table(sim_fair_coin)/length(sim_fair_coin)
## sim_fair_coin
## heads tails 
##  0.44  0.56
  • In this simulation each of the outcomes is independent of the previous outcomes.

1.2 An unfair coin

Repeat the simulation – but now with an unfairly weighted coin that we know lands heads only 20% of the time.

  • What proportion of heads do you obtain?
  • Does this proportion vary more or less between teammates than the proportion for the fair coin?

1.2 An unfair coin – solution

We can adjust for the unfairness by adding the argument prob and providing it with a vector of two probability weights.

sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(.2, .8))
table(sim_unfair_coin)/length(sim_unfair_coin)
## sim_unfair_coin
## heads tails 
##  0.18  0.82

prob=c(.2,.8) indicates that for the two elements in the outcomes vector, the first, ‘heads’, is selected with probability .2 and the second, ‘tails’, with probability .8. The default for prob is that each outcome is equally likely.

The proportion of an unfair coin varies less than that of a fair coin. This is because uncertainty is maximized when the proportion of heads versus tails is exactly 50/50.

Activity 2: Hot Hands in Basketball

Background

  • Basketball players who make several baskets in succession are described as having a “hot hand” or being “in the zone”

  • Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. Instead they believe that previous success can change the psychological attitude and subsequent success rate of a player.

  • A 1985 paper by Gilovich et al. collected evidence to show that successive shots are independent events.

  • This paper started a big controversy as you can see by Googling “hot hand basketball”.

  • Esterman et al.: In the Zone or Zoning Out? Tracking Behavioral and Neural Fluctuations During Sustained Attention

Learning goals

We do not expect to ultimately resolve the hot-hands controversy today.

We will apply a simulation approach towards answering questions like this. The goals for this activity are to

  • think about the effects of independent, dependent events,
  • learn how to simulate shooting streaks in R,
  • compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real, and to
  • think about the limitations of a simulation in its ability to model reality.

Analysis

  • Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers.

  • His performance against the Orlando Magic in the 2009 NBA finals earned him the title “Most Valuable Player” and many spectators commented on how he appeared to show a hot hand.

2.1 Examining the data

  • Download the data using the following line of code and inspect the data:
load(url("http://www.openintro.org/stat/data/kobe.RData"))

Note: This will also load a custom function we will be using.

  • Do some basic exploration of the data.

  • What does a row in this data frame represent?

  • Which part of the data is most relevant for our hot hand investigation?

2.1 Examining the data – solution

head(kobe)
##    vs game quarter time                                             description
## 1 ORL    1       1 9:47                 Kobe Bryant makes 4-foot two point shot
## 2 ORL    1       1 9:07                               Kobe Bryant misses jumper
## 3 ORL    1       1 8:11                        Kobe Bryant misses 7-foot jumper
## 4 ORL    1       1 7:41 Kobe Bryant makes 16-foot jumper (Derek Fisher assists)
## 5 ORL    1       1 7:03                         Kobe Bryant makes driving layup
## 6 ORL    1       1 6:01                               Kobe Bryant misses jumper
##   basket
## 1      H
## 2      M
## 3      M
## 4      H
## 5      H
## 6      M

2.1 Examining the data – solution

  • Every row records a shot taken by Kobe Bryant.
  • If he hit the shot (made a basket) an H is recorded in the column named basket. Otherwise an M, for miss, is recorded.

Shooting streaks

  • How could we use these data for our hot hand investigation?

  • Just looking at the string of hits and misses, it can be difficult to gauge whether there was a hot hand.

  • One way we can approach this is by considering that hot hand shooters tend to go on shooting streaks.

2.2 Shooting streaks

  • Let’s define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.

  • What does a streak length of 1 mean? How many hits and misses are in a streak of 1?

  • What about a streak length of 0?

  • What was the sequence of hits and misses Kobe had from his nine shots in the first quarter of Game 1?

  • How many streaks are contained in the first nine shots? What are their lengths?

2.2 Shooting streaks – solution

  • A streak of length 1 means that you get 1 hit before a miss

  • A streak of length 0 means you simply miss at first attempt, i.e. no hits.

kobe$basket[1:9]
## [1] "H" "M" "M" "H" "H" "M" "M" "M" "M"
  • Within the nine shot attempts, there are six streaks

H M | M | H H M | M | M | M

  • Their lengths are 1, 0, 2, 0, 0, 0.

2.3 Distribution of shooting streaks

  • The custom function calc_streak(), which was loaded with the data, may be used to calculate the lengths of all shooting streaks and to then look at their distribution.

  • Use it to visualize the distribution of shooting streaks by Kobe Bryant. What would be an appropriate visualization, given that the shooting streaks are a discrete variable.

  • Describe the distribution of Kobe’s streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak?

2.3 Distribution of shooting streaks – solution

kobe_streak <- calc_streak(kobe$basket)
barplot(table(kobe_streak), main="Koby Bryant", 
        ylim=c(0,40), col="lightblue", xlab = "length of shooting streak")

2.3 Distribution of shooting streaks – solution

Skewed to the right, with a mode at 0. His longest streak was 4.

Here, a bar plot from a table is preferable to a histogram since our variable is discrete - counts - instead of continuous.

2.4 Relative frequencies

Compute the relative frequencies of Kobe’s streak lengths.

2.4 Relative frequencies – solution

Compute the relative frequencies of Kobe’s streak lengths.

Does this suggest a hot hand?

table(kobe_streak)/length(kobe_streak)
## kobe_streak
##          0          1          2          3          4 
## 0.51315789 0.31578947 0.07894737 0.07894737 0.01315789

2.5 Comparison

We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands?

What can we compare them to?

Think back to the activities from the beginning of today’s class.

2.5 Independence: Not having hot hands

  • Two events are independent if the outcome of one doesn’t affect the outcome of the second.

  • If each shot that a player takes is an independent event, having made or missed your first shot will not affect the probability that you will make or miss your second shot.

  • A shooter with a hot hand will have shots that are not independent of one another. Psychologically, she is “in the zone”. Specifically, if the shooter makes her first shot, the hot hand model says she will have a higher probability of making her second shot.

  • If there is no hot hands phenomenon, then Kobe Bryant should be an independent shooter.

2.6 How to simulate an independent shooter

We will generate shots of an independent shooter with a simulation.

Before you do any coding, discuss with your teammates the details of how the simulation should proceed and some of the aspects of a real basketball final series that our simulation will not model.

Be specific: What probabilities do you use in your simulation? How many shots do you include in the simulation?

2.6 How to simulate an independent shooter

  • An independent shooter is modeled by simulating 133 shots resulting hitting or missing a basket.

  • We will record the streak lengths computed from the 133 hits and misses.

  • We will align the success rate of our independent shooter with Kobe’s.

  • For an independent shooter each shot has the same success probability. As an estimate of the success probability of an independent Kobe Bryant let’s use the proportion of hits out of the 133 shots in our data.

  • Our simulation does not precisely reflect a real final series, e.g.:

    • We will be counting shooting streaks across quarters and games.
    • If opponent players believe in the hot hands phenomenon they would likely guard a player more once they believe he starts a streak. This would influence the success probability.

2.7 Simulating an independent shooter

Simulating an independent shooter uses the same mechanism as simulating a coin flip.

We can simulate a single shot from an independent shooter with shooting success rate of 50% by:

outcomes <- c("H", "M")
sim_basket <- sample(outcomes, size = 1, replace = TRUE)

For a sensible comparison between Kobe and our simulated independent shooter, we need to align both their shooting success rate and the number of attempted shots.

Adjust the last line of the above code such that it adheres to Kobe’s shooting percentage and generates 133 shots.

2.7 Simulating an independent shooter - solution

# Outcomes are always hit or miss.
outcomes <- c("H", "M") 

# We compute the hit rate of Kobe Bryant.
hit_prop <- mean(kobe$basket == "H")  # hits / (hits + misses)
hit_prop
## [1] 0.4360902

sim_basket is based on the number of shots in the data frame and the hit rate:

sim_basket <- sample(outcomes, 
                     size = length(kobe$basket),
                     replace = TRUE, 
                     prob = c(hit_prop, 1-hit_prop))

Back to the controversy

How do we tell if Kobe’s shooting streaks are long enough to indicate that he has hot hands?

We now have data to compare Kobe to a simulated independent shooter, who we know does not have hot hands.

Back to the controversy

We can look at Kobe’s data alongside our simulated data:

head(cbind(kobe$basket, sim_basket), 7)
##          sim_basket
## [1,] "H" "H"       
## [2,] "M" "M"       
## [3,] "M" "M"       
## [4,] "H" "H"       
## [5,] "H" "M"       
## [6,] "M" "H"       
## [7,] "M" "H"
  • Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 44% (hit_prob). It is hard to spot a pattern!

  • Let’s look instead at the streak lengths.

2.8 Independent shooter

  • What is the distribution of steak lengths for the independent shooter?
  • What is the independent shooter’s typical streak length?
  • What is the independent shooter’s longest streak length?
  • Are your answers the same as your teammates? Why?

2.8 Independent shooter – solution

Reuse the code from when we answered this for kobe$basket:

sim_streak <- calc_streak(sim_basket)
table(sim_streak)/length(sim_streak)
## sim_streak
##          0          1          2          3          4          5 
## 0.53030303 0.22727273 0.06060606 0.09090909 0.04545455 0.04545455

The typical, most common length is 0.

We can also get the maximum directly:

max(sim_streak)
## [1] 5

2.9 Comparison: Kobe vs independent shooter

Let’s compare “our” shots with Kobe Bryant!

  • Create a visualization of kobe_streak and sim_streak to compare our simulated independent shooter with Kobe Bryant with regards to their “hot handedness”.

2.9 Comparison: Kobe vs independent shooter

par(mfrow = c(1,2))
barplot(table(kobe_streak), main = "Kobe Bryant", col="lightblue" , ylim = c(0, 40))
barplot(table(sim_streak), main = "Independent shooter", col="pink" ,ylim = c(0, 40))

par(mfrow = c(1,1))

2.10 Discussion

  • Do we have evidence in favor of Kobe Bryant having a hot hand?
  • Do we have evidence against it?
  • Can we generate more evidence one way or the other with the tools we have discussed?

2.10 Discussion – remarks

  • Our example does not show strong evidence of a hot hand: Both the simulated data and Kobe’s shots have similar streak lengths.
  • Evidence against any hot hand is difficult to establish because the hot hand effect could be tiny to the point of being indiscernible.
  • Repeated simulations yield different distributions of streak lengths. Maybe our independent shooter just got lucky. Repeating the simulation might give us a better sense of whether Kobe’s streak lengths are expected if we assume him to be an independent shooter.

2.11 Repeating the simulation

  • Repeat the simulation a couple of times.
  • Compute the maximum streak length for each.
  • Compare this with Kobe’s maximum streak length.

2.11 Repeating the simulation – solution

max(calc_streak(sample(outcomes, size = length(kobe$basket),
replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 5
max(calc_streak(sample(outcomes, size = length(kobe$basket),
replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 6
max(calc_streak(sample(outcomes, size = length(kobe$basket),
replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 7

2.11 Repeating the simulation – solution

max(calc_streak(sample(outcomes, size = length(kobe$basket),
replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 6
max(calc_streak(sample(outcomes, size = length(kobe$basket),
replace = TRUE, prob = c(hit_prop, 1-hit_prop))))
## [1] 5
max(kobe_streak)
## [1] 4

Kobe’s longest streak length is not longer than expected from an independent shooter, providing no evidence for the hot hand hypothesis.

Conclusion

Learning goals

Today we used a simulation approach to find the answer to a questions. We

  • thought about the effects of independent and dependent events,
  • learned how to do simple simulations R,
  • and compared a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.

Additionally, we

  • confirmed that R generates pseudorandom numbers since set.seed makes them deterministic.
  • discussed how it can be hard to disprove something with data, for instance if the hot-hand effect exists but is indiscernible.
  • saw the need for multiple trials when considering whether Kobe Bryant had hot hands.

Recap

  • Simulations are a powerful tool to investigate data claims.
  • Randomness is key to simulations.
  • While previously we focused on the linear model, today we simulated data from different models.
  • Simulations give us a sense of uncertainty in data (such as the maximum streak length) if the simulated model were true.

Preview

  • Repeating trials or simulations can get tedious. Programming languages can automate the repeats using for loops.

  • We need to quantify uncertainty: What would be the average maximal streak length of an indpendent shooter? If Kobe Briant’s maximal streak length in the 2009 NBA finals was different, how can we make a statement about whether this difference was significant or due to randomness?

  • Randomness also plays a crucial role in sampling as the process of obtaining a subset from a population.

  • The goal is for the subset to be representative of the entire population, which can be achieved by adding randomness to the sampling procedure.