Simulation and sampling variability

Week 11, Lesson 1

Course context

In the previous class, we used randomness in a simulation to investigate the alleged Hot-Hands phenomenon in basketball. We found that repeating simulations could be useful because of the variability in the simulation outcomes.
If you repeat a simulation with built-in randomness to reach a conclusion, then you are using the Monte Carlo method.
Randomness provides a way to make samples created in the simulation representative of the population. We can use those samples to draw conclusions about that population.
But how do we cope with the variability in our results that is due to randomness?

Activity I

Learning goals

To practice implementing a simple simulation and to start thinking about its properties.
To consider a situation for which the mathematics quickly becomes so complex that a simulation is a better way forward. (Much more complex than simulating two dice rolls.)
To observe the variability in our simulation results, and to understand what happens when we adjust different aspects of our simulations.

Birthday problem

In a group of 367 students, we can be certain that at least 2 students share the same birthday. (Pigeon hole principle)

In a class of 10 students, what is the chance that at least 2 share the same birthday?

How about for class sizes of 20 or 40 students?

Let’s start with a simplifying assumption: All 366 calendar days are equally likely to be a birthday (even February 29).

1.1 Simulating one class

For this exercise, our population is all classes of size 10.

Use R to randomly choose one class from this population. Then check whether at least 2 students in it share a birthday.

Hints:

Use sample to create a vector of length equal to the number of students in the class, in which each position represents a student, and randomly assign a birthday to each student.
Use table to check if at least two students share a birthday.
Try to make your code easily modifiable. That means you can reuse it for different numbers of students or different numbers of shared birthdays.

1.1 Simulating one class

nbr_students = 10
nbr_same = 2
birthdays <- sample(1:366, size = nbr_students, replace = TRUE)
max(table(birthdays)) >= nbr_same

## [1] FALSE

We are not interested in the birthdays themselves, just whether at least two are the same. table(birthdays) records the number of times each calendar day was a birthday.

The code results in either FALSE (equivalently 0) if there isn’t a shared birthday, or TRUE (equivalently 1) if there is.

1.2 Simulating many classes

Using the previous code, implement a for-loop to count how many classes have at least 2 students who share a birthday if you repeat the process 10 times.

Use a variable nmbrHits for counting during the for-loop. (Hint: Recall that max(table(birthdays)) >= nbr_same returns either 0 or 1.)
Calculate the relative frequency of classes with at least 2 students sharing a birthday in your randomly chosen 10 classes.
What happens to the relative frequency if you change the class size? Try 20 and 40.
What happens to the relative frequency if you change the number of classes you examine? Try 100 and 1000.

1.2 Simulating many classes

nbr_students = 10
nbr_same = 2
numTrials <- 1000
numHits = 0

for (t in 1:numTrials) {
  birthdays <- sample(1:366, size =  nbr_students, replace = TRUE)
  numHits <- numHits + (max(table(birthdays)) >= nbr_same)
}
numHits/numTrials

## [1] 0.125

1.2 Comparison between teams

How do your simulation results compare to the actual probabilities for a shared birthday?

11.66% for a group of 10,
41.06% for a group of 20, and
89.05% for a group of 40 students.

FYI: The probability that in a group of N students at least 2 share a birthday is: \[ 1 - \frac{366}{366}\cdot\frac{366 - 1}{366} \cdots \frac{366 - N}{366} \] No need to worry about this formula beyond this: It follows by counting possible events, and the \(1-\ldots\) structure comes from considering the complementary case, the one in which no students share a birthday.

1.3 Advantages of simulation

It is much harder to find a formula for the probability that at least 3 students to share a birthday. We would need to consider a variety of subcases.

On the other hand, we can readily use our simulation model:

Estimate the probabilites that in a class of 50 and 100 students at least 2, at least 3, and at least 4 students share a birthday.

1.3 Advantages of simulation

Estimated probabilities (sample size 10000):

	50 students	100 students
2 shared	0.9708	1
3 shared	0.1259	0.648
4 shared	0.0043	0.0574

More advantages of simulation

Most common birthdays in England and Wales 1995-2014, www.ons.gov.uk.

More advantages of simulation

Our simulation can easily be adapted to reflect the unequal distribution of birthdays in a certain country. The mathematics on the other hand would become enormously complex.

What would happen if we use realistic assumptions for the occurences of birthdays on different calendar days? Would the results change a lot?

1.4 Bonus simulation

The file EnglandWalesBdays.csv contains the total number of birthdays on each calendar day of 2016 in England and Wales. Use it to adapt your simulation by including a prob vector of length 366 in sample() to adjust the probabilities of each calendar day to be a birthday.

1.4 Bonus simulation

EngWal <- read.csv("EnglandWalesBirthdays.csv")

nbr_students = 50
nbr_same = 4
numTrials <- 10000
numHits = 0

for (t in 1:numTrials) {
  birthdays <- sample(1:366, size =  nbr_students, replace = TRUE, prob = EngWal$X2016 )
  numHits <- numHits + (max(table(birthdays)) >= nbr_same)
}
numHits/numTrials

## [1] 0.004

Bonus simulation conclusion

Our toy model simulation was already a pretty good reflection of reality, in which birthdays are not distributed equally but nevertheless fairly evenly.

While sometimes nuance is important, often simplifying assumptions are very reasonable.

Activity II

Introduction

In the birthday problem activity, we inferred information about a population (e.g., all classes of 10 students) by examining a sample of these classes. Note that the sample size was the number of classes considered (e.g., 10, 100, or 1000).
Each sample has to be random, otherwise there is no reason to expect it will have similar properties as the population. Still, the sample will differ from the population. That’s the nature of chance.

Introduction

Gathering every data point for an entire population (e.g., a census) is often extremely costly or even impossible.
We often take a small sample of the population, using it to draw inferences about the characteristics of the population.
In this activity, we have access to the entire population. This is rarely the case in real life, but it will help us to explore the relationships between samples and populations.

2.1 Data set

The data set is from the US Centre for Environmental Information. It contains daily temperatures for 21 US cities from 1961 - 2015.

We’re interested in the population of daily temperatures across 21 cities. A contrived population? Yes, but it’s important to think about the relationship of samples (and statistics) and populations (and parameters).

Download the data temperatures.csv from Canvas.

pop <- read.csv("temperatures.csv")

Visualize the temperature data and describe it. What is the overall mean of temperatures? Their standard deviation?

2.1 Data set

There are data for 21 cities.

unique(pop$CITY)

##  [1] "SEATTLE"       "SAN DIEGO"     "PHILADELPHIA"  "PHOENIX"       "LAS VEGAS"     "CHARLOTTE"     "DALLAS"       
##  [8] "BALTIMORE"     "SAN JUAN"      "LOS ANGELES"   "MIAMI"         "NEW ORLEANS"   "ALBUQUERQUE"   "PORTLAND"     
## [15] "BOSTON"        "SAN FRANCISCO" "TAMPA"         "NEW YORK"      "DETROIT"       "ST LOUIS"      "CHICAGO"

The data set is rather large.

length(pop$CITY)

## [1] 421848

For convenience, let’s make an assignment.

pop <- pop$TEMP

2.1 Data set

hist(pop, breaks = 40, col = "lightblue", 
     main="Histogram of entire population", xlab = "Temperature")

2.1 Data set

qqnorm(pop)
qqline(pop, col = "deeppink")

2.2 Estimating the population mean

Let’s start by calculating the actual population mean (and the standard deviation \(\sigma\)). These are:

mu <- mean(pop)
mu

## [1] 16.29877

sigma <- sd(pop)
sigma

## [1] 9.43757

Now, draw a simple random sample of 100 temperatures.
Plot your sample next to the population and calculate the mean of your sample.
Compare these outcomes with your teammates’.

2.2 Estimating the population mean

SRS <- sample(pop, size = 100, replace = FALSE)
par(mfrow=c(1,2))
hist(SRS, col="lightblue", xlim=range(pop), freq=FALSE)
hist(pop,col="lightblue", main="Histogram of temperatures", xlim=range(pop), freq=FALSE)
par(mfrow=c(1,1))

2.2 Estimating the population mean

Generally, the sample will have similar characteristics as the population, like skew to the left. But it really depends on your particular sample!
With a sample size of only 100 out of over 400,000 data points, the mean and standard deviation of the sample are nevertheless close to \(\mu\) and \(\sigma\).

mean(SRS)

## [1] 16.764

sd(SRS)

## [1] 8.783513

2.3 Estimating the population mean

Was that just luck or can we expect that?

We noted that everyone who took a sample of size 100 got a different sample mean. How much variability should we expect when estimating the population mean repeatedly from samples of size 100?

Draw 1000 samples of size 100. The mean of each sample will be different. Draw a histogram of all the sample means.
What do you observe from the histogram of those sample means? Around what value is it centred? What is its variability?

2.3 Estimating the population mean

We use a for-loop to draw 1000 random samples of size 100 and store the resulting means.

trials    <- 1000
estimates <- numeric(trials)

for (i in 1:trials) {
  SRS <- sample(pop, size = 100, replace = FALSE)
  estimates[i] <- mean(SRS)
}

2.3 Estimating the population mean

hist(estimates, col="lightblue", breaks=20, xlim=range(estimates),
     main="Histogram of sample means")
abline(v = mu, lwd = 2, col = "deeppink")

2.3 Estimating the population mean

qqnorm(estimates)
qqline(estimates, col="deeppink")

2.3 Estimating the population mean

The histogram of the sample means looks like a normal distribution, centred around the true population mean.

mean(estimates)

## [1] 16.26916

mu

## [1] 16.29877

2.4 Standard deviation of the sample means

The standard deviation of the sample means is

sd(estimates)

## [1] 0.9747276

Will doing more trials (i.e. collecting more sample means) change the standard deviation of the sample means?

What would you expect? Discuss in your team first.
Then try by increasing the number of trials to 2000.

2.4 Standard deviation of the sample means

trials <- 2000
estimates_more_trials <-numeric(trials)

for (i in 1:trials) {
  SRS <- sample(pop, size = 100, replace = FALSE)
  estimates_more_trials[i] <- mean(SRS)
}

2.4 Standard deviation of the sample means

par(mfrow=c(1,2))
hist(estimates_more_trials, breaks = 40, col = "lightblue",
     main="Sample means from more trials", xlim=range(estimates), freq = FALSE)
hist(estimates, breaks = 40, col = "lightblue",
     main="Sample means from fewer trials", xlim=range(estimates), freq = FALSE)
par(mfrow=c(1,1))

2.5 Standard deviation of the sample means

When taking more trials the standard deviation barely changed.

Will taking bigger sample sizes make the standard deviation of the sample means drop?

Go back to 1000 trials, but this time increase the sample size.

2.5 Standard deviation of the sample means

trials <- 1000
estimates_lss <-numeric(trials)

for (i in 1:trials) {
  SRS <- sample(pop, size = 500, replace = FALSE)
  estimates_lss[i] <- mean(SRS)
}

2.5 Standard deviation of the sample means

par(mfrow=c(1,2))
hist(estimates_lss, breaks=20, col="lightblue", xlim=range(estimates), freq=FALSE,
     main="Sample means larger sampling size")
hist(estimates, breaks=40, col="lightblue", xlim=range(estimates), freq=FALSE,
     main="Sample means smaller sampling size")
par(mfrow=c(1,1))

2.6 Sampling variability discussion

Even going from sample size 100 to sample size 500 notably decreases the standard deviation of the sample means, i.e. the spread of the histogram of the sample means.

Is having a narrower distribution of sample means useful? Discuss with your team.

2.6 Sampling variability discussion

If we repeatedly choose samples from the population, the sample mean will take different values in different samples.
A simple idea: if we repeat sampling many times and the mean does not change substantially (i.e. if we get similar answers each time), our estimates are fairly reliable.
A more advanced idea: The distribution of sample means tends towards a normal distribution. So we can use the 68-95-99 rule ta quantify how reliable we expect our estimates to be. (We will return to this idea in later classes…)

2.6 Sampling variability discussion

This sounds great! But is it practical?

In order to find the standard deviation of the sampling distribution, we drew 1000 random samples of size 200.
This is not something we would do in real life. No one is going to conduct 1000 polls of 200 people each just to quantify the uncertainty of one poll!
We have used simulations to illustrate the relationship between samples and populations, but usually we are working with one sample.

2.7 Bonus: Standard error of the mean

So what should we expect for one sample?

We observed that the standard deviation of the sample means decreased with the sample size. In fact, the standard deviation of the sample means is approximately \(\sqrt{n}\) relative to the population’s standard deviation.

The approximation \(\sigma/\sqrt{n}\) is called the standard error of the mean.

Although we can just assert this relationship for now, it can be shown analytically…or through simulation.

2.7 Bonus: Standard error of the mean

Check from the estimates you got for sample size \(n=100\) that it differs by roughly a factor of \(\sqrt{100}\) from the population standard deviation.

2.7 Bonus: Standard error of the mean

sd(pop)

## [1] 9.43757

For samples of size \(n=100\) we found

sd(estimates)

## [1] 0.9747276

which differs by a factor of \(\sqrt{100}\).

The approximation \(\sigma/\sqrt{n}\) is called the standard error of the mean.

Summary

Here are some general principles we can derive from today’s explorations.

Assuming that the population is large and the sample size is small compared to the population size, then for sufficiently large sample size \(n\):

The sample means from a set of samples will be approximately normally distributed.
They will have mean close to the unknown mean \(\mu\) of the population.
The standard deviation of the sample means will be close to \(\sigma/\sqrt{n}\).

Moral: The larger \(n\) is, the smaller is the standard deviation of the distribution of sample means and the closer the sample mean is likely to be to the population mean.

Recap and preview

Learning goals

In today’s activities we…

became more proficient with using the R function sample and created for loops,
used a simulation to make inferences about a population and answer “What if …?” questions.
saw how the distribution of the population, sample, and the sample statistic (today we looked at the mean) differ.
saw how sample size affects sampling variability and how the latter may be used to quantify the accuracy of our estimate.

Preview

Next class we will…

explore how the shape of the population distribution influences what can be considered a sufficiently large sample size,
check what happens when estimating other statistics than the mean,
discuss different sampling schemes beyond simple randomized sampling.