2021-10-28

Data exploration

x <- read.csv('pop2021.csv')

For today’s activities we consider the survey data you provided earlier in the semester. We have upscaled the population size to 100,000, since we want to repeatedly sample many times from the population.

Explore the data. We have only kept some information from it.

What is the mean height in this population?

Data exploration

str(x)
## 'data.frame':    100000 obs. of  4 variables:
##  $ height     : num  165 178 181 186 186 ...
##  $ youtube    : int  6951 3471035 83101 90 3 21580 43184 0 2534430 403859 ...
##  $ nationality: chr  "SG" "SG" "SG" "SG" ...
##  $ gender     : chr  "M" "M" "M" "M" ...

The mean height in this population is

trueM <- mean(x$height)
trueM
## [1] 167.8968

Data exploration

hist(x$height, col='lightblue', main="Histogram of heights")

Note: You see some extreme heights here. Why do you think this is?

Data exploration

hist(x$youtube, col='lightblue', main="Histogram of youtube views")

Populations and samples

We aim to study how sample statistics behave.

We will treat the data frame x as our population. In reality, it is unlikely you would have access to all the data from a population, so this should be considered a hypothetical exercise.

We will sample from the population to see how our statistics (from looking at a sample) behave relative to the true population parameters.

We will look for lessons in how to design sampling strategies so that our sample statistics are more likely to estimate well the corresponding population parameters.

Using for loops to sample repeatedly

We can take a single sample of size 10 from the population and calculate its mean.

my_first_sample_mean <- mean(sample(x$height, size = 10))

But we want to do this 500 times, so we use a for loop.

h10 <- numeric(500)
for (i in 1:500) { h10[i] <- mean(sample(x$height, size = 10)) }

It would be better if we could change the sample size and the number of samples easily, so we use variables.

n <- 10
reps <- 500
h10 <- numeric(reps)
for (i in 1:reps) { h10[i] <- mean(sample(x$height, size = n)) }

This makes the code more easily reusable, and helps avoid errors.

Sampling to estimate the mean

  • How does sample size affect the accuracy of the estimate of the mean for heights in the population?

  • Start with a small sample size and look at the distribution of the sample means if you repeat taking many samples of this size. What does the distribution of sample means look like?

  • Repeat the experiment with a larger sample size.

Sampling to estimate the mean

n <- 10
reps <- 5000
h10 <- numeric(reps)
for (i in 1:reps) { h10[i] <- mean(sample(x$height, size = n)) }
hist(h10, col='lightblue', main="Sample means for sample size 10", 
     freq=FALSE, breaks=c(seq(158,182,1)), ylim=c(0,0.22))
abline(v=trueM, col=2, lwd=2)

Sampling to estimate the mean

n <- 20
reps <- 5000
h20 <- numeric(reps)
for (i in 1:reps) { h20[i] <- mean(sample(x$height, size = n)) }
hist(h20, col='lightblue', main="Sample means for sample size 20", 
     freq=FALSE, breaks=c(seq(158,182,1)), ylim=c(0,0.22))
abline(v=trueM, col=2, lwd=2)

Sampling to estimate the mean: analysis

  • Already with a small sample size the distribution of the sample means looks quite Normal. This is because the distribution of heights in the population looks quite Normal, exhibiting just a bit of skewness.

  • With a larger sample size, the distribution of the sample means is more concentrated around the true population mean.

Sampling to estimate the mean: conclusion

  • Imagine we are taking only one sample and we want to use the sample mean of that sample to estimate the population mean. Our experiment suggests that the estimate is more likely to be better if we use a sample of size 20 than if we use a sample of size 10.

  • But it is not guaranteed to be better; there were samples of size 10 whose sample mean was within 1cm the population mean and some (albeit fewer) samples of size 20 whose mean was approx 5cm away from the population mean.

  • Conclusion: the sample mean is more likely to be closer to the true mean when the sample size is bigger.

  • Although we will not investigate it today, a smaller population standard deviation also improves how well sample means estimate the population mean. If you are designing a sampling strategy, is this fact relevant?

Spread of the sample means

Recall the quantile() function, which can calculate what values correspond to different percentiles.

  • Which percentiles would we be interested in if we wanted to identify the range in which we can find 95% of the possible sample mean values?

  • Find those percentiles for the samples of size 10 and the samples of size 20. Compare.

Spread of the sample means

quantile(h20, c(0.025, 0.975))
##     2.5%    97.5% 
## 163.9827 172.0204
quantile(h10, c(0.025, 0.975))
##     2.5%    97.5% 
## 162.4547 173.4893

The interval in which 95% of sample means fall is narrower for samples of size 20 than it is for samples of size 10.

Spread of the sample means: analysis

  • By increasing the sample size, the 2.5% to 97.5% range of sample means decreases, concentrating sample means closer to the population mean.

Spread of the sample means: conclusion

  • In practice, we would only be taking one sample so, even with a large sample size, we can’t guarantee the sample mean is close to the population mean. But …

  • For a larger sample size, the “interval within which the sample mean is highly likely (95% chance) to fall” is closer to the true mean.

Sample means for a skewed population

Recall that we found that for large enough sample size, the distribution of the sample mean is approximately Normal. But how large does the sample size have to be for this?

If the true population is Normal, then for any sample size the distribution of sample means is Normal. But when the true population is not normal, what is large enough will depend on the shape of the population distribution.

  • What happens when the population distribution is far from normal? Try with different sample sizes on the YouTube data, which are very skewed.

  • What happens now to the 2.5% to 97.5% interval that is likely to contain the true mean?

Sample means for a skewed population

n <- 10
N <- 5000
y010 <- numeric(N)
for (i in 1:N) {
  y010[i] <- mean(sample(x$youtube, size = n))
}
hist(y010, col="lightblue", main="Sample means for youtube, sample size  10", 
     ylim=c(0,900), breaks=c(seq(0,8*10^6,10^5)))

Sample means for a skewed population

n <- 30
N <- 5000
y030 <- numeric(N)
for (i in 1:N) {
  y030[i] <- mean(sample(x$youtube, size = n))
}
hist(y030, col="lightblue", main="Sample means for youtube, sample size  30", 
     ylim=c(0,900), breaks=c(seq(0,8*10^6,10^5)))

Sample means for a skewed population

n <- 150
N <- 5000
y150 <- numeric(N)
for (i in 1:N) {
  y150[i] <- mean(sample(x$youtube, size = n))
}
hist(y150, col="lightblue", main="Sample means for youtube, sample size 150", 
     ylim=c(0,900), breaks=c(seq(0,8*10^6,10^5)))

Sample means for a skewed population

quantile(y010, c(0.025, 0.975)) 
##       2.5%      97.5% 
##   29186.63 3562091.93
quantile(y030, c(0.025, 0.975)) 
##      2.5%     97.5% 
##  203377.9 2457546.9
quantile(y150, c(0.025, 0.975)) 
##      2.5%     97.5% 
##  564205.8 1610828.3

The 2.5% to 97.5% ranges are reducing.

Sample means for a skewed population: analysis

  • Even with this very skewed population distribution, the sample means better estimate the population mean when the sample size increases.

  • Sample means are distributed more Normally for larger samples.

  • But for a population distribution this skewed, perhaps it wasn’t a good idea to look at the mean to begin with! This was an fun experiment, but not much practical use.

  • What happens when we consider sample statistics other than the mean?

Estimating other statistics (Extra)

  • Suppose that you are not interested in estimating the population mean but another population parameter, e.g. the population minimum.

  • Then we could use the sample minimum to estimate the population minimum.

  • Does the distribution of the sample minima taken from samples of certain size look Normal if you repeat taking samples many times?

  • Use x$height as the population for the above.

  • Try also for estimating the median of the YouTube views. Does larger sample size improve the estimate?

Estimating other statistics (Extra)

n <- 150
reps <- 5000
h_min <- numeric(reps)
for (i in 1:reps) {
  h_min[i] <- min(sample(x$height, size = n))
}
hist(h_min, col='lightblue', main="Height sample minima for sample size 150")

Estimating other statistics (Extra)

qqnorm(h_min)
qqline(h_min, col=2, lwd=2)

The distribution of sample minima looks less Normal than the original data!

Estimating other statistics (Extra)

n <- 50
reps <- 5000
ymed050 <- numeric(reps)
for (i in 1:reps) { ymed050[i] <- median(sample(x$youtube,size = n)) }
hist(ymed050, col="lightblue", main="Sample medians for youtube, sample size  50", 
     ylim=c(0,3000), breaks=c(seq(0,250000,5000)))
abline(v=median(x$youtube),col=2,lwd=2)

Estimating other statistics (Extra)

n <- 150
reps <- 5000
ymed150 <- numeric(reps)
for (i in 1:reps) { ymed150[i] <- median(sample(x$youtube, size = n)) }
hist(ymed150, col="lightblue", main="Sample medians for youtube, sample size 150", 
     ylim=c(0,3000), breaks=c(seq(0,250000,5000)))
abline(v=median(x$youtube),col=2,lwd=2)

Estimating other statistics (Extra)

n <- 500
reps <- 5000
ymed500 <- numeric(reps)
for (i in 1:reps) { ymed500[i] <- median(sample(x$youtube, size = n)) }
hist(ymed500, col="lightblue", main="Sample medians for youtube, sample size 500", 
     ylim=c(0,3000), breaks=c(seq(0,250000,5000)))
abline(v=median(x$youtube),col=2,lwd=2)

Estimating other statistics (Extra)

quantile(ymed050, c(0.025, 0.975)) 
##     2.5%    97.5% 
##   279.70 70695.67
quantile(ymed150, c(0.025, 0.975)) 
##      2.5%     97.5% 
##  1086.913 25941.475
quantile(ymed500, c(0.025, 0.975)) 
##      2.5%     97.5% 
##  2414.912 13626.400

The 2.5% to 97.5% ranges are reducing.

The interval in which 95% of sample medians fall is narrower for larger samples than it is for smaller samples.

Estimating other statistics: analysis & conclusion

  • The distribution of sample medians and looks more symmetric with increasing sample size, but it is still skewed, and we had to increase the sample size greatly!

  • The distribution of sample minima looks less symmetric than the population.

  • The Normal distribution approximation applies to the distribution of the sample mean but not to other sample statistics.

Stratification by gender or nationality

Different subgroups

There is a difference in the height for different subgroups.

  • Try subsetting the population using the categorical variables.

  • Try also combinations of the categorical variables.

  • Where do you see the biggest difference?

Different subgroups

mean(x$height[x$gender == 'M']) 
## [1] 175.3888
mean(x$height[x$gender == 'F'])
## [1] 163.008
mean(x$height[x$nationality == 'SG'])
## [1] 167.3249
mean(x$height[x$nationality == 'NSG'])
## [1] 168.7546

Different subgroups

par(mfrow=c(1,2))
boxplot(height~gender, data=x,col="lightblue")
boxplot(height~nationality,data=x, col="lightblue")
par(mfrow=c(1,1))

Different subgroups

Let us look at both height and gender together.

mean(x$height[x$nationality == 'SG' & x$gender =='M'])
## [1] 174.849
mean(x$height[x$nationality == 'SG' & x$gender =='F'])
## [1] 161.9032
mean(x$height[x$nationality == 'NSG'& x$gender =='M' ])
## [1] 176.3334
mean(x$height[x$nationality == 'NSG'& x$gender =='F' ])
## [1] 164.5105

Different subgroups

par(mfrow=c(1,2))
boxplot(height ~nationality, data=x[x$gender=="F", ], ylim=c(130, 210),  
       main="Height of Females across nationalities", col="lightblue")
boxplot(height ~nationality, data=x[x$gender=="M", ], ylim=c(130, 210), 
       main="Height of Males across nationalities", col="lightblue")

par(mfrow=c(1,1))

Different subgroups

Can you see any subgroups in the data?

Although in this case there don’t appear to be additional subgroups, it is always a good idea to consider whether subgroups exist and sample accordingly.

Let us now proceed to stratify by gender and nationality.

Stratification by gender

Female and male don’t seem to be equally represented in the population.

table(x$gender)/length(x$gender)
## 
##       F       M 
## 0.60513 0.39487

So in every 20 people there are approximately 12 women and 8 men.

Randomly overselecting from one group could push us away from the mean.

  • Going back to estimating the mean height, stratify your sample such that each sample of size 20 contains 12 female and 8 male individuals.

  • How does the distribution of the sample means from the stratified samples compare to that from simple random samples of size 20?

Stratification by gender

N<-5000
samples_genderStrat <-numeric(N)
for (i in 1:N) {
  samples_genderStrat[i] <- mean(c(sample(x$height[x$gender == 'M'], size=8),
                                   sample(x$height[x$gender == 'F'], size=12)))
}

Stratification by gender

par(mfrow=c(1,2))
hist(samples_genderStrat, freq=FALSE, col="lightblue", xlim=range(h20))
hist(h20, freq=FALSE, col="lightblue", xlim=range(h20))

par(mfrow=c(1,1))

Stratification by gender

sd(samples_genderStrat)
## [1] 1.503444
sd(h20)
## [1] 2.029586

The distribution of sample means has smaller spread for the stratified sample.

Making sure that each sample has the same characteristics as the population sometimes matters. But random is still pretty good, and the right default.

Stratification by nationality

Singaporeans and Non-Singaporeans don’t seem to be equally represented in the population.

table(x$nationality)/length(x$nationality)
## 
## NSG  SG 
## 0.4 0.6

But remember: There was not such a big difference in mean height depending on nationality.

mean(x$height[x$nationality == 'SG'])
## [1] 167.3249
mean(x$height[x$nationality == 'NSG']) 
## [1] 168.7546

What effect will stratification by nationality have?

Stratify your sample such that each sample of size 20 contains 12 Singaporeans and 8 Non-Singaporeans. As before, compare to the variability from simple random samples of size 20.

Stratification by nationality

N<-5000
samples_nationalityStrat <- numeric(N)
for (i in 1:N) {
  samples_nationalityStrat[i] <- mean(c(sample(x$height[x$nationality == 'SG'], 12), 
                                        sample(x$height[x$nationality == 'NSG'], 8)))
}

We can use the original h20 vector of sample means for the (unstratified) simple random sample.

Stratification by nationality

par(mfrow=c(1,2))
hist(samples_nationalityStrat, freq= FALSE, xlim=range(h20), col="lightblue")
hist(h20, freq= FALSE, xlim=range(h20), col="lightblue")

par(mfrow=c(1,1))

Stratification by nationality

sd(samples_nationalityStrat)
## [1] 2.062259
sd(h20)
## [1] 2.029586

Stratification by nationality: analysis

  • Stratfication now does just about nothing.

  • You should have a good reason for stratifying, or else you might introduce biases into your sample.

  • When stratifying, you also need to have a good idea of the true proportion of each subgroup in the population (otherwise you might end up over/under-representing a group in your sample).

  • Other cases where more control over sampling would be important: When you care about separate estimates for subgroups which vary in size. Think back to our Malaysian election results for the “other” category.

Recap

Recap

  • When we randomly take a sample from a large population to estimate a population parameter via the sample statistic, each sample will give us a different answer (sampling variability).

  • The larger the sample size, the higher the chance that the estimate from our sample is close to the true population parameter.

  • We got an intuition concerning the interplay between sampling size, sampling variability, and shape of the population distribution:

    • Sample means are distributed quite symmetrically, even for very skewed populations, and the larger the sample, the more symmetric.
    • This does not necessarily work for other sample statistics.