2021-09-09

How “Normal” are QR Students?

Before Monday’s class, you were all asked to complete a short survey.

For today’s activity, we will use the cleaned and anonymized survey data to find out more about the distributions of some of the quantitative indicators that we collected through the survey and that describe you (QR students).

Consistent with this week’s focus on the normal distribution, the overarching question in today’s activity is: How “Normal” are QR students?

How “Normal” are QR Students?

The goals of this activity are as follows:

  • Develop intuition about which data we expect to be normally distributed
  • Develop intuition about which data we expect to be non-normally distributed
  • Learn how to assess whether the data are normally distributed (using R)

In doing so, we will use some of the R commands and techniques featured in the YouTube R-tutorials to visually assess whether or not a distribution is “normal” (e.g. hist, dnorm, curve, qqnorm, and qqline)

How “Normal” are QR Students?

Let us start by importing survey.csv as a data frame survey into R. Let us then examine the data:

survey <- read.csv("survey.csv")
head(survey)
##   gender     nationality height phone facebook  youtube shoe postcode boxoffice
## 1   Male Non-Singaporean    190    43        5      267 29.4        6        NA
## 2 Female     Singaporean    168    27        0   386200 24.7        7 311605581
## 3 Female     Singaporean    154    40       NA     8000 23.9        5        NA
## 4   Male     Singaporean    180    48       NA  1200000 40.0        9  64337744
## 5 Female Non-Singaporean    163    35      493 58658325 25.4        6 258814783
## 6 Female     Singaporean    167    92        0        0 23.0        9     10000

How “Normal” are QR Students?

To recall, you were asked to report the following information:

  • Categorical Indicators

    • gender: Your gender identity (Female / Male / Neither, both, it depends, something else, etc.)
    • nationality: Your nationality (Singaporean / Non-Singaporean / Other or prefer not to say)
  • Quantitative Indicators

    • height: Your height (measured in centimetres)
    • phone: The last two digits of your Singapore hand phone number
    • youtube: The number of views on the video on YouTube that you most recently watched

Challenge 1: How “Normal” are the heights of QR students?

How “Normal” are the heights of QR students?

Begin by making a guess about whether or not the heights of QR students are normally distributed. Once you have made your guess and working in groups:

  • Plot the histogram of QR student heights
    • Set bin width equal to 5 (centimetres)
    • Add an appropriate label to the x-axis and an informative title
    • Add a normal curve to the histogram to help make a visual comparison (hint: try using curve and dnorm)
  • Plot the normal probability plot (QQ plot) + QQ line

What do you find based on the histogram and normal probability plot? Is it close to normal?

Solution 1 / 4 (Histogram)

hist(survey$height, main="Heights of Students", xlab="Heights", xlim=c(140,200),
     ylim=c(0, 0.045), las=1, col="gray", breaks=c(seq(140,200,5)), freq=FALSE)

Solution 2 / 4 (Mean and SD)

mean(survey$height, na.rm=TRUE)
## [1] 167.9957
sd(survey$height, na.rm=TRUE)
## [1] 9.23348

Solution 3 / 4 (Histogram + Curve)

hist(survey$height, main="Heights of Students", xlab="Heights", xlim=c(140,200),
     ylim= c(0, 0.045), las=1, col="gray", breaks=c(seq(140,200,5)), freq=FALSE)
curve(dnorm(x, mean=mean(survey$height, na.rm=TRUE), sd=sd(survey$height, na.rm=TRUE)),
      from=140, to=200, add=TRUE, col="red")

Solution 4 / 4 (QQ Plot + QQ line)

qqnorm(survey$height)
qqline(survey$height, col="red")

How “Normal” are the heights of QR students?

Based on the Histogram + Curve and the QQ plot + QQ line, it seems like the distribution deviates somewhat from the normal model.

  • The theoretical quantiles of the normal probability plot deviate suspiciously from the QQ line towards the tails of the distribution

Why might this be? What do you think could be happening?

How “Normal” are the heights of QR students?

Sex affects height.

So maybe we have two normal distributions, one for male and one for female, and they are overlapping.

How “Normal” are the heights of QR students?

We don’t have data on sex, so we will use gender instead.

  • Subset the data by gender Male and Female.
  • For each group, plot a histogram and normal model of participants’ heights. Display them side by side.
  • For each group, create a normal probability plot of participants’ heights. Display them side by side.

Solution 1 / 3 (Mean and SD of height by gender)

mean(survey$height[survey$gender=="Female"], na.rm=TRUE)
## [1] 162.8177
sd(survey$height[survey$gender=="Female"], na.rm=TRUE)
## [1] 6.483839
mean(survey$height[survey$gender=="Male"], na.rm=TRUE)
## [1] 175.3667
sd(survey$height[survey$gender=="Male"], na.rm=TRUE)
## [1] 7.222361

Solution 2 / 3 (By gender)

par(mfrow=c(1,2))
hist(survey$height[survey$gender=="Female"], main="Heights of Female Students", xlab="Heights",
     xlim=c(140,200), ylim=c(0,0.08), las=1, col="gray", breaks=c(seq(140,200,5)),freq = FALSE)
curve(dnorm(x, mean=162.8, sd=6.5),from=140, to=200, add=TRUE, col = "red")
curve(dnorm(x, mean=175.4, sd=7.2), from=140, to=200, add=TRUE, col="blue", lty=2)
hist(survey$height[survey$gender=="Male"], main="Heights of Male Students", xlab="Heights",
     xlim=c(140,200), ylim=c(0,0.08), las=1, col="gray", breaks=c(seq(140,200,5)),freq = FALSE)
curve(dnorm(x, mean=162.8, sd=6.5), from=140, to=200, add=TRUE, col="blue", lty=2)
curve(dnorm(x, mean=175.4, sd=7.2), from=140, to=200, add=TRUE, col="red")
par(mfrow=c(1,1))

Solution 3 / 3 (By gender)

par(mfrow=c(1,2))
qqnorm(survey$height[survey$gender=="Female"], main="Normal Q-Q plot for Female")
qqline(survey$height[survey$gender=="Female"], col="red")
qqnorm(survey$height[survey$gender=="Male"], main="Normal Q-Q plot for Male")
qqline(survey$height[survey$gender=="Male"], col="red")
par(mfrow=c(1,1))

Challenge 2: How “Normal” are the last two digits of your hand phone numbers?

How “Normal” are the last two digits of your hand phone numbers?

Begin by making a guess about whether or not the last two digits of your Singapore hand phone numbers are normally distributed. Once you have made your guess and working in groups:

  • Plot the histogram of the last two digits of your Singapore hand phone numbers
    • Set bin width = 10 ranging from 0 to 100
    • Add an appropriate label to the x-axis and an informative title
  • Plot the normal probability plot (QQ plot + QQ line)

What do you find? Is it close to normal?

Solution 1 / 2 (Histogram)

hist(survey$phone, main="Last 2-digits of Phone Number", xlab="Last 2-digits",
     las=1, breaks=c(seq(0,100,10)), col="gray", freq=FALSE)

Solution 2 / 2 (QQ plot + QQ line)

qqnorm(survey$phone)
qqline(survey$phone, col="red")

Challenge 3: How “Normal” is the number of views of the video QR students most recently watched on YouTube?

How “Normal” is the number of YouTube views?

Begin by making a guess about whether or not the number of YouTube views is normally distributed. Once you have made your guess and working in groups:

  • Plot the histogram of YouTube views (Histogram + Curve)
    • Add an appropriate label to the x-axis and an informative title
    • Apply any transformations you think appropriate to help visualise
  • Plot the normal probability plot (QQ plot + QQ line)

What do you find? Is it close to normal?

Solution 1 / 4 (Histogram)

hist(survey$youtube, main="YouTube Views",
     xlab="Views", las=1, col="gray", freq = FALSE)

Solution 2 / 4 (Mean and SD)

survey$log_youtube<-log10(survey$youtube+1) #Some zeros in views. Add one to avoid missing.
mean(survey$log_youtube, na.rm=TRUE)
## [1] 5.018722
sd(survey$log_youtube, na.rm=TRUE)
## [1] 1.742084
max(survey$log_youtube, na.rm=TRUE)
## [1] 9.152288
min(survey$log_youtube, na.rm=TRUE)
## [1] 0

Solution 3 / 4 (Histogram + Curve)

hist(survey$log_youtube, main="YouTube Views", xlab = "Views (log10)",
     col="gray", las=1, breaks=c(seq(0,10,0.5)),freq = FALSE)
curve(dnorm(x,
            mean = mean(survey$log_youtube, na.rm=TRUE), # Rather than manually enter mean
            sd   = sd(survey$log_youtube, na.rm=TRUE)),  # Rather than manually enter sd
      from = 0, to = 10, add = TRUE,col = "red")

Solution 4 / 4 (QQ plot + QQ line)

qqnorm(survey$log_youtube)
qqline(survey$log_youtube, col="red")

Discussion and Conclusion

How “Normal” are QR Students?

  • Are the heights of QR students normally distributed?

        All Students: Not exactly normally distributed, but...
        Female: Closer to normally distributed with mean 162.8 and sd 6.5
        Male: Closer to normally distributed with mean 175.4 and sd 7.2
  • Are the last two digit of the Singapore hand phone numbers of QR students normally distributed?

        Not normally distributed at all, closer to a uniform distribution
  • Are the number of views on the videos that QR students most recently watched on YouTube normally distributed?

        Not normally distributed, skewed to the right tail.
        But when we transform...still not normally distributed!

Discussion: Why not “Normal”?

The distribution of non-transformed YouTube views is clearly right skewed and not normal. But even after transforming the number of YouTube views, the data are not normally distributed and the distribution instead appears multi-modal.

  • What might have contributed to the odd shape of the distribution?

  • Are YNC students different from other YouTube users in terms of their viewing habits in some way that might help to explain?

    • Hint: Consider what you were doing before completing the survey

Discussion: Why not “Normal”?

Yes, YNC students are different: Unlike most YouTube users, YNC students must take QR in Sem 1 of their first year.

  • As part of QR, students complete preparatory work before coming to class.
  • This usually involves viewing one or more of the excellent series of R Tutorial videos created by Prof. Michael Gastner.
  • We know that the last video assigned as part of their preparatory materials before the survey was R Tutorial 25: Z-scores with R.

Let us check if this can help us to explain the odd shape.

Discussion: Why not “Normal”?

How many views are there on QR Tutorial 25?

QR Tutorial Video 25: Z-scores with R

Discussion: Why not “Normal”?

Let’s calculate the log transformed number of views.

tutorial <- log10(3622 + 1) # transform the number of views on QR Tutorial Video 25
tutorial                    # print to console
## [1] 3.559068

Now let’s check if the resulting value is similar to any of the modes of the distribution by adding abline() to the histogram and QQ plots.

Discussion: Why not “Normal”?

The log transformed number of views appears to line up with one of the modes.

We might reasonably conclude the shape is at least partly due to how QR students differ from other users: Many probably watched the same video.

Recap

The purpose of today’s lesson was to become more familiar with the normal model. Through the activities, we have worked together to do the following:

  • Examined real demographic data and whether it follows a normal model
    • How to diagnose, think about, and deal with outliers in real data
  • Develop intuition about which data we might expect to follow a normal model as well as which data might not follow a normal model
  • Learn how to assess whether or not data are normally distributed (using R)

Bonus Challenges: Guess the distribution

Guess the distribution

When you completed the survey and first examined the dataset, you may have noticed that it contained some additional information:

  • facebook: The number of reactions to the Facebook post to which you most recently reacted
  • shoe: Your shoe size in centimetres
  • postcode: The last digit of the postal code of your permanent residence
  • boxoffice: The total worldwide gross of your favorite film

Guess the distribution

For extra practice, you might try to answer the following questions:

  • How “normal” is the distribution of Facebook reacts?
  • How “normal” are the shoe sizes of QR students in centimeters?
  • How “normal” are the last digits of the postal code of QR students’ permanent residence?
  • How “normal” are the total worldwide grosses of QR students’ favorite films?

Do the following:

  • Guess the distribution of each variable.
  • Plot the histogram and qqplot. Apply any transformations you see as suitable.
  • What do you find? Is it close to normal?

Facebook Reacts (Histogram)

hist(survey$facebook, main="Facebook Reacts", xlab = "Reacts",
     las=1, col="grey", freq = FALSE)

Facebook Reacts (Mean and SD)

survey$log_facebook<-log10(survey$facebook+1) #Many 0 in reacts. Add one to avoid missing.
mean(survey$log_facebook, na.rm=TRUE)
## [1] 1.225786
sd(survey$log_facebook, na.rm=TRUE)
## [1] 1.222481
max(survey$log_facebook, na.rm=TRUE)
## [1] 4.81292
min(survey$log_facebook, na.rm=TRUE)
## [1] 0

Facebook Reacts (Histogram)

hist(survey$log_facebook, main="Facebook Reacts", xlab = "Reacts (log10)",
     las=1, col="grey", breaks=c(seq(0,6,1)), freq = FALSE)

Facebook Reacts (qqplot)

qqnorm(survey$log_facebook)
qqline(survey$log_facebook, col="red")

Shoe Size (Histogram)

hist(survey$shoe, main="Shoe Size", xlab="Size",
     xlim=c(20,40), las=1, col="grey", breaks=c(seq(20,40,1)), freq = FALSE)

Shoe Size (Mean and SD)

mean(survey$shoe, na.rm=TRUE)
## [1] 27.74524
sd(survey$shoe, na.rm=TRUE)
## [1] 4.368099

Shoe Size (Histogram + Curve)

hist(survey$shoe, main="Shoe Size", xlab="Size",
     xlim=c(20,40), las=1, col="grey", breaks=c(seq(20,40,1)))
curve(dnorm(x,
            mean = mean(survey$shoe, na.rm=TRUE),
            sd = sd(survey$shoe, na.rm=TRUE)),
      from = 20, to = 40, add = TRUE, col = "red")

Shoe Size (QQ Plot + QQ line)

qqnorm(survey$shoe)
qqline(survey$shoe, col="red")

Shoe Size (Mean and SD by gender)

mean(survey$shoe[survey$gender=="Female"], na.rm=TRUE)
## [1] 26.82661
sd(survey$shoe[survey$gender=="Female"], na.rm=TRUE)
## [1] 4.723942
mean(survey$shoe[survey$gender=="Male"], na.rm=TRUE)
## [1] 29.07531
sd(survey$shoe[survey$gender=="Male"], na.rm=TRUE)
## [1] 3.510396

Shoe Size (By gender)

par(mfrow=c(1,2))
hist(survey$shoe[survey$gender=="Female"], main="Shoe Size of Female Students", xlab="Size",
     xlim=c(20,40), ylim=c(0,0.45), las=1, col="gray", breaks=c(seq(20,40,1)), freq = FALSE)
curve(dnorm(x, mean = 26.8, sd = 4.7), from = 20, to = 40, add = TRUE, col = "red")

hist(survey$shoe[survey$gender=="Male"], main="Shoe Size of Male Students", xlab="Size",
     xlim=c(20,40), ylim=c(0,0.45), las=1, col="gray", breaks=c(seq(20,40,1)), freq = FALSE)
curve(dnorm(x, mean = 29.1, sd = 3.5), from = 20, to = 40, add = TRUE, col = "red")
par(mfrow=c(1,1))

Perhaps two different scales are being used for shoe size.

The last digit of postcode (Histogram)

hist(survey$postcode, main="Last digit of Postcode", xlab="Last digit",
     las=1, breaks=c(seq(-0.5, 9.5, 1)), freq = FALSE)

The last digit of postcode (QQ Plot + QQ Line)

qqnorm(survey$postcode)
qqline(survey$postcode, col="red")

Boxoffice gross (Histogram)

hist(survey$boxoffice, main="Boxoffice Gross", xlab="Gross",
     las=1, freq = FALSE)

Boxoffice gross (Mean and SD)

survey$log_boxoffice<-log10(survey$boxoffice+1) #Add one to avoid any missing.
mean(survey$log_boxoffice, na.rm=TRUE)
## [1] 8.089664
sd(survey$log_boxoffice, na.rm=TRUE)
## [1] 0.8525104
max(survey$log_boxoffice, na.rm=TRUE)
## [1] 9.454425
min(survey$log_boxoffice, na.rm=TRUE)
## [1] 4.000043

Boxoffice gross (Histogram + Curve)

hist(survey$log_boxoffice, main="Boxoffice Gross", xlab="Gross (log10)",
     las=1, col="grey", breaks=c(seq(0,10,1)), freq = FALSE)
curve(dnorm(x,
            mean = mean(survey$log_boxoffice, na.rm=TRUE),
            sd = sd(survey$log_boxoffice, na.rm=TRUE)),
      from = 0, to = 10, add = TRUE,col = "red")

Boxoffice gross (QQ plot + QQ line)

qqnorm(survey$log_boxoffice)
qqline(survey$log_boxoffice, col="red")