2021-10-04

Learning Goals

The learning goals for today’s class are as follows:

  • Understand and practice using scatter plots
  • Calculate and interpret correlation coefficients
  • Understand that correlation is not causality

To achieve these goals, we will spend today’s class working on two activities:

  • Activity 1: Mystery and Nobel Laureates
  • Activity 2: Exploring the Properties of Correlation

Activity 1: Mystery and Nobel Laureates

Mystery and Nobel Laureates

Imagine that you are at the airport waiting to take a flight. A stranger approaches you and asks you to take a thumbnail drive to their friend who will meet you at your destination.

Note: Never accept packages from strangers in airports. Unless they tell you it is related to QR. Then it is totally OK.

You accept the thumbnail drive and board your flight. Because you have brought your laptop with you and are curious about what information the drive contains, you open it to find that it contains the file Nobel.csv.

You decide to open the file and analyze it in R Studio to see what it contains.

Mystery and Nobel Laureates

Let’s import Nobel.csv as a data frame Nobel into R, and then inspect:

Nobel <- read.csv("Nobel.csv")
str(Nobel)
## 'data.frame':    22 obs. of  3 variables:
##  $ Country: chr  "China" "Japan" "Portugal" "Greece" ...
##  $ X      : num  0.758 1.794 1.979 2.515 2.867 ...
##  $ Y      : num  0.117 1.348 2.169 1.583 0.059 ...

Mystery and Nobel Laureates

In addition to Country which clearly contains the names of countries, the data frame Nobel also contains two numeric variables. Etched on the outside of the thumbnail drive is note that informs us as to the nature of Y but not X:

  • Y: Nobel Laureates per 10 Million Population
  • X: Mystery

Again, your curiousity gets the better of you and you want to know whether there is an association between the variables.

Since you have successfully completed at least half a semester of QR, you know that the first thing you should do is to…make a picture.

Challenge 1

Working together with your peers, do the following in R:

  • Use plot() to generate a scatter plot of the data (Y by X)

  • Give the plot appropriate title and labels:

    • Give the plot an informative title
    • Label both the x and y-axis

Bonus: If you find the above tasks easy, then try to label the points by adding the country names to the scatter plot (hint: try using text())

Challenge 1 (Solution)

plot(Y ~ X, data = Nobel,
       main = "Nobel Laureates / 10 MM Pop by Mystery",
       xlab = "Mystery", ylab = "Nobel Laureates / 10MM Pop")
text(Y ~ X, labels=Country, data=Nobel, cex=0.8, font=2, pos=2)

Challenge 1 (Solution)

It seems that countries with higher levels of our Mystery variable (X) seem to have more Nobel Laureates per 10 Million population (Y).

But is there an association? Let’s calculate the correlation…

Challenge 2

Working together with your peers, do the following in R:

  • Compute the correlation coefficient r using cor().

  • Discuss with your peers how you would answer the following questions:

    • How should we interpret the correlation coefficient?
    • What does the correlation coefficient mean?

Challenge 2 (Solution)

r <- cor(Nobel$X, Nobel$Y)
r
## [1] 0.7878342

Based upon the correlation coefficient, it seems like there is a positive association between our Mystery variable (X) and Nobel Laureates per 10 Million population (Y).

But it is a bit difficult to think about what this value represents given the absence of information about our mystery variable X …the Z-scores come to rescue.

Challenge 3

Working together with your peers, do the following in R:

  • Calculate Z-scores for X and Y

  • Briefly discuss whether you expect the correlation between the Z-scores to be different from the correlation between X and Y, and why.

  • Manually calculate the correlation of the Z-scores for X and Y

    • \(r=\frac{\sum z_x z_y }{n-1}\)
    • What do you find? Was it what you expected?
    • Why or why not?

Challenge 3 (Solution)

First, we calculate the Z-scores for X and Y:

meanX <- mean(Nobel$X)           #calculate mean of X-values
sdX <- sd(Nobel$X)               #calculate sd of X-values
Nobel$Zx <- (Nobel$X-meanX)/sdX  #calculate Zx in a new column

meanY <- mean(Nobel$Y)           #calculate mean of Y-values
sdY <- sd(Nobel$Y)               #calculate sd of Y-values
Nobel$Zy <- (Nobel$Y-meanY)/sdY  #calculate Zy in a new column

Challenge 3 (Solution)

Second, we manually calculate the correlation coefficient for the Z-scores. We can also use the command cor() to confirm the result:

ManualCorCoef <- sum(Nobel$Zx*Nobel$Zy)/(nrow(Nobel)-1)
ManualCorCoef
## [1] 0.7878342
CorCoef <- cor(Nobel$Zx, Nobel$Zy)
CorCoef
## [1] 0.7878342

The correlation coefficient of the Z-scores is the same as for X and Y

Discussion: What is the Mystery variable?

Our analysis thus far indicates there is a positive association between X and Nobel Laureates per 10 million population (Y) for the sample of countries. Let’s take a minute to step back from the analysis, however, and use our intuition.

What might the mystery variable X measure?

  • Take a moment to think about what you think X might measure

  • With a group, share and discuss what you think X might measure

  • How might what you think it is account for the association?

    • E.g. What is one possible mechanism that connects X and Y?

Be ready to share your answer with the class

Discussion: What is the Mystery variable?

It’s chocolate!

Per capita chocolate consumption is found to be strongly correlated with the number of Nobel Laureates per 10 million population. In fact, a research note detailing this finding was published in the New England Journal of Medicine (NEJM) in 2012

  • Cocoa rich in flavanols and is contained in chocolate
  • Flavanols seem effective in slowing or even reversing reductions of cognitive performance that occur with aging
  • Implies a relationship between chocolate consumption and cognitive performance (measured here by Nobel Laureates)

Chocolate make you smart! Or does it…what else could explain?

Conclusion

Correlation \(\neq\) Causation

  • Observing a correlation does not imply causation, even when we have a plausible or convincing mechanism / story to explain the correlation

  • Some possible alternatives:

    • Lurking variables
    • Reverse causality
    • Mutual causality
    • Pure randomness
  • Anything else you can think of?

Extra

In response to the note in NEJM, another note published in the Journal of Nutrition (JN) in 2013 used the same dataset to show the correlation between IKEA stores and Nobel Laureates per 10 Million is even stronger (r=0.82):

Want more interesting correlations? check these out: https://www.tylervigen.com/spurious-correlations

Activity 2 - Exploring the Properties of Correlation

Exploring the Properties of Correlation

In preparation for today’s class, you learnt how to manually calculate the correlation coefficient as well as calculate it using the cor() command in R.

  • For this activity, you will work with teammates to work out how various transformations of variables that we might want to conduct will effect the correlation coefficient.
  • To do so, you will be working with survey.csv which you should already have downloaded as part the pre-class prep materials.

Exploring the Properties of Correlation

Let’s import survey.csv as a data frame x and inspect.

x <- read.csv("survey.csv")
str(x)
## 'data.frame':    214 obs. of  9 variables:
##  $ gender     : chr  "Male" "Female" "Female" "Male" ...
##  $ nationality: chr  "Non-Singaporean" "Singaporean" "Singaporean" "Singaporean" ...
##  $ height     : num  190 168 154 180 163 167 169 175 164 171 ...
##  $ phone      : int  43 27 40 48 35 92 40 NA 25 43 ...
##  $ facebook   : int  5 0 NA NA 493 0 0 NA 18 0 ...
##  $ youtube    : num  267 386200 8000 1200000 58658325 ...
##  $ shoe       : num  29.4 24.7 23.9 40 25.4 23 27.4 27.4 24.7 23.9 ...
##  $ postcode   : int  6 7 5 9 6 9 5 0 3 4 ...
##  $ boxoffice  : num  NA 3.12e+08 NA 6.43e+07 2.59e+08 ...

It appears that there are occasional missing values. Keep this in mind.

Challenge 1

Working together with your peers, do the following in R:

  • Use cor() to compute the correlation between:

    • height and shoe
    • shoe and height

What do you find? Does it align with your intuition? Why or why not?

Challenge 1 (Solution)

The correlation between height and shoe is the same as the correlation between shoe and height.

r <- cor(x$height, x$shoe, use = "complete.obs") 
r
## [1] 0.1548563
r_reverse <- cor(x$shoe, x$height, use = "complete.obs") 
r_reverse
## [1] 0.1548563

So there is a weak correlation between height and shoe size, and the correlation is the same in either direction.

Challenge 2

Let’s focus on the association between height and shoe as it seems weaker than our intuition might suggest. In order to get a better sense of what is going on, let’s…make a picture.

Working together with your peers, do the following in R:

  • Create a scatter plot for height and shoe.
    • Do you find any obvious outliers?
    • If you do, then can you describe it / explain it?

Challenge 2 (Solution I)

plot(height ~ shoe, data = x,
     main = "Height and Shoe Size",
     xlab = "Shoe Size", ylab = "Height")

Why would the scatter plot have two groups?

If you recall from the survey, we asked to report shoe size in centimetres (not EU)

Challenge 2 (continued)

So it looks like the outliers might have come from respondents who did not read the directions closely when they completed the survey. The outliers are thus presumably errors.

Let’s see what happens when we exclude the outliers.

Working together with your peers, do the following in R:

  • Generate a new scatter plot for height and shoe
    • Make sure it excludes the outliers (i.e. shoe > 32)
  • With the outliers excluded, recalculate the correlation

What do you find? Now does it align with your intuition?

Challenge 2 (continued) (Solution II)

r_outlier <- cor(x$height[x$shoe<32], x$shoe[x$shoe<32], use = "complete.obs")
r_outlier
## [1] 0.7514418
plot(height[x$shoe<32] ~ shoe[x$shoe<32], data = x,
     main = "Height and Shoe Size", xlab = "Shoe Size", ylab = "Height")

Challenge 2 (continued) (Solution II)

r_outlier
## [1] 0.7514418

Let’s compare this correlation against r that we calculated in Challenge 1:

r
## [1] 0.1548563

Removing the outliers increased the correlation coefficient.

Note: While presence of outliers (in this example) reduced the correlation between the two variables, outliers can erroneously also increase correlations. Scatter plots can help us identify whether, and how, outliers influence the association between two variables.

Challenge 3

What happens to the correlation when we transform the variables, for example, by multiplying or dividing height and shoe by a constant? Let’s see by converting height and shoe (measured in centimeters) into inches.

Create two new variables height_inches and shoe_inches for the values stored in height and shoe (measured in centimeters), but converted into inches.

  • Do so by dividing height and shoe by 2.54.
  • Append them to the data frame.
x$height_inches <-x$height / 2.54
x$shoe_inches <-x$shoe   / 2.54

Challenge 3

Now working together with your peers, do the following in R:

  • Excluding outliers, create scatter plots as follows using par():
    • height by shoe
    • height_inches by shoe_inches
    • height by shoe_inches
    • height_inches by shoe
  • Calculate the correlation for each pair of variables.

What do you find?

Challenge 3 (Solution I)

par(mfrow=c(1,2))
plot(height[x$shoe<32] ~ shoe[x$shoe<32], data = x, xlab = "Shoe Size (cm)",
     ylab = "Height (cm)", main = "Height and Shoe Size (cm)")
plot(height_inches[x$shoe<32] ~ shoe_inches[x$shoe<32], data = x, xlab = "Shoe Size (in)",
     ylab = "Height (in)", main = "Height and Shoe Size (in)")

par(mfrow=c(1,1))

Challenge 3 (Solution II)

Correlation coefficient is the same when measured in centimetres or inches.

r_outlier_in_in <- cor(x$height[x$shoe<32], x$shoe[x$shoe<32],  use="complete.obs")  
r_outlier_in_in
## [1] 0.7514418

What about height in centimetres and shoe size in inches?

r_outlier_cm_in <- cor(x$height[x$shoe<32], x$shoe_inches[x$shoe<32], use="complete.obs")  
r_outlier_cm_in
## [1] 0.7514418

Correlation coefficient still does not change when we use different units of measurement. Multiplication or division does not change the correlation.

Challenge 3 (Extra)

What happens to the correlation when we transform the variables, for example, by adding or subtracting a numeric value to height and shoe? Let’s see.

Create a pair of new variables height_subX / height_plusX and shoe_subY / shoe_plusY that take the following values. Append them to the data frame. Choose your own values of X and Y.

Now working together with your peers, do the following in R:

  • Create a scatter plot for new variables for height and shoe.
    • Make sure to exclude the outliers (i.e. shoe > 32)
  • With the outliers excluded, recalculate the correlation for the new variables for height and shoe.

What do you find?

Challenge 3 (Extra) (Solution I)

x$height_sub5 <- x$height-5
x$shoe_plus10 <- x$shoe+10
plot(height_sub5[x$shoe<32] ~ shoe_plus10[x$shoe<32], data = x,
     main = "Height and Shoe Size", xlab = "Shoe Size", ylab = "Height")

Challenge 3 (Extra) (Solution II)

r_outlier_plus <- cor(x$height_sub5[x$shoe<32], x$shoe_plus10[x$shoe<32], 
                      use = "complete.obs")
r_outlier_plus
## [1] 0.7514418

Let’s check the correlation against r_outlier that we calculated in Challenge 2:

r_outlier
## [1] 0.7514418

The correlation coefficient does not change when you transform the variables by adding or subtracting a constant.

Challenge 4

What happens when we transform the variables, for example, by taking the decadal logarithm? Let’s see.

First, lets log transform height and shoe and append to the data frame x.

x$shoe_log   <- log(x$shoe, 10)     # take the decadal log
x$height_log <- log(x$height, 10)   # take the decadal log

Now working together with your peers, do the following in R:

  • Excluding outliers, create two scatter plots using par():
    • height and shoe
    • height_log and shoe_log
  • Calculate the correlation for both pairs of variables.

What do you find?

Challenge 4 (Solution I)

par(mfrow=c(1,2))
plot(height[x$shoe<32] ~ shoe[x$shoe<32], data = x,xlab = "Shoe Size",
      ylab = "Height", main = "Height and Shoe Size")
plot(height_log[x$shoe<32] ~ shoe_log[x$shoe<32], data = x, xlab = "(log) Shoe Size",
      ylab = "(log) Height", main = "Height and Shoe Size (logged)")

par(mfrow=c(1,1))

Challenge 4 (Solution II)

r_outlier_log_log <- cor(x$height_log[x$shoe<32], x$shoe_log[x$shoe<32], 
                         use="complete.obs")  
r_outlier_log_log
## [1] 0.749936

Let’s check the r_outlier_log_log against r_outlier to see if it is the same:

r_outlier
## [1] 0.7514418

r_outlier_log_log is different from r_outlier. Performing a non-linear transformation such as by taking the logarithm can change the correlation coefficient. Though it decreased in this case, it could also hypothetically increase.

Challenge 4 (Extra)

Now find the correlation of the log transformed values but including the outliers. Compare your result with the original r from Challenge 1.

Challenge 4 (Extra) (Solution)

The correlation of log transformed values, but include the outliers:

r_log_log <- cor(x$height_log, x$shoe_log, use="complete.obs")  
r_log_log
## [1] 0.1988249

Let’s check the original r from Challenge 1 that includes the outliers:

r
## [1] 0.1548563

It appears the log transformation also reduces the influence of the outliers. Note: Although log transformation can be useful in reducing the influence of outliers, it is not appropriate for the current example because the outliers are errors and therefore should be excluded from the analysis.

Challenge 5

Last, we might also wonder if correlation is transitive: For example, when z is correlated with w and z is also correlated with y, then is w correlated with y? Let’s see.

First, let’s generate three variables w, y, and z using rnorm():

  • Generate a variable, w, with size 100 obs from rnorm().
  • Generate a variable, y, with size 100 obs from rnorm().
  • Generate a variable z = w + y
w <- rnorm(100, mean=0, sd=1)
y <- rnorm(100, mean=0, sd=1)
z <- w+y

Challenge 5

Now working together with your peers, do the following in R:

  • Calculate the correlation coefficients between:
    • z and w
    • z and y
    • w and y

What do you find?

Challenge 5 (Solution)

cor_wz <- cor(w,z)
cor_wz                         # Good positive correlation
## [1] 0.7716394
cor_zy <- cor(z,y)
cor_zy                         # Good positive correlation
## [1] 0.7130094
cor_wy <- cor(w,y)
cor_wy                         # Very little correlation
## [1] 0.1042097

Conclusion

In conclusion, we have learned the following about correlation in this activity:

  • The correlation between height and shoe equals to the correlation between shoe and height.
  • Addition and subtraction do not change the correlation.
  • Multiplication and division do not change the correlation.
  • Non-linear transformations (such as taking the log) changes the correlation.

The core intuition is ultimately rather simple: Correlation is related to the z-score. As long as z-score does not change (i.e. the relative position of observations from the mean of your respective group), the correlation does not change.

Additionally we have also learned that correlation is not necessarily transitive.

Recap

Recap

To briefly recap, the learning goals for today’s class were:

  • Understand and use scatter plots
    • We practiced using and interpreting scatter plots in both activities
  • Calculate and interpret the correlation coefficients
    • We practiced computing and manually calculating correlation coefficients in R
    • We learned how a variety of transformations we might perform on data either change or do not change the correlation coefficient
  • Understand that correlation is not causality
    • Using the example of the relationship between our Mystery variable and Nobel Laureates, we discussed the dangers of over-interpreting correlations