AY 2021/2

Public support and scepticism about vaccines

Many people worldwide are hoping that COVID-19 vaccines will bring the current pandemic under control. However, fights against other diseases have often been hampered by sceptics who refuse to get vaccinated.

In 2018, the Wellcome Trust asked more than 140,000 people in more than 140 countries whether they agree or disagree with the following two statements: https://wellcome.org/sites/default/files/wellcome-global-monitor-2018.pdf

  • Vaccines are safe.
  • Vaccines are effective.

You can find the results of the survey in vaccine_perception.csv on Canvas.

In this activity, we want to answer the question:

If a country has a higher percentage of people who believe vaccines are unsafe, is there also a higher percentage of people who believe vaccines are ineffective?

Intended learning outcomes

  • We review the conditions under which correlation coefficients and linear models are meaningful summaries of the association between two variables.
  • We make several types of plots: scatter plot, residual plot, Q-Q plot, histogram.
  • We re-express quantitative variables.
  • We explain the result of a linear model.
  • We make predictions on the basis of a linear model.
  • We interpret the distribution of residuals.

Codebook

Here is the structure of the data frame.

vacc <- read.csv("vaccine_perception.csv")
str(vacc)
## 'data.frame':    144 obs. of  3 variables:
##  $ country        : chr  "Afghanistan" "Albania" "Algeria" "Argentina" ...
##  $ unsafe_pct     : num  4.47 15.85 11.28 4.85 20.57 ...
##  $ ineffective_pct: num  1.83 8.97 7.7 2.95 12.47 ...

The columns are:

  • country:
    country name
  • unsafe_pct:
    percentage of respondents who disagree that vaccines are safe
  • ineffective_pct:
    percentage of respondents who disagree that vaccines are effective

1.1 Scatter plot and correlation

  1. Make a scatter plot in which each point represents a country.
    • Plot the percentage that believes vaccines are ineffective (y) versus the percentage that believes vaccines are unsafe (x).
    • Add title and axis labels.
  2. Looking at the plot, is the correlation coefficient a suitable quantity to summarise the association between x-values and y-values? (Check the textbook for the criteria.)
  3. If yes, calculate the correlation coefficient. Otherwise, is it possible to re-express the variables to satisfy the textbook criteria?

1.1 Scatter plot and correlation

plot(ineffective_pct ~ unsafe_pct,
     data = vacc,
     main = "Perceptions of Vaccines",
     xlab = "% disagree that vaccines are safe",
     ylab = "% disagree that vaccines are effective")
grid()

1.1 Scatter plot and correlation

1.1 Scatter plot and correlation

Criteria from page 174 in the textbook:

  • Quantitative Variables Condition:
    Satisfied.
  • Straight Enough Condition:
    There is no curvature in the data; thus, the condition is satisfied.
  • No Outliers Condition:
    There is one outlier with large x-value (and high leverage) and another outlier with large y-value (and moderately high leverage).

1.1 Scatter plot and correlation

Taking the logarithm of both coordinates makes both outliers less conspicuous.

vacc$log10_unsafe <- log10(vacc$unsafe_pct)
vacc$log10_ineffective <- log10(vacc$ineffective_pct)
plot(log10_ineffective ~ log10_unsafe,
     data = vacc,
     main = "Perceptions of Vaccines",
     xlab = "log10(% disagree that vaccines are safe)",
     ylab = "log10(% disagree that vaccines are effective)")
grid()

1.1 Scatter plot and correlation

1.1 Scatter plot and correlation

cor(vacc$log10_unsafe, vacc$log10_ineffective)
## [1] 0.6649563

For the log-transformed variables, there is a moderately strong positive correlation.

1.2 Conditions for linear regression

Suppose we want to develop a model that predicts the percentage of people who believe vaccines are ineffective as a function of the percentage who believe vaccines are unsafe.

Do the re-expressed data satisfy the conditions for linear regression? (Check the textbook for the criteria.)

1.2 Conditions for linear regression

We have already checked the first three conditions on page 213:

  • Quantitative Variable Condition
  • Straight Enough Condition
  • Outlier Condition

The fourth condition is the “Does the Plot Thicken? Condition”. Looking at the scatter plot, the spread of the data appears to be independent of the x-coordinate. We conclude that the data meet all conditions for linear regression.

1.3 Linear regression

  1. Fit a linear model to the re-expressed data.
  2. Add the regression line to the scatter plot.
  3. What is the equation of the least-squares regression line?
  4. How do you interpret the coefficients in this equation?
  5. What is the \(R^2\)-value of the model? How do you interpret it?

1.3 Linear regression

model <- lm(log10_ineffective ~ log10_unsafe, data = vacc)
plot(log10_ineffective ~ log10_unsafe,
     data = vacc,
     main = "Perceptions of Vaccines",
     xlab = "log10(% disagree that vaccines are safe)",
     ylab = "log10(% disagree that vaccines are effective)")
grid()
abline(model, col = "blue", lwd = 2)

1.3 Linear regression

1.3 Linear regression

coef(model)
##  (Intercept) log10_unsafe 
##    0.1063266    0.6512057

The equation for the regression line is:

\(\log_{10}\)(% disagree that vaccines are effective) =
\(0.106\) + \(0.651\cdot\log_{10}\)(% disagree that vaccines are safe).

We conclude that, if the percentage that believes vaccines are unsafe increases by a factor \(10\), then the percentage that believes vaccines are ineffective tends to increase by a factor \(10^{0.651} \approx {4.479}\).

If 1% of people believe vaccines are unsafe, the linear model predicts that \(10^{0.106} \approx {1.277}\) percent believe vaccines are ineffective.

1.3 Linear regression

summary(model)$r.squared
## [1] 0.4421669

The \(R^2\)-value is 0.442, equal to the square of the correlation coefficient 0.665, which we calculated earlier.

We conclude that the linear model accounts for 44.2% of the variance in the data.

1.4 Prediction based on the linear model

The Wellcome Trust conducted the survey in most but not all countries in the world.

  1. Confirm that Brunei is not in the column vacc$country.
  2. Let us speculate that we know that 4% of Bruneians believe vaccines are unsafe. There is no good reason for this speculation other than practising our R skills. What is the linear model’s prediction for the percentage of Bruneians who believe vaccines are ineffective?

1.4 Prediction based on the linear model

any(vacc$country == "Brunei")
## [1] FALSE

Because the output is FALSE, we conclude that Brunei is not in vacc$country.

p <- predict(model, newdata = data.frame(log10_unsafe = log10(4)))
10^p
##        1 
## 3.150588

Based on an x-value log10(4), we predict that 3.2% of Bruneians believe vaccines are ineffective.

1.5 Investigating the residuals

  1. Generate the residual plot for our linear model.
  2. Which country has the largest residual?
  3. What might be an explanation why this country has an unusually large residual? Try to find information on page 112 of https://wellcome.org/sites/default/files/wellcome-global-monitor-2018.pdf.
  4. Some techniques for statistical inference, which are beyond the scope of this course, require residuals to be normally distributed. Do our data satisfy this condition? Support your answer with visualisations.

1.5 Investigating the residuals

plot(model,
     which = 1,  # Make residual plot
     main = "Perceptions of Vaccines",
     caption = "Residuals of log10(% disagree that vaccines are effective)",
     sub.caption = "log10(% disagree that vaccines are effective)")
grid()

1.5 Investigating the residuals

1.5 Investigating the residuals

We can see from the residual plot that the country with the largest positive residual is in row 72 of the data frame vacc.

vacc$country[72]
## [1] "Liberia"

Alternatively, we can find the country with the largest positive residual as follows.

res <- residuals(model)
vacc$country[which.max(res)]
## [1] "Liberia"

1.5 Investigating the residuals

Why does Liberia have an unusually large residual? The Wellcome Global Monitor 2018 gives a plausible explanation.

In Liberia, where 28% of people disagree that vaccines are effective (the highest in the world), just 3% of people disagree that they are safe… Liberia continues to grapple with infectious diseases such as yellow fever and tetanus, despite vaccination programmes. In countries where weak health supply and infrastructure systems exist, and there are difficulties with access to vaccines (in terms of distance to nearest clinics, for example), it is harder to achieve the vaccination rates necessary for herd immunity, and the persistence of infectious diseases may lead some people to conclude that the vaccines themselves are not working.

1.5 Investigating the residuals

qqnorm(res)
qqline(res, col = "blue", lwd = 2)
grid()

1.5 Investigating the residuals

The Q-Q plot indicates some deviation from a normal distribution. A histogram shows that the distribution of the residuals is more concentrated in the centre than a normal distribution.

hist(res,
     breaks = seq(-1.0, 1.2, 0.1),
     freq = FALSE,
     main = "Distribution of Residuals",
     xlab = "Residual")
curve(dnorm(x, mean = 0, sd = sd(res)),
      add = TRUE,
      col = "blue",
      lwd = 2)

1.5 Investigating the residuals

Summary

There is a moderately strong positive correlation between the percentage of people who believe vaccines are unsafe and the percentage of people who believe vaccines are ineffective.

Correlation does not imply causation. Scepticism about the safety of vaccines may cause doubts about their effectiveness. However, there are many plausible lurking variables (e.g. suspicions against modern medicine, pharmaceutical companies and governments).

Outliers always require a careful investigation. Sometimes, outliers are errors (e.g. when translation into another language changed the meaning of a survey question). In our example, the farthest outlier revealed interesting details about the public health situation in Liberia.