2021-10-14

Learning Goals

The learning goals for today’s class are as follows:

  • Use R-squared to assess predictive power of linear models
  • Examine the residuals from linear models
  • Think about outliers, leverage, and influence

To achieve these goals, we will spend today’s class analyzing the relationship between voting and ethnicity in recent general elections in Malaysia:

  • 2013 General Election
  • 2018 General Election

Part 1: 2013 General Election

2013 General Election

Import ge2013.csv as a data frame ge13 and inspect:

ge13 <- read.csv(file = "ge2013.csv", row.names=1)
str(ge13)
## 'data.frame':    166 obs. of  10 variables:
##  $ DISTNO  : chr  "P001" "P002" "P003" "P004" ...
##  $ STATE   : chr  "Perlis" "Perlis" "Perlis" "Kedah" ...
##  $ DISTNAME: chr  "PADANG BESAR" "KANGAR" "ARAU" "LANGKAWI" ...
##  $ VOTESBN : int  21473 23343 19376 21407 24161 33334 20654 32263 25491 37923 ...
##  $ VOTESPR : int  14047 19306 18005 9546 20891 22890 16212 36198 27364 42870 ...
##  $ MALAY   : num  83.5 79.3 86.7 91 90.2 ...
##  $ CHINESE : num  11.5 17.7 8.8 6.7 8.2 ...
##  $ INDIAN  : num  1.09 1.84 1.58 2.12 0.06 1.27 0.2 2.28 4.16 1.26 ...
##  $ OTHER   : num  4 1.15 2.92 0.22 1.51 0.43 5.72 0.76 0.33 0.27 ...
##  $ pctBN   : num  60.5 54.7 51.8 69.2 53.6 ...

Data contains 166 constituency-level observations from the 2013 Malaysian general election. Observations come from the states of peninsular Malaysia (exclude states in East Malaysia).

2013 General Election

A codebook with descriptions of the variables and where they were circulated as part of the pre-class materials. The critical variables for the first half of today:

  • DISTNO and DISTNAME: The constituency number and name.
  • MALAY: Percent of voters in the constituency classified as ethnic Malay.
  • CHINESE: Percent of voters in the constituency classified as ethnic Chinese.
  • INDIAN: Percent of voters in the constituency classified as ethnic Indian.
  • OTHER: Percent of voters in the constituency classified as “other” ethnicity (e.g. Eurasian, Thai).
  • pctBN: Percent of the votes cast for ruling Barisan Nasional (BN) coalition out of the total number of valid votes cast in the constituency. The BN was in power from independence in 1957 until 2018.

Challenge 1

Let’s begin our analysis of voting and ethnicity in Malaysia by examining BN vote share and whether it is related to ethnicity.

Working together with your peers, do the following in R:

  • Make a histogram of pctBN
  • BEFORE making any more plots, discuss the following based on context:
    • What patterns do you expect between pctBN and MALAY? CHINESE?
    • How much of the variation in pctBN would you guess that ethnicity might be able to account for?
    • Why do you expect this to be the case?

Once you are done, we can briefly discuss your answers together.

Challenge 1 (Solution)

Create a histogram. It appears to show a lot of variation in BN vote share across constituencies.

hist(ge13$pctBN, freq=FALSE, col="lightblue",
     main = "Histogram of BN Vote Share",
     xlab = "BN Vote Share (%)")

What do you expect to be the relationship between BN vote share and ethnicity? Why?

Challenges

With the patterns you expect to see between BN vote share and ethnicity in mind, let’s now use the ethnic composition of constituencies to predict BN vote shares.

Let’s start with the percentages of ethnic Malay voters and ethnic Chinese voters. If you finish early try to do the same with the percentages of ethnic Indian and “Other” voters.

Challenge 2

Working together with your peers, do the following in R:

  • Make scatter plots of pctBN by MALAY and CHINESE
    • Assess which conditions appear to be met for a linear model
  • Estimate, interpret, and compare linear models
    • What are the intercepts and slopes? How would you interpret them?
    • What are the R-squares? How would you interpret them?
    • What about the residual standard errors?
  • After you estimate the models, make residual plots:
    • What do you find?
    • Can you explain what it means vis-a-vis the conditions for a linear model?

Challenge 2 (Solution)

Create a scatter plots. Do they match with your intuition?

par(mfrow=c(1,2))
plot(pctBN ~ MALAY, data=ge13, col="lightblue", pch=19,
     xlab="Malay Voters (%)", ylab="BN Vote Share (%)",
     main="BN Vote Share and Malay Voters")
plot(pctBN ~ CHINESE, data=ge13, col="lightgreen", pch=19,
     xlab="Chinese Voters (%)", ylab="BN Vote Share (%)",
     main="BN Vote Share and Chinese Voters")

Challenge 2 (Solution)

Estimate linear models:

lm13.BN.MALAY   <- lm(pctBN ~ MALAY, data=ge13)
lm13.BN.CHINESE <- lm(pctBN ~ CHINESE, data=ge13)

Now interpret the intercepts and slopes:

coef(lm13.BN.MALAY)   # display intercept & slope
## (Intercept)       MALAY 
##  26.9520696   0.3534643
coef(lm13.BN.CHINESE) # display intercept & slope
## (Intercept)     CHINESE 
##   60.873960   -0.412503

Challenge 2 (Solution)

Now interpret the R-squares:

summary(lm13.BN.MALAY)$r.squared   # displays r-squared value
## [1] 0.4398053
summary(lm13.BN.CHINESE)$r.squared # displays r-squared value
## [1] 0.455778

Which model best accounts for the variation in BN vote share? Why?

Challenge 2 (Solution)

What about the residual standard errors?

sd(lm13.BN.MALAY$residuals)   # displays residual standard errors
## [1] 9.812739
sd(lm13.BN.CHINESE$residuals) # displays residual standard errors
## [1] 9.671834

Which model best accounts for the variation in BN vote share? Why?

Challenge 2 (Solution)

Create a residual plots. What does it mean vis-a-vis the conditions for a linear model?

par(mfrow=c(1,2))
plot(lm13.BN.MALAY, which=1,
     main = "BN Vote Share by Malay Voters (%)",
     sub.caption = "BN Vote Share (%)")
plot(lm13.BN.CHINESE, which=1,
     main = "BN Vote Share by Chinese Voters (%)",
     sub.caption = "BN Vote Share (%)")

Challenge 2 (Solution)

The violation of the straight enough condition is also clear if we add the line of best fit.

Discussion

Let’s take a moment to pause, think, and talk about the models so far:

  • What have we found?
    • Positive relationship between MALAY and pctBN
    • Negative relationship between CHINESE and pctBN
    • CHINESE more predictive of BN vote share than MALAY
    • Residuals suggest relationship not strictly linear
  • Things to think about:
    • Did you expect CHINESE to be more predictive than MALAY?
    • What might explain the curve in the relationship?
      • What further might we do to find out?

Part 2: 2018 General Election

2018 General Election

For the second half of today’s activity, we will be using the dataset ge2018.csv to practice fitting and assessing linear models.

To provide some background on the 2018 general election:

  • Fought largely between the Barisan Nasional (BN) coalition and the opposition Pakatan Harapan (PH) coalition
    • BN led by PM Najib Razak
    • PH led by former PM Mahathir Mohamad
  • Election played out against backdrop of a major corruption scandal over embezzlement funds from Malaysia’s 1MDB sovereign wealth fund
    • PM Najib accused of taking USD 700 million - is still on trial

2018 General Election

Despite extensive pre-electoral manipulation and against the expectations of observers, however…

  • The PH coalition soundly defeated the BN
    • First non-BN/UMNO-led government since Malaysia’s independence
    • Transition raised hopes of a Malaysia Baru (a new Malaysia) and that the country could potentially move beyond the politics of race/ethnicity that had dominated since independence
  • But do the results from 2018 suggest that ethnicity is no longer an issue?

2018 General Election

Import ge2018.csv as a data frame ge18 and inspect:

ge18 <- read.csv("ge2018.csv", row.names=1)
str(ge18)
## 'data.frame':    222 obs. of  12 variables:
##  $ DISTNO  : chr  "P001" "P002" "P003" "P004" ...
##  $ STATE   : chr  "Perlis" "Perlis" "Perlis" "Kedah" ...
##  $ DISTNAME: chr  "PADANG BESAR" "KANGAR" "ARAU" "LANGKAWI" ...
##  $ VOTESBN : int  15032 15306 16547 10061 12413 16975 16384 18390 14181 19400 ...
##  $ VOTESPH : int  13594 20909 11691 18954 18695 29984 7254 28959 32475 36624 ...
##  $ VOTERS  : int  46096 55938 48187 42697 54132 73881 46644 86892 80272 97753 ...
##  $ TURNOUT : int  37432 45703 40433 35250 44822 61452 39932 71910 65096 80555 ...
##  $ AREA    : int  450 141 222 469 316 634 1357 322 111 233 ...
##  $ BUMI    : num  87.6 81.9 87.8 90.7 91.3 86.9 92.2 81.6 63.7 78.3 ...
##  $ CHINESE : num  8.5 15.2 7.7 6.5 6.9 8.5 1.3 15.5 31.7 20.3 ...
##  $ INDIAN  : num  0.9 1.7 1.7 2.4 0.1 3.5 0.2 2.5 4.2 1.1 ...
##  $ OTHER   : num  3.1 1.2 2.9 0.4 1.7 1.2 6.4 0.5 0.4 0.3 ...

Data includes 222 observations for all constituencies. Includes constituencies from peninsular Malaysia as well as from the East Malaysian states of Sabah and Sarawak.

Challenge 1

Let’s start by replicating the main analysis for 2013 using the 2018 data.

Working together with your peers, do the following in R:

  • Make a histogram of BN Vote Share
    • Notice that the dataset does not contain a variable pctBN
    • So create pctBN using VOTESBN and TURNOUT
  • Make a scatter plot of BN Vote Share by the proportion of voters classified as Bumiputera (BUMI)
    • Assess which conditions appear to be met for a linear model
  • Estimate linear models for BN Vote Share and all ethnicities
    • Which model has the highest predictive power?
    • Is it the same as in the 2013 general election?

Challenge 1 (Solution)

ge18$pctBN <- 100 * (ge18$VOTESBN/ge18$TURNOUT) # compute pctBN and append
hist(ge18$pctBN, freq=FALSE, col="lightblue", breaks=seq(0, 100, 10),
     main = "Histogram of BN Vote Share",
     xlab = "BN Vote Share (%)")

Challenge 1 (Solution)

Create a scatter plot of BN Vote Share versus Bumiputera.

plot(pctBN ~ BUMI, data=ge18, col="blue", pch=19, main="BN Vote Share by Proportion Bumiputera Voters",
     xlab="Bumiputera Voters (%)", ylab="BN Vote Share (%)")

Challenge 1 (Solution)

Estimate linear models for BN vote share and each ethnicity:

lm18.BN.BUMI    <- lm(pctBN ~ BUMI,    data=ge18)
lm18.BN.CHINESE <- lm(pctBN ~ CHINESE, data=ge18)
lm18.BN.INDIAN  <- lm(pctBN ~ INDIAN,  data=ge18)
lm18.BN.OTHER   <- lm(pctBN ~ OTHER,   data=ge18)

Check the model output for CHINESE in both 2018 and in 2013 to see if slope is similar:

coef(lm18.BN.CHINESE)
## (Intercept)     CHINESE 
##  48.4817813  -0.4326668
coef(lm13.BN.CHINESE)
## (Intercept)     CHINESE 
##   60.873960   -0.412503

Challenge 1 (Solution)

Check which model has highest predictive power. Bumiputera has the highest predictive power.

summary(lm18.BN.BUMI)$r.squared
## [1] 0.4411699
summary(lm18.BN.CHINESE)$r.squared
## [1] 0.3768821
summary(lm18.BN.INDIAN)$r.squared
## [1] 0.2143789
summary(lm18.BN.OTHER)$r.squared
## [1] 0.009243599

Challenge 2

What about the relationship between the opposition Pakatan Harapan (PH) coalition’s vote share and ethnicity in 2018?

Working together with your peers, do the following in R:

  • Make a histogram of PH vote share
    • Create pctPH using VOTESPH and TURNOUT
    • Combine with a histogram of BN vote share to compare (e.g. par())
  • Make a scatter plot of pctPH by BUMI
    • Assess what conditions appear to be met for a linear model
  • Estimate linear models for PH Vote Share and all ethnicities
    • Which model has the highest predictive power?
    • How do they compare to your results from models of BN vote share?
    • What does this tell us about voting and ethnicity in 2018?

Challenge 2 (Solution)

Create pctPH and make histograms of PH Vote Share and BN vote shares

ge18$pctPH <- 100 * (ge18$VOTESPH / ge18$TURNOUT) # create pctPH and append

Create histograms of BN and PH vote shares:

par(mfrow=c(1,2))

hist(ge18$pctBN, freq=FALSE, col="lightblue", breaks=seq(0, 100, 10),
  main = "Histogram of BN Vote Shares",
  xlab = "BN Vote Share (%)",
  xlim = c(0,100), ylim = c(0,0.03))

hist(ge18$pctPH, freq=FALSE, col="pink", breaks=seq(0, 100, 10),
  main = "Histogram of PH Vote Shares",
  xlab = "PH Vote Share (%)",
  xlim = c(0,100), ylim = c(0,0.03))

par(mfrow=c(1,1))

Challenge 2 (Solution)

Challenge 2 (Solution)

Histograms indicate there more variation across constituencies in the proportion votes received by PH than in what proportion votes for BN. We can also confirm by calculating the standard deviation of vote shares.

sd(ge18$pctBN, na.rm = TRUE)
## [1] 15.18926
sd(ge18$pctPH, na.rm = TRUE)
## [1] 20.82529

Challenge 2 (Solution)

Create scatter plot of PH vote share by the proportion of voters classified as Bumiputera

plot(pctPH ~ BUMI, data=ge18, col="red", pch=19,
     xlab="Bumiputera Voters (%)", ylab="PH Vote Share (%)",
     main="PH Vote Share by Bumiputera Voters")

Challenge 2 (Solution)

Estimate linear models for PH Vote Share and all ethnicities:

lm18.PH.BUMI    <-lm(pctPH ~ BUMI,    data=ge18)
lm18.PH.CHINESE <-lm(pctPH ~ CHINESE, data=ge18)
lm18.PH.INDIAN  <-lm(pctPH ~ INDIAN,  data=ge18)
lm18.PH.OTHER   <-lm(pctPH ~ OTHER,   data=ge18)

Check the model output for BUMI in 2018:

coef(lm18.PH.BUMI)
## (Intercept)        BUMI 
##  93.3133200  -0.7146756

Challenge 2 (Solution)

We find that BUMI also had the highest predictive power for pctPH.

summary(lm18.PH.BUMI)$r.squared
## [1] 0.7293721
summary(lm18.PH.CHINESE)$r.squared
## [1] 0.6984725
summary(lm18.PH.INDIAN)$r.squared
## [1] 0.3682286
summary(lm18.PH.OTHER)$r.squared
## [1] 0.009369411

Challenge 3

It should be clear that there is a strong relationship between voting and ethnicity. But the following plot includes an interesting outlier (highlighted below in red), with 77% voting for PH in a district with 98% bumiputera.

Investigate the outlier: Can you explain why it seems so different from most other bumiputera-dominant constituencies?

Challenge 3 (Solution)

A quick subset will allow us to identify and investigate:

ge18[which(ge18$pctPH > 70 & ge18$BUMI > 90),]
##     DISTNO STATE DISTNAME VOTESBN VOTESPH VOTERS TURNOUT AREA BUMI CHINESE
## 189   P189 Sabah SEMPORNA    6135   26809  48248   34613 1177 98.1     1.6
##     INDIAN OTHER    pctBN    pctPH
## 189     NA   0.3 17.72455 77.45356

The constituency is named Semporna and is located in Sabah in East Malaysia.

For those who are familiar with Sabah, what makes Sabah different from other states in Malaysia? Alternatively, what happened in the state during the 2018 election that could help to explain why voters behaved differently than the model would predict?

Challenge 3 - A “Sabah Effect”?

First and foremost, ethnic identity is different and more complicated in Sabah.

  • Much of Sabah’s population is legally classified as Bumiputera
  • But this is not the same thing as saying that people in Sabah fit into the MCIO model of race/ethnicity that shapes identity and you have talked about in CSI.
    • Sabah only became part of Malaysia in 1963.
    • Prior to this, British colonial authorities did not use MCIO to divide and rule in Sabah as they had in Singapore and Malaya.
  • The bumiputera population of Sabah is more diverse in terms of religion, language, and traditions than the bumiputera population in Malaya:
    • Not everyone who is Bumiputera in Sabah speaks the Malay language, or is Muslim.
    • Over 42 distinct ethno-linguistic groups and 200 sub-groups.

Challenge 3 - A “Sabah Effect”?

Second and as a result, the relationship between ethnic identity and politics is more complicated in Sabah than it is in peninsular Malaysia.

Other forms of localised identity also matter for national politics:

  • Identity as a Sabahan — one who was born in or identifies with Sabah — is arguably more important that ethnic identity in many contexts.
    • Whether legally recognized as Bumiputera, Chinese, or Indian, you can (arguably) be a Sabahan.
  • If you identify as a Sabahan, then it might also affect your politics, how you vote, and which parties or coalitions you support…

Challenge 3 - A “Sabah Effect”?

To this end, the 2018 General Election also saw the emergence of a new, multi-ethnic party in Sabah: The Sabah Heritage Party (Parti Warisan Sabah or Warisan).

  • Warisan positioned itself in favor of greater regional autonomy for Sabah.
  • It also allied itself with the opposition Pakatan Harapan (PH) coalition.
  • It proved to be quite popular across the state among bumiputera and non-bumiputera voters…

Challenge 4 - A “Sabah Effect”? (Extra)

Let us check whether this “Sabah effect” also applies to other constituencies in Sabah.

  • Recreate the scatter plot of PH vote share versus Bumiputera voters
    • Use different colors for the constituencies in Sabah.
  • Does it seem like constituencies in Sabah behave differently with respect to the relationship between support for the PH and ethnicity?

Challenge 4 - A “Sabah Effect”? (Solution)

The solid red points are from Sabah while the pink ones are not.

The observed PH vote share is shifted upwards from what the model would predict (i.e. leverage). Things are indeed different in Sabah. But does this influence the slope?

Challenge 4 - A “Sabah Effect”? (Solution)

Let’s assess any influence by re-estimating the relationship between PH vote share and bumiputera voters after removing the observations from Sabah:

lm18.PH.BUMI_noSBH <-lm(pctPH ~ BUMI, data = ge18[ge18$STATE != "Sabah",])

Now let’s re-plot the scatter plots and add the two lines of best fit. We will color the original line red and the line of best fit (with Sabah removed) in blue:

plot(pctPH ~ BUMI, data = ge18[ge18$STATE != "Sabah",], col="pink", pch=19,
     xlab="Bumiputera Voters (%)", ylab="PH Vote Share (%)",
     main="PH Vote Share by Bumiputera Voters")
points(pctPH ~ BUMI, data = ge18[ge18$STATE == "Sabah",], col = "red", pch=19,
     xlab="Bumiputera Voters (%)", ylab="PH Vote Share (%)",
     main="PH Vote Share by Bumiputera Voters")
abline(lm18.PH.BUMI, col="red")        # line of best fit including Sabah
abline(lm18.PH.BUMI_noSBH, col="blue") # line of best fit excluding Sabah

Challenge 4 - A “Sabah Effect”? (Solution)

Comparing the two lines shows that excluding Sabah does not influence the line of best fit by that much. Sabah constituencies are still different, but not influential with regards to the slope between PH vote share and bumiputera voters.

Recap

Recap

To recap, the learning goals for today’s class were as follows:

  • Use R-squared to assess predictive power of linear models
    • We used R-squared to compare the predictive power of models
  • Examine the residuals from linear models
    • We used residual plots and the shape of the scatterplots to assess whether the relationship met the conditions for a linear model
    • We compared residual standard errors from a variety of models to discern which models accounted for more of the variability
  • Think about outliers, leverage, and influence
    • We identified outliers from Sabah in the 2018 General election
    • We then showed that outliers may have high leverage, but they are not necessarily influential (i.e. they do not influence the slope by much)