2021-10-11

Learning Goals

The learning goals for today’s class are as follows:

  • Understand regression to the mean
  • Think about variability in large versus small samples
  • Practice fitting linear models

To achieve these goals, we will spend today’s class working on two activities:

  • Activity 1: Cancer rates in US counties
  • Activity 2: Predicting stock prices

Activity 1: Cancer rates in US counties

Cancer rates in US counties

For today’s first activity, we will be using the dataset UScancer.csv to think about variability in small samples and to practice fitting linear models.

As a hint of what is to come, think back to before the mid-semester break when we learned about the normal model:

  • Try to recall the activity “Gender on a Normal Curve?”
    • We assessed whether the female proportion of subzone populations in Singapore followed a normal model
    • We then identified a number of outliers (i.e. subzones) that caused the distribution of female proportion to deviate from a normal model
  • Can you remember what was distinct about the outliers? What they had in common?

Cancer rates in US counties

Import UScancer.csv as a data frame cancer and inspect:

cancer <- read.csv("UScancer.csv")
str(cancer)
## 'data.frame':    3142 obs. of  7 variables:
##  $ STATE : chr  "South Carolina" "Louisiana" "Virginia" "Idaho" ...
##  $ COUNTY: chr  "Abbeville County" "Acadia Parish" "Accomack County" "Ada County" ...
##  $ POP   : int  25417 61773 33164 392365 7682 18656 25607 22683 441603 3976 ...
##  $ CASES : int  27 57 41 220 8 17 17 19 196 4 ...
##  $ PCI   : int  18134 19910 22703 27452 25564 17371 19258 15116 24195 21555 ...
##  $ MHI   : int  35947 37587 39328 55210 47892 32524 34733 32556 56270 35434 ...
##  $ OVER65: int  4203 7886 6336 41048 1644 2859 3210 2934 36862 828 ...

Challenge 1.1

Working together with your peers, do the following in R:

  • Make a histogram of the population of the counties.
    • What can you do to make this histogram more informative?
  • Explore the relationship between population and cancer
    • Use the cor function to see if the number of lung cancers cases increases with population.
    • Plot CASES as a function of POP.
    • Fit a linear model and add the line to your plot.
  • Investigate the outlier.
    • Which county is it?
    • Does it have more or fewer cases than you would predict from the model?

Challenge 1.1 (Solution)

par(mfrow=c(1,2))
hist(cancer$POP, freq=FALSE, col="lightblue", main = "Histogram of Population", xlab ="Population")
hist(log(cancer$POP), freq=FALSE, col="lightblue", main = "Histogram of Population (ln)", xlab = "(ln) Population")
par(mfrow=c(1,1))

Challenge 1.1 (Solution)

Calculate the correlation coefficient:

cor(cancer$CASES, cancer$POP, use="complete.obs")
## [1] 0.9405287

Estimate a linear model:

lm.cancer <- lm(CASES ~ POP, data=cancer)
lm.cancer
## 
## Call:
## lm(formula = CASES ~ POP, data = cancer)
## 
## Coefficients:
## (Intercept)          POP  
##   2.092e+01    5.085e-04

That intercept seems strange.

Challenge 1.1 (Solution)

Plot and add line of best fit:

plot(CASES ~ POP, data=cancer, main = "Lung Cancer Cases by Population", xlab = "Population",
     ylab="Lung Cancer Cases")
abline(lm.cancer, col="red")

Challenge 1.1 (Solution)

Consider that residuals represent vertical distance to the regression line.

plot(CASES ~ POP, data=cancer, main = "Lung Cancer Cases by Population", xlab = "Population",
     ylab="Lung Cancer Cases", ylim = c(0, 5.085e-04 * 1e+07))
abline(lm.cancer, col="red")

Challenge 1.1 (Solution)

Identify the outlier

cancer[cancer$POP == max(cancer$POP),]
##           STATE             COUNTY     POP CASES   PCI   MHI  OVER65
## 1714 California Los Angeles County 9818605  3661 27749 55909 1065699

The outlier is Los Angeles County, which contains the City of Los Angeles as well as cities such as Long Beach, Glendale, Pasadena, Hollywood (where you think the movie studios are located), and Burbank (where the movie studios are actually located).

Based on an examination of the scatter plot with the line of best fit, it would appear that Los Angeles County has fewer lung cancer cases than the model predicts.

Challenge 1.2

Since the counties vary so widely in population, it’s better for us to look at the rates rather than the number of cases.

Working together with your peers, do the following in R:

  • Add a column called RATE that gives the number of cases per 100,000 residents.
  • Look at the distribution of the variable RATE.
  • Make two histograms:
    • A histogram of counties with RATE > 200
    • A histogram of counties with RATE <= 200

Challenge 1.2 (Solution)

Calculate the rate per 100,000 population:

cancer$RATE <- (cancer$CASES / cancer$POP) * 100000

Examine the distribution of RATE:

hist(cancer$RATE, freq=FALSE, col="pink", main = "Histogram of Lung Cancer Rate (Cases per 100,000 Population)",
     xlab = "Lung Cancer Cases per 100,000 Population")

Challenge 1.2 (Solution)

par(mfrow=c(1,2))
hist(cancer$RATE[cancer$RATE <= 200], freq=FALSE, col="pink",
     main = "Histogram of Cancer Rate (<=200)", xlab = "Cancer Cases per 100,000 Population")
hist(cancer$RATE[cancer$RATE > 200], freq=FALSE, col="pink",
     main = "Histogram of Cancer Rate (>200)", xlab = "Cancer Cases per 100,000 Population")
par(mfrow=c(1,1))

Challenge 1.3

Let’s further focus on counties where the rate is extremely high and extremely low.

Working together with your peers, do the following in R:

  • Create a new data frame consisting of counties with RATE > 500.
    • How many are there?
    • Looking at this data frame, is there anything that distinguishes these counties?
  • What about for counties with very low rates (i.e. RATE < 20)?

Challenge 1.3 (Solution)

Get counties with high cancer rates (i.e. RATE > 500)

high <- cancer[cancer$RATE > 500 & !is.na(cancer$RATE),]

How many counties are there?

nrow(high)
## [1] 15

Challenge 1.3 (Solution)

Get counties with low cancer rates (i.e. RATE < 20)

low <- cancer[cancer$RATE < 20 & !is.na(cancer$RATE),]

How many counties are there?

nrow(low)
## [1] 27

Challenge 1.3 (Solution)

Let’s compare the mean POP of all counties with those of high / low rate counties

mean(cancer$POP, na.rm=TRUE) # population of all counties
## [1] 98262.04
mean(high$POP, na.rm=TRUE) # population of high rate counties
## [1] 6449.867
mean(low$POP, na.rm=TRUE) # population of low rate counties
## [1] 140282.3

Interesting: it appears that high rate counties have much smaller populations on average (about 7,000) than all counties or low rate counties (both have about or above 100,000). What could explain the low cancer rates in large counties (cities)?

Challenge 1.3 (Solution)

Let’s also compare the mean OVER65 of all counties with those of high / low rate counties. But lets do so as a proportion of population:

mean(cancer$OVER65 / cancer$POP, na.rm=TRUE) # proportion of population OVER65 in all counties
## [1] 0.2011706
mean(high$OVER65 / high$POP, na.rm=TRUE) # proportion of population OVER65 in high rate counties
## [1] 0.2149073
mean(low$OVER65 / low$POP, na.rm=TRUE) # proportion of population OVER65 in low rate counties
## [1] 0.1067833

Interesting: the low rate counties also have low proportion of persons over 65.

Challenge 1.4

We noticed that the counties with high lung cancer rates generally had small populations, while counties with average lung cancer rates tend to have larger populations. This is presumably reflective of regression to the mean.

But the relationship isn’t as simple for the counties with low rates. It seems that at least some of the counties with low rates are very populous but also have a younger population. Let’s investigate further.

  • How does the average per capita income of low-rate counties compare to the average of all counties?
  • How does the average median household income of low-rate counties compare to the average of all counties?
  • Make plots and linear models for RATE as a function of PCI and of MHI.
    • If there are outliers, then how should they be dealt with?

Challenge 1.4 (Solution)

mean(low$PCI, na.rm=TRUE) # average PCI of low rate counties
## [1] 25574.04
mean(cancer$PCI, na.rm=TRUE) # average PCI of all counties
## [1] 23616.34
mean(low$MHI, na.rm=TRUE) # average MHI of low rate counties
## [1] 50848.59
mean(cancer$MHI, na.rm=TRUE) # average MHI of all counties
## [1] 45925.73

Challenge 1.4 (Solution)

par(mfrow=c(1,2))
plot(RATE ~ PCI, data=cancer, main = "Cancer Rate by Per Capita Income",
     xlab = "Per Capita Income", ylab = "Cancer Cases per 100k")
plot(RATE ~ MHI, data=cancer, main = "Cancer Rate by Median Household Income",
     xlab = "Median Household Income", ylab = "Cancer Cases per 100k")
par(mfrow=c(1,1))

Challenge 1.4 (Solution)

Focus on counties with RATE <= 200.

par(mfrow=c(1,2))
plot(RATE ~ PCI, data=cancer[cancer$RATE <= 200,], main = "Cancer Rate (<=200) by Per Capita Income",
     xlab = "Per Capita Income", ylab = "Cancer Cases per 100k")
abline(lm(RATE ~ PCI, data=cancer[cancer$RATE <= 200,]), col="red")
plot(RATE ~ MHI, data=cancer[cancer$RATE <= 200,], main = "Cancer Rate (<=200) by Median Household Income",
     xlab = "Median Household Income", ylab = "Cancer Cases per 100k")
abline(lm(RATE ~ MHI, data=cancer[cancer$RATE <= 200,]), col="red")
par(mfrow=c(1,1))

Challenge 1.5 (Extra)

From an activity we did earlier this semester, we have reason to think that there is a relationship between age and rates of cancer.

  • Figure out how to use the OVER65 column to explore this relationship using linear regression.
    • What difficulties do you encounter?
    • What results do you find?

Learning Points (I)

Taking a step back, what have we learned?

  • Small samples have higher variability than larger samples.
  • All else equal, we would expect the small samples (e.g. counties with low populations) to be over-represented at the high end and the low end.
  • Hidden variable in this data set: young residents in high population counties.
  • Over the long run, though, individual outliers will revert toward the mean.

Learning Points (II)

If we are not careful, then variability in small samples and regression to the mean can have real-life consequences for important decisions!

Activity 2: Predicting stock prices

Predicting stock prices

Imagine that you have graduated from Yale-NUS College, you have successfully sold your soul to the highest bidder, and you are now working in your 80-hour a week dream job… as a securities analyst on Wall Street!

Your manager sends you an email with the subject line PICK ME SOME WINNERS, OR ELSE…

Attached to the email is a file entitled Stocks18Week1_no_split.csv

Predicting stock prices

Let’s import Stocks18Week1_no_split.csv as a data frame week into R, and then inspect:

week <- read.csv("Stocks18Week1_no_split.csv")
str(week)
## 'data.frame':    2308 obs. of  6 variables:
##  $ Symbol   : chr  "A" "AA" "AAC" "AAN" ...
##  $ X1.Jan.18: num  67 53.9 9 39.9 99.7 ...
##  $ X2.Jan.18: num  67.6 55.17 9.27 39.47 106.09 ...
##  $ X3.Jan.18: num  69.32 54.5 9.25 39.67 107.05 ...
##  $ X4.Jan.18: num  68.8 54.7 9.32 39.83 111 ...
##  $ X5.Jan.18: num  69.9 54.09 9.12 39.94 112.18 ...

The file contains data on the end-of-day stock prices for companies listed on the New York Stock Exchange (NYSE) for the first week of 2018.

Predicting stock prices

## 'data.frame':    2308 obs. of  6 variables:
##  $ Symbol   : chr  "A" "AA" "AAC" "AAN" ...
##  $ X1.Jan.18: num  67 53.9 9 39.9 99.7 ...
##  $ X2.Jan.18: num  67.6 55.17 9.27 39.47 106.09 ...
##  $ X3.Jan.18: num  69.32 54.5 9.25 39.67 107.05 ...
##  $ X4.Jan.18: num  68.8 54.7 9.32 39.83 111 ...
##  $ X5.Jan.18: num  69.9 54.09 9.12 39.94 112.18 ...

The columns contain data

  • Symbol is the stock symbol: an abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market.
  • X1.Jan.18 is the closing price on day 1 of the week.
  • X2.Jan.18 is the closing price on day 2 of the week.

etc.

Challenge 2.1

In groups, take a minute to discuss how you might use the data in the file to choose stocks to buy.

Working together with your peers, then do the following in R:

  • Add a new variable / column to week
    • Call this variable change
    • Calculate percentage change in the stock price over the first week
    • Store this in change
  • Explore this new variable change change
    • Calculate summary statistics (e.g. summary())
    • Visualize the data (e.g. hist())

Challenge 2.1 (Solution)

Calculate the percentage change in stock price over the first week. Save as change and append to week.

week$change <- 100 * (week$X5.Jan.18 - week$X1.Jan.18) / week$X1.Jan.18

Calculate summary statistics for change to get a sense of the weekly variation:

summary(week$change)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -17.6507  -0.2891   1.2652   1.9400   3.5369  32.8588

Based on the summary statistics, it looks like there are some big winners (e.g. +32.9%), but also some big losers (e.g. -17.7%).

Challenge 2.1 (Solution)

Visualising the data shows that there are many outliers:

par(mfrow=c(1,2))
hist(week$change, freq=FALSE, col = "lightgreen", main = "Histogram of Change in Stock Price",
     xlab = "Change in Stock Price (%) ")
boxplot(week$change, col="lightgreen", main = "Boxplot of Change in Stock Price",
        ylab = "Change in Stock Price (%)")
par(mfrow=c(1,1))

Challenge 2.2

So there is plenty of variation in the performance of stocks. Why not just identify the best performers and pick one.

Working together with your peers, do the following in R:

  • Select the stocks which improved by the most and the least in the first week.
    • Select those that gained at least 25% in the first week.
    • Select those that lost at least 10% in the first week.
    • Take a minute to find out a little bit about those companies.

Once you have done so, it is your turn to pick a winner in order to save your dream job and keep living the dream!

Choose from among the best performers and let the instructor know your choice. Instructor gets to invest in the top performer.

Challenge 2.2 (Solution)

best <- week[week$change >= 25,] # subset to get best performers
best
##      Symbol X1.Jan.18 X2.Jan.18 X3.Jan.18 X4.Jan.18 X5.Jan.18   change
## 392     CGA      1.24      1.25      1.25      1.44      1.60 29.03226
## 1178    KEG     11.79     12.36     13.63     14.89     14.79 25.44529
## 1594    OMF     25.99     26.30     26.64     34.40     34.53 32.85879
## 1769   RENN     10.39     12.43     18.32     15.19     12.99 25.02406
## 2272    WTI      3.31      3.72      4.16      4.24      4.36 31.72205
worst <- week[week$change <= -10,] # subset to get worst performers
worst
##      Symbol X1.Jan.18 X2.Jan.18 X3.Jan.18 X4.Jan.18 X5.Jan.18    change
## 353    CATO     15.92     16.23     15.27     13.29     13.11 -17.65075
## 355     CBB     20.85     21.15     20.10     19.95     18.20 -12.70983
## 381    CEIX     39.51     38.44     37.01     35.71     35.00 -11.41483
## 1232     LB     60.22     59.89     58.16     51.00     50.36 -16.37330
## 1875   SFUN      5.58      5.33      5.17      4.89      4.81 -13.79928
## 2192    VOC      5.92      5.80      5.68      5.40      5.28 -10.81081

Challenge 2.3

Now, take a look at the data for the entire year. Did you pick a big winner?

year <- read.csv("Stocks18_no_split.csv")
  • Explore the structure of the data set to understand the column titles.
  • Add a new variable / column to year.
    • Call this variable rest.of.year.change.
    • Calculate percentage change in the stock price over the rest of the year.
    • Store this in rest.of.year.change.
  • How did your company do? Did you pick a big winner?
  • Make a plot showing how the price changed over the year (hint: use trans_Stocks18_no_split.csv).
  • On reflection, was choosing the best performing stocks from the first week a good strategy?

Challenge 2.3 (Solution)

Using colnames(year) or str(year) we find the closing prices of later dates in the year appear in columns with names like X8.Jan.18 and X23.Nov.18.

Why are some dates missing?

The first date of week 2 and the last date of the year are:

colnames(year)[7]; colnames(year)[ncol(year)]
## [1] "X8.Jan.18"
## [1] "X31.Dec.18"

Note: You may be wondering why the prefix of X on what is, otherwise, a standard way of writing dates. This is because R does not like data frame columns to have names that begin with numbers. Even if you edit the CSV data source to remove the X at the beginning of the column names, the X appears in the data frame’s column names when you run read.csv.

Challenge 2.3 (Solution)

Let’s calculate rest.of.year.change:

year$rest.of.year.change <- 100 * (year$X31.Dec.18 - year$X8.Jan.18) / year$X8.Jan.18

Now let’s check to see how big of a winner we picked!

year$rest.of.year.change[year$Symbol == "OMF"]
## [1] -28.39033

You did not pick a big winner but instead have lost your proverbial shirt on a bad bet. Looks like the invisible hand of the market has crushed your dreams.

Challenge 2.3 (Solution)

Let’s see what happened to our big winner over the year by importing and inspecting a transposed version of the data frame year:

trans_year <- read.csv(file = "trans_Stocks18_no_split.csv", header=TRUE, row.names = 1)

Inspect the structure. Notice DOY which gives the day of the year (in trading terms):

str(trans_year)
## 'data.frame':    261 obs. of  7 variables:
##  $ DATE: chr  "X1.Jan.18" "X2.Jan.18" "X3.Jan.18" "X4.Jan.18" ...
##  $ DOY : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CGA : num  1.24 1.25 1.25 1.44 1.6 1.51 1.43 1.48 1.53 1.5 ...
##  $ KEG : num  11.8 12.4 13.6 14.9 14.8 ...
##  $ OMF : num  26 26.3 26.6 34.4 34.5 ...
##  $ RENN: num  10.4 12.4 18.3 15.2 13 ...
##  $ WTI : num  3.31 3.72 4.16 4.24 4.36 4.46 4.31 4.36 4.8 4.88 ...

Find your chosen company’s ticker symbol among the column names, and create a line graph by DOY.

Challenge 2.3 (Solution)

Create a line graph of the change in the company’s stock price by the day of year (DOY):

trans_year$OMF.index <- 100 * (trans_year$OMF / trans_year$OMF[1]) - 100
plot(OMF.index ~ DOY, data = trans_year, type ="l", col="red",
     main = "Percentage Change in Stock Price relative to January 1 by Day of the Year",
     xlab = "Day of the Year", ylab = "Percentage Change (%)")
abline(h=0, lty=3)

Challenge 2.4

Finally, let’s consider whether short term performance more generally predicts longer term performance.

Working together with your peers, do the following in R:

  • Plot rest of year change as functions of first week gains.
  • Compute the correlation between these two variables.
  • Fit a linear model and add the line to your plot.
  • What do you think of the strategy of investing based on short term performance?

Challenge 2.4 (Solution)

Append change in the first week to year and then plot:

year$first.week.change <- 100 * (year$X5.Jan.18 - year$X1.Jan.18) / year$X1.Jan.18
plot(rest.of.year.change ~ first.week.change, data=year,
     main = "Rest of Year Change by First Week Change",
     ylab = "Rest of Year Change", xlab = "First Week Change")

Challenge 2.4 (Solution)

Now, let’s check the correlation and whether investing based upon short term performance is a good idea more generally.

cor(year$rest.of.year.change, year$first.week.change, use="complete.obs")
## [1] -0.1667653
lm.first.week <- lm(rest.of.year.change ~ first.week.change, data = year)
lm.first.week
## 
## Call:
## lm(formula = rest.of.year.change ~ first.week.change, data = year)
## 
## Coefficients:
##       (Intercept)  first.week.change  
##           -14.407             -1.141

Challenge 2.4 (Solution)

plot(rest.of.year.change ~ first.week.change, data = year,
     main = "Rest of Year Change by First Week Change",
     ylab = "Rest of Year Change", xlab = "First Week Change")
abline(lm.first.week, col="red")

Challenge 2.4 (Solution)

Zooming in, we can see from the plot that there is very little correlation.

plot(rest.of.year.change ~ first.week.change, data = year, ylim = c(-70,70),
     main = "Rest of Year Change by First Week Change", ylab = "Rest of Year Change", xlab = "First Week Change")
abline(lm.first.week, col="red")

Challenge 2.5 (Extra)

We found slight negative correlation between the first week change and the change in the rest of the year, and the scatter plot looks random.

cor(year$rest.of.year.change, year$first.week.change, use="complete.obs")
## [1] -0.1667653

It might be that this slight negative correlation represents the change from the first week regressing to the mean. If that was the case, we would expect there to be (almost) zero correlation between the first week change and the change over the whole year including the first week.

Investigate whether there is correlation between the first week change and the change over the whole year.

Challenge 2.5

We know the last date of the year is X31.Dec.18. The first date of week 1 is:

colnames(year)[2]
## [1] "X1.Jan.18"
year$annual.change <- 100 * (year$X31.Dec.18 - year$X1.Jan.18) / year$X1.Jan.18
cor(year$annual.change, year$first.week.change, use="complete.obs")
## [1] -0.06025432

Almost zero correlation. Over the year, the short term performance has regressed to the mean.

Challenge 2.5

plot(annual.change ~ first.week.change, data = year, ylim = c(-70,70),
     main = "Annual Change by First Week Change", ylab = "Annual Change", xlab = "First Week Change")
abline(lm(annual.change ~ first.week.change, data = year), col="red")

Learning Points

Short term performance is mostly a matter of luck, not strongly predictive of future performance. Moreover, “big” winners in the short term will tend to revert to long-term performance.

  • Early leaders often end up falling behind as things revert toward the “fundamentals.”
  • Be careful when extrapolating linearly in time series!
  • Lacking some explanatory theory, past performance is not necessarily a good predictor of future performance.
    • On average, if a thousand people each flip a fair coin 10 times, we would expect one of them to get the same result 10 times.
    • Without a reason to think that there is some underlying factor influencing the flips, we would not expect that person to continue getting the same result in future flips.
  • Fund managers rarely consistently outperform the market. Invest wisely!

Recap

Recap

To recap, the learning goals for today’s class were as follows:

  • Understand regression to the mean
    • We learned about this by examining whether the short term performance of stocks predicts their long term performance
    • We found that it does not; top performers in the short term eventually fell behind as performance tends to revert to their “fundamentals”
  • Think about variability in large versus small samples
    • Examined cancer rates across US counties, found highest cancer rates are in the smallest counties
    • Think back to the activity “Gender on a Normal Curve” and how small deviations in the number of women (+10 or -10) have much bigger impacts on variability in the female proportion of subzone populations
  • Practice fitting and assessing linear models