2021-09-09
The Urban Redevelopment Authority divides Singapore into 231 subzones for purposes of statistics and urban planning.
The objective of this activity is to determine whether we can apply the normal model to the proportion of women in the subzones.
sg_pop.csv
on Canvas (under Files → Week05_Lesson1) as data frame sg_pop
. It contains the number of female and male residents in each subzone during the year 2015.fem_prop
that contains the proportion of residents that are female.fem_prop
.fem_prop
. Does the distribution pass the criteria for modelling as a normal distribution?sg_pop <- read.csv("sg_pop.csv") # Append column containing the proportion of females. sg_pop$fem_prop <- sg_pop$female / (sg_pop$female + sg_pop$male) # Calculate the sample mean and standard deviation. y_bar <- mean(sg_pop$fem_prop) y_bar
## [1] 0.5118451
s <- sd(sg_pop$fem_prop) s
## [1] 0.02826376
hist(sg_pop$fem_prop)
It is certainly unimodal and looks approximately symmetric. So we can try to use a normal model.
Let’s assume that fem_prop
follows a normal model \(N(\mu, \sigma)\) with \(\mu = \bar{y}\) and \(\sigma = s\). That is, we assume that
Use R to answer the following questions.
# What percentage of a normal distribution N(y_bar, s) is below 0.45? pnorm(0.45, mean = y_bar, sd = s) * 100
## [1] 1.432965
# What percentage of the subzones are below 0.45? sum(sg_pop$fem_prop < 0.45) / nrow(sg_pop) * 100
## [1] 2.164502
# What percentage of a normal distribution N(y_bar, s) is above 0.55? (1 - pnorm(0.55, mean = y_bar, sd = s)) * 100
## [1] 8.851473
# What percentage of the subzones are above 0.55? sum(sg_pop$fem_prop > 0.55) / nrow(sg_pop) * 100
## [1] 2.597403
fem_prop
and overlay a normal distribution with the same mean and standard deviation. Vary the bin width to explore the data.# Make a histogram and overlay a normal distribution N(y_bar, s). hist(sg_pop$fem_prop, breaks = seq(0.24, 0.76, 0.01), freq = FALSE, col = "lightgreen", main = "Proportion of women in Singapore's subzones", xlab = "Proportion") curve(dnorm(x, mean = y_bar, sd = s), from = min(sg_pop$fem_prop), to = max(sg_pop$fem_prop), add = TRUE, col = "red")
qqnorm(sg_pop$fem_prop)
The data are fatter in the tails than a normal model. We may want to remove the far outliers before approximating the data by the normal model.
# Information about the extremes of the distribution. sg_pop[sg_pop$fem_prop < 0.4, ]
## subzone female male fem_prop ## 42 Clarke Quay 50 80 0.3846154 ## 88 Kampong Glam 60 120 0.3333333
sg_pop[sg_pop$fem_prop > 0.6, ]
## subzone female male fem_prop ## 65 Gali Batu 80 50 0.6153846 ## 129 National University of S'pore 260 100 0.7222222
# Remove outliers. sg_trim <- sg_pop[sg_pop$fem_prop >= 0.4 & sg_pop$fem_prop <= 0.6, ] # We must compute the mean and standard deviation of the trimmed # distribution. y_bar_trim <- mean(sg_trim$fem_prop) s_trim <- sd(sg_trim$fem_prop)
hist(sg_trim$fem_prop, breaks = seq(0.38, 0.62, 0.01), freq = FALSE, col = "lightgreen", main = "Proportion of women in Singapore's subzones", xlab = "Proportion") curve(dnorm(x, mean = y_bar_trim, sd = s_trim), from = min(sg_trim$fem_prop), to = max(sg_trim$fem_prop), add = TRUE, col = "red")
qqnorm(sg_trim$fem_prop)
Removing four outliers improves the fit, but the trimmed distribution is more left-skewed than a normal model. That does not mean a normal model is useless, but imperfect.
Before removing outliers, we should have checked that was a reasonable thing to do. The outliers might be errors, or they might be the most important features of the data set!
All the outliers had a very small population. Perhaps they are the only regions with small populations.
total
to the data frameDoes this justify removing the outliers?
sg_pop$total <- sg_pop$female + sg_pop$male hist(sg_pop$total, breaks = seq(0, 140000, 1000))
There are many regions with low population. Even many below 1000.
We can view sg_pop
in RStudio’s built-in viewer. We can then sort subzones by their population from smallest to largest by clicking on total
.
These 4 are not even the lowest, though all in lowest 20. This is poor justification for removing the outliers (for removing just the outliers).
National University of Singapore is not a place people tend to live permanently. Perhaps each of the outliers is a special zone with unusual land use.
Does this justify removing the outliers?
Information suggests that our outliers are not residential areas where many people reside but rather are used for other purposes.
Since these are not residential areas, this seems like good reason for removing these subzones.
But… there are other regions that should also be removed by the same reasoning:
We should always be transparent and consistent when applying rules we use to exclude observations from our analyses.
When removing outliers,
Properly cleaning data takes a lot of work, but it is worth doing!