Gender on a normal curve?

2021-09-09

Objective

The proportion of women in Singapore’s subzones

The Urban Redevelopment Authority divides Singapore into 231 subzones for purposes of statistics and urban planning.

The objective of this activity is to determine whether we can apply the normal model to the proportion of women in the subzones.

Step 1: Proportion of female residents

Import the file sg_pop.csv on Canvas (under Files → Week05_Lesson1) as data frame sg_pop. It contains the number of female and male residents in each subzone during the year 2015.
Append a column fem_prop that contains the proportion of residents that are female.
Calculate the mean \(\bar{y}\) and standard deviation \(s\) of fem_prop.
Plot a histogram of fem_prop. Does the distribution pass the criteria for modelling as a normal distribution?

Solution 1 / 2

sg_pop <- read.csv("sg_pop.csv")

# Append column containing the proportion of females.
sg_pop$fem_prop <- sg_pop$female / (sg_pop$female + sg_pop$male)

# Calculate the sample mean and standard deviation.
y_bar <- mean(sg_pop$fem_prop)
y_bar

## [1] 0.5118451

s <- sd(sg_pop$fem_prop)
s

## [1] 0.02826376

Solution 2 / 2

hist(sg_pop$fem_prop)

It is certainly unimodal and looks approximately symmetric. So we can try to use a normal model.

Step 2: Compare normal model with data

Let’s assume that fem_prop follows a normal model \(N(\mu, \sigma)\) with \(\mu = \bar{y}\) and \(\sigma = s\). That is, we assume that

the mean of the normal model matches the observed mean and
the standard deviation of the normal model matches the observed standard deviation.

Use R to answer the following questions.

What percentage of subzones has a proportion of women \(<0.45\)
1. according to the normal model,
2. according to the data?
What are the predicted and observed percentages of subzones with a female proportion \(>0.55\).

Solution 1 / 2

# What percentage of a normal distribution N(y_bar, s) is below 0.45?
pnorm(0.45, mean = y_bar, sd = s) * 100

## [1] 1.432965

# What percentage of the subzones are below 0.45?
sum(sg_pop$fem_prop < 0.45) / nrow(sg_pop) * 100

## [1] 2.164502

Solution 2 / 2

# What percentage of a normal distribution N(y_bar, s) is above 0.55?
(1 - pnorm(0.55, mean = y_bar, sd = s)) * 100

## [1] 8.851473

# What percentage of the subzones are above 0.55?
sum(sg_pop$fem_prop > 0.55) / nrow(sg_pop) * 100

## [1] 2.597403

Step 3: Is the normal model suitable?

Make a histogram of fem_prop and overlay a normal distribution with the same mean and standard deviation. Vary the bin width to explore the data.
Make a “Normal Probability Plot”.
Judging from the plots, is the normal model suitable for the data? Why or why not?

Solution 1 / 3

# Make a histogram and overlay a normal distribution N(y_bar, s).
hist(sg_pop$fem_prop,
     breaks = seq(0.24, 0.76, 0.01),
     freq = FALSE,
     col = "lightgreen",
     main = "Proportion of women in Singapore's subzones",
     xlab = "Proportion")
curve(dnorm(x, mean = y_bar, sd = s),
      from = min(sg_pop$fem_prop),
      to = max(sg_pop$fem_prop),
      add = TRUE,
      col = "red")

Solution 2 / 3

Solution 3 / 3

qqnorm(sg_pop$fem_prop)

The data are fatter in the tails than a normal model. We may want to remove the far outliers before approximating the data by the normal model.

Step 4: Remove outliers

Which subzones have a female proportion below 0.4? Which subzones have a female proportion above 0.6?
Remove these subzones from the data. Does the normal model fit the remaining data better?

Solution 1 / 5

# Information about the extremes of the distribution.
sg_pop[sg_pop$fem_prop < 0.4, ]

##         subzone female male  fem_prop
## 42  Clarke Quay     50   80 0.3846154
## 88 Kampong Glam     60  120 0.3333333

sg_pop[sg_pop$fem_prop > 0.6, ]

##                           subzone female male  fem_prop
## 65                      Gali Batu     80   50 0.6153846
## 129 National University of S'pore    260  100 0.7222222

Solution 2 / 5

# Remove outliers.
sg_trim <- sg_pop[sg_pop$fem_prop >= 0.4 & sg_pop$fem_prop <= 0.6, ]

# We must compute the mean and standard deviation of the trimmed
# distribution.
y_bar_trim <- mean(sg_trim$fem_prop)
s_trim <- sd(sg_trim$fem_prop)

Solution 3 / 5

hist(sg_trim$fem_prop,
     breaks = seq(0.38, 0.62, 0.01),
     freq = FALSE,
     col = "lightgreen",
     main = "Proportion of women in Singapore's subzones",
     xlab = "Proportion")
curve(dnorm(x, mean = y_bar_trim, sd = s_trim),
      from = min(sg_trim$fem_prop),
      to = max(sg_trim$fem_prop),
      add = TRUE,
      col = "red")

Solution 4 / 5

Solution 5 / 5

qqnorm(sg_trim$fem_prop)

Conclusion

Removing four outliers improves the fit, but the trimmed distribution is more left-skewed than a normal model. That does not mean a normal model is useless, but imperfect.

Step 5: Why remove outliers?

Before removing outliers, we should have checked that was a reasonable thing to do. The outliers might be errors, or they might be the most important features of the data set!

All the outliers had a very small population. Perhaps they are the only regions with small populations.

Add a column total to the data frame
Use a histogram to investigate whether our four subzones are the only ones with small populations
Use RStudio’s built-in viewer and its ability to sort rows to further investigate

Does this justify removing the outliers?

Solution 1 / 2

sg_pop$total <- sg_pop$female + sg_pop$male
hist(sg_pop$total, breaks = seq(0, 140000, 1000))

There are many regions with low population. Even many below 1000.

Solution 2 / 2

We can view sg_pop in RStudio’s built-in viewer. We can then sort subzones by their population from smallest to largest by clicking on total.

These 4 are not even the lowest, though all in lowest 20. This is poor justification for removing the outliers (for removing just the outliers).

Step 5 second attempt: Why remove outliers?

National University of Singapore is not a place people tend to live permanently. Perhaps each of the outliers is a special zone with unusual land use.

Use https://www.citypopulation.de/en/singapore/admin/ to investigate the land use in each of these regions. It looks like these data are from the mid 2015 estimate.

Does this justify removing the outliers?

Solution 1 / 2

Information suggests that our outliers are not residential areas where many people reside but rather are used for other purposes.

Gali batu is a train depot, disused quarry, small industrial estate, and wild land. No one lives there.
NUS is a student zone. No one lives there permanently.
Clarke Quay is tourist & nightlife location, not residential.
Kampong Glam is also tourist not residential.

Since these are not residential areas, this seems like good reason for removing these subzones.

Solution 2 / 2

But… there are other regions that should also be removed by the same reasoning:

One North is an industrial estate
Fort Canning is a park
Western Water Catchment is a nature reserve, university (NTU), cemeteries, aquaculture, reservoirs and military base
Port is a port!

We should always be transparent and consistent when applying rules we use to exclude observations from our analyses.

Conclusion

When removing outliers,

justify the removal on the basis of the data, not on trying to get a good fit to a preferred model,
consider that it might be appropriate to remove other data points for the same reasons as the outliers,
you may need to seek more contextual information beyond your dataset to make informed decision on data points to discard.

Properly cleaning data takes a lot of work, but it is worth doing!