2021-08-16
Do you remember the practice RAT from last week?
Let’s use data and R to find out which map is correct.
QR
folder that you made when following the video tutorials, start a new R project called population_by_continent
.country_info.csv
from Canvas (under Files → Week02_Lesson1). The data are from the United Nations Population Division.population_by_continent
project folder.population.R
.country_info
?country_info
?country_info <-read.csv("country_info.csv") dim(country_info)
## [1] 193 4
There are 193 rows and 4 columns.
names(country_info)
## [1] "country" "continent" "pop" "electr_pct"
head(country_info)
## country continent pop electr_pct ## 1 Burundi Africa 11890784 7.59 ## 2 Comoros Africa 869601 77.80 ## 3 Djibouti Africa 988000 51.80 ## 4 Eritrea Africa 3546421 46.70 ## 5 Ethiopia Africa 114963588 42.90 ## 6 Kenya Africa 53771296 56.00
unique(country_info$continent)
## [1] "Africa" "Asia" "Americas" "Oceania" "Europe"
Let’s first break down the challenge into small steps. Afterwards we’ll show you how to do it in one giant leap.
We create a subset of country_info
that contains only European countries.
europe_info <- country_info[country_info$continent == "Europe", ]
Let’s check what’s inside europe_info
.
nrow(europe_info)
## [1] 43
From the 193 rows in country_info
we have only kept 43, namely one row for each European country.
What’s in the first few rows of europe_info
?
head(europe_info)
## country continent pop electr_pct ## 149 Belarus Europe 9449323 100 ## 150 Bulgaria Europe 6948445 100 ## 151 Czechia Europe 10708981 100 ## 152 Hungary Europe 9660351 100 ## 153 Poland Europe 37846611 100 ## 154 Republic of Moldova Europe 4033963 100
We find Europe’s total population by summing over the column that has the name population
.
sum(europe_info$pop)
## [1] 747293775
Hence, there are approximately 747 million Europeans.
Find the population in
You can, of course, follow the same pattern as before and replace "Europe"
by "Africa"
, "Asia"
etc. It’s a perfectly fine solution.
For the sake of variety, here’s a shorter alternative.
sum(country_info$pop[country_info$continent == "Africa"])
## [1] 1337666440
sum(country_info$pop[country_info$continent == "Asia"])
## [1] 4609091684
sum(country_info$pop[country_info$continent == "Americas"])
## [1] 1018121141
And so on for the other continents.
R can do the entire calculation in a single line with aggregate()
. We haven’t seen aggregate()
yet, but here is a flavour of things to come.
pop_by_cont <- aggregate(pop~continent, data = country_info, FUN = sum) pop_by_cont
## continent pop ## 1 Africa 1337666440 ## 2 Americas 1018121141 ## 3 Asia 4609091684 ## 4 Europe 747293775 ## 5 Oceania 41798096
The first argument in aggregate()
treats pop
as function of the continent
. The second argument specifies that pop
and continent
are in the data frame country_info
. The third argument tells aggregate()
to apply the function sum()
to each continent’s population.
barplot(pop_by_cont$pop, names.arg = pop_by_cont$continent)
Let’s return to the RAT question.
There are roughly seven billion people in the world today. Which map shows where people live? (Each figure represents 1 billion people.)
Americas 1, Europe 1, Africa 1, Asia 4
Looking at the R output above, the correct answer is A.
Let’s use the information in country_info.csv
to answer another question from the practice RAT.
What percentage of the world population has some access to electricity?
access_to_electricity
.country_info.csv
to the new project directory.access_to_electricity.R
.pop
: population,electr_pct
: the percentage of the country’s population with access to electricity. Data from the World Bank.Hint: you may find it helpful to add a column to the data frame that contains the number of people in each country with access to electricity.
# Import the spreadsheet. country_info <- read.csv("country_info.csv") # Add a column called population_with_access_to_electricity. country_info$pop_with_electricity <- country_info$pop * country_info$electr_pct / 100 # percentage of population with access to electricity = # 100 * total_population_with_access_to_electricity / total_population total_pop <- sum(country_info$pop) total_pop_with_electricity <- sum(country_info$pop_with_electricity)
# Here's the percentage we're looking for. 100 * total_pop_with_electricity / total_pop
## [1] 86.93472
# Here's a common mistake: sum(country_info$electr_pct) / nrow(country_info)
## [1] 82.29984
In doing so, you are taking an avg of the electr_pct by assuming all countries have the same population. Instead, the average should be weighted by the countries respective population, such as our solution.
\(\tiny{\%\ pop\ have\ access\ to\ electricity =\dfrac{pop\ have\ access\ to\ electricity}{total\ world\ pop}}\)
\(\tiny{=\dfrac{pop\ in\ country\ A\ have\ access+pop\ in\ country\ B\ have\ access+...}{total\ world\ pop}}\)
\(\tiny{=\dfrac{\%\ pop\ in\ A\ have\ access \times pop\ in\ A+\%\ pop\ in\ B\ have\ access \times pop\ in\ B+...}{total\ world\ pop}}\)
\(\tiny{=\%\ pop\ in\ A\ have\ access \times \dfrac{pop\ in\ A}{total\ world\ pop}+\%\ pop\ in\ B\ have\ access \times \dfrac{pop\ in\ B}{total\ world\ pop}+...}\)
Here’s another question from the practice RAT.
In low-income countries across the world, how many girls complete primary school?
Let’s find the answer with R.
primary_education
.primary_education.csv
from Canvas (under Files → Week02_Lesson1). The data (female primary school completion rate and classification of countries by income are from the World Bank .primary_education
project folder.primary_education.R
.Hint: you may find it helpful to create a subset that only contains low-income countries.
primary_education <- read.csv("primary_education.csv") dim(primary_education)
## [1] 117 4
There are fewer rows (i.e. fewer countries) in primary_education
than in our earlier data frame country_info
because, for many countries, the World Bank has no information about primary school enrolment. We have removed these countries from primary_education
to simplify this activity.
names(primary_education)
## [1] "country" "income" "completion_f_pct" "f_age_last_grade"
head(primary_education)
## country income completion_f_pct f_age_last_grade ## 1 Albania upper middle 104.64122 18102.0 ## 2 Argentina not classified 102.76964 349201.0 ## 3 Armenia lower middle 91.29588 16380.6 ## 4 Austria high 99.59905 40829.6 ## 5 Azerbaijan upper middle 109.72021 59122.8 ## 6 Burundi low 74.52029 127731.8
You may be wondering why the output above shows a completion rate of >100% in some countries. Here is a quote from https://data.worldbank.org/indicator/SE.PRM.CMPT.FE.ZS:
“There are many reasons why the primary completion rate can exceed 100 percent. The numerator may include late entrants and overage children who have repeated one or more grades of primary education as well as children who entered school early, while the denominator is the number of children at the entrance age for the last grade of primary education.”
Let’s create a subset that only contains low-income countries.
low_income <- primary_education[primary_education$income == "low", ] nrow(low_income)
## [1] 16
The output shows that we have 16 low-income countries in our data frame.
str(low_income)
## 'data.frame': 16 obs. of 4 variables: ## $ country : chr "Burundi" "Benin" "Burkina Faso" "Central African Republic" ... ## $ income : chr "low" "low" "low" "low" ... ## $ completion_f_pct: num 74.5 76 64.1 33.5 52.4 ... ## $ f_age_last_grade: num 127732 132165 236609 62497 154588 ...
We add a column females_completing
to the data frame. The new column contains the number of females who complete primary school.
low_income$females_completing <- low_income$f_age_last_grade *low_income$completion_f_pct / 100 # percentage of females completing = # 100 * total_females_completing / total_females_at_age_of_last_grade total_females_completing <- sum(low_income$females_completing) total_females_age_last_grade <- sum(low_income$f_age_last_grade)
# Here's the percentage we're looking for. 100 * total_females_completing / total_females_age_last_grade
## [1] 64.50803