2021-08-16

Live demo

Activity: basic R operations

Do you remember the practice RAT from last week?

  • There are roughly seven billion people in the world today. Which map shows where people live? (Each figure represents 1 billion people.)

Let’s use data and R to find out which map is correct.

Live demo objective: find Europe’s population

  • In your QR folder that you made when following the video tutorials, start a new R project called population_by_continent.
  • Download the file country_info.csv from Canvas (under Files → Week02_Lesson1). The data are from the United Nations Population Division.
  • Move the file to the population_by_continent project folder.
  • Start a new R script population.R.

R commands for exploring data 1/3

  • Import the data.
  • How many rows and columns are in country_info?
  • What are the column names?
  • Print the first few columns to the console.
  • Which continents appear in country_info?

R commands for exploring data 2/3

country_info <-read.csv("country_info.csv")
dim(country_info)
## [1] 193   4

There are 193 rows and 4 columns.

names(country_info)
## [1] "country"    "continent"  "pop"        "electr_pct"

R commands for exploring data 3/3

head(country_info)
##    country continent       pop electr_pct
## 1  Burundi    Africa  11890784       7.59
## 2  Comoros    Africa    869601      77.80
## 3 Djibouti    Africa    988000      51.80
## 4  Eritrea    Africa   3546421      46.70
## 5 Ethiopia    Africa 114963588      42.90
## 6    Kenya    Africa  53771296      56.00
unique(country_info$continent)
## [1] "Africa"   "Asia"     "Americas" "Oceania"  "Europe"

Find Europe’s population 1/3

Let’s first break down the challenge into small steps. Afterwards we’ll show you how to do it in one giant leap.

We create a subset of country_info that contains only European countries.

europe_info <- country_info[country_info$continent == "Europe", ]

Let’s check what’s inside europe_info.

nrow(europe_info)
## [1] 43

From the 193 rows in country_info we have only kept 43, namely one row for each European country.

Find Europe’s population 2/3

What’s in the first few rows of europe_info?

head(europe_info)
##                 country continent      pop electr_pct
## 149             Belarus    Europe  9449323        100
## 150            Bulgaria    Europe  6948445        100
## 151             Czechia    Europe 10708981        100
## 152             Hungary    Europe  9660351        100
## 153              Poland    Europe 37846611        100
## 154 Republic of Moldova    Europe  4033963        100

Find Europe’s population 3/3

We find Europe’s total population by summing over the column that has the name population.

sum(europe_info$pop)
## [1] 747293775

Hence, there are approximately 747 million Europeans.

Activity 1: Find population on other continents

Activity 1

Find the population in

  • Africa (Team 1),
  • Asia (Team 2),
  • the Americas (Team 3).

Solution

You can, of course, follow the same pattern as before and replace "Europe" by "Africa", "Asia" etc. It’s a perfectly fine solution.

For the sake of variety, here’s a shorter alternative.

sum(country_info$pop[country_info$continent == "Africa"])
## [1] 1337666440
sum(country_info$pop[country_info$continent == "Asia"])
## [1] 4609091684
sum(country_info$pop[country_info$continent == "Americas"])
## [1] 1018121141

And so on for the other continents.

Bonus material

Bonus material

R can do the entire calculation in a single line with aggregate(). We haven’t seen aggregate() yet, but here is a flavour of things to come.

pop_by_cont <- aggregate(pop~continent, data = country_info, FUN = sum)
pop_by_cont
##   continent        pop
## 1    Africa 1337666440
## 2  Americas 1018121141
## 3      Asia 4609091684
## 4    Europe  747293775
## 5   Oceania   41798096

The first argument in aggregate() treats pop as function of the continent. The second argument specifies that pop and continent are in the data frame country_info. The third argument tells aggregate() to apply the function sum() to each continent’s population.

Population barplot

barplot(pop_by_cont$pop, names.arg = pop_by_cont$continent) 

Population map

Let’s return to the RAT question.

There are roughly seven billion people in the world today. Which map shows where people live? (Each figure represents 1 billion people.)

Americas 1, Europe 1, Africa 1, Asia 4

Looking at the R output above, the correct answer is A.

Activity 2:
How many people in the world have some access to electricity?

Activity 2

Let’s use the information in country_info.csv to answer another question from the practice RAT.

What percentage of the world population has some access to electricity?

  • 20%
  • 50%
  • 80%

Before you start coding …

  • In your QR folder, start a new project called access_to_electricity.
  • Copy country_info.csv to the new project directory.
  • Start an R script called access_to_electricity.R.
  • To answer the question, you need to use the information in the columns
    • pop: population,
    • electr_pct: the percentage of the country’s population with access to electricity. Data from the World Bank.

Hint: you may find it helpful to add a column to the data frame that contains the number of people in each country with access to electricity.

Solution 1/2

# Import the spreadsheet.
country_info <- read.csv("country_info.csv")

# Add a column called population_with_access_to_electricity.
country_info$pop_with_electricity <-
  country_info$pop * country_info$electr_pct / 100

# percentage of population with access to electricity =
# 100 * total_population_with_access_to_electricity / total_population
total_pop <- sum(country_info$pop)
total_pop_with_electricity <- sum(country_info$pop_with_electricity)

Solution 2/2

# Here's the percentage we're looking for.
100 * total_pop_with_electricity / total_pop
## [1] 86.93472
# Here's a common mistake:
sum(country_info$electr_pct) / nrow(country_info)
## [1] 82.29984

In doing so, you are taking an avg of the electr_pct by assuming all countries have the same population. Instead, the average should be weighted by the countries respective population, such as our solution.

Weighted Average by population

\(\tiny{\%\ pop\ have\ access\ to\ electricity =\dfrac{pop\ have\ access\ to\ electricity}{total\ world\ pop}}\)

\(\tiny{=\dfrac{pop\ in\ country\ A\ have\ access+pop\ in\ country\ B\ have\ access+...}{total\ world\ pop}}\)

\(\tiny{=\dfrac{\%\ pop\ in\ A\ have\ access \times pop\ in\ A+\%\ pop\ in\ B\ have\ access \times pop\ in\ B+...}{total\ world\ pop}}\)

  • This is what our R codes do.

\(\tiny{=\%\ pop\ in\ A\ have\ access \times \dfrac{pop\ in\ A}{total\ world\ pop}+\%\ pop\ in\ B\ have\ access \times \dfrac{pop\ in\ B}{total\ world\ pop}+...}\)

  • It is equivalent to an average of % pop having some access weighted by their respective population.

Activity 3:
How many girls in low-income countries complete primary school?

Activity 3

Here’s another question from the practice RAT.

In low-income countries across the world, how many girls complete primary school?

  • 20%
  • 40%
  • 60%

Let’s find the answer with R.

Before you start coding …

  • In your QR folder, start a new R project called primary_education.
  • Download the file primary_education.csv from Canvas (under Files → Week02_Lesson1). The data (female primary school completion rate and classification of countries by income are from the World Bank .
  • Move the file to the primary_education project folder.
  • Start a new R script called primary_education.R.

Hint: you may find it helpful to create a subset that only contains low-income countries.

Solution 1/6

primary_education <- read.csv("primary_education.csv")
dim(primary_education)
## [1] 117   4

There are fewer rows (i.e. fewer countries) in primary_education than in our earlier data frame country_info because, for many countries, the World Bank has no information about primary school enrolment. We have removed these countries from primary_education to simplify this activity.

names(primary_education)
## [1] "country"          "income"           "completion_f_pct" "f_age_last_grade"
  • f_age_last_grade: School Age Female Population (Last Grade Of Primary Education)

Solution 2/6

head(primary_education)
##      country         income completion_f_pct f_age_last_grade
## 1    Albania   upper middle        104.64122          18102.0
## 2  Argentina not classified        102.76964         349201.0
## 3    Armenia   lower middle         91.29588          16380.6
## 4    Austria           high         99.59905          40829.6
## 5 Azerbaijan   upper middle        109.72021          59122.8
## 6    Burundi            low         74.52029         127731.8

Solution 3/6

You may be wondering why the output above shows a completion rate of >100% in some countries. Here is a quote from https://data.worldbank.org/indicator/SE.PRM.CMPT.FE.ZS:

“There are many reasons why the primary completion rate can exceed 100 percent. The numerator may include late entrants and overage children who have repeated one or more grades of primary education as well as children who entered school early, while the denominator is the number of children at the entrance age for the last grade of primary education.”

Solution 4/6

Let’s create a subset that only contains low-income countries.

low_income <- primary_education[primary_education$income == "low", ]
nrow(low_income)
## [1] 16

The output shows that we have 16 low-income countries in our data frame.

str(low_income)
## 'data.frame':    16 obs. of  4 variables:
##  $ country         : chr  "Burundi" "Benin" "Burkina Faso" "Central African Republic" ...
##  $ income          : chr  "low" "low" "low" "low" ...
##  $ completion_f_pct: num  74.5 76 64.1 33.5 52.4 ...
##  $ f_age_last_grade: num  127732 132165 236609 62497 154588 ...

Solution 5/6

We add a column females_completing to the data frame. The new column contains the number of females who complete primary school.

low_income$females_completing <-
  low_income$f_age_last_grade *low_income$completion_f_pct / 100

# percentage of females completing =
# 100 * total_females_completing / total_females_at_age_of_last_grade
total_females_completing <- sum(low_income$females_completing)
total_females_age_last_grade <- sum(low_income$f_age_last_grade)

Solution 6/6

# Here's the percentage we're looking for.
100 * total_females_completing / total_females_age_last_grade
## [1] 64.50803

Takeaway

  • We practice commands to understand our data better: dim(), names(), head(), unique(), nrow().
  • We practice basic R skills: importing data, subsetting data, creating new variables, and arithmetic - summing values, division, etc.
  • We work with different types of data: identifiers (country names), categorical data (continents), quantitative data (e.g. population)
  • We learn that simple average of characteristics across countries can be misleading since countries with larger population would be treated equally as countries with fewer population.
  • For our purpose, to get an unbiased view, the average should be weighted according to the each country’s population.