Displaying & Summarizing Data: Part II

2021-08-26

Read in your data

Read in the UESI2019_with_indicators.csv dataset as the data frame uesi and perform requisite checks on data structure

UESI: Urban Environment & Social Inclusion Index

Preliminary Data Exploration

What do the values indicate?

Higher values for indicators like air pollution

Particulate air pollution PM25_mean
Trace gas air pollution NO2_mean
Urban Heat Island UHI_mean

are bad. Others, like

Tree cover per capita TREECAP_mean
Access to public transit TRANSCOV_mean

have higher values on the good end.

Preliminary Data Exploration

What’s the difference between the _mean and .UESI values?

The .UESI data are all scaled from 0 to 100.
.UESI data all have 100 being good.

Why might this be useful?

1 Dealing with missing data points

Pick one indicator and create a table to explore how many missing values there are for that indicator.

Recall that str (structure) is one way to get a list of the column names.

1 Dealing with missing data points

Pick one indicator and create a table to explore how many missing values there are for that indicator.

table(is.na(uesi$PUBTRANS_mean))

## 
## FALSE  TRUE 
##   162     2

uesi$city[is.na(uesi$PUBTRANS_mean)]

## [1] "evansville" "reykjavik"

2 Exploring the data with summary statistics

Answer these questions:

What is the mean population density (popdens) of cities that score 100 on Tree Cover per capita?
What is the median population density of cities that score 100 on Tree Cover per capita?
What is the IQR of all cities’ population density?
What is the standard deviation of all cities’ population density?

2 Exploring the data with summary statistics

1. What is the mean population density of cities that score 100 on Tree Cover per capita?

mean(uesi$popdens[uesi$TREECAP.UESI == 100], na.rm=TRUE)

## [1] 2051.575

2. Median?

median(uesi$popdens[uesi$TREECAP.UESI == 100], na.rm=TRUE)

## [1] 1363.773

2 Exploring the data with summary statistics

3. IQR

IQR(uesi$popdens, na.rm=TRUE)

## [1] 4796.313

EXTRA: you can also use quantile()

q1 <- quantile(uesi$popdens, 0.25, na.rm=TRUE)
q3 <- quantile(uesi$popdens, 0.75, na.rm=TRUE)
q3-q1

##      75% 
## 4796.313

2 Exploring the data with summary statistics

4. What is the standard deviation of all cities’ population density?

sd(uesi$popdens, na.rm=TRUE)

## [1] 4830.647

3 More practice with logical operators

Review of preparatory work.

What is the total population of cities NOT in Asia?
How many cities have more than 100 neighborhoods?
Which cities have scores above 85 on both PUBTRANS.UESI and TREECAP.UESI

3 More practice with logical operators

1. What is the total population of cities NOT in Asia?

sum(uesi$population_total[uesi$continent != "Asia"])

## [1] 255690791

3 More practice with logical operators

2. How many cities have more than 100 neighborhoods?

length(uesi$city[uesi$nbhd_num > 100])

## [1] 42

3 More practice with logical operators

3. Which cities have scores above 85 on both PUBTRANS.UESI and TREECAP.UESI

# Why use which( ) in this code? Try omitting it and spot the difference!
uesi$city[which(uesi$PUBTRANS.UESI > 85 & uesi$TREECAP.UESI > 85)]

##  [1] "alexandria"      "alger"           "amsterdam"       "asuncion"       
##  [5] "atlanta"         "baltimore"       "berlin"          "boston"         
##  [9] "bratislava"      "bridgeport"      "brisbane"        "brussels"       
## [13] "bucharest"       "budapest"        "chelyabinsk"     "chicago"        
## [17] "cleveland"       "copenhagen"      "denver"          "detroit"        
## [21] "dublin"          "edinburgh"       "fargo"           "hamburg"        
## [25] "houston"         "kampala"         "kiev"            "lome"           
## [29] "london"          "louisville"      "lyons"           "managua"        
## [33] "maputo"          "melbourne"       "milan"           "milwaukee"      
## [37] "minneapolis"     "monrovia"        "montreal"        "moscow"         
## [41] "munich"          "nashville"       "newyork"         "nizhny"         
## [45] "novosibirsk"     "omaha"           "oslo"            "paterson"       
## [49] "philadelphia"    "portland"        "quito"           "riodejaneiro"   
## [53] "saintpetersburg" "saltlakecity"    "sanjose"         "seattle"        
## [57] "seoul"           "singapore"       "stlouis"         "stockholm"      
## [61] "sydney"          "toronto"         "tulsa"           "vancouver"      
## [65] "vienna"          "warsaw"          "wellington"      "wichita"        
## [69] "yangon"          "zagreb"          "zurich"

4 Perfect scores

How many cities, get perfect scores of 100 on all of the three variables PM25.UESI, UHI.UESI, and TREECAP.UESI?

4 Perfect scores

How many cities, get perfect scores of 100 on all of the three variables PM25.UESI, UHI.UESI, and TREECAP.UESI?

table(uesi$PM25.UESI==100 & uesi$UHI.UESI==100 & uesi$TREECAP.UESI==100)

## 
## FALSE  TRUE 
##   162     2

# looks like only 2 cities meet this condition!
sum(uesi$PM25.UESI==100 & uesi$UHI.UESI==100 & uesi$TREECAP.UESI==100)

## [1] 2

5 Which cities got perfect scores?

Which two cities are they? For this subsetting you may want to wrap your logical expression in the function which( ). See R Tutorial 19 for a reminder of why.

5 Which cities got perfect scores?

uesi$city[which(uesi$PM25.UESI==100 & uesi$UHI.UESI==100 &
                  uesi$TREECAP.UESI==100)]

## [1] "anchorage" "oslo"

6 Comparison with a value in the data set

How many cities are better than Singapore with respect to treecover per capita (TREECAP)?
How many cities are better than Singapore with respect to BOTH tree cover (TREECAP) and PM2.5 (PM25)?
How many cities are better than Singapore with respect to EITHER treecover per capita (TREECAP) OR PM2.5 (PM25)?

6 Comparison with a value in the data set

1. How many cities are better than Singapore with respect to treecover per capita (TREECAP)?

sum(uesi$TREECAP.UESI > uesi$TREECAP.UESI[uesi$city == "singapore"])

## [1] 88

6 Comparison with a value in the data set

2. How many cities are better than Singapore with respect to BOTH tree cover (TREECAP) and PM2.5 (PM25)?

sum((uesi$TREECAP.UESI > uesi$TREECAP.UESI[uesi$city == "singapore"]) &
      (uesi$PM25.UESI > uesi$PM25.UESI[uesi$city == "singapore"]) )

## [1] 82

6 Comparison with a value in the data set

3. How many cities are better than Singapore with respect to EITHER treecover per capita (TREECAP) OR PM2.5 (PM25)?

sum((uesi$TREECAP.UESI > uesi$TREECAP.UESI[uesi$city == "singapore"]) |
      (uesi$PM25.UESI > uesi$PM25.UESI[uesi$city == "singapore"]) )

## [1] 145

7 Distribution of PM2.5 performances

In Challenge 6.3, it looks like a lot of cities fell into this category. Singapore’s PM2.5 rating is not so great; uesi$PM25.UESI[uesi$city == "singapore"] returns 15.1. Draw a histogram of PM2.5 for all cities.

7 Distribution of PM2.5 performances

hist(uesi$PM25.UESI, xlab="Performance on PM25", col="lightblue")

8 PM2.5 distributions by continent

How do cities across continents compare on the PM2.5 UESI indicator? Save your output as a PDF using ‘Export > Save as PDF’

8 PM2.5 distributions by continent

How do cities across continents compare on the PM2.5 UESI indicator? Save your output as a PDF using ‘Export > Save as PDF’

boxplot(uesi$PM25.UESI ~ uesi$continent, col="lightblue", xlab="Score",
        main="City performance on Air Quality PM2.5 by Continent")

9 Does city population vary by continent?

Asia is not doing well on this comparison. But we might expect larger cities to have worse PM2.5. Check the distribution of city total populations by continent.

9 Does city population vary by continent?

Asia is not doing well on this comparison. But we might expect larger cities to have worse PM2.5. Check the distribution of city total populations by continent.

boxplot(uesi$population_total ~ uesi$continent,
        main="City populations by continent", col="pink", xlab="Score")

Is that enough to explain the discrepancy? We will have to wait for a later class to examine this question.

Takeaway

Summary statistics like mean, median, IQR, and standard deviation give us a sense of the ‘spread’ and ‘center’ of a dataset
When data are skewed, a median might be a better measure of a dataset’s center; when data are relatively centered, you could use mean.
In either case, visualizing your data first using a histogram or boxplot can give you a sense of the data’s spread.
The IQR and standard deviation can also give a sense of the data’s spread.
Using logical/relational operators in R can help subset the data in a variety of ways for exploration.