2021-09-02

Goals for the activity

  • Practice with summary statistics and the commands that produce them (e.g., sd, mean, IQR).

  • Produce plots of data over time.

  • Compare and interpret such plots.

Today’s data and source

The PM2.5 scale

What do different levels of PM2.5 mean?

Air quality in four cities in India

This data set contains hourly PM2.5 measurements for the month of June, 2016, in five cities in India. Here’s an overview:

x <- read.csv("India_AirQuality.csv")
dim(x)
## [1] 720   8
head(x)
##      DateTime Chennai Delhi Hyderabad Kolkata Mumbai Time           Part
## 1 1/6/16 1:00      20    34        32      41     16  100 1.EarlyMorning
## 2 1/6/16 2:00      32    43        40      33     13  200 1.EarlyMorning
## 3 1/6/16 3:00      36    74        39      28      9  300 1.EarlyMorning
## 4 1/6/16 4:00      27    52        33      18      8  400 1.EarlyMorning
## 5 1/6/16 5:00      31    46        35      22     16  500 1.EarlyMorning
## 6 1/6/16 6:00      33    38        35      23     14  600 1.EarlyMorning

Challenge 1

Pick a city: Chennai, Hyderabad, Kolkata.

For your city, calculate the following: the minimum, maximum, mean, median, standard deviation, interquartile range, and the \(5^{th}\) and \(95^{th}\) percentiles (or quantiles).

Calculate the number of missing values by using functions sum and is.na.

Challenge 1, recommended solution

Challenge 1, recommended solution

Challenge 1

Boxplots of the results:

boxplot(x[, c("Chennai", "Delhi", "Hyderabad", "Kolkata")], main="PM2.5 by city")

Challenge 2 - Pollution Cycle Over a Day

We would like to find out what is the pollution cycle over a day for the city you chose. In the dataset, there is a variable, Time, which indicates the hour of the day at which the measurement was taken.

For your city, calculate the mean of the PM2.5 for each hour. Hint: use the aggregate function.

Use something like plot(Delhi_hour$Delhi) to explore the time series of measurements for your city. Use options such as

  • type="l" to change from the default (points) to lines
  • main="A great title" to provide a meaningful title
  • xlab and ylab to produce human-friendly axis labels

Challenge 2 - recommended solution

Comparing graphs

What is the problem with these graphs if you want to compare across cities?

Comparing graphs

  • Cities can have very different variation in the the pollution level.
  • Figures without a similar scale are hard to compare visually.
  • Next let’s try to fix the graphs by using a similar scale.
  • We should begin by finding the overall maximum and minimum values of the hourly average of PM2.5 over all cities.

Max and Min of hourly averages

range(Delhi_hour$Delhi, na.rm=TRUE)
## [1] 45.83333 54.93333
range(Chennai_hour$Chennai, na.rm=TRUE)
## [1] 28.51724 55.64286
range(Hyderabad_hour$Hyderabad, na.rm=TRUE)
## [1] 26.66667 46.77778
range(Kolkata_hour$Kolkata, na.rm=TRUE)
## [1] 23.30000 35.23333

Challenge 3

We already collected the descriptive statistics for all the cities, and found that the overall maximum and minimum were approx 55.6 and 23.3.

  • We should use a scale on the vertical axis that goes from 23 to 56.

  • Use these values to customize the y-axis of the time series plot for Delhi.

  • To do this, add an optional argument (named ylim) to the plot and assign it an object with these two values (described in the previous sentence) using c().

Challenge 3 (recommended solution)

Challenge 4 (Let’s compare)

Pick any two cities – your team’s city plus one other (not Delhi).

On the lines immediately after the par command that you are given below, copy/paste the code for the two time series plots (with the standardized y-axis limits that you did above). Then run the little block of code. You should obtain two plots on the same graphical display, one above the other.

par(mfrow=c(2,1))
# YOUR WORK HERE for Challenge 4:
# plot() for the first city here...
# plot() for the second city here... resulting in two plots in the same graphic!
par(mfrow=c(1,1))

Reminder: This code merges two graphs, stacked one on top of another. Just like a data frame with 2 rows and 1 column, par(mfrow=c(1,2)) allows you to put two graphs side by side. When you finish you should always set the par back to par(mfrow=c(1,1)), otherwise your next graph will not have the right size.

Challenge 4, recommended solution (code)

Challenge 4, recommended solution (plot)

Challenge 5

In the previous challenge, we pick two cities and plot them in a multiple-plots. However, it was difficult to read due to the scale. To improve this, we can instead put both cities in one single graph. To do this, you first plot one city graph as you did in Challenge 4 but then immediately after you can use the lines function to add another city. You will have to assign a different color to distinguish the two cities.

Challenge 5, recommended solution (plot I)

Challenge 5, recommended solution (plot II)

Challenge 6

Next let’s use the boxplot to gauge the distribution for each hour.

What is the pattern for the city?

What are the most important data presented in this plot?

Challenge 6, recommended solution

Challenge 7: How often is the PM2.5 index at an unhealthy level?

Explain how the following code can be used to answer the question, and adapted to answer related questions.

Level <- 100
100*c(sum(x$Chennai >= Level, na.rm = TRUE)/sum(!is.na(x$Chennai)),
  sum(x$Delhi >= Level, na.rm = TRUE)/sum(!is.na(x$Delhi)),
  sum(x$Hyderabad >= Level, na.rm = TRUE)/sum(!is.na(x$Hyderabad)),
  sum(x$Kolkata >= Level, na.rm = TRUE)/sum(!is.na(x$Kolkata))
)
## [1] 5.8655222 1.2500000 0.1547988 0.1412429

How often is the PM2.5 index at an unhealthy level?

The variable Level gets 100, which is the PM2.5 value which is “unhealthy for sensitive groups” or worse. The code returns a vector of values. Each value is the percentage of times the city has PM2.5 at least Level. The first value corresponds to Chennai, the second to Delhi, the third to Hyderabad, and the fourth to Kolkata.

We see that Chennai has PM2.5 unhealthy for sensitive groups or worse 6% of the time, while Delhi has it only 1% of the time. The other two cities have such PM2.5 levels much more rarely.

By changing the variable Level to 50 or 150, we could find out which cities have PM2.5 “moderate” or worse, or “unhealthy” or worse.

Specific conclusions

  1. Chennai has the highest pollution index among 5 cities, sometimes even reaching hazardous levels.
  2. However, when we look at the average air quality, Chennai is not doing worse than Dehli.
  3. But sometimes the value can go below 0. These negative values make the index less trustworthy.
  4. On average, Delhi seems to have the highest pollution during the peak hour. Except for a few hours of the day, Delhi has the highest mean pollution comparing with other cities.
  5. When looking at the boxplots, the median changes little across all hours, but the max reflects high pollution levels occasionally.
  6. It seems that Delhi tends to have worse pollution than Chennai, but Chennai’s pollution is more often unhealthy for sensitive groups than Delhi’s.

General conclusions

  • Today you practiced calculating, plotting, and reasoning about summary statistics.
  • Plots over time can reveal important trends. We saw that air quality varied over both time and location.
  • The important story is sometimes in the spread (variance), not the central tendencies (means or medians).