Practice with summary statistics and the commands that produce them (e.g., sd, mean, IQR).
Produce plots of data over time.
Compare and interpret such plots.
2021-09-02
Practice with summary statistics and the commands that produce them (e.g., sd, mean, IQR).
Produce plots of data over time.
Compare and interpret such plots.
PM2.5 is an important air quality index (AQI): http://airnow.gov/index.cfm?action=aqibasics.aqi.
PM2.5 represents the amount of very small (“fine”) particulate matter in the air, one of several common measurements used in studies of air quality.
The PM2.5 data that we will use today can be found here: https://www.airnow.gov/
What do different levels of PM2.5 mean?
This data set contains hourly PM2.5 measurements for the month of June, 2016, in five cities in India. Here’s an overview:
x <- read.csv("India_AirQuality.csv") dim(x)
## [1] 720 8
head(x)
## DateTime Chennai Delhi Hyderabad Kolkata Mumbai Time Part ## 1 1/6/16 1:00 20 34 32 41 16 100 1.EarlyMorning ## 2 1/6/16 2:00 32 43 40 33 13 200 1.EarlyMorning ## 3 1/6/16 3:00 36 74 39 28 9 300 1.EarlyMorning ## 4 1/6/16 4:00 27 52 33 18 8 400 1.EarlyMorning ## 5 1/6/16 5:00 31 46 35 22 16 500 1.EarlyMorning ## 6 1/6/16 6:00 33 38 35 23 14 600 1.EarlyMorning
Pick a city: Chennai, Hyderabad, Kolkata.
For your city, calculate the following: the minimum, maximum, mean, median, standard deviation, interquartile range, and the \(5^{th}\) and \(95^{th}\) percentiles (or quantiles).
Calculate the number of missing values by using functions sum
and is.na
.
min(x$Delhi)
## [1] 6
max(x$Delhi)
## [1] 129
mean(x$Delhi)
## [1] 49.06528
median(x$Delhi)
## [1] 48
sd(x$Delhi)
## [1] 19.08148
IQR(x$Delhi)
## [1] 23
quantile(x$Delhi, c(0.05, 0.95)) # Both at once!
## 5% 95% ## 21.95 84.00
sum(is.na(x$Delhi)) # No missing values for Delhi
## [1] 0
Boxplots of the results:
boxplot(x[, c("Chennai", "Delhi", "Hyderabad", "Kolkata")], main="PM2.5 by city")
We would like to find out what is the pollution cycle over a day for the city you chose. In the dataset, there is a variable, Time
, which indicates the hour of the day at which the measurement was taken.
For your city, calculate the mean of the PM2.5 for each hour. Hint: use the aggregate
function.
Use something like plot(Delhi_hour$Delhi)
to explore the time series of measurements for your city. Use options such as
type="l"
to change from the default (points) to linesmain="A great title"
to provide a meaningful titlexlab
and ylab
to produce human-friendly axis labelsDelhi_hour <- aggregate(Delhi~Time, data= x, FUN = mean) Chennai_hour <- aggregate(Chennai~Time, data= x, FUN = mean) Hyderabad_hour <- aggregate(Hyderabad~Time, data= x, FUN = mean) Kolkata_hour <- aggregate(Kolkata~Time, data= x, FUN = mean) plot(Delhi_hour$Delhi, type="l", main="PM2.5 in Delhi", xlab="Hour", ylab="PM2.5")
What is the problem with these graphs if you want to compare across cities?
range(Delhi_hour$Delhi, na.rm=TRUE)
## [1] 45.83333 54.93333
range(Chennai_hour$Chennai, na.rm=TRUE)
## [1] 28.51724 55.64286
range(Hyderabad_hour$Hyderabad, na.rm=TRUE)
## [1] 26.66667 46.77778
range(Kolkata_hour$Kolkata, na.rm=TRUE)
## [1] 23.30000 35.23333
We already collected the descriptive statistics for all the cities, and found that the overall maximum and minimum were approx 55.6
and 23.3
.
We should use a scale on the vertical axis that goes from 23
to 56
.
Use these values to customize the y-axis of the time series plot for Delhi.
To do this, add an optional argument (named ylim
) to the plot and assign it an object with these two values (described in the previous sentence) using c()
.
plot(Delhi_hour$Delhi, type="l", main="PM2.5 in Delhi", xlab="Hour", ylab="PM2.5", ylim=c(23, 56))
Pick any two cities – your team’s city plus one other (not Delhi).
On the lines immediately after the par
command that you are given below, copy/paste the code for the two time series plots (with the standardized y-axis limits that you did above). Then run the little block of code. You should obtain two plots on the same graphical display, one above the other.
par(mfrow=c(2,1)) # YOUR WORK HERE for Challenge 4: # plot() for the first city here... # plot() for the second city here... resulting in two plots in the same graphic! par(mfrow=c(1,1))
Reminder: This code merges two graphs, stacked one on top of another. Just like a data frame with 2 rows and 1 column, par(mfrow=c(1,2))
allows you to put two graphs side by side. When you finish you should always set the par back to par(mfrow=c(1,1))
, otherwise your next graph will not have the right size.
par(mfrow=c(2,1)) plot(Delhi_hour$Delhi, type="l", main="PM2.5 in Delhi", xlab="Hour", ylab="PM2.5", ylim=c(23, 56)) plot(Chennai_hour$Chennai, type="l", main="PM2.5 in Chennai", xlab="Hour", ylab="PM2.5", ylim=c(23, 56)) par(mfrow=c(1,1))
In the previous challenge, we pick two cities and plot them in a multiple-plots. However, it was difficult to read due to the scale. To improve this, we can instead put both cities in one single graph. To do this, you first plot one city graph as you did in Challenge 4 but then immediately after you can use the lines
function to add another city. You will have to assign a different color to distinguish the two cities.
plot(Delhi_hour$Delhi, type="l", xlab="Hour", ylab="PM2.5", ylim=c(23, 56), main="PM2.5 in Delhi (black) and Chennai (red)") lines(Chennai_hour$Chennai, col="red")
plot(Delhi_hour$Delhi, type="l", xlab="Hour", ylab="PM2.5", ylim=c(23, 56), main="PM2.5 in Delhi (black), Chennai (red), Hyderabad (blue), Kolkata(orange)") lines(Chennai_hour$Chennai, col="red") lines(Hyderabad_hour$Hyderabad , col="blue") lines(Kolkata_hour$Kolkata, col="orange")
Next let’s use the boxplot to gauge the distribution for each hour.
What is the pattern for the city?
What are the most important data presented in this plot?
boxplot(x$Delhi~x$Time, main="PM2.5 in Delhi", xlab="Hour", ylab="PM2.5") grid()
Median does not vary much, but range does. The most important data are the upper outliers, as they represent more dangerous pollution.
Explain how the following code can be used to answer the question, and adapted to answer related questions.
Level <- 100 100*c(sum(x$Chennai >= Level, na.rm = TRUE)/sum(!is.na(x$Chennai)), sum(x$Delhi >= Level, na.rm = TRUE)/sum(!is.na(x$Delhi)), sum(x$Hyderabad >= Level, na.rm = TRUE)/sum(!is.na(x$Hyderabad)), sum(x$Kolkata >= Level, na.rm = TRUE)/sum(!is.na(x$Kolkata)) )
## [1] 5.8655222 1.2500000 0.1547988 0.1412429
The variable Level
gets 100
, which is the PM2.5 value which is “unhealthy for sensitive groups” or worse. The code returns a vector of values. Each value is the percentage of times the city has PM2.5 at least Level
. The first value corresponds to Chennai, the second to Delhi, the third to Hyderabad, and the fourth to Kolkata.
We see that Chennai has PM2.5 unhealthy for sensitive groups or worse 6% of the time, while Delhi has it only 1% of the time. The other two cities have such PM2.5 levels much more rarely.
By changing the variable Level
to 50
or 150
, we could find out which cities have PM2.5 “moderate” or worse, or “unhealthy” or worse.