Read in the titanic.csv
dataset and perform requisite checks on data structure
2021-08-23 Week 3 Mon
Read in the titanic.csv
dataset and perform requisite checks on data structure
Histograms are a great way of displaying the underlying distribution of a dataset. It represents counts of observations as bars and plots them against a set of ‘bin values’, which R automatically sets but can be adjusted. One question we might ask of the data is:
What was the distribution of passengers’ ages aboard the Titanic?
We can make a histogram of the distributions of survivor’s ages.
Make a histogram of the Titanic passengers’ ages.
Make a histogram of the Titanic passengers’ ages.
hist(titanic$age)
How can we compare whether there were differences in the age distributions of passengers who survived and perished?
How can we compare whether there were differences in the age distributions of passengers who survived and perished?
hist(titanic$age[titanic$survived])
hist(titanic$age[!titanic$survived])
What can we learn from these histograms?
hist(titanic$age)
hist(titanic$age[titanic$survived])
hist(titanic$age[!titanic$survived])
What can we learn from these histograms?
hist(titanic$age)
hist(titanic$age[titanic$survived])
hist(titanic$age[!titanic$survived])
range(titanic$age)
## [1] NA NA
but we get NA because there are NA values.
Why might there be NA values in the dataset?
Adding na.rm=TRUE
returns correct range, removing NA values that tripped us
range(titanic$age, na.rm=TRUE)
## [1] 0 74
We can also count the records in which we know the person’s age
nrow(titanic[!is.na(titanic$age),])
## [1] 2205
That is reassuring; we know there are still plenty of records!
hist(titanic$age, breaks=c(seq(0,80,10)))
Here, we used the range to decide on bin widths of 10 years.
Add a title, change the horizontal axis label and color of the histogram.
Add a title, change the horizontal axis label and color of the histogram.
hist(titanic$age, breaks=c(seq(0,80,10)), col="darkgreen", main="Ages of Titanic Passengers", xlab="age")
You can also set labels=TRUE
to label the exact number in each bin.
hist(titanic$age, breaks=c(seq(0,80,10)), col="darkgreen", main="Ages of Titanic Passengers", labels=TRUE, xlab="age")
Boxplots are a nifty way of displaying a dataset’s distribution of data based on a dataset’s minimum, first quartile (Q1), median, third quartile (Q3), and maximum. They can tell you
It’s also a good way to compare distributions between groups of data. R makes it easy to make boxplots.
You could ask “Does survival depend on ticket price?”
Begin by adding a column total_price
to the data frame for the total ticket price, as demonstrated in the videos.
Begin by adding a column total_price
to the data frame for the total ticket price, as demonstrated in the videos.
titanic$total_price <- titanic$pnd + titanic$shl/20 + titanic$pnc/240
Use a boxplot to compare the ticket prices of the survivors with deceased.
Use a boxplot to compare the ticket prices of the survivors with deceased.
boxplot(titanic$total_price ~ titanic$survived, ylab="Ticket price", xlab="Survival Status")
Because of the outliers, it makes the boxplot hard to read. One way to deal with this problem is to transform the data. Let’s add a new column with the square root of the values:
titanic$total_price_sqrt <- sqrt(titanic$total_price) boxplot(titanic$total_price_sqrt ~ titanic$survived, ylab="Sqrt(Ticket price)", xlab="Survival Status")
What is the distribution of ages of those who survived, comparing men and women?
What is the distribution of ages of those who survived, comparing men and women?
titanic_survivors <- titanic[titanic$survived=="TRUE",] boxplot(titanic_survivors$age ~ titanic_survivors$gender, ylab="Age of survivor", xlab="Sex", main="Ages of male and female survivors")
What can we learn from this boxplot?
What can we learn from this boxplot?
What can we learn from this boxplot?
Why might the distribution of men’s ages be less spread? How could you find out?
What is the distribution of ticket prices for those survivors by class?
What is the distribution of ticket prices for those survivors by class?
boxplot(titanic_survivors$total_price_sqrt ~ titanic_survivors$class, ylab="Sqrt(Ticket price)", xlab="Class", main="Survivors' ticket prices by class")
What can we learn from this boxplot?
What can we learn from this boxplot?
Try checking what fraction of the crew have a ticket price. Is it right to be analyzing the crew’s ticket prices?
Perhaps we should hide the crew from the last box plot, or relabel them as “Paying Crew”.
You may have noticed that some of the ticket prices seem to be much higher than others. Why might this be the case? Take a look at the pax_on_tckt
column in the data.
The pax_on_tckt
is actually a total number of people for the ticket, so the ticket price per person could be calculated by dividing by this column.
Create a new column price_per
to calculate the actual price per person and repeat the above.