2021-08-23 Week 3 Mon

Read in the titanic.csv dataset and perform requisite checks on data structure

Histograms

Histograms are a great way of displaying the underlying distribution of a dataset. It represents counts of observations as bars and plots them against a set of ‘bin values’, which R automatically sets but can be adjusted. One question we might ask of the data is:

What was the distribution of passengers’ ages aboard the Titanic?

We can make a histogram of the distributions of survivor’s ages.

1 Histogram of Titanic passenger ages

Make a histogram of the Titanic passengers’ ages.

1 Histogram of Titanic passenger ages

Make a histogram of the Titanic passengers’ ages.

hist(titanic$age)

2 Histograms of Titanic ages by survival

How can we compare whether there were differences in the age distributions of passengers who survived and perished?

2 Histograms of Titanic ages by survival

How can we compare whether there were differences in the age distributions of passengers who survived and perished?

hist(titanic$age[titanic$survived])

hist(titanic$age[!titanic$survived])

3 Observations from histograms

What can we learn from these histograms?

hist(titanic$age)

hist(titanic$age[titanic$survived])

hist(titanic$age[!titanic$survived])

3 Observations from histograms

What can we learn from these histograms?

hist(titanic$age)

hist(titanic$age[titanic$survived])

hist(titanic$age[!titanic$survived])

  • Most passengers aged between 15 and 50.
  • Infants more likely to survive.
  • Elderly all perished.

Bin widths: range

  • R automatically sets the histogram bin width (i.e., data interval) at 5, but you can adjust this parameter.
  • To determine an appropriate bin width it might make sense to explore the range of the data to determine how you might want to adjust the bin width:
range(titanic$age)
## [1] NA NA

but we get NA because there are NA values.

Why might there be NA values in the dataset?

Range & NA

Adding na.rm=TRUE returns correct range, removing NA values that tripped us

range(titanic$age, na.rm=TRUE)
## [1]  0 74

We can also count the records in which we know the person’s age

nrow(titanic[!is.na(titanic$age),])
## [1] 2205

That is reassuring; we know there are still plenty of records!

Histogram of Titanic passenger ages with adjusted bin widths

hist(titanic$age, breaks=c(seq(0,80,10)))

Here, we used the range to decide on bin widths of 10 years.

4 Histogram of Titanic passenger ages with axes labels

Add a title, change the horizontal axis label and color of the histogram.

4 Histogram of Titanic passenger ages with axes labels

Add a title, change the horizontal axis label and color of the histogram.

hist(titanic$age, breaks=c(seq(0,80,10)), col="darkgreen",
     main="Ages of Titanic Passengers", xlab="age")

Histogram of Titanic passenger ages with axes and data labels

You can also set labels=TRUE to label the exact number in each bin.

hist(titanic$age, breaks=c(seq(0,80,10)), col="darkgreen",
     main="Ages of Titanic Passengers", labels=TRUE, xlab="age")

Boxplots

Boxplots are a nifty way of displaying a dataset’s distribution of data based on a dataset’s minimum, first quartile (Q1), median, third quartile (Q3), and maximum. They can tell you

  • about outliers in the dataset
  • if your data are symmetrical or skewed
  • how tightly your data is grouped

It’s also a good way to compare distributions between groups of data. R makes it easy to make boxplots.

You could ask “Does survival depend on ticket price?”

5 Ticket price

Begin by adding a column total_price to the data frame for the total ticket price, as demonstrated in the videos.

5 Ticket price

Begin by adding a column total_price to the data frame for the total ticket price, as demonstrated in the videos.

titanic$total_price <- titanic$pnd + titanic$shl/20 + titanic$pnc/240

6 Boxplot of ticket prices by survival

Use a boxplot to compare the ticket prices of the survivors with deceased.

6 Boxplot of ticket prices by survival

Use a boxplot to compare the ticket prices of the survivors with deceased.

boxplot(titanic$total_price ~ titanic$survived, ylab="Ticket price", xlab="Survival Status")

Boxplot with sqrt scaling of vertical axis

Because of the outliers, it makes the boxplot hard to read. One way to deal with this problem is to transform the data. Let’s add a new column with the square root of the values:

titanic$total_price_sqrt <- sqrt(titanic$total_price)
boxplot(titanic$total_price_sqrt ~ titanic$survived,
        ylab="Sqrt(Ticket price)", xlab="Survival Status")

7 Boxplot of ages of survivors by gender

What is the distribution of ages of those who survived, comparing men and women?

7 Boxplot of ages of survivors by gender

What is the distribution of ages of those who survived, comparing men and women?

titanic_survivors <- titanic[titanic$survived=="TRUE",]
boxplot(titanic_survivors$age ~ titanic_survivors$gender,
        ylab="Age of survivor", xlab="Sex",
        main="Ages of male and female survivors")

8 Boxplot of ages of survivors by gender

What can we learn from this boxplot?

8 Boxplot of ages of survivors by gender

What can we learn from this boxplot?

  • Median ages of men and women similar: ~29.
  • Both distributions approximately symmetric.
  • Distribution of men’s ages less spread.

Boxplot of ages of survivors by gender

What can we learn from this boxplot?

  • Median ages of men and women similar: ~29.
  • Both distributions approximately symmetric.
  • Distribution of men’s ages less spread.

Why might the distribution of men’s ages be less spread? How could you find out?

9 Boxplot of survivors’ ticket prices by class

What is the distribution of ticket prices for those survivors by class?

9 Boxplot of survivors’ ticket prices by class

What is the distribution of ticket prices for those survivors by class?

boxplot(titanic_survivors$total_price_sqrt ~ titanic_survivors$class,
        ylab="Sqrt(Ticket price)", xlab="Class",
        main="Survivors' ticket prices by class")

10 Boxplot of survivors’ ticket prices by class

What can we learn from this boxplot?

10 Boxplot of survivors’ ticket prices by class

What can we learn from this boxplot?

  • Among passengers, 1st class survivors paid the most, followed by 2nd, then 3rd.
  • Ticket prices of 1st class survivors varied the most.
  • Many 3rd class passengers paid the same for their tickets: the median.
  • (Some) crew paid for their tickets.

BONUS 1

Try checking what fraction of the crew have a ticket price. Is it right to be analyzing the crew’s ticket prices?

Perhaps we should hide the crew from the last box plot, or relabel them as “Paying Crew”.

BONUS 2

You may have noticed that some of the ticket prices seem to be much higher than others. Why might this be the case? Take a look at the pax_on_tckt column in the data.

The pax_on_tckt is actually a total number of people for the ticket, so the ticket price per person could be calculated by dividing by this column.

Create a new column price_per to calculate the actual price per person and repeat the above.

Takeaway

  • Summary statistics are a helpful first step in any data exploration.
  • Before doing any fancy models or analysis on the data, it’s absolutely critical to first explore the data and make sure that it resonates with your instincts or any a priori knowledge about the dataset.
  • There are several ways of doing this exploration, including examining the distribution of the data through histograms, boxplots, and summary statistics, like mean, median, standard deviation and interquartile range.
  • Initial exploration of data can help you identify features of the data that might be worth further investigation.