2021-08-30

Introduction

Prep reading & today’s activity 1 / 2

The textbook shows examples where splitting data into groups gives more insight. Today’s activity goes one step further: what if within each group there are subgroups that affect the distribution?

Prep reading & today’s activity 2 / 2

Specifically, we will explore the following questions.

  • How informative is a direct comparison of the overall lung cancer incidence rates of two countries?
  • Their populations are likely to differ by factors (specifically: age) that affect incidence rates.
  • How can we take these differences into account in our comparison between countries?

Load the data as a date frame

In RStudio, start a new project in a directory that contains the data set lung_cancer.csv. Also open a new R script file.

Load the data.

lung_cancer <- read.csv("lung_cancer.csv")

To ease exchange of code, let’s all use the same variable name lung_cancer for the data frame.

Objective of this activity

Globally, lung cancer is one of the most common types of cancer with estimates of about 1.8 million cases or some 12.9% of all new cases of cancer in 2012 alone. The rates of lung cancer incidence may differ among countries because of factors such as levels of air pollution and smoking.

In this activity we will try to figure out if someone from, say, Viet Nam has a higher probability of getting lung cancer than someone from the UK. We will compare the following eight countries: Viet Nam, Singapore, the UK, Ethiopia, Austria, China, Georgia and the Philippines.

“Crude” incidence rate

As part of your preparation for this activity, you made this bar chart.

Do you find the results shown by the bar chart surprising?

Incidence rate by country and age group

Incidence rate by country and age group

  • Calculate the incidence rates by age group for at least two of the countries: UK, Viet Nam, Austria, China, Georgia, Philippines, Singapore, and Ethiopia
  • Plot the data in a bar chart. Add a title and a \(y\)-axis label to the plot. Add the corresponding age group as a label below each bar.

Incidence rate by country and age group 1 / 5

Let’s take the UK as an example.

uk <- lung_cancer[lung_cancer$Country == "UK", ]

uk$Incidence <- 100000 * uk$Cases / uk$Population

barplot(uk$Incidence,
        names.arg = uk$AgeClass,
        main = "UK Lung Cancer Incidence by Age Group",
        ylab = "Incidence rate per 100,000",
        col="violetred3",
        las = 2)
grid()

Incidence rate by country and age group 2 / 5

Here is the result of the code on the previous slide.

Incidence rate by country and age group 3 / 5

country <- "Singapore"
country_lc <- lung_cancer[lung_cancer$Country == country, ]
country_lc$Incidence <- 100000 * country_lc$Cases / country_lc$Population
barplot(country_lc$Incidence, names.arg = country_lc$AgeClass,
        main = paste(country,"Lung Cancer Incidence by Age Group"),
        ylab = "Incidence rate per 100,000", las = 2)

Incidence rate by country and age group 4 / 5

Here are all the bar plots combined into one figure so that we can compare them more easily.

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL

Incidence rate by country and age group 5 / 5

The bar plots show that

  • incidence varies strongly between countries,
  • the incidence rate tends to increase with age.

Can you think of reasons why this might be the case? Hans Rosling gave some hints in his video tutorial (https://www.youtube.com/watch?v=QBht72_PA-4).

Activity 2: Age structure

Age structure

In our prep work for this lesson, we combined all age groups to find a country-wide incidence rate. But this approach is very crude. The result of activity 1 shows that the incidence rate depends strongly on the age.

  • Work with the data of at least two of the countries: UK, Viet Nam, Austria, China, Georgia, Philippines, Singapore, and Ethiopia
  • Calculate what proportion of each country is in each age group
  • Plot the data in a bar chart.

Age structure solution 1 / 4

Again let’s take the UK as an example. We assume that we have already done the subsetting as in the solution to our previous activity. Especially, we assume that the data frame uk is already in our environment.

uk_total_population <- sum(uk$Population)

barplot(uk$Population / uk_total_population,
        names.arg = uk$AgeClass,
        main = "UK Age Distribution",
        ylab = "Relative population",
        col="darkorchid3",
        las = 2)
grid()

Age structure solution 2 / 4

Age structure solution 3 / 4

country <- "Singapore"
country_lc <- lung_cancer[lung_cancer$Country == country, ]
country_total_population <- sum(country_lc$Population)
country_lc$Incidence <- 100000 * country_lc$Cases / country_lc$Population
barplot(country_lc$Population / country_total_population,
        names.arg = country_lc$AgeClass, main = paste(country,"Age Distribution"),
        ylab = "Relative population", las = 2)
grid()

Age structure solution 4 / 4

Here are all the bar plots combined into one figure so that we can compare them more easily.

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL

Class discussion: How should we measure incidence?

Class discussion

  • Consider the implications of the results of activities 1 and 2. What could be an important factor in determining how the total incidence rates vary among these eight countries? Discuss.

  • Given the foregoing discussion, would you amend the analyses in Challenge 1? How?

Implications of different age structures

Age structure differs among countries, with a high percentage of older people in high income countries such as the UK and a very young population in countries like Ethiopia or even the Philippines.

You can compare the age structure of all countries in the world at https://www.populationpyramid.net/world/2012. In many countries, the population is aging while birth rates are falling. So it might be that some of the difference in total lung cancer incidence rates across countries is due to their different underlying demographic structure.

Ideally we would compute a “corrected” or “age-standardized” incidence rate that accounts for these age-specific differences in countries’ demographic structure. We’re going to use the UK as the “standard” population because it has the highest crude lung cancer incidence rate, at least for these 8 countries considered. Alternatively, if we had the data, we could use the 2012 World population for standardization.

Comparing countries with different age structure

Let’s compare two hypothetical countries \(A\) and \(B\).

Suppose there were only two age groups and

  • the incidence rates were
    • low for the young group,
    • high for the old group, but
  • identical for both age groups in country \(A\) and country \(B\).

If \(A\) has a higher proportion of elderly, \(A\) will have a higher overall incidence rate.

But is it a “fair” comparison between \(A\) and \(B\)?

Age-adjusted incidence 1 / 2

What would be the incidence rate in country \(A\) if, in all age groups, it had the same fraction of the population as country \(B\)? Let’s introduce some notation.

  • \(c_{A, \text{age}}\): the number of cancer cases in \(A\) in a given age group.
  • \(p_{B, \text{age}}\): the population in \(B\) in the same age group

The number of cases in this age group that would occur in country \(A\) if it had the population of country \(B\) is

\[ \tilde{c}_{A, \text{age}} = \frac{p_{B, \text{age}}} {p_{A, \text{age}}} \times c_{A, \text{age}}\ . \qquad \qquad \qquad (\text{I}) \]

In other words, if \(p_{B, \text{age}} = 2 \times p_{A, \text{age}}\), the number of cases in this age group that would occur in country \(A\) if it had the population of country \(B\) is twice the observed number

Age-adjusted incidence 2 / 2

The “age-adjusted” incidence rate in country \(A\) is

\[ \frac{\sum_{\text{age}} \tilde{c}_{A, \text{age}}} {\sum_\text{age} p_{B, \text{age}}} \times 100\,000\ . \qquad \qquad \qquad (\text{II}) \]

The unadjusted incidence rate would have

  • \(c_{A, \text{age}}\) in the numerator instead of \(\tilde{c}_{A, \text{age}}\),
  • \(p_{A, \text{age}}\) in the denominator instead of \(p_{B, \text{age}}\).

Activity 3: Age adjusted lung cancer incidence

Age adjusted lung cancer incidence: preparation

Append a column Population_UK to the lung_cancer data frame with

uk_population <- lung_cancer$Population[lung_cancer$Country == "UK"]

lung_cancer$Population_UK <- uk_population

This column contains the value of \(p_{B, \text{age}}\) in Equation \((\text{I})\).

In the second line, we’re taking advantage of R’s vectorization: the shorter vector uk_population is repeated as often as necessary to fill the entire length of the longer vector lung_cancer$Population_UK.

Age adjusted lung cancer incidence

  1. Append another column Cases_if_UK that contains the hypothetical number of cases if the country in the corresponding row had the population of the UK in this age group.

    This column contains the values of \(\tilde{c}_{A, \text{age}}\) in equation \((\text{I})\).

  2. Use the aggregate() function to calculate the total number of cases that the country would have if it had the population and age structure of the UK.

    This column contains the numerator of Equation \((\text{II})\).

Age adjusted lung cancer incidence: solutions 0-2

# Preparation: Append population of UK in corresponding age group.
uk_population <- lung_cancer$Population[lung_cancer$Country == "UK"]
lung_cancer$Population_UK <- uk_population

# 1: How many cases would there be if the population had the age
#    structure of the UK?
lung_cancer$Cases_if_UK <-
  (lung_cancer$Population_UK / lung_cancer$Population) * lung_cancer$Cases

# 2: Total number of cases in each country if it had the population and
#    age structure of the UK.
adjusted <- aggregate(Cases_if_UK ~ Country, data = lung_cancer, sum)

Age adjusted lung cancer incidence

3.Calculate the age-adjusted incidence rate defined by Equation \((\text{II})\).

4.Make a bar chart of the age-adjusted incidence rate (one bar for each country).

5.Compare this plot with the bar chart for the overall (i.e. unadjusted) incidence rate that you made during the prep work for toady. Are there any noteworthy changes?

Age adjusted lung cancer incidence: solution 3

# Preparation: Append population of UK in corresponding age group.
uk_population <- lung_cancer$Population[lung_cancer$Country == "UK"]
lung_cancer$Population_UK <- uk_population

# 1: How many cases would there be if the population had the age
#    structure of the UK?
lung_cancer$Cases_if_UK <-
  (lung_cancer$Population_UK / lung_cancer$Population) * lung_cancer$Cases

# 2: Total number of cases in each country if it had the population and
#    age structure of the UK.
adjusted <- aggregate(Cases_if_UK ~ Country, data = lung_cancer, sum)

# 3: Append a column with the age adjusted incidence rate.
adjusted$Incidence <- adjusted$Cases_if_UK / sum(uk_population) * 100000

Age adjusted lung cancer incidence: solution 4

barplot(adjusted$Incidence, names.arg = adjusted$Country,
        main = "Age-adjusted Lung Cancer Incidence", ylab = "Incidence rate per 100,000",
        col="tan2", las = 2)
grid()

Age adjusted lung cancer incidence: solution 5 (graph)

Age adjusted lung cancer incidence: solution 5 (Discussion)

Because most countries have a younger population than the UK, age-adjustment leads to higher incidence rates in most cases.

The age-adjusted incidence rate of China is strikingly high. China has a very high incidence in the oldest age group, which is (still) a relatively small fraction of China’s population.

Recap

Recap

Why do the corrected incidence rates provide a better answer to the main question of this activity?

“Does someone from one country (UK) have a higher chance of getting lung cancer than someone from another country (Vietnam)?”

Answer

The corrected or adjusted incidence rates can be compared among countries, having taken out any age-structure effects. If the adjusted rates are very different from the crude (i.e. overall) rates, the rank order of that country will change substantially.

For example, in Vietnam, the proportion of very young people, with almost no cases of lung cancer, is very high. By contrast, Singapore has more old people, a group with a much higher incidence rate of lung cancer. So the high crude rate in Singapore is strongly driven by the larger weight of the older age groups.

Therefore, in epidemiology, generally an age-adjusted rate is used to compare among countries or other groups. In general, the weighted average of the age-specific incidence rates is calculated based on the proportions of persons in the corresponding age groups of a standard population. The potential confounding effect of age is reduced when comparing age-adjusted rates computed using the same standard population.