2021-08-30
The textbook shows examples where splitting data into groups gives more insight. Today’s activity goes one step further: what if within each group there are subgroups that affect the distribution?
Specifically, we will explore the following questions.
In RStudio, start a new project in a directory that contains the data set lung_cancer.csv
. Also open a new R script file.
Load the data.
lung_cancer <- read.csv("lung_cancer.csv")
To ease exchange of code, let’s all use the same variable name lung_cancer
for the data frame.
Globally, lung cancer is one of the most common types of cancer with estimates of about 1.8 million cases or some 12.9% of all new cases of cancer in 2012 alone. The rates of lung cancer incidence may differ among countries because of factors such as levels of air pollution and smoking.
In this activity we will try to figure out if someone from, say, Viet Nam has a higher probability of getting lung cancer than someone from the UK. We will compare the following eight countries: Viet Nam, Singapore, the UK, Ethiopia, Austria, China, Georgia and the Philippines.
As part of your preparation for this activity, you made this bar chart.
Do you find the results shown by the bar chart surprising?
Let’s take the UK as an example.
uk <- lung_cancer[lung_cancer$Country == "UK", ] uk$Incidence <- 100000 * uk$Cases / uk$Population barplot(uk$Incidence, names.arg = uk$AgeClass, main = "UK Lung Cancer Incidence by Age Group", ylab = "Incidence rate per 100,000", col="violetred3", las = 2) grid()
Here is the result of the code on the previous slide.
country <- "Singapore" country_lc <- lung_cancer[lung_cancer$Country == country, ] country_lc$Incidence <- 100000 * country_lc$Cases / country_lc$Population barplot(country_lc$Incidence, names.arg = country_lc$AgeClass, main = paste(country,"Lung Cancer Incidence by Age Group"), ylab = "Incidence rate per 100,000", las = 2)
Here are all the bar plots combined into one figure so that we can compare them more easily.
## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ## ## [[5]] ## NULL ## ## [[6]] ## NULL ## ## [[7]] ## NULL ## ## [[8]] ## NULL
The bar plots show that
Can you think of reasons why this might be the case? Hans Rosling gave some hints in his video tutorial (https://www.youtube.com/watch?v=QBht72_PA-4).
In our prep work for this lesson, we combined all age groups to find a country-wide incidence rate. But this approach is very crude. The result of activity 1 shows that the incidence rate depends strongly on the age.
Again let’s take the UK as an example. We assume that we have already done the subsetting as in the solution to our previous activity. Especially, we assume that the data frame uk
is already in our environment.
uk_total_population <- sum(uk$Population) barplot(uk$Population / uk_total_population, names.arg = uk$AgeClass, main = "UK Age Distribution", ylab = "Relative population", col="darkorchid3", las = 2) grid()
country <- "Singapore" country_lc <- lung_cancer[lung_cancer$Country == country, ] country_total_population <- sum(country_lc$Population) country_lc$Incidence <- 100000 * country_lc$Cases / country_lc$Population barplot(country_lc$Population / country_total_population, names.arg = country_lc$AgeClass, main = paste(country,"Age Distribution"), ylab = "Relative population", las = 2) grid()
Here are all the bar plots combined into one figure so that we can compare them more easily.
## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ## ## [[5]] ## NULL ## ## [[6]] ## NULL ## ## [[7]] ## NULL ## ## [[8]] ## NULL
Consider the implications of the results of activities 1 and 2. What could be an important factor in determining how the total incidence rates vary among these eight countries? Discuss.
Given the foregoing discussion, would you amend the analyses in Challenge 1? How?
Age structure differs among countries, with a high percentage of older people in high income countries such as the UK and a very young population in countries like Ethiopia or even the Philippines.
You can compare the age structure of all countries in the world at https://www.populationpyramid.net/world/2012. In many countries, the population is aging while birth rates are falling. So it might be that some of the difference in total lung cancer incidence rates across countries is due to their different underlying demographic structure.
Ideally we would compute a “corrected” or “age-standardized” incidence rate that accounts for these age-specific differences in countries’ demographic structure. We’re going to use the UK as the “standard” population because it has the highest crude lung cancer incidence rate, at least for these 8 countries considered. Alternatively, if we had the data, we could use the 2012 World population for standardization.
Let’s compare two hypothetical countries \(A\) and \(B\).
Suppose there were only two age groups and
If \(A\) has a higher proportion of elderly, \(A\) will have a higher overall incidence rate.
But is it a “fair” comparison between \(A\) and \(B\)?
What would be the incidence rate in country \(A\) if, in all age groups, it had the same fraction of the population as country \(B\)? Let’s introduce some notation.
The number of cases in this age group that would occur in country \(A\) if it had the population of country \(B\) is
\[ \tilde{c}_{A, \text{age}} = \frac{p_{B, \text{age}}} {p_{A, \text{age}}} \times c_{A, \text{age}}\ . \qquad \qquad \qquad (\text{I}) \]
In other words, if \(p_{B, \text{age}} = 2 \times p_{A, \text{age}}\), the number of cases in this age group that would occur in country \(A\) if it had the population of country \(B\) is twice the observed number
The “age-adjusted” incidence rate in country \(A\) is
\[ \frac{\sum_{\text{age}} \tilde{c}_{A, \text{age}}} {\sum_\text{age} p_{B, \text{age}}} \times 100\,000\ . \qquad \qquad \qquad (\text{II}) \]
The unadjusted incidence rate would have
Append a column Population_UK
to the lung_cancer
data frame with
uk_population <- lung_cancer$Population[lung_cancer$Country == "UK"] lung_cancer$Population_UK <- uk_population
This column contains the value of \(p_{B, \text{age}}\) in Equation \((\text{I})\).
In the second line, we’re taking advantage of R’s vectorization: the shorter vector uk_population
is repeated as often as necessary to fill the entire length of the longer vector lung_cancer$Population_UK
.
Append another column Cases_if_UK
that contains the hypothetical number of cases if the country in the corresponding row had the population of the UK in this age group.
This column contains the values of \(\tilde{c}_{A, \text{age}}\) in equation \((\text{I})\).
Use the aggregate()
function to calculate the total number of cases that the country would have if it had the population and age structure of the UK.
This column contains the numerator of Equation \((\text{II})\).
# Preparation: Append population of UK in corresponding age group. uk_population <- lung_cancer$Population[lung_cancer$Country == "UK"] lung_cancer$Population_UK <- uk_population # 1: How many cases would there be if the population had the age # structure of the UK? lung_cancer$Cases_if_UK <- (lung_cancer$Population_UK / lung_cancer$Population) * lung_cancer$Cases # 2: Total number of cases in each country if it had the population and # age structure of the UK. adjusted <- aggregate(Cases_if_UK ~ Country, data = lung_cancer, sum)
3.Calculate the age-adjusted incidence rate defined by Equation \((\text{II})\).
4.Make a bar chart of the age-adjusted incidence rate (one bar for each country).
5.Compare this plot with the bar chart for the overall (i.e. unadjusted) incidence rate that you made during the prep work for toady. Are there any noteworthy changes?
# Preparation: Append population of UK in corresponding age group. uk_population <- lung_cancer$Population[lung_cancer$Country == "UK"] lung_cancer$Population_UK <- uk_population # 1: How many cases would there be if the population had the age # structure of the UK? lung_cancer$Cases_if_UK <- (lung_cancer$Population_UK / lung_cancer$Population) * lung_cancer$Cases # 2: Total number of cases in each country if it had the population and # age structure of the UK. adjusted <- aggregate(Cases_if_UK ~ Country, data = lung_cancer, sum) # 3: Append a column with the age adjusted incidence rate. adjusted$Incidence <- adjusted$Cases_if_UK / sum(uk_population) * 100000
barplot(adjusted$Incidence, names.arg = adjusted$Country, main = "Age-adjusted Lung Cancer Incidence", ylab = "Incidence rate per 100,000", col="tan2", las = 2) grid()
Because most countries have a younger population than the UK, age-adjustment leads to higher incidence rates in most cases.
The age-adjusted incidence rate of China is strikingly high. China has a very high incidence in the oldest age group, which is (still) a relatively small fraction of China’s population.
Why do the corrected incidence rates provide a better answer to the main question of this activity?
“Does someone from one country (UK) have a higher chance of getting lung cancer than someone from another country (Vietnam)?”
The corrected or adjusted incidence rates can be compared among countries, having taken out any age-structure effects. If the adjusted rates are very different from the crude (i.e. overall) rates, the rank order of that country will change substantially.
For example, in Vietnam, the proportion of very young people, with almost no cases of lung cancer, is very high. By contrast, Singapore has more old people, a group with a much higher incidence rate of lung cancer. So the high crude rate in Singapore is strongly driven by the larger weight of the older age groups.
Therefore, in epidemiology, generally an age-adjusted rate is used to compare among countries or other groups. In general, the weighted average of the age-specific incidence rates is calculated based on the proportions of persons in the corresponding age groups of a standard population. The potential confounding effect of age is reduced when comparing age-adjusted rates computed using the same standard population.