Code
<- c("coronavirus", "deSolve", "lubridate", "scales", "PrettyCols")
packages # Package vector names
Today we will have one more lab and will explore basic aspects of models for Spread of Disease or Compartmental models that are used to simplify the mathematical modeling of infectious diseases, more specifically, we will explore the basic
We can get information about herd immunity in an amazing paper published in the American Journal of Preventive Medicine. The paper entitles Vaccine Efficacy Needed for a COVID-19 Coronavirus Vaccine to Prevent or Stop an Epidemic as the Sole Intervention and was written by Bartsch et al. (2020). Another nice paper to understand the herd immunity concept was published in Current Biology by Ashby & Best (2021), you can get access to that conceptual paper by clicking HERE.
To do this laboratory you will need to have a set of R packages. Install the following packages:
<- c("coronavirus", "deSolve", "lubridate", "scales", "PrettyCols")
packages # Package vector names
You can use the function install.packages() to install the packages.
If you don’t want to install the packages one by one, you can use the next command.
# Install packages not yet installed
<- packages %in% rownames(installed.packages())
installed_packages
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages], dependencies = TRUE)
}
This command, will, first, check if you already the packages installed, then if a package is not installed in your computer, will install it.
Load installed packages:
library(tidyverse)
library(lubridate)
library(coronavirus)
library(PrettyCols)
library(scales)
library(deSolve)
Double-check your working directory.
You can use the function getwd() to get the current working directory.
First we will explore the data for the COVID-19 for different countries. To do that we will get data at global scale using the amazing R package {coronavirus} that provides a daily summary of COVID-19 cases by country obtained from the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus.
data("coronavirus")
glimpse(coronavirus)
We are using tidyverse for almost everything in the lab sessions. If you want to know what we are really doing, you can install and load the R package named {tidylog}. This package will print in your console the different operations within the pipes.
To to get the most updated dataset we can use the function refresh_coronavirus_jhu() and store the UPDATED data in our Environment as a new object named corona.
<- refresh_coronavirus_jhu()
corona
%>%
corona head(10)
If you received an error while uploading or updating the coronavirus dataset, use the next code. If not, please skip this lines.
<- "https://github.com/jesusNPL/BiodiversityScience/raw/master/Spring2023/Lab_3/Data/corona.RData"
githubURL
load(url(githubURL))
glimpse(corona)
Let’s start exploring the total numbers of confirmed cases by country.
# Get top confirmed cases by country
%>%
corona filter(data_type == "cases_new") %>%
group_by(location) %>%
summarise(total = sum(value)) %>%
arrange(-total) %>%
head(20) %>%
ggplot(aes(y = as.factor(location), x = total)) +
geom_bar(stat = "identity")
Informative but not good looking… let’s play a bit more!
# Aux function for visualization
<- function() {
theme_nice theme_bw() + #base_family = "Noto Sans") +
theme(panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
#plot.title = element_text(face = "bold"),
#strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "grey80", color = NA),
legend.title = element_text(face = "bold", size = 15),
legend.text = element_text(size = 12))
}
# Get top confirmed cases by country
%>%
corona filter(data_type == "cases_new") %>%
group_by(location) %>%
summarise(total = sum(value)) %>%
arrange(-total) %>%
head(20) %>%
ggplot(aes(y = as.factor(location), x = total)) +
geom_bar(stat = "identity", alpha = 0.7) +
scale_x_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
labs(x = "Number of cases", y = "Countries with more reported cases") +
theme_nice()
Much, much better!
What about the 20 countries with less number of reported cases?
Which countries reported fewer cases of COVID-19?
What about the number of reported recovered cases?
How do feel about that?
By quickly exploring the coronavirus dataset we can see that the United States is the country with more reported cases, not new information is added to our current knowledge, however, we can learn more about the COVID-19 dynamics in the US by looking at its changes over time–you can select any other country if you wish.
<- corona %>%
corona_us filter(location == "US") %>%
arrange(desc(date)) # sort the data according dates
If we look at the US coronavirus data it contains information of the number of recovers (R), the number of new cases (I) and the the deaths. Let’s isolate those data and inspect if there are some trends.
<- corona %>%
infected_us filter(location == "US") %>%
arrange(desc(date)) %>%
filter(data_type == "cases_new")
<- corona %>%
deaths_us filter(location == "US") %>%
arrange(desc(date)) %>%
filter(data_type == "deaths_new")
<- corona %>%
recovered_us filter(location == "US") %>%
arrange(desc(date)) %>%
filter(data_type == "recovered_new")
head(infected_us)
let’s plot the new cases.
%>%
corona_us filter(data_type == "cases_new") %>%
ggplot(aes(x = as.POSIXct(date), y = value)) +
geom_point(color = "gray") +
labs(x = NULL, y = "COVID-19 new cases") +
geom_line(mapping = aes(x = as.POSIXct(date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 2) +
scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = scales::label_date_short()) +
theme_nice()
We can see a trend in the number of confirmed Infected subjects since the first case was registered in the US. Happily, the number of reported cases has declined steadily since mid-September 2022. This is really great news! If you want to explore with more detail these dynamics in the United Sates the website covid19.Explorer developed by Revell (2021) can help you. This website is accompanied with a nice paper in which Liam explains the methodology he used to arrange and visualize the data.
How do you feel about that?
Please explain the trend temporally; in other words, do the spikes in the number of cases match the US holidays?
What about the number of deaths and the number of recovered people?
The first step to build our first SIR model is to get data of a disease, well, we already have data of coronavirus for each country in the world, however, as this dataset is at country level we probably will be missing some relevant information that occurs only at local scale, i.e., by States. In this sense, we will use data from the Centers for Disease Control and Prevention that reports aggregate counts of COVID-19 at State level within the United States, how good is that? Okay, let’s go for it!
# https://covid.cdc.gov/covid-data-tracker/#trends_weeklycases_select_27
# Get data from the CDC
<- "https://data.cdc.gov/api/views/9mfq-cb36/rows.csv?accessType=DOWNLOAD"
url_data
<- read.csv(url_data)
covid_us
glimpse(covid_us)
If the previous lines return an error, please use these ones.
<- "https://github.com/jesusNPL/BiodiversityScience/raw/master/Spring2023/Lab_3/Data/covidUS.RData"
githubURL
load(url(githubURL))
glimpse(covid_us)
Let’s explore the data and clean it little bit.
%>%
covid_us arrange(state) %>%
drop_na(tot_cases) %>%
ggplot(aes(y = state, x = tot_cases)) +
geom_bar(stat = "identity") +
labs(x = "Number of cases", y = "Reported cases by State") +
scale_x_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
theme_nice()
Let’s reformat the dates and order the data in an ascending order.
<- covid_us %>%
covid_us mutate(Date = submission_date) %>%
mutate(Date = mdy(Date)) %>%
select(Date, everything()) %>%
separate(submission_date, sep = "/", into = c("month", "day", "year"))
# Sort the data in an increasing order
<- covid_us[order(covid_us$Date), ]
covid_us
head(covid_us)
As we can see there are a lot of NA values in the dataset, these NA values are because there were no cases reported in the US in the first two months of the 2020.
The data is still very broad, let’s select data for the State of Minnesota.
<- covid_us %>%
covid_mn filter(state == "MN")
head(covid_mn)
Now we can explore the data in a similar way we did for the entire United States.
%>%
covid_mn ggplot(aes(x = as.POSIXct(Date), y = new_case)) +
geom_point(color = "gray") +
labs(x = NULL, y = "COVID-19 new cases") +
geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 2) +
scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = label_date_short()) +
theme_nice()
We can see that there are some days that no cases were reported, however, it looks like follows the same pattern as the entire United States.
Let’s see what happens if we plot the cumulative number of cases.
%>%
covid_mn ggplot(aes(x = as.POSIXct(Date), y = tot_cases)) +
geom_point(color = "gray") +
labs(x = NULL, y = "Total COVID-19 cases") +
geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 1) +
scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = label_date_short()) +
theme_nice()
It look like the after three years the cumulative number of cases is reaching a plateau.
We can also see the number of deaths caused by COVID-19 in the state of Minnesota.
%>%
covid_mn ggplot(aes(x = as.POSIXct(Date), y = abs(new_death))) +
geom_point(color = "gray") +
labs(x = NULL, y = "Number of Deaths") +
geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 1) +
scale_y_continuous(labels = scales::label_number(suffix = " K", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = label_date_short()) +
theme_nice()
Now using the data from the US we will fit the SIR model to predict the changes in the number of Infected cases. The basic idea of the SIR model is, in fact, quite simple. First, we can see that the SIR model is composed of a set of three groups of people or compartments:
Then, these compartments in turn are used to model the dynamics of an outbreak. However, first of all we need to understand a little bit the Math behind the scenes. Dr. Elizabeth Borer provided a nice introduction to the math in the last lecture, but if you want to dig a little bit more you can watch this amazing video by Trevor Bazett The MATH of Epidemics | Intro to the SIR Model on YouTube. But, if you want to learn more details on this topic, the Book Epidemics: models and data using R by Bjørnstad (2018) can help you.
Anyhow, let’s recap a little bit how the math works. To do that, we need three differential equations, one for the change in each group or compartment, where
The first equation indicates that the number of susceptible subjects (
The second equation indicates that the number of infected subjects (
Finally, the third equation indicates that the number of recovered subjects
Okay, we are almost there. But first let’s see how an epidemic develops.
As we now understand how the
<- function(time, state, parameters) {
SIR <- as.list(c(state, parameters))
par with(par, {
<- -Beta * I * S / N # Equation one
dS <- Beta * I * S / N - Gamma * I # Equation two
dI <- Gamma * I # Equation three
dR list(c(dS, dI, dR))
}) }
Now to fit the
More specifically, in order to solve the equations we need is to minimize the sum of the squared differences (
Before start solving the
<- function(parameters) {
RSS names(parameters) <- c("Beta", "Gamma")
<- ode(y = init, times = Days, func = SIR, parms = parameters)
out # the out object includes the SIR function we wrote above
<- out[, 3]
fit sum((Infected - fit)^2)
}
Okay, we now have almost all information needed to fit our first
By observing the trend in the number of cases for MN, we can see that the first record of infected subject was in March 06 2020 (you can double check to confirm that date), so, let’s set that day as a starting point for our
%>%
covid_mn filter(Date == "2020-03-06")
Moreover, as more than two years have passed since the pandemic started, the variability in the number of cases is huge, so let’s select the first 60 days of cases to fit our model and then use the fitted model to predict the dynamics of COVID-19 for the following year.
<- 5686649 # Total population for the State of Minnesota for the 2020
N
<- "2020-03-06"
start_date <- "2020-05-10"
end_date
# isolating the infected subjects in the state of Minnesota since the start date
<- covid_mn %>%
Infections filter(Date >= ymd(start_date) & Date <= ymd(end_date))
#Infected2 <- subset(covid_mn, Date >= ymd(start_date) & Date <= ymd(end_date))$new_case
Let’s plot this data and see how it looks.
%>%
Infections ggplot(aes(x = as.POSIXct(Date), y = log(new_case))) +
geom_point(color = "gray") +
#geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
# colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 1) +
labs(x = NULL, y = "log - COVID-19 cases") +
theme_nice()
We can see that the confirmed cases in Minnesota were increasing rapidly since the first case of COVID-19 was reported.
As we now know that the number of
# ODE does not like tidy format, so we are going to isolate the number of cases in R base.
<- Infections$new_case
Infected
<- 1:length(Infected) # Number of days since the first case
Days
# Initial values for our SIR model
<- c(
init S = N - Infected[1], # Susceptible group
I = Infected[1], # Infected group
R = 0 # Recovered group.
)
Now we can combine all information and run our model. Using the information provided above, we can find the values of
You can play with the initial values by changing the c(0.5, 0.5) to lower or higher values (e.g., 0.1 or 0.7), but remember, the estimates of
<- optim(c(0.5, 0.5),
Opt
RSS,method = "L-BFGS-B",
lower = c(0, 0),
upper = c(1, 1)
)
Check if the model converged.
# optimize with some sensible conditions
$message
Opt
# [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Yay! our model show convergence.
From the optimized model we can obtain the
<- setNames(Opt$par, c("Beta", "Gamma"))
Opt_par
Opt_par
Now using the fitted
# get the fitted values from our SIR model
<- data.frame(ode(
fit_incidence y = init, times = Days,
func = SIR, parms = Opt_par
))
head(fit_incidence)
tail(fit_incidence)
Let’s plot the estimated
matplot(fit_incidence$time, fit_incidence$I,
type = "l", log = "y",
xlab = "Days", ylab = "Number of infected subjects",
lwd = 2, lty = 1)
points(Days, Infected)
Hmmm, looks like the fitted model does recover the number of infected cases in the Minnesota. However, although seems that the model tend to underestimate the number of reported cases, we can see an steady increment on the number of cases since the first reported case.
How do you feel about that?
Do you see any other interested pattern? If so, please explain
Although our SIR model does not fit the observed data accurately, we can still use it to estimate the basic reproduction number
It arises from the concept of “force of infection” which is the rate that susceptibles get infected. This is broadly a function of host mortality and the infectiousness of a pathogen. There was a lot of work on the math of this in the 1950s-1980s (that I have not spent the time to fully follow through), but the relationships that emerge are relatively simple. One key one that e.g., Anderson and May use a lot comes from Muench (1950s math) and Dietz (1993) attached. In this case, force of infection has been described as
Life expectancy can be estimated as
There are a ton of assumptions in each of these steps (pathogen is increasing, complete mixing, no host variation, etc), but this is all basically derived from the original SIR models via relationships describing force of infection.
<- setNames(Opt_par["Beta"] / Opt_par["Gamma"], "R0")
R0
round(R0, 3)
The estimated
Another reason could be that the number of reported cases of COVID-19 in US are not true, in other words, the reported cases are lower than the true cases (Wu et al., 2020). This is critical when we are constructing any kind of model, as potentially we would be facing the GIGO metaphor–garbage-in, garbage-out–that indicates that if the data used in a model are not reliable, then the results are not useful. In other words, you can be using the best mathematical model ever constructed, but if your data is not reliable or coherent, then your results are not useful (are garbage).
To fit our model we used data of the first two months since the first reported case, now let’s explore what would it happen if no public intervention (i.e., quarantine) was applied. Here, using the fitted model we will extrapolate up to 350 days.
<- 1:350 # time in days
times
<- data.frame(ode(
fit_350 y = init, times = times,
func = SIR, parms = Opt_par))
head(fit_350)
tail(fit_350)
To better explore the data, let’s make some figures that allow us to see what would happened the first 350 days of the pandemic in the state of Minnesota in a hypothetical case of no intervention.
<- 1:3 # colors: black = susceptible, red = infected and green = recovered cols
matplot(fit_350$time, fit_350[, 2:4], type = "l",
xlab = "Days", ylab = "Number of subjects",
lwd = 2, lty = 1, col = cols)
legend("left", c("Susceptible", "Infected", "Recovered"),
lty = 1, lwd = 2, col = cols, inset = 0.05)
Same figure but in log scale and adding the observed cases
matplot(fit_350$time, fit_350[ , 2:4], type = "l",
xlab = "Days", ylab = "Number of subjects",
lwd = 2, lty = 1, col = cols, log = "y")
## Warning in xy.coords(x, y, xlabel, ylabel, log = log): 1 y value <= 0
## omitted from logarithmic plot
points(Days, Infected)
legend("bottomright", c("Susceptible", "Infected", "Recovered"),
lty = 1, lwd = 2, col = cols, inset = 0.05)
title("SIR model 2019-nCoV United States", outer = TRUE, line = -2)
How do you feel about that?
Please explain the trend of each compartment in a temporal way?
Using the fitted model we can also estimate some additional statistics that could help us to better understand the dynamics of a pandemic:
# Peak of the pandemic for the first 60 days
$I == max(fit_incidence$I), c("time", "I")] fit_incidence[fit_incidence
max(fit_incidence$I) * 0.02 # Assuming 2% of fatality rate
The challenge for this assignment is to repeat the process by selecting other state in the United States and discuss if the