Code
<- c("coronavirus", "deSolve", "lubridate", "scales", "PrettyCols")
packages # Package vector names
To do this laboratory you will need to have a set of R packages. Install the following packages:
<- c("coronavirus", "deSolve", "lubridate", "scales", "PrettyCols")
packages # Package vector names
If you don’t want to install the packages one by one, you can use the next command.
# Install packages not yet installed
<- packages %in% rownames(installed.packages())
installed_packages
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages], dependencies = TRUE)
}
This command, will, first, check if you already the packages installed, then if a package is not installed in your computer, will install it.
Load installed packages:
library(tidyverse)
library(lubridate)
library(coronavirus)
library(PrettyCols)
library(scales)
library(deSolve)
Double-check your working directory.
First we will explore the data for the COVID-19 for different countries. To do that we will get data at global scale using the amazing R package {coronavirus} that provides a daily summary of COVID-19 cases by country obtained from the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus.
data("coronavirus")
glimpse(coronavirus)
To to get the most updated dataset we can use the function refresh_coronavirus_jhu() and store the UPDATED data in our Environment as a new object named corona.
<- refresh_coronavirus_jhu()
corona
%>%
corona head(10)
If you received an error while uploading or updating the coronavirus dataset, use the next code. If not, please skip this lines.
<- "https://github.com/jesusNPL/BiodiversityScience/raw/master/Spring2023/Lab_3/Data/corona.RData"
githubURL
load(url(githubURL))
glimpse(corona)
Let’s start exploring the total numbers of confirmed cases by country.
# Get top confirmed cases by country
%>%
corona filter(data_type == "cases_new") %>%
group_by(location) %>%
summarise(total = sum(value)) %>%
arrange(-total) %>%
head(20) %>%
ggplot(aes(y = as.factor(location), x = total)) +
geom_bar(stat = "identity")
Informative but not good looking… let’s play a bit more!
# Aux function for visualization
<- function() {
theme_nice theme_bw() + #base_family = "Noto Sans") +
theme(panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
#plot.title = element_text(face = "bold"),
#strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "grey80", color = NA),
legend.title = element_text(face = "bold", size = 15),
legend.text = element_text(size = 12))
}
# Get top confirmed cases by country
%>%
corona filter(data_type == "cases_new") %>%
group_by(location) %>%
summarise(total = sum(value)) %>%
arrange(-total) %>%
head(20) %>%
ggplot(aes(y = as.factor(location), x = total)) +
geom_bar(stat = "identity", alpha = 0.7) +
scale_x_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
labs(x = "Number of cases", y = "Countries with more reported cases") +
theme_nice()
Much, much better!
What about the 20 countries with less number of reported cases?
Which countries reported fewer cases of COVID-19?
What about the number of reported recovered cases?
How do feel about that?
By quickly exploring the coronavirus dataset we can see that the United States is the country with more reported cases, not new information is added to our current knowledge, however, we can learn more about the COVID-19 dynamics in the US by looking at its changes over time–you can select any other country if you wish.
<- corona %>%
corona_us filter(location == "US") %>%
arrange(desc(date)) # sort the data according dates
If we look at the US coronavirus data it contains information of the number of recovers (R), the number of new cases (I) and the the deaths. Let’s isolate those data and inspect if there are some trends.
<- corona %>%
infected_us filter(location == "US") %>%
arrange(desc(date)) %>%
filter(data_type == "cases_new")
<- corona %>%
deaths_us filter(location == "US") %>%
arrange(desc(date)) %>%
filter(data_type == "deaths_new")
<- corona %>%
recovered_us filter(location == "US") %>%
arrange(desc(date)) %>%
filter(data_type == "recovered_new")
head(infected_us)
let’s plot the new cases.
%>%
corona_us filter(data_type == "cases_new") %>%
ggplot(aes(x = as.POSIXct(date), y = value)) +
geom_point(color = "gray") +
labs(x = NULL, y = "COVID-19 new cases") +
geom_line(mapping = aes(x = as.POSIXct(date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 2) +
scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = scales::label_date_short()) +
theme_nice()
We can see a trend in the number of confirmed Infected subjects since the first case was registered in the US. Happily, the number of reported cases has declined steadily since mid-September 2022. This is really great news! If you want to explore with more detail these dynamics in the United Sates the website covid19.Explorer developed by Revell (2021) can help you. This website is accompanied with a nice paper in which Liam explains the methodology he used to arrange and visualize the data.
How do you feel about that?
Please explain the trend temporally; in other words, do the spikes in the number of cases match the US holidays?
What about the number of deaths and the number of recovered people?
The first step to build our first SIR model is to get data of a disease, well, we already have data of coronavirus for each country in the world, however, as this dataset is at country level we probably will be missing some relevant information that occurs only at local scale, i.e., by States. In this sense, we will use data from the Centers for Disease Control and Prevention that reports aggregate counts of COVID-19 at State level within the United States, how good is that? Okay, let’s go for it!
# https://covid.cdc.gov/covid-data-tracker/#trends_weeklycases_select_27
# Get data from the CDC
<- "https://data.cdc.gov/api/views/9mfq-cb36/rows.csv?accessType=DOWNLOAD"
url_data
<- read.csv(url_data)
covid_us
glimpse(covid_us)
If the previous lines return an error, please use these ones.
<- "https://github.com/jesusNPL/BiodiversityScience/raw/master/Spring2023/Lab_3/Data/covidUS.RData"
githubURL
load(url(githubURL))
glimpse(covid_us)
Let’s explore the data and clean it little bit.
%>%
covid_us arrange(state) %>%
drop_na(tot_cases) %>%
ggplot(aes(y = state, x = tot_cases)) +
geom_bar(stat = "identity") +
labs(x = "Number of cases", y = "Reported cases by State") +
scale_x_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
theme_nice()
Let’s reformat the dates and order the data in an ascending order.
<- covid_us %>%
covid_us mutate(Date = submission_date) %>%
mutate(Date = mdy(Date)) %>%
select(Date, everything()) %>%
separate(submission_date, sep = "/", into = c("month", "day", "year"))
# Sort the data in an increasing order
<- covid_us[order(covid_us$Date), ]
covid_us
head(covid_us)
As we can see there are a lot of NA values in the dataset, these NA values are because there were no cases reported in the US in the first two months of the 2020.
The data is still very broad, let’s select data for the State of Minnesota.
<- covid_us %>%
covid_mn filter(state == "MN")
head(covid_mn)
Now we can explore the data in a similar way we did for the entire United States.
%>%
covid_mn ggplot(aes(x = as.POSIXct(Date), y = new_case)) +
geom_point(color = "gray") +
labs(x = NULL, y = "COVID-19 new cases") +
geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 2) +
scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = label_date_short()) +
theme_nice()
We can see that there are some days that no cases were reported, however, it looks like follows the same pattern as the entire United States.
Let’s see what happens if we plot the cumulative number of cases.
%>%
covid_mn ggplot(aes(x = as.POSIXct(Date), y = tot_cases)) +
geom_point(color = "gray") +
labs(x = NULL, y = "Total COVID-19 cases") +
geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 1) +
scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = label_date_short()) +
theme_nice()
It look like the after three years the cumulative number of cases is reaching a plateau.
We can also see the number of deaths caused by COVID-19 in the state of Minnesota.
%>%
covid_mn ggplot(aes(x = as.POSIXct(Date), y = abs(new_death))) +
geom_point(color = "gray") +
labs(x = NULL, y = "Number of Deaths") +
geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 1) +
scale_y_continuous(labels = scales::label_number(suffix = " K", scale = 1e-6)) +
scale_x_datetime(date_breaks = "2 month", labels = label_date_short()) +
theme_nice()
Now using the data from the US we will fit the SIR model to predict the changes in the number of Infected cases. The basic idea of the SIR model is, in fact, quite simple. First, we can see that the SIR model is composed of a set of three groups of people or compartments:
Then, these compartments in turn are used to model the dynamics of an outbreak. However, first of all we need to understand a little bit the Math behind the scenes. Dr. Elizabeth Borer provided a nice introduction to the math in the last lecture, but if you want to dig a little bit more you can watch this amazing video by Trevor Bazett The MATH of Epidemics | Intro to the SIR Model on YouTube. But, if you want to learn more details on this topic, the Book Epidemics: models and data using R by Bjørnstad (2018) can help you.
Anyhow, let’s recap a little bit how the math works. To do that, we need three differential equations, one for the change in each group or compartment, where \(\beta\) (represents the infection rate) is the parameter that controls the transition between \(S\) and \(I\) and \(\gamma\) (gamma) which controls the transition between \(I\) and \(R\), and represents the removal or recovery rate.
The first equation indicates that the number of susceptible subjects (\(S\)) decreases with the number of newly infected subjects. In other words, every new infected subject is the result of the infection rate (\(\beta\)) times the number of susceptible individuals (\(S\)) who had a contact with an infected subject (\(I\)).
\[\frac{dS}{dt} = - \frac{\beta IS}{N}\]
The second equation indicates that the number of infected subjects (\(I\)) increases as new infected individuals are added to the pool (\(\beta IS\)) minus the recovered subjects (\(\gamma I\)), where, \(\gamma I\) is the removal rate \(\gamma\) times the infected subjects \(I\).
\[\frac{dI}{dt} = \frac{\beta IS}{N} - \gamma I\]
Finally, the third equation indicates that the number of recovered subjects \(R\) increases with the number of subjects who either recovered or died (\(\gamma I\)).
\[\frac{dR}{dt} = \gamma I\]
Okay, we are almost there. But first let’s see how an epidemic develops.
As we now understand how the \(SIR\) model works, we can build an R function that allow us to model the dynamic of a disease.
<- function(time, state, parameters) {
SIR <- as.list(c(state, parameters))
par with(par, {
<- -Beta * I * S / N # Equation one
dS <- Beta * I * S / N - Gamma * I # Equation two
dI <- Gamma * I # Equation three
dR list(c(dS, dI, dR))
}) }
Now to fit the \(SIR\) model to the data (Minnesota dataset) we need a solver and an optimizer. Wait, what is a solver? Well, the \(SIR\) model is an Ordinary Differential Equation (ODE), thus, a solver function is needed to help us to solve the equations. We will use the function “ode” of the package {deSolve} and the “optim” function of {R base} as an optimizer.
More specifically, in order to solve the equations we need is to minimize the sum of the squared differences (\(RSS\)) between the number of \(I\) subjects (infected) and time \(t\)–\(I(t)\)–and the number of predicted cases by our model, i.e., \(\hat{I}(t)\):
\[RSS(\beta, \gamma) = \sum_t \big(I(t) - \hat{I}(t) \big)^2\]
Before start solving the \(SIR\) model, let’s write another function that allows the \(RSS\) estimation.
<- function(parameters) {
RSS names(parameters) <- c("Beta", "Gamma")
<- ode(y = init, times = Days, func = SIR, parms = parameters)
out # the out object includes the SIR function we wrote above
<- out[, 3]
fit sum((Infected - fit)^2)
}
Okay, we now have almost all information needed to fit our first \(SIR\) model. The last information we need is to set of initial values for \(N\) (population size), \(S\), \(I\) and \(R\).
By observing the trend in the number of cases for MN, we can see that the first record of infected subject was in March 06 2020 (you can double check to confirm that date), so, let’s set that day as a starting point for our \(SIR\) model.
%>%
covid_mn filter(Date == "2020-03-06")
Moreover, as more than two years have passed since the pandemic started, the variability in the number of cases is huge, so let’s select the first 60 days of cases to fit our model and then use the fitted model to predict the dynamics of COVID-19 for the following year.
<- 5686649 # Total population for the State of Minnesota for the 2020
N
<- "2020-03-06"
start_date <- "2020-05-10"
end_date
# isolating the infected subjects in the state of Minnesota since the start date
<- covid_mn %>%
Infections filter(Date >= ymd(start_date) & Date <= ymd(end_date))
#Infected2 <- subset(covid_mn, Date >= ymd(start_date) & Date <= ymd(end_date))$new_case
Let’s plot this data and see how it looks.
%>%
Infections ggplot(aes(x = as.POSIXct(Date), y = log(new_case))) +
geom_point(color = "gray") +
#geom_line(mapping = aes(x = as.POSIXct(Date)), linewidth = 2,
# colour = alpha(PrettyCols::prettycols("Purples")[4], 0.3)) +
geom_smooth(colour = PrettyCols::prettycols("Purples")[1], linewidth = 1) +
labs(x = NULL, y = "log - COVID-19 cases") +
theme_nice()
We can see that the confirmed cases in Minnesota were increasing rapidly since the first case of COVID-19 was reported.
As we now know that the number of \(I\) were increasing since the first days of the pandemic, let’s model de dynamic of COVID-19 during that period of time. Why we selected that period of time and not all data? Well, the main reason is that this kind of models are used to explore the first stages of a disease in order to identify if the disease has the potential to become a pandemic and to approximate its reproductive ratio \(R_0\).
# ODE does not like tidy format, so we are going to isolate the number of cases in R base.
<- Infections$new_case
Infected
<- 1:length(Infected) # Number of days since the first case
Days
# Initial values for our SIR model
<- c(
init S = N - Infected[1], # Susceptible group
I = Infected[1], # Infected group
R = 0 # Recovered group.
)
Now we can combine all information and run our model. Using the information provided above, we can find the values of \(\beta\) and \(\gamma\) that give the smallest \(RSS\) that represent the best fit to the data. Let’s start exploring with values of 0.5 for each step and constrain those values to an interval from 0 to 1.
You can play with the initial values by changing the c(0.5, 0.5) to lower or higher values (e.g., 0.1 or 0.7), but remember, the estimates of \(\beta\) and \(\gamma\) will change. A nice post that explains in more detail the impact of changing the initial values can be found HERE.
<- optim(c(0.5, 0.5),
Opt
RSS,method = "L-BFGS-B",
lower = c(0, 0),
upper = c(1, 1)
)
Check if the model converged.
# optimize with some sensible conditions
$message
Opt
# [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Yay! our model show convergence.
From the optimized model we can obtain the \(\beta\) and \(\gamma\) values, and remember, these values controls for the transition between \(S\) (susceptible) and \(I\) (infectious) and \(I\) and \(R\) (recovered), respectively.
<- setNames(Opt$par, c("Beta", "Gamma"))
Opt_par
Opt_par
Now using the fitted \(\beta\) and \(\gamma\) values, we can see if our model recovers the observed trend in the state of Minnesota for the first two months since the first positive \(I\) case was reported.
# get the fitted values from our SIR model
<- data.frame(ode(
fit_incidence y = init, times = Days,
func = SIR, parms = Opt_par
))
head(fit_incidence)
tail(fit_incidence)
Let’s plot the estimated \(I\) over the first two months. We will use a kind of plot named semi-log plot, this plot allow us the read the difference between the expected (fitted) and observed number of confirmed cases over time.
matplot(fit_incidence$time, fit_incidence$I,
type = "l", log = "y",
xlab = "Days", ylab = "Number of infected subjects",
lwd = 2, lty = 1)
points(Days, Infected)
Hmmm, looks like the fitted model does recover the number of infected cases in the Minnesota. However, although seems that the model tend to underestimate the number of reported cases, we can see an steady increment on the number of cases since the first reported case.
How do you feel about that?
Do you see any other interested pattern? If so, please explain
Although our SIR model does not fit the observed data accurately, we can still use it to estimate the basic reproduction number \(R_0\) (AKA reproductive ratio). The \(R_0\) give us an approximate estimation of how many \(S\) subjects get infected by a sick subject (\(I\)) on average–i.e., the transmission probability per contact (Ashby & Best, 2021). Importantly, \(R_0\) is not a fixed value and can vary over time and between populations. To calculate \(R_0\) we just need to obtain the ratio between \(\beta\) and \(\gamma\):
\[R_0 = \frac{\beta}{\gamma}\]
<- setNames(Opt_par["Beta"] / Opt_par["Gamma"], "R0")
R0
round(R0, 3)
The estimated \(R_0\) is 1.232 which suggests that for every sick subject 1.2 subjects can get infected by COVID-19. This value is lower when compared to other diseases and even lower than the estimated \(R_0\) for COVID-19 in the literature that ranges between 2.5 and 3. One potential explanation for this low \(R_0\) can be due the fact the first 60 days, the number of observed cases were very low when compared with the large population size (\(N\)) in the state of Minnesota.
Another reason could be that the number of reported cases of COVID-19 in US are not true, in other words, the reported cases are lower than the true cases (Wu et al., 2020). This is critical when we are constructing any kind of model, as potentially we would be facing the GIGO metaphor–garbage-in, garbage-out–that indicates that if the data used in a model are not reliable, then the results are not useful. In other words, you can be using the best mathematical model ever constructed, but if your data is not reliable or coherent, then your results are not useful (are garbage).
To fit our model we used data of the first two months since the first reported case, now let’s explore what would it happen if no public intervention (i.e., quarantine) was applied. Here, using the fitted model we will extrapolate up to 350 days.
<- 1:350 # time in days
times
<- data.frame(ode(
fit_350 y = init, times = times,
func = SIR, parms = Opt_par))
head(fit_350)
tail(fit_350)
To better explore the data, let’s make some figures that allow us to see what would happened the first 350 days of the pandemic in the state of Minnesota in a hypothetical case of no intervention.
<- 1:3 # colors: black = susceptible, red = infected and green = recovered cols
matplot(fit_350$time, fit_350[, 2:4], type = "l",
xlab = "Days", ylab = "Number of subjects",
lwd = 2, lty = 1, col = cols)
legend("left", c("Susceptible", "Infected", "Recovered"),
lty = 1, lwd = 2, col = cols, inset = 0.05)
Same figure but in log scale and adding the observed cases
matplot(fit_350$time, fit_350[ , 2:4], type = "l",
xlab = "Days", ylab = "Number of subjects",
lwd = 2, lty = 1, col = cols, log = "y")
## Warning in xy.coords(x, y, xlabel, ylabel, log = log): 1 y value <= 0
## omitted from logarithmic plot
points(Days, Infected)
legend("bottomright", c("Susceptible", "Infected", "Recovered"),
lty = 1, lwd = 2, col = cols, inset = 0.05)
title("SIR model 2019-nCoV United States", outer = TRUE, line = -2)
How do you feel about that?
Please explain the trend of each compartment in a temporal way?
Using the fitted model we can also estimate some additional statistics that could help us to better understand the dynamics of a pandemic:
# Peak of the pandemic for the first 60 days
$I == max(fit_incidence$I), c("time", "I")] fit_incidence[fit_incidence
max(fit_incidence$I) * 0.02 # Assuming 2% of fatality rate
The challenge for this assignment is to repeat the process by selecting other state in the United States and discuss if the \(R_0\) obtained differ to the obtained for the state of Minnesota.