Note

In this lab, we will explore some Correlative models using data at species level (occurrence data points) and environmental data at large spatial scales. We will download occurrence data for oak species distributed in the Americas from GBIF and environmental (climatic) data from WorldClim.

Part of the explanation of the different algorithms used in this lad was extracted from the vignette of the {dismo} R package.

1 Set up your data and your working directory

To do this laboratory you will need to install the following packages:

Code

packages <- c("raster", "sp", "rgeos", "rworldmap", "dismo", "rgdal", "maptools", 
              "kernlab", "rgbif", "spThin", "sdmvspecies", "mmap", "sf", "maps")
# Package vector names

Function install.packages()

You can use the function install.packages() to install the packages.

If you don’t want to install the packages one by one, you can use the next command.

Code

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())

if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages], dependencies = TRUE)
}

The package {scrubr} is not available on CRAN, thus will install it from GitHub.

Code

if ( ! ("scrubr" %in% installed.packages())) {remotes::install_github("ropensci/scrubr")}

Although we installed several packages in the previous lines a package is missing, the R package {sdm}. This package is the main wrapper for modeling the environmental niches (shadows) or Grinnellian niches. We will install this package separately because it has several dependencies.

Code

if ( ! ("sdm" %in% installed.packages())) {install.packages("sdm", dependencies = TRUE)}

A last step for setting up the R package {sdm} is to use the command installAll() this will install any missing package and will guarantee the complete functionality if the package. Yes, this is annoying but is the way of how the package was designed.

Code

sdm::installAll()

Code

sapply(packages, require, character.only = TRUE)

library(sdm)
library(scrubr)
library(tidyverse)

Double-check your working directory.

Function getwd()

You can use the function getwd() to get the current working directory.

2 Prepare data

2.1 Download species occurrences from GBIF

Get some occurrence data for our species from GBIF, directly within R. This may take some time, given the number of occurrences for the selected species.

NOTE: we need to have an internet connection.

Set the vector with some species scientific names to download occurrence data from GBIF.

Code

spp <- c("Quercus virginiana", "Quercus minima", "Quercus alba", "Quercus fusiformis")

Download data using the vector of species names

Code

gbif_data <- occ_data(scientificName = spp, hasCoordinate = TRUE, limit = 2000)  

#decrease the 'limit' if you just want to see how many records there are without waiting all the time that it will take to download the whole dataset.

Save the raw data

Code

dir.create("Data")
dir.create("Data/OCC")

# save the donwloaded in the working directory
save(gbif_data, file = "Data/OCC/oaks_raw_occ.RDATA")

Inspect the downloaded data. Note that we have downloaded data for four species but for simplicity let’s pick one species for this lab. I will select Quercus minima but you can select any other species.

The returned object is a list of 4 elements (i.e., the number of our species vector), thus, to select the focal species just use [[position of the species in the vector]] to get access to the species. For example, I selected Quercus minima, so, to get access to this species I just need to type gbif_data[[2]], let’s see:

Code

#gbif_data
# if, for any species, "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations

# check how the data are organized:
names(gbif_data)
names(gbif_data[[2]]$meta)
names(gbif_data[[2]]$data)

Wow, there are a bunch of columns, let’s select some columns that are relevant for our purposes.

Code

# get the columns that matter for mapping and cleaning the occurrence data:
occ_quemin <- gbif_data[[2]]$data[, c("decimalLongitude", "decimalLatitude", 
                                      "scientificName", "occurrenceStatus", 
                                      "coordinateUncertaintyInMeters",
                                      "institutionCode", "references")]

head(occ_quemin)

Let’s look the data.

Code

glimpse(occ_quemin)

# It look like the data include two species, let's clean the data first
occ_quemin <- occ_quemin %>% 
  filter(scientificName == "Quercus minima (Sarg.) Small")

glimpse(occ_quemin)

Now let’s clean a little bit the data. First, removing all NA in the occurrence data.

Code

occ_quemin <- occ_quemin %>% 
  filter(!is.na(decimalLongitude & !is.na(decimalLatitude)))

#occ_quealb <- subset(occ_quealb, !is.na(decimalLongitude) & !is.na(decimalLatitude))

occ_quemin

Let’s do some further data cleaning with functions of the {scrubr} package (but note this cleaning is not exhaustive!). This will clean some unprobable coordinates in the data.

Code

occ_quemin <- coord_incomplete(coord_imprecise(coord_impossible(coord_unlikely(occ_quemin))))

occ_quemin

Code

glimpse(occ_quemin)
## 537    7

Now transform our data to a spatial object, this will help us to visualize better our data.

Code

quemin_spt <- st_as_sf(occ_quemin, 
                       coords = c("decimalLongitude", "decimalLatitude"), 
                       crs = 4326) #%>%
  #st_cast("MULTIPOINT")

# explore the data
glimpse(quemin_spt)

Let’s plot the distribution of Quercus minima occurrences.

Code

sf_use_s2(FALSE)

# world map
worldMap <- rnaturalearth::ne_countries(scale = "medium", 
                                        type = "countries", 
                                        returnclass = "sf")

# country subset
USpoly <- worldMap %>% 
  #filter(region_wb == "North America")
  filter(admin == "United States of America")

Plot the occurrence records

Code

map_US <- map_data('world')[map_data('world')$region == "USA", ]

ggplot() + 
  geom_polygon(data = map_data("world"), 
               aes(x = long, y = lat, group = group), 
               color = "#f1f2f3", fill = '#f3f3f3') + 
  geom_polygon(data = map_US, 
               aes(x = long, y = lat, group = group), 
               color = 'lightgray', fill = 'lightgray') + 
  geom_point(data = occ_quemin, 
             aes(x = decimalLongitude, y = decimalLatitude), 
             color = "darkgray", alpha = 0.5) + 
  coord_map() + 
  coord_fixed(1.3, 
              xlim = c(-65, -125), 
              ylim = c(50, 25)
              ) + 
  theme(panel.background = element_rect(fill = "lightblue"))

Small joke 😬🫣.

Now the real map…

Code

ggplot() + 
  geom_polygon(data = map_data("world"), 
               aes(x = long, y = lat, group = group), 
               color = "#f1f2f3", fill = '#f3f3f3') + 
  geom_sf(data = USpoly) + 
  geom_sf(data = quemin_spt, color = "darkgray", alpha = 0.5) + 
  coord_sf(
     xlim = c(-125, -65), 
     ylim = c(25, 50)
  ) + 
  theme(
    plot.background = element_rect(fill = "#f1f2f3"),
    panel.background = element_rect(fill = "lightblue"),
    panel.grid = element_blank(),
    line = element_blank(),
    rect = element_blank()
  )

Seems the data is correct but sometimes the data aggregation can cause some troubles in model predictions. To solve this issue we will apply a spatial thinning on the species occurrences, to do so, we will use the package {spThin}. Given that our species has a broad distribution let’s sett a distance threshold of 2 km between occurrence points.

Code

quemin_thinned <- thin(
      loc.data =  occ_quemin, 
      verbose = FALSE, 
      long.col = "decimalLongitude", 
      lat.col = "decimalLatitude",
      spec.col = "scientificName",
      thin.par = 2, # points have at least a minimum distance of 2 km from each other
      reps = 1, 
      locs.thinned.list.return = TRUE, 
      write.files = FALSE, 
      out.dir = "Data/OCC/")
    
quemin_thinned <- as.data.frame(quemin_thinned)
quemin_thinned$Species <- "Quercus_minima"

Explore the results of the thinning process. We can see that the number of occurrences of Quercus minima decreased from 537 occurrences to 346 occurrences. We will use the thinned data for further analyses.

Code

glimpse(quemin_thinned)

Transform the thinned occurrences to a spatial object

Code

quemin_thinned_sf <- st_as_sf(quemin_thinned, 
                       coords = c("Longitude", "Latitude"), 
                       crs = 4326) 

# The below code is using R base. We will use this object for the modelling part
quemin_thinned_spt <- SpatialPointsDataFrame(coords = quemin_thinned[, 1:2], 
                                             data = quemin_thinned, 
                                             proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

Now visualize the both spatial objects.

Code

ggplot() + 
  geom_polygon(data = map_data("world"), 
               aes(x = long, y = lat, group = group), 
               color = "#f1f2f3", fill = '#f3f3f3') + 
  geom_sf(data = USpoly) + 
  geom_sf(data = quemin_spt, color = "blue", alpha = 0.5) + 
  geom_sf(data = quemin_thinned_sf, color = "red", alpha = 0.3) + 
  coord_sf(
     xlim = c(-125, -65), 
     ylim = c(25, 50)
  ) +
  theme(
    plot.background = element_rect(fill = "#f1f2f3"),
    panel.background = element_rect(fill = "lightblue"),
    panel.grid = element_blank(),
    line = element_blank(),
    rect = element_blank()
  )

Now let’s save the data processed and keep thinned_spt for further analyses.

Code

save(occ_quemin, quemin_spt, quemin_thinned_sf, quemin_thinned, 
     file = "Data/OCC/quemin_OCC_processed.RData")

rm(gbif_data, occ_quemin, quemin_spt)

Until here we have downloaded and cleaned species occurrence data, now we will prepare the environmental data that is going to be used as predictors for constructing ENM/SDM for our species.

2.2 Prepare environmental data

First we will obtain bioclimatic variables from WorldClim and to do that we will use the function ‘getData()’ from the package {raster}, then we will select which variables are more relevant for explaining the distribution of our species.

Code

dir.create("Data/Envi")

bios <- raster::getData("worldclim", var = "bio", res = 10, 
                download = TRUE, path = "Data/Envi")

You can see that by using the argument path within the function getData() the bioclimatic variables were also downloaded directly to your hard drive, so you don’t need to download the data again.

Download the climatic data manually

If you have issues downloading WorldClim data, you can download the data HERE and then store the climatic data within the subfolder Envi.

2.2.1 Load bioclimatic data stored in your computer

If you downloaded the bioclimatic data from dropbox please use the next line of code.

Code

biosNames <- list.files("Data/Envi/wc10/", pattern = "bil$")

# Please double check your folder path. You may need to change "wc10" to "bio_10m_bil"

bios <- stack(paste0("Data/Envi/wc10/", biosNames))

Let’s explore some details of the bioclimatic data

Code

names(bios)

str(bios[[1]])

Now let’s plot some bioclimatic variables, let’s say BIO1 and BIO12 that are mean annual temperature and annual precipitation, respectively.

Code

plot(stack(bios$bio1, bios$bio12))

Okay, these data is at global level but our species only occur in North America, in order to have a better visualization of the data, let’s crop the bioclimatic data to the extent of North America (United States and Canada).

Code

USpoly_spt <- subset(countriesCoarse, ADMIN == "United States of America") # used as extent

plot(USpoly_spt)

Now crop the biocliamtic data to the extent of North America.

Code

bios_US <- crop(bios, USpoly_spt)

Let’s visualize the results of cropping the bioclimatic data and overlap the thinned occurrences of Quercus minima.

Code

plot(bios_US[[1]]) # mean annual temperature
plot(quemin_thinned_spt, col = "red", pch = 16, add = TRUE) # add occurrence records
plot(countriesCoarse, lwd = 2, lty = 2, add = TRUE) # add country borders

Cool!

3 Accessible area

Before any further analysis we need to define the accessible are for our species in the geographical space, remember our species is distributed in the United States. To define the accessible area let’s use a bounding box around the known species occurrences plus ~300 km beyond that bound. This will give us an approximation of the species dispersal within the geographical domain, i.e., North America. This bounding box represent the component M in the BAM diagram.

Code

### Species specific accessible area
bb <- bbox(quemin_thinned_spt) # bounding box

e <- extent(c(bb[1]-3, bb[3]+3, bb[2]-3, bb[4]+3)) # bounding box + 300 km

p <- as(e, 'SpatialPolygons') # transform to polygon
crs(p) <- crs(bios_US) # use the geographical reference of the bioclimatic variables

crs(USpoly_spt) <- crs(bios_US)

out <- gIntersection(USpoly_spt, p, byid = FALSE) # use NAs to eliminate areas on the sea

Now let’s plot the accessible area.

Code

plot(bios_US[[1]])
plot(p, add = TRUE, lty = 2)
plot(out, add = TRUE, lwd = 2)
#enviSPP <- raster::crop(envi, out)

Nice, we can see that the accessible area for our species is the southeast United States.

Now crop the bioclimatic data to the extent of the accessible area.

Code

bios_spp <- raster::crop(bios_US, out)
bios_spp <- raster::mask(bios_spp, out)

# plot the results
plot(bios_spp[[1]])
plot(quemin_thinned_spt, add = TRUE, col = "red", pch = 16)
plot(USpoly_spt, add = TRUE, lty = 2)

Nothe that we will use this accessible area as extent for all the posterior analyses.

4 Pseudoabsences

To model the environmental niches (ghosts) and to project the species distributions (shadows) we need to have data of species presences and absences, however we only have presence data. What should I do if I don’t have absences? That’s the question! There is no a straightforward answer for that question, but the most common procedure to get absence data is to generate random points (AKA pseudoabsences) on the geographical domain (i.e., the accessible area or M). There is a lot of discussion on how many pseudoabsences we need to use for ENM/SDM, here we will use a conservative number, i.e., twice the number presences.

Code

set.seed(12345) # Random Number Generation to obtain the same result

# Generate the data
absence <- randomPoints(mask = bios_spp[[1]], 
                        n = round(nrow(quemin_thinned)*2, 0), # number of pseudoabsences
                        p = quemin_thinned_spt, # presence object
                        ext = extent(bios_spp) # extent
                        )

Now let’s combine the presence and pseudoabsence data

Code

# Presences
presence <- data.frame(quemin_thinned) # presence data

# Pseudoabsences
absence <- data.frame(absence) # pseudoabsence data
absence$Species <- "Quercus_minima"
names(absence) <- names(presence)

presence$Occurrence <- 1 # presence data
absence$Occurrence <- 0 # pseudoabsence data

quemin <- rbind(presence, absence) # combine both information

Finally transform the presence-pseudoabsence data to a spatial object and visualize it!

Code

coordinates(quemin) <- ~ Longitude + Latitude
crs(quemin) <- crs(bios_spp)

quemin

Let’s visualize the results

Code

plot(bios_spp[[1]])
plot(quemin[quemin$Occurrence == 1, ], col = "blue", add = TRUE, pch = 16)
points(quemin[quemin$Occurrence == 0, ], col = "red", pch = 16)

We can save the processed data and clean the environment a little bit.

Code

save(presence, absence, quemin, file = "Data/OCC/quemin_PresAbs.RData")

save(bb, e, USpoly_spt, out, p, file = "Data/Envi/accessible_area_quemin.RData")

rm(absence, presence, bios, e, out, p, bb)

Now, we need to decide which variables are necessary to model the niche (Grinnellian niche or Fundamental niche or the Ghost) for our oak species. Here we can rely the selection of the variables by asking a specialist or use statistical tools or both.

5 Variable selection

Let’s use the statistical way. However we can ask Jeannine about which variables are more important for the distribution of oaks.

Code

quemin_bios <- data.frame(raster::extract(bios_spp, quemin))

quemin_bios <- cbind(data.frame(quemin), quemin_bios)

quemin_bios <- quemin_bios[complete.cases(quemin_bios), ]
quemin_bios <- na.omit(quemin_bios)

glimpse(quemin_bios)

One way to select variables is to explore the correlation among the predictors. We can use a threshold of |r| <0.7 which means that correlated variables below this threshold can be considered as no problematic (Dormann et al., 2013), however you can use a more conservative threshold, such as <0.5.

To do that, we first estimate a correlation matrix from the predictors. We use Spearman rank correlation coefficient, as we do not know whether all variables are normally distributed.

Code

cor_mat <- cor(quemin_bios[, c(6:24)], method = 'spearman')

Let’s visualize the correlation matrix.

Code

corrplot::corrplot.mixed(cor_mat, 
                         tl.pos = "lt", 
                         tl.cex = 0.5, 
                         number.cex = 0.5, 
                         addCoefasPercent = TRUE, 
                         mar = c(0, 0, 1, 0))

Looks like that several variables are highly correlated, but let’s select the ones that are highly correlated with each other and exclude them for further analysis. However, inspecting predictor by predictor can take forever, what should we do? Well we have two options:

Talk with the specialist about which predictors should we use, or
Let the computer decide

The simple solution is letting the computer decide, thus, for doing that, we will use an amazing function called ‘select07()’ from the package {mecofun} that will select the predictors bellow the correlation threshold.

Let’s install the package from its repo as it is not currently published in CRAN. Important, a message will appear in your console asking if you would like to update an existing package, please type 3 and press Enter. This will order R to install just {mecofun} and also will avoid any error in the installation.

Code

# Install the package from source
remotes::install_git("https://gitup.uni-potsdam.de/macroecology/mecofun.git")

The function will return a list of three objects:

AIC values for each model
The correlation matrix
A vector of the selected variables.

Code

# Run select07()
 
covar_sel <- mecofun::select07(X = quemin_bios[, -c(1:5)], # only predictors data 
                               y = quemin_bios$Occurrence, # presence-absence data 
                               threshold = 0.7) # here you can change the threshold for one

# Check out the structure of the resulting object:
str(covar_sel)

Let’s see the results…

Code

covar_sel$AIC
covar_sel$cor_mat
covar_sel$pred_sel

According to the select07() function eight climatic variables (predictors) best fit the data and also have low correlation. Assuming that this is correct, we will use these variables for further analyses. The selected variables are: “bio18”, “bio8”, “bio9”, “bio10”, “bio5”, “bio12”, “bio2”, “bio14”. Store these variable names as a vector.

Code

preds <- covar_sel$pred_sel
preds

If you have some issues installing the package {mecofun} please use the next lines of code. The function select07_v2 is the very same from the package {mecofun}. In this case I just extracted the function from the package in order to avoid installation issues.

Code

#source("R-Functions/select07_mod.R")
source("https://raw.githubusercontent.com/jesusNPL/BiodiversityScience/master/Spring2021/R-Functions/select07_mod.R")

# Run select07()
covar_sel <- select07_v2(X = quemin_bios[, -c(1:5)], # only predictors data 
                         y = quemin_bios$Occurrence, # presence-absence data 
                         threshold = 0.7) # here you can change the threshold for one

This extracted function will return the same outputs.

Code

preds <- covar_sel$pred_sel
preds

Finally, let’s select the bioclimatic variables and plot them

Code

bios_quemin_sel <- bios_spp[[preds]]

plot(bios_quemin_sel)

Okay, we re ready to build the models for our species.

6 Building Environmental Niche Models

To model the environmental niches for Quercus minima we will use the R package {sdm}. This package is very useful and first we need to create a data object which will save us a lot of time in the future.

Code

quemin_DATA <- sdmData(formula = Occurrence ~ bio2 + bio5 + bio8 + bio9 + bio10 + bio12 + bio14 + bio18, 
                      train = quemin, # presence-pseudoabsence data
                      predictors = bios_quemin_sel, # selected covariables
                      crs = crs(bios_quemin_sel))

Let’s explore a little bit the sdm object.

Code

quemin_DATA

We can see that our object has 1 species, 8 features or predictors and the type of data is presence-absence (well is pseudoabsence) and the number of records.

Now let’s model the species-environment relationship or Grinnellian niches. Under {sdm} we can model the species-environment relationships using several algorithms at once, but it depends completely on the power of your computer, so use it at your own risk. To get the complete list of algorithms available in the {sdm} package just type getmethodNames(‘sdm’) in your console.

Code

getmethodNames('sdm')

Here we will explore the predictions of six algorithms, i.e., Bioclim, Gower, GLMs GAM, SVM and Random forest.

6.1 Fit models

6.1.1 Bioclim model

The BIOCLIM algorithm computes the similarity of a location by comparing the values of environmental variables at any location to a percentile distribution of the values at known locations of occurrence (‘training sites’).

6.1.2 Gower model

The Domain algorithm computes the Gower distance between environmental variables at any location and those at any of the known locations of occurrence (‘training sites’). For each variable the minimum distance between a site and any of the training points is taken.

6.1.3 Generalized Linear Models

A generalized linear model (GLM) is a generalization of ordinary least squares regression. Models are fit using maximum likelihood and by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Depending on how a GLM is specified it can be equivalent to (multiple) linear regression, logistic regression or Poisson regression.

6.1.4 Random forest

Random Forest (Breiman, 2001) is an extension of the classification and regression trees (CART; Breiman (1998)).

6.1.5 Support Vector Machines or SVM model

Support Vector Machines are an excellent tool for classification, novelty detection, and regression.

Support Vector Machines (SVMs; Vapnik (1998)) apply a simple linear method to the data but in a high-dimensional feature space non-linearly related to the input space, but in practice, it does not involve any computations in that high-dimensional space. This simplicity combined with state of the art performance on many learning problems (classification, regression, and novelty detection) has contributed to the popularity of the SVM (Karatzoglou et al., 2006).

If you have a powerful computer you can set more cores that will speed the computation time. Additionally, Here we are going to fit 6 models and evaluate them through 2 runs of subsampling, each draw 30 percent of training data as test dataset:

Code

#This takes sometime (~3 minutes) please be patient!

quemin_SDM <- sdm(Occurrence~., # the presence-absence
                   data = quemin_DATA, # formula and data 
                   methods = c("bioclim", "domain.dismo", "glm", "gam", "rf", "svm"), # algorithms 
                   replication = "sub", test.percent = 30, n = 2, # training-testing subsampling 
                   parallelSettings = list(ncore = 2, method = "parallel")) # parallel computation

Now let’s see how well each algorithm fitted the data.

Code

quemin_SDM

We can see that all algorithms worked relatively well. The evaluation table shows four metrics of model evaluation, i.e., AUC, COR, TSS and Deviance, each metric has its own properties but I personally like the true-skill statistics (TSS) metric, because it evaluates the matches and mismatches between observations and predictions. So by looking at the column of TSS we can see that the best algorithms that best fitted the occurrence data are random forest, gam, svm and glm. . Interestingly, these models also have the highest area under the operating curve (AUC) metric values.

Function getEvaluation()

You can use the function getEvaluation() to pull out the metrics and save them as a data.frame. Remember that we run two replicates so the metrics data.frame will consist of twelve rows.

Another way to evaluate the model performance is to plot the AUCs.

Code

roc(quemin_SDM, smooth = TRUE)

Function getVarImp()

The function getVarImp() will show the importance of the environmental variables in predicting the niches of Quercus minima. See the next example.

Code

# Get the variable importance
getVarImp(quemin_SDM)

# Get the variable importance for a specific algorithm
getVarImp(quemin_SDM, method = "rf") # VI for Random Forest algorithm

How do you feel about that?

Do the important variables are the same across algorithms?

Until here we have downloaded and processed occurrence data and fitted and evaluated six different algorithms. Now let’s predict the Grinnellian niche for Quercus minima in the accessible area. We can do that for all six algorithms at once but it will take a very long time, so let’s do algorithm by algorithm. In addition, you can use the fitted object (i.e., quemin_SDM) to make predictions to other regions (e.g., areas with similar environmental conditions or invadable areas) or to project to other periods of time (i.e., to the future or the past).

Code

quemin_ENM_bioclim <- predict(quemin_SDM, #sdm object
                              newdata = bios_quemin_sel, # environmental data 
                              method = "bioclim", # algorithm
                              mean = TRUE)
quemin_ENM_bioclim

Now plot the resulting prediction under the bioclim algorithm. This map is showing the probability of occurrence of Quercus minima under the algorithm bioclim, where values closest to 1 indicate higher probability and values near 0 low probability of occurrence. Before visualizing the prediction let’s re-scale the prediction to make it more interpretable.

Code

quemin_ENM_bioclim <- sdmvspecies::rescale(quemin_ENM_bioclim$sp_1.m_bioclim.re_subs)

plot(quemin_ENM_bioclim)

Let’s repeat the process for the other algorithms.

Code

quemin_ENM_domain <- predict(quemin_SDM, newdata = bios_quemin_sel, 
                              method = "domain.dismo", mean = TRUE)

quemin_ENM_glm <- predict(quemin_SDM, newdata = bios_quemin_sel, 
                              method = "glm", mean = TRUE)

quemin_ENM_gam <- predict(quemin_SDM, newdata = bios_quemin_sel, 
                              method = "gam", mean = TRUE)

quemin_ENM_rf <- predict(quemin_SDM, newdata = bios_quemin_sel, 
                              method = "rf", mean = TRUE)

quemin_ENM_svm <- predict(quemin_SDM, newdata = bios_quemin_sel, 
                              method = "svm", mean = TRUE)

# Please, do not pay attention to the warnings...

Code

quemin_ENM_domain <- sdmvspecies::rescale(quemin_ENM_domain$sp_1.m_domain.dismo.re_subs)
quemin_ENM_glm <- sdmvspecies::rescale(quemin_ENM_glm$sp_1.m_glm.re_subs)
quemin_ENM_gam <- sdmvspecies::rescale(quemin_ENM_gam$sp_1.m_gam.re_subs)
quemin_ENM_rf <- sdmvspecies::rescale(quemin_ENM_rf$sp_1.m_rf.re_subs)
quemin_ENM_svm <- sdmvspecies::rescale(quemin_ENM_svm$sp_1.m_svm.re_subs)

Now plot all six predictions

Code

quemin_ENM_all <- raster::stack(quemin_ENM_bioclim, quemin_ENM_domain, 
                                quemin_ENM_glm, quemin_ENM_gam, 
                                quemin_ENM_svm, quemin_ENM_rf)

names(quemin_ENM_all) <- c("Bioclim", "Domain", "GLM", "GAM", "SVM", "RF")

Code

plot(quemin_ENM_all)

As you can see all algorithms returned different predictions, for example Domain return a very broad prediction while Bioclim a narrow prediction. With this we can conclude that there is no silver-bullet (i.e., perfect algorithm) that allow the prediction of the Grinnellian niches. If you are curious about this issue this paper led by Huijie Qiao can help you.

One potential solution to avoid this issue is to create an ensemble model. Basically an ensemble return a prediction by combining results of different modeling algorithms (Araujo & New, 2007). These ensemble predictions are also know as consensus predictions. Here rather than using the six models we will use the best four models (i.e., rf, gam, glm and svm) and we will use the TSS metric as weights.

Code

### Ensemble prediction - ensemble based on TSS statistics
quemin_ENM_ensemble <- ensemble(quemin_SDM, # sdm object
                                newdata = bios_quemin_sel, # covariables 
                                method = c("rf", "gam", "svm", "glm"),# algorithms
                                setting = list(method = "weighted", stat = "TSS"), # metric 
                                parallelSettings = list(ncore = 2, method = "parallel"))

Don’t forget to re-scale the model prediction and visualize it.

Code

quemin_ENM_ensemble <- sdmvspecies::rescale(quemin_ENM_ensemble)

Plot the ensemble prediction

Code

plot(quemin_ENM_ensemble)

How do you feel about that?

Does the ensemble prediction outperform single algorithm predictions?

Let’s try making a nicer map.

Code

library(rasterVis)
library(RColorBrewer)

mapTheme <- rasterTheme(region = rev(brewer.pal(11, "Spectral")),
  layout.widths = list(right.padding = 10),
  axis.line = list(col = "transparent"),
  tick = list(col = 'transparent'))

levelplot(quemin_ENM_ensemble,
  maxpixels = 1e10,
  margin = FALSE,
  par.settings = mapTheme,
  scales = list(x = list(draw = FALSE),
                y = list(draw = FALSE)),
  zlim = c(0, 1))

You might also want to save the predictions.

Code

dir.create("Results")
dir.create("Results/ENMs")

# save the predictions
writeRaster(quemin_ENM_all, 
            filename = "Results/ENMs/quemin_ENM", 
            suffix = names(quemin_ENM_all), 
            format = "GTiff", 
            bylayer = TRUE, 
            overwrite = TRUE)

# save the ensemble prediction
writeRaster(quemin_ENM_ensemble, 
            filename = "Results/ENMs/quemin_ENM_Ensemble", 
            format = "GTiff", 
            bylayer = TRUE, 
            overwrite = TRUE)

Finally, let’s produce “shadows” from the “ghosts”.

7 Building Species Distributions

7.1 Species geographical distributions

First we need to set a binarization threshold for the model predictions, the threshold (cut-off) is used to transform model predictions (probabilities, distances, or similar values) to a binary score (presence or absence).

Here we will find the threshold for the ensemble prediction, but you can do for each algorithm separately. The threshold we will use intents to maximizes the sensitivity (ability of a test to correctly identify presences or true positive rate) plus specificity (ability of a test to correctly identify absences or true negative rate) in the predictions.

Code

dt <- data.frame(as.data.frame(quemin_DATA), coordinates(quemin_DATA))

glimpse(dt)

prediction <- raster::extract(quemin_ENM_ensemble, dt[, c("Longitude", "Latitude")])

evaluation <- sdm::evaluates(dt$Occurrence, prediction) # observed versus expected

threshold_sel <- evaluation@threshold_based$threshold[2]

round(threshold_sel, 2)

The threshold that maximizes the sensitivity and specificity is 0.34, we will use that value to produce “shadows” from the “ghosts”.

Code

quemin_SDM_ensemble <- quemin_ENM_ensemble

# reclassify our ensemble prediction
quemin_SDM_ensemble[] <- ifelse(quemin_SDM_ensemble[] >= threshold_sel, 1, 0)

Plot the distribution of Quercus minima

Code

plot(quemin_SDM_ensemble)
plot(countriesCoarse, add = TRUE)

That’s it, we modeled environmental niches (ghosts) to produce geographical distributions (shadows).

You might also want to save this final raster file in your hard drive.

Code

dir.create("Results/SDMs")

writeRaster(quemin_SDM_ensemble, 
            filename = "Results/SDMs/quemin_SDM_Ensemble", 
            format = "GTiff", 
            bylayer = TRUE, 
            overwrite = TRUE)

8 The challenge

Please respond each question based on the practice.

Summarize the data: how many records are there, how many have coordinates, how many records without coordinates have a textual georeference (locality description)?
Do you think the observations are a reasonable representation of the distribution (and ecological niche) of the species?
Is there a best model? Explain your answer.
Can we use ENM/SDM to aid the conservation of Quercus minima? Explain your answer.
Based on the lecture and the practice, what can we conclude?
Model the niche and predict the distribution of a different species (you can use one of the species listed at the beginning of the class).
Please explain the algorithm used for modeling the niche and predict the distribution of your species and why you decided to use it.

The end, for now 😉!

References

Araujo, M., & New, M. (2007). Ensemble forecasting of species distributions. Trends in Ecology & Evolution, 22(1), 42–47. https://doi.org/10.1016/j.tree.2006.09.010

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Breiman, L. (Ed.). (1998). Classification and regression trees (1. CRC Press repr). Chapman & Hall/CRC.

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x

Karatzoglou, A., Meyer, D., & Hornik, K. (2006). Support vector machines in r. Journal of Statistical Software, 15(9). https://doi.org/10.18637/jss.v015.i09

Vapnik, V. N. (1998). Statistical learning theory. Wiley.