Code
<- c("raster", "sp", "rgeos", "rworldmap", "dismo", "rgdal", "maptools",
packages "kernlab", "rgbif", "spThin", "sdmvspecies", "mmap", "sf", "maps")
# Package vector names
To do this laboratory you will need to install the following packages:
<- c("raster", "sp", "rgeos", "rworldmap", "dismo", "rgdal", "maptools",
packages "kernlab", "rgbif", "spThin", "sdmvspecies", "mmap", "sf", "maps")
# Package vector names
If you don’t want to install the packages one by one, you can use the next command.
# Install packages not yet installed
<- packages %in% rownames(installed.packages())
installed_packages
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages], dependencies = TRUE)
}
The package {scrubr} is not available on CRAN, thus will install it from GitHub.
if ( ! ("scrubr" %in% installed.packages())) {remotes::install_github("ropensci/scrubr")}
Although we installed several packages in the previous lines a package is missing, the R package {sdm}. This package is the main wrapper for modeling the environmental niches (shadows) or Grinnellian niches. We will install this package separately because it has several dependencies.
if ( ! ("sdm" %in% installed.packages())) {install.packages("sdm", dependencies = TRUE)}
A last step for setting up the R package {sdm} is to use the command installAll() this will install any missing package and will guarantee the complete functionality if the package. Yes, this is annoying but is the way of how the package was designed.
::installAll() sdm
sapply(packages, require, character.only = TRUE)
library(sdm)
library(scrubr)
library(tidyverse)
Double-check your working directory.
Get some occurrence data for our species from GBIF, directly within R. This may take some time, given the number of occurrences for the selected species.
NOTE: we need to have an internet connection.
Set the vector with some species scientific names to download occurrence data from GBIF.
<- c("Quercus virginiana", "Quercus minima", "Quercus alba", "Quercus fusiformis") spp
Download data using the vector of species names
<- occ_data(scientificName = spp, hasCoordinate = TRUE, limit = 2000)
gbif_data
#decrease the 'limit' if you just want to see how many records there are without waiting all the time that it will take to download the whole dataset.
Save the raw data
dir.create("Data")
dir.create("Data/OCC")
# save the donwloaded in the working directory
save(gbif_data, file = "Data/OCC/oaks_raw_occ.RDATA")
Inspect the downloaded data. Note that we have downloaded data for four species but for simplicity let’s pick one species for this lab. I will select Quercus minima but you can select any other species.
The returned object is a list of 4 elements (i.e., the number of our species vector), thus, to select the focal species just use [[position of the species in the vector]] to get access to the species. For example, I selected Quercus minima, so, to get access to this species I just need to type gbif_data[[2]], let’s see:
#gbif_data
# if, for any species, "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations
# check how the data are organized:
names(gbif_data)
names(gbif_data[[2]]$meta)
names(gbif_data[[2]]$data)
Wow, there are a bunch of columns, let’s select some columns that are relevant for our purposes.
# get the columns that matter for mapping and cleaning the occurrence data:
<- gbif_data[[2]]$data[, c("decimalLongitude", "decimalLatitude",
occ_quemin "scientificName", "occurrenceStatus",
"coordinateUncertaintyInMeters",
"institutionCode", "references")]
head(occ_quemin)
Let’s look the data.
glimpse(occ_quemin)
# It look like the data include two species, let's clean the data first
<- occ_quemin %>%
occ_quemin filter(scientificName == "Quercus minima (Sarg.) Small")
glimpse(occ_quemin)
Now let’s clean a little bit the data. First, removing all NA in the occurrence data.
<- occ_quemin %>%
occ_quemin filter(!is.na(decimalLongitude & !is.na(decimalLatitude)))
#occ_quealb <- subset(occ_quealb, !is.na(decimalLongitude) & !is.na(decimalLatitude))
occ_quemin
Let’s do some further data cleaning with functions of the {scrubr} package (but note this cleaning is not exhaustive!). This will clean some unprobable coordinates in the data.
<- coord_incomplete(coord_imprecise(coord_impossible(coord_unlikely(occ_quemin))))
occ_quemin
occ_quemin
glimpse(occ_quemin)
## 537 7
Now transform our data to a spatial object, this will help us to visualize better our data.
<- st_as_sf(occ_quemin,
quemin_spt coords = c("decimalLongitude", "decimalLatitude"),
crs = 4326) #%>%
#st_cast("MULTIPOINT")
# explore the data
glimpse(quemin_spt)
Let’s plot the distribution of Quercus minima occurrences.
sf_use_s2(FALSE)
# world map
<- rnaturalearth::ne_countries(scale = "medium",
worldMap type = "countries",
returnclass = "sf")
# country subset
<- worldMap %>%
USpoly #filter(region_wb == "North America")
filter(admin == "United States of America")
Plot the occurrence records
<- map_data('world')[map_data('world')$region == "USA", ]
map_US
ggplot() +
geom_polygon(data = map_data("world"),
aes(x = long, y = lat, group = group),
color = "#f1f2f3", fill = '#f3f3f3') +
geom_polygon(data = map_US,
aes(x = long, y = lat, group = group),
color = 'lightgray', fill = 'lightgray') +
geom_point(data = occ_quemin,
aes(x = decimalLongitude, y = decimalLatitude),
color = "darkgray", alpha = 0.5) +
coord_map() +
coord_fixed(1.3,
xlim = c(-65, -125),
ylim = c(50, 25)
+
) theme(panel.background = element_rect(fill = "lightblue"))
Small joke 😬🫣.
Now the real map…
ggplot() +
geom_polygon(data = map_data("world"),
aes(x = long, y = lat, group = group),
color = "#f1f2f3", fill = '#f3f3f3') +
geom_sf(data = USpoly) +
geom_sf(data = quemin_spt, color = "darkgray", alpha = 0.5) +
coord_sf(
xlim = c(-125, -65),
ylim = c(25, 50)
+
) theme(
plot.background = element_rect(fill = "#f1f2f3"),
panel.background = element_rect(fill = "lightblue"),
panel.grid = element_blank(),
line = element_blank(),
rect = element_blank()
)
Seems the data is correct but sometimes the data aggregation can cause some troubles in model predictions. To solve this issue we will apply a spatial thinning on the species occurrences, to do so, we will use the package {spThin}. Given that our species has a broad distribution let’s sett a distance threshold of 2 km between occurrence points.
<- thin(
quemin_thinned loc.data = occ_quemin,
verbose = FALSE,
long.col = "decimalLongitude",
lat.col = "decimalLatitude",
spec.col = "scientificName",
thin.par = 2, # points have at least a minimum distance of 2 km from each other
reps = 1,
locs.thinned.list.return = TRUE,
write.files = FALSE,
out.dir = "Data/OCC/")
<- as.data.frame(quemin_thinned)
quemin_thinned $Species <- "Quercus_minima" quemin_thinned
Explore the results of the thinning process. We can see that the number of occurrences of Quercus minima decreased from 537 occurrences to 346 occurrences. We will use the thinned data for further analyses.
glimpse(quemin_thinned)
Transform the thinned occurrences to a spatial object
<- st_as_sf(quemin_thinned,
quemin_thinned_sf coords = c("Longitude", "Latitude"),
crs = 4326)
# The below code is using R base. We will use this object for the modelling part
<- SpatialPointsDataFrame(coords = quemin_thinned[, 1:2],
quemin_thinned_spt data = quemin_thinned,
proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
Now visualize the both spatial objects.
ggplot() +
geom_polygon(data = map_data("world"),
aes(x = long, y = lat, group = group),
color = "#f1f2f3", fill = '#f3f3f3') +
geom_sf(data = USpoly) +
geom_sf(data = quemin_spt, color = "blue", alpha = 0.5) +
geom_sf(data = quemin_thinned_sf, color = "red", alpha = 0.3) +
coord_sf(
xlim = c(-125, -65),
ylim = c(25, 50)
+
) theme(
plot.background = element_rect(fill = "#f1f2f3"),
panel.background = element_rect(fill = "lightblue"),
panel.grid = element_blank(),
line = element_blank(),
rect = element_blank()
)
Now let’s save the data processed and keep thinned_spt for further analyses.
save(occ_quemin, quemin_spt, quemin_thinned_sf, quemin_thinned,
file = "Data/OCC/quemin_OCC_processed.RData")
rm(gbif_data, occ_quemin, quemin_spt)
Until here we have downloaded and cleaned species occurrence data, now we will prepare the environmental data that is going to be used as predictors for constructing ENM/SDM for our species.
First we will obtain bioclimatic variables from WorldClim and to do that we will use the function ‘getData()’ from the package {raster}, then we will select which variables are more relevant for explaining the distribution of our species.
dir.create("Data/Envi")
<- raster::getData("worldclim", var = "bio", res = 10,
bios download = TRUE, path = "Data/Envi")
You can see that by using the argument path within the function getData() the bioclimatic variables were also downloaded directly to your hard drive, so you don’t need to download the data again.
If you downloaded the bioclimatic data from dropbox please use the next line of code.
<- list.files("Data/Envi/wc10/", pattern = "bil$")
biosNames
# Please double check your folder path. You may need to change "wc10" to "bio_10m_bil"
<- stack(paste0("Data/Envi/wc10/", biosNames)) bios
Let’s explore some details of the bioclimatic data
names(bios)
str(bios[[1]])
Now let’s plot some bioclimatic variables, let’s say BIO1 and BIO12 that are mean annual temperature and annual precipitation, respectively.
plot(stack(bios$bio1, bios$bio12))
Okay, these data is at global level but our species only occur in North America, in order to have a better visualization of the data, let’s crop the bioclimatic data to the extent of North America (United States and Canada).
<- subset(countriesCoarse, ADMIN == "United States of America") # used as extent
USpoly_spt
plot(USpoly_spt)
Now crop the biocliamtic data to the extent of North America.
<- crop(bios, USpoly_spt) bios_US
Let’s visualize the results of cropping the bioclimatic data and overlap the thinned occurrences of Quercus minima.
plot(bios_US[[1]]) # mean annual temperature
plot(quemin_thinned_spt, col = "red", pch = 16, add = TRUE) # add occurrence records
plot(countriesCoarse, lwd = 2, lty = 2, add = TRUE) # add country borders
Cool!
Before any further analysis we need to define the accessible are for our species in the geographical space, remember our species is distributed in the United States. To define the accessible area let’s use a bounding box around the known species occurrences plus ~300 km beyond that bound. This will give us an approximation of the species dispersal within the geographical domain, i.e., North America. This bounding box represent the component M in the BAM diagram.
### Species specific accessible area
<- bbox(quemin_thinned_spt) # bounding box
bb
<- extent(c(bb[1]-3, bb[3]+3, bb[2]-3, bb[4]+3)) # bounding box + 300 km
e
<- as(e, 'SpatialPolygons') # transform to polygon
p crs(p) <- crs(bios_US) # use the geographical reference of the bioclimatic variables
crs(USpoly_spt) <- crs(bios_US)
<- gIntersection(USpoly_spt, p, byid = FALSE) # use NAs to eliminate areas on the sea out
Now let’s plot the accessible area.
plot(bios_US[[1]])
plot(p, add = TRUE, lty = 2)
plot(out, add = TRUE, lwd = 2)
#enviSPP <- raster::crop(envi, out)
Nice, we can see that the accessible area for our species is the southeast United States.
Now crop the bioclimatic data to the extent of the accessible area.
<- raster::crop(bios_US, out)
bios_spp <- raster::mask(bios_spp, out)
bios_spp
# plot the results
plot(bios_spp[[1]])
plot(quemin_thinned_spt, add = TRUE, col = "red", pch = 16)
plot(USpoly_spt, add = TRUE, lty = 2)
Nothe that we will use this accessible area as extent for all the posterior analyses.
To model the environmental niches (ghosts) and to project the species distributions (shadows) we need to have data of species presences and absences, however we only have presence data. What should I do if I don’t have absences? That’s the question! There is no a straightforward answer for that question, but the most common procedure to get absence data is to generate random points (AKA pseudoabsences) on the geographical domain (i.e., the accessible area or M). There is a lot of discussion on how many pseudoabsences we need to use for ENM/SDM, here we will use a conservative number, i.e., twice the number presences.
set.seed(12345) # Random Number Generation to obtain the same result
# Generate the data
<- randomPoints(mask = bios_spp[[1]],
absence n = round(nrow(quemin_thinned)*2, 0), # number of pseudoabsences
p = quemin_thinned_spt, # presence object
ext = extent(bios_spp) # extent
)
Now let’s combine the presence and pseudoabsence data
# Presences
<- data.frame(quemin_thinned) # presence data
presence
# Pseudoabsences
<- data.frame(absence) # pseudoabsence data
absence $Species <- "Quercus_minima"
absencenames(absence) <- names(presence)
$Occurrence <- 1 # presence data
presence$Occurrence <- 0 # pseudoabsence data
absence
<- rbind(presence, absence) # combine both information quemin
Finally transform the presence-pseudoabsence data to a spatial object and visualize it!
coordinates(quemin) <- ~ Longitude + Latitude
crs(quemin) <- crs(bios_spp)
quemin
Let’s visualize the results
plot(bios_spp[[1]])
plot(quemin[quemin$Occurrence == 1, ], col = "blue", add = TRUE, pch = 16)
points(quemin[quemin$Occurrence == 0, ], col = "red", pch = 16)
We can save the processed data and clean the environment a little bit.
save(presence, absence, quemin, file = "Data/OCC/quemin_PresAbs.RData")
save(bb, e, USpoly_spt, out, p, file = "Data/Envi/accessible_area_quemin.RData")
rm(absence, presence, bios, e, out, p, bb)
Now, we need to decide which variables are necessary to model the niche (Grinnellian niche or Fundamental niche or the Ghost) for our oak species. Here we can rely the selection of the variables by asking a specialist or use statistical tools or both.
Let’s use the statistical way. However we can ask Jeannine about which variables are more important for the distribution of oaks.
<- data.frame(raster::extract(bios_spp, quemin))
quemin_bios
<- cbind(data.frame(quemin), quemin_bios)
quemin_bios
<- quemin_bios[complete.cases(quemin_bios), ]
quemin_bios <- na.omit(quemin_bios)
quemin_bios
glimpse(quemin_bios)
One way to select variables is to explore the correlation among the predictors. We can use a threshold of |r| <0.7 which means that correlated variables below this threshold can be considered as no problematic (Dormann et al., 2013), however you can use a more conservative threshold, such as <0.5.
To do that, we first estimate a correlation matrix from the predictors. We use Spearman rank correlation coefficient, as we do not know whether all variables are normally distributed.
<- cor(quemin_bios[, c(6:24)], method = 'spearman') cor_mat
Let’s visualize the correlation matrix.
::corrplot.mixed(cor_mat,
corrplottl.pos = "lt",
tl.cex = 0.5,
number.cex = 0.5,
addCoefasPercent = TRUE,
mar = c(0, 0, 1, 0))
Looks like that several variables are highly correlated, but let’s select the ones that are highly correlated with each other and exclude them for further analysis. However, inspecting predictor by predictor can take forever, what should we do? Well we have two options:
Talk with the specialist about which predictors should we use, or
Let the computer decide
The simple solution is letting the computer decide, thus, for doing that, we will use an amazing function called ‘select07()’ from the package {mecofun} that will select the predictors bellow the correlation threshold.
Let’s install the package from its repo as it is not currently published in CRAN. Important, a message will appear in your console asking if you would like to update an existing package, please type 3 and press Enter. This will order R to install just {mecofun} and also will avoid any error in the installation.
# Install the package from source
::install_git("https://gitup.uni-potsdam.de/macroecology/mecofun.git") remotes
The function will return a list of three objects:
AIC values for each model
The correlation matrix
A vector of the selected variables.
# Run select07()
<- mecofun::select07(X = quemin_bios[, -c(1:5)], # only predictors data
covar_sel y = quemin_bios$Occurrence, # presence-absence data
threshold = 0.7) # here you can change the threshold for one
# Check out the structure of the resulting object:
str(covar_sel)
Let’s see the results…
$AIC
covar_sel$cor_mat
covar_sel$pred_sel covar_sel
According to the select07() function eight climatic variables (predictors) best fit the data and also have low correlation. Assuming that this is correct, we will use these variables for further analyses. The selected variables are: “bio18”, “bio8”, “bio9”, “bio10”, “bio5”, “bio12”, “bio2”, “bio14”. Store these variable names as a vector.
<- covar_sel$pred_sel
preds preds
If you have some issues installing the package {mecofun} please use the next lines of code. The function select07_v2 is the very same from the package {mecofun}. In this case I just extracted the function from the package in order to avoid installation issues.
#source("R-Functions/select07_mod.R")
source("https://raw.githubusercontent.com/jesusNPL/BiodiversityScience/master/Spring2021/R-Functions/select07_mod.R")
# Run select07()
<- select07_v2(X = quemin_bios[, -c(1:5)], # only predictors data
covar_sel y = quemin_bios$Occurrence, # presence-absence data
threshold = 0.7) # here you can change the threshold for one
This extracted function will return the same outputs.
<- covar_sel$pred_sel
preds preds
Finally, let’s select the bioclimatic variables and plot them
<- bios_spp[[preds]]
bios_quemin_sel
plot(bios_quemin_sel)
Okay, we re ready to build the models for our species.
To model the environmental niches for Quercus minima we will use the R package {sdm}. This package is very useful and first we need to create a data object which will save us a lot of time in the future.
<- sdmData(formula = Occurrence ~ bio2 + bio5 + bio8 + bio9 + bio10 + bio12 + bio14 + bio18,
quemin_DATA train = quemin, # presence-pseudoabsence data
predictors = bios_quemin_sel, # selected covariables
crs = crs(bios_quemin_sel))
Let’s explore a little bit the sdm object.
quemin_DATA
We can see that our object has 1 species, 8 features or predictors and the type of data is presence-absence (well is pseudoabsence) and the number of records.
Now let’s model the species-environment relationship or Grinnellian niches. Under {sdm} we can model the species-environment relationships using several algorithms at once, but it depends completely on the power of your computer, so use it at your own risk. To get the complete list of algorithms available in the {sdm} package just type getmethodNames(‘sdm’) in your console.
getmethodNames('sdm')
Here we will explore the predictions of six algorithms, i.e., Bioclim, Gower, GLMs GAM, SVM and Random forest.
The BIOCLIM algorithm computes the similarity of a location by comparing the values of environmental variables at any location to a percentile distribution of the values at known locations of occurrence (‘training sites’).
The Domain algorithm computes the Gower distance between environmental variables at any location and those at any of the known locations of occurrence (‘training sites’). For each variable the minimum distance between a site and any of the training points is taken.
A generalized linear model (GLM) is a generalization of ordinary least squares regression. Models are fit using maximum likelihood and by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Depending on how a GLM is specified it can be equivalent to (multiple) linear regression, logistic regression or Poisson regression.
Random Forest (Breiman, 2001) is an extension of the classification and regression trees (CART; Breiman (1998)).
Support Vector Machines are an excellent tool for classification, novelty detection, and regression.
Support Vector Machines (SVMs; Vapnik (1998)) apply a simple linear method to the data but in a high-dimensional feature space non-linearly related to the input space, but in practice, it does not involve any computations in that high-dimensional space. This simplicity combined with state of the art performance on many learning problems (classification, regression, and novelty detection) has contributed to the popularity of the SVM (Karatzoglou et al., 2006).
If you have a powerful computer you can set more cores that will speed the computation time. Additionally, Here we are going to fit 6 models and evaluate them through 2 runs of subsampling, each draw 30 percent of training data as test dataset:
#This takes sometime (~3 minutes) please be patient!
<- sdm(Occurrence~., # the presence-absence
quemin_SDM data = quemin_DATA, # formula and data
methods = c("bioclim", "domain.dismo", "glm", "gam", "rf", "svm"), # algorithms
replication = "sub", test.percent = 30, n = 2, # training-testing subsampling
parallelSettings = list(ncore = 2, method = "parallel")) # parallel computation
Now let’s see how well each algorithm fitted the data.
quemin_SDM
We can see that all algorithms worked relatively well. The evaluation table shows four metrics of model evaluation, i.e., AUC, COR, TSS and Deviance, each metric has its own properties but I personally like the true-skill statistics (TSS) metric, because it evaluates the matches and mismatches between observations and predictions. So by looking at the column of TSS we can see that the best algorithms that best fitted the occurrence data are random forest, gam, svm and glm. . Interestingly, these models also have the highest area under the operating curve (AUC) metric values.
Another way to evaluate the model performance is to plot the AUCs.
roc(quemin_SDM, smooth = TRUE)
# Get the variable importance
getVarImp(quemin_SDM)
# Get the variable importance for a specific algorithm
getVarImp(quemin_SDM, method = "rf") # VI for Random Forest algorithm
How do you feel about that?
Do the important variables are the same across algorithms?
Until here we have downloaded and processed occurrence data and fitted and evaluated six different algorithms. Now let’s predict the Grinnellian niche for Quercus minima in the accessible area. We can do that for all six algorithms at once but it will take a very long time, so let’s do algorithm by algorithm. In addition, you can use the fitted object (i.e., quemin_SDM) to make predictions to other regions (e.g., areas with similar environmental conditions or invadable areas) or to project to other periods of time (i.e., to the future or the past).
<- predict(quemin_SDM, #sdm object
quemin_ENM_bioclim newdata = bios_quemin_sel, # environmental data
method = "bioclim", # algorithm
mean = TRUE)
quemin_ENM_bioclim
Now plot the resulting prediction under the bioclim algorithm. This map is showing the probability of occurrence of Quercus minima under the algorithm bioclim, where values closest to 1 indicate higher probability and values near 0 low probability of occurrence. Before visualizing the prediction let’s re-scale the prediction to make it more interpretable.
<- sdmvspecies::rescale(quemin_ENM_bioclim$sp_1.m_bioclim.re_subs)
quemin_ENM_bioclim
plot(quemin_ENM_bioclim)
Let’s repeat the process for the other algorithms.
<- predict(quemin_SDM, newdata = bios_quemin_sel,
quemin_ENM_domain method = "domain.dismo", mean = TRUE)
<- predict(quemin_SDM, newdata = bios_quemin_sel,
quemin_ENM_glm method = "glm", mean = TRUE)
<- predict(quemin_SDM, newdata = bios_quemin_sel,
quemin_ENM_gam method = "gam", mean = TRUE)
<- predict(quemin_SDM, newdata = bios_quemin_sel,
quemin_ENM_rf method = "rf", mean = TRUE)
<- predict(quemin_SDM, newdata = bios_quemin_sel,
quemin_ENM_svm method = "svm", mean = TRUE)
# Please, do not pay attention to the warnings...
<- sdmvspecies::rescale(quemin_ENM_domain$sp_1.m_domain.dismo.re_subs)
quemin_ENM_domain <- sdmvspecies::rescale(quemin_ENM_glm$sp_1.m_glm.re_subs)
quemin_ENM_glm <- sdmvspecies::rescale(quemin_ENM_gam$sp_1.m_gam.re_subs)
quemin_ENM_gam <- sdmvspecies::rescale(quemin_ENM_rf$sp_1.m_rf.re_subs)
quemin_ENM_rf <- sdmvspecies::rescale(quemin_ENM_svm$sp_1.m_svm.re_subs) quemin_ENM_svm
Now plot all six predictions
<- raster::stack(quemin_ENM_bioclim, quemin_ENM_domain,
quemin_ENM_all
quemin_ENM_glm, quemin_ENM_gam,
quemin_ENM_svm, quemin_ENM_rf)
names(quemin_ENM_all) <- c("Bioclim", "Domain", "GLM", "GAM", "SVM", "RF")
plot(quemin_ENM_all)
As you can see all algorithms returned different predictions, for example Domain return a very broad prediction while Bioclim a narrow prediction. With this we can conclude that there is no silver-bullet (i.e., perfect algorithm) that allow the prediction of the Grinnellian niches. If you are curious about this issue this paper led by Huijie Qiao can help you.
One potential solution to avoid this issue is to create an ensemble model. Basically an ensemble return a prediction by combining results of different modeling algorithms (Araujo & New, 2007). These ensemble predictions are also know as consensus predictions. Here rather than using the six models we will use the best four models (i.e., rf, gam, glm and svm) and we will use the TSS metric as weights.
### Ensemble prediction - ensemble based on TSS statistics
<- ensemble(quemin_SDM, # sdm object
quemin_ENM_ensemble newdata = bios_quemin_sel, # covariables
method = c("rf", "gam", "svm", "glm"),# algorithms
setting = list(method = "weighted", stat = "TSS"), # metric
parallelSettings = list(ncore = 2, method = "parallel"))
Don’t forget to re-scale the model prediction and visualize it.
<- sdmvspecies::rescale(quemin_ENM_ensemble) quemin_ENM_ensemble
Plot the ensemble prediction
plot(quemin_ENM_ensemble)
How do you feel about that?
Does the ensemble prediction outperform single algorithm predictions?
Let’s try making a nicer map.
library(rasterVis)
library(RColorBrewer)
<- rasterTheme(region = rev(brewer.pal(11, "Spectral")),
mapTheme layout.widths = list(right.padding = 10),
axis.line = list(col = "transparent"),
tick = list(col = 'transparent'))
levelplot(quemin_ENM_ensemble,
maxpixels = 1e10,
margin = FALSE,
par.settings = mapTheme,
scales = list(x = list(draw = FALSE),
y = list(draw = FALSE)),
zlim = c(0, 1))
You might also want to save the predictions.
dir.create("Results")
dir.create("Results/ENMs")
# save the predictions
writeRaster(quemin_ENM_all,
filename = "Results/ENMs/quemin_ENM",
suffix = names(quemin_ENM_all),
format = "GTiff",
bylayer = TRUE,
overwrite = TRUE)
# save the ensemble prediction
writeRaster(quemin_ENM_ensemble,
filename = "Results/ENMs/quemin_ENM_Ensemble",
format = "GTiff",
bylayer = TRUE,
overwrite = TRUE)
Finally, let’s produce “shadows” from the “ghosts”.
First we need to set a binarization threshold for the model predictions, the threshold (cut-off) is used to transform model predictions (probabilities, distances, or similar values) to a binary score (presence or absence).
Here we will find the threshold for the ensemble prediction, but you can do for each algorithm separately. The threshold we will use intents to maximizes the sensitivity (ability of a test to correctly identify presences or true positive rate) plus specificity (ability of a test to correctly identify absences or true negative rate) in the predictions.
<- data.frame(as.data.frame(quemin_DATA), coordinates(quemin_DATA))
dt
glimpse(dt)
<- raster::extract(quemin_ENM_ensemble, dt[, c("Longitude", "Latitude")])
prediction
<- sdm::evaluates(dt$Occurrence, prediction) # observed versus expected
evaluation
<- evaluation@threshold_based$threshold[2]
threshold_sel
round(threshold_sel, 2)
The threshold that maximizes the sensitivity and specificity is 0.34, we will use that value to produce “shadows” from the “ghosts”.
<- quemin_ENM_ensemble
quemin_SDM_ensemble
# reclassify our ensemble prediction
<- ifelse(quemin_SDM_ensemble[] >= threshold_sel, 1, 0) quemin_SDM_ensemble[]
Plot the distribution of Quercus minima
plot(quemin_SDM_ensemble)
plot(countriesCoarse, add = TRUE)
That’s it, we modeled environmental niches (ghosts) to produce geographical distributions (shadows).
You might also want to save this final raster file in your hard drive.
dir.create("Results/SDMs")
writeRaster(quemin_SDM_ensemble,
filename = "Results/SDMs/quemin_SDM_Ensemble",
format = "GTiff",
bylayer = TRUE,
overwrite = TRUE)
Please respond each question based on the practice.
Summarize the data: how many records are there, how many have coordinates, how many records without coordinates have a textual georeference (locality description)?
Do you think the observations are a reasonable representation of the distribution (and ecological niche) of the species?
Is there a best model? Explain your answer.
Can we use ENM/SDM to aid the conservation of Quercus minima? Explain your answer.
Based on the lecture and the practice, what can we conclude?
Model the niche and predict the distribution of a different species (you can use one of the species listed at the beginning of the class).
Please explain the algorithm used for modeling the niche and predict the distribution of your species and why you decided to use it.
The end, for now 😉!