Code
<- c("picante", "Taxonstand", "neonUtilities", "car",
packages "phytools", "vegan", "devtools", "magrittr", "corrplot")
# Package vector names
For today’s practice we will use data from the National Ecological Observatory Network (NEON). We strongly recommend to look at the NEON website to get a deeper understanding about NEON and how it works. Also you can read these papers: A continental strategy for the National Ecological Observatory Network and The plant diversity sampling design for The National Ecological Observatory Network.
To do this laboratory you will need to install the following packages:
<- c("picante", "Taxonstand", "neonUtilities", "car",
packages "phytools", "vegan", "devtools", "magrittr", "corrplot")
# Package vector names
If you don’t want to install the packages one by one, you can use the next command.
# Install packages not yet installed
<- packages %in% rownames(installed.packages())
installed_packages
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages], dependencies = TRUE)
}
We will also need the R package {V.PhyloMaker} to obtain a phylogenetic tree for our species. Note that this package is not located in CRAN, so we will install the package directly from the author’s github.
if ( ! ("V.PhyloMaker" %in% installed.packages())) {remotes::install_github("jinyizju/V.PhyloMaker")}
This command, will, first, check if you already the packages installed, then if a package is not installed in your computer, will install it.
Load installed packages:
sapply(packages, require, character.only = TRUE)
require(V.PhyloMaker)
library(tidyverse)
Double-check your working directory.
We will download data of plant communities directly from NEON and to do that we will use the R package {neonUtilities} and the function ‘loadByProduct’. Before proceed let’s take a look at the core information required in this function. To do that you just can type ?loadByProduct in your console and the documentation for this function will appear in the Help window of RStudio.
The core information we need to inform in the ‘loadByProduct’ function are:
dpID = The identifier of the NEON data product to pull, in the form DPL.PRNUM.REV, e.g. DP1.10023.001.
site = Either the string ‘all’, meaning all available sites, or a character vector of 4-letter NEON site codes, e.g. c(‘ONAQ’,‘RMNP’). Defaults to all.
package = Either ‘basic’ or ‘expanded’, indicating which data package to download. Defaults to basic.
As you can see, the dpID and site correspond to the kind of data we want to download and the site to the location where the data were collected, respectively. We can open the NEON website to find the information required for downloading the data for plant presence and percent cover. Also, you can look at the map of NEON sites to see the distribution of NEON sites across The United States.
For this practice we will use the next information: dpID = “DP1.10058.001” and site = c(“HARV”, “CPER”,“ABBY”) that correspond to plant presence and percent cover and three sites: HARV (Harvard Forest & Quabbin Watershed NEON, MA), CPER (Central Plains Experimental Range NEON, CO) and ABBY (Abby Road NEON, WA). Note that you can download data only for one site but for the sake of getting more practice in data management in R we will download the full data of plant community data for three sites (you can download data for more sites if you prefer).
# Set global option to NOT convert all character variables to factors
options(stringsAsFactors = FALSE)
<- loadByProduct(dpID = "DP1.10058.001",
NEON_data site = c("HARV", "CPER", "ABBY"),
package = "expanded", check.size = TRUE)
# type "y" (with no quotes) in your console to start downloading the data from NEON
Let’s inspect the downloaded data.
names(NEON_data)
View(NEON_data$div_10m2Data100m2Data)
View(NEON_data$div_1m2Data)
Save the raw data in your hard drive - this is a common practice that allow reproducibility and also save you a lot of time.
dir.create("Data") # You can skip this line if you already have this folder
dir.create("Data/NEON")
save(NEON_data, file = "Data/NEON/RawData_NEON_lab4.RData")
The data of abundance of plants correspond to the object “div_1m2Data”, thus let’s isolate that data from the rawdata.
<- NEON_data$div_1m2Data %>%
sel select(namedLocation, domainID, siteID, plotType, plotID, subplotID, endDate,
taxonID, taxonRank, family, scientificName, nativeStatusCode,
percentCover, heightPlantSpecies)
unique(sel$namedLocation)
unique(sel$siteID)
unique(sel$endDate)
The isolated data is a data.frame of 88,189 rows and 14 columns. Some information is not required so let’s clean the data a little bit and select information for only one site and a single period of time.
<- sel %>%
sel drop_na(scientificName) %>% # Removing NAs in the column of species
mutate(Date = endDate) %>%
separate(endDate, sep = "-", into = c("Year", "Month", "Day"))
unique(sel$Year)
unique(sel$siteID)
Select the site HARV and the year 2018.
<- sel %>%
HARV filter(siteID == "HARV" & Year == 2018)
unique(HARV$Year)
unique(HARV$siteID)
head(HARV)
The data corresponding to NEON site HARV for the 2018 contains 2161 rows and 17 columns. If we look at the data, specifically to the column scientificName we can see that the taxonomy used in NEON correspond to the taxonomy used by the USDA, however this taxonomy is not necessarily used by scientist (🤔😵🫣) that rely for example on the taxonomy of the APG Angiosperm Phylogeny Group for Angiosperms and WCSP World Checklist of Conifers for Gymnosperms.
glimpse(HARV)
In any case, we need to standardize the species names in order to proceed with the calculation of metrics.
To do that, we will use the package {Taxonstand} that allow matching the taxonomy among different sources using the working list of all known plant species produced by the botanical community or The Plant List (TPL).
Note - this process will take some time (~60 seconds) and return several warning messages but please do not pay attention to those messages.
<- unique(HARV$scientificName) # vector with scientific names
spp
# Perform taxonomic standardization on plant names (TPL table)
<- TPL(spp, infra = TRUE, corr = TRUE) # it will return a lot of warnings, please do not pay attention to that.
spp_check
head(spp_check)
Check the updated taxonomy.
View(spp_check)
Select the necessary information and combine it with the HARV data.
<- spp_check %>%
taxonomy drop_na(New.Genus, New.Species) %>%
select(Taxon, Family, New.Genus, New.Species, New.Taxonomic.status)
Now let’s combine the new taxonomic information with our NEON data.
<- full_join(HARV, taxonomy, by = c("scientificName" = "Taxon")) HARV_data
We can select some specific columns that we will use from now on.
<- HARV_data %>%
HARV_data mutate(sciName = paste0(New.Genus, "_", New.Species)) %>%
select(siteID, plotID, subplotID, New.Taxonomic.status, family, sciName, percentCover) #%>%
#filter(New.Taxonomic.status == "Species")
head(HARV_data)
We can save the cleaned data and remove the unnecessary information from the environment.
save(HARV, HARV_data, taxonomy, file = "Data/NEON/CleanData_NEON_lab4.RData")
# remove the information that we don't need anymore (for now)
rm(HARV, NEON_data, sel, spp_check, taxonomy, installed_packages, packages, spp)
As a last step the cleaned data will be transformed from the long format to a wide format, in other words, we will transform the NEON data to a matrix in which the species are the columns and the plots (or communities) are the rows.
<- HARV_data %>%
HARV_CDM select(plotID, percentCover, sciName) %>%
sample2matrix(.)
#HARV_mat <- sample2matrix(HARV_data[, c(2, 7, 6)])
nrow(HARV_CDM)
ncol(HARV_CDM)
Alright, until here we have downloaded, cleaned and prepared plant community data for the NEON site HARV. The next step is to prepare the phylogeny for those communities or a community level phylogeny. To do that we will use the most up to date phylogeny of vascular plants Constructing a broadly inclusive seed plant phylogeny and the R package {V.PhyloMaker}.
<- HARV_data %>%
sppPhylo select(family, sciName) %>% # select columns
distinct(family, sciName) # identify unique values
#sppPhylo <- HARV_data[, c(5, 6)]
# Prepare the taxonomy data to extract the phylogeny
<- sppPhylo %>%
sppPhylo mutate(species = gsub("_" , " ", sciName)) %>%
separate(sciName, sep = "_", into = c("genus", "ephitet")) %>%
select(species, genus, family)
head(sppPhylo)
Prepare the phylogeny and plot it.
<- V.PhyloMaker::phylo.maker(sppPhylo, scenarios = "S3") # this will take some time.
result
<- multi2di(result$scenario.3)
phylo
# Check if our phylogeny is ultrametric
is.ultrametric(phylo)
# Check is our phylogeny is bifurcated
is.binary.phylo(phylo)
plot(phylo, show.tip.label = FALSE)
Before continuing with the lab let’s check again if our data (phylogeny and community) match each other. To do this we will use the amazing function match.phylo.com() from the package {picante}.
<- picante::match.phylo.comm(phy = phylo, comm = HARV_CDM)
matched
$phy
matched
$comm matched
Alright, we have all data necessary for calculating different phylogenetic diversity metrics, hooray!
Save the phylogenetic data and clean the R environment.
save(phylo, sppPhylo, result, file = "Data/NEON/Phylo_NEON_lab4.RData")
rm(sppPhylo, result)
Ok, let’s inspect the data that were stored in the object matched.
$comm[1:10, 1:10] matched
Now the phylogeny
plot(matched$phy, show.tip.label = FALSE, type = "fan")
Awesome, we are now ready to explore the jungle of metrics for the evaluation of taxonomic, phenotypic and phylogenetic structure of communities (Pausas & Verdú, 2010). Let’s start with metrics for the taxonomic dimension.
Let’s explore the most common metrics of taxonomic diversity. For this exercise, we are going to use the R package {vegan}.
Simply the number of species recorded in a plot or community.
<- data.frame(S = specnumber(matched$comm)) HARV_tax
Metric that characterizes species diversity in a sample. Assumes that all species are represented in the sample and that individuals within species were sampled randomly.
<- HARV_tax %>%
HARV_tax mutate(H = diversity(matched$comm,
index = "shannon")
)
Metric that characterizes species diversity in a sample. Contrary to Shannon’s H, Simpson’s D captures the variance of the species abundance distribution.
<- HARV_tax %>%
HARV_tax mutate(D = diversity(matched$comm,
index = "simpson")
)
HARV_tax
Metric that count the individuals of each species in a sample, rather than the number of species. A J value ranges from 0 (no evenness) to 1 (complete evenness).
<- HARV_tax %>%
HARV_tax mutate(J = H/log(S)
)
HARV_tax
We just embedded {vegan} functions into {dplyr} to do the taxonomic metric calculations in a tidyverse framework 😬…
Let’s calculate some metrics manually and then using the package {picante} we will calculate the same metrics but for all communities at once.
Phylogenetic diversity is just the sum of the total branch lengths in the community. In this case we are calculating PD using all species in the phylogeny, in other words, assuming that a single community contain the same number of species as the phylogeny.
sum(matched$phy$edge.length) # sum of the total branch lengths in the community
We can calculate PD for individual plots or communities, for example the plot HARV_001
# Select species that are only present in the plot HARV_001
<- matched$comm %>%
HARV_001 filter(rownames(.) == "HARV_001") %>%
t() %>%
data.frame() %>%
filter(HARV_001 > 0)
# Drop species in the phylogeny that are not present in the plot HARV_001
<- drop.tip(matched$phy, setdiff(matched$phy$tip.label, rownames(HARV_001)))
HARV_001_phy
sum(HARV_001_phy$edge.length)
We can confirm the result by calculating PD using the package {picante}. However, instead of calculating plot by plot {picante} will calculate PD for all of the plots.
<- pd(samp = matched$comm,
HARV_PD tree = matched$phy,
include.root = FALSE) # Faith's PD
head(HARV_PD)
We NOW can confirm the PD value estimated by hand is equal to the PD estimated using {picante}.
You can see both metrics SR and PD are highly correlated, that is because PD is highly sensitive to the sample size, i.e., as more species are added to the community more branch lengths are added to the phylogeny and consequently the expected relationship is high.
%>%
HARV_PD ggplot(aes(x = SR, y = PD)) +
geom_point(size = 3, color = "darkgray")
Now see correlation between the two metrics
%$%
HARV_PD cor.test(SR, PD, use = "complete.obs")
Let’s plot the association again…
# Aux function for visualization
<- function() {
theme_nice theme_bw() + #base_family = "Noto Sans") +
theme(panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
#plot.title = element_text(face = "bold"),
#strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "grey80", color = NA),
legend.title = element_text(face = "bold", size = 15),
legend.text = element_text(size = 12))
}
# plot
%>%
HARV_PD ggplot(aes(x = SR, y = PD)) +
geom_point(size = 3, color = "darkgray") +
labs(x = "Species richness", y = "PD (millions of years)") +
theme_nice()
Other common metrics are MPD (mean pairwise distance) and MNTD (mean nearest taxon distance). As in PD, let’s calculate MPD and MNTD manually.
# MPD
<- cophenetic(HARV_001_phy)
dist.trMB
<- dist.trMB[lower.tri(dist.trMB, diag = FALSE)]
dist.trMB
mean(dist.trMB)
# MNTD
<- cophenetic(HARV_001_phy)
dist.trMB2 diag(dist.trMB2) <- NA
apply(dist.trMB2, 2, min, na.rm = TRUE)
mean(apply(dist.trMB2, 2, min, na.rm = TRUE))
And now using the package picante
# MPD
<- mpd(matched$comm, cophenetic(matched$phy))
HARV_MPD head(HARV_MPD)
# MNTD
<- mntd(matched$comm, cophenetic(matched$phy))
HARV_MNTD head(HARV_MNTD)
Until here we just played with the data and also confirmed that we can calculate different metrics by hand. Now let’s do a more formal evaluation of the biodiversity in Harvard Forest & Quabbin Watershed NEON, MA.
The analyses of community phylogenetic started making inferences about the mechanisms structuring the local communities through the evaluation of phylogenetic arrangements in local communities (see Cavender-Bares et al. (2009) for an initial criticism). However, new methods are now available, such that more complex balance between ecological and historical processes at local and regional scales can be incorporated into the analyses (Pigot & Etienne, 2015; Pinto-Ledezma et al., 2019; Pinto‐Ledezma et al., 2020).
Let’s calculate some of the most common metrics.
Note - we will use the object HARV_CDM to store all the results.
PD or phylogenetic diversity is the sum of the total phylogenetic branch length for one or multiple samples.
<- ses.pd(samp = matched$comm,
HARV_CDM tree = matched$phy,
runs = 999) # this will take some time
<- HARV_CDM %>%
HARV_CDM select(ntaxa, pd.obs, pd.obs.z)
#HARV_CDM <- HARV_CDM[, c(1, 2, 6, 7)]
head(HARV_CDM)
Rao’s quadratic entropy (Rao 1982) is a measure of diversity in ecological communities that can optionally take species differences (e.g. phylogenetic dissimilarity) into account.
<- HARV_CDM %>%
HARV_CDM mutate(RaoD = raoD(matched$comm, force.ultrametric(matched$phy))$Dkk)
head(HARV_CDM)
# SES-MPD
<- ses.mpd(samp = matched$comm,
HARVsesmpd dis = cophenetic(matched$phy),
null.model = "taxa.labels",
abundance.weighted = TRUE,
runs = 999)
<- HARVsesmpd %>%
HARVsesmpd select(mpd.obs, mpd.obs.z)
<- bind_cols(HARV_CDM, HARVsesmpd) HARV_CDM
# SES-MNTD
<- ses.mntd(samp = matched$comm,
HARVsesmntd dis = cophenetic(matched$phy),
null.model = "taxa.labels",
abundance.weighted = TRUE,
runs = 999)
<- HARVsesmntd %>%
HARVsesmntd select(mntd.obs, mntd.obs.z)
<- bind_cols(HARV_CDM, HARVsesmntd) HARV_CDM
Phylogenetic species variability quantifies how phylogenetic relatedness decreases the variance of a hypothetical unselected/neutral trait shared by all species in a community.
# PSV or phylogenetic species variability
<- HARV_CDM %>%
HARV_CDM mutate(PSV = psv(samp = matched$comm,
tree = matched$phy,
compute.var = TRUE)$PSVs)
HARV_CDM
Phylogenetic species richness is the number of species in a sample multiplied by PSV.
# PSR or phylogenetic species richness
<- HARV_CDM %>%
HARV_CDM mutate(PSR = psr(samp = matched$comm,
tree = matched$phy,
compute.var = TRUE)$PSR)
HARV_CDM
Phylogenetic species evenness is the metric PSV modified to incorporate relative species abundances.
# PSE or phylogenetic species evenness
<- HARV_CDM %>%
HARV_CDM mutate(PSE = pse(samp = matched$comm,
tree = matched$phy)$PSEs)
HARV_CDM
qD(p) is a metric that measure the variation in species’ divergences within communities. This metric is a modification of the Hill index, weighting a species’ proportional abundance by its relative share of phylogenetic information.
# Scheiner 2012 qD(p)
source("https://raw.githubusercontent.com/jesusNPL/BiodiversityScience/master/Spring2021/R-Functions/qDp.R")
<- HARV_CDM %>%
HARV_CDM mutate(qDP = qDp(matched$phy, matched$comm, q = 2))
HARV_CDM
We have calculated several metrics that describe the phylogenetic structure of communities at Harvard Forest & Quabbin Watershed NEON (HARV) site.
scatterplotMatrix(HARV_CDM)
Hmmm, Kinda ugly…
Let’s use the package {corrplot} to get a more informative figure.
<- cor(HARV_CDM, method = 'spearman')
cor_mat
corrplot.mixed(cor_mat,
tl.pos = "lt",
tl.cex = 0.7,
number.cex = 0.7,
addCoefasPercent = TRUE,
mar = c(0, 0, 1, 0))
Explore the correlation among metrics.
%>%
HARV_CDM cor.table()
You can also plot the relationship between metrics.
%>%
HARV_CDM ggplot(aes(x = RaoD, y = PSE)) +
geom_point(size = 3, color = "darkgray") +
labs(x = "RaoD", y = "PSE") +
theme_nice()
RaoD and PSE are highly correlated; explain why
<- HARV_CDM %>%
HARV_mds select(-c(pd.obs.z, mpd.obs.z, mntd.obs.z)) %>% # MNDS does not like negative values
na.omit() %>%
metaMDS(.)
ordiplot(HARV_mds, type = "t", display = "species")
Let’s use {ggplot2} to make a better figure.
## Metrics scores
<- data.frame(HARV_mds$species) %>%
HARV_mds_scores mutate(Metric = rownames(.))
%>%
HARV_mds_scores ggplot(aes(x = MDS1, y = MDS2, color = Metric)) +
geom_point(size = 3, alpha = 0.5) +
theme_nice()
What do you think?
Which metric would you use for your paper?
The challenge for this assignment is:
Repeat all process but select two NEON sites across the United States, compare the results of both sites and discuss the difference in the results if there any. For example, you can use the other two sites we have downloaded at the beginning of the lab, i.e., “CPER” and “ABBY”.
Another option is to calculate the phylogenetic diversity metrics for each year on a NEON site. For example, you can use the data from HARV but instead of doing the analysis only for the 2018 (as we did here) you can repeat the process for all years from 2013 to 2022 and discuss if the taxonomic (species richness) and phylogenetic diversities changed.
The end! for now…