Measuring phylogenetic diversity within communities

Showing some neat features of R!
Author
Affiliation
Published

March 27, 2023

Note

The main goal of this practice is to present basic understanding about measuring phylogenetic diversity within communities or best known as the analysis of community phylogenetics. The community phylogenetics integrates ecological and evolutionary concepts and explores the mechanisms (e.g., biotic interactions or environmental filters) governing the assembly of ecological communities.

There are different sources of information and web pages with a lot of information about this field. The most common and useful are the web pages of the books: Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology and Phylogenies in Ecology. Among the most influential papers in this field are Phylogenies and Community Ecology by Webb et al. (2002) and The merging of community ecology and phylogenetic biology by Cavender-Bares et al. (2009).

1 Set up your data and your working directory

For today’s practice we will use data from the National Ecological Observatory Network (NEON). We strongly recommend to look at the NEON website to get a deeper understanding about NEON and how it works. Also you can read these papers: A continental strategy for the National Ecological Observatory Network and The plant diversity sampling design for The National Ecological Observatory Network.

To do this laboratory you will need to install the following packages:

Code
packages <- c("picante", "Taxonstand", "neonUtilities", "car", 
              "phytools", "vegan", "devtools", "magrittr", "corrplot") 
# Package vector names
Function install.packages()

You can use the function install.packages() to install the packages.

If you don’t want to install the packages one by one, you can use the next command.

Code
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())

if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages], dependencies = TRUE)
}

We will also need the R package {V.PhyloMaker} to obtain a phylogenetic tree for our species. Note that this package is not located in CRAN, so we will install the package directly from the author’s github.

Code
if ( ! ("V.PhyloMaker" %in% installed.packages())) {remotes::install_github("jinyizju/V.PhyloMaker")}

This command, will, first, check if you already the packages installed, then if a package is not installed in your computer, will install it.

Load installed packages:

Code
sapply(packages, require, character.only = TRUE)
require(V.PhyloMaker)
library(tidyverse)

Double-check your working directory.

Function getwd()

You can use the function getwd() to get the current working directory.

2 Prepare data

2.1 Download community data from NEON

We will download data of plant communities directly from NEON and to do that we will use the R package {neonUtilities} and the function ‘loadByProduct’. Before proceed let’s take a look at the core information required in this function. To do that you just can type ?loadByProduct in your console and the documentation for this function will appear in the Help window of RStudio.

The core information we need to inform in the ‘loadByProduct’ function are:

  1. dpID = The identifier of the NEON data product to pull, in the form DPL.PRNUM.REV, e.g. DP1.10023.001.

  2. site = Either the string ‘all’, meaning all available sites, or a character vector of 4-letter NEON site codes, e.g. c(‘ONAQ’,‘RMNP’). Defaults to all.

  3. package = Either ‘basic’ or ‘expanded’, indicating which data package to download. Defaults to basic.

As you can see, the dpID and site correspond to the kind of data we want to download and the site to the location where the data were collected, respectively. We can open the NEON website to find the information required for downloading the data for plant presence and percent cover. Also, you can look at the map of NEON sites to see the distribution of NEON sites across The United States.

For this practice we will use the next information: dpID = “DP1.10058.001” and site = c(“HARV”, “CPER”,“ABBY”) that correspond to plant presence and percent cover and three sites: HARV (Harvard Forest & Quabbin Watershed NEON, MA), CPER (Central Plains Experimental Range NEON, CO) and ABBY (Abby Road NEON, WA). Note that you can download data only for one site but for the sake of getting more practice in data management in R we will download the full data of plant community data for three sites (you can download data for more sites if you prefer).

Code
# Set global option to NOT convert all character variables to factors
options(stringsAsFactors = FALSE)

NEON_data <- loadByProduct(dpID = "DP1.10058.001", 
                        site = c("HARV", "CPER", "ABBY"), 
                        package = "expanded", check.size = TRUE)

# type "y" (with no quotes) in your console to start downloading the data from NEON

Let’s inspect the downloaded data.

Code
names(NEON_data)

View(NEON_data$div_10m2Data100m2Data)

View(NEON_data$div_1m2Data)

Save the raw data in your hard drive - this is a common practice that allow reproducibility and also save you a lot of time.

Code
dir.create("Data") # You can skip this line if you already have this folder
dir.create("Data/NEON")

save(NEON_data, file = "Data/NEON/RawData_NEON_lab4.RData")

The data of abundance of plants correspond to the object “div_1m2Data”, thus let’s isolate that data from the rawdata.

Code
sel <- NEON_data$div_1m2Data %>% 
  select(namedLocation, domainID, siteID, plotType, plotID, subplotID, endDate, 
         taxonID, taxonRank, family, scientificName, nativeStatusCode, 
         percentCover, heightPlantSpecies)

unique(sel$namedLocation)
unique(sel$siteID)
unique(sel$endDate)

The isolated data is a data.frame of 88,189 rows and 14 columns. Some information is not required so let’s clean the data a little bit and select information for only one site and a single period of time.

Code
sel <- sel %>% 
  drop_na(scientificName) %>% # Removing NAs in the column of species
  mutate(Date = endDate) %>% 
  separate(endDate, sep = "-", into = c("Year", "Month", "Day"))

unique(sel$Year)
unique(sel$siteID)

Select the site HARV and the year 2018.

Code
HARV <- sel %>% 
  filter(siteID == "HARV" & Year == 2018)

unique(HARV$Year)
unique(HARV$siteID)

head(HARV)

The data corresponding to NEON site HARV for the 2018 contains 2161 rows and 17 columns. If we look at the data, specifically to the column scientificName we can see that the taxonomy used in NEON correspond to the taxonomy used by the USDA, however this taxonomy is not necessarily used by scientist (🤔😵🫣) that rely for example on the taxonomy of the APG Angiosperm Phylogeny Group for Angiosperms and WCSP World Checklist of Conifers for Gymnosperms.

Code
glimpse(HARV)

In any case, we need to standardize the species names in order to proceed with the calculation of metrics.

To do that, we will use the package {Taxonstand} that allow matching the taxonomy among different sources using the working list of all known plant species produced by the botanical community or The Plant List (TPL).

Note - this process will take some time (~60 seconds) and return several warning messages but please do not pay attention to those messages.

Code
spp <- unique(HARV$scientificName) # vector with scientific names
  
# Perform taxonomic standardization on plant names (TPL table)
spp_check <- TPL(spp, infra = TRUE, corr = TRUE) # it will return a lot of warnings, please do not pay attention to that.

head(spp_check)

Check the updated taxonomy.

Code
View(spp_check)

Select the necessary information and combine it with the HARV data.

Code
taxonomy <- spp_check %>% 
  drop_na(New.Genus, New.Species) %>% 
  select(Taxon, Family, New.Genus, New.Species, New.Taxonomic.status)

Now let’s combine the new taxonomic information with our NEON data.

Code
HARV_data <- full_join(HARV, taxonomy, by = c("scientificName" = "Taxon"))

We can select some specific columns that we will use from now on.

Code
HARV_data <- HARV_data %>% 
  mutate(sciName = paste0(New.Genus, "_", New.Species)) %>% 
  select(siteID, plotID, subplotID, New.Taxonomic.status, family, sciName, percentCover) #%>% 
  #filter(New.Taxonomic.status == "Species")

head(HARV_data)

We can save the cleaned data and remove the unnecessary information from the environment.

Code
save(HARV, HARV_data, taxonomy, file = "Data/NEON/CleanData_NEON_lab4.RData")

# remove the information that we don't need anymore (for now)
rm(HARV, NEON_data, sel, spp_check, taxonomy, installed_packages, packages, spp)

As a last step the cleaned data will be transformed from the long format to a wide format, in other words, we will transform the NEON data to a matrix in which the species are the columns and the plots (or communities) are the rows.

Code
HARV_CDM <- HARV_data %>% 
  select(plotID, percentCover, sciName) %>% 
  sample2matrix(.)

#HARV_mat <- sample2matrix(HARV_data[, c(2, 7, 6)])
nrow(HARV_CDM)
ncol(HARV_CDM)

Alright, until here we have downloaded, cleaned and prepared plant community data for the NEON site HARV. The next step is to prepare the phylogeny for those communities or a community level phylogeny. To do that we will use the most up to date phylogeny of vascular plants Constructing a broadly inclusive seed plant phylogeny and the R package {V.PhyloMaker}.

Code
sppPhylo <- HARV_data %>% 
  select(family, sciName) %>% # select columns
  distinct(family, sciName) # identify unique values

#sppPhylo <- HARV_data[, c(5, 6)] 

# Prepare the taxonomy data to extract the phylogeny
sppPhylo <- sppPhylo %>% 
  mutate(species = gsub("_" , " ", sciName)) %>% 
  separate(sciName, sep = "_", into = c("genus", "ephitet")) %>% 
  select(species, genus, family)
  
head(sppPhylo)

Prepare the phylogeny and plot it.

Code
result <- V.PhyloMaker::phylo.maker(sppPhylo, scenarios = "S3") # this will take some time.

phylo <- multi2di(result$scenario.3)

# Check if our phylogeny is ultrametric
is.ultrametric(phylo)

# Check is our phylogeny is bifurcated 
is.binary.phylo(phylo)

plot(phylo, show.tip.label = FALSE)

Before continuing with the lab let’s check again if our data (phylogeny and community) match each other. To do this we will use the amazing function match.phylo.com() from the package {picante}.

Code
matched <- picante::match.phylo.comm(phy = phylo, comm = HARV_CDM)

matched$phy

matched$comm

Alright, we have all data necessary for calculating different phylogenetic diversity metrics, hooray!

Save the phylogenetic data and clean the R environment.

Code
save(phylo, sppPhylo, result, file = "Data/NEON/Phylo_NEON_lab4.RData")

rm(sppPhylo, result)

Ok, let’s inspect the data that were stored in the object matched.

Code
matched$comm[1:10, 1:10]

Now the phylogeny

Code
plot(matched$phy, show.tip.label = FALSE, type = "fan")

Awesome, we are now ready to explore the jungle of metrics for the evaluation of taxonomic, phenotypic and phylogenetic structure of communities (Pausas & Verdú, 2010). Let’s start with metrics for the taxonomic dimension.

3 Taxonomic diversity metrics

Let’s explore the most common metrics of taxonomic diversity. For this exercise, we are going to use the R package {vegan}.

3.0.1 Species richness (S)

Simply the number of species recorded in a plot or community.

Code
HARV_tax <- data.frame(S = specnumber(matched$comm))

3.0.2 Shannon Index (H)

Metric that characterizes species diversity in a sample. Assumes that all species are represented in the sample and that individuals within species were sampled randomly.

Code
HARV_tax <- HARV_tax %>% 
  mutate(H = diversity(matched$comm, 
                       index = "shannon")
         )

3.0.3 Simpson Index (D)

Metric that characterizes species diversity in a sample. Contrary to Shannon’s H, Simpson’s D captures the variance of the species abundance distribution.

Code
HARV_tax <- HARV_tax %>% 
  mutate(D = diversity(matched$comm, 
                       index = "simpson")
         )

HARV_tax

3.0.4 Pielou’s evenness (J)

Metric that count the individuals of each species in a sample, rather than the number of species. A J value ranges from 0 (no evenness) to 1 (complete evenness).

Code
HARV_tax <- HARV_tax %>% 
  mutate(J = H/log(S)
         )
  
HARV_tax

We just embedded {vegan} functions into {dplyr} to do the taxonomic metric calculations in a tidyverse framework 😬…

4 Phylogenetic diversity metrics

4.1 Explore diversity metrics

Let’s calculate some metrics manually and then using the package {picante} we will calculate the same metrics but for all communities at once.

4.1.1 Phylogenetic diversity

Phylogenetic diversity is just the sum of the total branch lengths in the community. In this case we are calculating PD using all species in the phylogeny, in other words, assuming that a single community contain the same number of species as the phylogeny.

Code
sum(matched$phy$edge.length) #  sum of the total branch lengths in the community

We can calculate PD for individual plots or communities, for example the plot HARV_001

Code
# Select species that are only present in the plot HARV_001
HARV_001 <- matched$comm %>% 
  filter(rownames(.) == "HARV_001") %>% 
  t() %>% 
  data.frame() %>% 
  filter(HARV_001 > 0)

# Drop species in the phylogeny that are not present in the plot HARV_001
HARV_001_phy <- drop.tip(matched$phy, setdiff(matched$phy$tip.label, rownames(HARV_001)))

sum(HARV_001_phy$edge.length)

We can confirm the result by calculating PD using the package {picante}. However, instead of calculating plot by plot {picante} will calculate PD for all of the plots.

Code
HARV_PD <- pd(samp = matched$comm, 
              tree = matched$phy, 
              include.root = FALSE) # Faith's PD

head(HARV_PD)

We NOW can confirm the PD value estimated by hand is equal to the PD estimated using {picante}.

Argument include.root = TRUE

Using the argument “include.root = TRUE” in the function pd() of {picante} will return a slightly different PD value as the root of the phylogeny is considered in the calculation.

You can see both metrics SR and PD are highly correlated, that is because PD is highly sensitive to the sample size, i.e., as more species are added to the community more branch lengths are added to the phylogeny and consequently the expected relationship is high.

Code
HARV_PD %>% 
  ggplot(aes(x = SR, y = PD)) + 
  geom_point(size = 3, color = "darkgray") 

Now see correlation between the two metrics

Code
HARV_PD %$% 
  cor.test(SR, PD, use = "complete.obs")

Let’s plot the association again…

Code
# Aux function for visualization
theme_nice <- function() {
  theme_bw() + #base_family = "Noto Sans") +
    theme(panel.grid.minor = element_blank(),
          plot.background = element_rect(fill = "white", color = NA),
          #plot.title = element_text(face = "bold"),
          #strip.text = element_text(face = "bold"),
          strip.background = element_rect(fill = "grey80", color = NA),
          legend.title = element_text(face = "bold", size = 15), 
          legend.text = element_text(size = 12))
}

# plot
HARV_PD %>% 
  ggplot(aes(x = SR, y = PD)) + 
  geom_point(size = 3, color = "darkgray") + 
  labs(x = "Species richness", y = "PD (millions of years)") + 
  theme_nice()

4.1.2 Mean pairwise distance (MPD) and mean nearest-pairwise distance (MNTD)

Other common metrics are MPD (mean pairwise distance) and MNTD (mean nearest taxon distance). As in PD, let’s calculate MPD and MNTD manually.

Code
# MPD
dist.trMB <- cophenetic(HARV_001_phy)

dist.trMB <- dist.trMB[lower.tri(dist.trMB, diag = FALSE)]

mean(dist.trMB)
Code
# MNTD
dist.trMB2 <- cophenetic(HARV_001_phy)
diag(dist.trMB2) <- NA
apply(dist.trMB2, 2, min, na.rm = TRUE)

mean(apply(dist.trMB2, 2, min, na.rm = TRUE))

And now using the package picante

Code
# MPD
HARV_MPD <- mpd(matched$comm, cophenetic(matched$phy)) 
head(HARV_MPD)
Code
# MNTD
HARV_MNTD <- mntd(matched$comm, cophenetic(matched$phy)) 
head(HARV_MNTD)

Until here we just played with the data and also confirmed that we can calculate different metrics by hand. Now let’s do a more formal evaluation of the biodiversity in Harvard Forest & Quabbin Watershed NEON, MA.

5 Community phylogenetic diversity metrics

The analyses of community phylogenetic started making inferences about the mechanisms structuring the local communities through the evaluation of phylogenetic arrangements in local communities (see Cavender-Bares et al. (2009) for an initial criticism). However, new methods are now available, such that more complex balance between ecological and historical processes at local and regional scales can be incorporated into the analyses (Pigot & Etienne, 2015; Pinto-Ledezma et al., 2019; Pinto‐Ledezma et al., 2020).

Let’s calculate some of the most common metrics.

Note - we will use the object HARV_CDM to store all the results.

5.0.1 Phylogenetic diversity in a community - PD

PD or phylogenetic diversity is the sum of the total phylogenetic branch length for one or multiple samples.

Code
HARV_CDM <- ses.pd(samp = matched$comm, 
                   tree = matched$phy, 
                   runs = 999) # this will take some time

HARV_CDM <- HARV_CDM %>% 
  select(ntaxa, pd.obs, pd.obs.z)

#HARV_CDM <- HARV_CDM[, c(1, 2, 6, 7)]

head(HARV_CDM)

5.0.2 Phylogenetic Rao’s quadratic entropy - RaoD

Rao’s quadratic entropy (Rao 1982) is a measure of diversity in ecological communities that can optionally take species differences (e.g. phylogenetic dissimilarity) into account.

Code
HARV_CDM <- HARV_CDM %>% 
  mutate(RaoD = raoD(matched$comm, force.ultrametric(matched$phy))$Dkk)

head(HARV_CDM)

5.0.3 Mean pairwise distance separating taxa in a community - MPD

Code
# SES-MPD
HARVsesmpd <- ses.mpd(samp = matched$comm, 
                      dis = cophenetic(matched$phy), 
                      null.model = "taxa.labels", 
                      abundance.weighted = TRUE, 
                      runs = 999)

HARVsesmpd <- HARVsesmpd %>% 
  select(mpd.obs, mpd.obs.z)

HARV_CDM <- bind_cols(HARV_CDM, HARVsesmpd)

5.0.4 Mean nearest taxon distance for taxa in a community - MNTD

Code
# SES-MNTD
HARVsesmntd <- ses.mntd(samp = matched$comm, 
                        dis = cophenetic(matched$phy), 
                        null.model = "taxa.labels", 
                        abundance.weighted = TRUE,
                        runs = 999)

HARVsesmntd <- HARVsesmntd %>% 
  select(mntd.obs, mntd.obs.z)

HARV_CDM <- bind_cols(HARV_CDM, HARVsesmntd)

5.0.5 Phylogenetic species variability - PSV

Phylogenetic species variability quantifies how phylogenetic relatedness decreases the variance of a hypothetical unselected/neutral trait shared by all species in a community.

Code
# PSV or phylogenetic species variability
HARV_CDM <- HARV_CDM %>% 
  mutate(PSV = psv(samp = matched$comm, 
               tree = matched$phy, 
               compute.var = TRUE)$PSVs)

HARV_CDM

5.0.6 Phylogenetic species richness - PSR

Phylogenetic species richness is the number of species in a sample multiplied by PSV.

Code
# PSR or phylogenetic species richness
HARV_CDM <- HARV_CDM %>% 
  mutate(PSR = psr(samp = matched$comm, 
               tree = matched$phy, 
               compute.var = TRUE)$PSR)

HARV_CDM

5.0.7 Phylogenetic species evenness - PSE

Phylogenetic species evenness is the metric PSV modified to incorporate relative species abundances.

Code
# PSE or phylogenetic species evenness
HARV_CDM <- HARV_CDM %>% 
  mutate(PSE = pse(samp = matched$comm, 
               tree = matched$phy)$PSEs)

HARV_CDM

5.0.8 qDp

qD(p) is a metric that measure the variation in species’ divergences within communities. This metric is a modification of the Hill index, weighting a species’ proportional abundance by its relative share of phylogenetic information.

Code
# Scheiner 2012 qD(p)
source("https://raw.githubusercontent.com/jesusNPL/BiodiversityScience/master/Spring2021/R-Functions/qDp.R")

HARV_CDM <- HARV_CDM %>% 
  mutate(qDP = qDp(matched$phy, matched$comm, q = 2))

HARV_CDM

We have calculated several metrics that describe the phylogenetic structure of communities at Harvard Forest & Quabbin Watershed NEON (HARV) site.

6 Compare the metrics

Code
scatterplotMatrix(HARV_CDM)

Hmmm, Kinda ugly…

Let’s use the package {corrplot} to get a more informative figure.

Code
cor_mat <- cor(HARV_CDM, method = 'spearman')

corrplot.mixed(cor_mat, 
               tl.pos = "lt", 
               tl.cex = 0.7, 
               number.cex = 0.7, 
               addCoefasPercent = TRUE, 
               mar = c(0, 0, 1, 0))

Explore the correlation among metrics.

Code
HARV_CDM %>% 
  cor.table()

You can also plot the relationship between metrics.

Code
HARV_CDM %>% 
  ggplot(aes(x = RaoD, y = PSE)) + 
  geom_point(size = 3, color = "darkgray") + 
  labs(x = "RaoD", y = "PSE") + 
  theme_nice()

RaoD and PSE are highly correlated; explain why

Code
HARV_mds <- HARV_CDM %>% 
  select(-c(pd.obs.z, mpd.obs.z, mntd.obs.z)) %>% # MNDS does not like negative values
  na.omit() %>% 
  metaMDS(.)


ordiplot(HARV_mds, type = "t", display = "species")

Let’s use {ggplot2} to make a better figure.

Code
## Metrics scores
HARV_mds_scores <- data.frame(HARV_mds$species) %>% 
  mutate(Metric = rownames(.))

HARV_mds_scores %>% 
  ggplot(aes(x = MDS1, y = MDS2, color = Metric)) + 
  geom_point(size = 3, alpha = 0.5) + 
  theme_nice()

What do you think?

Which metric would you use for your paper?

7 The challenge

The challenge for this assignment is:

  • Option 1

Repeat all process but select two NEON sites across the United States, compare the results of both sites and discuss the difference in the results if there any. For example, you can use the other two sites we have downloaded at the beginning of the lab, i.e., “CPER” and “ABBY”.

  • Option 2

Another option is to calculate the phylogenetic diversity metrics for each year on a NEON site. For example, you can use the data from HARV but instead of doing the analysis only for the 2018 (as we did here) you can repeat the process for all years from 2013 to 2022 and discuss if the taxonomic (species richness) and phylogenetic diversities changed.

The end! for now…

References

Cavender-Bares, J., Kozak, K. H., Fine, P. V. A., & Kembel, S. W. (2009). The merging of community ecology and phylogenetic biology. Ecology Letters, 12(7), 693–715. https://doi.org/10.1111/j.1461-0248.2009.01314.x
Pausas, J. G., & Verdú, M. (2010). The jungle of methods for evaluating phenotypic and phylogenetic structure of communities. BioScience, 60(8), 614–625. https://doi.org/10.1525/bio.2010.60.8.7
Pigot, A. L., & Etienne, R. S. (2015). A new dynamic null model for phylogenetic community structure. Ecology Letters, 18(2), 153–163. https://doi.org/10.1111/ele.12395
Pinto-Ledezma, J. N., Jahn, A. E., Cueto, V. R., Diniz-Filho, J. A. F., & Villalobos, F. (2019). Drivers of phylogenetic assemblage structure of the furnariides, a widespread clade of lowland neotropical birds. The American Naturalist, 193(2), E41–E56. https://doi.org/10.1086/700696
Pinto‐Ledezma, J. N., Villalobos, F., Reich, P. B., Catford, J. A., Larkin, D. J., & Cavender‐Bares, J. (2020). Testing darwin’s naturalization conundrum based on taxonomic, phylogenetic, and functional dimensions of vascular plants. Ecological Monographs, 90(4). https://doi.org/10.1002/ecm.1420
Webb, C. O., Ackerly, D. D., McPeek, M. A., & Donoghue, M. J. (2002). Phylogenies and community ecology. Annual Review of Ecology and Systematics, 33(1), 475–505. https://doi.org/10.1146/annurev.ecolsys.33.010802.150448