The main goal of this tutorial is to present basic aspects for anyone to be free of initial fear and start using R to perform data analysis. Every learning process becomes more effective when theory is combined with practice; in this sense, we strongly recommend that you follow the exercises in this short tutorial at the same time that you run the commands on your computer and not just read them passively.
1 Why R?
R is a language and a statistical programming environment and graphics or also called an “object-oriented programming”, which means that using R involves the creation and manipulation of objects on a screen, where the user has to say exactly what they want to do rather than simply press a button (black box paradox). So, the main advantage of R is that the user has control over what is happening and also a full understanding of what they want before performing any analysis.
With R, it is possible to manipulate and analyze data, make graphics and write from small commands to entire programs. Basically, R is the open version of the S language, created by Bell’s Lab in 1980. Interestingly, the S language is super popular among different areas of science and is the base for commercial products such as SPSS, STATA, and SAS, among others. Thus, if we have to add another advantage to R, is that R is an open language and free!
There are different sources and web-pages with a lot of information about R, most of them are super useful and can be found at DataCamp, CRAN, R Tutorial.
Also, when we are reporting our results in the form of a report, scientific paper or any kind of document, we would need to cite the used software, the easiest to cite R is using the internal function citation().
Code
citation()
To cite R in publications use:
R Core Team (2022). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL https://www.R-project.org/.
A BibTeX entry for LaTeX users is
@Manual{,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Core Team}},
organization = {R Foundation for Statistical Computing},
address = {Vienna, Austria},
year = {2022},
url = {https://www.R-project.org/},
}
We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also 'citation("pkgname")' for
citing R packages.
2 First steps
First that all, we need to know about WHERE are we working at. In other words, our working directory. To get information that information we just need to type getwd() in the script or the console.
Code
getwd()
[1] "/Users/jesusnpl/Dropbox/My Mac (Jesús’s MacBook Pro)/Documents/GitHub/BiodiversityScience/Spring2023/Lab_0"
If the working directory is not the correct one, we just need to order R to SET the correct address.
Code
setwd("Your path or directory")
There is an R package called {here} that is super convenient for setting your working directory if the path is really long. You didn’t hear that from me ;p!
Ok, we are now in the correct place, so we can continue with the practice.
2.1 Directory structure
For training purposes, we will create a directory structure where the main folder is our current working environment, so we will create a series of subfolders where we store, the data, the scripts and whatever we want… To do that we will use the function dir.create(). Let’s practice!
Expand to learn more about this issue
Every class you will need to check your working directory in order follow the labs without issues.
Code
?dir.create
Code
dir.create("BioSci") # this can be your main folder and you can change the namedir.create("Data") # folder that store the data dir.create("R-scripts") # folder that store the scripts used in the coursedir.create("Figures") # folder that store the figures created in the coursedir.create("Results") # The resultsdir.create("Temp")
To check if the subfolders were created within the main folder, just use the function dir(), this simple function will print in the console the name of the files that are currently in your working directory.
We can SET our working directory into one of the subfolders that we just created using the function setwd()
Code
setwd("Results")
However, for practicality it is super-ultra-mega recommendable to work in the MAIN FOLDER, so go back to the previous folder or main folder by just using the function setwd(), instead of using a folder name, we will use simply two dots, yes two dots “..”. This simple operation will return to the main folder.
Code
setwd("..")
2.2 The importance of the question mark “?” or the help function
Maybe, the most important (at least for Jesús) function of R is help or ?. Using help or the question mark, we can ask to R about almost anything (sadly we can’t order pizza, yet)… so, let’s practice!
Code
help("logarithm")
Code
?log
Code
??log
Other important and useful functions in R, are: head(), tail(), dim(), str, summary(), names(), class(), rm(), save.image, saveRDS() and readRDS(), load(), source(), all these simple functions will help us to understand our data.
3 Objects: creation and manipulation
In R you can create and manipulate different kind of data, from a simple numeric vector to complex spatial and/or phylogenetic data frames. The main six kinds of objects that you can create and manipulate in R, are: vector, factor, matrix, data frame, list and functions.
So, let’s start with the first object, the Vector.
3.1 Vector
Vectors are the basic object in R and basically, contains elements of the same type (e.g., numbers, characters). Within vector exist three types: numeric, character and logic.
3.1.0.1 Numeric vector
IMPORTANT R is case sensitive, so you need to pay attention when you name the objects.
Code
a <-10# numeric value b <-c(1, 2, 3, 4, 5) # numeric vectorclass(b) # ask to R which type of object is b
[1] "numeric"
Code
seq_test <-seq(from =1, to =20, by =2) # Here is a sequence of numbers from 1 to 20, every two numbersx =seq(10, 30) # This is a sequence from 10 to 30. What is the difference with the previous numeric vector? sample(seq_test, 2, replace = T) # Sort two numbers within the object seq_test
[1] 13 11
Code
rep_test <-rep(1:2, c(10, 3)) # Repeat the number one, ten times and the number 2 three timesex <-c(1:10) # Create a sequence of 1 to 10length(ex) # Length of the object example
[1] 10
Code
aa <-length(ex) # What we are doing in here?str(seq_test) # Look at the structure of the data
num [1:10] 1 3 5 7 9 11 13 15 17 19
3.1.0.2 Character vector
We can also create vector of characters, which mean that instead of storing numbers we can store characters.
Code
research_groups <-c(Jeannine ="Plants", Jesus ="Birds and plants", Laura ="Plants")research_groups
Jeannine Jesus Laura
"Plants" "Birds and plants" "Plants"
Explore the character vector using the function str()
Code
str(research_groups)
Named chr [1:3] "Plants" "Birds and plants" "Plants"
- attr(*, "names")= chr [1:3] "Jeannine" "Jesus" "Laura"
You can try to create a different character vector, for example, using the names of your peers.
3.1.0.3 Logic vector
This kind of vector is super useful when the purpose is to create or build functions. The elements of a logic vector are TRUE, FALSE, NA (not available).
Code
is.factor(ex) # Is it a factor? (FALSE)
[1] FALSE
Code
is.matrix(ex) # Is it a matrix? (FALSE)
[1] FALSE
Code
is.vector(ex) # Is it a vector? (TRUE)
[1] TRUE
Code
a <1# 'a' is lower than 1? (FALSE)
[1] FALSE
Code
a ==1# 'a' is equal to 1? (TRUE)
[1] FALSE
Code
a >=1# 'a' is higher or equal to 1? (TRUE)
[1] TRUE
Code
a !=2# the object 'a' is different of two? (TRUE) (!= negation)
[1] TRUE
3.2 Factor
A factor is useful to create categorical variables, that is very common in statistical analyses, such as the Anova.
Code
data <-factor(c("small", "medium", "large"))
Code
is.factor(data) # Check if the object is correct.
[1] TRUE
3.3 Matrix
A matrix is bidimensional arrangement of vectors, where the vectors need to be of the same type, that is, two or more numeric vectors, or two or more character vectors.
Code
matx <-matrix(1:45, nrow =15)rownames(matx) <- LETTERS[1:15] # names of the rowscolnames(matx) <-c("Sample01", "Sample02", "Sample03") # names of the columns or headers
Code
matx # Inspect the matrix
Sample01 Sample02 Sample03
A 1 16 31
B 2 17 32
C 3 18 33
D 4 19 34
E 5 20 35
F 6 21 36
G 7 22 37
H 8 23 38
I 9 24 39
J 10 25 40
K 11 26 41
L 12 27 42
M 13 28 43
N 14 29 44
O 15 30 45
Code
class(matx) # Ask, which kind of data is?
[1] "matrix" "array"
Code
matx[, 1] # We can use brackets to select a specific column
A B C D E F G H I J K L M N O
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Code
matx[1, ] # We can use brackets to select a specific row
Sample01 Sample02 Sample03
1 16 31
Code
head(matx)
Sample01 Sample02 Sample03
A 1 16 31
B 2 17 32
C 3 18 33
D 4 19 34
E 5 20 35
F 6 21 36
Code
tail(matx)
Sample01 Sample02 Sample03
J 10 25 40
K 11 26 41
L 12 27 42
M 13 28 43
N 14 29 44
O 15 30 45
summary(matx) # summary statistics of the data in the matrix
Sample01 Sample02 Sample03
Min. : 1.0 Min. :16.0 Min. :31.0
1st Qu.: 4.5 1st Qu.:19.5 1st Qu.:34.5
Median : 8.0 Median :23.0 Median :38.0
Mean : 8.0 Mean :23.0 Mean :38.0
3rd Qu.:11.5 3rd Qu.:26.5 3rd Qu.:41.5
Max. :15.0 Max. :30.0 Max. :45.0
In general, when we are exploring our data for example using head() the function will return only the 6 first rows of our matrix, however, we can add another argument into the function. For example, head(matx, 10), just add the number 10 after the comma and is possible to see the first 10 lines. This simple operation is useful specially when our matrix is large >100 rows.
Function tail
You can use the function tail() to check the last rows of your data.
3.4 Data frame
The difference between a matrix and a data frame is that a data frame can handle different types of vectors. You can explore more about the data frames asking R ?data.frame. Let’s create a data frame and explore its properties.
Sample01 Sample02 Sample03
A 1 16 31
B 2 17 32
C 3 18 33
D 4 19 34
E 5 20 35
F 6 21 36
G 7 22 37
H 8 23 38
I 9 24 39
J 10 25 40
K 11 26 41
L 12 27 42
M 13 28 43
N 14 29 44
O 15 30 45
At to this point, we have explored the most common objects in R. Understanding the structure of each class of objects (from vectors to lists) is maybe the most critical step to learning R.
4 Install and load packages
Although R is a programming language, it is also possible to use different auxiliary packages that are available for free to download and to install in our computers. Install new packages into R is easy and just needs a simple function install.packages() and of course, an Internet connection. For more information on how to install new packages, you just need to ask R using ?install.packages
Code
install.packages("PACKAGE NAME")
The reverse function is remove.packages().
Most of the time, we do not remember if we already have a package installed on our computer, so if we are tired and do not want to go to our R folder packages and check if the package is, in fact, installed, we can use the following command.
Code
if ( ! ("PACKAGE NAME"%in%installed.packages())) {install.packages("PACKAGE NAME", dependencies =TRUE)}
To load an installed package you can just type, library() or require()
Code
library("PACKAGE NAME")require("PACKAGE NAME")
Sometimes we need to install a lot of packages, and installing them one by one will require time and patience, which, most of the time, we don’t have Lol. To solve that issue, we can create a vector with the names of the packages and create a simple function that helps us to install R with just one click!
As indicated before, using R you can handle different kind of information (from vectors to data frames) and basically most of our data is usually stored in an Excel spreadsheet or in files that have the extension of .csv (comma-separated values file) or .txt (Text X Text or text file that contains unformatted text).
Most of these files are imported in R are data frames, but, as we were practicing, we now have the tools to handle or transform the information into different objects.
The function to import data to R is simple read.table() or read.csv(), and using these simple functions, you can import the data and transform it in other kind of objects So, lets practice!
Code
dat <-read.table("Data/lab_0/Sample.txt")dat2 <-read.table("Data/lab_0/Sample.txt", row.names =1, header =TRUE)dat3 <-read.csv("Data/lab_0/Sample.csv")
You can also import your data using the same functions, but without specifying the address. Notice that we do not recommend this procedure as you can’t control the directory structure, but is useful when you just are exploring data.
Code
dat5 <-na.omit(read.csv(file.choose()))
You can also save your data from R using the function write.table() or write.csv(). Lets save the dat3Sample. Notice that always we need to specify the correct address, in our case we will save the data in the subolder Data.
To study biodiversity is important to first understand the data we are using and one common data used now is the phylogenetic data or phylogenetic trees that describe the evolutionary relationships between and among lineages. From here until the end of this short tutorial we will try to explain the basics of how to import/export and handle phylogenetic information. You can find extra information in the second chapter of the MPCM Book.
7.1 Formats
The two most common formats in which the phylogenies are stored are the Newick and Nexus (Maddison et al., 1997).
Code
"((A:10,B:9)D:5,C:15)F;"
[1] "((A:10,B:9)D:5,C:15)F;"
Using this notation, the parenthesis links the lineages to a specific node of the tree and the comma “,” separates the lineages that descend from that node. The colon punctuation “:” can be used after the name of the node and the subsequent numeric values represent the branch length. Finally, the semicolon punctuation “;” indicate the end of the phylogenetic tree.
Now we can see how this format works, but first, check if we have the R packages for this purpose. Here we will use the R package Analyses of Phylogenetics and Evolution, AKA ape.
Code
if ( ! ("ape"%in%installed.packages())) {install.packages("ape", dependencies =TRUE)}
Code
require(ape)
Loading required package: ape
Now we can read the phylogenetic tree we just created above in Newick format.
Code
## Here we will create a phylogenetic tree in Newick formatnewick_tree <-"((A:10,B:9)D:5,C:15)F;"## Read the trenewick_tree <-read.tree(text = newick_tree)
And now we can plot the phylogentic tree
Code
plot(newick_tree, show.node.label =TRUE)
The other format is the Nexus, and after some time using it, we can say that the Nexus format have more flexibility for working. An example of a Nexus format is as follow:
Code
"#NEXUSBEGIN TAXA;DIMENSIONS NTAXA=3;TaxLabels A B C;END;BEGIN TREES;TREE=((A:10,B:9)D:5,C:15)F;END;"
[1] "#NEXUS\nBEGIN TAXA;\nDIMENSIONS NTAXA=3;\nTaxLabels A B C;\nEND;\nBEGIN TREES;\nTREE=((A:10,B:9)D:5,C:15)F;\nEND;"
We can create and save a nexus file from scratch using the next code.
Code
## First create a Nexus file in the working directory cat("#NEXUS BEGIN TAXA; DIMENSIONS NTAXA=3; TaxLabels A B C; END; BEGIN TREES; TREE=((A:10,B:9)D:5,C:15)F; END;",file ="../Data/Lab_0/Nexus_tree.nex")
Now, using the function read.nexus() we can read the nexus file.
Code
## Now read the phylogenetic tree, but look that instead of using read.tree we are using read.nexusnexus_tree <-read.nexus("../Data/Lab_0/Nexus_tree.nex")
And also plot the imported nexus file.
Code
## lets plot the exampleplot(nexus_tree, show.node.label =TRUE)
Now, let’s inspect our phylogenetic trees.
Code
str(nexus_tree)
List of 5
$ edge : int [1:4, 1:2] 4 5 5 4 5 1 2 3
$ edge.length: num [1:4] 5 10 9 15
$ Nnode : int 2
$ node.label : chr [1:2] "F" "D"
$ tip.label : chr [1:3] "A" "B" "C"
- attr(*, "class")= chr "phylo"
- attr(*, "order")= chr "cladewise"
Code
nexus_tree$tip.label
[1] "A" "B" "C"
If we want to know about the branch length of the tree we just need to select edge.lenght
Code
nexus_tree$edge.length
[1] 5 10 9 15
An important component of a phylo object is the matrix object called edge. In this matrix, each row represents a branch in the tree and the first column shows the index of the ancestral node of the branch and the second column shows the descendant node of that branch. Let’s inspect!
Code
nexus_tree$edge
[,1] [,2]
[1,] 4 5
[2,] 5 1
[3,] 5 2
[4,] 4 3
We know it is a little hard to follow even with small trees as the example, but if we plot the phylogenetic tree, the information within it it’s easier to understand.
Code
# Lets plot the treeplot(nexus_tree, show.tip.label =FALSE)# Add the internal nodesnodelabels()# Add the tips or lineagestiplabels()
Finally, the phylogenies can also be imported in form of a list and in phylogenetic comparative methods this list of phylogenies is called multiPhylo, and we can import/export these multiPhylos in the two formats.
Code
# Simulate 10 phylogenies, each one with 5 speciesmultitree <-replicate(10, rcoal(5), simplify =FALSE)# Store the list of trees as a multiPhylo objectclass(multitree) <-"multiPhylo"
Code
# Plot a single tree from the 10plot(multitree[[10]])
# Exporting the phylogenies as a single Newick file. write.tree(phy = multitree, file ="../Data/Lab_0/multitree_example_newick.txt")multitree_example_newick <-read.tree("../Data/Lab_0/multitree_example_newick.txt")multitree_example_newick
10 phylogenetic trees
Code
# Exporting the phylogenies as a single Nexus file. write.nexus(phy = multitree, file ="Data/Lab_0/multitree_example_nexus.nex")multitree_example_nexus <-read.nexus("Data/Lab_0/multitree_example_nexus.nex")multitree_example_nexus
10 phylogenetic trees
The :: operator
If you know exactly which package contains the function you want to use you can reference it directly using the :: operator. Simply place the {package name} before the operator and the name of the function after the operator to retrieve it.
In simple words, if you just want to use a specific function of an R package and not the entire package, the :: operator can do it for you. for example:
In programming one of the most important tool is the loop AKA for. Basically, a loop runs for n number of steps in a previously defined statement.
The basic syntax struture of a loop is:
Code
for (variable in vector) { execute defined statements}
When we are writing some piece of code it is common to use the loop variable i to determine the number of steps. Why not other letter?, well i is the first letter of the word iteration —duh! Anyway, you can use any letter or word as a loop variable.
So, let’s take a look.
Code
for (i in1:10){cat(i, sep ='')}
12345678910
Notice that the number of steps is determined by the loop variable and in this example is a sequence of steps from 1 to 10, that correspond to the second element of the for loop, the vector.
You can modify the previous statement to obtain different results, for example:
Code
for (i in1:10){cat(i, sep ='\n')}
1
2
3
4
5
6
7
8
9
10
Or using a previous object:
Code
for (i in5:length(ex)){cat(i, sep ='\n')}
5
6
7
8
9
10
Or to make calculations
Code
for (i in5:length(ex)){ b2 <- b^2 b3 <- b*2 b4 <- b+10}
To finish this short tutorial, we will welcome all of the members of the Biodiversity Science cohort 2023.
for (i in1:length(BioSciNames)){ print(paste0("Hi ", BioSciNames[i], ", welcome to the first practice of Biodiversity Science 2023!"))Sys.sleep(2) # wait two seconds before the next iteration or name}
[1] "Hi Landon Aufderhar, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Sara Berger, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Jeannine Cavender-Bares, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Jaron Cook, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Betsy Custis (she/her), welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Ashley Darst, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Tiana De Grande, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Megan DeCook (she/her), welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Victoria Deitschman, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Sally Donovan, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Shelby Erickson, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Gwynneth Foley, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Ashley Halverson, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Mackenna Kaufer (she/her), welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Faith Kelly (she/her), welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Emma Klubberud (she/her), welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Laura Ostrowsky, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Abha Panda, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Nguyen Thanh Vy Phan, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Nathaniel Pierce, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Jesus Pinto Ledezma, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Leah Ray, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Ayden Reed, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Henry Rosato, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Lisa Russell (she/her), welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Ethan Schindler, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Nathan Schneider, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Erja Smith, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Yiyang Wang, welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Cathy Wiegand (she/her), welcome to the first practice of Biodiversity Science 2023!"
[1] "Hi Eliana Wilson (she/her), welcome to the first practice of Biodiversity Science 2023!"
We have covered basic aspects of R, from exploring and managing objects to import/export data and basics into loops. We hope that this short tutorial can be helpful not only for the Biodiversity Science course but for your specific projects. Remember, practice, practice, practice!
References
Maddison, D. R., Swofford, D. L., & Maddison, W. P. (1997). Nexus: An Extensible File Format for Systematic Information. Systematic Biology, 46(4), 590–621. https://doi.org/10.1093/sysbio/46.4.590