Part 2: tidyverse (follow along in RStudio)
What you will learn in this session:
tidyverse package for data frames
select() and rename columns (variables)filter() rows (observations)mutate() (define columns (variables); overwrite old or create new)|>.»R commands consists of the command’s name followed by a pair of parentheses: command()
Inside the () we can define one or more arguments for the command.
Arguments in a command usually have names such as x= or data=
R does not require to use the argument’s name, but order matters
R commands have many arguments. Most have default values
We can nest commands. However, nesting too deeply makes code difficult to read.»
:::
Most R commands have the following structure: \[\begin{equation} \underbrace{DataNew}_{\text{R object storing the result}}= \underbrace{Command}_{\text{Name of the command}} \underbrace{(\overbrace{Data}^{\text{1. Argument: Data to process}}, \overbrace{Arg2, Arg3, \dots, ArgN}^{\text{More Arguments}})}_{\text{Arguments inside () and separated by komma}} \end{equation}\]
Often the data argument is the first argument in a command. Usually named data= or x=.»
Result=mean(x=VecTest, trim=0, na.rm=FALSE)
cat("The mean of the values in vector VecTest is:", Result)The mean of the values in vector VecTest is: 2
The mean of the values in vector VecTest is: 2
All three examples are equivalent
Try ? mean in the Rstudio console to see the default values.»
tidyverse/dplyr Packagedplyr package is part of the tidyverse (meta) packagelibrary(tidyverse) (loads the tidyverse and its packages)select() selects columns (variables) from a data framefilter() filters rows (observations) for specific criteriamutate() calculates new or overwrites existing columns (variables) based on other columns (just like Excel).» Survived Pclass Name Sex Age SiblingsSpousesAboard
1 0 3 Mr. Owen Harris Braund male 22 1
2 1 1 Mrs. John Bradley Cumings female 38 1
3 1 3 Miss. Laina Heikkinen female 26 0
4 1 1 Mrs. Jacques Heath Futrelle female 35 1
5 0 3 Mr. William Henry Allen male 35 0
6 0 3 Mr. James Moran male 27 0
ParentsChildrenAboard FareInPounds
1 0 7.2500
2 0 71.2833
3 0 7.9250
4 0 53.1000
5 0 8.0500
6 0 8.4583
»
select() Commandselect(DataMine, Var1, Var2) selects columns (variables) Var1 and Var2 from a data frame DataMine. The first argument is the data= argument followed by the names of the selected variables.
select(Data, -Var1, -Var2) selects all columns (variables) except Var1 and Var2 from a data frame DataMine.
Here is an example using the DataTitanic data frame from the previous slide:
Survived Name Sex Age
1 0 Mr. Owen Harris Braund male 22
2 1 Mrs. John Bradley Cumings female 38
3 1 Miss. Laina Heikkinen female 26
4 1 Mrs. Jacques Heath Futrelle female 35
5 0 Mr. William Henry Allen male 35
6 0 Mr. James Moran male 27
»
filter() CommandThe filter() command filters rows (observations) of a data frame for specific criteria. The first argument is the data= argument followed by the filter criteria.
E.g., filter for female passengers from the dataset: Use DataTitanicSelVar that we created in the previous slide (note that we have to use == instead of = for the criteria):
Survived Name Sex Age
1 1 Mrs. John Bradley Cumings female 38
2 1 Miss. Laina Heikkinen female 26
3 1 Mrs. Jacques Heath Futrelle female 35
4 1 Mrs. Oscar W Johnson female 27
5 1 Mrs. Nicholas (Adele Achem) Nasser female 14
6 1 Miss. Marguerite Rut Sandstrom female 4
»
mutate() Command 🤓mutate() creates or overwrites columns (variables) based on other columns (just like Excel). The first argument is the data= argument followed by the instructions on how to create the new variable.
E.g., mutate calculates new column Born based on Age during Titanic disaster (1912). Uses DataTitanicSelVarFem from previous slide:
Survived Name Sex Age Born
1 1 Mrs. John Bradley Cumings female 38 1874
2 1 Miss. Laina Heikkinen female 26 1886
3 1 Mrs. Jacques Heath Futrelle female 35 1877
4 1 Mrs. Oscar W Johnson female 27 1885
5 1 Mrs. Nicholas (Adele Achem) Nasser female 14 1898
6 1 Miss. Marguerite Rut Sandstrom female 4 1908
»
DataTitanicSelVarDataTitanicSelVarFemDataTitanicSelVarFemBirthYearCould this be done easier?
Note, overwriting data frames such as DataTitanic is usually a bad idea!»
Survived Name Sex Age Born
1 1 Mrs. John Bradley Cumings female 38 1874
2 1 Miss. Laina Heikkinen female 26 1886
3 1 Mrs. Jacques Heath Futrelle female 35 1877
4 1 Mrs. Oscar W Johnson female 27 1885
5 1 Mrs. Nicholas (Adele Achem) Nasser female 14 1898
6 1 Miss. Marguerite Rut Sandstrom female 4 1908
»
Piping Schema
Survived Name Sex Age Born
1 1 Mrs. John Bradley Cumings female 38 1874
2 1 Miss. Laina Heikkinen female 26 1886
3 1 Mrs. Jacques Heath Futrelle female 35 1877
4 1 Mrs. Oscar W Johnson female 27 1885
5 1 Mrs. Nicholas (Adele Achem) Nasser female 14 1898
6 1 Miss. Marguerite Rut Sandstrom female 4 1908
»