Part 1: Setup
Part 2: How R Stores the Data:
Part 3: The tidyverse Package:
tidyverse package for data frames
select() and rename columns (variables)filter() rows (observations)mutate() (define columns (variables); overwrite old or create new)arrange() sort observations in a data frame.|>.A typical setup to work with R consists of two components:
the R Console which executes R code and
an integrated development environment (IDE) such as RStudio or Positron.
You can download R here: Download R
You can download Positron here: Download Positron
Note
- Install R before Positron
- If an older R version exists uninstall it before installing the newer version
RStudio Window
Video for First Steps to setup R and RStudio: Click here
(However, it is recommended to work wit Positron rather than RStudio )
Positron Window
First steps to setup R and Posit can be found in this video: coming soon
Click Gear Icon -> Choose: Settings -> Search for: Keybindings ->
Toggle-On RStudio Keybindings
Always open or create a folder first
The folder is the one where all your R ().r), Quarto (.qmd), and data (e.g., .csv) files are stored
R or Quarto fileCTRL SHIFT P (Windows) or ⌘ SHIFT P (Mac) to open Command Pallette:
(.r)MyFile.r contains only R-codePrint “Hello world” using the print() command.
Assign values \(3\) and \(4\) to the legs a and b of an right-angled triangle.
calculate the hypotenuse \(c\): \[c^2=a^2+b^2 \Longleftrightarrow c=\sqrt{a^2+b^2}\]
print the result using the cat(command)
Assign the number \(2\) to the variable (“R” objects) a and run the cat() command again.
(.qmd)MyFile.qmd contains text as well as R-code```{r}
MyCode goes here
```
(.qmd)Instructions:
Print “Hello world” using the print() command.
Assign values \(3\) and \(4\) to the legs a and b of an right-angled triangle.
calculate the hypotenuse \(c\): \[c^2=a^2+b^2 \Longleftrightarrow c=\sqrt{a^2+b^2}\]
print the result using the cat(command)
Note, now we want everything nicely commented!
Print “Hello world” using the print() command.
Assign values \(3\) and \(4\) to the legs a and b of an right-angled triangle.
Calculate the hypotenuse \(c\) \[c^2=a^2+b^2 \Longleftrightarrow c=\sqrt{a^2+b^2}\]
Print the result using the cat(command)
Assign the number \(2\) to the variable (“R” object) a and run the cat() command again.
R Packages extend R’s functionality. They have to be installed only once:
For example, to install the tidyverse package, type in the consol window:
install.packages("tidyverse)
Needs to be only done once!
After installation packages they need to be loaded in every new R script or Quarto file with: library().
Packages frequently used in this course (please install soon):
tidyverse: supports easy data processing .rio: allows loading various data resources with one import() command from the user’s hard drive or the Internet.janitor: provides functionality to clean data and rename variable names to avoid spaces and special characters.rio and the tidyverse PackageExample: How to install the tidyverse package: Click here
Video about the rio package: Click here
rio and the tidyverse PackageExample: rio and tidyverse package (assuming they are installed already)
import() would not work if the rio package were not loaded.
select() would not work if the tidyverse package were not loaded.
num (such as: 0.1, 2.3, 3.14157)int (such as: 1, 2, 7)chr (such as: “Hello”, “Hi”, “World”)factor (such as: “Female”, “Male” Or: “small”, “medium”, “large”)logic (True, False)single entryvectordataframe or tibbleNumerical Data Type (num and int): Numerical values (e.g., 1, 523, 7 or 3.45, 0.1, 8.0) are used for calculations. In contrast, ZIP-Codes are not numerical data type.
Character Data Type (chr): Storing sequence of characters, numbers, and/or symbols to form a word or even a sentence is called a character data type (e.g. first or last names, street addresses, or Zip-codes)
Categorical Data Type (factor): A factor is an R data type that stores categorical data in an effective way. factor data types are also required by many classification models in R.
Logic Data Type(logic): A data type that stores the logic states TRUE and FALSE is called a logic object (sometimes called Boolean)
Character Data Type (chr) a.k.a Labels:
Note that what is called a character in R is often called a string in other programming languages.
character data types must be surrounded by quotes:
Character variables can be concatenated with the cat() command:
Numerical Data Type (num): Numerical values are used for calculations (therefore ZIP-Codes are not numerical). The numerical data type is num and in some cases int (full number). In most of the cases you do not have to differentiate between int and num.
Categorical values are stored in R as a factor data type. A factor object signals certain R commands that the variable is categorical rather than character.
Note, some R commands recognize categorical variables automatically and a conversion from character to factor is not needed.
Sex is a character variable in the dataset People
Transforming the variable \(Sex\) to a factor and looking at its structure (str()) again:
Logic variables: Store TRUE and FALSE. They can be combined with and/or. Internally True is stored as \(1\) and False is stored as \(0\)
Print Hello world! by using variable A:
cat()A rectangular lot has a width of 200 feet (Width) and a length (Length) of 300 feet. Calculate the area (Area) and create a full sentence output.
cat() command and Single Value Objects with Different Data TypesAssign your own first and last name, your ZIP code, and your your age, to three character variables (first name, last name, Zip code) and one numerical variable (age). Use var1, var2, var3, var4. Afterward, use Cat() to output a sentence like Carsten Lange is 55 years old and lives in ZIP code 92656 using the variables you had created.
Objects just store a single value:
A vector object stores a list of values (numerical, character, factor, or logic; mixing of data types is not allowed)
Example: Weather during the last three days in Stattown:
Vector objects can be used as arguments for an R command to calculate statistics such as the mean() or the number of entries in the vector (length()):
A data frame is similar to an Excel table. A data frame stores the values of R Vectors as variables entries in its columns .
Note, that the c() command combines values to a vector.
Below we show how the values from the four vectors VecDay, VecTemp, VecWindSpeed, and VevIsSunny are stored in the data frame DataWeather.
The columns hold the values from the four vectors and the rows (with the exception of the first row), hold the observations for the various days. The first row contains the variable names:
Most of the times, we do not build a data frame from its vectors (columns). Instead we load the data frame from a file (for example, a csv file).
Below we load the Titanic dataset. Note, only the first six observations are shown.
We can see the structure of the data frame by using the str() command. This includes the type of all variables/vectors:
Since the columns of a data frame are made up of vectors, we can extract these vectors, and use the values for data analysis (remember: observations are in the rows, variables are in the columns).
We can use the notation DataFrameName$VectorName to extract the vectors:
If we like, we can change a vector inside a data frame:
We can use the logical vector Survived (remember, TRUE=1, FALSE=0) to calculate the survival rate:
graph TD VarGeneral["Data in General"] VarChar["Labels/Char:<br/>- Last names<br/>- ZIP-codes<br/>- ID-numbers"] VarCategory["Categorical/Factors<br/>limited number of variations:<br/>- Sex (M/F)<br/>- Grades (A, B, C, D, F)<br/>- Gender (M, F, NonBin)"] VarNumeric["Numerical Data<br/>- Counts (# of accident)<br/>- Measures<br/> (e.g., Weight=160.8)"] VarNumericCon["Continuous/Num<br/>- Measures<br> (e.g., Weight=160.8)"] VarInt["Discrete/Int<br/>- Counts<br> (e.g., # accidents)"] VarGeneral --> VarChar VarGeneral --> VarCategory VarGeneral --> VarNumeric VarNumeric --> VarInt VarNumeric --> VarNumericCon %% Define one reusable style classDef ClassTopNodeCL text-align:center,color:#000000,fill:#ffffff,stroke:#000000,stroke-width:3.8px; classDef ClassLowNodesCL text-align:left,color:#0000dd,fill:#dedea9,stroke:#ffffff,stroke-width:3.8px; classDef ClassLowestNodesCL text-align:left,color:#0000dd,fill:#b3672b,stroke:#ffffff,stroke-width:3.8px; %% Apply it to multiple nodes class VarChar,VarNumeric,VarNumericCon,VarInt,VarCategory ClassLowNodesCL; class VarGeneral ClassTopNodeCL; class VarNumericCon,VarInt ClassLowestNodesCL
Data Type and Object Structure
tidyverse and PipingR commands consists of the command’s name followed by a pair of parentheses: command()
Inside the () we can define one or more arguments for the command.
Arguments in a command usually have names such as x= or data=
R does not require to use the argument’s name, but order matters
R commands have many arguments. Most have default values
We can nest commands. However, nesting too deeply makes code difficult to read.»
Most R commands have the following structure: \[\begin{equation} \underbrace{DataNew}_{\text{R object storing the result}}= \underbrace{Command}_{\text{Name of the command}} \underbrace{(\overbrace{Data}^{\text{1. Argument: Data to process}}, \overbrace{Arg2, Arg3, \dots, ArgN}^{\text{More Arguments}})}_{\text{Arguments inside () and separated by comma}} \end{equation}\]
Often the data argument is the first argument in a command. Usually named data= or x=.»
All three examples are equivalent
mean)Use ?mean or help(mean) in the RStudio console to see the default values.
You can also mark/highlight and then press F1
Try it for the mean() command.
tidyverse/dplyr Packagedplyr package is part of the tidyverse (meta) packagelibrary(tidyverse) (loads the tidyverse and its packages)select() selects columns (variables) from a data framefilter() filters rows (observations) for specific criteriamutate() calculates new or overwrites existing columns (variables) based on other columns (just like Excel)arrange() sorts a data frame according to one or more columns in ascending order (use argument desc() for descending order)tidyverse for Data AnalysisGoal: Create a data frame with a few selected variables, that contains only female observations, and the fare in current U.S.-$.
select() Commandselect(DataMine, Var1, Var2) selects columns (variables) Var1 and Var2 from a data frame DataMine. The first argument is the data= argument followed by the names of the selected variables.
select(DataMine, -Var1, -Var2) selects all columns (variables) except Var1 and Var2 from a data frame DataMine.
Here is an example using the DataTitanic data frame with a few selected variables:
filter() CommandThe filter() command filters rows (observations) of a data frame for specific criteria. The first argument is the data= argument followed by the filter criteria.
E.g., filter for female passengers:
We use DataTitanicSelVar that we created in the previous slide at as a starting dataframe and save the result in DataTitanicSelVarFem.
Note, we have to use == instead of = for the criteria):
mutate() Commandmutate() creates or overwrites columns (variables) based on other columns (just like Excel). The first argument is the data= argument followed by the instructions on how to create the new variable.
E.g., mutate calculates the FareIn2023Dollars by multiplying FareInPounds by \(108.5\). The command uses DataTitanicSelVarFem from the previous slide:*
*) The purchasing power of a British pound from the year 1912 was about 90.4 British pounds at the beginning of 2023 (see Bank of England (2022)).
Multiplying this with the exchange rate for the British pound at the beginning of 2023 ($1.2/Brit. pound) (see Federal Reserve Bank of St. Louis (2023))
gives us the multiplier of 108.5.
We now have a data frame with only women and columns \(Survived\), \(PasClass\), \(Sex\), \(Age\), and \(FareIn2023Dollars\).
How did we get there:
DataTitanicSelVarDataTitanicSelVarFemDataTitanicSelVarFemDolFareCould this be done easier?
Note, overwriting data frames such as DataTitanic is usually a bad idea! Nesting the command is possible but very difficult to read.
Piping Schema
Shortcut for |>: CTRL SHIFT M (Windows) or ⌘ SHIFT M (Mac).
The pipe operator |> is for most practical purposes equivalent to %>%.
Let us compare code for the same tasks between R and Python:
Download the Titanic dataset
select the variables Sex, FareInPounds, Survived (renamed to: Surv)
Calculate a new column FareInDollars by multiplying FareInPounds by \(108.5\)
Filter for Sex being female
Calculate the mean of FareInDollars
[1] 4826.06
[1] 4826.06
Note, polars is currently not supported in WASM
To answer the question, we develop a male and a female data frame and compare the survival rates.
In each data frame we would need only the variables Sex and Survived but we add also PasClass for additional analysis.
We select Sex, Survived, and PasClass=Pclass and filter for male:
We select Sex, Survived, and PasClass=Pclass and filter for female:
Hint: You could either calculate the female proportion as sum(DataFemale$Survived/nrow(DataFemale)) or mean(DataFemale$Survived).
PasClass is a ConfounderThe third class was deep in the hull of the Titanic with low survival chances and more men were traveling in that class. This makes PasClass a confounder. Therefore we have to analyze male and female survival by class: We have to filter for Sex and PasClass.
Survived, Sex, PasClass=PclassPasClass and Sex (female and male)Survived, Sex, PasClass=PclassPasClass and Sex (female and male)Survived, Sex, PasClass=PclassPasClass and Sex (female and male)