Introduction to R and RStudio

Part 1: Basics (follow along in RStudio)

Learning Outcomes

What you will learn in this session:

  • How to install R and RStudio
  • What is the windows layout of RStudio
  • How to setup RStudio
  • How to create a project (folder) in RStudio
  • How to use major functionalities of RStudio
  • How to extend R’s functionality with R-packages
  • Which packages you should install for this book
  • Data types and data objects in R
  • How very big and very small numbers can be displayed

Install and Setup R and RStudio

A typical setup to work with R consists of two components:

  • the R Console which executes R code and

  • an integrated development environment (IDE) such as RStudio.

You can download R here: Download R

You can download RStudio here: Download RStudio

Detailed installation guides are provided in the Book and the Online Resources sections of this chapter in book.

RStudio — Integrated Development Environment (IDE) 🤓

RStudio Window

R Packages

R Packages extend R’s functionality. They have to be installed only once:

Tools -> Install Packages ...

After installation they need to be loaded in every new R script with library().

Packages frequently used in this course (please install soon):

  • tidyverse: supports easy data processing .
  • rio: allows loading various data resources with one import() command from the user’s hard drive or the Internet.
  • janitor: provides functionality to clean data and rename variable names to avoid spaces and special characters.
  • tidymodels: streamlines data engineering and machine learning tasks.
  • kableExtra: supports rendering tables in HTML.
  • shiny: needed together with the learnr package for the interactive exercises in the book.
  • learnr package: together with the shiny package for the interactive exercises in the book.

Example: the rio and the tidyverse Package

Assuming the rio packages is already installed.

library(rio);library(tidyverse)
DataHousing =
  import("https://ai.lange-analytics.com/data/HousingData.csv") |> 
  select(Price=price, Sqft=sqft_living, Bedrooms=bedrooms,Waterfront=waterfront)
print(DataHousing[1:3,])
   Price Sqft Bedrooms Waterfront
1 221900 1180        3         no
2 538000 2570        3         no
3 180000  770        2         no

import() would not work if the rio package were not loaded.

select() would not work if the tidyverse package were not loaded.

Data Types & Data Objects

  • Data Types: What can R store?

    • numerical num
    • character chr
    • factor
    • logic
  • Data Objects: What are the containers R uses to store data?

    • single entry single variable
    • list of entries vectors
    • table dataframe and tibble
    • advanced objects. E.g., for plot, models, prediction results

Data Types 🤓

Numerical Data Type (num): Numerical values (e.g., 1, 523, 3.45) are used for calculation. In contrast, ZIP-Codes are not numerical data type.

Character Data Type (chr): Storing sequence of characters, numbers, and/or symbols to form a word or even a sentence is called a character data type (e.g. first or last names, street addresses, or Zip-codes)

Factor Data Type (factor): A factor is an R data type that stores categorical data in an effective way. factor data types are also required by many classification models in R.

Logic Data Type(logic): A data type that stores the logic states TRUE and FALSE is called a logic object (sometimes called Boolean)

Numerical Data Type (num): Numerical values are used for calculations (therefore ZIP-Codes are not numerical). Numerical data can be discrete (integer) or continuous (double).

A=as.integer(2)
B=as.integer(3)
str(A) # str() returns structure of a variable
 int 2
C=1.23
str(C) # str() returns structure of a variable
 num 1.23
print(A*C)
[1] 2.46
A^B
[1] 8
A/B # Returns num type
[1] 0.6666667

Character Data Type (chr):

Note that what is called a character in R is often called a string in other programming languages.

character data types must be surrounded by quotes:

MyText="Hello world!"
print(MyText)
[1] "Hello world!"

Character variables can be concatenated with the cat() command:

FirstName="Carsten"
LastName="Lange"
cat(FirstName, LastName) # R adds a space automatically
Carsten Lange

A factor is an R data type that stores categorical data in an effective way. Categorical data are character type data covering a few categories such as hair color (blonde, braun, red, black). They can be coded with numbers (e.g., from 1-5 for hair color) and thus use less memory. Another example is sex (male, female).

Code
People=tribble(~Name,~Sex,
                  "John", "male",
                  "Jane", "female",
                  "Mia", "female",
                  "Brid", "female",
                  "Greg", "male")
print(People)
# A tibble: 5 × 2
  Name  Sex   
  <chr> <chr> 
1 John  male  
2 Jane  female
3 Mia   female
4 Brid  female
5 Greg  male  

Sex is a charcter variable in the dataset People

str(People$Sex)
 chr [1:5] "male" "female" "female" "female" "male"

Transforming the variable \(Sex\) to a factor and looking at its structure (str()) again:

str(as.factor(People$Sex))
 Factor w/ 2 levels "female","male": 2 1 1 1 2

Logic variables: Store TRUE and FALSE. They can be combined with and/or. Internally True is stored as \(1\) and False is stored as \(0\)

IsConcertGood=FALSE 
IsCompanyGood=TRUE
cat("Is the concert good?", IsConcertGood, "Is the company good?", IsCompanyGood)
Is the concert good? FALSE Is the company good? TRUE
IsEveningAmazing=IsConcertGood & IsCompanyGood
cat("Is the evening amazing?", IsEveningAmazing)
Is the evening amazing? FALSE
IsEveningGood=IsConcertGood | IsCompanyGood # | stands for "or"
cat("Is the evening good?", IsEveningGood)
Is the evening good? TRUE
IsConcertGood+IsCompanyGood+17
[1] 18
Truth Table for AND and OR
R object A
R object B
A and B
A or B
A B A&B A&#124;B
TRUE TRUE TRUE TRUE
TRUE FALSE FALSE TRUE
FALSE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE

Data Types & Data Objects

Data Types: What can R store?

Data Objects: What are the containers R uses to store data?

Data Objects

  • Single Value Object
  • Vector Object
  • Data Frame (Tibble) Object
  • List Object (not covered in this course)
  • Advanced Object such as plots, models, recipes

Single Value Object

Object just stores a single value:

A=123.768
B=3
C="Hello World"
IsLifeGood=TRUE

Vector-Objects

A vector object stores a list of values (numerical, character, factor, or logic)

Example: Weather during the last three days in Stattown:

VecTemp=c(70, 68, 55)
VecWindSpeed=c("low","low","high")
VecIsSunny=c(TRUE,TRUE,FALSE)

Vector objects can be used as arguments for an R command to calculate:

Code
MeanForecTemp=mean(VecTemp)
cat("The average forecasted temperature is", MeanForecTemp)
The average forecasted temperature is 64.33333
Code
ForecDays=length(VecTemp)
cat("The forecast is for", ForecDays, "days.")
The forecast is for 3 days.

Data Frames (tibbles)

A data frame is similar to an Excel table (note not all columns of the Titanic data frame are shown).

Survived Pclass Sex Age FareInPounds
0 3 male 22 7.2500
1 1 female 38 71.2833
1 3 female 26 7.9250
1 1 female 35 53.1000
0 3 male 35 8.0500
0 3 male 27 8.4583
0 1 male 54 51.8625
0 3 male 2 21.0750
1 3 female 27 11.1333
1 2 female 14 30.0708
1 3 female 4 16.7000
1 1 female 58 26.5500

A data frame consist of vectors making up the columns. These are the variables for the data analysis (remember: observations are in the rows, variables are in the columns).

DataTitanic=import("https://ai.lange-analytics.com/data/Titanic.csv")
str(DataTitanic)
'data.frame':   887 obs. of  8 variables:
 $ Survived             : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass               : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name                 : chr  "Mr. Owen Harris Braund" "Mrs. John Bradley Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath Futrelle" ...
 $ Sex                  : chr  "male" "female" "female" "female" ...
 $ Age                  : num  22 38 26 35 35 27 54 2 27 14 ...
 $ SiblingsSpousesAboard: int  1 1 0 1 0 0 0 3 0 1 ...
 $ ParentsChildrenAboard: int  0 0 0 0 0 0 0 1 2 0 ...
 $ FareInPounds         : num  7.25 71.28 7.92 53.1 8.05 ...

Extracting the Vectors and Perform Calculations (numerical Vectors)

VecFareInPounds=DataTitanic$FareInPounds
AvgFare=mean(VecFareInPounds)
cat("The average fare of Titanic passengers was:", AvgFare, "British Pounds")
The average fare of Titanic passengers was: 32.30542 British Pounds

Extracting the Vectors and Perform Calculations (logical Vectors)

DataTitanic$Survived=as.logical(DataTitanic$Survived)
str(DataTitanic)
'data.frame':   887 obs. of  8 variables:
 $ Survived             : logi  FALSE TRUE TRUE TRUE FALSE FALSE ...
 $ Pclass               : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name                 : chr  "Mr. Owen Harris Braund" "Mrs. John Bradley Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath Futrelle" ...
 $ Sex                  : chr  "male" "female" "female" "female" ...
 $ Age                  : num  22 38 26 35 35 27 54 2 27 14 ...
 $ SiblingsSpousesAboard: int  1 1 0 1 0 0 0 3 0 1 ...
 $ ParentsChildrenAboard: int  0 0 0 0 0 0 0 1 2 0 ...
 $ FareInPounds         : num  7.25 71.28 7.92 53.1 8.05 ...
SurvRate=mean(DataTitanic$Survived)
cat("The average survival rate of Titanic passengers was:", SurvRate)
The average survival rate of Titanic passengers was: 0.3855693

Data Frames vs. Tibbles 🤓

A tibble is a more advanced sub-type of a data frame. If needed, a regular data frame can be coerced into a tibble with the as_tibble() command.

A few of the differences between data frames and tibbles:

  1. A data frame outputs all its rows and columns by default. A tibble outputs only the first 10 rows and the variables that fit on the screen but provides information about omitted variables and rows.

  2. A data frame can have row names, while a tibble cannot.

  3. In R version <4.1 a data frame converts all character values to factor type. This conversion was often confusing and annoying. In contrast, a tibble only coerces character values into factor on demand. Since R version 4.1 regular data frames behave the same as tibbles.

Summary Data Types and Objects

How are Very Big Numbers Presented

The GDP for 2021 in the US was $ 22,996,086,000,000 (rounded to millions)

\[\begin{eqnarray*} GDP&=&2.2996086 \cdot 10000000000000\\ &\Longleftrightarrow&\\ GDP&=&2.2996086 \cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10 \\ &\Longleftrightarrow&\\ GDP&=&2.2996086 \cdot 10^{13} \end{eqnarray*}\]

Let us see what R does:

GDPUS=22996086000000
print(GDPUS)
[1] 2.299609e+13

How are Very Small Numbers Presented 🤓

The probability of getting struck by lightning in the US is about \(0.000000000365\) on any randomly chosen day.

\[\begin{eqnarray*} ProbLight&=&\frac{3.65}{10000000000} \\ &\Longleftrightarrow&\\ ProbLight&=&\frac{3.65}{10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10\cdot 10} \\ &\Longleftrightarrow&\\ ProbLight&=&\frac{3.65}{10^{10}}\\ &\Longleftrightarrow&\\ ProbLight&=&3.65 \cdot 10^{-10} \end{eqnarray*}\]

In the U.S., a person has a 1:10,000-lifetime risk of being struck by lightning. Assuming a life span of 75 years and 365.25 days per year, the probability per day is:

\[\frac{1}{10,000 \cdot 365.25\cdot 75)}=0.000000000365\] ### What R Does

Let us see what R does:

ProbStruck=0.000000000365
cat("Probabilty to get stuck by a lighning on an avg. day:", ProbStruck)
Probabilty to get stuck by a lighning on an avg. day: 3.65e-10

Questions