Key Machine Learning Concepts

Explained with Linear Regression

Loading Required Librarries

library(tidymodels)
library(rio)
library(kableExtra)
library(janitor)
DataMockup=import("https://ai.lange-analytics.com/data/DataStudyTimeMockup.rds")

What Will You Learn

  • Reviewing the basic idea behind linear regression

  • Learning how how to measure predictive quality with Mean Square Error (\(MSE\)).

  • Understanding the role of parameters in a machine learning model in general and in linear regression in particular

  • Calculating optimal regression parameters using OLS

  • Finding optimal regression parameters by trial and error

  • Distinguish between unfitted and fitted models

  • Using the tidymodels package to split observations from a dataset randomly into a training and testing dataset.

  • Understanding how categorical data such as the sex of a person (female/male) can be transformed into numerical dummy variable.

  • Being able to distinguish between dummy encoding and one-hot encoding

  • Using tidymodels including model design and data pre-processing (recipes) to analyze housing prices.

Jumping Right Into It

Univariate OLS with a Real World Dataset

Data Description:

  • King County House Sale dataset (Kaggle 2015). House sales prices from May 2014 to May 2015 for King County in Washington State.

  • Several predictor variables. For now we use only \(Sqft\)

  • We will only use 100 randomly chosen observations from the total of 21,613 observations.

  • We only use Sqft as predictor variable for now.

Loading the Data and Assigning Training and Testing Data (manually)

Code
DataHouses=
  import("https://ai.lange-analytics.com/data/HousingData.csv") |>
  clean_names("upper_camel") |>
  select(Price, Sqft=SqftLiving) 

# Manually generating DataTrain and DataTest
set.seed(7771)
DataTrain= sample_n(DataHouses, 100)
DataTest= sample_n(DataHouses, 50)
head(DataTrain)
   Price Sqft
1 517000 1180
2 236000 1300
3 490000 2800
4 129000 1150
5 257000 1400
6 312500  870

How much is a House Worth in King County?

A house with average properties should be predicted with an average price!

Code
MeanSqft=mean(DataTrain$Sqft)
cat("The mean square footage of a house in King county is:", MeanSqft)
The mean square footage of a house in King county is: 1956.7
Code
MeanPrice=mean(DataTrain$Price)
cat("The mean price of a house in King county is:", MeanPrice)
The mean price of a house in King county is: 521294.2

Predicting the Price of an Average Sized House as the Average of all House Prices

An Interactive Graph That Explains it All

https://econ.lange-analytics.com/calcat/linregrmeans

From Unfitted to Fitted Model

How does the Unfitted Model Looks Like?

\[ \underbrace{\widehat{Price}}_\widehat{y}=\underbrace{\beta_1}_m \underbrace{Sqft}_x + \underbrace{\beta_0}_b \]

Fitting theModel with Tidymodels

Code
RecipeHouses= recipe(Price~Sqft, data=DataTrain)

ModelDesignOLS= linear_reg() %>% 
                set_engine("lm") %>% 
                set_mode("regression")

WFModelHouses = workflow() %>%  
  add_recipe(RecipeHouses) %>% 
  add_model(ModelDesignOLS) %>% 
  fit(DataTrain)

tidy(WFModelHouses)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   52509.   64183.      0.818 4.15e- 1
2 Sqft            240.      30.6     7.84  5.67e-12

Unfitted Model vs Fitted Workflow Model

Unfitted Model: \[ \underbrace{\widehat{Price}}_\widehat{y}=\underbrace{\beta_1}_m \underbrace{Sqft}_x + \underbrace{\beta_0}_b \]

Code
tidy(WFModelHouses)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   52509.   64183.      0.818 4.15e- 1
2 Sqft            240.      30.6     7.84  5.67e-12

Fitted Model: \[ \underbrace{\widehat{Price}}_\widehat{y}=\underbrace{240}_m \cdot\underbrace{Sqft}_x + \underbrace{52509}_b \]

Interpretation and Significance

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   52509.   64183.      0.818 4.15e- 1
2 Sqft            240.      30.6     7.84  5.67e-12

\[ \begin{align} \widehat{Price}&=240 \cdot Sqft + 52509\\ (+240)&=240\cdot (+1) + (+0)\\ (+480)&=240\cdot (+2) + (+0)\\ (+720)&=240\cdot (+3) + (+0) \end{align} \] For each extra \(Sqft\) the predicted price increases by $240

The variable \(Sqft\) is significant. I.e., the probability that the related coefficient \(\beta_1\) equals zero is extremely small.

How Does the fitted model that considers SQFT improves the prediction compared to a simple average

Look at the simulation again and choose 240 for beta 1:

https://econ.lange-analytics.com/calcat/linregrmeans

Evaluating Predictive Quality with the Testing Dataset

DataTestWithPred=augment(WFModelHouses, new_data=DataTest)
metrics(DataTestWithPred, truth=Price, estimate=.pred)
# A tibble: 3 × 3
  .metric .estimator  .estimate
  <chr>   <chr>           <dbl>
1 rmse    standard   163476.   
2 rsq     standard        0.626
3 mae     standard   132050.   

Univariate Linear Regression - Data Table and Goal

The Regression:

\[ \widehat{y}_{i} = \beta_{1}x_{i}+\beta_{2} \]

The Goal

Find values for \(\beta_1\) and \(\beta_2\) that minimize the prediction errors \((\widehat{y}_{i}-y_i)^2\)

The Data Table

Mockup Training Dataset
y
x
i Grade StudyTime
1 65 2
2 82 3
3 93 7
4 93 8
5 83 4

Univariate Linear Regression - Data Diagram and Goal

The Regression:

\[ \widehat{y}_{i} = \beta_{1}x_{i}+\beta_{2} \]

The Goal

Find values for \(\beta_0\) and \(\beta_1\) that minimize the prediction errors \((\widehat{y}_{i}-y_i)^2\)

The Data Diagram

Code
Model123=lm(Grade~StudyTime, data=DataMockup)
PredGrade=predict(Model123, DataMockup)
ggplot(DataMockup, aes(x=StudyTime,y=Grade)) +
  geom_line(aes(y=PredGrade), color="red", size=2.7)+
  geom_point(size=5, color="blue")+
  geom_point(aes(y=PredGrade), color="black", size=2.7)+
  geom_segment(aes(x = StudyTime, y = PredGrade,
                   xend = StudyTime, yend = Grade),size=1.2)+
  scale_x_continuous("Study Time", breaks=seq(1,8))+
  scale_y_continuous(limits=c(65,110), breaks=seq(60,100,5))

How to Measure Prediction Quality

\[\begin{eqnarray*} MSE & = & \frac{1}{N} \sum_{i=1}^{N}(\widehat{y}_{i}-y_{i})^{2} \\ & \Longleftrightarrow& \\ MSE & = & \frac{1}{N} \sum_{i=1}^{N}(\underbrace{\overbrace{\beta_{1}x_{i}+\beta_2}^{\mbox{Prediction $i$}}-y_i}_{\mbox{Error $i$}})^2 \end{eqnarray*}\]

Note, when the data are given (i.e., \(x_i\) and \(y_i\) are given), the \(MSE\) depends only on the choice of \(\beta_1\) and \(\beta_2\) »

How to Measure Prediction Quality with the MSE

\[\begin{eqnarray} MSE & = & \frac{(\beta_1x_{1}+\beta_2-y_1)^2 +(\beta_1x_{2}+\beta_2-y_2)^2 + \cdots+ (\beta_1x_{5}+\beta_2-y_{5})^2}{5} \nonumber \\ & \Longleftrightarrow& \nonumber \\ MSE & = & \frac{1}{5}\left[ (\underbrace{\overbrace{\beta_1\cdot 2+\beta_2}^{\mbox{Prediction $1$}}-65}_{\mbox{Error $1$}})^2 +(\underbrace{\overbrace{\beta_1\cdot 3+\beta_2}^{\mbox{Prediction $2$}}-82}_{\mbox{Error $2$}})^2\right.\nonumber \\ & &\nonumber \\ && + (\underbrace{\overbrace{\beta_1\cdot 7+\beta_2}^{\mbox{Prediction $3$}}-93}_{\mbox{Error $3$}})^2 +(\underbrace{\overbrace{\beta_1\cdot 8+\beta_2}^{\mbox{Prediction $4$}}-93}_{\mbox{Error $4$}})^2\nonumber \\ & &\nonumber \\ && +\left. (\underbrace{\overbrace{\beta_1\cdot 4+\beta_2}^{\mbox{Prediction $5$}}-83}_{\mbox{Error $6$}})^2\right] \end{eqnarray}\]

Custom R Function to Calculate MSE

Function Call:

Code
VecBetaTest=c(4,61)
ResultMSE=FctMSE(VecBetaTest, DataMockup)
print(ResultMSE)
[1] 29.8

Function Definition:»

Code
FctMSE=function(VecBeta, data){
                        Beta1=VecBeta[1]
                        Beta2=VecBeta[2]
                        data=data |>
                        rename(Y=1, X=2) |> 
                        mutate(YPred=Beta1*X+Beta2) |>
                        mutate(Error=YPred-Y) |>
                        mutate(ErrorSqt=Error^2)
                        
                        MSE=mean(data$ErrorSqt)
                        
                        return(MSE)}

How to Find Optimal Values for \(\beta_1\) and \(\beta_2\)

Method 1:

Calculate optimal values for the parameters (the \(\beta s\)) based on Ordinary Least Squares (OLS) using two formulas (Note, this method works only for linear regression)

Method 2:

We can use a systematic trial and error process.

Method 1: Calculate Optimal Parameters (only for OLS!)

\[\begin{eqnarray*} \beta_{1,opt}&=& \frac {N \sum_{i=1}^N y_i x_i- \sum_{i=1}^N y_i \sum_{i=1}^N x_i} {N \sum_{i=1}^N x_i^2 - \left (\sum_{i=1}^N x_i \right )^2}=3.96\\ && \nonumber \\ \beta_{2,opt.}&=& \frac{ \sum_{i=1}^N y_i - \beta_1 \sum_{i=1}^N x_i} {N} = 64.18 \end{eqnarray*}\]
Code
DataTable=DataMockup |>
  mutate(GradeXStudyTime=Grade*StudyTime) |>
  mutate(StudyTimeSquared=StudyTime^2) 


kbl(DataTable |> mutate(i=1:5) |> select(i,everything()), 
    caption="Mockup Training Dataset ")|>
  add_header_above(c(" ", "y", "x", "y x","x x"), escape=F) |> 
  kable_styling(bootstrap_options=c("striped","hover"), full_width = F, position="center")
Mockup Training Dataset
y
x
y x
x x
i Grade StudyTime GradeXStudyTime StudyTimeSquared
1 65 2 130 4
2 82 3 246 9
3 93 7 651 49
4 93 8 744 64
5 83 4 332 16
Column Sums
Grade StudyTime GradeXStudyTime StudyTimeSquared
416 24 2103 142

Method 2: Use a Systematic Trial and Error Process 🤓

  • Grid Search (aka Brute Force):

    1. For a given range of \(\beta_1\) and \(\beta_2\) values, build a table with pairs of all combinations of these \(\beta s\).
    2. Then use our custom FctMSE() command to calculate a \(MSE\) for each \(\beta\) pair.
    3. Find the \(\beta\) pair with the lowest \(MSE\)
  • Optimizer: Use the R build-in optimizer. Push the start values for \(\beta_1\) and \(\beta_2\) together with the data to the optimizer as arguments. The rest is done by the optimizer.

  • See the R script in the footnote to see both algorithms in action.»

Multivariate OLS with a Real World Dataset

Multivariate OLS with a Real World Dataset

Data

Code
library(rio)
DataHousing =
  import("https://ai.lange-analytics.com/data/HousingDataSmall.csv")
  • King County House Sale dataset (Kaggle 2015). House sales prices from May 2014 to May 2015 for King County in Washington State.
  • Several predictor variables.
  • We will use all 21,613 observations.

Multivariate Analysis — Three Predictor Variables

Sqft: Living square footage of the house

Grade Indicates the condition of houses (1 (worst) to 13 (best))

Waterfront: Is house located at the waterfront (yes or no)

Code
library(tidyverse);library(rio);library(janitor);library(tidymodels)
DataHousing =
  import("https://ai.lange-analytics.com/data/HousingData.csv")|>
  clean_names("upper_camel") |>
  select(Price, Sqft=SqftLiving, Grade, Waterfront)

Unfitted Model: » \[ Price=\beta_1 Sqft+\beta_2 Grade+\beta_3 Waterfront_{yes} +\beta_4 \]

Multivariate Real World Dataset — Splitting

Code
set.seed(777)
Split7030=initial_split(0.7,data=DataHousing, strata = Price, breaks = 5)
DataTrain=training(Split7030)
DataTest=testing(Split7030)

DataTrain

   Price Sqft Grade Waterfront
1 221900 1180     7         no
2 180000  770     6         no
3 189000 1200     7         no
4 230000 1250     7         no
5 252700 1070     7         no
6 240000 1220     7         no

DataTest

    Price Sqft Grade Waterfront
1 1230000 5420    11         no
2  257500 1715     7         no
3  291850 1060     7         no
4  229500 1780     7         no
5  530000 1810     7         no
6  650000 2950     9         no

Dummy and One-Hot Encoding

One-Hot Encoding

Code
OneHotTable=tibble(Waterfront_yes=c(0,0,1,0),Waterfront_no=c(1,1,0,1))
print(OneHotTable)
# A tibble: 4 × 2
  Waterfront_yes Waterfront_no
           <dbl>         <dbl>
1              0             1
2              0             1
3              1             0
4              0             1

One-hot encoding is easier to interpret but causes problems in OLS (dummy trap) because one variable is redundant. We can calculate one variable from the other (perfect multicollinearity):

\[Waterfront_{yes}=1-Waterfront_{no}\]

Dummy and One-Hot Encoding

Dummy Coding

We use one variable less than we have categories. Waterfront has two categories. Therefore, we use one variable (e.g., Waterfront_yes):

Dummy Encoding Example

Code
DummyTable=tibble(Waterfront_yes=c(0,0,1,0))
print(DummyTable)
# A tibble: 4 × 1
  Waterfront_yes
           <dbl>
1              0
2              0
3              1
4              0

Note, dummy encoding can be done with step_dummy() in a tidymodels recipe

Multivariate Analysis — Building the Recipe

RecipeHouses=recipe(Price ~ ., data=DataTrain) |> 
                    step_dummy(Waterfront)

Here is how the recipe later on (in the workflow) transforms the data:

Code
juice(RecipeHouses |> prep()) |> head()
# A tibble: 6 × 4
   Sqft Grade  Price Waterfront_yes
  <int> <int>  <dbl>          <dbl>
1  1180     7 221900              0
2   770     6 180000              0
3  1200     7 189000              0
4  1250     7 230000              0
5  1070     7 252700              0
6  1220     7 240000              0

Multivariate Analysis — Building the Model Design

Unfitted Model:

ModelDesignHouses=linear_reg() |> 
  set_engine("lm") |> 
  set_mode("regression")
print(ModelDesignHouses)
Linear Regression Model Specification (regression)

Computational engine: lm 

»

Multivariate Analysis — Creating Workflow & Fitting to the Training Data

Code
WFModelHouses = workflow() |>  
      add_recipe(RecipeHouses) |> 
      add_model(ModelDesignHouses) |> 
      fit(DataTrain)
tidy(WFModelHouses)
# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    -570056.  15133.       -37.7 6.63e-297
2 Sqft               180.      3.25      55.2 0        
3 Grade            95214.   2548.        37.4 1.65e-292
4 Waterfront_yes  868338.  22200.        39.1 7.12e-319
Code
glance(WFModelHouses)
# A tibble: 1 × 12
  r.squared adj.r.squared   sigma statistic p.value    df   logLik    AIC    BIC
      <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>    <dbl>  <dbl>  <dbl>
1     0.581         0.581 238574.     7002.       0     3 -208785. 4.18e5 4.18e5
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Multivariate Analysis — Predicting Testing Data and Metrics

Code
DataTestWithPredictions = augment(WFModelHouses, new_data=DataTest)
metrics(DataTestWithPredictions, truth=Price, estimate=.pred)
# A tibble: 3 × 3
  .metric .estimator  .estimate
  <chr>   <chr>           <dbl>
1 rmse    standard   244656.   
2 rsq     standard        0.549
3 mae     standard   163358.   

Exercise

Run the Analysis»

https://ai.lange-analytics.com/exc/?file=05-LinRegrExerc100.Rmd