Code
Price Sqft Bedrooms Waterfront
1 221900 1180 3 no
2 538000 2570 3 no
3 180000 770 2 no
Types, Tasks, Terminology
Categorizing AI, Machine Learning, and Deep Learning
Categorizing AI, Machine Learning, and Deep Learning
Categorizing AI, Machine Learning, and Deep Learning
First 3 Observations (records) of the Housing Dataset (to predict house prices)
Price Sqft Bedrooms Waterfront
1 221900 1180 3 no
2 538000 2570 3 no
3 180000 770 2 no
Tidy data:
First 3 Observations (records) of the Housing Dataset (predict house prices)
Price Sqft Bedrooms Waterfront
1 221900 1180 3 no
2 538000 2570 3 no
3 180000 770 2 no
Outcome Variable: The variables that is the outcome of the prediction (\(Price\))
Predictor Variables: The variables that predict an outcome (\(Sqft\), \(Bedrooms\), \(Waterfront\))
Example linear regression: \[Price=\beta_1 \cdot Sqft+ \beta_2 \cdot Bedrooms +\beta_3 \cdot Waterfront+\beta_4\]
Synonyms for Outcome Variable:
Synonyms for Predictor Variables:
Predicting means that we use the values for one or more known variables to estimate an outcome. Predictions can be forecasts or for the same time period.
Variables that are based on a prediction are marked with a hat (e.g., \(\widehat{Price_i}\)).
A model is what we use for predicting an outcome variable based on values of predictor variables — given certain assumptions.
\[\widehat{Price_i}=\beta_1 Sqft_i + \beta_2\]
Can we use the model from the previous slide to predict the price of a house, if we know the value for the house’s predictor variable (e.g., \(Sft=1000\))?
Only if we know the values for the parameters (the \(\beta's\))!
Suppose OLS based on data determines that \(\beta_1=300\) andt \(\beta_2=500,000\):
\[\widehat{Price_i}=300 Sqft_i + 500000\]
A model where the parameters (the \(\beta's\)) have been determined by a machine learning algorithm is called a fitted model.
A fitted model can be used for predictions. E.g., a house with a square footage of 1,000 sqft is predicted to cost $8000,000.
In our case:
\[\widehat{Price_i}=300 \cdot 1,000 + 500,000= 800,000\]
The \(\beta s\) of a model are the parameters. The parameters are determined by the optimizer of a machine learning algorithm.
Machine learning can be (over)simplified to the following steps:
Determine the model including the \(\beta s\).
Use machine learning to determine the \(\beta s\) and therefore create a fitted model.
Use the fitted model o predict based on predictor variables.
Training Dataset
When using data to calibrate the parameters minimizing some type of prediction error (training the model), most but not all of the observations are used.
Only about 60% – 90% of the total observations are usually used to calibrate the parameters (the \(\beta s\) of the model). These observations are randomly chosen and the resulting dataset is called the training dataset.
Testing Dataset
Observations not randomly chosen for training makeup the testing dataset. Testing data are never used to optimize model performance in any way! Instead, they form a hold-out dataset used to assess the predictive quality of a model.
Using the training dataset for this purpose is not an option because we would measure how well the model approximates the training data rather than assessing the predictive quality on new data — data that the model never has seen before
Machine Learning Software
tidyverse and tidymodels packages)