Recurrent Neural Networks (e.g. Long Short Term Memory recurrent networks)
Generative Adversarial Networks
AutoEncoders
Transformers»
Multi-Layer Perceptrons (MLP) Neural Network
Input Layer: with one or more input neurons.
Hidden Layer(s) one or more hiden layers with one or more hidden neurons.
Output Layer: with one or more output neurons.
Fully connected: each neuron in each of the layers is connected to all neurons of the following layer.
Example for an MLP Neural Network with One Hidden Layer
The Data
We will estimate diamond prices based on their physical properties and use the well-known diamonds dataset automatically loaded together with tidymodels:
Domain Knowledge: The Four C s to Appraise a Diamond
Cut: Refers to the facets, symmetry, and reflective qualities of a diamond. The cut of a diamond is directly related to its overall sparkle and beauty.
Color: Refers to the natural color or lack of color visible within a diamond. The closer a diamond is to “colorless,” the higher its value.
Clarity: Is the visibility of natural microscopic inclusions and imperfections within a diamond. Diamonds with little to no inclusions are considered particularly rare and highly valued.
Carat: Is the unit of measurement used to describe the weight of a diamond. It is often the most visually apparent factor when comparing diamonds.
Data Engeneering
We start with a very basic model with 2 predictors for \(Price\):
\(Carat\) (the weight of the diamond in metric grams),
\(Clarity\) (eight categories with \(8\) being the best).
To later increase training speed, we use only 10,000 observations.
Use a Trained Neural Nework (\(\beta s\) are known) to Predict
Effectiv Inputs to Hidden Neurons:
Use a Trained Neural Nework (\(\beta s\) are known) to Predict
Calculate Activity in Hidden Neurons with Logistic Function
Use a Trained Neural Nework (\(\beta s\) are known) to Predict
Calculate Prediction from Activities in Hidden Neurons:
Prediction of the Neural Network
\[\widehat P =\beta_7 + \beta_8 A^{ct}_1 + \beta_9 A^{ct}_2\] A neural network can be transformed into a prediction equation that depends only on the \(\beta s\) and the values of the predictor variables!
We will show this in more detail on the following slides.»
Transformation From Neural Network to Prediction Equation
\[\widehat P =\beta_7 + \beta_8 A^{ct}_1 + \beta_9 A^{ct}_2\]
\(A^{ct}_1\) and \(A^{ct}_2\) depend on \(I^{np\ eff}_1\) and \(I^{np\ eff}_2\) (and the \(\beta s\))
\(I^{np\ eff}_1\) and \(I^{np\ eff}_2\) depend on the values of predictor variables \(Carat\) and \(Clarity\) (and the \(\beta s\))
Consequently, prediction depends only on the values of predictor variables and the \(\beta s\)!»
Transformation From Neural Network to Prediction Equation
To show the transformation, we move backwards from right to left through the neural network.
\[\widehat P =\beta_7 + \beta_8 A^{ct}_1 + \beta_9 A^{ct}_2\]
Transformation From Neural Network to Prediction Equation
Inside the Hidden Neurons:
Transformation From Neural Network to Prediction Equation
Inside the Hidden Neurons
\[\widehat P =\beta_7 + \beta_8 A^{ct}_1 + \beta_9 A^{ct}_2\]
Optimizer adjusts \(\beta s\) incrementally (iteration by iteration; the iterations are called epochs)
Each epoch:
Find if individual \(\beta\) needs to be increased or decreased.
Increase \(\beta_i\) and see if \(MSE\) increases or not.
Decrease \(\beta_i\) and see if \(MSE\) increases or not.
Reset \(\beta_i\) and note if \(\beta_i\) needs to be increased or decreased.
Repeat for all \(\beta s\)
In/Decrease \(\beta`s\) proportional to change of \(MSE\) caused — multiply by learning rate (e.g., 0.01) to keep change small.
run process for several hundreds or thousands epochs.»
Example: Approximation Properties of Neural Networks
Let us run an example to see how well a Neural Network can approximate.
In the example we will z-normalize the predictors.
Are interested why?
Then use the down-arrow to proceed with the slides.
Otherwise, use the left-arrow.
Why is Scaling of Predictors Needed?
Logistic Activation Function
If inputs are not scaled and if they lead to very big effective inputs, the slope of the activation function will be very close to 0 and different effective inputs are indistinguishable.
Example: Approximation Properties of Neural Networks
To run the R-script with an example to see how well a Neural Network can approximate:
Click the link in the footer of this slide.
Theorem: Approximation Properties of Neural Networks
“Feedforward networks are capable of arbitrarily accurate approximation to any real-valued continuous function over a compact set.”
I.e.: Single hidden layer feedforward networks can approximate any measurable function arbitrarily well.
The app linked in the footer of this slide provides intuition for the Hornik, Stinchcombe, White proof.
Real World Example to Estimate Diamond Prices
You will use all big C variables \(Carat\), \(Clarity\), \(Cut\), and \(Color\). \(Cut\) describes the quality of the cut of the diamond rated from 1 (lowest) to 6 (highest) and \(Color\) rates the color of a diamond from 1 (highest) to 7 (lowest)
Instead of using the nnet package, you will use the more advanced brulee package which is based on PyTorch, which is a Python library originally developed by Facebook.
We will tune the hyper-parameters of the neural network (e.g., the number of hidden units) using cross validation.
Major Differences: nnet and brulee/PyTorch
brulee uses internally stop learning.
epoch setting refers to maximum epochs
from the training data set a validation set is held back.
when validation error stops decreasing for 5 epochs training is stopped.
brulee allows to use ReLu Activation Functions
ReLu Activation Function
\[Act_i=max\left (0, I_i^{eff}\right )\]
Two ReLU functions can be combined into one step function similar to sigmoid functions.
See the link in the footer for a demo.
Logistic Activation Function: Problem of Vanishing Gradient
Logistic Activation Function
Even when activation is determined somewhere in the middle of the activation function the slope is smaller than one. With multiple layers this can propagate to a gradient that is zero because slopes from multiple layers are multiplied (chain rule).
ReLu Activation Function: No Problem of Vanishing Gradient
\[Act_i=max\left (0, I_i^{eff}\right )\]
ReLu has a slope of one.
Now It’s Time to Run the Real-World Analysis
Go to the AI Book and find the analysis at the end of the Neural Network chapter: