2 A simple neural network in R
Let’s think about what we need for a neural network.
2.1 What’s in a network?
Our toy network will perform a simple regression task, and to explain its building blocks, we start by explaining what it has in common with standard linear regression (ordinary least squares, OLS).
2.1.1 Gradient descent
If we had to, we could do linear regression
\[\begin{equation*} \mathbf{X} \boldsymbol{\beta} = \mathbf{y} \tag{2.1} \end{equation*}\]from scratch in R. Not necessarily using the normal equations (as those imply invertibility of the covariance matrix):
\[\begin{equation*} \hat{\boldsymbol{\beta}} = \mathbf{{(X^t X)}^{-1} X^t y} \tag{2.2} \end{equation*}\]but iteratively, doing gradient descent. We start with a guess of the weight vector \(\boldsymbol{\beta}\), \(\hat{\boldsymbol{\beta}}\). Then in each iteration, we compute how far we are from our objective, that is, from correctly solving equation (2.1). Most often in regression, for this we’d calculate the sum of squared errors:
\[\begin{equation*} \mathcal{L} = \sum{{(\mathbf{X} \hat{\boldsymbol{\beta}} - \mathbf{\ y})}^2} \tag{2.3} \end{equation*}\]This is our loss function. For that loss to go down, we need to update \(\hat{\boldsymbol{\beta}}\) in the right direction. How the loss changes with \(\hat{\boldsymbol{\beta}}\) is given by its gradient with respect to the same:
\[\begin{equation*} \nabla_{\hat{\boldsymbol{\beta}}} \mathcal{L} = 2 \mathbf{X}^t (\mathbf{X} \hat{\boldsymbol{\beta}} - \mathbf{y}) \tag{2.4} \end{equation*}\]Substracting a fraction of that gradient from \(\hat{\boldsymbol{\beta}}\) – “descending” the gradient of the loss – will make it go down. This can be seen by looking at the first-order Taylor approximation of a function \(f\) (choosing a single-variable function for simplicity):
\[\begin{equation*} f(x + \delta x) \approx f(x) + f'(x) \delta x \tag{2.5} \end{equation*}\]If we set \(\delta x\) to a multiple of the derivative of \(f\) at \(x\), \(\delta = \eta f'(x)\), we get
\[\begin{equation*} f(x - \eta f'(x)) \approx f(x) - \eta f'(x) f'(x)) \end{equation*}\] \[\begin{equation*} f(x - \eta f'(x)) \approx f(x) - \eta (f'(x))^2 \tag{2.5} \end{equation*}\]This new value \(f(x - \eta f'(x))\) is smaller than \(f(x)\) because on the right side, a positive value is subtracted.
Ported to our task of loss minimization, where loss \(\mathcal(L)\) depends on a vector parameter \(\hat{\boldsymbol{\beta}}\), this would be
\[\begin{equation*} \mathcal{L}(\hat{\boldsymbol{\beta}} + \Delta \hat{\boldsymbol{\beta}}) \approx \mathcal{L}(\hat{\boldsymbol{\beta}}) + {\Delta \hat{\boldsymbol{\beta}}}^t \nabla \hat{\boldsymbol{\beta}} \tag{2.5} \end{equation*}\]Now, again, subtracting a fraction of the gradient, \(- \eta \nabla \hat{\boldsymbol{\beta}}\), we have
\[\begin{equation*} \mathcal{L}(\hat{\boldsymbol{\beta}} - \eta \nabla \hat{\boldsymbol{\beta}}) \approx \mathcal{L}(\hat{\boldsymbol{\beta}}) - \eta {\nabla \hat{\boldsymbol{\beta}}}^t \nabla \hat{\boldsymbol{\beta}} \end{equation*}\]where again the new loss value is lower than the old one.
Iterating this process, we successively approach better estimates of \(\hat{\boldsymbol{\beta}}\). The scale parameter, \(\eta\), used to multiply the gradient is called the learning rate.
The process is analogous if we have a simple network. The main difference is that instead of one weight vector \(\hat{\boldsymbol{\beta}}\), we have several layers, each with their own weights that have to be updated.
Before going there, a quick summary of the concepts and building blocks we’ve now seen:
- Better weights are determined iteratively.
- On each iteration, with the current weight estimates, we calculate a new prediction, the current loss, and the gradient of the loss with respect to the weights.
- We update the weights, subtracting learning_rate times the gradient, and proceed to the next iteration.
The program below will follow this blueprint. We’ll fill out the sections soon:
for (t in 1:200) {
### -------- Forward pass --------
# here we'll compute the prediction
### -------- compute loss --------
# here we'll compute the sum of squared errors
### -------- Backpropagation --------
# here we'll pass through the network, calculating the required gradients
### -------- Update weights --------
# here we'll update the weights, subtracting portion of the gradients
}
2.1.2 From linear regression to a simple network
Let’s see how our simple network will be different from that process.
2.1.2.1 Implications of having multiple layers
The simple network will have two layers, the output layer (corresponding to the predictions above) and an intermediate (hidden) layer. Both layers have their corresponding weight matrices, w1
and w2
, and intermediate values computed at the hidden layer – called activations – are passed to the output layer for multiplication with its weight matrix.
Mirroring that multi-step forward pass, losses have to be propagated back through the network, such that both weight matrices may be updated. Backpropagation, understood in a conceptual1 way, means that gradients are computed via the chain rule of calculus; for example, the gradient of the loss with respect to w1
in our example will be2
- the gradient of the loss w.r.t. the predictions; times
- the gradient of output layer activation w.r.t. hidden layer activation (
w2
) - the gradient of hidden layer activation w.r.t.
w1
(X
, the matrix of input data)
2.1.2.2 Activation functions
In the above paragraph, we simplified slightly, making it look as though layer weights were applied with no action “in between.” In fact, usually a layer’s output, before being passed to the next layer, is transformed by an activation function, operating pointwise. Different activation functions exist; they all have in common that they introduce non-linearity into the computation.
Our example will use ReLU (“Rectified Linear Unit”) activation for the intermediate layer. ReLU sets negative input to 0 while leaving positive input as is. Activation functions add a further step to the backward pass, as well.
2.1.2.3 Weights and biases
In the linear regression example, we had a weight vector – a vector, with one element for each predictor.
In neural networks, layers normally consist of several “neurons” (or units), the exception being the output layer – sometimes, namely, when there is a single prediction per observation.
Apart from that exception though, instead of weight vectors here we have weight matrices, connecting multiple “source” units to multiple “target” units.
Moreover, every unit has a so-called bias that is added to the output of the multiplication of inputs and weights. Thus, the biases are in fact vectors.
We now have all building blocks we need to define a training loop.
2.2 A simple network
Our blueprint for a simple network does not employ any deep learning libraries; however, for speed, predictability and intuitiveness (in the sense of comparability to Python’s NumPy) we make use of rray to manipulate array data.
Before getting to the network proper, we simulate some data for a typical regression problem.
2.2.1 Simulate data
Our data has three input columns and a single target column.
library(rray)
library(dplyr)
# input dimensionality (number of input features)
<- 3
d_in # output dimensionality (number of predicted features)
<- 1
d_out # number of observations in training set
<- 100
n
# create random data
<- rray(rnorm(n * d_in), dim = c(n, d_in))
x <- x[ , 1] * 0.2 - x[ , 2] * 1.3 - x[ , 3] * 0.5 + rnorm(n) y
With x
and y
being instances of rray
- provided classes,
class(x)
we can use operations like rray_dot
, rray_add
or rray_transpose
on them. If you’ve used Python NumPy before, these will look familiar, – there is one point of caution though: Although rray
explicitly provides broadcasting, it lines up array dimensions from the left, not from the right side, in line with R’s column-major storage format.
Also reflecting column-major layout, rray
prints array dimension data differently from base R – e.g. for two-dimensional arrays, the number of columns goes first:
= x[1:10, ]
first_ten_rows first_ten_rows
We also need the weight matrices \(w1\) and \(w2\), as well as the biases \(b1\) and \(b2\).
2.2.2 Initialize weights and biases
Again, we use rray
, initializing the weights from a standard normal distribution, and the biases to \(0\):
### initialize weights ---------------------------------------------------------
# dimensionality of hidden layer
<- 32
d_hidden # weights connecting input to hidden layer
<- rray(rnorm(d_in * d_hidden), dim = c(d_in, d_hidden))
w1 # weights connecting hidden to output layer
<- rray(rnorm(d_hidden * d_out), dim = c(d_hidden, d_out))
w2
# hidden layer bias
<- rray(rep(0, d_hidden), dim = c(1, d_hidden))
b1 # output layer bias
<- rray(rep(0, d_out), dim = c(d_out, 1)) b2
2.2.3 Training loop
Now for the training loop proper. The training loop here is the network.
The forward pass computes intermediate activations (also applying ReLU activation), actual predictions, and the loss.
The backward pass starts from the output and, making use of the chain rule, calculates the gradients of the loss with respect to \(w2\), \(b2\), \(w1\) und \(b1\). It then uses the gradients to update the parameters.
If you’re just starting out with neural networks, don’t worry too much about the details of matrix shapes and operations – all this will become a lot easier when we use full-flegded torch. Just try to develop an understanding of what this code does overall.
<- 1e-4
learning_rate
### training loop --------------------------------------------------------------
for (t in 1:200) {
### -------- Forward pass --------
# compute pre-activations of hidden layers (dim: 100 x 32)
<- rray_dot(x, w1) + b1
h # apply activation function (dim: 100 x 32)
<- rray_maximum(h, 0)
h_relu # compute output (dim: 100 x 1)
<- rray_dot(h_relu, w2) + b2
y_pred
### -------- compute loss --------
<- rray_pow(y_pred - y, 2) %>% rray_sum()
loss if (t %% 10 == 0) cat("Epoch:", t, ", loss:", loss, "\n")
### -------- Backpropagation --------
# gradient of loss w.r.t. prediction (dim: 100 x 1)
<- 2 * (y_pred - y)
grad_y_pred
# gradient of loss w.r.t. w2 (dim: 32 x 1)
<- rray_transpose(h_relu) %>% rray_dot(grad_y_pred)
grad_w2 # gradient of loss w.r.t. hidden activation (dim: 100 x 32)
<- rray_dot(grad_y_pred, rray_transpose(w2))
grad_h_relu # gradient of loss w.r.t. hidden pre-activation (dim: 100 x 32)
<- rray_if_else(h > 0, grad_h_relu, 0)
grad_h # gradient of loss w.r.t. b2 (dim: 1 x 1)
<- rray_sum(grad_y_pred)
grad_b2
# gradient of loss w.r.t. w1 (dim: 3 x 32)
<- rray_transpose(x) %>% rray_dot(grad_h)
grad_w1 # gradient of loss w.r.t. b1 (dim: 3 x 32)
<- rray_sum(grad_h, axes = 1)
grad_b1
### -------- Update weights --------
<- w2 - learning_rate * grad_w2
w2 <- b2 - learning_rate * grad_b2
b2 <- w1 - learning_rate * grad_w1
w1 <- b1 - learning_rate * grad_b1
b1 }
In the next chapter, we start introducing torch. Optimization will still be performed manually, but instead of rray
we are going to use torch tensors.
2.2.4 Complete code
library(rray)
library(dplyr)
### generate training data -----------------------------------------------------
# input dimensionality (number of input features)
<- 3
d_in # output dimensionality (number of predicted features)
<- 1
d_out # number of observations in training set
<- 100
n
# create random data
<- rray(rnorm(n * d_in), dim = c(n, d_in))
x <- x[ , 1] * 0.2 - x[ , 2] * 1.3 - x[ , 3] * 0.5 + rnorm(n)
y # lm(as.matrix(y) ~ as.matrix(x)) %>% summary()
### initialize weights ---------------------------------------------------------
# dimensionality of hidden layer
<- 32
d_hidden # weights connecting input to hidden layer
<- rray(rnorm(d_in * d_hidden), dim = c(d_in, d_hidden))
w1 # weights connecting hidden to output layer
<- rray(rnorm(d_hidden * d_out), dim = c(d_hidden, d_out))
w2
# hidden layer bias
<- rray(rep(0, d_hidden), dim = c(1, d_hidden))
b1 # output layer bias
<- rray(rep(0, d_out), dim = c(d_out, 1))
b2
### network parameters ---------------------------------------------------------
<- 1e-4
learning_rate
### training loop --------------------------------------------------------------
for (t in 1:200) {
### -------- Forward pass --------
# compute pre-activations of hidden layers (dim: 100 x 32)
<- rray_dot(x, w1) + b1
h # apply activation function (dim: 100 x 32)
<- rray_maximum(h, 0)
h_relu # compute output (dim: 100 x 1)
<- rray_dot(h_relu, w2) + b2
y_pred
### -------- compute loss --------
<- rray_pow(y_pred - y, 2) %>% rray_sum()
loss if (t %% 10 == 0) cat("Epoch:", t, ", loss:", loss, "\n")
### -------- Backpropagation --------
# gradient of loss w.r.t. prediction (dim: 100 x 1)
<- 2 * (y_pred - y)
grad_y_pred
# gradient of loss w.r.t. w2 (dim: 32 x 1)
<- rray_transpose(h_relu) %>% rray_dot(grad_y_pred)
grad_w2 # gradient of loss w.r.t. hidden activation (dim: 100 x 32)
<- rray_dot(grad_y_pred, rray_transpose(w2))
grad_h_relu # gradient of loss w.r.t. hidden pre-activation (dim: 100 x 32)
<- rray_if_else(h > 0, grad_h_relu, 0)
grad_h # gradient of loss w.r.t. b2 (dim: 1 x 1)
<- rray_sum(grad_y_pred)
grad_b2
# gradient of loss w.r.t. w1 (dim: 3 x 32)
<- rray_transpose(x) %>% rray_dot(grad_h)
grad_w1 # gradient of loss w.r.t. b1 (dim: 3 x 32)
<- rray_sum(grad_h, axes = 1)
grad_b1
### -------- Update weights --------
<- w2 - learning_rate * grad_w2
w2 <- b2 - learning_rate * grad_b2
b2 <- w1 - learning_rate * grad_w1
w1 <- b1 - learning_rate * grad_b1
b1 }