Implicit bias by large learning rate: Noise can be helpful for gradient descent
I will demonstrate how large learning rates can lead to implicit biases in a simple regression task. Code to reproduce the results can be found in spoilers. We will use pytorch.
The regression task
We generate normally distributed data points in
Code to generate data
import torch as th
th.manual_seed(0)
d = 30 # Input dimension
n = 200 # Training examples / testing points
X = th.randn(n*2, d) # Generate random points
y = X[:,0]**2 # Ground truth
X_train, y_train = X[:n,:], y[:n] # Train/test split
X_test, y_test = X[n:,:], y[n:]
The neural network
We will use a single hidden layer neural network with a quadratic activation function. The hidden layer will have
This neural network should be well suited to solve the regression task since it is straightforward to find values for the parameters that solve the task exactly, for example we may pick
We initialize
Code to make the neural network
def makeModel(seed = 0):
m = 100 # Number of hidden nodes
A = th.zeros(d,m, requires_grad=True)
b = th.zeros(m, requires_grad=True)
th.manual_seed(seed)
th.nn.init.xavier_normal_(A)
parameters = [A,b]
predict = lambda X : (X@A)**2 @ b
return parameters, predict
To fit the model we run gradient descent on the Mean Square Error (MSE) over the training data. We train until the MSE on the training data is below
Code to train the neural network
def trainModel(parameters, predict, lr):
loss = 1e100
while loss > 1e-4:
loss = th.mean((predict(X_train)-y_train)**2)
loss.backward()
with th.no_grad():
for param in parameters:
param -= lr * param.grad # Gradient descent
param.grad[:] = 0
return th.mean((predict(X_test)-y_test)**2) # Return test MSE
print('MSE = %.2f'%trainModel(*makeModel(), 0.01))
We choose the learning rate 0.01 and train the neural network. It scores MSE = 0.40 on the test set. This is quite bad, as we clearly did not solve the task. As a reference for the scale of MSE, the model always predicting 1 gets MSE = 1.97 .
Increasing learning rate
In deep learning, performance is often very dependent on hyperparameters. Let’s look at the effect of the learning rate.
Code to compare different learning rates
lr_list = th.arange(0.001,0.15,0.005)
mse_list = []
for lr in lr_list:
mse = trainModel(*makeModel(), lr)
print("MSE = %.2f"%mse)
For learning rates above 0.096 the optimization diverges, so we can’t go higher. We see that all learning rates up to 0.03 give the same bad MSE, but after that larger learning rates improve performance. Interestingly, we can actually solve the task, but only if we choose learning rates right at the edge of diverging.
It should be noted that while the above plot paints a deceptively simple picture, it is not true in general that higher learning rate is better. The experiment seems robust against random seeds for initialization and data generation, but is quite fragile against changes in other hyperparameters, such as the number of hidden neurons and scale of initialization.
Label noise
We can achieve a similar effect without huge learning rates by adding “label noise”. Label noise is a form a data augmentation, where we generate more training data by modifying the original data. In each iteration of gradient descent, we will replace the training targets by the original targets plus some noise. We choose standard normal noise.
Code to train with label noise
def trainModelWithLabelNoise(parameters, predict, lr, steps):
for _ in range(steps):
y = y_train + th.randn(y_train.shape)
loss = th.mean((predict(X_train) - y)**2)
loss.backward()
with th.no_grad():
for param in parameters:
param -= lr * param.grad # Gradient descent
param.grad[:] = 0
return th.mean((predict(X_test)-y_test)**2) # Return test MSE
print('MSE = %.2f'%trainModelWithLabelNoise(*makeModel(), lr = 0.03, steps = 100000))
Training with learning rate 0.03 (which previously gave MSE = 0.40) now gives MSE = 0.01! However, notice it required 100000 gradient descent steps to achieve this. Without label noise we only needed 1000 steps to converge (then the gradient is practically zero, meaning that parameters stop changing and running longer doesn’t change anything). Lower learning rates require even more steps.
The explanation
We will derive an explicit regularizer which approximates the implicit regularization by large step gradient descent.
Let’s think of gradient descent as an approximation to gradient flow (think of gradient descent with infinitesimal step length). We have some loss function
Consider a single step of gradient descent. We start the step with parameters
We see that the gradient descent step only captures the first two terms, giving a truncation error of
Let’s change the loss function to
and look at the new gradient flow
The gradient descent step with respect to the original loss approximates the gradient flow with respect to the new loss with truncation error only
In this view, it is hence more accurate to say that gradient descent is optimizing
I found the neat derivation presented above in Implicit Gradient Regularization, there you can find more details.
Label noise
But wait, optimizing
With label noise we have
The
One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze
One of the main insights of this implicit regularizer is the multiplicative factor
Comparing with small initialization
If you read the previous blog post, you might wonder if we can apply the implicit bias given by small initialization to solve the regression task given here. Indeed you can! Simply scaling down the outputs of the neural network by a factor 100, we get MSE = 0.00 with learning rate 0.01.
Code to test small initialization
def makeModelSmallInit(seed = 0):
m = 100 # Number of hidden nodes
A = th.zeros(d,m, requires_grad=True)
b = th.zeros(m, requires_grad=True)
th.manual_seed(seed)
th.nn.init.xavier_normal_(A)
parameters = [A,b]
predict = lambda X : (X@A)**2 @ b / 100
return parameters, predict
print('MSE = %.2f'%trainModel(*makeModelSmallInit(), 0.01))
When writing the previous blog entry I tried the reverse: applying regularization by large step sizes and label noise to the classification task. However, this implicit bias was not useful for solving that task.
If you are interested in this kind of research, I would be happy to talk on Discord, email (johanswi@ uio.no), or in the comments below.