The Good Minima

How the RWKV language model works

2023-03-23T00:00:00+00:00

In this post, I will explain the details of how RWKV generates text. For a high level overview of what RWKV is and what is so special about it, check out the other post about RWKV.

To explain exactly how RWKV works, I think it is easiest to look at a simple implementation of it. The following ~100 line code (based on RWKV in 150 lines) is a minimal implementation of a relatively small (430m parameter) RWKV model which generates text.

Minimal RWKV code

import numpy as np
from torch import load as torch_load  # Only for loading the model weights
from tokenizers import Tokenizer

layer_norm = lambda x, w, b : (x - np.mean(x)) / np.std(x) * w + b
exp = np.exp
sigmoid = lambda x : 1/(1 + exp(-x))

def time_mixing(x, last_x, last_num, last_den, decay, bonus, mix_k, mix_v, mix_r, Wk, Wv, Wr, Wout):
    k = Wk @ ( x * mix_k + last_x * (1 - mix_k) )
    v = Wv @ ( x * mix_v + last_x * (1 - mix_v) )
    r = Wr @ ( x * mix_r + last_x * (1 - mix_r) )

    wkv = (last_num + exp(bonus + k) * v) /      \
          (last_den + exp(bonus + k))
    rwkv = sigmoid(r) * wkv

    num = exp(-exp(decay)) * last_num + exp(k) * v
    den = exp(-exp(decay)) * last_den + exp(k)

    return Wout @ rwkv, (x,num,den)


def channel_mixing(x, last_x, mix_k, mix_r, Wk, Wr, Wv):
    k = Wk @ ( x * mix_k + last_x * (1 - mix_k) )
    r = Wr @ ( x * mix_r + last_x * (1 - mix_r) )
    vk = Wv @ np.maximum(k, 0)**2
    return sigmoid(r) * vk, x


def RWKV(model, token, state):
    params = lambda prefix : [model[key] for key in model.keys() if key.startswith(prefix)]

    x = params('emb')[0][token]
    x = layer_norm(x, *params('blocks.0.ln0'))

    for i in range(N_LAYER):
        x_ = layer_norm(x, *params(f'blocks.{i}.ln1'))
        dx, state[i][:3] = time_mixing(x_, *state[i][:3], *params(f'blocks.{i}.att'))
        x = x + dx

        x_ = layer_norm(x, *params(f'blocks.{i}.ln2'))
        dx, state[i][3] = channel_mixing(x_, state[i][3], *params(f'blocks.{i}.ffn'))
        x = x + dx

    x = layer_norm(x, *params('ln_out'))
    x = params('head')[0] @ x

    e_x = exp(x-np.max(x))
    probs = e_x / e_x.sum() # Softmax of x

    return probs, state

##########################################################################################################

def sample_probs(probs, temperature=1.0, top_p=0.85):
    sorted_probs = np.sort(probs)[::-1]
    cumulative_probs = np.cumsum(sorted_probs)
    cutoff = sorted_probs[np.argmax(cumulative_probs > top_p)]
    probs[probs < cutoff] = 0
    probs = probs**(1/temperature)
    return np.random.choice(a=len(probs), p=probs/np.sum(probs))


# Available at https://huggingface.co/BlinkDL/rwkv-4-pile-430m/resolve/main/RWKV-4-Pile-430M-20220808-8066.pth
MODEL_FILE = '/data/rwkv/RWKV-4-Pile-430M-20220808-8066.pth'
N_LAYER = 24
N_EMBD = 1024

print(f'\nLoading {MODEL_FILE}')
weights = torch_load(MODEL_FILE, map_location='cpu')
for k in weights.keys():
    if '.time_' in k: weights[k] = weights[k].squeeze()
    weights[k] = weights[k].float().numpy() # convert to f32 type


# Available at https://github.com/BlinkDL/ChatRWKV/blob/main/20B_tokenizer.json
tokenizer = Tokenizer.from_file("/data/rwkv/20B_tokenizer.json")

print(f'\nPreprocessing context')

context = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."

state = np.zeros((N_LAYER, 4, N_EMBD), dtype=np.float32)
for token in tokenizer.encode(context).ids:
    probs, state = RWKV(weights, token, state)

print(context, end="")
for i in range(100):
    token = sample_probs(probs)
    print(tokenizer.decode([token]), end="", flush=True)
    probs, state = RWKV(weights, token, state)

To avoid hiding complexity, the model computation itself is written entirely in python, with numpy for matrix / vector operations. However, I needed to use torch.load to load the model weights from a file, and tokenizers.Tokenizer to make the text into tokens the model can work with.

Text generation with RWKV

The code uses RWKV to continue the following text:

“In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.”

We first need to convert this text into a series of tokens (numbers from 0 to 50276 representing words/symbols/tokens in our vocabulary). That is not the focus of this blog post, so I just do it with an external library tokenizer.encode(context).ids.

Next, we need to process this sequence of tokens into an RWKV state. Essentially, RWKV represents a function which takes a token and a state, and outputs a probability distribution over the next token, and a new state. Of course, the function also depends on the RWKV model parameters, but since we use a trained model (downloaded from here), we view those parameters as fixed. To convert the text to a state, we just initialize the state to all zeros, and feed the tokens through the RWKV function one by one.

state = np.zeros((N_LAYER, 4, N_EMBD), dtype=np.float32)
for token in tokenizer.encode(context).ids:
    probs, state = RWKV(weights, token, state)

Now the variable state contains a state representation of our input text, and the variable “probs” contain the probability distribution the model predicts for the next token.

We can now simply sample the probability distribution (in practice, we avoid low probability tokens in sample_probs()) and add another token to the text. Then we feed the new token into RWKV and repeat.

for i in range(100):
    token = sample_probs(probs)
    print(tokenizer.decode([token]), end="", flush=True)
    probs, state = RWKV(weights, token, state)       

A typical, generated continuation is:

“They’re just like us. They use Tibetan for communication, and for a different reason – they use a language that they’re afraid to use. To protect their secret, they prefer to speak a different language to the local public.”

Of course, larger models will perform better than this relatively small 430m RWKV.

What goes on inside RWKV()

The first thing RWKV does is look up the embedding vector of the input token. I.e x = params('emb')[0][token]. Here params('emb')[0] is simply a $50277 \times 1024$ matrix, and we extract a row.

The next line x = layer_norm(x, *params('blocks.0.ln0')) requires me to explain what a Layer Normalization is. The easiest way is to just show the definition:

layer_norm = lambda x, w, b : (x - np.mean(x)) / np.std(x) * w + b.

The intuition is that it normalizes a vector x to zero mean and unit variance, and then scales and offsets that. Note that the scale w and offset b are 1024-dimensional vectors, which are learned model parameters.

Now we get to the main part of the model. Which is split into 24 layers, applied sequentially.

for i in range(N_LAYER):
    x_ = layer_norm(x, *params(f'blocks.{i}.ln1'))
    dx, state[i][:3] = time_mixing(x_, *state[i][:3], *params(f'blocks.{i}.att'))
    x = x + dx

    x_ = layer_norm(x, *params(f'blocks.{i}.ln2'))
    dx, state[i][3] = channel_mixing(x_, state[i][3], *params(f'blocks.{i}.ffn'))
    x = x + dx

Note that we are only adding updates to x like x = x + dx, this is called using “residual connections”. Each time we make a copy of x, we feed it through a layer normalization before mixing it. Each layer has two mixing functions: a “time mixing” part and a “channel mixing” part. In a typical transformer, the “time mixing” would be done by multi head attention, and the “channel mixing” would be done by a simple feed forward network. RWKV does something a bit different, which we’ll explain in the next sections.

Channel mixing

I’ll start with channel mixing, since it’s the simpler one of the two mixing functions.

def channel_mixing(x, last_x, mix_k, mix_r, Wk, Wr, Wv):
    k = Wk @ ( x * mix_k + last_x * (1 - mix_k) )
    r = Wr @ ( x * mix_r + last_x * (1 - mix_r) )
    vk = Wv @ np.maximum(k, 0)**2
    return sigmoid(r) * vk, x

The channel mixing layer takes an input x corresponding to this token, and the x corresponding to the previous token, which we call last_x. last_x was stored in this RWKV layer’s state. The rest of the inputs are learned RWKV parameters.

First, we linearly interpolate x and last_x, using learned weights. We run this interpolated x as input to a 2 layer feed forward network with squared relu activation, and finally multiply with the sigmoid activations of another feed forward network (in classical RNN terms, this would be called gating).

Note that in terms of memory usage, the matrices Wk,Wr,Wv hold almost all the parameters (the smallest of them is a $1024\times 1024$ matrix, while the other variables are just 1024-dimensional vectors). And the matrix multiplications (@ in python) contribute the vast majority of required computations.

Time mixing

def time_mixing(x, last_x, last_num, last_den, decay, bonus, mix_k, mix_v, mix_r, Wk, Wv, Wr, Wout):
    k = Wk @ ( x * mix_k + last_x * (1 - mix_k) )
    v = Wv @ ( x * mix_v + last_x * (1 - mix_v) )
    r = Wr @ ( x * mix_r + last_x * (1 - mix_r) )

    wkv = (last_num + exp(bonus + k) * v) /      \
          (last_den + exp(bonus + k))
    rwkv = sigmoid(r) * wkv

    num = exp(-exp(decay)) * last_num + exp(k) * v
    den = exp(-exp(decay)) * last_den + exp(k)

    return Wout @ rwkv, (x,num,den)

The time mixing starts similarly to the channel mixing, by interpolating this token’s x with the last token’s x. We then apply learned $1024\times 1024$ matrices to get “key”, “value” and “receptance” vectors.

The next part is where the magic happens.

The “RWKV attention”

Before getting to the core of the mechanism, we will make the observation that while the variables going into the attention mechanism are all 1024-dimensional (we say they have 1024 channels), all channels are computed independently of each other. We will therefore just look at what happens to a single channel, treating the variables as scalars.

Now, let us look at the variable num. To make math notations cleaner, let’s rename num and den to $\alpha$ and $\beta$. Both $\alpha$ and $\beta$ are stored in the RWKV state. For each new token, $\alpha$ is calculated as $\alpha_i = e^{-w} \alpha_{i-1} +e^{k_i} v_i$, where $i$ is the index of the current token. We defined w = exp(decay), note that w is always positive.

By induction, we have $\alpha_i = \sum_{j=1}^i e^{-(i-j)w+k_j} v_j$. Similarly, $\beta_i = \sum_{j=1}^i e^{-(i-j)w+k_j}$. Note that $\alpha_i$ looks like a weighted sum of the $v_j$, while $\beta_i$ is just the sum of weights. So $\frac{\alpha_i}{\beta_i}$ becomes a weighted average of $v_j$.

Plugging in the formulas for $\alpha_{i-1}$ and $\beta_{i-1}$ into the definition of wkv, and denoting bonus by $u$, we get

\[\text{wkv}_i = \frac{ \sum_{j=1}^{i-1} e^{-(i-1-j)w+k_j} v_j + e^{u+k_i} v_i }{\sum_{j=1}^{i-1} e^{-(i-1-j)w+k_j} + e^{u+k_i}}.\]

So $\text{wkv}$ is a weighted average of $v$ with weights according to $k$, but also the current $v_i$ is given a bonus ($u$) additional weight, and previous $v_j$ are given geometrically smaller weights the further away they are.

For reference, standard transformer attention takes “query”, “key” and “value” vectors $q,k,v$ and outputs

\[\frac{\sum_{j=1}^i e^{q_i^\top k_j} v_j}{\sum_{j=1}^i e^{q_i^\top k_j}}.\]

After calculating wkv, the time mixing multiplies by the “receptance” sigmoid(r). It does a final linear transformation before returning the result.

Converting to output probabilities

After going through the 24 layers of time mixing and channel mixing, we need to convert the final output to predicted probabilities for the next token.

x = layer_norm(x, *params('ln_out'))
x = params('head')[0] @ x

e_x = exp(x-np.max(x))
probs = e_x / e_x.sum() # Softmax of x

First, we do a layer normalization. Then, we multiply by a $50277 \times 1024$ matrix params('head')[0] given by the RWKV parameters, giving us a 50277-dimensional vector. To get a probability distribution over tokens (i.e. a 50277-dimensional, non-negative vector which sums to 1), we run our x through a “softmax” function. The softmax of x is just exp(x)/sum(exp(x)). However, calculating exp(x) can cause numerical overflows, so we calculate the equivalent function exp(x-max(x))/sum(exp(x-max(x))).

That’s it! Now you know exactly how RWKV works for generating text.

Practical considerations

In practice, there are some issues which I ignored in my simplified code. Most importantly, in practice, we care a lot about the performance / run-time of the code. This leads us to run RWKV in parallel on GPUs, use specialized GPU code written in CUDA, use 16-bit floating point numbers, and more.

Numerical issues

The largest number a 16-bit floating point number (float16) can represent is 65 504, anything above that overflows, which is bad. Most of the code has no problems with this, partially because the Layer Normalizations keep values in a reasonable range. However, the RWKV attention contains exponentially large numbers (exp(bonus + k)). In practice, the RWKV attention is implemented in a way where we factor out an exponential factor from num and den to keep everything within float16 range. See for example the time_mixing function in RWKV in 150 lines.

Training

We simply loaded a pretrained model in our example. To train the model, one would calculate the cross entropy loss of the predicted probabilities on a long text (our example model was trained on the pile). Next, calculate the gradient of that loss with respect to all the RWKV parameters. That gradient is used to improve the parameters using a variant of Gradient Descent called Adam. Repeat for a long time, and you get a trained RWKV model.

GPT-mode

My simplified code processes the tokens one by one, which is much slower than processing them in parallel, especially when running on GPUs. For inference, there is no way around this, as we need to sample a token before we can use it to calculate the next one. However, for training, all the text is already available. This lets us parallelize across tokens. Most of the code is fairly straightforward to parallelize like this, as there is little dependence through time. For example, all the expensive matrix multiplications work on each token independently, leading to good performance.

However, the RWKV attention is inherently sequential. Fortunately, it has very little computation (on the order of 1024 times less than the matrix multiplications), so it should be fast. Sadly, pytorch does not have a good way of handling this sequential task, so the attention part becomes slow (even compared to the matrix multiplications). Therefore, I wrote optimized CUDA kernels for computing the RWKV attention, which has been my main contribution to the RWKV project.

JAX has jax.lax.scan and jax.lax.associative_scan, which allows a pure JAX implementation to perform better than pure pytorch. However, I still estimate that JAX would lead to about 40% slower training compared to CUDA (that estimate may be outdated, as it was made for training a relatively small 1.5B model).

Contribute

RWKV is an open source community project. Join the Discord and contribute! Or just ask questions or lurk.

The RWKV language model: An RNN with the advantages of a transformer

2023-03-23T00:00:00+00:00

For a while, I’ve been following and contributing to the RWKV language model, an open source large language model with great potential. As ChatGPT and large language models in general have gotten a lot of attention recently, I think it’s a good time to write about RWKV. In this post, I will try to explain what is so special about RWKV compared to most language models (transformers). The other RWKV post is more technical, showing in detail how RWKV actually works (with a ~100 line minimal implementation).

At a high level, the RWKV model is a clever RNN architecture that enables it to be trained like a transformer. So to explain RWKV, I need to explain RNNs and transformers first.

RNNs

Classically, the neural networks used for sequence (such as text) processing were RNNs (like LSTMs). RNNs take two inputs: a state vector and a token¹. It goes through the input sequence one token at a time, each token updating the state. We may for example use an RNN to process a text into a single state vector. This can then be used to classify the text into “positive” or “negative”. Or we may use the final state to predict the next token, which is how RNNs are used to generate text.

Transformers

Because of the sequential nature of RNNs, they are hard to massively parallelize across many GPUs. This motivated using an “attention” mechanism instead of sequential processing, resulting in an architecture called a transformer. A transformer processes all tokens at the same time, comparing each token to all previous tokens in parallel. Specifically, the attention calculates “key”, “value” and “query” vectors for each token, then contributions between all pairs of tokens are computed using those.

In addition to being able to speed up training through massive parallelization, large transformers generally score better than RNNs on benchmarks.

However, the attention mechanism scales quadratically with the length of the sequence to be processed. This effectively limits the model’s input size (or “context length”). Additionally, because of the attention mechanism, when generating text, we need to keep attention vectors for all previous tokens in memory. This requires much more memory than an RNN which only stores a single state.

RWKV

RWKV combines the best features of RNNs and transformers. During training, we use the transformer type formulation of the architecture, which allows massive parallelization (with a sort of attention which scales linearly with the number of tokens). For inference, we use an equivalent formulation which works like an RNN with a state. This allows us to get the best of both worlds.

So we basically have a model which trains like a transformer, except that long context length is not expensive. And during inference, we need substantially less memory and can implicitly handle “infinite” context length (though in practice, the model might have a hard time generalizing to much longer context lengths than it saw during training).

OK, but what about the performance? Since RWKV an RNN, it is natural to think that it can’t perform as well as a transformer on benchmarks. Also, this just sounds like linear attention. None of the many previous linear time attention transformer architectures (like “Linformer”, “Nystromformer”, “Longformer”, “Performer”) seemed to take off.

Benchmarks

Well, RWKV seems to scale as well as SOTA transformers. At least up to 14 billion parameters.

Contribute

RWKV is an open source community project. Join the Discord and contribute (or ask questions or whatever).

Cost estimates for Large Language Models

When looking at RWKV 14B (14 billion parameters), it is easy to ask what happens when we scale to 175B like GPT-3. However, training a 175B model is expensive. Calculating the approximate training cost of a transformer-like architecture is actually straightforward.

The bottleneck for training is essentially multiplying by all the parameters, and then adding that together, for each input token. With automatic differentiation, we can calculate the gradient with about another 2x that, for a total of 6 FLOPs per parameter per token. So a 14B model trained on 300 billion tokens takes about $14B \times 300B \times 6 = 2.5 \times 10^{22}$ FLOPs. We use A100 GPUs for training. Using 16-bit floating point numbers, an A100 can theoretically do up to 312 TFLOPS, or about $1.1\times 10^{18}$ FLOPs per hour. So we theoretically need at least 22 436 hours of A100 time to train. In practice, RWKV 14B was trained on 64 A100s in parallel, sacrificing a bit of performance for various reasons. RWKV 14B took about 3 months $\approx 140\ 160$ A100 hours to train, thus achieving about 20% theoretical efficiency (since it took about 5x longer than the theoretical minimum). Recent versions can train RWKV 14B at around 50% theoretical efficency.

As a rough price estimate, at the time of writing, the cheapest A100 cost at cloud-gpus.com was $0.79/h. Training the original 14B RWKV there would hence cost around $100k, but with the recent training code improvements we could reduce this to $40k. In practice, there are other considerations like ease of use, timeouts, multi-gpu communication speed, etc. Thus, one might want more high-end options like AWS at $4.096/h. RWKV was trained on compute donated by Stability and EleutherAI.

Now you can imagine that training 10x more parameters and 10x more data will cost 100x more, making it prohibitively expensive.

Footnote

Before using a language model on a text, we tokenize the text into tokens. Intuitively speaking, a token is basically a word. In practice, a tokenizer might split words into multiple tokens, has to handle special characters and punctuation, and employ some tricks like adding a token for “end of text”. The 14B RWKV model uses 50277 different tokens. ↩

94% on CIFAR-10 in 94 lines and 94 seconds

2022-12-28T00:00:00+00:00

The following code scores 94.02% test accuracy (mean of 40 runs) in 94 lines and less than 94 seconds training time (loading / validation / etc. not included). My tests were performed on NVIDIA A10 GPUs, with pytorch version 1.12.1+cu116 and torchvision version 0.13.1+cu116.

Code

import torch, torchvision, sys, time, numpy
from torch import nn
import torch.nn.functional as F

device = torch.device("cuda")
dtype = torch.float16

EPOCHS = 24
BATCH_SIZE = 512
MOMENTUM = 0.9
WEIGHT_DECAY = 5e-4*BATCH_SIZE
lr_knots = [0, EPOCHS/5, EPOCHS]
lr_vals  = [0.1/BATCH_SIZE, 0.6/BATCH_SIZE, 0]
W,H = 32,32
CUTSIZE = 8
CROPSIZE = 4

class CNN(nn.Module):
  def __init__(self):
    super().__init__()

    dims = [3,64,128,128,128,256,512,512,512]
    seq = []
    for i in range(len(dims)-1):
      c_in,c_out = dims[i],dims[i+1]
      seq.append( nn.Conv2d(in_channels=c_in, out_channels=c_out, kernel_size=(3,3), stride=(1,1), padding=(1,1), bias=False) )
      if c_out == c_in * 2:
        seq.append( nn.MaxPool2d(2) )
      seq.append( nn.BatchNorm2d(c_out) )
      seq.append( nn.CELU(alpha=0.075) )
    self.seq = nn.Sequential(*seq, nn.MaxPool2d(4), nn.Flatten(), nn.Linear(dims[-1], 10, bias=False))

  def forward(self, x, y):
    x = self.seq(x) / 8
    return F.cross_entropy(x, y, reduction='none', label_smoothing=0.2), (x.argmax(dim=1) == y)*100

def loadCIFAR10(device):
  train = torchvision.datasets.CIFAR10(root="./data", train=True, download=True)
  test  = torchvision.datasets.CIFAR10(root="./data", train=False, download=True)
  ret = [torch.tensor(i, device=device) for i in (train.data, train.targets, test.data, test.targets)]
  std, mean = torch.std_mean(ret[0].float(),dim=(0,1,2),unbiased=True,keepdim=True)
  for i in [0,2]: ret[i] = ((ret[i]-mean)/std).to(dtype).permute(0,3,1,2)
  return ret

def getBatches(X,y, istrain):
  if istrain:
    perm = torch.randperm(len(X), device=device)
    X,y = X[perm],y[perm]

    Crop = ([(y0,x0) for x0 in range(CROPSIZE+1) for y0 in range(CROPSIZE+1)], 
        lambda img, y0, x0 : nn.ReflectionPad2d(CROPSIZE)(img)[..., y0:y0+H, x0:x0+W])
    FlipLR = ([(True,),(False,)], 
        lambda img, choice : torch.flip(img,[-1]) if choice else img)
    def cutout(img,y0,x0):
      img[..., y0:y0+CUTSIZE, x0:x0+CUTSIZE] = 0
      return img
    Cutout = ([(y0,x0) for x0 in range(W+1-CUTSIZE) for y0 in range(H+1-CUTSIZE)], cutout)

    for options, transform in (Crop, FlipLR, Cutout):
      optioni = torch.randint(len(options),(len(X),), device=device)
      for i in range(len(options)):
        X[optioni==i] = transform(X[optioni==i], *options[i])

  return ((X[i:i+BATCH_SIZE], y[i:i+BATCH_SIZE]) for i in range(0,len(X) - istrain*(len(X)%BATCH_SIZE),BATCH_SIZE))


X_train, y_train, X_test, y_test = loadCIFAR10(device)

model = CNN().to(dtype).to(device)
opt = torch.optim.SGD(model.parameters(), lr=0, momentum=MOMENTUM, weight_decay=WEIGHT_DECAY, nesterov=True)

training_time = stepi = 0
for epoch in range(EPOCHS):
  start_time = time.perf_counter()
  train_losses, train_accs = [], []
  model.train()
  for X,y in getBatches(X_train,y_train, True):
    stepi += 1
    opt.param_groups[0]['lr'] = numpy.interp([stepi/(len(X_train)//BATCH_SIZE)], lr_knots, lr_vals)[0]

    loss,acc = model(X,y)
    model.zero_grad()
    loss.sum().backward()
    opt.step()
    train_losses.append(loss.detach())
    train_accs.append(acc.detach())
  training_time += time.perf_counter()-start_time

  model.eval()
  with torch.no_grad():
    test_accs = [model(X,y)[1].detach() for X,y in getBatches(X_test,y_test, False)]

  summary = lambda l : torch.mean(torch.cat(l).float())
  print(f'epoch % 2d  train loss %.3f  train acc %.2f  test acc %.2f  training time %.2f'%(epoch+1, summary(train_losses), summary(train_accs), summary(test_accs), training_time))

Why 94%? Because that’s reportedly human level accuracy on CIFAR-10. Additionally, it made for a reasonable balance between accuracy, code complexity and training time.

The code and idea is based on the final code from the fantastic blog series How to Train Your ResNet from myrtle.ai. The blog series worked on optimizing the training time to reach 94% on CIFAR-10, and was able to get it down to 26 seconds!

The code from myrtle.ai ended up being more than 500 lines long and containing a large number of clever tricks and optimizations.

Motivation

My main interest is in studying why deep learning works so well, specifically the implicit biases of deep learning. Ideally, I would like to study and do experiments on realistic and representative models and tasks. However, it’s not actually clear what is realistic and representative of deep learning. The best criterion seems to be “it works much better than all approaches not based on deep learning”.

However, the best performing deep learning models are also the largest (taking days to train on hardware I don’t have access to) and full of tricks which make them impossible to analyze mathematically and hard to even reproduce. I am therefore interested in finding the simplest cases where deep learning works significantly better than alternative approaches. Image classification on CIFAR-10 is among the best I’ve found for these criteria.

myrtle.ai’s model was a good starting point, as it reduced training time from hours to seconds, when compared to most alternatives. However, many of the tricks they used would complicates analysis while only providing a small gain in performance. I removed many of the less significant optimizations in favor of simplicity. Things like “frozen batch norm scales” and “exponential moving averages” were removed.

You might think that reducing training time to seconds is overkill, since taking some minutes to train a model sounds fine. However, it is very useful to have some wiggle room in training time. As an example, the final accuracy varies from run to run because of randomness in the training procedure (data augmentation, initialization), so myrtle.ai typically did 50 runs just to reduce this variance.

The important tricks

While trying to simplify the code, I found the following tricks to be critical for performance.

Batch normalization

Basically, everything diverges if you remove normalization layers in modern deep learning models. Without normalization layers, things like initialization size, learning rate and layer scaling become fragile hyperparameters. In general, it seems that even when all these (new) hyperparameters are tuned well, it’s too hard to compete with batch norm. The myrtle.ai blog came to a similar conclusion.

Data augmentation

Without data augmentations, accuracies drop to around 90%, which is not significantly better non-deep learning approaches.

I believe the current best non-deep learning methods for CIFAR-10 are convolutional kernel methods, which are typically based on linearizing neural network architectures. These can reach around 90% accuracy on CIFAR-10 (89%, 90%), but struggle for large training sets. They need an $N \times N$ Gram matrix, where $N$ is the number of training samples. This means they cannot use data augmentation to the same degree as modern deep learning.

Learning rate schedule / warm-up

I used this learning rate schedule:

Interesting observations

Residual connections were not necessary

I expected residual connections to be crucial for the network architecture. However, they were not necessary, so I removed them.

Output scale is important

The code from myrtle.ai scales down the output of their model by a factor 8. This is very interesting to me, since this is a trick to increase the effect of what I call “implicit bias by small initialization”. I used the trick myself in previous blogs like 1 and 2.

Implicit bias in single epoch SGD

2022-10-25T00:00:00+00:00

Large transformer models are typically trained using only a single epoch, i.e the model only sees each data point once. Theoretically, this training regime circumvents problems related to overfitting and generalization by converting them to a question of how fast we can converge given noisy gradients.

My recent experiments indicate that the implicit bias by small initialization is important in this regime, also in quite realistic settings. This makes me hopeful that the single epoch training SGD regime is a good setting to study several important phenomena observed in deep learning. In this blog I will showcase single epoch training on the following matrix sensing task.

Matrix sensing task

Our task is to estimate a $d\times d$ ground truth matrix $T$. The effect we are studying in this blog is more apparent at larger scales, so we pick the fairly large dimension $d \in \{256,1024\}$. We generate $T$ as a random matrix of rank $r \in \{2,5\}$.

The $i$th training example consists of a $d$-dimensional input vector $a_i$, a $d$-dimensional vector $b_i$ of output weights, and a scalar target value $y_i = a_i^\top T b_i$.

We let $a_i$ and $b_i$ be random Gaussians $a_i, b_i \sim \mathcal{N}(0,I)$. To make $T$ of rank $r$, we generate it as the product of two random Gaussian matrices of sizes $d\times r$ and $r \times d$.

Simple setting

In this section, I will describe one of the simplest settings I know where the implicit bias is apparent. Then, in later sections I will show more realistic settings.

Our simple model will be a 2 layer linear network (as in this previous post). The parameters are two $d\times d$ matrices $W_1$ and $W_2$. The model outputs $\mathrm{predict}(a_i,b_i) = a_i^\top W_1 W_2 b_i$. We use the squared sample loss $L_i = \frac{1}{2}(y_i-\mathrm{predict}(a_i,b_i))^2$. The parameters are initialized using Xavier normal initialization (random Gaussians with scale $\frac{1}{\sqrt d}$).

We train the model by SGD with constant learning rate $\eta$, for a single epoch. This means we loop through the samples one by one, for each taking a step of length $\eta$ along the negative gradient of $L_i$. Note that we need a large enough learning rate $\eta$ to be able to converge after only one epoch. A too small learning rate will keep the weights too close to the initialization, making learning impossible. However, a too large learning rate will make the model diverge.

We will be exploiting implicit bias by small initialization. Actually, instead of scaling down the initialization, we will scale up the target values, which is equivalent (for homogeneous models like ours). Therefore, we add a parameter $\gamma$ which sets the scale of the ground truth matrix $\gamma T$.

We pick $d = 256$, $r = 2$, $\eta = 2\times 10^{-8}$ and $\gamma = 100$.

Code for simple setting

import torch as th
import matplotlib.pyplot as plt
th.manual_seed(0)

d = 256      # Input and output dimension
r = 2        # Rank of ground truth matrix
lr = 2e-8    # Learning rate
yscale = 100 # Scale factor for ground truth

T = th.randn(d,r)@th.randn(r,d) * yscale

W1 = th.randn(d,d, requires_grad=True)
W2 = th.randn(d,d, requires_grad=True)
th.nn.init.xavier_normal_(W1)
th.nn.init.xavier_normal_(W2)

log = []
for i in range(d*d//2):
  a, b = th.randn(d), th.randn(d)
  loss = 0.5*(a@T@b-a@W1@W2@b)**2
  loss.backward()
  with th.no_grad():
    for param in [W1,W2]:
      param -= lr * param.grad # Gradient descent
      param.grad[:] = 0
  if i%100 == 0:
    error = (W1@W2-T).norm()/T.norm() # Relative error
    log.append((i,error))

plt.plot(*zip(*log))
plt.show()

Without exploiting the rank constraint, and using only $\frac{1}{2}d^2$ samples, we would expect a relative reconstruction error (Frobenius norm of $W_1W_2-T$ over Frobenius norm of $T$) of around $0.5$. However, we only ran the optimization long enough to see $\frac{1}{2}d^2$ samples, and as seen in the plot above we already converged and reconstructed the correct matrix!

More realistic setting

What interests me about the setting above, i.e single epoch SGD with low rank linear ground truth, is that we can observe the same phenomena in much more realistic settings. We can add more layers, non-linearities, even normalization layers, and we still see the same phenomena!

The first “realistic” setup is what I call “the CNN setup”. Typical CNNs (like ResNet and PyramidNet) use ReLU(-like) activation functions, batch normalization, and are optimized by SGD with momentum. Our model consists of 10 layers with ReLU non-linearities and Batch Normalization, optimized by SGD with momentum.

The second “realistic” setup, I call “the transformer setup”. Transformers typically use ReLU-like activations, layer normalization, and are optimized by the Adam optimizer. We stack 10 layers with ReLU activations and layer normalization, and optimize by Adam with default parameters.

We also scale up the experiment a bit to $d = 1024$, $r = 5$ and use mini-batches of size $256$ to speed up the computations.

Code for more realistic setting

import matplotlib.pyplot as plt
from tqdm import tqdm # Optional loading bar
import torch as th
nn = th.nn
F = nn.functional
th.manual_seed(0)

device = 'cuda'
#device = 'cpu'

d = 1024      # Input and output dimension
r = 5         # Rank of ground truth matrix
B = 256       # Batch size
L = 10        # Number of layers
yscale = 4e-2 # Scale factor for ground truth

truth = (th.randn(d,r)@th.randn(r,d)).to(device) * yscale

cnn = nn.Sequential(*[x for i in range(L) for x in [nn.Linear(d,d,bias=False), nn.BatchNorm1d(d), nn.ReLU()]][:-2]).to(device)
sgd = th.optim.SGD(cnn.parameters(), lr=2.5e-8, momentum=0.9)

transformer = nn.Sequential(*[x for i in range(L) for x in [nn.Linear(d,d,bias=False), nn.ReLU(), nn.LayerNorm(d)]][:-2]).to(device)
adam = th.optim.Adam(transformer.parameters(), lr=6e-4)

for net,opt,label in [(cnn,sgd,'CNN'), (transformer,adam,'Transformer')]:
  log = []
  for bi in tqdm(range(0, d**2//2, B)):
    a = th.randn(B, d, device=device)
    b = th.randn(B, d, device=device)
    y    = th.einsum('bi,bi->b', a@truth, b)
    pred = th.einsum('bi,bi->b', net(a), b)
    loss = th.sum((y-pred)**2)

    log.append((bi, (loss/th.sum(y**2)).item() ))

    opt.zero_grad()
    loss.backward()
    opt.step()

  plt.plot(*zip(*log), label=label)
plt.legend()
plt.show()

Interesting phenomena

While performing experiments like the ones above, I recognized numerous phenomena seen elsewhere in deep learning. This makes me hopeful that this simple setting is a good place to study and understand phenomena which are otherwise hard to specify and analyze.

Small initialization / large targets

Implicit bias by small initialization seems crucial to the experiments. If the initialization is too large compared to the ground truth (so $\gamma$ too small), all the networks seem unable to train quickly and generalize. Interestingly, while the simple settings require very large $\gamma$, the more realistic settings (more layers, normalization layers) work best with smaller values of $\gamma$. When doing classification instead of regression, the standard classification losses (say cross entropy with label smoothing) have a slightly larger target scale by default. That could mean we don’t need any explicit target scaling. We saw this effect in a previous blog post.

Only works well for large instances

Deep learning is very data hungry, with simpler methods beating it when data is scarce. The single epoch training regime might shed some light on why that is, as it also only works for large instances.

Learning rate warm-up

In practical deep learning, “learning rate warm-up” is an important trick where we gradually increase the learning rate at the start of training. By using this trick, I was able to use larger learning rates later without diverging. For example, we can add opt.param_groups[0]['lr'] = 1e-7*min(1,0.1+bi/(d**2/4)) before opt.step() in the CNN setting.

Normalization layers

Normalization layers seem to help train deeper networks. However, it is not well understood why. Maybe analyzing the effect of normalization layers in this simple setting can give some insights.

Theory

The theory of online optimization is well suited to describe single epoch SGD. As a brief introduction to online optimization, I will give a simple proof of how single epoch SGD optimizes convex functions. The proof is mostly taken from “Convex Optimization: Algorithms and Complexity” by Sébastien Bubeck.

We want to optimize the differentiable (or we could work with subgradients), convex function $\tilde f \colon \mathbb{R}^d \to \mathbb{R}$. To apply SGD, we sample from some family of differentiable functions $f \colon \mathbb{R}^d \to \mathbb{R}$ satisfying $\mathbb{E}\|\nabla f\|_2^2 \le B^2$ and $\mathbb{E}\nabla f = \nabla \tilde f$. Next, we assume there exists some minimizer $x^* \in \mathbb{R}$ of $\tilde f$, and we have some initial guess $x^1$. The SGD update rule is

\[x^{k+1} = x^k-\eta\nabla f^k(x^k),\]

where $f^k$ is randomly sampled.

By convexity $\tilde f(x^k) - \tilde f(x^*) \le \nabla \tilde f(x^k)^\top(x^k-x^*)$. So

\[\begin{align*} \mathbb{E}\min_{k \in \{1,\dots,K\}} \tilde f(x^k) - \tilde f(x^*) \le \frac{1}{K} \mathbb{E}\sum_{k=1}^K \tilde f(x^k) - \tilde f(x^*)\\ \le \frac{1}{K} \mathbb{E}\sum_{k=1}^K \nabla \tilde f(x^k)^\top (x^k-x^*) = \frac{1}{K} \mathbb{E}\sum_{k=1}^K \nabla f^k(x^k)^\top (x^k-x^*) \end{align*}\]

Using $x^{k+1}-x^k = -\eta\nabla f^k(x^k)$ and $2u^\top v = \|u\|_2^2+\|v\|_v^2-\|u-v\|_2^2$, we calculate

\[\begin{align*} \nabla f^k(x^k)^\top (x^k-x^*) &= \frac{1}{\eta}(x^k-x^{k+1})^\top (x^k-x^*)\\ &= \frac{1}{2\eta}(\|x^k-x^{k+1}\|_2^2+\|x^k-x^*\|_2^2-\|x^*-x^{k+1}\|_2^2)\\ &= \frac{1}{2\eta}(\|x^k-x^*\|_2^2-\|x^{k+1}-x^*\|_2^2)+\frac{\eta}{2}\|\nabla f(x^k)\|_2^2 \end{align*}\]

Inserting into the previous expression, we get a telescoping sum. We pick $\eta = \frac{\|x^1-x^*\|_2}{B\sqrt K}$ to minimize the final bound. Hence

\[\begin{align*} &\frac{1}{K} \mathbb{E}\sum_{k=1}^K \nabla f^k(x^k)^\top (x^k-x^*)\\ = &\frac{1}{2K\eta} \mathbb{E}\left(\|x^1-x^*\|_2^2-\|x^{K+1}-x^*\|_2^2\right) + \frac{\eta}{2K}\mathbb{E}\sum_{k=1}^K \|\nabla f(x^k)\|_2^2\\ \le &\frac{1}{2K\eta} \|x^1-x^*\|_2^2 + \frac{\eta B^2}{2} = \frac{\|x^1-x^*\|_2B}{\sqrt K}. \end{align*}\]

In summary, $\mathbb{E}\min_{k \in \{1,\dots,K\}} \tilde f(x^k) - \tilde f(x^*) \le \frac{\|x^1-x^*\|_2B}{\sqrt K}$. This shows that in expectation, we successfully optimize the function with rate $\frac{1}{\sqrt K}$.

Start here: Why I care about implicit biases

2022-08-29T00:00:00+00:00

My research, and this blog, are centered around implicit biases in deep learning. But what even are “implicit biases”, and why do I care about them? In this post, I try to explain my motivations, why I think understanding implicit biases is the key to unlocking the potential of deep learning.

Why I care about deep learning

It works really well for some problems. Just look at the following image generated using deep learning (by Midjourney):

Making a program to automatically draw a beautiful image from a text prompt would be practically impossible, before deep learning came along and did it. Other problems where deep learning is miles ahead of the competition include image classification, playing games from pixels and lossless text compression. I believe deep learning will continue to deliver breakthroughs in other areas, and I’m excited to see what those are.

The problem with deep learning

Ok, so deep learning has amazing potential, what are the problems we need to overcome to achieve that potential? In my opinion, the main problem with modern deep learning is the huge amount of engineering effort it currently requires.

To get the best practical performance from deep learning, you need to add a bunch of small (but hugely important) tricks. These tricks take time to learn, and take time to apply and adjust to new problems. An example of such a trick is using data augmentation when training an image classifier. Concretely, a simple data augmentation would be adding horizontally flipped images to your dataset, effectively doubling the size of your dataset.

Another problem is that modern deep learning requires enormous computations. More compute gives better results, so naturally you put in as much compute as you can afford. In practice, this means waiting hours or days for a model to train. This drastically increases the time required to test new code, and in general slows down development.

Tuning the hyperparameters is also a time-consuming part of modern deep learning. The performance of a deep learning model critically depends on numerous of tuning parameters, which need to be carefully chosen when applying the model to a new problem. Here is a list of some common hyperparameters:

Hyperparameter	Typical value	Affects model expressivity
Learning rate	3e-4	No
Momentum	0.9	No
Learning rate schedule	Cosine	No
Optimizer	Adam	No
Batch size	32	No
Number of training epochs	300	No
Weight decay	0	No
Activation function	ReLU	Yes
Weight initialization	Xavier	No
Feature dimension	512	Yes
Label smoothing	0.1	No
Dropout probability	0.2	No

In practice, these hyperparameters are chosen using a combination of the engineer’s experience, and repeatedly testing the model with different hyperparameter configurations. Note that changing one parameter might change the effects of other parameters, making it exponentially harder. Recall that testing the model is computationally expensive and slow. The combination makes for a painful engineering experience. The “correct” solution is to run an automated search through many hyperparameter combinations, and pick the best. However, that is computationally expensive. So you would generally rather train a more computationally expensive model giving better results.

I believe the applicability of deep learning is severely limited by the huge amount of engineering effort it requires. So what is the solution?

The need for theory

Let’s compare with classical machine learning algorithms, things like linear regression (on possibly nonlinear, hand-engineered features). Applying those algorithms can often be reduced to minimizing some convex loss function $L$ plus some convex regularizer $R$ times some scalar weight $\lambda$,

\[\min_\theta L(\theta)+\lambda R(\theta).\]

We can then use some numerical optimization algorithm to find the unique minimum at parameter configuration $\theta^*$, and use those parameters to produce predictions. There are several hyperparameters in the optimization algorithm, but those only affect the time to convergence, so they can be left to reasonable default values. Mathematicians proved correctness of the optimization algorithms, so the right $\theta^*$ is found every time. This mature theory means practitioners can focus most of their effort on making good models, since then applying the models is relatively straightforward.

But modern deep learning is also optimizing a loss function? We have optimizers which can be guaranteed to find (local) minima. What’s the problem? The problems start when we realize that modern deep learning methods can have way more parameters than the number of data points they are trying to fit. Even for small datasets like CIFAR with 50 000 training images, deep learning models use millions of parameters. The models are overparameterized. Deep learning models can fit random training labels.

Because of overparameterization, there are many different parameter configurations giving 0 training loss (or arbitrarily small loss in the case of cross entropy classification loss). To make matters worse, deep learning models don’t seem to require explicit regularizers (specifically, weight decay is optional). As a result, in deep learning, our training algorithm and hyperparameters do affect what model we end up with. I call it the implicit bias which determines the final model, among the many models minimizing the loss function. The long list of hyperparameters in the table above can (and empirically does!) affect the implicit bias and performance of the final model.

Classically, the main important aspect of a model is what kind of functions it can express, its expressivity. However, in deep learning, most of the important choices don’t even affect the expressivity, they only affect the implicit bias (see the table above). Sadly, we have no better description of this implicit bias than “run exactly this algorithm with these hyperparameters, that should give you the implicit bias baked into your final model”.

As a concrete example of why this is a problem, say you made a fantastic new second order optimization algorithm, it optimizes the loss function 10x faster than standard first order methods! The current deep learning models were developed and compared under the implicit bias given by current optimizers. Chances are that your second order optimizer significantly changes that implicit bias, giving worse performance, since the model was not built for the new implicit bias. The algorithm couldn’t be used.

Let’s say we found a way to better characterize the implicit bias of modern deep learning models. Maybe we could improve it. Maybe we could train models faster, without losing out on performance. Maybe we could get rid of annoying hyperparameters. I want to find out.

Technical: Deep Linear Networks with label noise minimize the nuclear norm

2022-08-10T00:00:00+00:00

The implicit biases in (stochastic) gradient descent with large learning rates often reduce to regularizers preferring “flat minima”. See for example my blog with label noise. The regularizers depend on the network architecture, so I find it interesting to characterize them for tractable cases such as Deep Linear Networks.

This blog is more technical, mathy and less standalone than previous blogs. However, to my knowledge only the two layer case has been analyzed at the time of writing, so I thought I would put this more general result out there.

Deep Linear Networks (DLNs)

We already encountered DLNs in a previous blog post. However, we will introduce notation useful for this post. Let’s consider an $L$-layer DLN with dimensions $d_0,d_1,\dots,d_L$. I.e we are parameterizing a $d_0 \times d_L$ matrix $\tilde W$ as

\[\tilde W = W_1 W_2 \cdots W_L\]

where $W_i$ is a $d_{i-1}\times d_i$ matrix. We will assume $L \ge 2$ and $d_i \ge \min\{d_0,d_L\}$, so the model is overparameterized.

The regularizer we will analyze is inspired by the following slight generalization of matrix completion. Given a $d_0$-dimensional vector $x$ and a $d_L$ dimensional vector $y$ our model predicts $P(x,y) = x^\top \tilde W y$. For a set of $n$ data points $(x_i,y_i)$ with targets $t_i$ we can then do regression with the loss $\sum_{i=1}^n (P(x_i,y_i)-t_i)^2$. If $x_i$ and $y_i$ are standard unit vectors, we get a formulation of matrix completion.

We can then use the same arguments as in the first blog post on implicit regularization by large learning rate to find the regularizer introduced by label noise and large learning rate, namely

\[R(W_1,\dots,W_L) = \sum_{i=1}^n \sum_{l=1}^L \|\nabla_{W_l} P(x_i,y_i)\|_F^2\]

where $\|\cdot\|_F$ is the Frobenius norm. To summarize roughly: among all the solutions $W_1,\dots,W_L$ minimizing the loss, gradient descent with label noise will pick the one minimizing $R(W_1,\dots,W_L)$.

The rest of the blog post will try to analyze the regularizer $R$. We will eventually see that it is similar to the nuclear norm regularizer $\|\tilde W\|_*$, which is well known in matrix completion and is known to encourage low rank.

As a first step we differentiate $P$ and split the resulting Frobenius norm to get

\[\sum_{i=1}^n \sum_{l=1}^L \|\nabla_{W_l} P(x_i,y_i)\|_F^2 = \sum_{i=1}^n \sum_{l=1}^L \|x_i^\top W_1\cdots W_{l-1}\|_2^2 \|W_{l+1}\cdots W_L y_i\|_2^2.\]

Assumption

To get a nice expression for the regularizer, we will make the following assumption: There exists matrices $X$ and $Y$ such that

\[\sum_{i=1}^n\ (x_i x_i^\top) \otimes (y_i y_i^\top) = (X^\top X) \otimes (Y Y^\top)\]

where $\otimes$ is the Kronecker product. We will furthermore assume that $X$ and $Y$ are square, invertible matrices. I expect the results in this blog to also hold for singular $X$ and $Y$, but that introduces some technical difficulties.

In my intuition, $X^\top X$ and $Y Y^\top$ are basically the covariances of the vectors $\{x_i\}_i$ and the vectors $\{y_i\}_i$. The assumption is an independence assumption between $\{x_i\}_i$ and $\{y_i\}_i$, it is pretty much saying $\mathbb{E}\big((x x^\top) \otimes (y y^\top)\big) = \mathbb{E}(x x^\top)\otimes \mathbb{E}(y y^\top)$.

While this assumption is often not exactly satisfied, I hope it is approximately satisfied when $x_i$ and $y_i$ are sampled independently, such as usually done in matrix completion. My attempts at $L > 2$ without the assumption ended in ugly tensor norms which were hard to work with and interpret.

Using the assumption, we may simplify

\[\sum_{i=1}^n \sum_{l=1}^L \|x_i^\top W_1\cdots W_{l-1}\|_2^2 \|W_{l+1}\cdots W_L y_i\|_2^2 = \sum_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2.\]

Lower bound

In this section we will use some inequalities to find a tight lower bound on $R$. Using the AM–GM inequality we have

\[\sum_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2 \ge L \left(\prod_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2\right)^{\frac{1}{L}}.\]

Rearrange the product

\[\begin{align} &L \left(\prod_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2\right)^{\frac{1}{L}}\\ =\ &L \left(\|X\|_F\|Y\|_F\prod_{l=1}^{L-1} \|X\ W_1\cdots W_l\|_F \|W_{l+1}\cdots W_L Y\|_F\right)^{\frac{2}{L}}. \end{align}\]

Use $\|A\|_F\|B\|_F \ge \|AB\|_*$ where $\|\cdot\|_*$ is the nuclear norm

\[\begin{align} &L \left(\|X\|_F\|Y\|_F\prod_{l=1}^{L-1} \|X\ W_1\cdots W_l\|_F \|W_{l+1}\cdots W_L Y\|_F\right)^{\frac{2}{L}}\\ \ge\ &L \left(\|X\|_F\|Y\|_F\prod_{l=1}^{L-1} \|X\ W_1\cdots W_L Y\|_*\right)^{\frac{2}{L}}\\ =\ &L\left(\|X\|_F\|Y\|_F\right)^{\frac{2}{L}}\|X\tilde W Y\|_*^{2-\frac{2}{L}}. \end{align}\]

Construction achieving the bound

In this section we will give a construction which achieves the lower bound, showing that it is tight. Let’s say we are given some $\tilde W$ and need to find $W_1 \cdots W_L = \tilde W$ such that $R(W_1,\dots,W_L)$ is minimized.

Let’s take the (compact) singular value decomposition of $X \tilde W Y = U \Sigma V^\top$. Here $U$ and $V$ are semi-orthogonal matrices with dimensions $d_0 \times r$ and $d_L \times r$ and $\Sigma$ is a diagonal $r \times r$ matrix with positive diagonal, where $r$ is the rank of $\tilde W$ (which is also the rank of $X \tilde W Y$ since $X$ and $Y$ are invertible). We can now construct

\[\begin{align} W_1 &= \alpha \begin{pmatrix}X^{-1}U\Sigma^{\frac{1}{2}} & \bf 0\end{pmatrix}\\ W_i &= \beta \begin{pmatrix}I_r & \bf 0 \\ \bf 0 & \bf 0\end{pmatrix} \hspace{2.5cm} i = 2,\dots,L-1\\ W_L &= \gamma \begin{pmatrix}\Sigma^{\frac{1}{2}}V^\top Y^{-1} \\ \bf 0\end{pmatrix} \end{align}\]

where $\beta = \left(\frac{\|X\tilde W Y\|_*}{\|X\|_F\|Y\|_F}\right)^{\frac{1}{L}}$, $\alpha = \sqrt{\frac{\|X\|_F}{\|Y\|_F}}\beta^{1-\frac{L}{2}}$ and $\gamma = \sqrt{\frac{\|Y\|_F}{\|X\|_F}}\beta^{1-\frac{L}{2}}$. $\bf 0$ are zero matrices of appropriate dimensions to match the dimensions of the left-hand side. $I_r$ is the $r\times r$ identity matrix.

It can be verified that $W_1 \cdots W_L = \tilde W$ and

\[\sum_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2 = L\left(\|X\|_F\|Y\|_F\right)^{\frac{2}{L}}\|X\tilde W Y\|_*^{2-\frac{2}{L}}.\]

Conclusion

We showed that the minimizing assignment of $W_1,\dots,W_L$ give the regularizer

\[R(W_1,\dots,W_L) = L\left(\|X\|_F\|Y\|_F\right)^{\frac{2}{L}}\|X\tilde W Y\|_*^{2-\frac{2}{L}}.\]

Since multiplication by constants and positive powers don’t change the minimum, we are basically minimizing $\|X\tilde W Y\|_*$. For isotropic data we have $X \propto I$ and $Y \propto I$, so we get the usual nuclear norm regularizer $\|\tilde W\|_*$. The regularizer $\|X\tilde W Y\|_*$ can be interpreted as first normalizing the distributions of $x_i$ and $y_i$ to be isotropic, and then using the usual nuclear norm regularizer on the processed data. Specifically, let $\tilde W' = X\tilde W Y$ and consider a loss function $f$ depending only on the predictions $P(x_i,y_i) = x_i^\top \tilde W y_i$. Then the change of variables to the loss function and regularizer yields

\[f(\{x_i^\top\tilde W y_i\}_i) + \lambda \|X\tilde W Y\|_* = f(\{(x_i^\top X^{-1})\tilde W' (Y^{-1}y_i)\}_i) + \lambda \|\tilde W'\|_*.\]

Implicit bias by large learning rate: Noise can be helpful for gradient descent

2022-07-22T00:00:00+00:00

I will demonstrate how large learning rates can lead to implicit biases in a simple regression task. Code to reproduce the results can be found in spoilers. We will use pytorch.

The regression task

We generate normally distributed data points in $d = 30$ dimensions. The regression target is the square of the first feature dimension. The other 29 dimensions are only there to make the task harder. We generate $n = 200$ data points for training and another $200$ data points for testing.

Code to generate data

import torch as th
th.manual_seed(0)

d = 30  # Input dimension
n = 200 # Training examples / testing points

X = th.randn(n*2, d) # Generate random points
y = X[:,0]**2        # Ground truth
X_train, y_train = X[:n,:], y[:n] # Train/test split
X_test,  y_test  = X[n:,:], y[n:]

The neural network

We will use a single hidden layer neural network with a quadratic activation function. The hidden layer will have $m = 100$ nodes. The parameters of this network are a $d \times m$ matrix $A$ and a $m$-dimensional vector $b$. For a $d$-dimensional data point $X$, the model predicts

\[\text{predict}(X) = \sum_{i=1}^m b_i (A_i \cdot X)^2.\]

This neural network should be well suited to solve the regression task since it is straightforward to find values for the parameters that solve the task exactly, for example we may pick $A_{11} = b_1 = 1$ and set everything else to zero.

We initialize $A$ by Xavier/Glorot normal initialization and $b$ to zero, i.e $A_{ij} \sim \mathcal{N}(0,\frac{2}{d+m})$ and $b_i = 0$.

Code to make the neural network

def makeModel(seed = 0):
  m = 100 # Number of hidden nodes
  A = th.zeros(d,m, requires_grad=True)
  b = th.zeros(m, requires_grad=True)
  th.manual_seed(seed)
  th.nn.init.xavier_normal_(A)
  parameters = [A,b]
  predict = lambda X : (X@A)**2 @ b
  return parameters, predict

To fit the model we run gradient descent on the Mean Square Error (MSE) over the training data. We train until the MSE on the training data is below $10^{-4}$.

Code to train the neural network

def trainModel(parameters, predict, lr):
  loss = 1e100
  while loss > 1e-4:
    loss = th.mean((predict(X_train)-y_train)**2)
    loss.backward()
    with th.no_grad():
      for param in parameters:
        param -= lr * param.grad # Gradient descent
        param.grad[:] = 0
  return th.mean((predict(X_test)-y_test)**2) # Return test MSE

print('MSE = %.2f'%trainModel(*makeModel(), 0.01))

We choose the learning rate 0.01 and train the neural network. It scores MSE = 0.40 on the test set. This is quite bad, as we clearly did not solve the task. As a reference for the scale of MSE, the model always predicting 1 gets MSE = 1.97 .

Increasing learning rate

In deep learning, performance is often very dependent on hyperparameters. Let’s look at the effect of the learning rate.

Code to compare different learning rates

lr_list = th.arange(0.001,0.15,0.005)
mse_list = []
for lr in lr_list:
  mse = trainModel(*makeModel(), lr)
  print("MSE = %.2f"%mse)

For learning rates above 0.096 the optimization diverges, so we can’t go higher. We see that all learning rates up to 0.03 give the same bad MSE, but after that larger learning rates improve performance. Interestingly, we can actually solve the task, but only if we choose learning rates right at the edge of diverging.

It should be noted that while the above plot paints a deceptively simple picture, it is not true in general that higher learning rate is better. The experiment seems robust against random seeds for initialization and data generation, but is quite fragile against changes in other hyperparameters, such as the number of hidden neurons and scale of initialization.

Label noise

We can achieve a similar effect without huge learning rates by adding “label noise”. Label noise is a form a data augmentation, where we generate more training data by modifying the original data. In each iteration of gradient descent, we will replace the training targets by the original targets plus some noise. We choose standard normal noise.

Code to train with label noise

def trainModelWithLabelNoise(parameters, predict, lr, steps):
  for _ in range(steps):
    y = y_train + th.randn(y_train.shape)
    loss = th.mean((predict(X_train) - y)**2)
    loss.backward()
    with th.no_grad():
      for param in parameters:
        param -= lr * param.grad # Gradient descent
        param.grad[:] = 0
  return th.mean((predict(X_test)-y_test)**2) # Return test MSE

print('MSE = %.2f'%trainModelWithLabelNoise(*makeModel(), lr = 0.03, steps = 100000))

Training with learning rate 0.03 (which previously gave MSE = 0.40) now gives MSE = 0.01! However, notice it required 100000 gradient descent steps to achieve this. Without label noise we only needed 1000 steps to converge (then the gradient is practically zero, meaning that parameters stop changing and running longer doesn’t change anything). Lower learning rates require even more steps.

The explanation

We will derive an explicit regularizer which approximates the implicit regularization by large step gradient descent.

Let’s think of gradient descent as an approximation to gradient flow (think of gradient descent with infinitesimal step length). We have some loss function $L$ which we optimize over some parameters $\theta$. In our case $L = \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i)^2$, where $P(X)$ is shorthand for $\text{predict}(X)$, and we can stack the parameters into a vector $\theta = (A,b)$.

Consider a single step of gradient descent. We start the step with parameters $\theta(0)$ and go to $\theta(0)-\eta\nabla L$, where $\eta$ is the learning rate. The gradient flow solution satisfies $\dot{\theta} = -\nabla L$. Differentiating again we get $\ddot{\theta} = -\nabla^2 L\dot{\theta} = \nabla^2 L \nabla L$. We can therefore Taylor expand

\[\theta(\eta) = \theta(0) + \eta\dot{\theta} + \frac{\eta^2}{2}\ddot{\theta} + O(\eta^3) = \theta(0) - \eta\nabla L + \frac{\eta^2}{2}\nabla^2 L\nabla L + O(\eta^3).\]

We see that the gradient descent step only captures the first two terms, giving a truncation error of $O(\eta^2)$.

Let’s change the loss function to

\[\tilde{L} = L + \frac{\eta}{4}\|\nabla L\|_2^2\]

and look at the new gradient flow $\dot{\theta} = -\nabla \tilde{L} = -\nabla L - \frac{\eta}{4}\nabla \|\nabla L\|_2^2 = - \nabla L - \frac{\eta}{2}\nabla^2 L \nabla L$. Differentiating again we have $\ddot{\theta} = \nabla^2 L \nabla L + O(\eta)$. The Taylor expansion is now

\[\theta(\eta) = \theta(0) + \eta(- \nabla L - \frac{\eta}{2}\nabla^2 L \nabla L) + \frac{\eta^2}{2}\nabla^2 L\nabla L + O(\eta^3) = \theta(0) - \eta\nabla L + O(\eta^3).\]

The gradient descent step with respect to the original loss approximates the gradient flow with respect to the new loss with truncation error only $O(\eta^3)$!

In this view, it is hence more accurate to say that gradient descent is optimizing $\tilde{L} = L + \frac{\eta}{4}\|\nabla L\|_2^2$ than $L$. So we implicitly added the regularizer $\frac{\eta}{4}\|\nabla L\|_2^2$. When the learning rate $\eta$ is large, this regularizer is not negligible.

I found the neat derivation presented above in Implicit Gradient Regularization, there you can find more details.

Label noise

But wait, optimizing $L$ is the same as optimizing $\tilde{L}$, right? A minimizer of $L$ has gradient zero, so it is also a minimizer of $\tilde{L}$. Well, we are in the overparameterized setting, so the optimization path might change which minimizer we end up at. When we add label noise it becomes clearer.

With label noise we have $L = E_{z_i \sim \mathcal{N}(0,1)}\frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2$ and $\tilde{L} = L + \frac{\eta}{4} E_{z_i \sim \mathcal{N}(0,1)}\|\nabla \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2\|_2^2$. After a bit of calculation, we may simplify this to

\[\tilde{L} = \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i)^2 + \eta\frac{1}{n^2}\sum_{i=1}^n \|\nabla P(X_i)\|_2^2 + \eta\Big[\frac{1}{n}\sum_{i=1}^n (P(X_i)-y_i)\cdot \nabla P(X_i)\Big]^2 + 1.\]

The $+1$ doesn’t matter for optimization, so we remove it. At convergence $P(X_i) \approx y_i$, so we have

\[\tilde{L} \approx \eta\frac{1}{n^2}\sum_{i=1}^n \|\nabla P(X_i)\|_2^2\]

One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze $\|\nabla P(X_i)\|_2^2$ for the case of our neural network, and that will likely show why this regularizer is useful for solving our regression task. I will not do that here.

One of the main insights of this implicit regularizer is the multiplicative factor $\eta$. It implies the strength of the regularizer is proportional to the learning rate. Consequently, while we can often fit the training data roughly $\frac{1}{\eta}$ iterations of gradient descent, it might take $\frac{1}{\eta^2}$ iterations to optimize the regularizer.

Comparing with small initialization

If you read the previous blog post, you might wonder if we can apply the implicit bias given by small initialization to solve the regression task given here. Indeed you can! Simply scaling down the outputs of the neural network by a factor 100, we get MSE = 0.00 with learning rate 0.01.

Code to test small initialization

def makeModelSmallInit(seed = 0):
  m = 100 # Number of hidden nodes
  A = th.zeros(d,m, requires_grad=True)
  b = th.zeros(m, requires_grad=True)
  th.manual_seed(seed)
  th.nn.init.xavier_normal_(A)
  parameters = [A,b]
  predict = lambda X : (X@A)**2 @ b / 100
  return parameters, predict

print('MSE = %.2f'%trainModel(*makeModelSmallInit(), 0.01))

When writing the previous blog entry I tried the reverse: applying regularization by large step sizes and label noise to the classification task. However, this implicit bias was not useful for solving that task.

Implicit bias by small initialization: A simple classification task where deeper is better

2022-07-06T00:00:00+00:00

In this blog post I will show an example of implicit bias on a synthetic classification task. I will put code to reproduce the results in spoilers. We will implement everything using pytorch.

Code to import packages

import torch as th
import torch.nn.functional as F
th.manual_seed(0)
import numpy as np

The classification task

Let’s classify $d = 10$ dimensional input features into $k = 10$ classes using $n = 100$ training examples, and generate $n = 100$ data points for testing.

Code to set constants

d = 10  # Input dimension
k = 10  # Classes
n = 100 # Training examples / testing points

We generate independently normally distributed data points in d = 10 dimensions. The classes are determined by the direction of the first 2 dimensions. Note that the other 8 dimensions are only there to confuse the classifier. In the plots below: on the left we see that the classes are separated into 10 pizza slices by angle of the first 2 feature dimensions, on the right we see other dimensions which are useless for classification.

Code to generate data

X = th.randn(n*2, d) # Generate random points
y = ((th.atan2(X[:,0],X[:,1])/np.pi+1)/2*k).long() # Classify points
X_train, y_train = X[:n,:], y[:n] # Train/test split
X_test,  y_test  = X[n:,:], y[n:]

The models

First, let us consider the linear classifier given by a $d \times k$ matrix $W$ where we predict the class of the data point $X$ as the class whose column in $W$ has the maximum inner product with $X$, i.e

\[\text{predict}(X) = \text{argmax}_{i \in \{1,\dots,k\}} X \cdot W_i.\]

We optimize the model by gradient descent until we have mean cross entropy loss at most 0.01 on the training data. Note that 0.01 cross entropy is quite low, meaning we perfectly fit the training data (100% accuracy), which is only possible since our models are in the overparameterized regime.

Code to train model

def trainModel(parameters, predict, lr):
  loss = 1e100
  while loss > 0.01: # Optimize until mean cross entropy loss is <= 0.01
    loss = F.cross_entropy(predict(X_train), y_train)
    loss.backward()
    with th.no_grad():
      for param in parameters:
        param -= lr * param.grad # Gradient descent
        param.grad[:] = 0
  # Return test accuracy
  return th.sum(th.argmax(predict(X_test), dim=1) == y_test).item()

We can now train the simple linear model.

Code to test simple linear model

W = th.zeros(d,k, requires_grad=True)
parameters = [W]
predict = lambda X : X@W
print("Test accurracy:", trainModel(parameters, predict, lr=10), '%')

We get an accuracy of 53%. The hyperparameters for learning rate (lr) and stopping threshold (0.01) don’t seem to matter much if they are small enough (but making them smaller takes more time to optimize).

Let us now apply the rule of thumb from deep learning that “deeper is better” and add another layer. We parameterize $W = AB$ for a $d \times k$ matrix $A$ and $k\times k$ matrix $B$.

\[\text{predict}(X) = \text{argmax}_{i \in \{1,\dots,k\}} X \cdot [AB]_i.\]

For the previous parameterization, the optimization problem was convex, but this time it is not. This means initialization matters (in particular, if we initialize $A = B = 0$ we will have gradient 0 and not get anywhere), let’s therefore initialize by a standard initialization from deep learning called Xavier/Glorot normal initialization. Let us also run 10 times to average out the random initialization.

Code to test 2 layer model

scores = []
for _ in range(10):
  A = th.zeros(d,k, requires_grad=True)
  B = th.zeros(k,k, requires_grad=True)
  th.nn.init.xavier_normal_(A) # Equivalent to A = th.randn(d,k)*(2/(d+k))**.5
  th.nn.init.xavier_normal_(B) # Equivalent to B = th.randn(k,k)*(2/(k+k))**.5
  parameters = [A,B]
  predict = lambda X : X@A@B
  scores.append(trainModel(parameters, predict, lr=1))
print("Test accurracy: %.1f ± %.1f %%"%(np.mean(scores), np.std(scores)/len(scores)**.5))

We get an accuracy of 60.3% with standard deviation 0.6%. The accuracy improved! How can this happen? Classical intuition tells us that parameterizing a matrix as a product of matrices is useless, as we can only represent exactly the same set of classifiers. The key is that we are in the overparameterized regime, where gradient descent has many solutions to choose from. The parameterization changes the implicit bias of the model when it is trained by gradient descent, causing it to choose a different solution.

Let’s go deeper!

Clearly more layers give better accuracy.

Code to test models of various depths

for L in [1,2,3,4,5,6]:
  scores = []
  for _ in range(10):
    layers = []
    for l in range(L):
      layers.append( th.zeros(d if l==0 else k, k, requires_grad=True) )
      th.nn.init.xavier_normal_(layers[l])

    def predict(X):
      product = X
      for layer in layers: product @= layer
      return product

    if L == 1: lr = 10
    elif L == 2: lr = 1
    else: lr = 3e-2
    scores.append(trainModel(layers, predict, lr))
  print("Test accurracy: %.1f ± %.1f %%"%(np.mean(scores), np.std(scores)/len(scores)**.5))

The explanation

To understand why depth improves generalization accuracy, we need to note that our classification problem has “low rank” in a certain sense. Consider the optimal classifier matrix $W$ with respect to

\[\text{predict}(X) = \text{argmax}_{i \in \{1,\dots,k\}} X \cdot W_i.\]

The i’th column (so the direction of class i-1 in the 0-indexed code) of this matrix is given by $W_i = \left(\sin\left(\frac{\pi(2i-1-k)}{k}\right), \cos\left(\frac{\pi(2i-1-k)}{k}\right), 0, \dots, 0\right)^\top$. Note that since only the first two rows of $W$ are non-zero, we have $\text{rank}(W) = 2$. Intuitively, there are much fewer matrices with rank 2 than general matrices, so if we can somehow implicitly bias our model towards low rank matrices (or preferably rank 2 matrices), we will likely get better classification accuracy.

To demonstrate, let’s explicitly force $W$ to be rank 2 by factoring it into a $d \times 2$ matrix $A$ and a $2 \times k$ matrix $B$ so $W = AB$, and then repeat our experiment.

Code to test rank 2 model

scores = []
for _ in range(10):
  A = th.zeros(d,2, requires_grad=True)
  B = th.zeros(2,k, requires_grad=True)
  th.nn.init.xavier_normal_(A)
  th.nn.init.xavier_normal_(B)
  parameters = [A,B]
  predict = lambda X : X@A@B
  scores.append(trainModel(parameters, predict, lr=1))
print("Test accurracy: %.1f ± %.1f %%"%(np.mean(scores), np.std(scores)/len(scores)**.5))

85% accuracy! That’s the best we’ve seen. Ok, but our other deep factorizations didn’t force the matrix to be low rank, so how did they exploit the low rank property?

That is most easily seen if we scale down the outputs of a three layer factorization $W = \frac{1}{10^4}ABC$.

Code for near-zero initialized three layer factorization

scores = []
for _ in range(10):
  A = th.zeros(d,k, requires_grad=True)
  B = th.zeros(k,k, requires_grad=True)
  C = th.zeros(k,k, requires_grad=True)
  th.nn.init.xavier_normal_(A)
  th.nn.init.xavier_normal_(B)
  th.nn.init.xavier_normal_(C)
  parameters = [A,B,C]
  predict = lambda X : X@A@B@C / 1e4
  scores.append(trainModel(parameters, predict, lr=100))
print("Test accurracy: %.1f ± %.1f %%"%(np.mean(scores), np.std(scores)/len(scores)**.5))

We get an accuracy of ca 84%. Plotting the singular values of the product matrix $\frac{1}{10^4}ABC$ during training, we see what is going on.

We see that the singular values show up one by one. When only the first singular value is non-negligible, we are effectively searching through rank 1 matrices. After a while it introduces the next singular value to search rank 2 matrices. Then the optimization terminates because it fits the training data. The remaining singular values are left around the initialization magnitude $\frac{1}{10^4}$.

In contrast to scaling down the outputs, if we switch $\frac{1}{10^4}$ to $10^3$ so $W = 10^3 ABC$ (and adapt the learning rate to lr = 0.001), we get accuracy 57.5% with standard deviation 1%. So we are back down to accuracies around the one layer case, since we removed most of the implicit bias.

Now you might wonder how we got any benefit from depth in our original setup, where we seemingly didn’t have small initialization. The clue is that it is really the relative size of initialization to final size which determines whether we are in the “small initialization regime”. Since our overfitting of the cross entropy loss results in the final matrix $W = ABC$ having singular values around size 100, the initialization of size around 1 becomes small in comparison. You can see this in the following plot of the evolution of singular values for the original 3 layer model.

The implicit bias demonstrated in this blog has many names given by different researchers, such as near-zero / small initialization regime, following “saddle-to-saddle” dynamics, anti-NTK regime and rich regime / rich limit. The experiment itself was motivated by Arora’s paper on Implicit Regularization in Deep Matrix Factorization. In that paper, they describe the differential equations for the singular values and what causes them to show up one by one to give the low rank implicit bias. Simply put, increasing the depth causes the effect to strengthen. The experiment in this paper is an adaptation of their experiment from matrix completion to classification.

Kernel methods are basically overparameterized linear regression

2022-06-29T00:00:00+00:00

In this blog post I explain kernel methods and the intuition that they’re “basically just overparameterized linear regression”. First, I will explain what I mean by kernel methods and overparameterized linear regression. I will view them as two methods for solving the following problem:

We are given $n$ data points $x_i$ where each data point is a d-dimensional vector. Each data point has a scalar target value $y_i$. For example, the data points could describe houses by size, location and year of construction and the target value could be the price of the house. Next, we are given a new data point $z$ (size, location and year of construction of a new house) and want to predict its target value (price of the new house).

Informal description of kernel methods for regression

We may apply a kernel method to this problem as follows:

Pick your favorite kernel function k(x,y). Intuitively, this is a function measuring the similarity between x and y, giving higher values to more similar inputs. A popular one is
\[k(x,y) = e^{-\|x-y\|_2^2}.\]
Build the $n \times n$ “kernel matrix” by
\[K_{ij} = k(x_i,x_j).\]
Solve for the coefficient vector $\alpha$
\[K \alpha = y.\]
Calculate similarities to the new data point $z$ and use this to produce a prediction
\[prediction = \sum_{i=1}^d \alpha_i k(x_i, z).\]

Overparameterized linear regression

Let us now instead apply linear regression to the problem. Since the features $x_i$ might be related to the target values $y_i$ in a nonlinear way, it is often useful to preprocess the features before fitting a linear model. For example, if $x_{i1}$ is the size of house $i$ and $x_{i2}$ its year of construction, we might want to add the feature $x_{i1}\cdot x_{i2}$ to account for nonlinear interactions between $x_{i1}$ and $x_{i2}$. Maybe we also want to add something like $\sin(x_{i2})$ or $\exp(x_{i1}+x_{i2})$ or the constant value $1$ (often called intercept). Or we just add all monomials ($x_{i1}^{p_1}x_{i2}^{p_2}\dots$) of the input features up to degree $10$. The possibilities are endless. Let’s assume we processed our old d-dimensional features $x_i$ into new m-dimensional features $\tilde{x}_i = \varphi(x_i)$, where $\varphi$ is our feature transform. Let’s also stack all these new features into a new $n \times m$ matrix $\tilde X$. Construct $\tilde{z} = \varphi(z)$ in the same way.

What if we end up with more features $m$ than data points $n$? This is what we call an overparameterized model. Usually this means we can perfectly fit all the targets $y_i$. And moreover, we can perfectly fit the targets in many different ways. It makes sense to pick some linear coefficients $\beta$ such that $\tilde{X}\beta = y$, i.e we fit the data. And let’s pick the one among those with minimal 2-norm $\|\beta\|_2$. It turns out that we can calculate this by

\[\beta = \tilde{X}^\top(\tilde{X}\tilde{X}^\top)^{-1}y.\]

Then we can make the prediction

\[prediction = \tilde{z} \beta.\]

Numerical example

To illustrate the two methods described above, I used them to interpolate the function $\sin(2\pi x)$ in the interval $0 \le x \le 1$ from 5 evenly spaced points. In this case the features $x_i$ are simply the positions of the data points. For the overparamterized linear regression I used the 10 polynomial features $\left(\frac{x_i}{10}\right)^p$ for $p = 0,\dots,9$.

Here’s the python code:

import numpy as np
import matplotlib.pyplot as plt

X = np.linspace(0,1,5)   # Training data
y = np.sin(2*np.pi*X)    # Training targets
Z = np.linspace(0,1,100) # Data points to predict

# Kernel method
k = lambda x,y : np.exp(-(x.reshape(-1,1)-y.reshape(1,-1))**2)
K = k(X,X)
alpha = np.linalg.solve(K, y)
kernel_prediction = alpha @ k(X,Z)

# Overparameterized linear regression
features = lambda x : (x.reshape(-1,1) / 10) ** np.arange(10).reshape(1,-1)
X_ = features(X)
beta = X_.transpose() @ np.linalg.solve(X_@X_.transpose(), y)
linear_prediction = features(Z) @ beta

plt.plot(X, y, 'o', label = 'Data points')
plt.plot(Z, kernel_prediction, '--', label = 'Kernel method')
plt.plot(Z, linear_prediction, '-.', label = 'Linear regression')
plt.plot(Z, np.sin(2*np.pi*Z), label = 'Original function')
plt.legend()
plt.show()

Now that I’ve described the two methods, we can get to the point.

Kernel methods are basically overparameterized linear regression

Recall $\varphi$ as the feature transform taking the old features to the new features in overparameterized linear regression. Now define the kernel function $k(x,y) = \varphi(x)^\top \varphi(y)$. Then the kernel matrix becomes $K = \tilde{X}\tilde{X}^\top$ and $\alpha = K^{-1}y = (\tilde{X}\tilde{X}^\top)^{-1}y$. The final prediction becomes

\[prediction = \tilde{z}\tilde{X}^\top(\tilde{X}\tilde{X}^\top)^{-1}y.\]

Does this look familiar? That’s because it is exactly the same prediction that we would get from doing overparameterized linear regression! So if we have a feature transform $\varphi$, we can construct a corresponding kernel function $k(x,y) = \varphi(x)^\top \varphi(y)$. What about the converse?

It turns out that under some technical assumptions (look up Mercer’s theorem if you’re interested) on the kernel function $k(x,y)$, we can find feature transforms $\varphi$ such that $k(x,y) \approx \varphi(x)^\top \varphi(y)$ to arbitrary accuracy. And importantly $\varphi$ can be chosen without knowing the data points $x_i$ or targets $y_i$. So the trade-off here is that to get an accurate approximation of $k(x,y)$ the feature transform $\varphi$ might output a lot of features.

We may pick some high accuracy $10^{-100}$, much higher than what we usually compute with, and find some $\varphi$ such that $\lvert k(x,y)-\varphi(x)^\top \varphi(y)\rvert < 10^{-100}$ for every relevant x and y. This might require $10^{10^{10}}$ features, but that’s fine. The point is that overparameterized linear regression on these features is for all practical purposes the same as the kernel method. Note again that the feature transform is independent of $x_i$ and $y_i$. Kernel methods hence inherit many properties and limitations of linear regression.

Some applications:

Linear regression (in view of predictions) only depends on the inner products between data points. For example, duplicating a feature is the same as scaling it up by $\sqrt 2$, and having features $x_{i1}$ and $x_{i2}$ is the same as having features $\frac{x_{i1}+x_{i2}}{\sqrt 2}$ and $\frac{x_{i1}-x_{i2}}{\sqrt 2}$.
In terms of notation, I often find it easier to work with (and think about) a single feature vector for each data point, than considering pair-wise similarities through a kernel function. I feel like it makes it easier to apply linear algebra notation and operations.
This one is a bit vague: Kernel methods can’t perform “feature learning”, i.e they are stuck with a fixed set of features (given by the kernel function) in a way where they can’t disproportionally focus on the important features. This is in contrast to for example neural networks which we hope will learn good representations of the data useful for transfer learning, ignoring irrelevant features, etc.