Kernel methods are basically overparameterized linear regression
In this blog post I explain kernel methods and the intuition that they’re “basically just overparameterized linear regression”. First, I will explain what I mean by kernel methods and overparameterized linear regression. I will view them as two methods for solving the following problem:
We are given
Informal description of kernel methods for regression
We may apply a kernel method to this problem as follows:
-
Pick your favorite kernel function k(x,y). Intuitively, this is a function measuring the similarity between x and y, giving higher values to more similar inputs. A popular one is
-
Build the
“kernel matrix” by -
Solve for the coefficient vector
-
Calculate similarities to the new data point
and use this to produce a prediction
Overparameterized linear regression
Let us now instead apply linear regression to the problem. Since the features
What if we end up with more features
Then we can make the prediction
Numerical example
To illustrate the two methods described above, I used them to interpolate the function
Here’s the python code:
import numpy as np
import matplotlib.pyplot as plt
X = np.linspace(0,1,5) # Training data
y = np.sin(2*np.pi*X) # Training targets
Z = np.linspace(0,1,100) # Data points to predict
# Kernel method
k = lambda x,y : np.exp(-(x.reshape(-1,1)-y.reshape(1,-1))**2)
K = k(X,X)
alpha = np.linalg.solve(K, y)
kernel_prediction = alpha @ k(X,Z)
# Overparameterized linear regression
features = lambda x : (x.reshape(-1,1) / 10) ** np.arange(10).reshape(1,-1)
X_ = features(X)
beta = X_.transpose() @ np.linalg.solve(X_@X_.transpose(), y)
linear_prediction = features(Z) @ beta
plt.plot(X, y, 'o', label = 'Data points')
plt.plot(Z, kernel_prediction, '--', label = 'Kernel method')
plt.plot(Z, linear_prediction, '-.', label = 'Linear regression')
plt.plot(Z, np.sin(2*np.pi*Z), label = 'Original function')
plt.legend()
plt.show()
Now that I’ve described the two methods, we can get to the point.
Kernel methods are basically overparameterized linear regression
Recall
Does this look familiar? That’s because it is exactly the same prediction that we would get from doing overparameterized linear regression! So if we have a feature transform
It turns out that under some technical assumptions (look up Mercer’s theorem if you’re interested) on the kernel function
We may pick some high accuracy
Some applications:
- Linear regression (in view of predictions) only depends on the inner products between data points. For example, duplicating a feature is the same as scaling it up by
, and having features and is the same as having features and . - In terms of notation, I often find it easier to work with (and think about) a single feature vector for each data point, than considering pair-wise similarities through a kernel function. I feel like it makes it easier to apply linear algebra notation and operations.
- This one is a bit vague: Kernel methods can’t perform “feature learning”, i.e they are stuck with a fixed set of features (given by the kernel function) in a way where they can’t disproportionally focus on the important features. This is in contrast to for example neural networks which we hope will learn good representations of the data useful for transfer learning, ignoring irrelevant features, etc.
If you are interested in this kind of research, I would be happy to talk on Discord, email (johanswi@ uio.no), or in the comments below.