Technical: Deep Linear Networks with label noise minimize the nuclear norm
The implicit biases in (stochastic) gradient descent with large learning rates often reduce to regularizers preferring “flat minima”. See for example my blog with label noise. The regularizers depend on the network architecture, so I find it interesting to characterize them for tractable cases such as Deep Linear Networks.
This blog is more technical, mathy and less standalone than previous blogs. However, to my knowledge only the two layer case has been analyzed at the time of writing, so I thought I would put this more general result out there.
Deep Linear Networks (DLNs)
We already encountered DLNs in a previous blog post. However, we will introduce notation useful for this post. Let’s consider an \(L\)-layer DLN with dimensions \(d_0,d_1,\dots,d_L\). I.e we are parameterizing a \(d_0 \times d_L\) matrix \(\tilde W\) as
\[\tilde W = W_1 W_2 \cdots W_L\]where \(W_i\) is a \(d_{i-1}\times d_i\) matrix. We will assume \(L \ge 2\) and \(d_i \ge \min\{d_0,d_L\}\), so the model is overparameterized.
The regularizer we will analyze is inspired by the following slight generalization of matrix completion. Given a \(d_0\)-dimensional vector \(x\) and a \(d_L\) dimensional vector \(y\) our model predicts \(P(x,y) = x^\top \tilde W y\). For a set of \(n\) data points \((x_i,y_i)\) with targets \(t_i\) we can then do regression with the loss \(\sum_{i=1}^n (P(x_i,y_i)-t_i)^2\). If \(x_i\) and \(y_i\) are standard unit vectors, we get a formulation of matrix completion.
We can then use the same arguments as in the first blog post on implicit regularization by large learning rate to find the regularizer introduced by label noise and large learning rate, namely
\[R(W_1,\dots,W_L) = \sum_{i=1}^n \sum_{l=1}^L \|\nabla_{W_l} P(x_i,y_i)\|_F^2\]where \(\|\cdot\|_F\) is the Frobenius norm. To summarize roughly: among all the solutions \(W_1,\dots,W_L\) minimizing the loss, gradient descent with label noise will pick the one minimizing \(R(W_1,\dots,W_L)\).
The rest of the blog post will try to analyze the regularizer \(R\). We will eventually see that it is similar to the nuclear norm regularizer \(\|\tilde W\|_*\), which is well known in matrix completion and is known to encourage low rank.
As a first step we differentiate \(P\) and split the resulting Frobenius norm to get
\[\sum_{i=1}^n \sum_{l=1}^L \|\nabla_{W_l} P(x_i,y_i)\|_F^2 = \sum_{i=1}^n \sum_{l=1}^L \|x_i^\top W_1\cdots W_{l-1}\|_2^2 \|W_{l+1}\cdots W_L y_i\|_2^2.\]Assumption
To get a nice expression for the regularizer, we will make the following assumption: There exists matrices \(X\) and \(Y\) such that
\[\sum_{i=1}^n\ (x_i x_i^\top) \otimes (y_i y_i^\top) = (X^\top X) \otimes (Y Y^\top)\]where \(\otimes\) is the Kronecker product. We will furthermore assume that \(X\) and \(Y\) are square, invertible matrices. I expect the results in this blog to also hold for singular \(X\) and \(Y\), but that introduces some technical difficulties.
In my intuition, \(X^\top X\) and \(Y Y^\top\) are basically the covariances of the vectors \(\{x_i\}_i\) and the vectors \(\{y_i\}_i\). The assumption is an independence assumption between \(\{x_i\}_i\) and \(\{y_i\}_i\), it is pretty much saying \(\mathbb{E}\big((x x^\top) \otimes (y y^\top)\big) = \mathbb{E}(x x^\top)\otimes \mathbb{E}(y y^\top)\).
While this assumption is often not exactly satisfied, I hope it is approximately satisfied when \(x_i\) and \(y_i\) are sampled independently, such as usually done in matrix completion. My attempts at \(L > 2\) without the assumption ended in ugly tensor norms which were hard to work with and interpret.
Using the assumption, we may simplify
\[\sum_{i=1}^n \sum_{l=1}^L \|x_i^\top W_1\cdots W_{l-1}\|_2^2 \|W_{l+1}\cdots W_L y_i\|_2^2 = \sum_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2.\]Lower bound
In this section we will use some inequalities to find a tight lower bound on \(R\). Using the AM–GM inequality we have
\[\sum_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2 \ge L \left(\prod_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2\right)^{\frac{1}{L}}.\]Rearrange the product
\[\begin{align} &L \left(\prod_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2\right)^{\frac{1}{L}}\\ =\ &L \left(\|X\|_F\|Y\|_F\prod_{l=1}^{L-1} \|X\ W_1\cdots W_l\|_F \|W_{l+1}\cdots W_L Y\|_F\right)^{\frac{2}{L}}. \end{align}\]Use \(\|A\|_F\|B\|_F \ge \|AB\|_*\) where \(\|\cdot\|_*\) is the nuclear norm
\[\begin{align} &L \left(\|X\|_F\|Y\|_F\prod_{l=1}^{L-1} \|X\ W_1\cdots W_l\|_F \|W_{l+1}\cdots W_L Y\|_F\right)^{\frac{2}{L}}\\ \ge\ &L \left(\|X\|_F\|Y\|_F\prod_{l=1}^{L-1} \|X\ W_1\cdots W_L Y\|_*\right)^{\frac{2}{L}}\\ =\ &L\left(\|X\|_F\|Y\|_F\right)^{\frac{2}{L}}\|X\tilde W Y\|_*^{2-\frac{2}{L}}. \end{align}\]Construction achieving the bound
In this section we will give a construction which achieves the lower bound, showing that it is tight. Let’s say we are given some \(\tilde W\) and need to find \(W_1 \cdots W_L = \tilde W\) such that \(R(W_1,\dots,W_L)\) is minimized.
Let’s take the (compact) singular value decomposition of \(X \tilde W Y = U \Sigma V^\top\). Here \(U\) and \(V\) are semi-orthogonal matrices with dimensions \(d_0 \times r\) and \(d_L \times r\) and \(\Sigma\) is a diagonal \(r \times r\) matrix with positive diagonal, where \(r\) is the rank of \(\tilde W\) (which is also the rank of \(X \tilde W Y\) since \(X\) and \(Y\) are invertible). We can now construct
\[\begin{align} W_1 &= \alpha \begin{pmatrix}X^{-1}U\Sigma^{\frac{1}{2}} & \bf 0\end{pmatrix}\\ W_i &= \beta \begin{pmatrix}I_r & \bf 0 \\ \bf 0 & \bf 0\end{pmatrix} \hspace{2.5cm} i = 2,\dots,L-1\\ W_L &= \gamma \begin{pmatrix}\Sigma^{\frac{1}{2}}V^\top Y^{-1} \\ \bf 0\end{pmatrix} \end{align}\]where \(\beta = \left(\frac{\|X\tilde W Y\|_*}{\|X\|_F\|Y\|_F}\right)^{\frac{1}{L}}\), \(\alpha = \sqrt{\frac{\|X\|_F}{\|Y\|_F}}\beta^{1-\frac{L}{2}}\) and \(\gamma = \sqrt{\frac{\|Y\|_F}{\|X\|_F}}\beta^{1-\frac{L}{2}}\). \(\bf 0\) are zero matrices of appropriate dimensions to match the dimensions of the left-hand side. \(I_r\) is the \(r\times r\) identity matrix.
It can be verified that \(W_1 \cdots W_L = \tilde W\) and
\[\sum_{l=1}^L \|X\ W_1\cdots W_{l-1}\|_F^2 \|W_{l+1}\cdots W_L Y\|_F^2 = L\left(\|X\|_F\|Y\|_F\right)^{\frac{2}{L}}\|X\tilde W Y\|_*^{2-\frac{2}{L}}.\]Conclusion
We showed that the minimizing assignment of \(W_1,\dots,W_L\) give the regularizer
\[R(W_1,\dots,W_L) = L\left(\|X\|_F\|Y\|_F\right)^{\frac{2}{L}}\|X\tilde W Y\|_*^{2-\frac{2}{L}}.\]Since multiplication by constants and positive powers don’t change the minimum, we are basically minimizing \(\|X\tilde W Y\|_*\). For isotropic data we have \(X \propto I\) and \(Y \propto I\), so we get the usual nuclear norm regularizer \(\|\tilde W\|_*\). The regularizer \(\|X\tilde W Y\|_*\) can be interpreted as first normalizing the distributions of \(x_i\) and \(y_i\) to be isotropic, and then using the usual nuclear norm regularizer on the processed data. Specifically, let \(\tilde W' = X\tilde W Y\) and consider a loss function \(f\) depending only on the predictions \(P(x_i,y_i) = x_i^\top \tilde W y_i\). Then the change of variables to the loss function and regularizer yields
\[f(\{x_i^\top\tilde W y_i\}_i) + \lambda \|X\tilde W Y\|_* = f(\{(x_i^\top X^{-1})\tilde W' (Y^{-1}y_i)\}_i) + \lambda \|\tilde W'\|_*.\]If you are interested in this kind of research, I would be happy to talk on Discord, email (johanswi@ uio.no), or in the comments below.