Sparsity in Deep Learning

Note: Experiments with projected gradient descent for ridge regression on a simplex, and a subspace can be seen here

.

Sparse coding decomposes an input vector into a sparse vector of weights based on a dictionary, such that the weighted average of the dictionary approximates the original vector in L2-norm while at the same time keeping the L1-norm of the weights themselves low. The goal is to find coefficients that can best projection of the query (x) in the subspace spanned by dictionary/set of keys D. Formally the problem is 

\(\displaystyle \text{sparseCoding}_{D,\lambda}(x) = \arg\min_y ||x - Dy||_2^2 + \lambda || y ||_1\)

Although there are no closed-form solution of this problem, there are well-known iterative algorithms for solving this convex optimization. Note that a closely related problem is ridge-regression which ultimately is just a simple linear operator.

\(\displaystyle \text{ridgeRegression}_{D,\lambda}(x) = \arg\min_y ||x - Dy||_2^2 + \lambda || y ||^2_2 = (D^TD + \lambda I)^{-1}D^Tx\)

Coming back to the sparse coding problem, there is a very well-known iterative algorithm for sparse-coding called ISTA (whose homotopy continuation version) can be implemented as a sequence of feed-forward layers as follows:

\(\displaystyle \begin{align} y_{1} &= S_{\lambda_1}(D^T(x)) \\ &\ldots\\ y_{l + 1} &= S_{\lambda_l}(y_l + D^T(x - Dy_l)) \text{ where } \lambda_l = \lambda_{\max} \frac{\lambda_{*}}{\lambda_\max}^{l/L} \text{ and } S_\lambda(x) = \text{sign}(x)\text{RelU}(\text{elemwise-absolute}(x) - \lambda) \end{align}\)

The function \(S_{\lambda}(x)\) can be approximated by a different function such as the SigShrink function \(\displaystyle \frac{x}{1 + \exp(-\tau(|x| - \lambda))}\). Ofcourse it's important to note that this is not the first or only attempt to merge DL and spartiy, e.g. the paper Learning Fast Approximations of Sparse Coding by Gregor, and LeCun proposed a knowledge-distillation approach to train neural-network to act like sparse-encoders, so the sparse-encoding layers had trainable parameters unlike the above layers where D is fixed and it's just a fixed computational circuit.

 

Another approach that delves into sparsity is the sparsemax, which is a sparse variant of softmax operation. The softmax and other probability generating operators try to find the best distribution that represents \(D^Tx\) subject to some constraints. I.e. a lot of the attention operations try to find  \(\displaystyle {\arg}\max_{\mu \in \text{dom}(\Omega)} \langle D^Tx, \mu \rangle - \Omega(\mu)\) and there are closed form solutions for \(\Omega(\mu) = -\text{entropy}(\mu) + \mathbb{I}_{\Delta}\) (softmax) and \(\Omega(\mu) = 1/2 || \mu ||_2^2 + \mathbb{I}_{\Delta}\) (sparsemax). The sparsemax attention is implemented as follows

\(\text{sparsemax}_i(z) = [z_i - \tau(z)]_{+}\) where \(\tau\)(z) is that constant such that the the sparsemax function sums upto 1. The jacobian of the sparsemax is extremely simply stated in terms of the output values as \(\text{diag}(s) - ss^T / |S(z)|\). See the papers, Learning with Fenchel-Young Losses by Blondel, Martins, and Niculae and From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification by Martins and Astudillo for reference.

 

Note that the standard attention operations are scale-dependent, i.e. if we rescale D and rescale x then the scale of D^Tx will also change and therefore this gives rise to the idea of scale-free attention, or distance minimizing attention which is the  \(\displaystyle \textbf{scale-free-attention}_{\lambda}(x, D) = \arg\min_{y \in \Delta} ||x - Dy||_2^2 \color{red}{+ \lambda || y ||_1}\) the elements in red are not necessary and can be removed. Note that this doesn't increase the number of parameters because the intermediate layers dont need additional parameters.

 

The basics of convex optimization with/without constraints for differentiable objectives with convex regularization -- Building upto ISTA / FISTA

The most important thing to realize is that an arbitrary member of the subdifferential, i.e. arbitrary subgradient at a point, may not be a descent direction for function \(f\). Only the subgradient with the smallest norm is guaranteed to be the direction of steepest descent (as long as the point at which we are taking the subgradient lies inside the interior of the domain of \(f \)) . Therefore if we want to perform something like gradient descent, then we need to find the subgradient with smallest norm, and then for unconstrained problems take a step in the opposite direction, and for constrained problems we need to take the step, and then project the new parameters back to the feasible set.

Steepest Descent : If the function that we want to optimize can be written as \(f(x) = g(x) + ||x||_1\) where g is differentiable then the subdifferential is

 \(\displaystyle \nabla g(x) + \begin{bmatrix}\begin{cases}1 &\text{ if } x_i > 0 \\ -1 &\text{ if }x_i < 0\\ [-1, 1] &\text { if } x_i = 0 \end{cases}\forall 1 \le i \le d\end{bmatrix}\)

and the smallest norm element for those coordinates where \(x_i = 0\) is simply 0 if the gradient is between [-1, 1] otherwise we reduce the magnitude of the gradient 1. So basically once a parameter is zero, then the gradient has to be larger than 1 to move it away from 0, and even then we reduce the magnitude of the gradient so that it doesn't move too far from 0.

ISTA  improves upon steepest descent by thinking about the steepest-descent procedure from a proximal-gradient pov. Basically whenever we update a parameter as \(x_{k+1} = x_k - \eta \nabla f(x_k)\) then we are solving the following local/proximal-problem

\(\displaystyle x_{k+1} = \arg\min_{x} f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{1}{2\eta} || x - x_k||_2^2\)

The ISTA method instead solves the following master sub-problem which is more meaningful when \(f(x) = g(x) + \lambda ||x||_1\) .

\(\displaystyle x_{k+1} = \arg\min_{x} f(x_k) + \nabla g(x_k)^T (x - x_k) + \frac{1}{2\eta} || x - x_k||_2^2 + \lambda ||x||_1\)

The solution of this problem can be analyzed by using the optimality condition that 0 must be in the sub-differential of the above optimization problem at \(x_{k+1}\). This basically results in the following update rule 

\(\displaystyle x_{k+1} = S_{\eta \lambda} (x_k - \eta \nabla g(x_k))\)

and effectively this means that if the update was going to shift a parameter to within \(\eta\lambda\) distance of zero then we'll just leave it there, and otherwise we shift the updated value closer to origin by \(\eta\lambda\). Even more importantly if the step size is small then we get monotonic convergence.

FISTA : Fista improves upon ISTA by adding in the idea of momentum! One way of thinking about the above algorithm ISTA is that we are applying the prox operator to the gradient-descent step. I.e. the update is \(\text{prox}_{h}(x - \nabla g(x_k)) \text{ where } \text{prox}_h(x) = \arg\min_y || y - x ||^2_2 + h(y)\) where h(y) is the regularization function like the L1-penalty term. In Fista we apply prox to the gradient update + momentum term

\(x_{k+1} = \text{prox}_{k,h} \Big( x_k + \frac{k-1}{k+2} (x_k - x_{k-1}) - \eta\lambda \nabla g\big(x_k + \frac{k-1}{k+2} (x_k - x_{k-1}) \big)\Big)\)

and according to slide 30 of this lecture note by Pradeep Ravikumar nesterov acceleration really helps.