A Theory for Deep Learning

A Theory for Deep Learning
If you happened to type "neural networks" on your Google bar, chances are you will find "explained" as the suggested next word. The fact is there are many many works trying to give an explanation to why neural network works. I have not seen one work that provides an explict answer that identifies a mathematical quantity and shows that matches with the neural networks learning results. In that sense, I do think we are making progress here.

We didn't present much of a theory in this page, nor did I wish to use these simple demos to prove anything. I only hope to use these experiments to explain the main ideas, to show you what would the theory suggest in a few specific scenarios. Hopefully this would make some of you interested enough to read the paper. In our paper, we put out a more complete picture. It starts by saying that we need a new way to measure information itself. After designing a new information metric, we can use that to describe the information flow in any data processing procedure, what information is kept, what is discarded, and whether such choices are good ones. We then show that finding the singular vectors of the $$ \tilde{B} $$ like what we showed here is a good choice: it captures the strongest modes of dependence between $$ X $$ and $$ Y $$ like we claimed; it captures the most signifcant elements of the "common information" between $$ X $$ and $$ Y $$; and probably more importantly, the selected features are the optimal choices of what we called the "universal feature selection problem".

In classic statistics, if we need to select some features from data for the purpose of solving some inference problems, the answer is always to select the sufficient statistic. We argue that in most learning problems, as we process high dimensional data samples, we do not know exactly what attributes of the data we would be interested in detecting. For example, when we observe a customer's behavioral record it is not clear what attribute(s) matters when we try to recommend a product to her later. Thus, for these new problems, we cannot use a notion of sufficient statisitc for a particular inference task. Instead, we need to select a set of features, such that regardless of what attribute we might be interested in detect later, the decision based on these features are "universally" good. For this new problem, we show that the choice from the singular vectors are indeed optimal. That is, any feature selection procedure, as long as the goal is not a single problem with known statistical model, the sensible choice should be such universal features. And neural networks, as one of such tools, had better give these solutions. This is the basis of the above experiments.

Particularly in this area, I would like to avoid overstating the importance of our results, such as saying "blackbox opened". The best way of doing that is to show the mathematical results. Since we are not doing it here, I think it is important to make it clear what we can and cannot do with this theory. Here it goes.

A Performance Metric for Neural Networks
We get a performance metric of neural networks. If we keep the notation that $$ \underline{S}(x) = [S_1(x), \ldots, S_k(x)] $$ is the output of the last hidden layer with $$ k $$ nodes, and $$ \underline{v}(y) = [v_1(y), \ldots, v_k(y)] $$ is the weights on the output layer, our theory says the goal is to make them as highly correlated as possible, i.e. to make $$ {\mathbb E}[ \langle\underline{S}(x), \underline{v}(y)\rangle] $$. The optimal solution is to choose $$ S(x) $$'s and $$ v(y) $$'s from the SVD solutions. However, due to the limited expressive power of the selected network structure, we usually cannot achieve such optimum. The gap is then a good metric of how effecitve our network is.

The H-Scores
For that purpose, we can define a metric we called the "H-Score". For a given network with both $$ S(x) $$ and $$ v(y) $$ computed, keep in mind that a neural network might not put the answers in out favorate coordinate system, but can have rotation, scaling and shift to the solutions, we first write $$ \tilde{S}(x), \tilde{v}(y) \in {\mathbb R}^k $$ as the centralized versions of the features and the weights, with the means subtracted, and then define $$ \Phi = [\phi_1, \ldots, \phi_k], \Psi= [\psi_1, \ldots, \psi_k] $$, with

$$ \begin{align} \phi_i (x) &= \sqrt{P_X(x)} \cdot \tilde{S}(x) \\ \psi_i (y) &= \sqrt{P_Y(y)} \cdot \tilde{v}(y) \end{align} $$

Then the "H-Score" is defined as

$$ \begin{align} H(S, v) &:= \|\tilde{B}\|^2 - \| \tilde{B} - \Psi \Phi^{\mathrm{T}} \|^2\\ &= 2 {\mathbb E}[ \underline{S}^{\rm T}(X) \underline{v}(Y) ] - {\rm trace} ({\rm cov}(\underline{S}(X)) {\rm cov}(\underline{v}(Y))) \end{align} $$

If we are interested in only to evaluate how good the selected feature $$ S(x) $$ is, we can plug-in the theoretical optimal weights to get the one-sided H-score as

$$ \begin{align} H(S) := H(S, v^*) &= \left\| \tilde{B} \Phi (\Phi^{\mathrm{T}} \Phi) ^{-\frac{1}{2}}\right\|^2 \\ &= {\mathbb E}_{P_Y} [ \ {\mathbb E} [\underline{S}(X) |Y=y]^{\rm T} \cdot {\rm cov}(\underline{S}(x))^{-1} \cdot {\mathbb E} [\underline{S}(X) |Y=y]\ ] \end{align} $$

In practice, all of the expectations and covariances above can be replaced by the empirical averages. So the H-Scores can be computed for all the features and weights computed in the network, including the features at the output of intermediate hidden layers, to evaluate how good they are.



To demonstrate the use of the H-Score, we have the following classification experiment for the data as shown. These are the first 2 dimensions of a 6-dimensional dataset, with color-coded classes. The data is generated with a mixture-Gaussian model, and is rather complicated. We need a multi-layer neural network for this task, and we plotted the H-Scores for the outputs at each layer as follows:



What we can observe is that as we run the network with more and more iterations, the H-Score improves; and the H-Scores of the outputs at the later layers is higher. This shows that the more iterations and more layers in the network are actually picking better and better features of the data. Finally, the classification accuracy is also included in the figure to show the correspondence with the H-Scores: higher H-Scores generally mean better performance as expected.

Interested readers might find from the plot that layer-3 might not be doing her job too well, and try to do something about it. The code for this experiment can be downloaded from here.

For a more realistic test, we also evaluated the features selected from different algorithms on the ImageNet dataset, where we listed the H-Scores and the accuracy in the following table to show the correspondence between the H-Score and the accuracy, in both picking the top-1 and the top-5 candidates.



Why Not Log-Loss?
Of course the more commonly used metric in such experiments is the log-loss, which is also called the cross entropy, or the K-L divergence. Here is how the H-Score is related to them and why it is (somewhat) better.

Log-Loss, or cross entropy computes $$ {\rm LL}(S, v)= -{\mathbb E}_{P_{XY}} \left[ \log Q_{Y|X}^{(S, v)}(Y|X) \right] $$, where $$ P $$ is the empirical distribution and $$ Q^{(S, v)} $$ is the output of the SoftMax unit interpretted as a conditional distribution. The entire network chooses $$ S $$ and $$ v $$ to minimize this. As the objective of the entire optimization, we of course can use that as a measure of how good the result is. In fact, currently the most important job of the engineer who designed the neural network when it runs is to watch the log-loss value declines, slows down, and converges.

One issue of the Log-Loss is that it is not clear a priori what value we should expect. Should it be in the order of a few hundred, or 10 to power of 6? The actual result depends on the specific dataset and task. This makes it hard to use the log-loss as an indicator of the goodness of the network in an absolute sense.

An alternative is the K-L divergence, which is also called the relative entropy. It computes $$ D(P_{XY} || P_X \cdot Q_{Y|X}^{(S, v)}) = {\mathbb E}_{P_{XY}} \left[ \log \frac{P_{Y|X}(Y|X)}{Q_{Y|X}^{(S, v)}(Y|X)}\right] $$

and tries to minimize it. In terms of the optimization, this is equivalent as the Log-Loss, since the only difference is a term $$ {\mathbb E} [\log P_{Y|X}] $$, which does not depend on $$ (S, v) $$ at all.

A slight advantage of the K-L divergence is that it has an operational meaning: the distance between the two distribution: the actual empirical distribution $$ P $$, and the one generated by the network $$ Q $$. This gives the entire network a nice interpretation: it finds the best approximation to the true joint empirical distribiton $$ P_{XY} $$. Operationally, K-L divergence does have an absolute range: the divergence above always satisfies $$ 0 \leq D(P||Q) \leq \log |{\mathcal Y}|$$. This is a good thing. For example you can claim that for a dataset with $$ |{\mathcal Y}| =8 $$ you can get a K-L divergence of 0.02, it sort of says that you have a pretty good approximation; instead if your K-L divergence is 1.2, then your network is pretty much shooting randomly.

The problem of the K-L divergence is that we often cannot compute it from the data, because we don't know $$ P_{Y|X} $$. In fact, the reason we want to have a good approximation to the true $$ P_{Y|X} $$ is often to have a decent guess of what it looks like. This is where one should understand this difficulty to have a good metric for neural networks, which is like a Marco Polo game: we are chasing this wildly moving target, and it is hard to report where it is and how close we are from it with a single number.

Another issue of the K-L divergence is revealed when we write it in our vector form, with

$$ D(P_{XY} || P_X \cdot Q_{Y|X}^{(S, v)}) \approx \frac{1}{2} \| \tilde{B} - \Psi \Phi^{\mathrm{T}} \|^2 = \frac{1}{2} \| \tilde{B} - \sum_{i=1}^k \psi_i \cdot \phi_i^{\mathrm{T}} \|^2 $$

We are trying to use a rank-k matrix to approximate $$ \tilde{B} $$, but the distance we used here is actually the norm of the entire matrix. That is, even if we have used the optimal rank-k matrix computed from the SVD, the remaining singular values $$ \sigma_{k+1}, \sigma_{k+2}, \ldots $$ are still there in the resulting distance. In general, the K-L divergence is a sum of the contribution from these singular values that we can never reach to and the actual distance between our rank-k solution and the optimal one; and we have no way to separate the two. This is bad particularly when $$ |{\mathcal Y}| $$ is large.

This is the point that we can understand why H-Score is nice:

$$ H(S, v) := \|\tilde{B}\|^2 - \| \tilde{B} - \Psi \Phi^{\mathrm{T}} \|^2 $$

That is, we subtracted from the K-L divergence $$ \|\tilde{B}\|^2 $$, which is the part of the K-L divergence that is not computable from the data samples. With that, we also dropped the contributions of the extra singular values. To see that, suppose $$ \Psi \Phi^{\mathrm{T}} $$ is precisely the rank-k approximation of $$ \tilde{B} $$ from the SVD, the resulting H-Score is then the sum of the squared top-k singular values, with all other singular values cancelled.

The following sequence of inequalities can be very useful:

$$

H(S, v) \leq H(S) \leq \sum_{i=1}^k \sigma_i^2 \leq k

$$

where k is the dimensionality of the selected feature function $$ S(x) $$.

By definition, the first gap between $$ H(S, v) $$ and $$ H(S) $$ indicates how far is the weights $$ v(y) $$ from the optimal $$ v^*(y) $$. For an intermediate layer of the deep neural network, we can imagine of taking the output of this layer and immediately have a fully connected SoftMax output layer, and $$ v^* $$ is the weights on this layer. Thus, this gap is often a good measure of how well the network compuation has converged to its limit.

The second gap measures how far is the selected features $$ S(x) $$ from the optimal SVD solutions. If the network has converged well, the only reason that we cannot select this optimal feature is due to the limited expressive power of the network. Thus, this gap is a good measure of how good our network structure is.

The third gap measure how good our dataset is. That is, how much information a k-dimensional approximation of the dependence between $$ X $$ and $$ Y $$ can capture. The result is convienently normalized to have an absolute upper bound of k. For example, in a good solution of the MNIST problem, $$ H(S) $$ can be about 8.1, where $$ k =9 $$ corresponding to 9 dimensions in distinguishing 10 hand-written letters. Thus we can conclude from this that the dataset is actually really good, and the selected features are very close to the optimal; and hence the classification results can be expected to be very precise.

The Limitations
This is what we can do with the H-Score. Now here is what we cannot. Basically there are ways where the H-Scores are not accurate measure of the performance. A design with a lower H-Score can actually do better. This can happen for multiple reasons.

First, as stated, H-Score measure how good the selected features are in the universal sense. That is, how useful the features are averaged over all possible queries. This universal often cannot be as good as the custom-made solution for a specific task. We argue that if we don't know the task, then the universal choice is the only senisble choice. In practice, however, it can be the case that we are actually interested in just one specific problem, or the problem space is actually quite narrow; and we are allowed to do some trial-and-error experiments. What such trial and error can gives us is the choice of feature functions that are actually tuned to this specific task, even though such tunning happens in a rather implicit way. Although we would like to claim that our universal solution is a more noble solution, we cannot deny that this trial and error approach does have its value in practice.

The second problem can be seen from our ImageNet solution table. Sometimes a solution with higher H-Score can have worse performance, such as VGG16 and VGG19. One reason for this these solutions are significantly more complex than the others, with much more parameters. All of our results are derived with the assumption that the statistical models are quite precisely and stably reflected from the models. The generalization error from overfitting is not captured.

The third problem of the H-Score, perhaps the most important one, is that it is based on a local approaximation. This is explained in much more details in our paper. The short statement is that the approximation allows us to focus on a particular local optima. If we insist to do the math without this approximation, the solutions would be much more complicated, less clean, harder to generalize, and sometimes numerically less stable. The lack of stability is one servere problem of deep neural networks, which is the nightmare to all engineers working in this area. Unfortunately, our theory does not help with that. The local assumption helped us to avoid this entire issue from the outstart. This is a deliberate choice, and perhaps the main reason that we could have a clean story like this. In optimization, brave people do look for better ways to find the global optima for non-convex problems, but it is hard to imagine the result can be as general and as simple as those on local optima, which is what we are trying to find for the neural network problem.

Interpretability
Interpretability is one important issue for neural networks. Exactly what is learned by this complex structure? Those selected features, what are they? We should expect that our results can shed some lights to this. We have an mathematical expression for the output of each hidden node and the weight of each edge. We know what the network tries to compute and where it stores these results. That must be a good thing.

One obvious advantage is that we now have guidance for transfer learning and multi-task learning. Instead of using a large deep neural network as a blackbox, we can now name the intermediate results, which helps us to use such results in a different task. We have some practice on this, and will report it in another page.

Now does this mean that we can interpret the meanings of the selected features? Not really. Perhaps even worse, it gives a reason why interpretability is something we probably should not expect. What our theory says is that in general these features selected by neual networks are the solution to a number of optimization problems: they carry the most information, they are the most distinctive features, they are universally good for unknown inference tasks, etc. But they are not necessarily interpretable. Meaning they might not correspond to a simple and compact term that human are familiar to.

Think of a close friend of yours. The first thing that comes to your mind must be a general impression of this person as a whole, probably not the age, gender, ID number, or anything else you can find on a customs form. People invent a number of tags to catergorize items in the world. The question of why these tags are chosen as they are is far beyond the scope of engineering. All we can say is that they may not be aligned with the features we choose from data. Often these features that can best distinguish data samples are combinations of many human-chosen tags in specific ways. In principle one can learn these combinations if there is a dataset labeled by some tags, and thus translate the data features into terms that we are more familiar with. We view this a learning problem separated from that of selecting features from the data.

Bob finally found his soulmate since her &xi; value is really high, and &xi; = 0.0122 &times; haircolor + 0.0031 &times; athletic + 0.0041 &times; cooking-skills + 0.0201 &times; laugh-on-old-jokes + 0.0311 &times; prefer-python-to-matlab + ...

Yeah! You guessed it. Bob is a software engineer.

Download Links
Here are the codes used in this page that you can download.

Experiment 1: The SoftMax Output Units

Experiment 2: Discrete Valued Inputs

Experiment 3: Limited Expressive Power

Experiment 4: H-Scores

This page is made with the help of Xiangxiang Xu. He looks like this.

Other languages: 中文