Understanding the Power of Neural Networks

Why Are Neural Networks So Powerful?
Well, that is a difficult question that has haunted us for about 30 years. The more successes we see from the applications of deep learning, the more puzzled we are. Why is it that we can plug in this “thing” to pretty much any problem, classification or  prediction, and with just some limited amount of tuning, we almost always can get good results. For curious minds, having such a powerful tool without knowing why it works is very unsettling: what if there is something even more powerful right around the corner?

As one works more and more with neural networks, this feeling of unsettling deepens. There are many design parameters, the number of hidden layers, number of nodes, the type of activation functions, etc, often have to be chosen only by the trial-and-error approach. Even more puzzling is the fact that neural networks are often considered hungry in training samples. At the same time, the back propagation training procedure is perhaps the most mysterious element of neural networks. Without understanding this procedure, it is hard to imagine how to improve its efficiency.

All of these problems essentially point at the same question that we have not answered: What exactly is a neural network doing? Where does its power come from?

Now here is an answer! Ready?
In short, Neural Networks extract from the data the most relevant part of the  information that describes the statistical dependence between the features and the labels. In other words, the size of a Neural Networks specifies a  data structure  that we can compute and store, and the result of training the network is the best approximation of the statistical relationship between the features and the labels that can be represented by this data structure.

I know you have two questions right away: '''REALLY? WHY?'''

The “why” part is a bit involved. We have a new paper that covers this. Briefly, we need to first define a metric that quantifies how valuable a piece of partial information is for a specific inference task, and then we can show that Neural Networks actually draw the most valuable part of information from the data. As a bonus, the same argument can also be used in understanding and comparing other learning algorithms,  PCA,  compressed sensing, etc. So everything ends up in the same picture. Pretty cool, huh? (Be aware, it is a loooong paper.)

In this page, we try to answer the “Really?” question. One way to do that is I can write a mathematical proof, which is included in the paper. Here, we will only try to numerically demonstrate some of the results, and maybe add just a little bit of mathematical and intuitive explanations. We will at the end of these experiments introduce a new metric, which can be used to measure how effective the network is. This is particularly useful when we are designing a complex network, and are having difficulties in choosing the network structure and the hyperparameters.

There are several goals of this page:


 * 1) For beginners of neural networks, we provide some links to the commonly used packages and comments in the codes. The examples we use are all simple and small. Hopefully this can be a good way to help you to start programming;
 * 2) For engineers with extensive experience in using neural networks, we use the code as a common language to explain the main ideas of our research, and hope it can help you to design your next project more flexibly and more efficiently;
 * 3) For statisticians that are more interested in the theory, this page is only a supplimentary material, not a replacement, for the mathematical proofs in our paper. We hope that you find the experiments to be an effective tool to explain the ideas behind.

To follow the experiments:
You can just read the code and comments on this page, and trust me with the results; or you can run it yourself. To do that, you will need a standard Python environment, including Numpy, Matplotlib, etc. Also, you will need a standard neural network package. For that, I use Keras, and run it with TensorFlow. You can follow this link to get them installed. I recommend using Anaconda to install them, which takes, if you do it right, less than 10 minutes. Trust me, it’s a worthy effort. These packages are really well made and powerful. Of course, you are also welcome to just sit back, relax and enjoy the journey, as I will show you everything that you are supposed to see from these experiments.

You need to have the following lines to initialize.

If you receive no error message after these, then congratulation, you have installed your packages right. And now we are ready for the ride!

The Experiments
We will have three experiments, one conceptually more complex than the other. In each experiemnt, we will generate some samples $$ (x_i, y_i), i=1, \ldots, N $$. We will use $$ x $$ as the inputs and $$ y $$ as the labels to train a neural network. Our goal is to understand what exactly happened inside the network during this training process: what does the network learn from the training data, and where does it store the learning results. To answer such questions, we will in some cases dig out the network parameters at the end of the training session and compare them with theoretic results; and in some other cases we will skip the training process and use the theoretic results to directly set the values of the network parameters and check the performance of the resulting network.

Here is a rough idea of the three experiments:


 * 1) We will start with the simpliest case where the neural network has only a single layer of SoftMax output units. This helps us to establish the first mathematical formula of the weights computed in neural networks;
 * 2) We will then work on a network with one layer of hidden nodes, but with a dataset where both $$ x $$ and $$ y $$ are discrete valued. This is in fact a case where the neural network has sufficient expressive power, in which case we can check the "ideal" feature selected by the network, and compare that with the theoretically optimal choice;
 * 3) We will then study a multi-layer network, to understand the approximation to the "ideal" choice in the previous case when the network is not expressive enough. From this, we will define a natural metric of the quality of this approximation, which is a way to measure the effectiveness of a given network structure.

Generate Data Samples
We will generate some data samples $$(\underline{x}, y )$$ pairs. $$\underline{x}$$ is the real-valued feature vector, and $$y $$ is the label. We will use these data samples to train a simple neural network so it can be used to do classification of the feature vectors.

In the following, we will generate $$ \underline{x} $$'s from the Dirichlet distribution, with parameters chosen randomly, but depends on the value of $$ y $$. Why Dirichlet? We of course can use a mixture Gaussian model, which might be conceptually more straightforward. However, we want to avoid the concern that our result only applies to the Gaussian cases, and implicitly we are stating somehow that Dirchlet is "sufficiently non-Gaussian". Clearly we are not making a formal statement here. Readers are welcome to try other distributions, the more the better.

Here is the code to generate the data:

Here are how the generated data samples look like. I had to add in the colors, as otherwise it is really hard to see. Not a very clear clustering problem, is it? The strange triangle shape comes from the fact that Dirichlet distribution has the support over a simplex. We generate samples on a 3-D simplex and project them down to the 2-D space.



Use Neural Networks for Classification
Now let’s make a neural network to nail these data. The network we would like to make has a single layer of nodes, not exactly deep learning, but a good start point to focus on the SoftMax unit. It looks like this:



What the network does is to train some weight parameters $$(\underline{v}_j, b_j )$$, to form some linear transforms of the feature vectors, denoted as $$Z_j = \underline{v}_j^T \cdot \underline{x} + b_j$$, one for each node, or neuron. The SoftMax output unit computes a distribution based on these weights,

$$Q^{(v,b)}_{Y|\underline{X}}(j | \underline{x}) = \frac{e^{Z_l}}{\sum_i e^{Z_l}}$$

and maximizes the likelihood with the given collection of samples

$$\max_{v, b} \sum_{i=1}^N \log Q_{Y|\underline{X}}^{(v,b)}(y_i| \underline{x}_i) $$

The number of nodes, $$ K $$, should be chosen as the number of possible values of the labels, $$Cy = |{\mathcal Y}| $$ in the code. The codes using Keras to implement this network is as simple as the follows:

The code should be self-explainary. model.add specifies the structure of the network, and model.fit starts the training to get the weights. The resulting weights can be accessed by calling model.get_weights, where we get the following results, in comparison with the centers of each class:

The question that has puzzled us for several decades is "what are those weights?"

What Are Those Weights?
Now we will compute the empirical distribution of $$ Y $$ and the empirical average $$ {\mathbb E} [\underline{X}|Y=y] $$ for each class as follows:

The resulting M is a Cy by Dx matrix. For each j in range(C_y=8), M[j,:] is a Dx=2 dimensional vector. Each column M[:, i] can be viewed as a function of $$ y $$. Note that the weights computed by the neural network, which we label as $$ v_j $$'s in the figure, is also a Cy by Dx matrix, one edge to connect each input dimension and each output value. We would like to compare the two. Before that, we need to use the following regulation routine to make sure that all functions of $$ y $$ have zero-mean and unit-variance with respect to the empirical distribution PPy.

Now we are ready to make some plots:

and the results are something like this:

Interesting, huh? So the training procedure of a neural network is actually computing something really simple: the empirical conditional expectations $${\mathbb E} [\underline{X}|Y=y] $$ for different values of $$y$$'s.

So why are these conditional expectations worth computing? This goes back all the way to Alfréd Rényi, and the notion of HGR correlation. In our paper, we can show that this conditional expectation as a function of is in fact the function that is the most relevant to the classification problem.

Well, if the HGR thing is too abstract here, you can think there is a “correct” function that statisticians would like to compute. This is somewhat different from the conventional pictures where learning means to learn the statistic model that governs $$(\underline{X}, Y)$$. Since the input features are often very high dimensional vectors or have other complex forms, learning this complete model can be difficult. The structure of a neural network gives a constraint on the number of parameters that we can learn, which is often much smaller than the number of parameters needed to specify the full statistical model of $$(\underline{X}, Y)$$. Thus, we can only hope to learn a good approximate model that can be represented by these parameters. What the HGR business and our paper says is there is a theoretical way to identify what is the best approximation to the full model with only this number of parameters; and what we are demonstrating here is that a neural network is computing exactly that, and the results are stored on the weights of its edges!

A Neural Net with NO Training
At this point, I guess we should be a bit more aggressive. How about make a neural network, and without feeding data to it to train, we simply compute and tell the net what the weights should be!

The code we use should be something like this:

The significant thing here is that there is no model.fit. Instead, we will use a function makeup_1layer to compute all the weights, and use model.set_weights to directly assign those weights. This way we are really using the neural network as a data structure, and replacing the backprop/training procedure by something we have full control with. After that, we use model.predict_classes to check how well it works.

Our weight calculating function is as follows:

In a more clear form, this routine computes

$$ \underline{v}_j = \frac{{\mathbb E}[\underline{X} | Y=j]}{\sigma^2}; \qquad b_j= \log P_Y(j) - \frac{\|{\mathbb E}[\underline{X} | Y=j]\|^2}{2\sigma^2} $$

where $$ {\mathbb E}[\cdot] $$ is the empirical average, and $$ \sigma^2 $$ is the average empirical variance of the entries in $$ \underline{x}$$.

Needless to say, these weights work very well. The performance matches that based on the keras training. In other words, the magic of neural networks lies right in these two equations!

Experiment 2: Discrete-Valued Inputs
The above results is by no means satisfactory when we are interested in DEEP learning. While it shows us what the SoftMax output nodes are doing, we know that real networks get their power from those multiple hidden layers of nodes. So what are all these layers doing?

Let's consider the following slightly modified network.



For the moment, we can think the red dashed line as representing many layers of hidden nodes. What they do is they form a selected feature function $$ S(x) $$, as a function of the inputs $$ x $$, and feed that to the SoftMax output layer. What function does it choose? Intuitively, we should expect that $$ S(x) $$ as an "informative" feature, from which the inference is possible. But what is "informative"?

To answer this question, we need to first look at a simple case, where $$ x $$ is discrete-valued, from $$ \{ 1, \ldots, m\} $$. The input to the network is the indicators, or one-hot representation, of $$ x $$, and there is a single layer of hidden nodes whose output is $$ S(x) $$. This is a simple case particularly because any desired function $$ S(x) $$ can be generated by selecting the appropriate weights $$ w_1, \ldots, w_m $$, in the form

$$ S(x) = \sigma\left( \sum_{i=j}^m w_j \cdot \mathbb{I}_{x=j}\right) $$

where $$ \sigma(\cdot) $$ is the activation functions, which we choose as the sigmoid function. In other words, the network with a single hidden layer is expressive enough. Our question is when we give the network the freedom to choose feature functions, what $$ S(x) $$ would it choose? and why?

Generate Data Samples
We randomly choose the joint distribution $$ p_{XY} $$. This is not necessary. You can certainly pick whichever joint distribution you like. We make this a random choice more or less to get a mental impression that we didn't cook up a special example which could be the only case that our results work. Well, like every other non-mathematical maneuver, it is not clear what exactly is achieved by doing that. Nevertheless, here is how it is done.

One important thing here is that we returned the distributions $$ p_{XY}, p_X, p_Y $$ after generating the data samples. This is partly cheating. A more strict approach should only return the data samples, and use a separate routine to compute the empirical joint distirbution, and the empirical marginal distributions. Here we saved that effort and have some faith on the law of large numbers.

At this point, we can already compute two functions as the theoretical results. Like the name suggests, these two things, S_theory and v_theory, will be used as the theoretical results to be compared with the training results from the neural networks. I will reveal what these computations are very soon. I mean, what kind of wikipage this would be if there weren't a little suspense!

Before using these samples, they have to be turned into one-hot form. This can be found as a standard function of [sklearn].

Run The Network
Now we are ready to make a network to learn from these data. This is fairly standard using Keras.

The 'weights' read out on the last line is a list. Two elements are arrays corresponding to the weights $$ w, v $$ in our figure. We take these weights out and have simple processing as follows, among the steps we used the 'regulate' to force the results to have zero mean and unit variance as before. After that, we plot the results in comparison with the theoretical results.

We need to point out that the processing of $$ S $$ and $$ v $$ are slightly different. For $$ v $$ we just take the corresponding weights, but for $$ S $$, we used the 'expit' function. This is because we used sigmoid function sigmoid as the activation function. One needs a little thinking to be convinced that the resulting $$ S $$ is array with elements of $$ S(x), x \in \{1, \ldots, m\} $$, where $$ S(x) $$ is precisely the feature value in the figure, the output of the hidden layer.

A typical plot is given here: What a surprise! They look the same! I hope now you are curious about how the theoretical values are computed, or, what is in that 'WhatTheorySays' function.



What Theory Says?
First of all, we give the true joint distribution $$ p_{XY} $$ to 'WhatTheorySays'. This is actually cheating. In principle, we should give the training sample set instead to make sure it sees the same thing that a neural network should see. One of course can compute the empirical joint distribution and use that in the place of $$ p_{XY} $$, and because of the Law of Large Numbers the result should be the same. The situation is actually slightly more subtle than this. The number of samples we use might not be sufficient to allow precise learning of the joint distribution, since both $$ {\mathcal X} $$ and $$ {\mathcal Y} $$ can be very large alphabets. In order that a neural network and our algorithm to work, we only need to make sure some relevant parts of the joint distribution is precisely estimated. This is a very important fact in terms of controling the sample complexity of the solutions. In our example, both alphabets are quite small. This allows us to avoid this entire discussion on the sample complexity. Interested readers can experiment by increasing the cardinalities, 'xCard, yCard', in the program, and try to see which one, the neural network or our approach fail first.

With that in mind, here the code:

Surpisingly short, isn't it? And yes, this is what the neural network computes! In mathematical form, what the code does is the following: first it form a matrix $$ B $$, of size $$ |{\mathcal X} | \times |{\mathcal Y}| $$, each entry of the matrix corresponds to a particular pair of values $$ (x,y) $$, where

$$ B(x,y) = \frac{p_{XY}(x, y)}{\sqrt{p_X(x)} \sqrt{p_Y(y)}} $$

After that we do find the singular value decomposition (SVD) of this B-matrix. We pick $$ \underline{u }_1 \in {\mathbb R}^{|{\mathcal X}|}$$ and $$ \underline{r}_1 \in {\mathbb R}^{|{\mathcal Y}|} $$ as the pair of left and right singular vectors corresponding to the second largest singular value. We treat $$ \underline{u}_1, \underline{r}_1 $$ as real-valued functions over $$ {\mathcal X, Y} $$, respectively, and do a simple normalization

$$ S(x) = \frac{u_1(x)}{\sqrt{p_X(x)}}, \quad v(y) = \frac{r_1(y)}{\sqrt{p_Y(y)}} $$

The resulting $$S, v $$ are our theoretical results. They automatically have zero-mean and unit-variance, so we didn't have to use the 'regulate' routine on them. And that is what you see in the above plots.

A bit dizzy, huh? Don't worry. Let me help you with this.

The B -Matrix
The first thing that you might notice and find confusing is that there is a scaling that seems particularly strange. We define a correspondence between a function of the data and a vector we operate on. In the X-space, we have a function $$ S : {\mathcal X} \mapsto {\mathbb R} $$ represented in a vector form $$ \underline{u} \in {\mathbb R}^{|{\mathcal X}|} $$, with $$ S(x) = u(x)/\sqrt{p_X(x)} $$. In the Y-space, the scaling is $$ v(y) = r(y) /\sqrt{p_Y(y)} $$.This defines the correspondence between functions and vectors $$ \underline{u} \leftrightarrow S, \underline{r} \leftrightarrow v $$. For the moment, please just be patient and take these as given.

One question you should have at this point is: "Isn't this $$ v(y) $$ function already computed in the previous page? Didn't it say that $$ v(y) = {\mathbb E} [ S(X)|Y=y] $$ ?"

Well, that is at least what I hope you to ask.

It turns out these two answers are exactly the same. To see this, we need to take a closer look of the matrix $$ B $$ we defined above. It has some really elegant properties, and here is the first one:

Let's consider the $$ B $$ matrix multiplying a vector. Consider $$ \underline{r} = B \cdot \underline{u} $$, where $$ \underline{u}, \underline{r}  $$ correspond to a pair of functions $$ S(x) $$ and $$ v(y) $$ as defined above. Now we simply write this out:

$$ \begin{align} v(y) &= \frac{1}{\sqrt{p_Y(y)}} r(y) = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) u(x) \\ & = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) \sqrt{p_X(x)} \cdot S(x) \\ &= \frac{1}{\sqrt{p_Y(y)}} \sum_x \left( \frac{p_{XY}(x,y) } {\sqrt{p_X(x)} \sqrt{p_Y(y)}} \right) \sqrt{p_X(x)} \cdot S(x)\\ &= {\mathbb E} [S(X)|Y=y] \end{align} $$

In words, this says that the matrix multipliation $$ \underline{r} = B \cdot \underline{u} $$ defines a mapping. It maps a feature function on $$ S(\cdot) $$ on $$ {\mathcal X} $$ to a corresponding feature function $$ v(\cdot) $$ on $$ {\mathcal Y} $$, with $$ v(y)={\mathbb E} [S(X)|Y=y] $$. And as we have seen, this mapping is clearly relevant to the neural network operation.

It is not hard to check that a conjugate operation $$ \underline{u} = B^T \cdot \underline{r} $$ defines a mapping in the reverse direction, it maps a function of $$ y $$ to a function of $$ x $$ as $$ S(x) = {\mathbb E}[v(Y)|X=x] $$.

Feature Projection
Now here is how the neural network operates. It initializes by randomly choosing the weights, and hence gets some random choices of the feature function $$ S(x) $$, and random output layer weights $$ v(y) $$. In the back propagation process, we first fix $$ S(x) $$ and update $$ v(y) $$. This can be viewed as finding a feature of $$y,  v(y) $$ that is best "aligned" with $$ S(x) $$. The solution is $$ v(y) = {\mathbb E} [S(X)|Y=y] $$. Then in the next step, we fix $$ v(y) $$ and look for a best aligned feature of $$ x $$, which gives $$ S(x) = {\mathbb E}[v(Y)|X=x] $$. After a number of iterations the process reaches a fixed point, where $$ S(x), v(y) $$ are well aligned to each other.

In this procedure, we repeated answer the question "what feature of one variable is best aligned with a given feature of another variable". We call this the "feature projection": we find the projection of a feature, say $$ S(x) $$， in the space of features that can be computed from $$ y $$, and the solution is always the conditional expectiation $$ v(y) = {\mathbb E} [S(X)|Y=y] $$.

This story brings out the SVD solution: suppose $$ \underline{u}, \underline{r} $$ are a pair of singular vectors of $$ B $$, with singular value $$ \sigma $$, we have $$ \sigma \cdot \underline{r} = B \cdot \underline{u} $$ and $$ \sigma \cdot \underline{u} = B^T \cdot \underline{v} $$. The scaling by $$ \sigma $$ is not important at this point. It is taken out by the "regulate" function anyway. When we compute the SVD, one numerical method is by power iteration. Namely, we start with a random choice of $$ \underline{u} $$ and repeated left-multiply $$ B^T B $$ to it. One can see now that the feature projections are doing exactly that.

In our paper, we show that the resulting $$ \underline{u}, \underline{r} $$ pair are the pair of singular vectors corresponding to the largest singular value of $$ B $$. It captures the strongest mode of dependence between $$ X $$ and $$ Y $$. This is why the corresponding feature functions are useful in making predictions from $$ X $$ to $$ Y $$. The story can easily be generalized if we are looking for not one but $$ k $$ pairs of singular vectors, and thus capture the strongest $$ k $$ modes of dependence. This corresponds to the neural network with $$ k $$ hidden nodes before the output layer.

One leftover thing that you may wonder is why did we take the singular vectors for the second largest singular value, not the first. It turns out that the first pair of singular vectors are always the same:

$$ \underline{u}_0 = [\sqrt{P_X(x)}, x \in {\mathcal X}], \underline{r}_0 = [\sqrt{P_Y(y)}, y \in {\mathcal Y}] $$,

and the singular value is $$\sigma_0= 1 $$. To see this, we write $$ \begin{align} B \underline{u}_0 = \sum_x B(x,y) \sqrt{P_X(x)} = \sum_x\left( \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}}\right)\cdot \sqrt{P_X(x)} = \sqrt{P_Y(y)} \end{align} $$

All other singular vectors must be orthogonal to this

$$ \begin{align} 0 = \langle \underline{u}_0, \underline{u} \rangle = \sum_x \sqrt{P_X(x)} \cdot \sqrt{P_X(x)} S(x) = {\mathbb E}[S(X)] \end{align} $$

which means the feature function $$ S(x) $$ corresponding to $$ \underline{u} $$ must be zero-mean with respect to $$ P_X $$. We take this as a given assumption without loss of generality when we choose feature functions. It turns out that all other singular values of $$ B $$ must be less than or equal to 1. This is a simple consequence of the data processing inequality. Thus, the second pair of singular vectors are in fact the first pair of non-trivial singular vectors.

In our paper, we actually use the following definition of $$ \tilde{B} $$ instead of $$ B $$, for which this cumbersome largest singular mode is taken out. The good thing is that for $$ \tilde{B} $$ it is easier to get convinced that it addresses the dependence between $$ X $$ and $$ Y $$ $$ \begin{align} \tilde{B} (x,y)= \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} - \sqrt{P_X(x)}\sqrt{P_Y(y)} = \frac{P_{XY}(x, y) - P_X(x) P_Y(y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} \end{align} $$

Experiment 3: Limited Expressive Power
So far we have demonstrated that in a neural network used to predict the value of $$ Y $$ from input samples of $$ X $$, the network uses the training samples to learn two feature functions, $$ S(x) $$ and $$ v(y) $$, and store these two functions as weights of some edges. The two feature functions are the fixed-point solution of

$$ \begin{cases} S(x) = {\mathbb E}[v(Y)|X=x]\\ v(y) = {\mathbb E}[S(X) |Y=y] \end{cases} $$

Moreover, they correspond to a pair of left and right singular vectors of a $$ \tilde{B} $$ matrix, and thus have the interpretation of the strongest "mode" of dependence between $$ X $$ and $$ Y $$.

In other words, there is an explicit goal that a neural network tries to compute, and we explain the reason why that is a good thing to compute.

This goal is actually achieved when we give the network enough expressive power, by choosing the input $$ x $$ samples to be discrete valued. In this experiment, we will worry about the case when the previous layers, no matter how many there are, is not expressive enough. This happens in most cases where neural networks are useful, in NLP, in vision, etc. Intuitively, what the network should do is to find the second best thing, a good approximation to these ideal solutions from the SVD of the $$ \tilde{B} $$ matrix, which can be generated from whatever network structure that is given. We would like to understand the nature of this approximation a bit better than this intuitive level.

Here is the network we plan to work with:



Discrete Data, Continous Data
The main differene between this and the network we worked with previously is that the inputs to the network, $$ T_1, \ldots, T_m $$ are assumed to be continuously valued, where in the previous experiment the inputs are assumed to be the one-hot encoding of a discrete random variable $$ X $$.

Why this is a big deal? Well, if you check the solutions we presented there, we argue that the network tries to compute some functions of the joint distribution $$ P_{XY} $$. If the input $$ T $$ is continuous valued, it's not that we cannot assume a joint distribution $$ P_{TY} $$, but notions like $$ {\mathbb E}[v(Y)|T=t] $$ no longer makes much sense. Think about it, the conditional expectation is a function of $$ t $$, and for two close by values $$ t $$ and $$ t+ \delta $$, we should expect the conditional expectaions to be close, but nothing in the data can guarantee that. One solution to this is to use a smoother, but that cannot give very clean results.

So here is the plan. We will not feed the neural network with just any continuous-valued inputs. Instead, we assume there is a discrete valued $$ X $$ behind $$ T $$. That is, in the data acquisition process, we cannot observe the true discrete valued $$ X $$, but instead observe some results from "pre-processing", which are the continuous-valued $$ T $$. We cannot change this preprocessing. And the question is "what would the network do with these processed data samples?"

One thing convenient is that with this setup, we can understand deep networks. In a multi-layer setup, at each layer, we can think of the outputs of the previous layers as a particular kind of preprocessing. We ask the question "how would the neural network assign weights on the current layer given this preprocessing?" This is one step of the back propagation; and is exactly the same problem in our setup.

This leads to a somewhat philosophical question: can we always think of any kind of continuous-valued data as the preprocessing result of some collection of discrete-valued factors? Well, our answer is "pretty much!" Particularly if you allow the alphabet size of $$ X $$ to be large. Let's take this not as a mathematical fact, but as an assumption.

This assumption does cause a conceptual difference when we talk about the "expressive power" of a neural network. The commonly used notion tries to ask whether a given network can generate an arbitrary function of the input $$ T $$. For that purpose, the complexity of the network we wish for is unlimited. That is, we always want more nodes and more layers in order to generate more and more complex functions of $$ T $$. If we now take the assumption that there is a discrete-valued ground truth $$ X $$ behind $$ T $$, then the network we need is limited. For example in the figure, if we have $$ m \geq L $$, i.e. the dimensionality of the processing results higher than the cardinality of $$ X $$, then the network can assign weights $$ w $$'s to revert the preprocessing so we get back to the discrete-valued problem again. In reality, the cardinality of $$ X $$ is really really large, particularly when noise is involved in the preprocessing, so even this idea of "limited sized network" is actually quite large. If you are asking your boss for another GPU, don't take that back yet.

OK! Enough discussions. Let's do some experiments.

Run the Network
Here is how we generate the data:

The function "Generate2DSamples" is the same as before. It randomly pick a joint distribution $$ P_{XY} $$ and generates data samples from that. "MakeLabels" is the one-hot encoding as before. The only added step is a randomly chosen "PreProcessing" matrix, and we get some samples $$ T(x_i) $$ for each $$ x_i $$. Note that we make sure xCard > tDim. This corresponds to the notations in the figure as $$ L > m $$, since otherwise the problem becomes trivial.

Now we feed the data, $$ (T(x_i), y_i) $$ pairs to the  neural network.

We train the network, and then get the weights. Now all that's left is to check the weights?

Check the Weights
We have already established that if we feed the discrete ground truth $$ X $$ to the network, the traning results should be

$$ \begin{cases} v^*(y) = {\mathbb E} [S^*(X)|Y=y]\\ S^*(x) = {\mathbb E} [v^*(Y)|X=x] \end{cases} $$

which has an elegant interpretation as the singular vectors of the $$ \tilde{B} $$ matrix.

Now we replaced $$ X $$ with the preprocssing results $$ T(x) $$. The first relation should remain the same:

$$ v(y) = {\mathbb E} [S(T(X)) | Y=y ] $$

The second relation is no longer the same. Because now we cannot compute an arbitrary function $$ S^*(x) $$. The network structure restricted us to be only abel to compute functions of a particular form. Namely,

$$ S(x) = \sigma\left( \sum_{j=1}^m w_j \cdot T_j(x) + d\right) $$

where $$ \sigma(\cdot) $$ is the sigmoid function, and $$ T_j(\cdot) $$'s are given. Intuitively, all we can do is to find one choice of the weights $$ (w_1, \ldots, w_m, d) $$ such that the resulting $$ S(x) $$ is close to the desired $$ S^*(x) $$ in some sense.

Recall that we had a name to the conditional expectation step: feature projection. For a given feature $$ v(y) $$, we try to find a feature that can be computed from $$ x $$ that is best aligned with $$ v(y) $$. The result is the conditional expectation $$ S(x) = {\mathbb E} [v(Y) |X=x] $$. Now our problem is only slightly changed. For a given feature $$ v(y) $$, we try to find a feature that can be computed from the $$ T_j(x) $$'s in the specific form as above that is best aligned with $$ v(y) $$. This is conceptually only a slight generalization. Thus we call this process "Feature Projection with Restrictions".

Numerically, it is actually quite a bit harder to verify such an intuition, since the restriction on $$ S(x) $$ is non-linear, also maybe less obvious, the bias terms can be rather annoying. Let's handle them one by one.

The Annoying Bias
The bias term, $$ b_1, \ldots, b_k $$ in the figure, turns out to be very tediuous and useless.

What actually happens is that when neural networks is understood as feature projections, i.e. operators that maps one feature function to another, it is quite clear that a constant shift on the feature functions does nothing. Thus we should always think of such operations as mapping between zero-mean feature functions: zero-mean with respect to the marginal distribution of the variable that the feature function is defined, $$ P_X $$ for $$ S(x) $$, and $$ P_Y $$ for $$ v(y) $$.

We can therefore define notations as

$$ \begin{align} \tilde{S}(x) &= S(x) - {\mathbb E}[S(X)], \\ \tilde{v}(y) &= v(y) - {\mathbb E}[v(Y)] \end{align} $$

and a better way to write the feature projections should be

$$ \begin{cases} \tilde{S}(x) = {\mathbb E}[\tilde{v}(Y)|X=x]\\ \tilde{v}(y) = {\mathbb E}[\tilde{S}(X)|Y=y] \end{cases} $$

Unfortunately, neural networks do not appreciate the beauty of this simplest form. (And you are still worrying about whether one day AI can replace humankind.) Instead, it controls the mean values of these functions so that the nodes work on some specific operating points on the activation functions, to utilize the non-linearity around those operating points. As a result, the output generated by  neural networks are almost never zero-mean. For example, the output of the sigmoid function is always positive, and cannot have zero mean.

This is why whenever we need to compare the network outputs with the theoretical results, we use the "regulate" function. Inside the networks, the bias node and the weights on the edge connected to it are used to compensate for these non-zero means. For example, the weights $$ b_1, \ldots, b_k $$, written in function form as $$ b(y) $$ in fact satisfies

$$ b(y) = \log P_Y(y) - v(y) \cdot {\mathbb E} [S(X)] $$

where the second term is used to compensate for the mean value of $$ S(x) $$. It helps to make sure that the output nodes have $$ v(y) \cdot S(x) + b(y) = v(y) \cdot \tilde{S}(x) + \log P_Y(y) $$

That is, $$ S(x) $$ is effectively zero-mean from the output nodes point-of-view.

This fact can be easily verified with the following code piece: with "weights" as the one taken from the network by "model.get_weights"



And the output looks like this.

I hope you didn't catch it, on both "b" and "b_theory", we subtracted another mean! This time the mean w.r.t. $$ P_Y $$. Because the output layer softmax function is another non-linear function, and the network plays with this mean again! Annoying, isn't it?

Careful readers might also find that in the Softmax page when we make up our own weights to plug in to the network, we used a different formula $$ b(y) = \log P_Y(y) - \frac{({\mathbb E}[S|Y=y])^2}{2 \cdot {\rm var}[S]} $$

Instead of showing you why they are actually equivalent, which is quite tedious, let's just say the specific form depends on the way that these weights are regulated. And don't we all wish that there wasn't any bias in any form?

 "I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character." --- Dr. Martin Luther King

The Content of Their Character
Now that we have put the bias issue aside, let see how the network makes the feature projection. Recall the ideal output of the hidden node is

$$ S^*(x) = {\mathbb E} [ \tilde{v}(Y)| X=x] = {\mathbb E}[v(Y)|X=x] - {\mathbb E}[v(Y)] $$

Our conjecture is that the network generates an approximation to this as $$ \tilde{S}(x) = S(x) - {\mathbb E}[S(X)] $$. Knowing that whatever mean it creats will be taken out at a later layer, the network actually makes one that is not zero mean:

$$ S^{(w, d)}(x) = \sigma\left( \sum_j w_j T_j(x) + d \right) $$

by choosing the weights $$ w_1, \ldots, w_m, d $$.

Let's make a bold guess that this notion of approximation is in the mean-square error sense, now we can write the optimization in a single line

$$ \min_{w, d} \sum_x P_X(x) \cdot ( S^{(w, d)}(x) - {\mathbb E}[S^{(w, d)}(X)] - S^*(x))^2 $$

or using the $$ \tilde{S} $$ notation

$$ \min_{w, d} \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x))^2 $$

Now we can take some derivatives:

$$ \begin{align} 0 &= \frac{\partial}{\partial w_j } \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x))^2\\ &= \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x)) \cdot\left( \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x)\right) \end{align} $$

This now can be viewed as two functions, $$ (\tilde{S}^{(w, d)} (x) - S^*(x)) $$ and $$ \left( \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x)\right)$$ being orthogonal, in the inner producted weighted by $$ P_X $$. The fact that the inner product is 0 is harder to check numerically since both functions might be scaled rather arbitrarily, so it is hard to tell what numerical value is close enough to 0. What we would actally do is to check the angle between the two functions by taking the "arccos", and see if it is close to $$ \pi/2 $$. Well, we actually changed the unit and compared the angle with $$ 90 ^\circ $$.

The only thing that one needs to be careful about is the second function with the derivative now. I know may people, including myself, memoerize the derivative for the sigmoid function: $$ \sigma'(z) = \sigma(z) \cdot (1- \sigma(z)) $$. However, here we need the derivative of the sigmoid function subtracting its mean:

$$ \begin{align} \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x) &= \frac{\partial}{\partial w_j} \left[ S^{(w, d)} (x) - \sum_{x'} P_X(x') S^{(w, d)}(x') \right] \end{align} $$

This in our notation is simply $$ \widetilde{\frac{\partial }{\partial w_j} S^{(w, d)}}(x) $$. That is, the derivative taken away the mean. Finally, filling in the fact that

$$ \frac{\partial}{\partial w_j} S^{(w, d)} (x) = S^{(w, d)}(x) (1-S^{(w, d)}(x)) \cdot T_j(x) $$

and we are ready to program again.

And the results? Here is a typical set of values:

Convinced?

A Theory for Deep Learning
If you happened to type "neural networks" on your Google bar, chances are you will find "explained" as the suggested next word. The fact is there are many many works trying to give an explaination to why neural network works. I have not seen one work that provides an explict answer that identifies a mathematical quantity and shows that matches with the neural networks learning results. In that sense, I do think we are making progress here.

We didn't present much of a theory in this page, nor did I wish to use these simple demos to prove anything. I only hope to use these experiments to explain the main ideas, to show you what would the theory suggest in a few specific scenarios. Hopefully this would make some of you interested enough to read the paper. In our paper, we put out a more complete picture. It starts by saying that we need a new way to measure information itself. After designing a new information metric, we can use that to describe the information flow in any data processing procedure, what information is kept, what is discarded, and whether such choices are good ones. We then show that finding the singular vectors of the $$ \tilde{B} $$ like what we showed here is a good choice: it captures the strongest modes of dependence between $$ X $$ and $$ Y $$ like we claimed; it captures the most signifcant elements of the "common information" between $$ X $$ and $$ Y $$; and probably more importantly, the selected features are the optimal choices of what we called the "universal feature selection problem".

In classic statistics, if we need to select some features from data for the purpose of solving some inference problems, the answer is always to select the sufficient statistic. We argue that in most learning problems, as we process high dimensional data samples, we do not know exactly what attributes of the data we would be interested in detecting. For example, when we observe a customer's behavioral record it is not clear what attribute(s) matters when we try to recommend a product to her later. Thus, for these new problems, we cannot use a notion of sufficient statisitc for a particular inference task. Instead, we need to select a set of features, such that regardless of what attribute we might be interested in detect later, the decision based on these features are "universally" good. For this new problem, we show that the choice from the singular vectors are indeed optimal. That is, any feature selection procedure, as long as the goal is not a single problem with known statistical model, the sensible choice should be such universal features. And neural networks, as one of such tools, had better give these solutions. This is the basis of the above experiments.

Particularly in this area, I would like to avoid overstating the importance of our results, such as saying "blackbox opened". The best way of doing that is to show the mathematical results. Since we are not doing it here, I think it is important to make it clear what we can and cannot do with this theory. Here it goes.

A Performance Metric for Neural Networks
We get a performance metric of neural networks. If we keep the notation that $$ \underline{S}(x) = [S_1(x), \ldots, S_k(x)] $$ is the output of the last hidden layer with $$ k $$ nodes, and $$ \underline{v}(y) = [v_1(y), \ldots, v_k(y)] $$ is the weights on the output layer, our theory says the goal is to make them as highly correlated as possible, i.e. to make $$ {\mathbb E}[ \langle\underline{S}(x), \underline{v}(y)\rangle] $$. The optimal solution is to choose $$ S(x) $$'s and $$ v(y) $$'s from the SVD solutions. However, due to the limited expressive power of the selected network structure, we usually cannot achieve such optimum. The gap is then a good metric of how effecitve our network is.

The H-Scores
For that purpose, we can define a metric we called the "H-Score". For a given network with both $$ S(x) $$ and $$ v(y) $$ computed, keep in mind that a neural network might not put the answers in out favorate coordinate system, but can have rotation, scaling and shift to the solutions, we first write $$ \tilde{S}(x), \tilde{v}(y) \in {\mathbb R}^k $$ as the centralized versions of the features and the weights, with the means subtracted, and then define $$ \Phi = [\phi_1, \ldots, \phi_k], \Psi= [\psi_1, \ldots, \psi_k] $$, with

$$ \begin{align} \phi_i (x) &= \sqrt{P_X(x)} \cdot \tilde{S}(x) \\ \psi_i (y) &= \sqrt{P_Y(y)} \cdot \tilde{v}(y) \end{align} $$

Then the "H-Score" is defined as

$$ \begin{align} H(S, v) &:= \|\tilde{B}\|^2 - \| \tilde{B} - \Psi \Phi^T \|^2\\ &= 2 {\mathbb E}[ \underline{S}^T(X) \underline{v}(Y) ] - {\rm trace} ({\rm cov}(\underline{S}(X)) {\rm cov}(\underline{v}(Y))) \end{align} $$

If we are interested in only to evaluate how good the selected feature $$ S(x) $$ is, we can plug-in the theoretical optimal weights to get the one-sided H-score as

$$ \begin{align} H(S) := H(S, v^*) &= \| \tilde{B} \Phi (\Phi \Phi^T) ^{-\frac{1}{2}}\|^2 \\ &= {\mathbb E}_{P_Y} [ \ {\mathbb E} [\underline{S}(X) |Y=y]^{T} \cdot {\rm cov}(\underline{S}(x))^{-1} \cdot {\mathbb E} [\underline{S}(X) |Y=y]\ ] \end{align} $$

In practice, all of the expectations and covariances above can be replaced by the empirical averages. So the H-Scores can be computed for all the features and weights computed in the network, including the features at the output of intermediate hidden layers, to evaluate how good they are.



To demonstrate the use of the H-Score, we have the following classification experiment for the data as shown. These are the first 2 dimensions of a 6-dimensional dataset, with color-coded classes. The data is generated with a mixture-Gaussian model, and is rather complicated. We need a multi-layer neural network for this task, and we plotted the H-Scores for the outputs at each layer as follows:



What we can observe is that as we run the network with more and more iterations, the H-Score improves; and the H-Scores of the outputs at the later layers is higher. This shows that the more iterations and more layers in the network are actually picking better and better features of the data. Finally, the classification accuracy is also included in the figure to show the correspondence with the H-Scores: higher H-Scores generally mean better performance as expected.

Interested readers might find from the plot that layer-3 might not be doing her job too well, and try to do something about it. The code for this experiment can be downloaded from here.

For a more realistic test, we also evaluated the features selected from different algorithms on the ImageNet dataset, where we listed the H-Scores and the accuracy in the following table to show the correspondence.

Why Not Log-Loss?
Of course the more commonly used metric in such experiments is the log-loss, which is also called the cross entropy, or the K-L divergence. Here is how the H-Score is related to them and why it is (somewhat) better.

Log-Loss, or cross entropy computes $$ {\rm LL}(S, v)= {\mathbb E}_{P_{XY}} \left[ \log Q_{Y|X}^{(S, v)}(Y|X) \right] $$, where $$ P $$ is the empirical distribution and $$ Q^{(S, v)} $$ is the output of the SoftMax unit interpretted as a conditional distribution. The entire network chooses $$ S $$ and $$ v $$ to maximize this. As the objective of the entire optimization, we of course can use that as a measure of how good the result is. In fact, currently the most important job of the engineer who designed the neural network when it runs is to watch the log-loss value grows, slows down, and converges.

One issue of the Log-Loss is that it is not clear a priori what value we should expect. Should it be positive, negative, in the order of a few hundred, or 10 to power of 6? The actual result depends on the specific dataset and task. This makes it hard to use the log-loss as an indicator of the goodness of the network in an absolute sense.

An alternative is the K-L divergence, which is also called the relative entropy. It computes $$ D(P_{XY} || P_X \cdot Q_{Y|X}^{(S, v)}) = {\mathbb E}_{P_{XY}} \left[ \log \frac{P_{Y|X}(Y|X)}{Q_{Y|X}^{(S, v)}(Y|X)}\right] $$

and tries to minimize it. In terms of the optimization, this is equivalent as the Log-Loss, since the only difference is a term $$ {\mathbb E} [\log P_{Y|X}] $$, which does not depend on $$ (S, v) $$ at all.

A slight advantage of the K-L divergence is that it has an operational meaning: the distance between the two distribution: the actual empirical distribution $$ P $$, and the one generated by the network $$ Q $$. This gives the entire network a nice interpretation: it finds the best approximation to the true joint empirical distribiton $$ P_{XY} $$. Operationally, K-L divergence does have an absolute range: the divergence above always satisfies $$ 0 \leq D(P||Q) \leq \log |{\mathcal Y}|$$. This is a good thing. For example you can claim that for a dataset with $$ |{\mathcal Y}| =8 $$ you can get a K-L divergence of 0.02, it sort of says that you have a pretty good approximation; instead if your K-L divergence is 1.2, then your network is pretty much shooting randomly.

The problem of the K-L divergence is that we often cannot compute it from the data, because we don't know $$ P_{Y|X} $$. In fact, the reason we want to have a good approximation to the true $$ P_{Y|X} $$ is often to have a decent guess of what it looks like. This is where one should understand this difficulty to have a good metric for neural networks, which is like a Marco Polo game: we are chasing this wildly moving target, and it is hard to report where it is and how close we are from it with a single number.

Another issue of the K-L divergence is revealed when we write it in our vector form, with

$$ D(P_{XY} || P_X \cdot Q_{Y|X}^{(S, v)}) \approx \frac{1}{2} \| \tilde{B} - \Psi \Phi^T \|^2 = \frac{1}{2} \| \tilde{B} - \sum_{i=1}^k \psi_i \cdot \phi_i^T \|^2 $$

We are trying to use a rank-k matrix to approximate $$ \tilde{B} $$, but the distance we used here is actually the norm of the entire matrix. That is, even if we have used the optimal rank-k matrix computed from the SVD, the remaining singular values $$ \sigma_{k+1}, \sigma_{k+2}, \ldots $$ are still there in the resulting distance. In general, the K-L divergence is a sum of the contribution from these singular values that we can never reach to and the actual distance between our rank-k solution and the optimal one; and we have no way to separate the two. This is bad particularly when $$ |{\mathcal Y}| $$ is large.

This is the point that we can understand why H-Score is nice:

$$ H(S, v) := \|\tilde{B}\|^2 - \| \tilde{B} - \Psi \Phi^T \|^2 $$

That is, we subtracted from the K-L divergence $$ \|\tilde{B}\|^2 $$, which is the part of the K-L divergence that is not computable from the data samples. With that, we also dropped the contributions of the extra singular values. To see that, suppose $$ \Psi \Phi^T $$ is precisely the rank-k approximation of $$ \tilde{B} $$ from the SVD, the resulting H-Score is then the sum of the squared top-k singular values, with all other singular values cancelled.

The following sequence of inequalities can be very useful:

The Limitations
This is what we can do with the H-Score. Now here is what we cannot. Basically there are ways where the H-Scores are not accurate measure of the performance. A design with a lower H-Score can actually do better. This can happen for multiple reasons.

First, as stated, H-Score measure how good the selected features are in the universal sense. That is, how useful the features are averaged over all possible queries.

one can actually have a network with a lower $$ \rho $$ value yet works very well in practice, outperforming another network with a higher $$ \rho $$ value. The $$ \rho $$ value is a measure of the "universal" performance. More precisely, when we don't know what decision/classificiation/estimation problem we need to solve, we process the data to get $$ S(x) $$ that is useful to all these problems, on average. This often cannot be as good as the custom-made solution for a specific task. We argue that if we don't know the task, then the universal choice is the only senisble choice. In practice, however, it can be the case that we are actually interested in just one specific problem, or the problem space is actually quite narrow; and we are allowed to do some trial-and-error experiments. What such trial and error can gives us is the choice of feature functions that are actually tuned to this specific task, even though such tunning happens in a rather implicit way. Although we would like to claim that our universal solution is a more noble solution, we cannot deny that this trial and error approach does have its value in practice.

That coin of yours, it's HEAD! No? Then it is a TAIL! See! 100% correct! --- Whether you call this cheating is actually a philosophical issue.

The second problem of our metric is that it is based on a local approaximation. This is explained in much more details in our paper. The short statement is that the approximation allows us to focus on local optima. If we insist to do the math without this approximation, the solutions would be much more complicated, less clean, harder to generalize, and sometimes numerically less stable. In optimization, brave people do look for better ways to find the global optima for non-convex problems, but it is hard to imagine the result can be as general and as simple as those on local optima, which is what we are trying to find for the neural network problem.

This page is made with the help of Xiangxiang Xu. He looks like this.