Understanding the Power of Neural Networks

Why Are Neural Networks So Powerful?
Well, that is a difficult question that has haunted us for about 30 years. The more successes we see from the applications of deep learning, the more puzzled we are. Why is it that we can plug in this “thing” to pretty much any problem, classification or  prediction, and with just some limited amount of tuning, we almost always can get good results. For curious minds, having such a powerful tool without knowing why it works is very unsettling: what if there is something even more powerful right around the corner?

As one works more and more with neural networks, this feeling of unsettling deepens. There are many design parameters, the number of hidden layers, number of nodes, the type of activation functions, etc, often have to be chosen only by the trial-and-error approach. Even more puzzling is the fact that neural networks are often considered hungry in training samples. At the same time, the back propagation training procedure is perhaps the most mysterious element of neural networks. Without understanding this procedure, it is hard to imagine how to improve its efficiency.

All of these problems essentially point at the same question that we have not answered: What exactly is a neural network doing? Where does its power come from?

Now here is an answer! Ready?
In short, Neural Networks extract from the data the most relevant part of the  information that describes the statistical dependence between the features and the labels. In other words, the size of a Neural Networks specifies a  data structure  that we can compute and store, and the result of training the network is the best approximation of the statistical relationship between the features and the labels that can be represented by this data structure.

I know you have two questions right away: '''REALLY? WHY?'''

The “why” part is a bit involved. We have a new paper that covers this. Briefly, we need to first define a metric that quantifies how valuable a piece of partial information is for a specific inference task, and then we can show that Neural Networks actually draw the most valuable part of information from the data. As a bonus, the same argument can also be used in understanding and comparing other learning algorithms,  PCA,  compressed sensing, etc. So everything ends up in the same picture. Pretty cool, huh? (Be aware, it is a loooong paper.)

In this page, we try to answer the “Really?” question. One way to do that is I can write a mathematical proof, which is included in the paper. Here, we will only try to numerically demonstrate some of the results, and maybe add just a little bit of mathematical and intuitive explanations. We will at the end of these experiments introduce a new metric, which can be used to measure how effective the network is. This is particularly useful when we are designing a complex network, and are having difficulties in choosing the network structure and the hyperparameters.

There are several goals of this page:


 * 1) For beginners of neural networks, we provide some links to the commonly used packages and comments in the codes. The examples we use are all simple and small. Hopefully this can be a good way to help you to start programming;
 * 2) For engineers with extensive experience in using neural networks, we use the code as a common language to explain the main ideas of our research, and hope it can help you to design your next project more flexibly and more efficiently;
 * 3) For statisticians that are more interested in the theory, this page is only a supplimentary material, not a replacement, for the mathematical proofs in our paper. We hope that you find the experiments to be an effective tool to explain the ideas behind.

To follow the experiments:
You can just read the code and comments on this page, and trust me with the results; or you can run it yourself. To do that, you will need a standard Python environment, including Numpy, Matplotlib, etc. Also, you will need a standard neural network package. For that, I use Keras, and run it with TensorFlow. You can follow this link to get them installed. I recommend using Anaconda to install them, which takes, if you do it right, less than 10 minutes. Trust me, it’s a worthy effort. These packages are really well made and powerful. Of course, you are also welcome to just sit back, relax and enjoy the journey, as I will show you everything that you are supposed to see from these experiments.

You need to have the following lines to initialize.

If you receive no error message after these, then congratulation, you have installed your packages right. And now we are ready for the ride!

The Experiments
We will have three experiments, one conceptually more complex than the other. In each experiemnt, we will generate some samples $$ (x_i, y_i), i=1, \ldots, N $$. We will use $$ x $$ as the inputs and $$ y $$ as the labels to train a neural network. Our goal is to understand what exactly happened inside the network during this training process: what does the network learn from the training data, and where does it store the learning results. To answer such questions, we will in some cases dig out the network parameters at the end of the training session and compare them with theoretic results; and in some other cases we will skip the training process and use the theoretic results to directly set the values of the network parameters and check the performance of the resulting network.

Here is a rough idea of the three experiments:


 * 1) We will start with the simpliest case where the neural network has only a single layer of SoftMax output units. This helps us to establish the first mathematical formula of the weights computed in neural networks;
 * 2) We will then work on a network with one layer of hidden nodes, but with a dataset where both $$ x $$ and $$ y $$ are discrete valued. This is in fact a case where the neural network has sufficient expressive power, in which case we can check the "ideal" feature selected by the network, and compare that with the theoretically optimal choice;
 * 3) We will then study a multi-layer network, to understand the approximation to the "ideal" choice in the previous case when the network is not expressive enough. From this, we will define a natural metric of the quality of this approximation, which is a way to measure the effectiveness of a given network structure.

Generate Data Samples
We will generate some data samples $$(\underline{x}, y )$$ pairs. $$\underline{x}$$ is the real-valued feature vector, and $$y $$ is the label. We will use these data samples to train a simple neural network so it can be used to do classification of the feature vectors.

In the following, we will generate $$ \underline{x} $$'s from the Dirichlet distribution, with parameters chosen randomly, but depends on the value of $$ y $$. Why Dirichlet? We of course can use a mixture Gaussian model, which might be conceptually more straightforward. However, we want to avoid the concern that our result only applies to the Gaussian cases, and implicitly we are stating somehow that Dirchlet is "sufficiently non-Gaussian". Clearly we are not making a formal statement here. Readers are welcome to try other distributions, the more the better.

Here is the code to generate the data:

Here are how the generated data samples look like. I had to add in the colors, as otherwise it is really hard to see. Not a very clear clustering problem, is it? The strange triangle shape comes from the fact that Dirichlet distribution has the support over a simplex. We generate samples on a 3-D simplex and project them down to the 2-D space.



Use Neural Networks for Classification
Now let’s make a neural network to nail these data. The network we would like to make has a single layer of nodes, not exactly deep learning, but a good start point to focus on the SoftMax unit. It looks like this:



What the network does is to train some weight parameters $$(\underline{v}_j, b_j )$$, to form some linear transforms of the feature vectors, denoted as $$Z_j = \underline{v}_j^T \cdot \underline{x} + b_j$$, one for each node, or neuron. The SoftMax output unit computes a distribution based on these weights,

$$Q^{(v,b)}_{Y|\underline{X}}(j | \underline{x}) = \frac{e^{Z_l}}{\sum_i e^{Z_l}}$$

and maximizes the likelihood with the given collection of samples

$$\max_{v, b} \sum_{i=1}^N \log Q_{Y|\underline{X}}^{(v,b)}(y_i| \underline{x}_i) $$

The number of nodes, $$ K $$, should be chosen as the number of possible values of the labels, $$Cy = |{\mathcal Y}| $$ in the code. The codes using Keras to implement this network is as simple as the follows:

The code should be self-explainary. model.add specifies the structure of the network, and model.fit starts the training to get the weights. The resulting weights can be accessed by calling model.get_weights, where we get the following results, in comparison with the centers of each class:

The question that has puzzled us for several decades is "what are those weights?"

What Are Those Weights?
Now we will compute the empirical distribution of $$ Y $$ and the empirical average $$ {\mathbb E} [\underline{X}|Y=y] $$ for each class as follows:

The resulting M is a Cy by Dx matrix. For each j in range(C_y=8), M[j,:] is a Dx=2 dimensional vector. Each column M[:, i] can be viewed as a function of $$ y $$. Note that the weights computed by the neural network, which we label as $$ v_j $$'s in the figure, is also a Cy by Dx matrix, one edge to connect each input dimension and each output value. We would like to compare the two. Before that, we need to use the following regulation routine to make sure that all functions of $$ y $$ have zero-mean and unit-variance with respect to the empirical distribution PPy.

Now we are ready to make some plots:

and the results are something like this:

Interesting, huh? So the training procedure of a neural network is actually computing something really simple: the empirical conditional expectations $${\mathbb E} [\underline{X}|Y=y] $$ for different values of $$y$$'s.

So why are these conditional expectations worth computing? This goes back all the way to Alfréd Rényi, and the notion of HGR correlation. In our paper, we can show that this conditional expectation as a function of is in fact the function that is the most relevant to the classification problem.

Well, if the HGR thing is too abstract here, you can think there is a “correct” function that statisticians would like to compute. This is somewhat different from the conventional pictures where learning means to learn the statistic model that governs $$(\underline{X}, Y)$$. Since the input features are often very high dimensional vectors or have other complex forms, learning this complete model can be difficult. The structure of a neural network gives a constraint on the number of parameters that we can learn, which is often much smaller than the number of parameters needed to specify the full statistical model of $$(\underline{X}, Y)$$. Thus, we can only hope to learn a good approximate model that can be represented by these parameters. What the HGR business and our paper says is there is a theoretical way to identify what is the best approximation to the full model with only this number of parameters; and what we are demonstrating here is that a neural network is computing exactly that, and the results are stored on the weights of its edges!

A Neural Net with NO Training
At this point, I guess we should be a bit more aggressive. How about make a neural network, and without feeding data to it to train, we simply compute and tell the net what the weights should be!

The code we use should be something like this:

The significant thing here is that there is no model.fit. Instead, we will use a function makeup_1layer to compute all the weights, and use model.set_weights to directly assign those weights. This way we are really using the neural network as a data structure, and replacing the backprop/training procedure by something we have full control with. After that, we use model.predict_classes to check how well it works.

Our weight calculating function is as follows:

In a more clear form, this routine computes

$$ \underline{v}_j = \frac{{\mathbb E}[\underline{X} | Y=j]}{\sigma^2}; \qquad b_j= \log P_Y(j) - \frac{\|{\mathbb E}[\underline{X} | Y=j]\|^2}{2\sigma^2} $$

where $$ {\mathbb E}[\cdot] $$ is the empirical average, and $$ \sigma^2 $$ is the average empirical variance of the entries in $$ \underline{x}$$.

Needless to say, these weights work very well. The performance matches that based on the keras training. In other words, the magic of neural networks lies right in these two equations!

Experiment 2: Discrete-Valued Inputs
The above results is by no means satisfactory when we are interested in DEEP learning. While it shows us what the SoftMax output nodes are doing, we know that real networks get their power from those multiple hidden layers of nodes. So what are all these layers doing?

Let's consider the following slightly modified network.



For the moment, we can think the red dashed line as representing many layers of hidden nodes. What they do is they form a selected feature function $$ S(x) $$, as a function of the inputs $$ x $$, and feed that to the SoftMax output layer. What function does it choose? Intuitively, we should expect that $$ S(x) $$ as an "informative" feature, from which the inference is possible. But what is "informative"?

To answer this question, we need to first look at a simple case, where $$ x $$ is discrete-valued, from $$ \{ 1, \ldots, m\} $$. The input to the network is the indicators, or one-hot representation, of $$ x $$, and there is a single layer of hidden nodes whose output is $$ S(x) $$. This is a simple case particularly because any desired function $$ S(x) $$ can be generated by selecting the appropriate weights $$ w_1, \ldots, w_m $$, in the form

$$ S(x) = \sigma\left( \sum_{i=j}^m w_j \cdot \mathbb{I}_{x=j}\right) $$

where $$ \sigma(\cdot) $$ is the activation functions, which we choose as the sigmoid function. In other words, the network with a single hidden layer is expressive enough. Our question is when we give the network the freedom to choose feature functions, what $$ S(x) $$ would it choose? and why?

Generate Data Samples
We randomly choose the joint distribution $$ p_{XY} $$. This is not necessary. You can certainly pick whichever joint distribution you like. We make this a random choice more or less to get a mental impression that we didn't cook up a special example which could be the only case that our results work. Well, like every other non-mathematical maneuver, it is not clear what exactly is achieved by doing that. Nevertheless, here is how it is done.

One important thing here is that we returned the distributions $$ p_{XY}, p_X, p_Y $$ after generating the data samples. This is partly cheating. A more strict approach should only return the data samples, and use a separate routine to compute the empirical joint distirbution, and the empirical marginal distributions. Here we saved that effort and have some faith on the law of large numbers.

At this point, we can already compute two functions as the theoretical results. Like the name suggests, these two things, f_theory and g_theory, will be used as the theoretical results to be compared with the training results from the neural networks. I will reveal what these computations are very soon. I mean, what kind of wikipage this would be if there weren't a little suspense!

Before using these samples, they have to be turned into one-hot form. This can be found as a standard function of [sklearn].

Run The Network
Now we are ready to make a network to learn from these data. This is fairly standard using Keras.

The 'weights' read out on the last line is a list. Two elements are arrays corresponding to the weights $$ w, v $$ in our figure. We take these weights out and have simple processing as follows, among the steps we used the 'regulate' to force the results to have zero mean and unit variance. The code for 'regulate' can be found in the Softmax page. After that, we plot the results in comparison with the theoretical results.

We need to point out that the processing of $$ S $$ and $$ v $$ are slightly different. For $$ v $$ we just take the corresponding weights, but for $$ S $$, we used the 'expit' function. This is because we used sigmoid function sigmoid as the activation function. One needs a little thinking to be convinced that the resulting $$ S $$ is array with elements of $$ S(x), x \in \{1, \ldots, m\} $$, where $$ S(x) $$ is precisely the feature value in the figure, the output of the hidden layer.

A typical plot is given here: What a surprise! They look the same! I hope now you are curious about how the theoretical values are computed, or, what is in that 'WhatTheorySays' function.



What Theory Says?
First of all, we give the true joint distribution $$ p_{XY} $$ to 'WhatTheorySays'. This is actually cheating. In principle, we should give the training sample set instead to make sure it sees the same thing that a neural network should see. One of course can compute the empirical joint distribution and use that in the place of $$ p_{XY} $$, and because of the Law of Large Numbers the result should be the same. The situation is actually slightly more subtle than this. The number of samples we use might not be sufficient to allow precise learning of the joint distribution, since both $$ {\mathcal X} $$ and $$ {\mathcal Y} $$ can be very large alphabets. In order that a neural network and our algorithm to work, we only need to make sure some relevant parts of the joint distribution is precisely estimated. This is a very important fact in terms of controling the sample complexity of the solutions. In our example, both alphabets are quite small. This allows us to avoid this entire discussion on the sample complexity. Interested readers can experiment by increasing the cardinalities, 'xCard, yCard', in the program, and try to see which one, the neural network or our approach fail first.

With that in mind, here the code:

Surpisingly short, isn't it? And yes, this is what the neural network computes! In mathematical form, what the code does is the following: first it form a matrix $$ B $$, of size $$ |{\mathcal X} | \times |{\mathcal Y}| $$, each entry of the matrix corresponds to a particular pair of values $$ (x,y) $$, where

$$ B(x,y) = \frac{p_{XY}(x, y)}{\sqrt{p_X(x)} \sqrt{p_Y(y)}} $$

After that we do find the singular value decomposition (SVD) of this B-matrix. We pick $$ \underline{u }_1 \in {\mathbb R}^{|{\mathcal X}|}$$ and $$ \underline{r}_1 \in {\mathbb R}^{|{\mathcal Y}|} $$ as the pair of left and right singular vectors corresponding to the second largest singular value. We treat $$ \underline{u}_1, \underline{r}_1 $$ as real-valued functions over $$ {\mathcal X, Y} $$, respectively, and do a simple normalization

$$ S(x) = \frac{u_1(x)}{\sqrt{p_X(x)}}, \quad v(y) = \frac{r_1(y)}{\sqrt{p_Y(y)}} $$

The resulting $$S, v $$ are our theoretical results. They automatically have zero-mean and unit-variance, so we didn't have to use the 'regulate' routine on them. And that is what you see in the above plots.

A bit dizzy, huh? Don't worry. Let me help you with this.

The B -Matrix
The first thing that you might notice and find confusing is that there is a scaling that seems particularly strange. We define a correspondence between a function of the data and a vector we operate on. In the X-space, we have a function $$ S : {\mathcal X} \mapsto {\mathbb R} $$ represented in a vector form $$ \underline{u} \in {\mathbb R}^{|{\mathcal X}|} $$, with $$ S(x) = u(x)/\sqrt{p_X(x)} $$. In the Y-space, the scaling is $$ v(y) = r(y) /\sqrt{p_Y(y)} $$.This defines the correspondence between functions and vectors $$ \underline{u} \leftrightarrow S, \underline{r} \leftrightarrow v $$. For the moment, please just be patient and take these as given.

One question you should have at this point is: "Isn't this $$ v(y) $$ function already computed in the previous page? Didn't it say that $$ v(y) = {\mathbb E} [ S(X)|Y=y] $$ ?"

Well, that is at least what I hope you to ask.

It turns out these two answers are exactly the same. To see this, we need to take a closer look of the matrix $$ B $$ we defined above. It has some really elegant properties, and here is the first one:

Let's consider the $$ B $$ matrix multiplying a vector. Consider $$ \underline{r} = B \cdot \underline{u} $$, where $$ \underline{u}, \underline{r}  $$ correspond to a pair of functions $$ S(x) $$ and $$ v(y) $$ as defined above. Now we simply write this out:

$$ \begin{align} v(y) &= \frac{1}{\sqrt{p_Y(y)}} r(y) = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) u(x) \\ & = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) \sqrt{p_X(x)} \cdot S(x) \\ &= \frac{1}{\sqrt{p_Y(y)}} \sum_x \left( \frac{p_{XY}(x,y) } {\sqrt{p_X(x)} \sqrt{p_Y(y)}} \right) \sqrt{p_X(x)} \cdot S(x)\\ &= {\mathbb E} [S(X)|Y=y] \end{align} $$

In words, this says that the matrix multipliation $$ \underline{r} = B \cdot \underline{u} $$ defines a mapping. It maps a feature function on $$ S(\cdot) $$ on $$ {\mathcal X} $$ to a corresponding feature function $$ v(\cdot) $$ on $$ {\mathcal Y} $$, with $$ v(y)={\mathbb E} [S(X)|Y=y] $$. And as we have seen, this mapping is clearly relevant to the neural network operation.

It is not hard to check that a conjugate operation $$ \underline{u} = B^T \cdot \underline{r} $$ defines a mapping in the reverse direction, it maps a function of $$ y $$ to a function of $$ x $$ as $$ S(x) = {\mathbb E}[v(Y)|X=x] $$.

Feature Projection
Now here is how the neural network operates. It initializes by randomly choosing the weights, and hence gets some random choices of the feature function $$ S(x) $$, and random output layer weights $$ v(y) $$. In the back propagation process, we first fix $$ S(x) $$ and update $$ v(y) $$. This can be viewed as finding a feature of $$y,  v(y) $$ that is best "aligned" with $$ S(x) $$. The solution is $$ v(y) = {\mathbb E} [S(X)|Y=y] $$. Then in the next step, we fix $$ v(y) $$ and look for a best aligned feature of $$ x $$, which gives $$ S(x) = {\mathbb E}[v(Y)|X=x] $$. After a number of iterations the process reaches a fixed point, where $$ S(x), v(y) $$ are well aligned to each other.

In this procedure, we repeated answer the question "what feature of one variable is best aligned with a given feature of another variable". We call this the "feature projection": we find the projection of a feature, say $$ S(x) $$， in the space of features that can be computed from $$ y $$, and the solution is always the conditional expectiation $$ v(y) = {\mathbb E} [S(X)|Y=y] $$.

This story brings out the SVD solution: suppose $$ \underline{u}, \underline{r} $$ are a pair of singular vectors of $$ B $$, with singular value $$ \sigma $$, we have $$ \sigma \cdot \underline{r} = B \cdot \underline{u} $$ and $$ \sigma \cdot \underline{u} = B^T \cdot \underline{v} $$. The scaling by $$ \sigma $$ is not important at this point. It is taken out by the "regulate" function anyway. When we compute the SVD, one numerical method is by power iteration. Namely, we start with a random choice of $$ \underline{u} $$ and repeated left-multiply $$ B^T B $$ to it. One can see now that the feature projections are doing exactly that.

In our paper, we show that the resulting $$ \underline{u}, \underline{r} $$ pair are the pair of singular vectors corresponding to the largest singular value of $$ B $$. It captures the strongest mode of dependence between $$ X $$ and $$ Y $$. This is why the corresponding feature functions are useful in making predictions from $$ X $$ to $$ Y $$. The story can easily be generalized if we are looking for not one but $$ k $$ pairs of singular vectors, and thus capture the strongest $$ k $$ modes of dependence. This corresponds to the neural network with $$ k $$ hidden nodes before the output layer.

One leftover thing that you may wonder is why did we take the singular vectors for the second largest singular value, not the first. It turns out that the first pair of singular vectors are always the same:

$$ \underline{u}_0 = [\sqrt{P_X(x)}, x \in {\mathcal X}], \underline{r}_0 = [\sqrt{P_Y(y)}, y \in {\mathcal Y}] $$,

and the singular value is $$\sigma_0= 1 $$. To see this, we write $$ \begin{align} B \underline{u}_0 = \sum_x B(x,y) \sqrt{P_X(x)} = \sum_x\left( \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}}\right)\cdot \sqrt{P_X(x)} = \sqrt{P_Y(y)} \end{align} $$

All other singular vectors must be orthogonal to this

$$ \begin{align} 0 = \langle \underline{u}_0, \underline{u} \rangle = \sum_x \sqrt{P_X(x)} \cdot \sqrt{P_X(x)} S(x) = {\mathbb E}[S(X)] \end{align} $$

which means the feature function $$ S(x) $$ corresponding to $$ \underline{u} $$ must be zero-mean with respect to $$ P_X $$. We take this as a given assumption without loss of generality when we choose feature functions. It turns out that all other singular values of $$ B $$ must be less than or equal to 1. This is a simple consequence of the data processing inequality. Thus, the second pair of singular vectors are in fact the first pair of non-trivial singular vectors.

In our paper, we actually use the following definition of $$ \tilde{B} $$ instead of $$ B $$, for which this cumbersome largest singular mode is taken out. The good thing is that for $$ \tilde{B} $$ it is easier to get convinced that it addresses the dependence between $$ X $$ and $$ Y $$ $$ \begin{align} \tilde{B} (x,y)= \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} - \sqrt{P_X(x)}\sqrt{P_Y(y)} = \frac{P_{XY}(x, y) - P_X(x) P_Y(y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} \end{align} $$

What's Left?
The main problem of this experiment is that the input $$ X $$ is discrete. This means the selected feature function $$ S(x) $$ can in fact be almost any function of $$ x $$. In other words, network a single hidden layer is expressive enough. This is not always the case. A real deep neural network often needs many layers of nodes to make the space of feature functions that can be generated large, and yet this is often insufficient. That is, the feature function $$ S(x) $$ might be limited to a subset that can be generated by the given network structure. In such cases, is the chosen feature function still in some sense from the feature projection solution? For that, we will have another page on Feature Projection with Restrictions.

This page is made with the help of Xiangxiang Xu. He looks like this.