Softmax

Why Are Neural Networks So Powerful?
Well, that is a difficult question that has haunted us for about 30 years. The more successes we see from the applications of deep-learning, the more puzzled we are. Why is it that we can plug in this “thing” to pretty much any problem, classification or  prediction, and with just some limited amount of tuning, we almost always can get good results. While some of us are amazed by this power, particularly the universal applicability; some found it hard to be convinced without a deeper understanding.

Now here is an answer! Ready?
In short, Neural Networks extract from the data the most relevant part of the  information that describes the statistical dependence between the features and the labels. In other words, the size of a Neural Networks specifies a  data structure  that we can compute and store, and the result of training the network is the best approximation of the statistical relationship between the features and the labels that can be represented by this data structure.

I know you have two questions right away: '''REALLY? WHY?'''

The “why” part is a bit involved. We have a new paper that covers this. Briefly, we need to first define a metric that quantifies how valuable a piece of partial information is for a specific inference task, and then we can show that Neural Networks  actually draw the most valuable part of information from the data. As a bonus, the same argument can also be used in understanding and comparing other learning algorithms,  PCA,  compressed sensing, etc. So everything ends up in the same picture. Pretty cool, huh? (Be aware, it is a loooong paper.) Strictly speaking, this power of informatione extraction comes from the SoftMax output unit of a neural network. We will explain what exactly does a SoftMax do later in this page, and show why it is a good choice for the information extraction task. Other parts of a neural network, such as the hidden layers, the non-linear activation function, are powerful only in the context of such information extraction processes. We will discuss them in a separate page.

In this page, we try to answer the “Really?” question. One way to do that is I can write a mathematical proof, which is included in the paper. Here, let’s just do some experiments.

Here is the Plan:
We will generate some data samples $$(\underline{x}, y )$$ pairs. $$\underline{x}$$ is the real-valued feature vector, and $$y $$ is the label. We will use these data samples to train a simple neural network so it can be used to do classification of the feature vectors. After the training we will take out some of the weights on the edges of the network and show that these weights actually are the empirical conditional expectations $${\mathbb E} [\underline{X}|Y=y] $$ for different values of $$y$$'s.

So why are these conditional expectations worth computing? This goes back all the way to Alfréd Rényi, and the notion of HGR correlation. In our paper, we can show that this conditional expectation as a function of is in fact the function that is the most relevant to the classification problem.

Well, if the HGR thing is too abstract here, you can think there is a “correct” function that statisticians would like to compute. This is somewhat different from the conventional pictures where learning means to learn the statistic model that governs $$(\underline{X}, Y)$$. Since the features are often very high dimensional vectors or have other complex forms, learning this complete model can be difficult. The structure of a neural network gives a constraint on the number of parameters that we can learn, which is often much smaller than the number of parameters needed to specify the full statistical model of $$(\underline{X}, Y)$$. Thus, we can only hope to learn a good approximate model that can be represented by these parameters. What the HGR business and our paper says is there is a theoretical way to identify what is the best approximation to the full model with only this number of parameters; and what we are demonstrating here is that at least for these specific examples, a neural network is computing exactly that!

Amazing! right? Imagine how the extensions of this can help you to use Neural Networks in your problems!

To follow the experiments:
You can just read the code and comments on this page, and trust me with the results; or you can run it yourself. To do that, you will need a standard Python environment, including Numpy, Matplotlib, etc. Also, you will need a standard neural network package. For that, I use Keras, and run it with TensorFlow. You can follow this link to get them installed. I recommend using Anaconda to install them, which takes, if you do it right, less than 10 minutes. Trust me, it’s a worthy effort. These packages are really well made and powerful. Of course, you are also welcome to just sit back, relax and enjoy the journey, as I will show you everything that you are supposed to see from these experiments.

To start
You need to have the following lines to initialize.

If you receive no error message after these, then congratulation, you have installed your packages right.

Generate Data Samples
Now let’s generate some data. To start with, let’s generate the feature vectors and labels $$ (\underline{x}_i, y_i), i=1, \ldots, N $$, from a mixture Gaussian model

$$ p_{\underline{X}|Y}(\underline{x} | j) = {\mathcal N}(\underline{\mu}_j, \sigma^2I) $$

This is almost cheating as we know right away that $${\mathbb E}[\underline{X}|Y=j] = \underline{\mu}_j $$, so we only need to look for these  $$\underline{\mu}_j$$  values after the model is trained, and hope that they would magically show up as the weights in the network somewhere. Simple!

To make the story a little bit more interesting, we will pick the $$\underline{\mu}_j$$ ’s randomly. We will pick the probability $$p_Y$$ randomly too. Here is the code.

We can plot the data as something like this:



One can pretty much eyeball the different classes. Some classes might be easier to separate, and some might be too close to separate well.

Use Neural Networks for Classification
Now let’s make a neural network to nail these data. The network we would like to make has a single layer of nodes, not exactly deep learning, but a good start point to focus on the SoftMax unit. It looks like this:



What the network does is to train some weight parameters $$(\underline{v}_j, b_j )$$, to form some linear transforms of the feature vectors, denoted as $$Z_j = \underline{v}_j^T \cdot \underline{x} + b_j$$, one for each node, or neuron. The SoftMax output unit computes a distribution based on these weights,

$$Q^{(v,b)}_{Y|\underline{X}}(j | \underline{x}) = \frac{e^{Z_l}}{\sum_i e^{Z_l}}$$

and maximizes the likelihood with the given collection of samples

$$\max_{v, b} \sum_{i=1}^N \log Q_{Y|\underline{X}}^{(v,b)}(y_i| \underline{x}_i) $$

The number of nodes, $$ K $$, should be chosen as the number of possible values of the labels, $$Cy = |{\mathcal Y}| $$ in the code. The codes using Keras to implement this network is as simple as the follows:

The code should be self-explainary. model.add specifies the structure of the network, and model.fit starts the training to get the weights. The resulting weights can be accessed by calling model.get_weights, where we get the following results, in comparison with the centers of each class:

They do not look all that similar, right? The trick is that we need to regulate, to make each row vector above as a function of $$y$$ to have zero mean and unit variance with respect to $$p_Y$$. To do that, we make the following regulating function:

Finally, we can make the plots

and here are the results, comparing the empirical conditional expectation and the weights from the neural network :

A Non-Gaussian Example
Well, this is not so terribly surprising for the mixture Gaussian case. TheMAP classifier would compute the distance from a sample to the conditional mean, and put a weight on it according to the prior. With all the scaling and shifting, it is not unthinkable that the procedure becomes the inner product to the conditional mean. In fact, I wish someone can make an animation of this, to see how the decision regions varies with the parameters, which could be a good demo teaching classical decision theory. (I didn’t say “classical” in any condescending way).

So how about we try a non-Gaussian case, say, samples with Dirichlet distribution. Why Dirichlet? Because Python generates it, and I cannot remember the mean of this distribution.

Here are how the generated data samples look like. I had to add in the colors, as otherwise it is really hard to see. Not a very clear clustering problem, is it? The strange triangle shape comes from the fact that Dirichlet distribution has the support over a simplex. We generate samples on a 3-D simplex and project them down to the 2-D space.



and here are the comparison between the conditional expectation and the weights

 "I figured, since I'd gone this far, I might as well just keep on goin'" -- Forrest Gump

The Ultimate Test: Neural Net with NO Training
At this point, I guess we should be a bit more aggressive. How about make a neural network, and without feeding data to it to train, we simply compute and tell the net what the weights should be!

The code we use should be something like this:

The significant thing here is that there is no model.fit. Instead, we will use a function makeup_1layer to compute all the weights, and use model.set_weights to directly assign those weights. This way we are really using the neural network as a data structure, and replacing the backprop/training procedure by something we have full control with. After that, we use model.predict_classes to check how well it works.

Our weight calculating function is as follows:

In a more clear form, this routine computes

$$ \underline{v}_j = \frac{{\mathbb E}[\underline{X} | Y=j]}{\sigma^2}; \qquad b_j= \log P_Y(j) - \frac{\|{\mathbb E}[\underline{X} | Y=j]\|^2}{2\sigma^2} $$

where $$ {\mathbb E}[\cdot] $$ is the empirical average, and $$ \sigma^2 $$ is the average empirical variance of the entries in $$ \underline{x}$$.

Needless to say, these weights work very well. The performance matches that based on the keras training. In other words, the magic of neural networks lies right in these two equations!

Want to Know What's Going On?
Well, that is part of the "why" question. Rather than giving you the whole paper, let me just give a brief spoiler here.

Local Approximation
First, a disclaimer. The formula for $$ \underline{v}_j $$ and $$ b_j $$ above are computed in our paper, which is based on a local approximation that made the analysis tractable. This approximation comes with a cost, that we lose some precision in the result. If we replace the empirical variance $$ \sigma^2 $$ by the conditional covariance matrices, that is, let $$ \underline{\mu}_j = {\mathbb E}[\underline{X}|Y=j] $$, and $$\Sigma_j = {\mathbb E} [\underline{X}\underline{X}^T |Y=j]$$

$$ \underline{v}_j = \Sigma_j^{-1} \cdot \underline{\mu}_j; \quad b_j = \log P_Y(j) - \frac{1}{2} \cdot \underline{\mu}_j^T \cdot \Sigma_j^{-1} \cdot \underline{\mu}_j, $$

this choice gives slightly better performance in the experiments.

Don't worry about the difference between the variance and the covariance matrix. We did our analysis on the case that $$ X $$ is a scalar, which can easily be generalized to the vector case. The real difference is between using the variance of $$ X $$ or the conditional variance given $$ Y=j $$. In our analysis, the local assumption says that the conditional variances for different $$ j $$ are all close to each other, which takes away our ability to distinghuish between them. Thus, we can't tell from the analysis which one of the two choices would work better, while in our experiments, the latter one seems to be slightly better in most cases.

Optimality Under the Mixture Gaussian Model
In the literature, it is well understood that the SoftMax is the optimal classifier for the mixture Gaussian model. Recall that

$$ p_{\underline{X}|Y}(\underline{x} | j ) = {\mathcal N} (\underline{\mu}_j, \Sigma_j), \forall j \in {\mathcal Y} $$.

From Bayes rule we have $$ P_{Y|\underline{X}}(j|\underline{x}) = \frac{P_Y(j) \cdot \exp \left( \underline{x}^T \cdot \Sigma_j^{-1} \cdot \underline{\mu}_j - \frac{1}{2} \underline{\mu}_j^T \cdot \Sigma_j^{-1} \underline{\mu}_j \right) }{\sum_{j' \in {\mathcal Y}} P_Y(j') \cdot \exp \left( \underline{x}^T \cdot \Sigma_{j'}^{-1} \cdot \underline{\mu}_{j'} - \frac{1}{2} \underline{\mu}_{j'}^T \cdot \Sigma_{j'}^{-1} \underline{\mu}_{j'} \right) } $$

Compare this with the SoftMax function, we see that our second choice of $$ (\underline{v}_j, b_j )$$ with the conditional variances is indeed the optimal choice: the conditional distribution $$Q_{Y|\underline{X}}^{(v,b)} $$ generated by the SoftMax is a precise match to the true distribution $$ P_{Y|\underline{X}} $$.

Now the Message is Clear
Our analysis, at the cost of bluring the difference between the conditional and the unconditional variances, says an important fact: the SoftMax basically treats any model of $$ (\underline{X}, Y) $$ as if it were the mixture Gaussian model, and just compute the weights from the conditional expectations and the variances empirically from the data, and ignores all other differences.

Cout it be this simple?

Is that all what a neural network do?

Then how can it be so powerful?

It turns out that with this understanding, now we can really talk about where the power of neural networks come from; to discuss why the multiple hidden layers and the non-linear activation function can turn such a simplistic information extraction unit into a full and general solution. For that discussion, you need to goto my next wiki page.