Feature Projection

What Is This Page About?
This page is a contiuation of our previous wiki page on Softmax. It tries to answer the question "Why are Neural Networks so powerful?" We will show by numerical examples that the internal computation of Neural Networks using the stochatic gradient descent algorithm in back propagation is in fact computing mathematical quantity with a clear definition and operation meanings. By understanding this fact, we hope you can gain a better understanding of the internal operations of neural networks, and thus use them more efficiently and more flexibly.

No matter how convincing the numerical examples are, they cannot replace a rigorous mathematical proof, which can be found in our recent paper. In this page, however, our goal is to use several programs with the minimal mathematical derivation to make the statement. This process would be helpful for readers who wish to start their own numerical experiments using neural networks. We encourage you to take pieces of our code, make changes, to form your own experiments. And please, leave comments so we can improve our story.

What Was Shown in the Wiki Page on Softmax ?


In Softmax, we use the following simple network for a classification problem. We generated some data samples in the form of $$(\underline{x}, y )$$ pairs, where $$\underline{x} $$ is the real-valued data and $$ y $$ is the label. We generate these samples from some rather arbitrarily chosen joint distributions $$p_{\underline{X}, Y} $$. We then feed the samples to the network, implemented using Keras over TensorFlow. After the training procedure converges, we show that the weights computed by the network, $$ v_i $$' s are in fact a simple mathematical quantity:

$$ v_i = \mathbb{E}[X | Y=i ] $$

We also showed that the weights on the bias $$ b_i $$ 's can be computed with a different formula. In this page, we are less intereted in that, and the focus is on the weights connecting the variable nodes.

This result is actually quite interesting. It says that the network actually learns some particular statistics of the data and store them on the weights of certain edges; and these values are the entire knowledge that the network remembers. If you are not convinced about how important this is, just imagine that one day if I can read these values from the neurons in your brain, and interpret them...

One obvious problem of this result is that the network is really simple. We are all working with DEEP learning, aren't we? So a real network should have many hidden layers of nodes to get its power. So what are all these layers doing? This is the question of the current page.

First Experiment: Discrete $$ X $$ and $$ Y $$


So here is the plan, let's generate some samples $$ (x, y) $$ pairs from a joint distribution $$ p_{XY} $$. Both $$ x, y $$ are discrete valued, with $$ x \in \{1, 2, \ldots, m\}, y \in \{1, \ldots, K\} $$. The distribution $$ p_{XY} $$ is randomly chosen; and we use the following network to do classification, i.e. predict the value of $$ Y $$ from the value of $$ X $$.

The right half of the network, the output layer, is exactly the same as before, so we know that the weights $$ v_i = \mathbb{E}[S|Y=i] $$. Our question is what is $$ S $$? Well, $$ S $$ is a function of the input data $$ X $$. The question is what function of $$ X $$ would a neural network choose?

Here, because the input $$ X $$ is discrete, what we feed into the network is actually the indicators, or one-hot representation, of $$ x $$. The network chooses the weights $$ w_1, \ldots, w_m $$ to form a function

$$ S(x) = \sigma\left( \sum_{i=j}^m w_j \cdot \mathbb{I}_{x=j}\right) $$

where $$ \sigma(\cdot) $$ is the activation functions, which we choose as the sigmoid function. Note that since $$ x $$ is discrete valued, by changing $$ w_j $$ 's, we can form pretty much any function of $$ x $$. The question is: which function would a neural network choose? And why?

To Start
To start the experiment, we first need to make sure you have all the software packages installed. The easiest thing is to run the following script.

If you miss any package, please use the links in the Softmax page to get them installed.

Generate Data Samples
We randomly choose the joint distribution $$ p_{XY} $$. This is not necessary. You can certainly pick whichever joint distribution you like. We make this a random choice more or less to get a mental impression that we didn't cook up a special example which could be the only case that our results work. Well, like every other non-mathematical maneuver, it is not clear what exactly is achieved by doing that. Nevertheless, here is how it is done.

One important thing here is that we returned the distributions $$ p_{XY}, p_X, p_Y $$ after generating the data samples. This is partly cheating. A more strict approach should only return the data samples, and use a separate routine to compute the empirical joint distirbution, and the empirical marginal distributions. Here we saved that effort.

 "We hold these truths to be self-evident: that all men are created equal; and the law of large numbers always works." -- Thomas Jeggerson

At this point, we can already compute two functions as the theoretical results. Like the name suggests, these two things, f_theory and g_theory, will be used as the theoretical results to be compared with the training results from the neural networks. I will reveal what these computations are very soon. I mean, what kind of wikipage this would be if there weren't a little suspension!

Before using these samples, they have to be turned into one-hot form. This can be found as a standard function of [sklearn].

Run The Network
Now we are ready to make a network to learn from these data. This is fairly standard using Keras.

The weights read out on the last line is list. Two elements are arrays corresponding to the weights $$ w, v $$ in our figure. We take these weights out and have simple processing as follows, among the steps we used the 'regulate' to forced the results to have zero mean and unit variance. The code for 'regulate' can be found in the Softmax page. After that, we plot the results in comparison with the theoretical results.

We need to point out that the processing of $$ f $$ and $$ g $$ are slightly different. For $$ g $$ we just take the corresponding weights, but for $$ f $$, we used the 'expit' function. This is because we used sigmoid function sigmoid as the activation function. One needs a little thinking to be convinced that the resulting $$ f $$ is array with elements of $$ S(x), x \in \{1, \ldots, m\} $$, where $$ S(x) $$ is precisely the feature value in the figure, the output of the hidden layer.

A typical plot is given here:

What Theory Says?
In principle, we should not have given $$ p_{XY} $$ to this routine, but instead should give the training sample set, from which the empirical joint distribution can be computed and used