Feature Projection

What Is This Page About?
This page is a continuation of our previous wiki page on Softmax. It tries to answer the question "Why are Neural Networks so powerful?" We will show by numerical examples that the internal computation of Neural Networks using the stochatic gradient descent algorithm in back propagation is in fact computing mathematical quantity with a clear definition and operation meanings. By understanding this fact, we hope you can gain a better understanding of the internal operations of neural networks, and thus use them more efficiently and more flexibly.

No matter how convincing the numerical examples are, they cannot replace a rigorous mathematical proof, which can be found in our recent paper. In this page, however, our goal is to use several programs with the minimal mathematical derivation to make the statement. This process would be helpful for readers who wish to start their own numerical experiments using neural networks. We encourage you to take pieces of our code, make changes, to form your own experiments. And please, leave comments so we can improve our story.

What Was Shown in the Wiki Page on Softmax ?


In Softmax, we use the following simple network for a classification problem. We generated some data samples in the form of $$(\underline{x}, y )$$ pairs, where $$\underline{x} $$ is the real-valued data and $$ y $$ is the label. We generate these samples from some rather arbitrarily chosen joint distributions $$p_{\underline{X}, Y} $$. We then feed the samples to the network, implemented using Keras over TensorFlow. After the training procedure converges, we show that the weights computed by the network, $$ v_i $$' s are in fact a simple mathematical quantity:

$$ v_i = \mathbb{E}[X | Y=i ] $$

We also showed that the weights on the bias $$ b_i $$ 's can be computed with a different formula. In this page, we are less intereted in that, and the focus is on the weights connecting the variable nodes.

This result is actually quite interesting. It says that the network actually learns some particular statistics of the data and store them on the weights of certain edges; and these values are the entire knowledge that the network remembers. If you are not convinced about how important this is, just imagine that one day if I can read these values from the neurons in your brain, and interpret them...

One obvious problem of this result is that the network is really simple. We are all working with DEEP learning, aren't we? So a real network should have many hidden layers of nodes to get its power. So what are all these layers doing? This is the question of the current page.

The Experiment


So here is the plan, let's generate some samples $$ (x, y) $$ pairs from a joint distribution $$ p_{XY} $$. Both $$ x, y $$ are discrete valued, with $$ x \in \{1, 2, \ldots, m\}, y \in \{1, \ldots, K\} $$. The distribution $$ p_{XY} $$ is randomly chosen; and we use the following network to do classification, i.e. predict the value of $$ Y $$ from the value of $$ X $$.

The right half of the network, the output layer, is exactly the same as before, so we know that the weights $$ v_i = \mathbb{E}[S|Y=i] $$. Our question is what is $$ S $$? Well, $$ S $$ is a function of the input data $$ X $$. The question is what function of $$ X $$ would a neural network choose?

Here, because the input $$ X $$ is discrete, what we feed into the network is actually the indicators, or one-hot representation, of $$ x $$. The network chooses the weights $$ w_1, \ldots, w_m $$ to form a function

$$ S(x) = \sigma\left( \sum_{i=j}^m w_j \cdot \mathbb{I}_{x=j}\right) $$

where $$ \sigma(\cdot) $$ is the activation functions, which we choose as the sigmoid function. Note that since $$ x $$ is discrete valued, by changing $$ w_j $$ 's, we can form pretty much any function of $$ x $$. The question is: which function would a neural network choose? And why?

To Start
To start the experiment, we first need to make sure you have all the software packages installed. The easiest thing is to run the following script.

If you miss any package, please use the links in the Softmax page to get them installed.

Generate Data Samples
We randomly choose the joint distribution $$ p_{XY} $$. This is not necessary. You can certainly pick whichever joint distribution you like. We make this a random choice more or less to get a mental impression that we didn't cook up a special example which could be the only case that our results work. Well, like every other non-mathematical maneuver, it is not clear what exactly is achieved by doing that. Nevertheless, here is how it is done.

One important thing here is that we returned the distributions $$ p_{XY}, p_X, p_Y $$ after generating the data samples. This is partly cheating. A more strict approach should only return the data samples, and use a separate routine to compute the empirical joint distirbution, and the empirical marginal distributions. Here we saved that effort.

 "We hold these truths to be self-evident: that all men are created equal; and the law of large numbers always works." -- Thomas Jeggerson

At this point, we can already compute two functions as the theoretical results. Like the name suggests, these two things, f_theory and g_theory, will be used as the theoretical results to be compared with the training results from the neural networks. I will reveal what these computations are very soon. I mean, what kind of wikipage this would be if there weren't a little suspense!

Before using these samples, they have to be turned into one-hot form. This can be found as a standard function of [sklearn].

Run The Network
Now we are ready to make a network to learn from these data. This is fairly standard using Keras.

The 'weights' read out on the last line is a list. Two elements are arrays corresponding to the weights $$ w, v $$ in our figure. We take these weights out and have simple processing as follows, among the steps we used the 'regulate' to force the results to have zero mean and unit variance. The code for 'regulate' can be found in the Softmax page. After that, we plot the results in comparison with the theoretical results.

We need to point out that the processing of $$ S $$ and $$ v $$ are slightly different. For $$ v $$ we just take the corresponding weights, but for $$ S $$, we used the 'expit' function. This is because we used sigmoid function sigmoid as the activation function. One needs a little thinking to be convinced that the resulting $$ S $$ is array with elements of $$ S(x), x \in \{1, \ldots, m\} $$, where $$ S(x) $$ is precisely the feature value in the figure, the output of the hidden layer.

A typical plot is given here: What a surprise! They look the same! I hope now you are curious about how the theoretical values are computed, or, what is in that 'WhatTheorySays' function.



What Theory Says?
First of all, we give the true joint distribution $$ p_{XY} $$ to 'WhatTheorySays'. This is actually cheating. In principle, we should give the training sample set instead to make sure it sees the same thing that a neural network should see. One of course can compute the empirical joint distribution and use that in the place of $$ p_{XY} $$, and because of the Law of Large Numbers the result should be the same. The situation is actually slightly more subtle than this. The number of samples we use might not be sufficient to allow precise learning of the joint distribution, since both $$ {\mathcal X} $$ and $$ {\mathcal Y} $$ can be very large alphabets. In order that a neural network and our algorithm to work, we only need to make sure some relevant parts of the joint distribution is precisely estimated. This is a very important fact in terms of controling the sample complexity of the solutions. In our example, both alphabets are quite small. This allows us to avoid this entire discussion on the sample complexity. Interested readers can experiment by increasing the cardinalities, 'xCard, yCard', in the program, and try to see which one, the neural network or our approach fail first.

With that in mind, here the code:

Surpisingly short, isn't it? And yes, this is what the neural network computes! In mathematical form, what the code does is the following: first it form a matrix $$ B $$, of size $$ |{\mathcal X} | \times |{\mathcal Y}| $$, each entry of the matrix corresponds to a particular pair of values $$ (x,y) $$, where

$$ B(x,y) = \frac{p_{XY}(x, y)}{\sqrt{p_X(x)} \sqrt{p_Y(y)}} $$

After that we do find the singular value decomposition (SVD) of this B-matrix. We pick $$ \underline{u }_1 \in {\mathbb R}^{|{\mathcal X}|}$$ and $$ \underline{r}_1 \in {\mathbb R}^{|{\mathcal Y}|} $$ as the pair of left and right singular vectors corresponding to the second largest singular value. We treat $$ \underline{u}_1, \underline{r}_1 $$ as real-valued functions over $$ {\mathcal X, Y} $$, respectively, and do a simple normalization

$$ S(x) = \frac{u_1(x)}{\sqrt{p_X(x)}}, \quad v(y) = \frac{r_1(y)}{\sqrt{p_Y(y)}} $$

The resulting $$S, v $$ are our theoretical results. They automatically have zero-mean and unit-variance, so we didn't have to use the 'regulate' routine on them. And that is what you see in the above plots.

A bit dizzy, huh? Don't worry. Let me help you with this.

The B -Matrix
The first thing that you might notice and find confusing is that there is a scaling that seems particularly strange. We define a correspondence between a function of the data and a vector we operate on. In the X-space, we have a function $$ S : {\mathcal X} \mapsto {\mathbb R} $$ represented in a vector form $$ \underline{u} \in {\mathbb R}^{|{\mathcal X}|} $$, with $$ S(x) = u(x)/\sqrt{p_X(x)} $$. In the Y-space, the scaling is $$ v(y) = r(y) /\sqrt{p_Y(y)} $$.This defines the correspondence between functions and vectors $$ \underline{u} \leftrightarrow S, \underline{r} \leftrightarrow v $$. For the moment, please just be patient and take these as given.

One question you should have at this point is: "Isn't this $$ v(y) $$ function already computed in the previous page? Didn't it say that $$ v(y) = {\mathbb E} [ S(X)|Y=y] $$ ?"

Well, that is at least what I hope you to ask.

It turns out these two answers are exactly the same. To see this, we need to take a closer look of the matrix $$ B $$ we defined above. It has some really elegant properties, and here is the first one:

Let's consider the $$ B $$ matrix multiplying a vector. Consider $$ \underline{r} = B \cdot \underline{u} $$, where $$ \underline{u}, \underline{r}  $$ correspond to a pair of functions $$ S(x) $$ and $$ v(y) $$ as defined above. Now we simply write this out:

$$ \begin{align} v(y) &= \frac{1}{\sqrt{p_Y(y)}} r(y) = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) u(x) \\ & = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) \sqrt{p_X(x)} \cdot S(x) \\ &= \frac{1}{\sqrt{p_Y(y)}} \sum_x \left( \frac{p_{XY}(x,y) } {\sqrt{p_X(x)} \sqrt{p_Y(y)}} \right) \sqrt{p_X(x)} \cdot S(x)\\ &= {\mathbb E} [S(X)|Y=y] \end{align} $$

In words, this says that the matrix multipliation $$ \underline{r} = B \cdot \underline{u} $$ defines a mapping. It maps a feature function on $$ S(\cdot) $$ on $$ {\mathcal X} $$ to a corresponding feature function $$ v(\cdot) $$ on $$ {\mathcal Y} $$, with $$ v(y)={\mathbb E} [S(X)|Y=y] $$. And as we have seen, this mapping is clearly relevant to the neural network operation.

It is not hard to check that a conjugate operation $$ \underline{u} = B^T \cdot \underline{r} $$ defines a mapping in the reverse direction, it maps a function of $$ y $$ to a function of $$ x $$ as $$ S(x) = {\mathbb E}[v(Y)|X=x] $$.

Feature Projection
Now here is how the neural network operates. It initializes by randomly choosing the weights, and hence gets some random choices of the feature function $$ S(x) $$, and random output layer weights $$ v(y) $$. In the back propagation process, we first fix $$ S(x) $$ and update $$ v(y) $$. This can be viewed as finding a feature of $$y,  v(y) $$ that is best "aligned" with $$ S(x) $$. The solution is $$ v(y) = {\mathbb E} [S(X)|Y=y] $$. Then in the next step, we fix $$ v(y) $$ and look for a best aligned feature of $$ x $$, which gives $$ S(x) = {\mathbb E}[v(Y)|X=x] $$. After a number of iterations the process reaches a fixed point, where $$ S(x), v(y) $$ are well aligned to each other.

In this procedure, we repeated answer the question "what feature of one variable is best aligned with a given feature of another variable". We call this the "feature projection": we find the projection of a feature, say $$ S(x) $$， in the space of features that can be computed from $$ y $$, and the solution is always the conditional expectiation $$ v(y) = {\mathbb E} [S(X)|Y=y] $$.

This story brings out the SVD solution: suppose $$ \underline{u}, \underline{r} $$ are a pair of singular vectors of $$ B $$, with singular value $$ \sigma $$, we have $$ \sigma \cdot \underline{r} = B \cdot \underline{u} $$ and $$ \sigma \cdot \underline{u} = B^T \cdot \underline{v} $$. The scaling by $$ \sigma $$ is not important at this point. It is taken out by the "regulate" function anyway. When we compute the SVD, one numerical method is by power iteration. Namely, we start with a random choice of $$ \underline{u} $$ and repeated left-multiply $$ B^T B $$ to it. One can see now that the feature projections are doing exactly that.

In our paper, we show that the resulting $$ \underline{u}, \underline{r} $$ pair are the pair of singular vectors corresponding to the largest singular value of $$ B $$. It captures the strongest mode of dependence between $$ X $$ and $$ Y $$. This is why the corresponding feature functions are useful in making predictions from $$ X $$ to $$ Y $$. The story can easily be generalized if we are looking for not one but $$ k $$ pairs of singular vectors, and thus capture the strongest $$ k $$ modes of dependence. This corresponds to the neural network with $$ k $$ hidden nodes before the output layer.

One leftover thing that you may wonder is why did we take the singular vectors for the second largest singular value, not the first. It turns out that the first pair of singular vectors are always the same:

$$ \underline{u}_0 = [\sqrt{P_X(x)}, x \in {\mathcal X}], \underline{r}_0 = [\sqrt{P_Y(y)}, y \in {\mathcal Y}] $$,

and the singular value is $$\sigma_0= 1 $$. To see this, we write $$ \begin{align} B \underline{u}_0 = \sum_x B(x,y) \sqrt{P_X(x)} = \left( \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}}\right)\cdot \sqrt{P_X(x)} = \sqrt{P_Y(y)} \end{align} $$

All other singular vectors must be orthogonal to this

$$ \begin{align} 0 = \langle \underline{u}_0, \underline{u} \rangle = \sum_x \sqrt{P_X(x)} \cdot \sqrt{P_X(x)} S(x) = {\mathbb E}[S(X)] \end{align} $$

which means the feature function $$ S(x) $$ corresponding to $$ \underline{u} $$ must be zero-mean with respect to $$ P_X $$. We take this as a given assumption without loss of generality when we choose feature functions. It turns out that all other singular values of $$ B $$ must be less than or equal to 1. This is a simple consequence of the data processing inequality. Thus, the second pair of singular vectors are in fact the first pair of non-trivial singular vectors.

In our paper, we actually use the following definition of $$ \tilde{B} $$ instead of $$ B $$, for which this cumbersome largest singular mode is taken out. The good thing is that for $$ \tilde{B} $$ it is easier to get convinced that it addresses the dependence between $$ X $$ and $$ Y $$ $$ \begin{align} \tilde{B} (x,y)= \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} - \sqrt{P_X(x)}\sqrt{P_Y(y)} = \frac{P_{XY}(x, y) - P_X(x) P_Y(y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} \end{align} $$

What's Left?
The main problem of this experiment is that the input $$ X $$ is discrete. This means the selected feature function $$ S(x) $$ can in fact be almost any function of $$ x $$. In other words, network a single hidden layer is expressive enough. This is not the case. A real deep neural network often needs many layers of nodes to get a large space of feature functions that can be generalized, and yet this is not always sufficient. That is, the feature function $$ S(x) $$ might be limited to a subset that can be generated by the given network structure. In such cases, is the chosen feature function still in some sense from the feature projection solution? For that, we will have another page on Feature Projection with Restrictions.