Feature Projection with Restrictions

Part 3
To tell the truth, I have always hoped to write an article that starts with "Part 3".

Well, the parts 1 and 2 of this long story are in two separate pages, on Softmax and on Feature Projection. Together these three pages are suppose to answer a profound question on "What is going on inside neural networks?"

I use those two pages to demonstrate that in a neural network used to predict the value of $$ Y $$ from input samples of $$ X $$, the network uses the training samples to learn two feature functions, $$ S(x) $$ and $$ v(y) $$, and store these two functions as weights of some edges. The two feature functions are the fixed-point solution of

$$ \begin{cases} S(x) = {\mathbb E}[v(Y)|X=x]\\ v(y) = {\mathbb E}[S(X) |Y=y] \end{cases} $$

Moreover, they correspond to a pair of left and right singular vectors of a $$ \tilde{B} $$ matrix, and thus have the interpretation of the strongest "mode" of dependence between $$ X $$ and $$ Y $$.

In other words, there is an explicit goal that a neural network tries to compute, and we explain the reason why that is a good thing to compute. I hope this is enough motivation for you to go read those two pages if you haven't. Afterall, if I could make that case clear in these four short paragraphs, I wouldn't have written those two long pages, would I?



Discrete Data, Continous Data
The main differene between this and the network we worked with previously is that the inputs to the network, $$ T_1, \ldots, T_m $$ are assumed to be continuously valued, where in the previous page the inputs are assumed to be the one-hot encoding of a discrete random variable $$ X $$.

Why this is a big deal? Well, if you check the solutions we presented there, we argue that the network tries to compute some functions of the joint distribution $$ P_{XY} $$. If the input $$ T $$ is continuous valued, it's not that we cannot assume a joint distribution $$ P_{TY} $$, but notions like $$ {\mathbb E}[v(Y)|T=t] $$ no longer makes much sense. Think about it, the conditional expectation is a function of $$ t $$, and for two close by values $$ t $$ and $$ t+ \delta $$, we should expect the conditional expectaions to be close, but nothing in the data can guarantee that. One solution to this is to use a smoother, but that cannot give very clean results.

So here is the plan. We will not feed the neural network with just any continuous-valued inputs. Instead, we assume there is a discrete valued $$ X $$ behind $$ T $$. That is, in the data acquisition process, we cannot observe the true discrete valued $$ X $$, but instead observe some results from "pre-processing", which are the continuous-valued $$ T $$. We cannot change this preprocessing. And the question is "what would the network do with these processed data samples?"

One thing convenient is that with this setup, we can understand deep networks. In a multi-layer setup, at each layer, we can think of the outputs of the previous layers as a particular kind of preprocessing. We ask the question "how would the neural network assign weights on the current layer given this preprocessing?" This is one step of the back propagation; and is exactly the same problem in our setup.

This leads to a somewhat philosophical question: can we always think of any kind of continuous-valued data as the preprocessing result of some collection of discrete-valued factors? Well, our answer is "pretty much!" Particularly if you allow the alphabet size of $$ X $$ to be large. Let's take this not as a mathematical fact, but as an assumption.

This assumption does cause a conceptual difference when we talk about the "expressive power" of a neural network. The commonly used notion tries to ask whether a given network can generate an arbitrary function of the input $$ T $$. For that purpose, the complexity of the network we wish for is unlimited. That is, we always want more nodes and more layers in order to generate more and more complex functions of $$ T $$. If we now take the assumption that there is a discrete-valued ground truth $$ X $$ behind $$ T $$, then the network we need is limited. For example in the figure, if we have $$ m \geq L $$, i.e. the dimensionality of the processing results higher than the cardinality of $$ X $$, then the network can assign weights $$ w $$'s to revert the preprocessing so we get back to the discrete-valued problem again. In reality, the cardinality of $$ X $$ is really really large, particularly when noise is involved in the preprocessing, so even this idea of "limited sized network" is actually quite large. If you are asking your boss for another GPU, don't take that back yet.

OK! Enough discussions. Let's do some experiments.

The Experiment
Here is how we generate the data:

The function "Generate2DSamples" can be found in the previous page. It randomly pick a joint distribution $$ P_{XY} $$ and generates data samples from that. "MakeLabels" is the one-hot encoding as before. The only added step is a randomly chosen "PreProcessing" matrix, and we get some samples $$ T(x_i) $$ for each $$ x_i $$. Note that we make sure xCard > tDim. This corresponds to the notations in the figure as $$ L > m $$, since otherwise the problem becomes trivial.

Now we feed the data, $$ (T(x_i), y_i) $$ pairs to the  neural network.

We train the network, and then get the weights. Now all that's left is to figure out what are these weights doing?

What Are These Weights Doing?
In the previous pages, we demonstrated that if we feed the discrete ground truth $$ X $$ to the network, the traning results should be

$$ \begin{cases} v^*(y) = {\mathbb E} [S^*(X)|Y=y]\\ S^*(x) = {\mathbb E} [v^*(Y)|X=x] \end{cases} $$

which has an elegant interpretation as the singular vectors of the $$ \tilde{B} $$ matrix.

Now we replaced $$ X $$ with the preprocssing results $$ T(x) $$. The first relation should remain the same:

$$ v(y) = {\mathbb E} [S(T(X)) | Y=y ] $$ This is demonstrated in the Softmax page, that the output layer weights computes a conditional expectation regardless of what is fed to this layer.

The second relation is no longer the same. Because now we cannot compute any function $$ S^*(x) $$. The network structure restricted us to be only abel to compute functions of a particular form. Namely,

$$ S(x) = \sigma\left( \sum_{j=1}^m w_j \cdot T_j(x) + d\right) $$

where $$ \sigma(\cdot) $$ is the sigmoid function, and $$ T_j(\cdot) $$'s are given. Intuitively, all we can do is to find one choice of the weights $$ (w_1, \ldots, w_m, d) $$ such that the resulting $$ S(x) $$ is close to the desired $$ S^*(x) $$ in some sense.

Recall that we had a name to the conditional expectation step: feature projection. For a given feature $$ v(y) $$, we try to find a feature that can be computed from $$ x $$ that is best aligned with $$ v(y) $$. The result is the conditional expectation $$ S(x) = {\mathbb E} [v(Y) |X=x] $$. Now our problem is only slightly changed. For a given feature $$ v(y) $$, we try to find a feature that can be computed from the $$ T_j(x) $$'s in the specific form as above that is best aligned with $$ v(y) $$. This is conceptually only a slight generalization. Thus we call this process "Feature Projection with Restrictions".

Numerically, it is actually quite a bit harder to verify such an intuition, since the restriction on $$ S(x) $$ is non-linear, also maybe less obvious, the bias terms can be rather annoying. Let's handle them one by one.

The Annoying Bias
The bias term, $$ b_1, \ldots, b_k $$ in the figure, turns out to be very tediuous and useless.

What actually happens is that when neural networks is understood as feature projections, i.e. operators that maps one feature function to another, it is quite clear that a constant shift on the feature functions does nothing. Thus we should always think of such operations as mapping between zero-mean feature functions: zero-mean with respect to the marginal distribution of the variable that the feature function is defined, $$ P_X $$ for $$ S(x) $$, and $$ P_Y $$ for $$ v(y) $$.

We can therefore define notations as

$$ \begin{align} \tilde{S}(x) &= S(x) - {\mathbb E}[S(X)], \\ \tilde{v}(y) &= v(y) - {\mathbb E}[v(Y)] \end{align} $$

and a better way to write the feature projections should be

$$ \begin{cases} \tilde{S}(x) = {\mathbb E}[\tilde{v}(Y)|X=x]\\ \tilde{v}(y) = {\mathbb E}[\tilde{S}(X)|Y=y] \end{cases} $$

Unfortunately, neural networks do not appreciate the beauty of this simplest form. (And you are strill worrying about whether one day AI can replace human kind.) Instead, it controls the mean values of these functions so that the nodes work on some specific operating points on the activation functions, to utilize the non-linearity around those operating points. As a result, the output generated by  neural networks are almost never zero-mean. For example, the output of the sigmoid function is by always positive, and cannot have zero-mean.

This is why whenever we need to compare the network outputs with the theoretical results, we use the "regulate" function. Inside the networks, the bias node and the weights on the edge connected to it are used to compansate for these non-zero means. For example, the weights $$ b_1, \ldots, b_k $$, written in function form as $$ b(y) $$ in fact satisfies

$$ b(y) = \log P_Y(y) - v(y) \cdot {\mathbb E} [S(X)] $$

where the second term is used to compensate for the mean value of $$ S(x) $$. It helps to make sure that the output nodes have $$ v(y) \cdot S(x) - b(y) = v(y) \cdot \tilde{S}(x) $$ That is, $$ S(x) $$ is effectively zero-mean from the output nodes point-of-view.

This fact can be easily verified with the following code piece: with "weights" as the one taken from the network by "model.get_weights"



And the output looks like this.

I hope you didn't catch it, on both "b" and "b_theory", we subtracted another mean! This time the mean w.r.t. $$ P_Y $$. Because the output layer softmax function is another non-linear function, and the network plays with this mean again! Annoying, isn't it?

Careful readers might also find that in the Softmax page when we make up our own weights to plug in to the network, we used a different formula $$ b(y) = \log P_Y(y) - \frac{({\mathbb E}[S|Y=y])^2}{2 \cdot {\rm var}[S]} $$

Instead of showing you why they are actually equivalent, which is quite tedious, let's just say the specific form depends on the way that these weights are regulated. And don't we all wish that there weren't any bias in any form?

 "I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character." --- Dr. Martin Luther King

The Content of Their Character
Now that we have put the bias issue aside, let see how the network makes the feature projection. Recall the ideal output of the hidden node is

$$ S^*(x) = {\mathbb E} [ \tilde{v}(Y)| X=x] = {\mathbb E}[v(Y)|X=x] - {\mathbb E}[v(Y)] $$

Our conjecture is that the network generates an approximation to this as $$ \tilde{S}(x) = S(x) - {\mathbb E}[S(X)] $$. Knowing that whatever mean it creats will be taken out at a later layer, the network actually makes one that is not zero mean:

$$ S^{(w, d)}(x) = \sigma\left( \sum_j w_j T_j(x) + d \right) $$

by choosing the weights $$ w_1, \ldots, w_m, d $$.

Let's make a bold guess that this notion of approximation is in the mean-square error sense, now we can write the optimization in a single line

$$ \min_{w, d} \sum_x P_X(x) \cdot ( S^{(w, d)}(x) - {\mathbb E}[S^{(w, d)}(X)] - S^*(x))^2 $$

or using the $$ \tilde{S} $$ notation

$$ \min_{w, d} \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x))^2 $$

Now we can take some derivatives:

$$ \begin{align} 0 &= \frac{\partial}{\partial w_j } \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x))^2\\ &= \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x)) \cdot\left( \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x)\right) \end{align} $$

This now can be viewed as two functions, $$ (\tilde{S}^{(w, d)} (x) - S^*(x)) $$ and $$ \left( \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x)\right)$$ being orthogonal, in the inner producted weighted by $$ P_X $$. The fact that the inner product is 0 is harder to check numerically since both functions might be scaled rather arbitrarily, so it is hard to tell what numerical value is close enough to 0. What we would actally do is to check the angle between the two functions by taking the "arccos", and see if it is close to $$ \pi/2 $$. Well, we actually changed the unit and compared the angle with $$ 90 ^\circ $$.

The only thing that one needs to be careful about is the second function with the derivative now. I know may people, including myself, memoerize the derivative for the sigmoid function: $$ \sigma'(z) = \sigma(z) \cdot (1- \sigma(z)) $$. However, here we need the derivative of the sigmoid function subtracting its mean:

$$ \begin{align} \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x) &= \frac{\partial}{\partial w_j} \left[ S^{(w, d)} (x) - \sum_{x'} P_X(x') S^{(w, d)}(x') \right] \end{align} $$

This in our notation is simply $$ \widetilde{\frac{\partial }{\partial w_j} S^{(w, d)}}(x) $$. That is, the derivative taken away the mean. Finally, filling in the fact that

$$ \frac{\partial}{\partial w_j} S^{(w, d)} (x) = S^{(w, d)}(x) (1-S^{(w, d)}(x)) \cdot T_j(x) $$

and we are ready to program again.

And the results? Here is a typical set of values:

Convinced?

A Theory of Neural Networks?
At the end of these three pages, I feel it worthwhile to have a discussion on what exactly are we presenting here. If you happened to type "neural networks" on your Google bar, chances are you will find "explained" as the suggested next word. The fact is, there are many many works trying to give an explaination to why neural network works. I have not seen one work that provides an explict answer that identifies a mathematical quantity and shows that matches with the neural networks learning results. In that sense, I do think we are making progress here.

But is this a general theory for neural networks?

The short answer is -- No. Well, this is mainly because I haven't been presenting much theory in these pages, have I?

In our paper, we put out a more complete picture. It starts by saying that we need a new way to measure information itself. After designing a new information metric, we can use that to describe the information flow in any data processing procedure, what information is kept, what is discarded, and are those good choices. We then show that finding the singular vectors of the $$ \tilde{B} $$ like what we showed here is a good choice: it captures the strongest modes of dependence between $$ X $$ and $$ Y $$ like we claimed; it captures the most signifcant elements of the "common information" between $$ X $$ and $$ Y $$; and probably more importantly, the selected features are the optimal choices of what we called the "universal feature selection problem".

In classic statistics, if we need to select some features from data for the purpose of solving some inference problems, the answer is always to select the sufficient statistic. We argue that in most learning problems, as we process high dimensional data samples, we do not exactly know exactly what attributes of the data we will be interested in detecting. For example, when we observe a customer's behavioral record it is not clear what attribute(s) matters when we try to recommend a product to her later. Thus, for these new problems, we cannot use a notion of sufficient statisitc for a particular inference task. Instead, we need to select a set of features, such that regardless of what attribute we might be interested in detect later, the decision based on these features are "universally" good. For this new problem, we show that the choice from the singular vectors are indeed optimal. That is, any feature selection procedure, as long as the goal is not a single problem with known statistical model, the sensible choice should be such universal features. And neural networks, as one of such tools, had better give these solutions.

History has told us, if you work very hard to find the theory about one thing, usually you get a theory about many other things.

Now with a theory like this, does it offer some much more powerful ways to design and use neural networks?

The short answer is -- No! The gap between a good theory and good designs is actually a lot bigger than most people think.