Softmax

Why Are Neural Networks So Powerful?
Well, that is a difficult question that has haunted us for about 30 years. The more successes we see from the applications of deep-learning, the more puzzled we are. Why is it that we can plug in this “thing” to pretty much any problem, classification or  prediction, and with just some limited amount of tuning, we almost always can get good results. While some of us are amazed by this power, particularly the universal applicability; some found it hard to be convinced without a deeper understanding.

Now here is an answer! Ready?
In short, Neural Networks extract from the data the most relevant part of the  information that describes the statistical dependence between the features and the labels. In other words, the size of a Neural Networks specifies a  data structure  that we can compute and store, and the result of training the network is the best approximation of the statistical relationship between the features and the labels that can be represented by this data structure.

I know you have two questions right away: '''REALLY? WHY?'''

The “why” part is a bit involved. We have a new paper that covers this. Briefly, we need to first define a metric that quantifies how valuable a piece of partial information is for a specific inference task, and then we can show that Neural Networks  actually draw the most valuable part of information from the data. As a bonus, the same argument can also be used in understanding and comparing other learning algorithms,  PCA,  compressed sensing, etc. So everything ends up in the same picture. Pretty cool, huh? (Be aware, it is a loooong paper.)

In this page, we try to answer the “Really?” question. One way to do that is I can write a mathematical proof, which is included in the paper. Here, let’s just do some experiments.

Here is the Plan:
We will generate some data samples $$(\underline{x}, y )$$ pairs. $$\underline{x}$$ is the real-valued feature vector, and $$y $$ is the label. We will use these data samples to train a simple neural network so it can be used to do classification of the feature vectors. After the training we will take out some of the weights on the edges of the network and show that these weights actually are the empirical conditional expectations $${\mathbb E} [\underline{X}|Y=y] $$ for different values of $$y$$'s.

So why are these conditional expectations worth computing? This goes back all the way to Alfréd Rényi, and the notion of HGR correlation. In our paper, we can show that this conditional expectation as a function of is in fact the function that is the most relevant to the classification problem.