理解神经网络

神经网络为什么这么厉害?
这已经是一个困扰我们快三十年的问题了. 深度学习的广泛应用，反而加深了这种困惑. 为什么只需要稍微调整一下参数，就能将它应用到任何分类和预测的问题上，还总能得到不错的结果？对于好奇心旺盛的人而言，使用这样强大却难以解释的工具会带来强烈的不安感：会不会还有更好的作为选择？

不安感随着神经网络的使用仍在发酵. 神经网络中有大量的参数往往需要使用试错的方式来选择，包括隐藏层的数量、神经元的数量、激活函数的类型等. 训练神经网络的反向传播算法可能是神经网络其中最神秘的部分. 在理解这些过程之前，很难想象要怎样提高神经网络的效率.

所有这些困难本质上都指向了同一个问题：神经网络到底在做什么？什么使得它看起来无所不能？

答案在这里！准备好了吗？
简单地说，神经网络从数据中提取出了描述特征和标签性关联性的关键信息. 换句话说，神经网络的具体形式决定了我们可存储和计算的数据结构，而神经网络训练的结果给出的是使用该数据结构对特征与标签的统计关联的最佳刻画.

现在你一定有两个问题要问：真的么？为什么？

要回答“为什么”可能得花些功夫，我们为此写了一篇论文对此进行讨论. 用一句话总结的话，首先我们度量了信息对特定推断任务的价值，并发现了神经网络实际上是在提取数据中最有价值的信息. 同样的分析方法可以用在理解或比较其他学习算法，或者主成分分析和 压缩感知等方法中.

我们尝试用这个页面回答“真的么”这个问题. 要回答这个问题，我可以列出论文里的数学证明. 不过这里，我们尝试从数值上解释相关的结果，可能还带一点点数学以及直观的解释. 在一系列的实验之后，我们会介绍一个新的度量用来评价神经网络的“效率”. 在设计复杂的网络时，这个度量对于选择网络结构和超参数是尤其有用的.

我们希望这个页面可以：


 * 1) 为神经网络的初学者提供可参考的案例. 希望这些基础的案例在你开始变成时能有帮助.
 * 2) 为有丰富神经网络实践经验的工程师解释我们的研究. 我们将使用代码解释研究的核心想法，希望它可以使你更灵活高效地开展下一个项目.
 * 3) 为对理论感兴趣的统计学家提供附加材料. 本页面无法取代论文中的数学证明，但可以有效地解释我们论文里的核心想法.

为了理解这些实验：
你可以阅读这些代码和注释，并且接受运行结果；或者选择自己运行代码. 运行代码将需要一个标准的Python环境, 还会用到Numpy, Matplotlib等工具包. 当然，神经网络neural network的工具包也是需要的，这里我使用的是基于TensorFlow后台的Keras，安装过程可以参照这个链接. 我推荐使用Anaconda来安装和管理所有这些环境，应该只需要几分钟的时间. 当然，我会把实验现象都展示出来，悠闲地欣赏也会是不错的体验.

你会需要一些代码来初始化：

如果程序没有报错，那么恭喜你，至少环境配置没有问题. 现在，我们的旅途正式开始了！

实验
我们将会开展三个难度依次递进的实验. 在每个实验里，我们都会生成一系列的样本$$ (x_i, y_i), i=1, \ldots, N $$. 我们使用$$ x $$和$$ y $$分别作为神经网络的输入和对应的标签，并进行训练. 我们的目标是理解神经网络训练过程中到底发生了什么：神经网络从训练数据中“学习”到了什么，以及如何存储学习的结果. 为了解答这些问题，我们将在训练结束时提取出网络参数，并和参数的理论值进行对比；在某些情况下我们会跳过训练过程，直接将使用理论的参数值，并检查此时网络的性能.

这三个实验的大体内容如下：
 * 1) 我们从结构最简单的神经网络入手，即只有SoftMax输出层的单层神经网络. 这回帮助我们建立关于神经网络中权重计算的第一个数学公式；
 * 2) 接下来，我们考虑有一个隐藏层的网络，但我们限定数据集中的$$ x $$ 和 $$ y $$的只能取有限的离散值. 这实际上对应于有足够表达/拟合能力的神经网络，进而我们可以得到这个网络提取出的“理想”特征，并与理论最优的特征进行比较；
 * 3) 最后我们对多层神经网络进行研究，以理解表达能力不足的网络如何逼近“理想”的特征. 基于此，我们可以自然地定义对逼近效果的度量，从而可衡量不同网络结构表达特征的有效性.

数据样本生成
我们将生成一系列的数据样本$$(\underline{x}, y )$$，其中$$\underline{x}$$为数据的特征，$$y $$为数据所对应的标签. 有了这些样本之后，我们可以训练一个简单的 神经网络完成对特征的分类任务.

具体操作时，我们生成服从Dirichlet分布的一系列$$ \underline{x} $$. 分布的参数是随机选取的，但会与$$ y $$的值有关. 为什么要使用Dirichlet分布呢？当然，我们也可以使用更熟悉的混合高斯模型，但这样容易造成错误的印象：实验结果与数据分布密切相关，或者只在高斯条件下成立. 从这个意义上来说，Dirichlet分布显然是足够“非高斯”的. 这不是一个严谨的说法，欢迎你尝试其他更奇怪的概率分布，越多越好.

这是我用来生成数据的代码：

生成的数据大概长成这样. 我必须把数据对应的标签用不同颜色表明，否则几乎无法区分. 这个分类问题看起来不那么容易，是吧？图里面有点古怪的三角形，是由于Dirichlet分布的支撑集在一个单纯形上. 这里我们是在一个3维空间中的单纯形上生成了数据，然后将它们投影到二维空间，看起来就是三角形的样子.



用于分类问题的神经网络
现在，我们搭建一个神经网络来处理这些数据. 这里用到的网络只有一层，还不能称为深度学习；不过简单的结构会有助于我们的理解，也能够说明神经网络的SoftMax单元的一些特性. 这个网络的结构是这样：



神经网络所做的就是训练其中的参数$$(\underline{v}_j, b_j )$$，从而对特征进行相应的线性变换，可表示为$$Z_j = \underline{v}_j^T \cdot \underline{x} + b_j$$，每个$$j$$对应不同的输出层神经元. 接下来，输出层通过SoftMax函数将这些$$Z_j$$转化为概率分布的形式

$$Q^{(v,b)}_{Y|\underline{X}}(j | \underline{x}) = \frac{e^{Z_l}}{\sum_i e^{Z_l}}$$

并且尝试最大化样本对应的似然函数值：

$$\max_{v, b} \sum_{i=1}^N \log Q_{Y|\underline{X}}^{(v,b)}(y_i| \underline{x}_i) $$

输出层节点的数量$$ K $$应该和标签所有可能取值的数量一样，对应于代码里的$$\text{Cy} = |{\mathcal Y}| $$. 使用Keras实现该网络的代码很简单：

从变量和函数名称里可以清楚地看出我们在做什么：model.add指定了网络的结构，接下来model.fit开始训练权重. 训练好的权重值可以通过调用model.get_weights获得，于是我们得到：

一个困扰我们几十年的问题是：这些权重到底代表什么？

这些权重是什么?
现在我们来计算$$ Y $$的经验分布以及每一类数据的经验平均值$$ {\mathbb E} [\underline{X}|Y=y] $$：

所得到的M是一个Cy行Dx列的矩阵. 对于range(C_y=8)中的每一个j， M[j,:]是一个Dx=2维的向量. M的每一列M[:, i]可以被看作是$$ y $$的一个函数. 可以注意到，神经网络计算的权重，即我们在图里画的$$ v_j $$们，也是一个Cy乘Dx的矩阵，它的每个元素对应从一条从输入层神经元到输出层神经元的连线. 我们希望对这两组取值进行比较. 在比较之前，我们需要使用以下的程序将所有$$ y $$的函数归一化，保证它们在经验分布PPy下是零均值和单位方差的.

现在我们可以准备画图了：

得到的结果类似这样：

是不是有点意思？所以神经网络的训练过程实际上在算的东西很简单：不同$$y$$对应的条件期望$${\mathbb E} [\underline{X}|Y=y] $$的经验值.

那么，为什么神经网络要费这么大劲去算条件期望呢？这个可以追溯到Alfréd Rényi以及HGR相关的的记号. 在论文里，我们证明了条件期望实际上是$$y$$的所有函数中，与分类问题最相关的函数.

好吧，如果HGR这些东西听起来过于抽象，你可以简单地认为有一个统计学家们希望计算的“好”的函数. 这个函数可能和经典概念中所说的描述$$(\underline{X}, Y)$$的充分统计量有些不一样. 在这里，由于输入特征维度过高且具有复杂的结构，很难通过学习得到像充分统计量这样完整的信息. 神经网络则限定了我们能够学习到包含多少个自由参数的函数，而这个数量远小于描述$$(\underline{X}, Y)$$完整的统计模型所需要的参数个数. 因此，我们希望通过比较少的参数有效地对完整的统计模型进行逼近. HGR相关的概念以及我们的论文说的就是，我们可以从理论上分析什么样的函数可以给出最佳的逼近；我们的结果表明，神经网络所给出的参数正是理论上逼近效果最好的，而且这些参数正是以权重的方式存储在网络中！

无需训练的神经网络
到这里，我想我们可以来点更大胆的想法. 我们可以构造一个神经网络但并不训练，而是直接把权重的理论值填进去，看看结果会怎样！

我们用到的代码应该长这样：

本质的变化是，这里没有出现model.fit函数. 这里，我们用函数makeup_1layer计算所有这些权重，并用model.set_weights直接设置了权重值. 这种做法相当于只是把神经网络作为一个特定的数据结构，并使用我们自己造的函数替代了反向传播/训练过程. 之后，我们可以用model.predict_classes检查这个新网络的性能.

我们计算权重的函数如下:

更明确地说，这段程序实际在计算

$$ \underline{v}_j = \frac{{\mathbb E}[\underline{X} | Y=j]}{\sigma^2}; \qquad b_j= \log P_Y(j) - \frac{\|{\mathbb E}[\underline{X} | Y=j]\|^2}{2\sigma^2} $$

其中$$ {\mathbb E}[\cdot] $$是经验平均，$$ \sigma^2 $$是$$ \underline{x}$$中元素经验方差的平均值.

毋庸讳言，计算的这些权重表现不错，可以和Keras训练出来的结果相比拟. 换句话说，神经网络的所有秘密都藏在这两个等式里！

实验2: 离散的输入
上一个实验远远不能满足我们对深度学习的好奇心，尽管它已经说明了SoftMax输出层的作用. 对实际的深度神经网络而言，似乎中间上百个隐藏层对网络性能的影响更大. 那么这些隐藏层都起了什么作用呢？

我们把之前的单层神经网络稍作修改，研究如下的网络.



我们暂且将红色的虚线看作是神经网络中的许多层，它们构造了特征$$ S(x) $$，即输入$$ x $$的一个函数，并将它送到SoftMax输出层中. 神经网络会选择什么样的函数$$ S(\cdot) $$呢？直觉告诉我们，$$ S(x) $$应该会是一个“有信息量”的特征以帮助我们完成对$$ y $$的推断过程. 但什么样的特征会是“有信息量”的？

为了回答这个问题，我们先看一个简单的情形. 当$$ x $$只从有限集合$$ \{ 1, \ldots, m\} $$取值时，若神经网络输入为指示函数，或者说one-hot编码的形式，设此时经过一个隐藏层后我们将得到$$ S(x) $$. 这个简单例子的精妙之处在于，任何形式的$$ S(x) $$ 都可以通过选择合适的权重$$ w_1, \ldots, w_m $$生成：

$$ S(x) = \sigma\left( \sum_{i=j}^m w_j \cdot \mathbb{I}_{x=j}\right) $$

其中$$ \sigma(\cdot) $$为激活函数，这里我们选用sigmoid函数. 换句话说，这个单隐藏层的神经网络具有足够的表达能力. 我们的问题是，当我们给神经网络选择特征的自由时，它会选择什么样的$$ S(x) $$？为什么？

生成数据样本
我们通过随机的方式产生联合分布$$ p_{XY} $$. 这不是必要的，你也可以选择任何其它想要的联合分布. 这里的随机，只是为了避免产生精心选择实验条件的印象. 好吧，如同所有非数学的表述，这样也说不清楚到底生成了什么. 尽管如此，我们是这样生成的：

值得一提的是，这个函数在生成数据之后返回的是分布$$ p_{XY}, p_X, p_Y $$. 这可能有作弊的嫌疑，因为更严格的方法应该仅仅返回生成的数据样本，再调用另一个函数根据样本计算联合分布及边缘分布的经验值. 基于对大数定理的信心，这里我们省去了这一步骤.

这时候我们已经可以计算两个函数，作为理论结果：

正如变量的名字所暗示的，计算得到的S_theory和v_theory将作为理论结果并与神经网络训练的结果进行比较. 别急，我很快会说明它们的计算过程. 如果不留一点点悬念，这个wiki页面会是多么的无趣！

在使用我们的数据样本之前，需要把它们转化成one-hot编码的形式. 这可以通过sklearn中标准的函数完成：

运行神经网络
现在我们准备构造一个神经网络用来学习这些数据. 这是Keras中相当标准的流程：

代码最后一行所得到的weights是一个列表，列表中有两个元素与我们图中的权重$$ w, v $$相对应. 我们把它们拿出来并进行如下的处理，其中regulate函数用于对结果归一化. 在这之后，我们就可以将结果画出来并与理论值进行比较了.

需要指出的是，这里我们使用了'expit'函数来计算$$ S $$，因为我们使用了sigmoid function sigmoid作为激活函数. 这样计算得到的$$ S(x) $$就对应于我们图中隐藏层的输出，也就是神经网络所提取的特征.

这里给出一个典型的结果：意外不意外？训练值和理论值看起来是一样的！现在是不是有点好奇理论值的计算方式，或者说WhatTheorySays函数的内容了？



What Theory Says?
First of all, we give the true joint distribution $$ p_{XY} $$ to 'WhatTheorySays'. This is actually cheating. In principle, we should give the training sample set instead to make sure it sees the same thing that a neural network should see. One of course can compute the empirical joint distribution and use that in the place of $$ p_{XY} $$, and because of the Law of Large Numbers the result should be the same. The situation is actually slightly more subtle than this. The number of samples we use might not be sufficient to allow precise learning of the joint distribution, since both $$ {\mathcal X} $$ and $$ {\mathcal Y} $$ can be very large alphabets. In order that a neural network and our algorithm to work, we only need to make sure some relevant parts of the joint distribution is precisely estimated. This is a very important fact in terms of controlling the sample complexity of the solutions. In our example, both alphabets are quite small. This allows us to avoid this entire discussion on the sample complexity. Interested readers can experiment by increasing the cardinalities, 'xCard, yCard', in the program, and try to see which one, the neural network or our approach fail first.

With that in mind, here the code:

Surpisingly short, isn't it? And yes, this is what the neural network computes! In mathematical form, what the code does is the following: first it form a matrix $$ B $$, of size $$ |{\mathcal Y} | \times |{\mathcal X}| $$, each entry of the matrix corresponds to a particular pair of values $$ (x,y) $$, where

$$ B(y, x) = \frac{p_{XY}(x, y)}{\sqrt{p_X(x)} \sqrt{p_Y(y)}} $$

After that we do find the singular value decomposition (SVD) of this B-matrix. We pick $$ \underline{u }_1 \in {\mathbb R}^{|{\mathcal X}|}$$ and $$ \underline{r}_1 \in {\mathbb R}^{|{\mathcal Y}|} $$ as the pair of left and right singular vectors corresponding to the second largest singular value. We treat $$ \underline{u}_1, \underline{r}_1 $$ as real-valued functions over $$ {\mathcal X, Y} $$, respectively, and do a simple normalization

$$ S(x) = \frac{u_1(x)}{\sqrt{p_X(x)}}, \quad v(y) = \frac{r_1(y)}{\sqrt{p_Y(y)}} $$

The resulting $$S, v $$ are our theoretical results. They automatically have zero-mean and unit-variance, so we didn't have to use the 'regulate' routine on them. And that is what you see in the above plots.

A bit dizzy, huh? Don't worry. Let me help you with this.

The B -Matrix
The first thing that you might notice and find confusing is that there is a scaling that seems particularly strange. We define a correspondence between a function of the data and a vector we operate on. In the X-space, we have a function $$ S : {\mathcal X} \mapsto {\mathbb R} $$ represented in a vector form $$ \underline{u} \in {\mathbb R}^{|{\mathcal X}|} $$, with $$ S(x) = u(x)/\sqrt{p_X(x)} $$. In the Y-space, the scaling is $$ v(y) = r(y) /\sqrt{p_Y(y)} $$.This defines the correspondence between functions and vectors $$ \underline{u} \leftrightarrow S, \underline{r} \leftrightarrow v $$. For the moment, please just be patient and take these as given.

One question you should have at this point is: "Isn't this $$ v(y) $$ function already computed in the previous page? Didn't it say that $$ v(y) = {\mathbb E} [ S(X)|Y=y] $$ ?"

Well, that is at least what I hope you to ask.

It turns out these two answers are exactly the same. To see this, we need to take a closer look of the matrix $$ B $$ we defined above. It has some really elegant properties, and here is the first one:

Let's consider the $$ B $$ matrix multiplying a vector. Consider $$ \underline{r} = B \cdot \underline{u} $$, where $$ \underline{u}, \underline{r}  $$ correspond to a pair of functions $$ S(x) $$ and $$ v(y) $$ as defined above. Now we simply write this out:

$$ \begin{align} v(y) &= \frac{1}{\sqrt{p_Y(y)}} r(y) = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) u(x) \\ & = \frac{1}{\sqrt{p_Y(y)}} \sum_x B(x, y) \sqrt{p_X(x)} \cdot S(x) \\ &= \frac{1}{\sqrt{p_Y(y)}} \sum_x \left( \frac{p_{XY}(x,y) } {\sqrt{p_X(x)} \sqrt{p_Y(y)}} \right) \sqrt{p_X(x)} \cdot S(x)\\ &= {\mathbb E} [S(X)|Y=y] \end{align} $$

In words, this says that the matrix multipliation $$ \underline{r} = B \cdot \underline{u} $$ defines a mapping. It maps a feature function on $$ S(\cdot) $$ on $$ {\mathcal X} $$ to a corresponding feature function $$ v(\cdot) $$ on $$ {\mathcal Y} $$, with $$ v(y)={\mathbb E} [S(X)|Y=y] $$. And as we have seen, this mapping is clearly relevant to the neural network operation.

It is not hard to check that a conjugate operation $$ \underline{u} = B^T \cdot \underline{r} $$ defines a mapping in the reverse direction, it maps a function of $$ y $$ to a function of $$ x $$ as $$ S(x) = {\mathbb E}[v(Y)|X=x] $$.

Feature Projection
Now here is how the neural network operates. It initializes by randomly choosing the weights, and hence gets some random choices of the feature function $$ S(x) $$, and random output layer weights $$ v(y) $$. In the back propagation process, we first fix $$ S(x) $$ and update $$ v(y) $$. This can be viewed as finding a feature of $$y,  v(y) $$ that is best "aligned" with $$ S(x) $$. The solution is $$ v(y) = {\mathbb E} [S(X)|Y=y] $$. Then in the next step, we fix $$ v(y) $$ and look for a best aligned feature of $$ x $$, which gives $$ S(x) = {\mathbb E}[v(Y)|X=x] $$. After a number of iterations the process reaches a fixed point, where $$ S(x), v(y) $$ are well aligned to each other.

In this procedure, we repeated answer the question "what feature of one variable is best aligned with a given feature of another variable". We call this the "feature projection": we find the projection of a feature, say $$ S(x) $$， in the space of features that can be computed from $$ y $$, and the solution is always the conditional expectiation $$ v(y) = {\mathbb E} [S(X)|Y=y] $$.

This story brings out the SVD solution: suppose $$ \underline{u}, \underline{r} $$ are a pair of singular vectors of $$ B $$, with singular value $$ \sigma $$, we have $$ \sigma \cdot \underline{r} = B \cdot \underline{u} $$ and $$ \sigma \cdot \underline{u} = B^T \cdot \underline{v} $$. The scaling by $$ \sigma $$ is not important at this point. It is taken out by the "regulate" function anyway. When we compute the SVD, one numerical method is by power iteration. Namely, we start with a random choice of $$ \underline{u} $$ and repeated left-multiply $$ B^T B $$ to it. One can see now that the feature projections are doing exactly that.

In our paper, we show that the resulting $$ \underline{u}, \underline{r} $$ pair are the pair of singular vectors corresponding to the largest singular value of $$ B $$. It captures the strongest mode of dependence between $$ X $$ and $$ Y $$. This is why the corresponding feature functions are useful in making predictions from $$ X $$ to $$ Y $$. The story can easily be generalized if we are looking for not one but $$ k $$ pairs of singular vectors, and thus capture the strongest $$ k $$ modes of dependence. This corresponds to the neural network with $$ k $$ hidden nodes before the output layer.

One leftover thing that you may wonder is why did we take the singular vectors for the second largest singular value, not the first. It turns out that the first pair of singular vectors are always the same:

$$ \underline{u}_0 = [\sqrt{P_X(x)}, x \in {\mathcal X}], \underline{r}_0 = [\sqrt{P_Y(y)}, y \in {\mathcal Y}] $$,

and the singular value is $$\sigma_0= 1 $$. To see this, we write $$ \begin{align} B \underline{u}_0 = \sum_x B(y,x) \sqrt{P_X(x)} = \sum_x\left( \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}}\right)\cdot \sqrt{P_X(x)} = \sqrt{P_Y(y)} \end{align} $$

All other singular vectors must be orthogonal to this

$$ \begin{align} 0 = \langle \underline{u}_0, \underline{u} \rangle = \sum_x \sqrt{P_X(x)} \cdot \sqrt{P_X(x)} S(x) = {\mathbb E}[S(X)] \end{align} $$

which means the feature function $$ S(x) $$ corresponding to $$ \underline{u} $$ must be zero-mean with respect to $$ P_X $$. We take this as a given assumption without loss of generality when we choose feature functions. It turns out that all other singular values of $$ B $$ must be less than or equal to 1. This is a simple consequence of the data processing inequality. Thus, the second pair of singular vectors are in fact the first pair of non-trivial singular vectors.

In our paper, we actually use the following definition of $$ \tilde{B} $$ instead of $$ B $$, for which this cumbersome largest singular mode is taken out. The good thing is that for $$ \tilde{B} $$ it is easier to get convinced that it addresses the dependence between $$ X $$ and $$ Y $$ $$ \begin{align} \tilde{B} (y,x)= \frac{P_{XY}(x, y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} - \sqrt{P_X(x)}\sqrt{P_Y(y)} = \frac{P_{XY}(x, y) - P_X(x) P_Y(y)}{\sqrt{P_X(x)}\sqrt{P_Y(y)}} \end{align} $$

Experiment 3: Limited Expressive Power
So far we have demonstrated that in a neural network used to predict the value of $$ Y $$ from input samples of $$ X $$, the network uses the training samples to learn two feature functions, $$ S(x) $$ and $$ v(y) $$, and store these two functions as weights of some edges. The two feature functions are the fixed-point solution of

$$ \begin{cases} S(x) = {\mathbb E}[v(Y)|X=x]\\ v(y) = {\mathbb E}[S(X) |Y=y] \end{cases} $$

Moreover, they correspond to a pair of left and right singular vectors of a $$ \tilde{B} $$ matrix, and thus have the interpretation of the strongest "mode" of dependence between $$ X $$ and $$ Y $$.

In other words, there is an explicit goal that a neural network tries to compute, and we explain the reason why that is a good thing to compute.

This goal is actually achieved when we give the network enough expressive power, by choosing the input $$ x $$ samples to be discrete valued. In this experiment, we will worry about the case when the previous layers, no matter how many there are, is not expressive enough. This happens in most cases where neural networks are useful, in NLP, in vision, etc. Intuitively, what the network should do is to find the second best thing, a good approximation to these ideal solutions from the SVD of the $$ \tilde{B} $$ matrix, which can be generated from whatever network structure that is given. We would like to understand the nature of this approximation a bit better than this intuitive level.

Here is the network we plan to work with:



Discrete Data, Continous Data
The main differene between this and the network we worked with previously is that the inputs to the network, $$ T_1, \ldots, T_m $$ are assumed to be continuously valued, where in the previous experiment the inputs are assumed to be the one-hot encoding of a discrete random variable $$ X $$.

Why this is a big deal? Well, if you check the solutions we presented there, we argue that the network tries to compute some functions of the joint distribution $$ P_{XY} $$. If the input $$ T $$ is continuous valued, it's not that we cannot assume a joint distribution $$ P_{TY} $$, but notions like $$ {\mathbb E}[v(Y)|T=t] $$ no longer makes much sense. Think about it, the conditional expectation is a function of $$ t $$, and for two close by values $$ t $$ and $$ t+ \delta $$, we should expect the conditional expectaions to be close, but nothing in the data can guarantee that. One solution to this is to use a smoother, but that cannot give very clean results.

So here is the plan. We will not feed the neural network with just any continuous-valued inputs. Instead, we assume there is a discrete valued $$ X $$ behind $$ T $$. That is, in the data acquisition process, we cannot observe the true discrete valued $$ X $$, but instead observe some results from "pre-processing", which are the continuous-valued $$ T $$. We cannot change this preprocessing. And the question is "what would the network do with these processed data samples?"

One thing convenient is that with this setup, we can understand deep networks. In a multi-layer setup, at each layer, we can think of the outputs of the previous layers as a particular kind of preprocessing. We ask the question "how would the neural network assign weights on the current layer given this preprocessing?" This is one step of the back propagation; and is exactly the same problem in our setup.

This leads to a somewhat philosophical question: can we always think of any kind of continuous-valued data as the preprocessing result of some collection of discrete-valued factors? Well, our answer is "pretty much!" Particularly if you allow the alphabet size of $$ X $$ to be large. Let's take this not as a mathematical fact, but as an assumption.

This assumption does cause a conceptual difference when we talk about the "expressive power" of a neural network. The commonly used notion tries to ask whether a given network can generate an arbitrary function of the input $$ T $$. For that purpose, the complexity of the network we wish for is unlimited. That is, we always want more nodes and more layers in order to generate more and more complex functions of $$ T $$. If we now take the assumption that there is a discrete-valued ground truth $$ X $$ behind $$ T $$, then the network we need is limited. For example in the figure, if we have $$ m \geq L $$, i.e. the dimensionality of the processing results higher than the cardinality of $$ X $$, then the network can assign weights $$ w $$'s to revert the preprocessing so we get back to the discrete-valued problem again. In reality, the cardinality of $$ X $$ is really really large, particularly when noise is involved in the preprocessing, so even this idea of "limited sized network" is actually quite large. If you are asking your boss for another GPU, don't take that back yet.

OK! Enough discussions. Let's do some experiments.

Run the Network
Here is how we generate the data:

The function "Generate2DSamples" is the same as before. It randomly pick a joint distribution $$ P_{XY} $$ and generates data samples from that. "MakeLabels" is the one-hot encoding as before. The only added step is a randomly chosen "PreProcessing" matrix, and we get some samples $$ T(x_i) $$ for each $$ x_i $$. Note that we make sure xCard > tDim. This corresponds to the notations in the figure as $$ L > m $$, since otherwise the problem becomes trivial.

Now we feed the data, $$ (T(x_i), y_i) $$ pairs to the  neural network.

We train the network, and then get the weights. Now all that's left is to check the weights?

Check the Weights
We have already established that if we feed the discrete ground truth $$ X $$ to the network, the training results should be

$$ \begin{cases} v^*(y) = {\mathbb E} [S^*(X)|Y=y]\\ S^*(x) = {\mathbb E} [v^*(Y)|X=x] \end{cases} $$

which has an elegant interpretation as the singular vectors of the $$ \tilde{B} $$ matrix.

Now we replaced $$ X $$ with the preprocssing results $$ T(x) $$. The first relation should remain the same:

$$ v(y) = {\mathbb E} [S(T(X)) | Y=y ] $$

The second relation is no longer the same. Because now we cannot compute an arbitrary function $$ S^*(x) $$. The network structure restricted us to be only able to compute functions of a particular form. Namely,

$$ S(x) = \sigma\left( \sum_{j=1}^m w_j \cdot T_j(x) + d\right) $$

where $$ \sigma(\cdot) $$ is the sigmoid function, and $$ T_j(\cdot) $$'s are given. Intuitively, all we can do is to find one choice of the weights $$ (w_1, \ldots, w_m, d) $$ such that the resulting $$ S(x) $$ is close to the desired $$ S^*(x) $$ in some sense.

Recall that we had a name to the conditional expectation step: feature projection. For a given feature $$ v(y) $$, we try to find a feature that can be computed from $$ x $$ that is best aligned with $$ v(y) $$. The result is the conditional expectation $$ S(x) = {\mathbb E} [v(Y) |X=x] $$. Now our problem is only slightly changed. For a given feature $$ v(y) $$, we try to find a feature that can be computed from the $$ T_j(x) $$'s in the specific form as above that is best aligned with $$ v(y) $$. This is conceptually only a slight generalization. Thus we call this process "Feature Projection with Restrictions".

Numerically, it is actually quite a bit harder to verify such an intuition, since the restriction on $$ S(x) $$ is non-linear, also may be less obvious, the bias terms can be rather annoying. Let's handle them one by one.

The Annoying Bias
The bias term, $$ b_1, \ldots, b_k $$ in the figure, turns out to be very tediuous and useless.

What actually happens is that when neural networks is understood as feature projections, i.e. operators that maps one feature function to another, it is quite clear that a constant shift on the feature functions does nothing. Thus we should always think of such operations as mapping between zero-mean feature functions: zero-mean with respect to the marginal distribution of the variable that the feature function is defined, $$ P_X $$ for $$ S(x) $$, and $$ P_Y $$ for $$ v(y) $$.

We can therefore define notations as

$$ \begin{align} \tilde{S}(x) &= S(x) - {\mathbb E}[S(X)], \\ \tilde{v}(y) &= v(y) - {\mathbb E}[v(Y)] \end{align} $$

and a better way to write the feature projections should be

$$ \begin{cases} \tilde{S}(x) = {\mathbb E}[\tilde{v}(Y)|X=x]\\ \tilde{v}(y) = {\mathbb E}[\tilde{S}(X)|Y=y] \end{cases} $$

Unfortunately, neural networks do not appreciate the beauty of this simplest form. (And you are still worrying about whether one day AI can replace humankind.) Instead, it controls the mean values of these functions so that the nodes work on some specific operating points on the activation functions, to utilize the non-linearity around those operating points. As a result, the output generated by  neural networks are almost never zero-mean. For example, the output of the sigmoid function is always positive, and cannot have zero mean.

This is why whenever we need to compare the network outputs with the theoretical results, we use the "regulate" function. Inside the networks, the bias node and the weights on the edge connected to it are used to compensate for these non-zero means. For example, the weights $$ b_1, \ldots, b_k $$, written in function form as $$ b(y) $$ in fact satisfies

$$ b(y) = \log P_Y(y) - v(y) \cdot {\mathbb E} [S(X)] $$

where the second term is used to compensate for the mean value of $$ S(x) $$. It helps to make sure that the output nodes have $$ v(y) \cdot S(x) + b(y) = v(y) \cdot \tilde{S}(x) + \log P_Y(y) $$

That is, $$ S(x) $$ is effectively zero-mean from the output nodes point-of-view.

This fact can be easily verified with the following code piece: with "weights" as the one taken from the network by "model.get_weights"



And the output looks like this.

I hope you didn't catch it, on both "b" and "b_theory", we subtracted another mean! This time the mean w.r.t. $$ P_Y $$. Because the output layer softmax function is another non-linear function, and the network plays with this mean again! Annoying, isn't it?

Careful readers might also find that in the Softmax page when we make up our own weights to plug in to the network, we used a different formula $$ b(y) = \log P_Y(y) - \frac{({\mathbb E}[S|Y=y])^2}{2 \cdot {\rm var}[S]} $$

Instead of showing you why they are actually equivalent, which is quite tedious, let's just say the specific form depends on the way that these weights are regulated. And don't we all wish that there wasn't any bias in any form?

 "I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character." --- Dr. Martin Luther King

The Content of Their Character
Now that we have put the bias issue aside, let see how the network makes the feature projection. Recall the ideal output of the hidden node is

$$ S^*(x) = {\mathbb E} [ \tilde{v}(Y)| X=x] = {\mathbb E}[v(Y)|X=x] - {\mathbb E}[v(Y)] $$

Our conjecture is that the network generates an approximation to this as $$ \tilde{S}(x) = S(x) - {\mathbb E}[S(X)] $$. Knowing that whatever mean it creats will be taken out at a later layer, the network actually makes one that is not zero mean:

$$ S^{(w, d)}(x) = \sigma\left( \sum_j w_j T_j(x) + d \right) $$

by choosing the weights $$ w_1, \ldots, w_m, d $$.

Let's make a bold guess that this notion of approximation is in the mean-square error sense, now we can write the optimization in a single line

$$ \min_{w, d} \sum_x P_X(x) \cdot ( S^{(w, d)}(x) - {\mathbb E}[S^{(w, d)}(X)] - S^*(x))^2 $$

or using the $$ \tilde{S} $$ notation

$$ \min_{w, d} \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x))^2 $$

Now we can take some derivatives:

$$ \begin{align} 0 &= \frac{\partial}{\partial w_j } \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x))^2\\ &= \sum_x P_X(x) \cdot (\tilde{S}^{(w, d)} (x) - S^*(x)) \cdot\left( \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x)\right) \end{align} $$

This now can be viewed as two functions, $$ (\tilde{S}^{(w, d)} (x) - S^*(x)) $$ and $$ \left( \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x)\right)$$ being orthogonal, in the inner producted weighted by $$ P_X $$. The fact that the inner product is 0 is harder to check numerically since both functions might be scaled rather arbitrarily, so it is hard to tell what numerical value is close enough to 0. What we would actally do is to check the angle between the two functions by taking the "arccos", and see if it is close to $$ \pi/2 $$. Well, we actually changed the unit and compared the angle with $$ 90 ^\circ $$.

The only thing that one needs to be careful about is the second function with the derivative now. I know may people, including myself, memoerize the derivative for the sigmoid function: $$ \sigma'(z) = \sigma(z) \cdot (1- \sigma(z)) $$. However, here we need the derivative of the sigmoid function subtracting its mean:

$$ \begin{align} \frac{\partial}{\partial w_j} \tilde{S}^{(w, d)}(x) &= \frac{\partial}{\partial w_j} \left[ S^{(w, d)} (x) - \sum_{x'} P_X(x') S^{(w, d)}(x') \right] \end{align} $$

This in our notation is simply $$ \widetilde{\frac{\partial }{\partial w_j} S^{(w, d)}}(x) $$. That is, the derivative taken away the mean. Finally, filling in the fact that

$$ \frac{\partial}{\partial w_j} S^{(w, d)} (x) = S^{(w, d)}(x) (1-S^{(w, d)}(x)) \cdot T_j(x) $$

and we are ready to program again.

And the results? Here is a typical set of values:

Convinced?

A Theory for Deep Learning
If you happened to type "neural networks" on your Google bar, chances are you will find "explained" as the suggested next word. The fact is there are many many works trying to give an explaination to why neural network works. I have not seen one work that provides an explict answer that identifies a mathematical quantity and shows that matches with the neural networks learning results. In that sense, I do think we are making progress here.

We didn't present much of a theory in this page, nor did I wish to use these simple demos to prove anything. I only hope to use these experiments to explain the main ideas, to show you what would the theory suggest in a few specific scenarios. Hopefully this would make some of you interested enough to read the paper. In our paper, we put out a more complete picture. It starts by saying that we need a new way to measure information itself. After designing a new information metric, we can use that to describe the information flow in any data processing procedure, what information is kept, what is discarded, and whether such choices are good ones. We then show that finding the singular vectors of the $$ \tilde{B} $$ like what we showed here is a good choice: it captures the strongest modes of dependence between $$ X $$ and $$ Y $$ like we claimed; it captures the most signifcant elements of the "common information" between $$ X $$ and $$ Y $$; and probably more importantly, the selected features are the optimal choices of what we called the "universal feature selection problem".

In classic statistics, if we need to select some features from data for the purpose of solving some inference problems, the answer is always to select the sufficient statistic. We argue that in most learning problems, as we process high dimensional data samples, we do not know exactly what attributes of the data we would be interested in detecting. For example, when we observe a customer's behavioral record it is not clear what attribute(s) matters when we try to recommend a product to her later. Thus, for these new problems, we cannot use a notion of sufficient statisitc for a particular inference task. Instead, we need to select a set of features, such that regardless of what attribute we might be interested in detect later, the decision based on these features are "universally" good. For this new problem, we show that the choice from the singular vectors are indeed optimal. That is, any feature selection procedure, as long as the goal is not a single problem with known statistical model, the sensible choice should be such universal features. And neural networks, as one of such tools, had better give these solutions. This is the basis of the above experiments.

Particularly in this area, I would like to avoid overstating the importance of our results, such as saying "blackbox opened". The best way of doing that is to show the mathematical results. Since we are not doing it here, I think it is important to make it clear what we can and cannot do with this theory. Here it goes.

A Performance Metric for Neural Networks
We get a performance metric of neural networks. If we keep the notation that $$ \underline{S}(x) = [S_1(x), \ldots, S_k(x)] $$ is the output of the last hidden layer with $$ k $$ nodes, and $$ \underline{v}(y) = [v_1(y), \ldots, v_k(y)] $$ is the weights on the output layer, our theory says the goal is to make them as highly correlated as possible, i.e. to make $$ {\mathbb E}[ \langle\underline{S}(x), \underline{v}(y)\rangle] $$. The optimal solution is to choose $$ S(x) $$'s and $$ v(y) $$'s from the SVD solutions. However, due to the limited expressive power of the selected network structure, we usually cannot achieve such optimum. The gap is then a good metric of how effecitve our network is.

The H-Scores
For that purpose, we can define a metric we called the "H-Score". For a given network with both $$ S(x) $$ and $$ v(y) $$ computed, keep in mind that a neural network might not put the answers in out favorate coordinate system, but can have rotation, scaling and shift to the solutions, we first write $$ \tilde{S}(x), \tilde{v}(y) \in {\mathbb R}^k $$ as the centralized versions of the features and the weights, with the means subtracted, and then define $$ \Phi = [\phi_1, \ldots, \phi_k], \Psi= [\psi_1, \ldots, \psi_k] $$, with

$$ \begin{align} \phi_i (x) &= \sqrt{P_X(x)} \cdot \tilde{S}(x) \\ \psi_i (y) &= \sqrt{P_Y(y)} \cdot \tilde{v}(y) \end{align} $$

Then the "H-Score" is defined as

$$ \begin{align} H(S, v) &:= \|\tilde{B}\|^2 - \| \tilde{B} - \Psi \Phi^{\mathrm{T}} \|^2\\ &= 2 {\mathbb E}[ \underline{S}^T(X) \underline{v}(Y) ] - {\rm trace} ({\rm cov}(\underline{S}(X)) {\rm cov}(\underline{v}(Y))) \end{align} $$

If we are interested in only to evaluate how good the selected feature $$ S(x) $$ is, we can plug-in the theoretical optimal weights to get the one-sided H-score as

$$ \begin{align} H(S) := H(S, v^*) &= \| \tilde{B} \Phi (\Phi^{\mathrm{T}} \Phi) ^{-\frac{1}{2}}\|^2 \\ &= {\mathbb E}_{P_Y} [ \ {\mathbb E} [\underline{S}(X) |Y=y]^{T} \cdot {\rm cov}(\underline{S}(x))^{-1} \cdot {\mathbb E} [\underline{S}(X) |Y=y]\ ] \end{align} $$

In practice, all of the expectations and covariances above can be replaced by the empirical averages. So the H-Scores can be computed for all the features and weights computed in the network, including the features at the output of intermediate hidden layers, to evaluate how good they are.



To demonstrate the use of the H-Score, we have the following classification experiment for the data as shown. These are the first 2 dimensions of a 6-dimensional dataset, with color-coded classes. The data is generated with a mixture-Gaussian model, and is rather complicated. We need a multi-layer neural network for this task, and we plotted the H-Scores for the outputs at each layer as follows:



What we can observe is that as we run the network with more and more iterations, the H-Score improves; and the H-Scores of the outputs at the later layers is higher. This shows that the more iterations and more layers in the network are actually picking better and better features of the data. Finally, the classification accuracy is also included in the figure to show the correspondence with the H-Scores: higher H-Scores generally mean better performance as expected.

Interested readers might find from the plot that layer-3 might not be doing her job too well, and try to do something about it. The code for this experiment can be downloaded from here.

For a more realistic test, we also evaluated the features selected from different algorithms on the ImageNet dataset, where we listed the H-Scores and the accuracy in the following table to show the correspondence between the H-Score and the accuracy, in both picking the top-1 and the top-5 candidates.



Why Not Log-Loss?
Of course the more commonly used metric in such experiments is the log-loss, which is also called the cross entropy, or the K-L divergence. Here is how the H-Score is related to them and why it is (somewhat) better.

Log-Loss, or cross entropy computes $$ {\rm LL}(S, v)= {\mathbb E}_{P_{XY}} \left[ \log Q_{Y|X}^{(S, v)}(Y|X) \right] $$, where $$ P $$ is the empirical distribution and $$ Q^{(S, v)} $$ is the output of the SoftMax unit interpretted as a conditional distribution. The entire network chooses $$ S $$ and $$ v $$ to maximize this. As the objective of the entire optimization, we of course can use that as a measure of how good the result is. In fact, currently the most important job of the engineer who designed the neural network when it runs is to watch the log-loss value grows, slows down, and converges.

One issue of the Log-Loss is that it is not clear a priori what value we should expect. Should it be positive, negative, in the order of a few hundred, or 10 to power of 6? The actual result depends on the specific dataset and task. This makes it hard to use the log-loss as an indicator of the goodness of the network in an absolute sense.

An alternative is the K-L divergence, which is also called the relative entropy. It computes $$ D(P_{XY} || P_X \cdot Q_{Y|X}^{(S, v)}) = {\mathbb E}_{P_{XY}} \left[ \log \frac{P_{Y|X}(Y|X)}{Q_{Y|X}^{(S, v)}(Y|X)}\right] $$

and tries to minimize it. In terms of the optimization, this is equivalent as the Log-Loss, since the only difference is a term $$ {\mathbb E} [\log P_{Y|X}] $$, which does not depend on $$ (S, v) $$ at all.

A slight advantage of the K-L divergence is that it has an operational meaning: the distance between the two distribution: the actual empirical distribution $$ P $$, and the one generated by the network $$ Q $$. This gives the entire network a nice interpretation: it finds the best approximation to the true joint empirical distribiton $$ P_{XY} $$. Operationally, K-L divergence does have an absolute range: the divergence above always satisfies $$ 0 \leq D(P||Q) \leq \log |{\mathcal Y}|$$. This is a good thing. For example you can claim that for a dataset with $$ |{\mathcal Y}| =8 $$ you can get a K-L divergence of 0.02, it sort of says that you have a pretty good approximation; instead if your K-L divergence is 1.2, then your network is pretty much shooting randomly.

The problem of the K-L divergence is that we often cannot compute it from the data, because we don't know $$ P_{Y|X} $$. In fact, the reason we want to have a good approximation to the true $$ P_{Y|X} $$ is often to have a decent guess of what it looks like. This is where one should understand this difficulty to have a good metric for neural networks, which is like a Marco Polo game: we are chasing this wildly moving target, and it is hard to report where it is and how close we are from it with a single number.

Another issue of the K-L divergence is revealed when we write it in our vector form, with

$$ D(P_{XY} || P_X \cdot Q_{Y|X}^{(S, v)}) \approx \frac{1}{2} \| \tilde{B} - \Psi \Phi^T \|^2 = \frac{1}{2} \| \tilde{B} - \sum_{i=1}^k \psi_i \cdot \phi_i^T \|^2 $$

We are trying to use a rank-k matrix to approximate $$ \tilde{B} $$, but the distance we used here is actually the norm of the entire matrix. That is, even if we have used the optimal rank-k matrix computed from the SVD, the remaining singular values $$ \sigma_{k+1}, \sigma_{k+2}, \ldots $$ are still there in the resulting distance. In general, the K-L divergence is a sum of the contribution from these singular values that we can never reach to and the actual distance between our rank-k solution and the optimal one; and we have no way to separate the two. This is bad particularly when $$ |{\mathcal Y}| $$ is large.

This is the point that we can understand why H-Score is nice:

$$ H(S, v) := \|\tilde{B}\|^2 - \| \tilde{B} - \Psi \Phi^T \|^2 $$

That is, we subtracted from the K-L divergence $$ \|\tilde{B}\|^2 $$, which is the part of the K-L divergence that is not computable from the data samples. With that, we also dropped the contributions of the extra singular values. To see that, suppose $$ \Psi \Phi^T $$ is precisely the rank-k approximation of $$ \tilde{B} $$ from the SVD, the resulting H-Score is then the sum of the squared top-k singular values, with all other singular values cancelled.

The following sequence of inequalities can be very useful:

$$

H(S, v) \leq H(S) \leq \sum_{i=1}^k \sigma_i^2 \leq k

$$

where k is the dimensionality of the selected feature function $$ S(x) $$.

By definition, the first gap between $$ H(S, v) $$ and $$ H(S) $$ indicates how far is the weights $$ v(y) $$ from the optimal $$ v^*(y) $$. For an intermediate layer of the deep neural network, we can imagine of taking the output of this layer and immediately have a fully connected SoftMax output layer, and $$ v^* $$ is the weights on this layer. Thus, this gap is often a good measure of how well the network compuation has converged to its limit.

The second gap measures how far is the selected features $$ S(x) $$ from the optimal SVD solutions. If the network has converged well, the only reason that we cannot select this optimal feature is due to the limited expressive power of the network. Thus, this gap is a good measure of how good our network structure is.

The third gap measure how good our dataset is. That is, how much information a k-dimensional approximation of the dependence between $$ X $$ and $$ Y $$ can capture. The result is convienently normalized to have an absolute upper bound of k. For example, in a good solution of the MNIST problem, $$ H(S) $$ can be about 8.1, where $$ k =9 $$ corresponding to 9 dimensions in distinguishing 10 hand-written letters. Thus we can conclude from this that the dataset is actually really good, and the selected features are very close to the optimal; and hence the classification results can be expected to be very precise.

The Limitations
This is what we can do with the H-Score. Now here is what we cannot. Basically there are ways where the H-Scores are not accurate measure of the performance. A design with a lower H-Score can actually do better. This can happen for multiple reasons.

First, as stated, H-Score measure how good the selected features are in the universal sense. That is, how useful the features are averaged over all possible queries. This universal often cannot be as good as the custom-made solution for a specific task. We argue that if we don't know the task, then the universal choice is the only senisble choice. In practice, however, it can be the case that we are actually interested in just one specific problem, or the problem space is actually quite narrow; and we are allowed to do some trial-and-error experiments. What such trial and error can gives us is the choice of feature functions that are actually tuned to this specific task, even though such tunning happens in a rather implicit way. Although we would like to claim that our universal solution is a more noble solution, we cannot deny that this trial and error approach does have its value in practice.

The second problem can be seen from our ImageNet solution table. Sometimes a solution with higher H-Score can have worse performance, such as VGG16 and VGG19. One reason for this these solutions are significantly more complex than the others, with much more parameters. All of our results are derived with the assumption that the statistical models are quite precisely and stably reflected from the models. The generalization error from overfitting is not captured.

The third problem of the H-Score, perhaps the most important one, is that it is based on a local approaximation. This is explained in much more details in our paper. The short statement is that the approximation allows us to focus on a particular local optima. If we insist to do the math without this approximation, the solutions would be much more complicated, less clean, harder to generalize, and sometimes numerically less stable. The lack of stability is one servere problem of deep neural networks, which is the nightmare to all engineers working in this area. Unfortunately, our theory does not help with that. The local assumption helped us to avoid this entire issue from the outstart. This is a deliberate choice, and perhaps the main reason that we could have a clean story like this. In optimization, brave people do look for better ways to find the global optima for non-convex problems, but it is hard to imagine the result can be as general and as simple as those on local optima, which is what we are trying to find for the neural network problem.

Interpretability
Interpretability is one important issue for neural networks. Exactly what is learned by this complex structure? Those selected features, what are they? We should expect that our results can shed some lights to this. We have an mathematical expression for the output of each hidden node and the weight of each edge. We know what the network tries to compute and where it stores these results. That must be a good thing.

One obvious advantage is that we now have guidance for transfer learning and multi-task learning. Instead of using a large deep neural network as a blackbox, we can now name the intermediate results, which helps us to use such results in a different task. We have some practice on this, and will report it in another page.

Now does this mean that we can interpret the meanings of the selected features? Not really. Perhaps even worse, it gives a reason why interpretability is something we probably should not expect. What our theory says is that in general these features selected by neual networks are the solution to a number of optimization problems: they carry the most information, they are the most distinctive features, they are universally good for unknown inference tasks, etc. But they are not necessarily interpretable. Meaning they might not correspond to a simple and compact term that human are familiar to.

Think of a close friend of yours. The first thing that comes to your mind must be a general impression of this person as a whole, probably not the age, gender, ID number, or anything else you can find on a customs form. People invent a number of tags to catergorize items in the world. The question of why these tags are chosen as they are is far beyond the scope of engineering. All we can say is that they may not be aligned with the features we choose from data. Often these features that can best distinguish data samples are combinations of many human-chosen tags in specific ways. In principle one can learn these combinations if there is a dataset labeled by some tags, and thus translate the data features into terms that we are more familiar with. We view this a learning problem separated from that of selecting features from the data.

Bob finally found his soulmate since her &xi; value is really high, and &xi; = 0.0122 &times; haircolor + 0.0031 &times; athletic + 0.0041 &times; cooking-skills +0.0201 &times; laugh-on-old-jokes + 0.0311 &times; prefer-python-to-matlab + ...

Yeah! You guessed it. Bob is a software engineer.

This page is made with the help of Xiangxiang Xu. He looks like this.

其他语言: English