Akaike Information Criterion

Introduction
Akaike Information Criterion (AIC) was first published in this paper, in one of the earliest International Symposium on Information Theory (ISIT). It is one of the most celebrated results on model selection. There is a well-written wikipedia page on that topic. In the link provided above, there is also a wonderful commentary by Jan de Leeuw. The goal of this page is to provide a "hopefully" more intuitive and brief way to understand the result, with a geometric view of the problem.

The Model Selection Problem
The problem addressed is a common one in statistic learning. Suppose we have samples $$ x_1, x_2, \ldots, x_n $$ i.i.d. from an unknown distribution $$ P $$, over a finite alphabet $$ {\mathcal X} $$. Our goal is to find the best fitting by using maximum likelihood estimate. That is, for a given parameterized family of models $$ {\mathcal P} = \{ P_\theta \} $$, we choose $$ P_{\theta^*} $$ as an approximation to $$ P $$, with

$$ \theta^* = \arg \min_\theta D(P||P_\theta) = \arg\max_\theta {\mathbb E}_{P} [\log P_\theta(X)] $$

where $$ D(P||Q) $$ is the Kullback–Leibler divergence.

The model selection problem is concerned about question of how big a family of models we should use to choose the fitting from. Intuitively, the larger the family of candidate models are, the better fitting we can find, with a smaller [[wikipedia:Kullback–Leibler divergence|K-L divergence]. However, since we do not have access to the true model $$ P $$, in reality, we use the empirical distribution $$ \hat{P} $$ of the observed $$ n $$ samples in the place of $$ P $$, and solve

$$ \hat{\theta} = \arg \min_\theta D(\hat{P}||P_\theta) = \arg\max_\theta \frac{1}{n} \sum_{i=1}^n \log P_\theta(x_i) $$ This potentially causes overfitting, especially when the number of samples $$ n $$ is large enough and when the model family is too big.

Mathematically, we can consider a sequence of families $$ {\mathcal P}_1 \subset {\mathcal P}_2 \subset \ldots \subset {\mathcal P}_K $$, where the $$ k^{th} $$ family

$$ {\mathcal P}_k = \{ P_{\theta^k} : \theta^k \in {\mathbb R}^k \} $$

has $$ k $$ dimensional parameters. Let $$ \hat{\theta}^k $$ be the maximum likelihood estimate of $$ \theta^k $$, hence $$ P_{\hat{\theta}^k} $$ be the fitted model in $$ {\mathcal P}_k $$. The question is which value of $$ k $$ one should use.

There are many common solutions to this problem, among which Bayesian information criterian (BIC) and Akaike Information Criterion(AIC) are commonly used. In AIC, we compute

$$ \min_k \left[ \min_{\theta^k \in {\mathbb R}^k} D(\hat{P} || P_{\theta^k}) + \frac{k}{n} \right] $$

That is, on top of the fitting error $$ D(\hat{P} || P_{\hat{\theta}^k}) $$, we add $$ k/n $$ to penalize model families with larger $$ k $$.

Akaike argued this by showing that when we replace the true distribution $$ P $$ with the empirical distribution $$ \hat{P} $$ in the divergence minimization, the resulting $$ \hat{\theta}^k $$ is biased by the realization of $$ \hat{P} $$, similar to the story of estimating variance from samples. As a result, we need a correction term to get rid of this bias, which happens to take a surprisingly simple form of $$ k/n $$.

Our goal is to show you this fact with a simpler geometric approach. Well, simpler if you are willing to take a local geometric assumption.

The Local Geometry
Local geometry is one of my favorite ways to make analysis. I probably should write a separate page to summarize the nice properties it offers. But here, let's just have the minimum setup as we need in this problem.

In short, we would like to represent each probability distribution $$ P $$ over $$ {\mathcal X} $$ as a vector. Because the alphabet is finite, we can already write $$ P $$ itself as a vector. However, we would like to add an assumption that all the distributions we are interested are in the neighborhood of a reference distribution $$ P_0 $$, and represent $$ P $$ by a vector $$ \phi $$ with entries

$$ \phi(x) = \frac{P(x)- P_0(x)}{\sqrt{P_0(x)}}, \forall x \in {\mathcal X}$$

We denote this representation as $$ P \leftrightarrow \phi $$. This might looks strange at first, but it offers the following nice properties.


 * If two distribution $$ P, Q $$ are represented by $$ \phi, \psi $$, respectively, then the K-L divergence $$ D(P||Q) \approx \| \phi - \psi \|^2$$. That is, the squared distance between the two vectors is a good approximation to the K-L divergence.


 * Locally each smooth parameterized family of distributions $$ {\mathcal P} $$ is a linear subspace. Here we assume that the given sequence of model families satisfy $$ P_0 \in {\mathcal P}_1 \subset {\mathcal P}_2 \subset \ldots $$


 * As a result, when we do fitting $$ P $$ to a particular family $$ {\mathcal P} $$, we find a $$ Q \in {\mathcal P} $$ with the minimum divergence, which in the vector representation, we are finding a $$ \psi $$ in a linear subspace with the minimum Euclidean distance to $$ \phi $$, which is just a projection. Let's denote the projection (fitting) of $$ P \leftrightarrow \phi $$ to $$ {\mathcal P}_i $$ as $$ \pi_i ( \phi) $$.


 * If we have $$ n $$ i.i.d. samples from a distribution $$ P \leftrightarrow \phi $$, the empirical distribution $$ \hat{P} \leftrightarrow \hat{\phi} $$ is random. With local approximation, we can actually approximately think that $$ \hat{\phi} $$ is the sum of $$ \phi $$ and a white Gaussian noise vector with variance $$ 1/n $$ in each dimension. This is mathematically a rather illegal way to state the central limit theorem, but a quite convenient way to develop intuition.

With this little machinery, let's now look at the model selection problem.

The Geometry of AIC


This is the picture. $$ \phi $$ is the vector representation of the true distribution $$ P $$; and $$ \hat{\phi} $$ is the empirical distribution we observe. The difference is the black vector, which as we stated before can be thought as a white Gaussian noise with variance $$1/n$$ per dimension. Only two model families are shown here as linear subspaces, $$ {\mathcal P}_1 $$ with dimension 1 and $${\mathcal P}_2 $$ with dimension 2, with the projections of $$ \phi $$ and $$ \hat{\phi}$$ onto these subspaces

If we were to fit $$ \phi $$, then it is clear that the higher dimensional family the better. In the figure we see that $$ \pi_2(\phi)$$ is closer to $$ \phi $$ than $$ \pi_1(\phi)$$. There is no overfitting issue.

Unfortunately, we do not have access to $$ \phi $$, and can only work with $$ \hat{\phi}$$, and the corresponding projections. These are the darker colored points in the figure. We can compute $$ \pi_1(\hat{\phi}), \pi_2(\hat{\phi}) $$. We can also compute the K-L divergence or squared distance in the vector space

$$ \| \pi_i(\hat{\phi}) - \hat{\phi} \|^2, i=1, 2 $$

to quantify how good the fittings are. However, we do not care that much about whether we get a good fitting of the empirical distribution $$ \hat{\phi} $$ or not. We are interested in seeing whether we have a good fitting to the true distribution $$ \phi$$. That is, we would like to compare

$$ \|\pi_1(\hat{\phi}) - \phi \|^2 \gtrless \|\pi_2(\hat{\phi}) - \phi \|^2 $$

For the moment, let's accept that $$ \| \pi_i(\hat{\phi}) - \hat{\phi} \|^2 $$ is a pretty good estimate of $$ \| \pi_i({\phi}) - {\phi} \|^2  $$ (This needs a pretty careful analysis). From the picture, we can still see a difference between $$ \|\pi_1(\hat{\phi}) - \phi \|^2 $$ and $$ \|\pi_1(\hat{\phi}) - \hat{\phi} \|^2 $$, which is the squared norm of the vector from $$ \pi_i(\phi)$$ to $$ \pi_i(\hat{\phi}) $$.

This is the error of the maximum likelihood estimate of the parameter due to the randomness of the empirical distribution. It is what Akaike pointed out as the bias of the maximum likelihood parameter estimation. From the figure, we can see that this error is the projection of the noise vector (the black vector) onto the model subspace $$ {\mathcal P}_i $$. It is clear that the squared norm of this error is larger when we use a higher dimensional model family. Intuitively, with a higher dimensional model family, we are accepting more from the black noise vector, i.e. more from the randomness of the empirical distribution for this specific sample set, in the fitted model. This is overfitting.

Can we get rid of this? No. Because we only observe $$\hat{\phi}$$ and don't have access to the black noise vector and its components. But because we know that the noise has an average power of $$1/n $$ per dimension, it make sense to at least include this average in our metric. That is, we write

$$ \begin{align} \| \pi_i (\hat{\phi}) - \phi \|^2 &= \| \pi_i ({\phi}) - \phi \|^2 + \|\pi_i(\hat{\phi})- \pi_i(\phi)\|^2 \\ &\approx \| \pi_i (\hat{\phi}) - \hat{\phi} \|^2 + \|\pi_i(\hat{\phi})- \pi_i(\phi)\|^2 \end{align} $$