Bayesian Information Criterion

Introduction
An alternative to Akaike Information Criterion (AIC) is the Bayesian Information Criterion, which is a slightly different setup to penalize using complex model families. We refer readers to well-written wikipedia pages BIC and AIC, and the references therein, for nice descriptions of these different formulations. The goal of my page here is to provide a local geometric view of the problem, which hopefully can be easier to understand. If you have not read my page on AIC, then please do that first. Particularly, the discussion on the basic operations using local geometry will not be repeated here, but will be used in our development. In other words, this page is not a stand-alone page. It is build on the AIC so that we can focus on the difference between the two different ways to address the same problem.

Formulation of BIC
For convenience, we will setup the model selection problem again, to introduce notations, present the result, and to clarify the difference.

To start, suppose we have samples $$ x_1, x_2, \ldots, x_n $$ i.i.d. from an unknown distribution $$ P $$, over a finite alphabet $$ {\mathcal X} $$. Given parameterized family of models $$ {\mathcal P} = \{ P_\theta \} $$, we choose $$ P_{\theta^*} $$ as an approximation to $$ \widehat{P} $$, the empirical distribution of the observed samples $$ x_1, x_2, \ldots, x_n $$ with

$$ \theta^* = \arg \min_\theta D(\widehat{P}||P_\theta) = \arg\max_\theta {\mathbb E}_{\widehat{P}} [\log P_\theta(X)] $$

and hope that the resulting $$ P_{\theta^*} $$ is a good approximation to the true model $$ P $$. In the above, $$ D(P||Q) $$ is the Kullback–Leibler divergence. This procedure is also know as the maximum likelihood estimation.

The model selection problem is concerned about question of how big a family of models we should use to choose the fitting from. For that, we assume there is a sequence of nested model families $$ {\mathcal P}_1 \subset {\mathcal P}_2 \subset \ldots \subset {\mathcal P}_K $$, where the $$ k^{th} $$ family

$$ {\mathcal P}_k = \{ P_{\theta^k} : \theta^k \in {\mathbb R}^k \} $$

has $$ k $$ dimensional parameters.

Clearly, the larger $$ k $$ is, the better fitting we can find. That is, as $$ k $$ increases, the resulting minimal K-L divergence $$ D(\hat{P} || P_{\theta^*}) $$ decreases since the optimization is taken over a larger and larger space. This, however, would result in overfitting. That is, using a large model family ends up with a fitting to the random realization of the empirical distribution $$ \hat{P} $$, not the true model $$ P $$, which means larger generalization errors when we work with a different set of samples.

In the story of AIC, we use a local geometric analysis to show that a better optimization is to solve

$$ \min_k \left[ \min_{\theta^k \in {\mathbb R}^k} D(\hat{P} || P_{\theta^k}) + \frac{k}{n} \right] $$

where on top of the K-L divergence, we added a penality term $$ \frac{k}{n} $$, to penalize using large model families, espeically when the number of samples $$ n $$ is limited.

A similar but slightly different criterion commonly used is the Bayesian Information Criterion, which optimizes

$$ \min_k \left[ \min_{\theta^k \in {\mathbb R}^k} D(\hat{P} || P_{\theta^k}) + \frac{1}{2}\frac{k \log n}{n} \right] $$

where a different penality term $$ \frac{1}{2} \frac{k \log n }{n} $$ is used. Our goal here is to derive this criterion, again with the local geometric analysis, which should help to make clear how the formulation of the problems are different, and perhaps offer insights on which criterion to choose when we face a real-life problem.

Now here comes the difference between the formulations of AIC and BIC. In both cases, we try to minimize the fitting error $$ D(P || P_{\theta^*}) $$, but we do not have access to the true model $$ P $$. In the AIC  story, what we essentially did is to replace $$ D(P || P_{\theta^*}) $$ by $$ \mathbb{E}[D(P || P_{\theta^*})] $$, where the expectation is taken over the posterior distribution of $$ P $$ given the empirical distribution $$ \widehat{P} $$, which is approximated as a normal distribution. This is done in our derivation in the two labeled equations (1) and (2).

In BIC, we view the model selection problem as a hypothesis testing problem, where hypothesis $$ H_k $$ says that the true model $$ P \in {\cal P}_k $$, for $$ k = 1, 2, \ldots, K $$. The likelihoods can be written as

$$ \begin{align} {\mathbb P} ( \widehat{P} | H_k ) &= {\mathbb E} [ {\mathbb E} [ \widehat{P} | P, H_k] ] = \int f( P | P \in {\cal P}_k) \cdot {\mathbb P} ( \widehat{P} | P) dP \end{align} $$ which, in words, says that we need to assume a distribution of the true model, given it lies in $$ {\cal P}_k $$, which we denote as $$ f( P | P \in {\cal P}_k) dP $$; and we need to average over $$ P $$ according to this measure the likelihood of observing the given empirical distribution $$ \widehat{P} $$. And of course we would maximize over $$ k $$ after computing all these likelihoods. It turns out that the precise assumption on the measure $$ f $$ is not important in the analysis, which we will see using the local geometric analysis.

The Local View
Before we proceed, we first translate everything into the local notation. Like in the AIC case, we assume that every distribution involved in the problem live in the same neighborhood. We write the empirical distribution in a vector form as $$ \widehat{P} \leftrightarrow \widehat{\phi} $$. For each $$ {\cal P}_k $$, we write $$ \pi_k ( \widehat{\phi} )$$ as the projection of $$ \widehat{\phi} $$ onto the linearized neighborhood of $$ {\cal P}_k $$. $$ \pi_k ( \widehat{\phi} )$$ corresponds to $$ P_{\theta^*} $$ where the optimization is over the $$ k $$ dimensional parameter space. $$ \widehat{\phi}, \pi_1 ( \widehat{\phi}), \pi_2 ( \widehat{\phi}), \ldots, \pi_K ( \widehat{\phi}) $$ are all what we observe. In the figure, two cases with $$ k= 1, 2 $$ are shown. Unlike in the AIC, we do not explicitly include the true model $$ P \leftrightarrow \phi $$ in our picture.



Now conditioned on a hypothesis $$ H_k: P \in {\cal P}_k $$, write that the difference $$ \pi_k(\widehat{\phi}) - \phi := \xi $$. The only thing we know is that $$ \xi $$ lies in $${\cal P}_k $$ which is a $$ k $$ dimensional subspace. Suppose that $$ \xi $$ follows a distribution $$ f_\xi $$, and we write

$$ \begin{align} {\mathbb P} [\widehat{P} | H_k ] & = \int f_{\xi} (\xi) \cdot {\mathbb P} ( \widehat{\phi} | \phi = \pi_k(\widehat{\phi}) - \xi) d\xi\\ &= \int f_{\xi}(\xi) \cdot \exp[ - \frac{n}{2} \|\widehat{\phi} - \pi_k(\widehat{\phi}) + \xi \|^2 ] d\xi \\ &= \int f_{\xi}(\xi) \cdot \exp[ - \frac{n}{2} \|\widehat{\phi} - \pi_k(\widehat{\phi})\|^2] \cdot \exp[- \frac{n}{2} \| \xi \|^2 ] d\xi \end{align} $$

where in the last step we used the fact that $$ \xi \perp (\widehat{\phi} - \pi_k(\widehat{\phi})) $$, since $$ \xi $$ is the difference between the true model $$ \phi $$ and the projection $$ \pi_k(\widehat{\phi}) $$, both in $$ \mathcal P_k$$, and $$\widehat{\phi} - \pi_k(\widehat{\phi}) $$ is the projection error to $$ \mathcal P_k$$.

This is a good place to point out the main difference between AIC and BIC: the randomness in the true model $$ P $$ is included in the problem differently. In AIC, we first compute the average of $$ \| \phi - \widehat{\phi} \|^2 $$ and then use this average value as a correction term to the observed fitting loss $$ \| \widehat{\phi} - \pi(\widehat{\phi})\|^2 $$. In the above derivation, we would first evaluate $$ {\mathbb E}[\| \xi \|^2] = k/n $$ by asymptotic normality of $$ \widehat{\phi} $$, and then just add this value to $$ \|\widehat{\phi} - \pi_k(\widehat{\phi})\|^2 $$. In BIC, however, we use each possible value of $$ \phi - \widehat{\phi} $$ and the resulting value of $$ \xi $$ to correct $$ \| \widehat{\phi} - \pi(\widehat{\phi})\|^2 $$, which all happen inside the exponent; and then we take the average over $$\xi $$.

Now we use a further step of approximation that because of the large value of $$ n $$, the only values of $$ \xi $$ that matters in the integral are those with $$ \|\xi\|^2 \leq O(1/n) $$. We draw the range of such values as the pink areas in the figure. Note that the dimensionality of such ranges of unknown values of $$ \xi $$ are different for different $$ k $$.

Nevertheless, the precise model of $$ f_\xi $$ is now not important, since the only thing that matters is $$ f_\xi (0) $$, and the fact that this assumed value has nothing to do with $$ n $$. Now we write

$$ \begin{align} {\mathbb P} [\widehat{P} | H_k ] & \approx \int f_\xi (0) \cdot \exp[ - \frac{n}{2} \|\widehat{\phi} - \pi_k(\widehat{\phi})\|^2] \cdot \exp[- \frac{n}{2} \| \xi \|^2 ] d\xi \\ &= f_\xi(0) \cdot \left(\sqrt{2 \pi \frac{1}{n}}\right)^k \cdot \exp[ - \frac{n}{2} \|\widehat{\phi} - \pi_k(\widehat{\phi})\|^2] \end{align} $$

and

$$ \begin{align} - \frac{1}{n} \log {\mathbb P} [\widehat{P} | H_k ] &= \frac{1}{2} \|\widehat{\phi} - \pi_k(\widehat{\phi})\|^2 + \frac{k \log n}{n} + O(\frac{1}{n})\leftrightarrow D( \widehat{P} || P_{\theta*} ) + \frac{k \log n}{n} \end{align} $$

which is what we optimize according to BIC.

From every other derivation of BIC you can find, chances are Laplace's method is used and the calculation has something to do with Fisher Information. That is partially the point of this page: the only thing that matters seems to be just a changing of order in taking the expectation.