Conjugate Priors

Introduction
Conjugate prior is a relatively well understood concept. Usually it comes from some kind of table, which tells you what kind of parameterized prior distributions you should use given the specific observation models. For example, if you have a Bernoulli or Binomial distribution with the probability of success as the unknown parameter, you should use a Beta distribution to model the unknown parameter; if you have a Poisson distributed observation with and unknown rate, then you should use Gamma distribution for that rate; a normal distribution with an unknown mean should have the mean modeled as normal too, but with a known mean and unknown covariance structure should have the covariance modeled by Wishart or inverse Wishart distributions, etc.

Having such a table by no means implies that these choices are some kind of empirical facts. On the contrary, the choice conjugate priors is a well understood theoretic concept, with systematic solutions. The best paper on this topic I know is by Diaconis and Ylvisaker.

The reason I would like to write about this is as follows. In some teachings of the topic, one has the intuitive statement that the conjugate prior family is analogous to the "eigen vectors" of the observation model. In, there are mathematical statements that clearly suggested such analogies. This is, however, not always clearly explained. I believe there is a local geometric story to be told here, where the idea of "eigen vector" arises in a clean and simple way; and when you compare that with the general not-so-local story, it becomes clear where the difficulty comes about. In other words, I am not writing about conjugate prior here, but rather I am writing yet another example where having the local geometric view is a good thing.

The Setup
By an observation model we mean a conditional distribution $$ p_{\sf y|x} $$. A conjugate prior family for a given observation model is a parameterized family of distributions $$ {\cal P} = \{ p_\theta \} $$ on $$ {\cal X} $$, such that if the prior distribution $$ p_{\sf x} \in {\cal P} $$, then for any collections of observations $$ {\sf y}_1^n = y_1^n $$, the posterior distribution

$$ p_{\sf x|y_1^n}(x |y_1^n) = \frac{p_{\sf x}(x) \cdot \prod_{i=1}^n p_{\sf y|x}(y_i|x)}{p_{\sf y_1^n}(y_1^n)} \quad x \in {\cal X} $$,

also lies in the family. The parameter $$ \theta $$ is called the hyper-parameter. The practical significance of conjugate prior family is that the update of belief, i.e. the update of the distribution of the unknown random variable $$ {\sf x} $$ with the observation of more and more samples $$ {\sf y_1, \ldots, y_n} \ldots $$ can be carried out by simply updating the hyper-parameter. The hyper-parameter can thus be viewed as a sufficient statistic to estimate $$ {\sf x} $$ from the observation of $$ {\sf y} $$ samples. It is an important property that the sufficient statistic remains in the same form regardless of how many $$ {\sf y} $$ samples we observe. This is a property only enjoyed if the observation model $$ p_{\sf y|x}(\cdot | x) $$ for all $$ x $$ lie on an exponential family. This nice fact however is not the topic of this page.

The Local Setup
Even though this problem has nothing inherently local, we will nevertheless make the local argument as an approximation. Suppose that the observation model $$ r_{\sf y|x} $$ is given. Let the prior family be $$ {\cal P} = \{ p_{\sf x}(\cdot; \theta)\} $$ where $$ \theta $$ is the hyper-parameter, which for the purpose of showing the main idea we assume to be a scalar, but of course can be higher dimensional. We do assume that the family is smooth in the sense that the derivative w.r.t. $$\theta $$ always exists.

We pick a reference x-distribution $$ r_{\sf x} (x) = p_{\sf x} (x; \theta_0) $$, and the corresponding $$ r_{\sf y}(y) = \sum_x r_{\sf x}(x) \cdot r_{\sf y|x}(y|x) $$. We use the product distribution $$ r_{\sf x} \cdot r_{\sf y} $$ as the reference distribution. We nevertheless write $$ r_{\sf xy} = r_{\sf x} \cdot r_{\sf y|x} = r_{\sf y} \cdot r_{\sf x|y} $$, with the letter $$ r $$ instead of $$ p $$, as the joint distribution with the prior at $$ \theta_0 $$ and without any observation. We use $$ p $$ for distributions with a different prior and maybe some observations, although following the same observation model. The point is to keep track of the difference. The local assumption here is that $$ p_{\sf xy} $$ is close to the reference $$ r_{\sf x} \cdot r_{\sf y} $$.

In particular, we can define

$$ \epsilon \cdot B(x, y) \stackrel{\Delta}{=} \frac{r_{\sf x}(x) \cdot r_{\sf y|x}(y|x) - r_{\sf x}(x)\cdot r_{\sf y}(y)}{\sqrt{r_{\sf x}(x)\cdot r_{\sf y}(y)}} $$

which is what we called the canonical dependence matrix. It is often used in equivalent forms:

$$ \begin{align*} r_{\sf y|x}(y|x) &= r_{\sf y}(y) + \epsilon \cdot \sqrt{r_{\sf y}(y)} \cdot \left( \frac{B(x,y)}{\sqrt{r_{\sf x}(x)}} \right)\\ r_{\sf x|y}(x|y) &= r_{\sf x}(x) + \epsilon \cdot \sqrt{r_{\sf x}(x)} \cdot \left( \frac{B(x,y)}{\sqrt{r_{\sf y}(y)}} \right) \end{align*} $$

Now suppose a different prior corresponds to some different $$\theta $$, and we write as

$$ p_{\sf x}(x) = r_{\sf x}(x) + \epsilon \cdot \sqrt{r_{\sf x}(x)} \cdot \phi_0(x) $$

$$ \phi_0 $$ is what we called an information vector.

Update Belief with Observations
Now suppose we observe a single sample $$ {\sf y} = y $$, and let's write out the posterior distribution of $$ {\sf x} $$ in the local notations:

$$ \begin{align*} p_{\sf x|y}(x|y) &= \frac{p_{\sf x}(x) \cdot r_{\sf y|x}(y|x)}{p_{\sf y}(y)} \\ &= \frac{\left( r_{\sf x}(x) + \epsilon \cdot \sqrt{r_{\sf x}(x)} \cdot \phi_0(x)\right)\cdot \left(r_{\sf y}(y) + \epsilon \cdot \sqrt{r_{\sf y}(y)} \cdot \left( \frac{B(x,y)}{\sqrt{r_{\sf x}(x)}} \right)\right)}{r_{\sf y}(y) \cdot ( 1+ O(\epsilon^2))}\\ &= r_{\sf x}(x) + \epsilon \cdot \sqrt{r_{\sf x}(x)} \cdot \left( \phi_0(x) + \left( \frac{B(x,y)}{\sqrt{r_{\sf x}(x)}}\right)\right) + O(\epsilon^2) \end{align*} $$

where we used the fact that on the denominator, $$ p_{\sf y}(y) = r_{\sf y}(y) (1 + O(\epsilon^2)) $$. This is intuitive since $$ p_{\sf x} - r_{\sf x}$$ is a small perturbation of order $$ \epsilon $$, and the observation model $$ r_{\sf y|x}$$ is also weak, $$ O(\epsilon) $$ close to independent. If this is not convincing enough, I have included the detailed calculation in this.