- PredApp

# Approximate the posterior not the model

Many people like to use Gaussian process (GP) models. Even the most expert deep learning researchers will use a GP for regression problems with small numbers of data points (Bengio 2016) or in the context of optimising hyper-parameters of deep learning models and algorithms (Snoek et al. 2012). For larger datasets, researchers have spent years trying to make reasonable approximations so that they can use GPs in those settings too: sacrificing some accuracy for computability. A question I often get asked is: "there are so many GP approximations, which should we use !?

I normally use variational approximation. This post explains my reasons and proposes a shift in how sparse GPs might be perceived: from model approximations to posterior approximations.

### Sparseness, pseudo-inputs and inducing points

The research of scaling GPs to larger datasets is often covered by the catch-all term ‘sparse GPs’. The term stems from some of the original work in this area, where methods used a subset of the dataset to approximate the kernel matrix (e.g. Williams and Seeger 2000). Sparseness occurred through selection of the datapoints. Given a dataset of inputs \(\mathbf X = [\mathbf x_i]_{i=1}^N\), \(\mathbf x_i \in \mathcal X\), and responses (targets) \(\mathbf y = [y_i]_{i=1}^N\), the task became to select a smaller set \(\mathbf Z \subset \mathbf X\) and \(\tilde{\mathbf y} \subset \mathbf y\) of the training data, and to compute the Gaussian process on that subset instead.

Then, Ed Snelson and Zoubin Ghahramani (2006) had a great idea, which was to generalise the set of selected points so that they did not have to be a subset of the training dataset. They called these new points ‘pseudo inputs’, since \(\mathbf z_i \in \mathcal X\), but not necessarily \(\mathbf z_i \in \mathbf X\). A neat thing about this approach was that the locations of these pseudo-inputs could be found by (unconstrained) gradient-based optimisation of the marginal likelihood. This idea stuck: the method was renamed ‘FITC’ (fully independent training conditional) by (Quiñonero-Candela and Rasmussen 2005), and has been the workhorse of sparse Gaussian processes ever since.

### Model Approximations

Quiñonero-Candela and Rasmussen's paper nicely provided an overview of all the sparse (or pseudo-input) GP methods in use at the time, and it showed that they could be interpreted as alternative models. If we consider the Gaussian process model as a prior on functions, then many sparse GPs can be considered as alternative priors on functions.

But this view conflates the model and the approximation and makes it hard to know whether to attribute performance (or problems) to one or the other. For example, we sometimes see that FITC *outperforms* standard GP regression. This might sound like a bonus, but it's actually rather problematic. It causes problems for users because we are no longer able to criticise the model.

Let me elaborate: suppose we fit a model using the FITC approximation, and it performs worse than desired. Should we modify the model (switch to a different kernel) or should we modify the approximation (increase the number of inducing points)? Equally, suppose we fit a model using FITC and it performs really well. Is that because we have made an excellent kernel choice, or because pathologies in the FITC approximation are hiding mis-specifications in the model? Let's look at some concrete examples:

In replicating the ‘Generalised FITC’ paper (Naish-Guzman and Holden 2007), Ricardo Andrade (now at UCSF) found issues replicating the results of the paper. The model seemed to overfit. It turned out that stopping the optimisation of the inducing inputs after only 100 iterations (as specified in Naish-Guzman’s article) made the method work. The FITC approximation conflated with early stopping, which made it hard to know what was wrong.

The FITC approximation gives rise to a heteroscedastic effect (see Ed Snelson’s thesis (2006) for an in-depth discussion). This seems to help on some datasets. But how are we to know if a dataset needs a heteroscedastic model? If I want such a model, I want to specify it, understand it and subject it to model comparison and criticism.

There is an accumulation of evidence that the FITC method overfits. Alex Matthews has a tidy case study in his __thesis__ for the regression case. The FITC approximation will give us the real posterior *if the inducing points are placed at the data points*, but optimising the locations of the inducing points will not necessarily help. In fact, Alex demonstrated that even when initialised at the perfect solution \(\mathbf Z = \mathbf X\), the FITC objective encourages \(\mathbf Z\) to move away from this solution. Quiñonero-Candela and Rasmussen's explanation of FITC as an alternative model (prior) explains why: the inducing points are parameters of a very large model, and optimising those parameters can lead to overfitting.

### Posterior Approximations

In 2009, Michalis Titsias published a paper that proposed a different approach: “Variational Learning of Inducing Variables in Sparse Gaussian Processes” (Titsias 2009). This method does not quite fit into the unifying view proposed by Quiñonero-Candela. The key idea is to *construct a variational approximation to the posterior process*, and learn the pseudo-points \(\mathbf Z\) alongside the kernel parameters by maximising the evidence lower bound (ELBO), i.e. a lower bound on the log-marginal likelihood. There was quite a bit of prior art on variational inference in Gaussian processes (e.g. Csató and Opper 2002; Seeger 2003): Titsias’ important contribution was to treat \(\mathbf Z\) as parameters of the variational approximation, rather than model parameters.

Now we understand that the variational method proposed by Titsias (and other related methods, e.g. Opper and Archambeau 2009) minimises the Kullback-Leibler (KL) divergence between the posterior GP approximation and the true posterior GP, where the KL is defined *across the whole process *(Matthews et al. 2016, Matthews 2017). This gives us a solid foundation on which to build approximations to the posterior of many Gaussian process models, including GP classification, GP state-space models, and more.

The construction is as follows. If the GP prior is

$$p( f(\cdot) ) = \mathcal{GP}\big(0,\, k(\cdot, \cdot)\big)$$

with the likelihood

$$p(\mathbf y \,|\, f(\cdot), \mathbf X ) = \prod_i p(y_i \,|\, f(\mathbf x_i) )\,$$

then we can give an approximation to the posterior process as

$$q(f(\cdot)) = \mathcal{GP}\big( \mu(\cdot),\, \sigma(\cdot, \cdot)\big)\,,$$

where \(\mu(\cdot)\) and \(\sigma(\cdot,\cdot)\) are functions that depend on the pseudo-points and maybe other parameters, too. Then, we ‘just’ do variational Bayes. The beauty of this is that it is totally clear that improving the ELBO will improve the approximation: we are (almost) completely free to do whatever we want with \(\mu(\cdot)\) and \(\sigma(\cdot,\cdot)\). However, to be able to *compute* the ELBO, we do need to pick particular forms for \(\mu(\cdot)\) and \(\sigma(\cdot,\cdot)\). The most straightforward way to do this is $$ \mu(\cdot) = k(., \mathbf Z) k(\mathbf Z, \mathbf Z)^{-1} \mathbf m\\ \sigma(\cdot, \cdot) = k(\cdot, \cdot) - k(\cdot, \mathbf Z)[k(\mathbf Z, \mathbf Z)^{-1} - k(\mathbf Z, \mathbf Z)^{-1}\Sigma k(\mathbf Z, \mathbf Z)^{-1}]k(\mathbf Z, \cdot) $$ where \(\mathbf m\), \(\mathbf \Sigma\) and \(\mathbf Z\) are variational parameters to be adjusted in order to maximise the ELBO, and thus reduce the Kullback-Leibler divergence \(\mathcal{KL}\big[q(f(\cdot))\,||\,p(f(\cdot)\,|\,\mathbf y, \mathbf X)\big]\).

### Advantages

These expressions appear superficially similar to the approximate-prior approach, but the underpinning idea is where most of the value lies. What we are doing is variational Bayes with the whole Gaussian process. Some immediate advantages include:

The approximation is nonparametric. This means that behaviour of the predictions away from the data takes the same form as the true posterior (usually increasing predictive variance away from the data).

We can always add more inducing points: since the solution with \(M\) pseudo points could always be captured by the one with \(M+1\) pseudo points, the quality of the approximation increases monotonically with the number of pseudo points.

The pseudo-inputs \(\mathbf Z\) (and their number!) are parameters of the approximation, not parameters of the model. This means that we can apply whatever strategy we want to the optimisation of \(\mathbf Z\): if the ELBO increases, then the approximation must be better (in the KL sense). So \(\mathbf Z\) are protected from overfitting.

It is clear what to do at predict time. The whole process is included in the approximation, so we simply have to evaluate the approximate posterior GP to make a prediction. This has been unclear in the past, when some authors have made the distinction of ‘extra approximations’ being required for prediction.

### Outlook

Variational Bayes gives a solid foundation for research in approximate inference for Gaussian processes. VB is making a serious comeback at the moment: stochastic variational inference lets us apply VB to large datasets; recognition models let us make efficient representations of approximate posteriors in latent variable models, and normalising flows (Rezende and Mohamed 2015) let us make the approximating posteriors very flexible. In my view, these technologies are all possible because VB genuinely turns inference into optimisation: a better ELBO (evidence lower bound) is always better. I am looking forward to seeing more GP models being solved by VB as new ideas filter into the GP community. Equally, I am looking forward to GPs becoming ‘the standard’ building block for modelling functions within models as the variational approximation fits within wider inference schemes.