Training with marginal likelihood
Data augmentation aims to make the predictions of the classification system insensitive to modifications of the input that do not change the identity of the input.
When classifying handwritten digits, for example, we know that small rotations or scale changes do not change what digit is represented. We want our machine learning method to use this same knowledge.
Currently, data augmentation works by giving a machine learning method lots of examples of hand-chosen transformations. Instead of relying on data augmentation methods that use trial-and-error, we want to learn what transformations we should be insensitive to automatically.
We show that we can do this by maximizing the marginal likelihood. This is a quantity that can be computed for certain probabilistic models, which correlates better with generalization performance than the usual training loss.
How does it work?
We used a Gaussian process as our probabilistic model because 1) they have nice properties that allow good approximations of the marginal likelihood to be found, and 2) because there was existing work on how to incorporate invariances in them. We extend the existing work in the following ways:
We show that the strict invariances, which were previously studied, result in models that are too inflexible. So, we introduce the notion of "insensitivity", a less strict form of invariance. It is mathematically easier to specify insensitivities, and they are more suitable for machine learning tasks.
The properties of Gaussian processes are determined by their kernel. We create Gaussian processes that are invariant by constructing an appropriate kernel. Previously, it was required that the kernel could be computed exactly, which is not possible for our new invariant kernels. The main methodological contribution of our paper is a method to use Gaussian processes when their kernels can only be computed approximately.
Finally, given that we constructed kernels that described Gaussian processes with invariance properties, we showed how the approximate kernel computation can be used to compute the marginal likelihood, which we use as a training criterion for the invariances.
Why do we use the marginal likelihood? In the above example task, we are trying to learn a function that is symmetric along the diagonal (it is invariant to having its arguments flipped). If we chose the model that simply fits the data best (aka the lowest train RMSE), which is standard in non-Bayesian learning, we would choose the non-invariant model. On the other hand, if we choose the marginal likelihood to select the model, we would choose the invariant model, which generalises best.
What’s the impact of this work?
A machine learning method can often be made to generalise much better by incorporating the correct invariances. We demonstrate a method that can find these invariances automatically.
This work highlights a second benefit of Bayesian probabilistic modelling: the automatic model selection of model properties. Currently, neural network models are designed by human modifications, combined with cross-validation (or, trial-and-error). We show that a probabilistic approach can automate an otherwise manual process.