How much computation do Gaussian Processes really need?
At Predapp we are not afraid to tackle the problems other common machine learning methods (like deep learning) find difficult.
We investigate problems where the data leaves significant amounts of uncertainty in the resulting predictions. To solve these problems, we develop tools based on Gaussian processes, which are a particular kind of neural network that accurately represent any uncertainty left in the predictions.
While Gaussian process models are elegant and provide many advantages, they can be practically difficult to use due to the amount of computation they need to train them. We know that to get the exact solution, the time t taken by a computer grows with the cube of the number of data points N, roughly summarised by this formula:
t = const . N3
As this equation demonstrates, small datasets can be handled quickly, but as the dataset grows, the time taken to complete the computation scales - and many datasets consequently take too long to solve.
In recent years, approximations have been developed for Gaussian processes that require far less computation, while still performing very well on the problems we apply them to. These methods work by summarising knowledge into a small number M of “inducing points”. These methods run in a time roughly summarised by this formula:
t = const1 . NM2 + const2M3
We see that the computation time t now only grows linearly with the number of data points N. However, the computational cost still grows quickly with the number of inducing points M.
In practice, we and many others have already successfully applied this approximation to a wide range of problems. However, one question remained: how big does M need to be for the approximation to be good? So far, the approximation seemed to work, but we did not have a guarantee that it would remain useful as the datasets get ever larger.
If M grew linearly with N, then the benefits of this approximation would be limited to speeding up computations by a constant factor. This would only moderately increase our capabilities since the growth of the computation time would still be cubic in N.
If, however, we could show that M could grow slowly with N, then we would know that the bigger our dataset, the larger the speed-up!
How many inducing points we need
we show that the number of inducing points only needs to grow slowly with the size of the dataset, while guaranteeing that we will get a very accurate approximation.
For a very common case, we can even show that M can grow logarithmically with N, resulting in an overall time that grows as:
t = const . N(logN)2D
Our proof shows that Gaussian processes will continue to scale favourably even for very large datasets.
What does this result mean in practice?
While the practical utility of Gaussian processes for many tasks has been established, the lack of a guarantee left open some possibility that using Gaussian processes for very large datasets may not be feasible.
Our result proves that the cost of Gaussian processes for large datasets grows manageably, which supports our belief that Gaussian processes can be practical tools for many more problems.
However, there is still a significant amount of work that needs to be done. While our proof shows that there are no fundamental theoretical obstacles to large scale Gaussian process models, our practical implementations can still benefit from many speed improvements. And we are working on it!