A Note on Bayesian Optimisation

History and background

Sometime during the 1740s, Reverend Thomas Bayes made an ingenious statistical discovery, an idea, a formula that had changed the world and provided solutions to a lot of problems. At that time people were very serious about religion and the idea of God. David Hume, one of the English philosophers believed that the certainty of anything should not be inherited from the traditional belief, habitual relationships, or cause and effect, rather it should rely on the experience over time. Because God was regarded as the first cause of everything, Hume’s skepticism was quite unsettling.

With Hume’s doubt of cause and effect, Bayes wanted to challenge this notion brought up by Hume mathematically. So Bayes decided to learn the approximate probability of a future event he knew nothing about. His idea was to start a process with a random number and update the number based upon the new information that he gathered. This process will not give the exact solution but an approximate solution. Bayes discovered that with each new information his initial random number would find itself more and more constraint with each gathering information.

We can sum Bayes’ idea into a statement where “prior times likelihood is proportional to the posterior”.

Although Bayes’ idea was discussed in Royal Society circles he himself did not like the answer and hence it was never published. It was Richard Price, a young friend of Bayes who was asked by Bayes’ family to examine his mathematical paper after his death.

Price found that Bayes’ paper was an answer attack to Hume’s attack on causation so devoted all his hard work and labor, he added missing references and citation. In moving mathematics from observation of the natural world to its ultimate cause, the theorem aimed to show that “the world must be the effect of the wisdom and power of an intelligent cause; and thus to confirm… from final causes... the existence of the Deity”.

“Bayes combined judgments based on prior hunches with probabilities based on repeatable experiments. He introduced the signature features of Bayesian methods: an initial belief modified by objective new information. He could move from observations of the world to abstractions about their probable cause. And he discovered the long-sought grail of probability, what future mathematicians would call the probability of causes, the principle of inverse probability, Bayesian statistics, or simply Bayes’ rule.” Excerpt From: Sharon Bertsch McGrayne. “The Theory That Would Not Die”.

Bayesian Mathematics in Machine Learning

Machine Learning models can turn out to be quite complicated and finding the right parameters and tuning them can be a very tedious task. When a machine learning model becomes more and more complicated hence a deep learning model, the number of combinations can be massive, and finding a correct combination can take a lot of time even if we have enough computing power. Bayesian mathematics or Bayesian approach to find correct parameters can be a deep learning model in a position where it finds a set of parameters that are approximately good not exactly good but approximately.

The reason for this is that we don’t wanna spend too much time finding a set of parameters from finite possibilities, it would take forever. We want to find good enough parameters that output approximated distribution as the observations.

But, how does it work?

The Bayesian optimization uses random search from a distribution. To define a random search we need to define a marginal distribution for each hyperparameter, for example, a Bernoulli or multinodular distribution for a discrete hyperparameter, or a uniform distribution on a log-scale for positive real-valued hyperparameters.

The goal of Bayesian optimization is to find optimal parameters of a system with a limited number of experimental trials. They attempt to find the global optimum in a minimum number of steps. These methods employ a probabilistic surrogate model to make predictions about possible outcomes. To search for optimal parameters, we define an acquisition function f that uses the surrogate model to assign each configuration with an approximation. Bayesian optimization incorporates prior belief about f and updates the prior with samples drawn from f to get a posterior that better approximates f. Configurations with the better approximation are tested on the system, and the process repeats. The performance of a Bayesian optimization algorithm is therefore determined by three components: the surrogate model, the acquisition function, and the methods that numerically optimize the acquisition function like gradient descent.

Bayesian Optimisation in generative models

By definition generative modeling is an unsupervised learning machine learning task that involves automatically discovering and learning the representations or patterns in input data in such a way that the model can be used to generate new examples. All these models represent probability distributions over multiple variables in some manner. The distributions that the generative model generates are high-dimensional. For instance the classical deep learning methodology like classification and regression we model a one-dimensional output whereas in generative modeling we model high-dimensional output.

Essentially, we simply model the joint distribution of the data and we don’t have any labels.

Some of the uses of generative models are:

  • Density estimation and outlier detection

  • Data compression

  • Mapping from one domain to another

  • Language translation, text-to-speech

  • Planning in model-based reinforcement learning

  • Representation learning

  • Understanding the data

For simplicity, we will take the example of Variational Autoencoder.

In a variational generative model, we approximate the exact posterior p(z|x) with a variational posterior q (z|x), where z is the latent variable and x is the observation. are the variational parameters which we will optimize over to fit the variational posterior to the exact posterior. The variational posterior is sampled from the standard normal distribution of mean and standard deviation is 0 and 1 respectively. The sample is randomly picked thus making the model intractable.

A Variational Autoencoders, or VAE [Kingma, 2013; Rezende et al., 2014], is a generative model that generates continuous latent variables -- variables that are random and unobserved that serves as the basic building block to model a distribution -- that uses learned approximate inference [Ian Goodfellow, Deep learning]. It is a modified version of the autoencoder.

The limitations with autoencoders is that it learns to generate compact representations and reconstruct their inputs well, but aside from a few applications like denoising autoencoders, they are fairly limited.

The fundamental problem with autoencoders, for generation, is that the latent space they convert their inputs to and where their encoded vectors lie, may not be continuous, or allow easy interpolation.

Variational Autoencoders on the other hand have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation.

It achieves this by making its encoder not output random latent variables, rather, outputting two vectors: a vector of means, μ, and another vector of standard deviations, σ (which will be the optimizing parameter of the model). Since we're assuming that our prior follows a normal distribution, we'll output two vectors describing the mean and variance of the latent state distributions.

Intuitively, the mean vector controls where the encoding of an input should be centered around, while the standard deviation controls the “area”, how much from the mean the encoding can vary. By constructing our encoder model to output a range of possible values from which we'll randomly sample to feed into our decoder model, we're essentially enforcing a continuous, smooth latent space representation. For any sampling of the latent distributions, we're expecting our decoder model to be able to accurately reconstruct the input. Thus, values that are nearby to one another in latent space should correspond with very similar reconstructions.

What we ideally want are encodings, all of which are as close as possible to each other while still being distinct, allowing smooth interpolation, and enabling the construction of new samples.

However, this sampling process requires some extra attention. When training the model, we need to be able to calculate the relationship of each parameter in the network with respect to the final output loss using backpropagation. We cannot do this for a random sampling process. The way VAE achieves this is by reparameterization trick which simply suggests that we randomly sample ε from a unit Gaussian, and then shift the randomly sampled ε by the latent distribution's mean μ and scale it by the latent distribution's variance σ. Since and tractable we use the backpropagation technique to optimize the parameters thus finding the exact posterior.

With this technique, we can optimize the parameter over the exact distribution while still maintaining the ability to randomly sample from that distribution.

The variational autoencoder achieves some great results in generating samples and is among the state-of-the-art approaches to generative modeling. Its main drawback is that the samples generated from the model tend to be blurry at times. The most obvious conclusion to cause is an intrinsic effect of the maximum likelihood, which minimizes DKL(pdata || pmodel). This essentially means that the model will assign a high probability to the points that occur in the training set and also to the other points which may be the cause of blurry image [Ian Goodfellow, Deep learning].

To learn more about generative models and representation learning read this article.