Bayesian Optimisation VS “Classical” Design of Experiments?
A hot topic of discussion in the world of structured learning about anything
Abstract
Table of contents
What is Bayesian Optimisation and "Classical DoE"
A quick explanation of what we are talking about:
Bayesian Optimisation is a method that interpolates a response function by a Gaussian Process, i.e. observations are collected at certain settings of several influencing factors for a process, and then the areas in between are described by probable functional relationships coming out of the Gaussian Process. Then BO determines the next observation(s) to be collected, by either finding the most uncertain locations in the area of interest – named Exploration –, or looks at the most likely optimal settings for the factors – named Exploitation. The process starts with a small number of starting observations and then adds new experimental runs, or tests, or trials (name them like it makes sense to you) in a sequential fashion.
Classical” Design of Experiments looks for an optimal set of experimental runs, i.e.minimal number of runs with maximum information collected, to estimate a regression function to answer specific questions. In classical Screening (more than 8) factors are typically set to a Low and a High setting, and the design is typically made up of all or a minimal number of combinations of these. The assumption is that not all of those factors play a role, and the goal is to reduce the factors. Newer designs, e.g. Definitive Screening, allow to also take quadratic terms into the selection, to capture curvature in the area of interest. And then you can determine Optimal Designs for all sorts of particular situations you are interested in, including Optimisation. The main point important for our discussion is that for all these designs you have to run the full design before you can analyse.
What is the debate about?
The discussion I have seen mainly revolves around the question, which method gives you an optimum with less effort. Some people have interpreted it in a broader sense, questioning, if not all experimentation, trialling, testing should be done using BO.
Let’s first be very clear: Bayesian Optimisation is not a replacement for screening efforts, as it is not actually meant to do dimension reduction. On the contrary, it is even more prone to the curse of dimensionality, the more dimensions considered, the more likely it is to get lost in a suboptimal area of the space. The other thing that classical DoE is better at is when there is a well understood first principle model that either has a closed form, or can be locally expanded in a Taylor series, a classical design will always be superior in generating the information necessary to estimate the parameters in that model, and then find an optimum.
Okay, now we put these out of the way, let’s look at some of the other situations where BO does make sense and has advantages.
The first one is in optimising hyperparameters for machine learning models. In this case BO is perfectly suited as it allows to faster move to good parameters in the often complex structure of the model performance response.
If you have a complex simulator with a reasonably small number of inputs and want to build a simpler surrogate model to find optimal settings quickly, again BO is your perfect choice, as it will allow you to quickly find those sweet spots without a lot of expensive runs of the actual simulator. This is particularly true for subsurface (reservoir) models, which are often very complex and expensive to execute.
Finally, let’s look at a case that actually makes sense to compare BO and DoE with each other: you have to find optimal settings for 4 influencing factors maximising one response. A 20 run space filling design has been run to determine the most interesting area in the range of the possible settings and understand the complexity of the response function. This gave a clear recommendation to restrict the further testing to a quarter of the space. As a result 5 of the original observations can be used for the determination of the optimum. In standard DoE these 5 observations would be augmented with 10 runs for a minimal number of runs to be able to estimate a complete response surface model with all two factor interactions and quadratic terms.
If we now use BO there is a high likelihood that only a few exploration steps will be required to then step over to exploitation. So with the right trade-off settings it will find the optimum with about the same amount or less of new observations generated. Under advantageous circumstances BO is likely to find a better optimum or a more precise optimum than using the linear regression model, because it determines an interpolation of the observations, not an approximation.
However, the real world throws a lot of spanners into wheels in real life! To start with, measurement uncertainty can be larger than the chosen stop criterium in BO. So the method might bounce around the surrounding of the optimum giving the impression of a complex response, while actually just overfitting to the noise. (I am sure there is a way to avoid this, I am not an expert in BO!)
Then there is the main criticism of any sequential or adaptive experimentation: how to avoid time trends to bias the outcome? In DoE this is avoided by randomising appropriately, so that the potential bias can be assumed to be captured by the random model term. In BO (or any other adaptive method) this can be tricky to overcome, and the easiest way is at the same time costly, namely introducing repeats! A similar problem are blocking factors, like the people doing the test, or instruments used, and experimental units.