I am using scikit learn for Gaussian process regression (GPR) operation to predict data. My training data are as follows:
x_train = np.array([[0,0],[2,2],[3,3]]) #2-D cartesian coordinate points
y_train = np.array([[200,250, 155],[321,345,210],[417,445,851]]) #observed output from three different datasources at respective input data points (x_train)
The test points (2-D) where mean and variance/standard deviation need to be predicted are:
xvalues = np.array([0,1,2,3])
yvalues = np.array([0,1,2,3])
x,y = np.meshgrid(xvalues,yvalues) #Total 16 locations (2-D)
positions = np.vstack([x.ravel(), y.ravel()])
x_test = (np.array(positions)).T
Now, after running the GPR (GausianProcessRegressor
) fit (Here, the product of ConstantKernel and RBF is used as Kernel in GaussianProcessRegressor
), mean and variance/standard deviation can be predicted by following the line of code:
y_pred_test, sigma = gp.predict(x_test, return_std =True)
While printing the predicted mean (y_pred_test
) and variance (sigma
), I get following output printed in the console:
In the predicted values (mean), the 'nested array' with three objects inside the inner array is printed. It can be presumed that the inner arrays are the predicted mean values of each data source at each 2-D test point locations. However, the printed variance contains only a single array with 16 objects (perhaps for 16 test location points). I know that the variance provides an indication of the uncertainty of the estimation. Hence, I was expecting the predicted variance for each data source at each test point. Is my expectation wrong? How can I get the predicted variance for each data source at each test points? Is it due to wrong code?
Best Answer
Well, you have inadvertently hit on an iceberg indeed...
As a prelude, let's make clear that the concepts of variance & standard deviation are defined only for scalar variables; for vector variables (like your own 3d output here), the concept of variance is no longer meaningful, and the covariance matrix is used instead (Wikipedia, Wolfram).
Continuing on the prelude, the shape of your
sigma
is indeed as expected according to the scikit-learn docs on thepredict
method (i.e. there is no coding error in your case):Combined with my previous remark about the covariance matrix, the first choice would be to try the
predict
function with the argumentreturn_cov=True
instead (since asking for the variance of a vector variable is meaningless); but again, this will lead to a 16x16 matrix, instead of a 3x3 one (the expected shape of a covariance matrix for 3 output variables)...Having clarified these details, let's proceed to the essence of the issue.
At the heart of your issue lies something rarely mentioned (or even hinted at) in practice and in relevant tutorials: Gaussian Process regression with multiple outputs is highly non-trivial and still a field of active research. Arguably, scikit-learn cannot really handle the case, despite the fact that it will superficially appear to do so, without issuing at least some relevant warning.
Let's look for some corroboration of this claim in the recent scientific literature:
Gaussian process regression with multiple response variables (2015) - quoting (emphasis mine):
Remarks on multi-output Gaussian process regression (2018) - quoting (emphasis in the original):
Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - quoting:
Hence, my understanding, as I said, is that sckit-learn is not really capable of handling such cases, despite the fact that something like that is not mentioned or hinted at in the documentation (it may be interesting to open a relevant issue at the project page). This seems to be the conclusion in this relevant SO thread, too, as well as in this CrossValidated thread regarding the GPML (Matlab) toolbox.
Having said that, and apart from reverting to the choice of simply modeling each output separately (not an invalid choice, as long as you keep in mind that you may be throwing away useful information from the correlation between your 3-D output elements), there is at least one Python toolbox which seems capable of modeling multiple-output GPs, namely the runlmc (paper, code, documentation).