In this appendix we are deducing the online learning rules for a restricted GP where only the diagonal elements of the matrix C parametrising the posterior kernel are nonzero, similarly to the parametrisation of the covariances in the kernel spaces proposed by [88].
We are doing the simplification by including the constraint in the learning rule: projecting to a subspace of GP's with the kernel specified using only diagonal elements, ie.
and if we use matrix notation and the design matrix
= [
,...,
] then we can write the posterior
covariance matrix in the feature space specified by the design matrix
and the diagonal matrix
C as
In online learning setup the KL-divergence between the true posterior
and the projected one is minimised. Differentiating the KL-divergence
from eq. (74) with respect to a diagonal element Cii
leads to the expression
0 | = | ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|
![]() |
= | ![]() ![]() ![]() ![]() ![]() |
with r(t + 1) the scalar coefficient obtained using the online
learning rule and
the feature vector corresponding to the
new datum.
We have t + 1 equations for t + 1 variables, but the system is not
linear. Substituting the forms for the covariances
and using
the matrix inversion lemma leads to the system of equations:
We see that we have a projection with respect to the unknown matrix Ct + 1, giving no analytic solution. Using Cpost from eq. (219) the solution is written
diag![]() ![]() ![]() ![]() |
We see that, to obtain the ``simplified'' solution we still need to
invert full matrices. The posterior covariance is not diagonal either.
As a consequence we will be required to perform iterative
approximations, also considered in [88]. From this we
conclude that the diagonalisation of parameter matrix
C is not
feasible as it does not introduce any computational benefit and
we believe that by keeping the size of the
set at a reasonable
size is a better alternative then a diagonalisation of a possibly
larger
set.