In this appendix we are deducing the online learning rules for a restricted GP where only the diagonal elements of the matrix C parametrising the posterior kernel are nonzero, similarly to the parametrisation of the covariances in the kernel spaces proposed by [88].
We are doing the simplification by including the constraint in the learning rule: projecting to a subspace of GP's with the kernel specified using only diagonal elements, ie.
and if we use matrix notation and the design matrix
 = [
 = [ ,...,
,..., ] then we can write the posterior
covariance matrix in the feature space specified by the design matrix
and the diagonal matrix 
C as
] then we can write the posterior
covariance matrix in the feature space specified by the design matrix
and the diagonal matrix 
C as
In online learning setup the KL-divergence between the true posterior
and the projected one is minimised. Differentiating the KL-divergence
from eq. (74) with respect to a diagonal element Cii
leads to the expression
| 0 | = |   t + 1  -  t + 1 +  post ![$\displaystyle \left.\vphantom{
-{\boldsymbol { \Sigma } }_{t+1} + {\boldsymbol { \Sigma } }_{post}
}\right]$](img903.png)  t + 1  | |
|  post | = |  t -  t  r(t + 1)   t + 1 | 
with r(t + 1) the scalar coefficient obtained using the online
learning rule and 
 the feature vector corresponding to the
new datum.
 the feature vector corresponding to the
new datum.
We have t + 1 equations for t + 1 variables, but the system is not
linear. Substituting the forms for the covariances 
 and using
the matrix inversion lemma leads to the system of equations:
 and using
the matrix inversion lemma leads to the system of equations:
We see that we have a projection with respect to the unknown matrix Ct + 1, giving no analytic solution. Using Cpost from eq. (219) the solution is written
| diag  KB-1 + Ct + 1  = diag  KB-1 + Cpost  | 
We see that, to obtain the ``simplified'' solution we still need to
invert full matrices. The posterior covariance is not diagonal either.
As a consequence we will be required to perform iterative
approximations, also considered in [88].  From this we
conclude that the diagonalisation of parameter matrix 
C is not
feasible as it does not introduce any computational benefit and
we believe that by keeping the size of the 
 set at a reasonable
size is a better alternative then a diagonalisation of a possibly
larger
 set at a reasonable
size is a better alternative then a diagonalisation of a possibly
larger 
 set.
 set.