Neuroinformatics Group

Universität BielefeldTechnische FakultätNI

Unsupervised Kernel Regression (UKR)

In general, the domain of manifold learning addresses the problem of finding a lower dimensional (latent) representation $ \mathbf{X} = (\mathbf{x}_1,\mathbf{x}_2,\dots, \mathbf{x}_N) \in \mathbb{R}^{q \times N} $ of a set of observed data $ \mathbf{Y} = (\mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_N) \in \mathbb{R}^{d \times N} $and the corresponding functional relationship $ \mathbf{y} = f(\mathbf{x}) $.

Unsupervised Kernel Regression (UKR) is a recent approach to learning non-linear continuous manifold representations. It has been introduced as an unsupervised formulation of the Nadaraya-Watson kernel regression estimator by Meinecke, Klanke et al. and further developed by Klanke. It uses the Nadaraya-Watson estimator as mapping $ \mathbf{y} = f(\mathbf{x}) $ between latent and observed space:

\[  \mathbf{y} = f(\mathbf{x}) = \sum_{i=1}^N{\mathbf{y}_i \frac{\mathbf{K}_{\mathbf{H}}(\mathbf{x} - \mathbf{x}_i)} {\sum_j \mathbf{K}_{\mathbf{H}}(\mathbf{x} - \mathbf{x}_j)}}  \] (1)

This estimator realises a smooth, continuous generalisation of the functional relationship between two random variables $ \mathbf{x} $ and $ \mathbf{y} $ described by given data samples $ (\mathbf{x}_i; \mathbf{y}_i) $. Here, $ \mathbf{K}(\cdot) $ is a density kernel like Gaussian or Quartic and $ \mathbf{H} $ contains the corresponding bandwidths. Whereas the choice of the kernel is of relatively low importance, the bandwidth plays a more crucial role in the original estimator.

UKR now treats (1) as a mapping from the latent space of the represented manifold to the original data space in which the manifold is embedded and from which the observed data samples $ \mathbf{Y} = \{\mathbf{y}_i\}, i=1..N $ are taken. The associated set $ \mathbf{X} = \{\mathbf{x}_i\}, i=1..N $ now plays the role of the input data to the regression function (1). Here, they are treated as \textit{latent parameters} corresponding to $ \mathbf{Y} $. As the scaling and positioning of the $ \mathbf{x}_i $'s are free, the formerly crucial bandwidth parameter $ \mathbf{H} $ becomes irrelevant and we can use unit bandwidths.

The UKR regression function $ f(\cdot) $ thus can be denoted as

\begin{align*}  b_i(\mathbf{x}; \mathbf{X}) = \frac{\mathbf{K}(\mathbf{x} - \mathbf{x}_i)} {\sum_j \mathbf{K}(\mathbf{x} - \mathbf{x}_j)}\\ \mathbf{y} = f(\mathbf{x}; \mathbf{X}) =\sum_{i=1}^N{\mathbf{y}_ib_i(\mathbf{x}; \mathbf{X})} = \mathbf{Yb}(\mathbf{x}; \mathbf{X}) \end{align*} (2)

where $ \mathbf{b}(\mathbf{x};\mathbf{X}) = (b_1(\mathbf{x}; \mathbf{X}), b_2(\mathbf{x}; \mathbf{X}), \dots, b_N(\mathbf{x}; \mathbf{X}))^T \in \mathbb{R}^N $ is a vector of basis functions representing the effects of the kernels parametrised by the latent parameters.

The training of the UKR manifold, or the adaptation of $ \mathbf{X} $ respectively, then can be realised by gradient-based minimisation of the reconstruction error

\[  R(\mathbf{X}) = \frac{1}{N}\sum_{i} \parallel \mathbf{y}_i - f(\mathbf{x}_i; \mathbf{X}) \parallel^2 = \frac{1}{N}\parallel\mathbf{Y} - \mathbf{YB}(\mathbf{X})\parallel^{2}_{F}  \] (3)

where $ \mathbf{B}(\mathbf{X}) = (\mathbf{b}(\mathbf{x}_1;\mathbf{X}), \mathbf{b}(\mathbf{x}_2;\mathbf{X}), \dots, \mathbf{b}(\mathbf{x}_N;\mathbf{X})) $.

To avoid the trivial minimisation solution $ R(\mathbf{X}) = 0 $ by moving the $ \mathbf{x}_i $ infinitely apart, several regularisation methods exist. Here, the most outstanding to mention is UKR's ability of very efficiently performing a leave-one-out cross-validation, that is, reconstructing each $ \mathbf{y}_i $ without using itself. To this end, in the calculation of $ \mathbf{B}(\mathbf{X}) $ (cf. (3)), the only additional step is to zero its diagonal before normalising the column sums to 1.



n=0 n=5 n=10 n=20 n=100

Fig.1: Progress of UKR training of a "noisy S" after n=0,5,10,20,100 gradient steps. Depicted are the reproduction of the one-dimensional UKR latent space (black line) and the corrsponding training data points (points, colours visualise the value of the 1D-latent parameter).

Note that, for a preselected density kernel, the reconstruction error (3) only depends on the set of latent parameters $ \mathbf{X} $. Thus, minimising $ R(\mathbf{X}) $ w.r.t. $ \mathbf{X} $ optimises both the positioning of the latent parameters $ \mathbf{x}_i $'s and the shape of the manifold at the same time. In this feature, UKR differs from many other manifold representations which usually require alternating projection and adaptation steps.

As gradient descent often suffers from getting stuck in poor local minima, an appropriate initialisation is important. Here -- depending on the problem -- i.e. PCA , Isomap or Local Linear Embedding (LLE) are usually good choices. These methods themselves are quite powerful in uncovering low-dimensional structures in data sets. In contrast to UKR indeed, PCA is restricted to linear structures and Isomap as well as LLE do not provide smooth mappings into the data space.

An inverse mapping $ \mathbf{x} = f^{-1}(\mathbf{y}; \mathbf{X}) $ from data space to latent space is not directly supported in UKR. Instead, one may use an orthogonal projection to define a mapping $ \mathbf{x}^\star = \mathbf{g(y;X)} = \arg\min_x \parallel \mathbf{y} − \mathbf{f(x;X)} \parallel^2 $ which approximates $ f^{−1}(\cdot) $.

UKR is presented in more detail in: MeinickeKlankeMemisevicRitter2005-PSF, KlankeRitter2006-ALK, KlankeRitter2007-VUK, Klanke2007-LMW.