Universität Bielefeld › Technische Fakultät › NI
A generative model must be able to integrate bottom-up and top-down signals for improved recognition rates and especially for the learning of new objects from few training views. Furthermore, this process must be able to adapt to unseen objects and extrapolate to unseen views of known objects by predicting (or generating) the appearance, based on the estimated intrinsic variables and other contextual information. The basic idea is as follows:
In hard vision problems, a pure feed-forward detection might not be sufficient because the feature detectors may not be able to provide reliable results. In the picture above, these are just small image patches, but that is only for simplicity reasons and you can think of others e.g., Gabor filters. Here, occlusion leads to big differences when compared (thin black arrows) to the stored features in the discriminative processing. However, there might be just enough information to generate a list of possible candidates to feed into a model of the scene. From a representation like this, it might be feasible to generate expectations for the different hypotheses, which can be compared to the image or to the early features in the forward processing (red dotted arrows). Of course, this would raise the need for a comparison within single features. One can imagine that the bonsai plant has been detected and this is used as an explanation for the green pixel in the features that should match with the duck or the banana. Clearly for hard scenarios, an iterative approach is needed. Basically the question can be posed as: How can the process of learning and recognition be shaped such that the dependence from the infinitely many different possible appearances (feature vectors) is minimal and the higher-layer representation catches more of the abstract essence of the objects and the typical applied transformations? In other words, we aim for plausible explaining instead of pure pattern matching.