[PREVIOUS] [NEXT] [UP]

Adaptive Units

This lesson will explain how Neo/NST supports the training of adaptive units. You will learn
 
  • About NST adapt mode
  • How to make the prog_unit adaptive
  • Adapt mode and operators
  • How to use the som_layer unit
  • How to use the mlp_net unit
  • How to recover when learning has diverged
  • How to separate training and testing
  • About online and epoch learning
  • How to adjust the learning rate automatically
  • How to use the llm_layer unit
  • How to initialize layer type units
  • About further adaptive NST units
  • About NST adapt mode

    Adaptive units can "learn" while processing their inputs. In simple cases, their adaptation could be part of the normal execution of the unit. However, there are several reasons to separate the adaptation step of an adaptive unit from its "normal" execution step:

    1. it is often desireable to restrict any adaptation to a training phase, with no further adaptation steps afterwards, when the ready-trained unit is used.
    2. if several adaptive units work in succession, adaptation of each unit in the chain may depend on results computed by its successors and can only be carried out after the successors were executed.
    3. Typically, the necessary information for adapting a unit in a chain propagates from the "output end" of the chain backwards to its "input end", i.e., opposite to the execution order of the units.

    Therefore, NST allows to separate the adaptation step of an adaptive unit from its execution step by permitting each unit to define an adapt method that is entirely separate from its exec method.

    In its normal mode, Neo will ignore any adapt methods, since for non-adaptive units the exec methods will implement everything that the unit does. When there are any adaptive units present, Neo can be put into adapt mode (use the Prefs command and toggle the button in the second line that corresponds to your current circuit).

    In adapt mode (indicated by a "+a" in the title bar of the circuit window), both the Step and Run commands will first execute all exec methods in the familiar left-to-right order. Then, Neo will execute all adapt methods, but starting with the rightmost unit (which was executed last) and then working backwards, in right-to-left order, invoking the adapt method of the first-executed unit last. Most normal (non-adaptive) units will not react on this second adapt pass, but it gives all adaptive units the chance to now carry out their adaptation step(s) (implemented in their adapt methods), with the possibility to use any previously computed result values from the exec pass.

    The ctrl_op unit allows to "convert" an exec call into an adapt call that is sent to its operand(s) (specify EXEC as trigger, and ADAPT as response).

    How to make the prog_unit adaptive

    Like any other non-adaptive unit, the prog_unit will ignore any adapt calls. However, if the special jump label ADAPT: occurs in the prog_unit source code (this label may occur only once in each prog_unit source text), any adapt call will jump to the corresponding source location and cause execution of the program from there (there is an analogous jump label EXEC: to choose a non-standard entry point for the exec-call; it not present, the exec-call will always start after the last definition of a subroutine).

    Example#1 uses this feature to make a few prog_unit instances report their execution order by printing a corresponding message in response to exec- and adapt-calls. "Step" this example with adapt mode disabled and enabled (via Prefs) and compare! (for an explanation what the for_op loop does, read on below).

    Note: the prog_unit's exec_opnds() call can only be used to execute, but not to adapt its its operands.

    Adapt mode and operators

    For non-repetitive operators (such as the if_op unit), adapt mode will work backwards through those operands that were executed previously (e.g., if the condition of an if_op was not met during the previous exec pass, its operands will also be skipped for the subsequent adapt pass; this keeps exec and adapt calls always cleanly "paired" for each unit). In particular, operands of a no_op unit will also be protected from any adapt calls, even though these approach the operands from the right side.

    For repetitive operators (such as the for_op unit) on wishes in most cases the backward adapt pass to follow immediately after each forward exec-iteration through the operands. This behavior is achieved by configuring the unit (e.g., the for_op) in the mode "exec+adapt" (usually offered as a button in the creation dialog window).

    Reconfigure the for_op of example#1 to respond in this way! Note that this triggers the desired adapt steps already during the exec pass; therefore, if configured in this way, the operator unit will not do anything extra during a subsequent adapt pass (and, at least for the purpose of a so-configured operator, Neo needs not even be in adapt mode).

    For the sake of completeness we mention two additional, more seldomly used modes for operator units, in which the adaptations are kept separated from the exec pass, requiring Neo to be in adapt mode to be carried out:

    1. Mode "adapt1":  all operands are "adapt"ed the same number of times as they have seen forward exec-calls, but their "adapts" occur in backward order. This is only seldom useful, since the iterated adapt steps can not be easily matched to the corresponding "exec" steps from the forward iterations.

    2. Mode "adapt2": as adapt1, but additionally inserts before each operand's adapt call an extra exec call to overcome the matching problem of adapt1. (Try them out on the for_op in example#1!).

    How to use the som_layer unit

    Many adaptive units are neural network units. Example#2 shows a little demonstration with the som_layer unit. A SOM layer is an unsupervised network where each node carries a "prototype" or "weight" vector. The prototype vectors are gradually adapted to approximate the distribution of "stimulus" vectors that are provided to the network during training. A characteristic feature of the SOM is that prototype vectors of neighboring (in the node lattice) nodes will try to learn similar values.

    In the case of example#2, the nodes form a 20x20 mesh and the prototype vectors (as well as the stimulus vectors) are 2-dimensional locations in the plane. The stimulus vectors are drawn from the uniform distribution in the unit square, and the SOM network is expected to unfold from an initial, disordered random state into an ordered state covering the unit square.

    The adaptive som_layer is trained inside the loop of a for_op, configured in "exec+adapt" mode as explained above. For its proper working the som_layer unit requires besides the "stimulus vector" (at its input 0) two learning parameters at its input 3. For each iteration step, both the stimulus vector and the learning parameters are computed with a prog_unit. In addition, the prog unit contains a public method (accessible as a "method subunit") "init" to initialize the training parameters and the som_layer unit for a new run (note that the initialization of the som_layer is made via a ctrl_opnd() call from the public method "init" kept in the prog_unit).

    The contents of the container unit "show_som" draw the som prototype vectors as a two-dimensional mesh in the unit square, using a plot_xy window named "plot" as the canvas for a draw_mesh unit. Note that the som_layer has a weight array in which each d-dimensional (here: d=2) weight vector is augmented by an additional element in position 0. This element ("radius") can be adapted to a local distance scaling value (see man page), but we don't use it here and ignore it in the visualization by specifying in the draw_mesh unit a mask "#xyM", which ignores the first element in each weight vector and uses the remaining two as x- and y-node coordinates. "Step" the example to see how the mesh unfolds (when you disable adapt mode in the for_op, the mesh will "freeze" in its initial, random configuration. To change the number of steps or the learning parameters, edit the prog_unit text accordingly.

    Example#3 shows a slightly more elaborate SOM-example that finds an approximate solution to the travelling salesman problem in two dimensions. This time, the distribution of stimulus vectors is given by the to-be-visited city positions, and topology of the net is that of a closed ring (i.e., a mesh with one side of length 1; the ring is made by requesting [with a negative dimension value] periodic boundary conditions along the longer side of the array). To add the edge to close the mesh into a ring, we also must extract the first and the last point of the mesh array (using an extract_subvec unit) and feed them to a suitably configured draw_sym unit. Note that this example already uses variables ($i) for some parameters to make the number of city easily changeable with the right mouse button (this is explained more fully in Sec. XX).

    How to use the mlp_net unit

    Another important type of network is the multilayer perceptron. The mlp_net unit provides an implementation of a general n-h1-h2-..hk-m multilayer perceptron, where n and m are the number of inputs and outputs, and h1..hk are the numbers of hidden nodes.

    Example#4 illustrates the use of the mlp_net to learn the prediction of a time series. In this example, the "Run" command is used to carry out the necessary iterations of the circuit. At each iteration, the wav_gen unit generates the next data point of a simple time series. The delay_unit delays the time series by 10 time steps to feed the mlp_net input with the three oldest values. The task of the mlp_net is to predict from these values in the past the current, new value. The multilayer perceptron is trained in a supervised fashion, i.e., each adaptation step requires to feed at its error input (the second input connector) the difference between the desired, correct target value and its own output. During the exec pass, the mlp_net will produce its own output, and the dif_vec unit will subtract it from the desired, correct output. The error difference is fed back to the error input of the mlp_net. It will be used by the mlp_net during the subsequent adapt pass to carry out an adaptation step to improve its weights. The plot_series unit provides a running plot of the superimposed target curve and the prediction error. The statistics unit computes the normalized root mean square error of the mlp_net output (accumulated over the entire learning history; you must reset the statistics unit to start a new evaluation), with the target value as a reference. You may wish to try out different mlp_net sizes (different numbers of hidden nodes).

    How to recover when learning has diverged

    If you experiment with the learning step size (in the '#' dialog of the mlp_net) you will discover that the learning will diverge if the learning step size is chosen too large. After this has happened, it may be difficult to make the circuit working again, since the units have become "polluted" by NaN values. A simple way out in this case is to re-make the entire unit. Remaking a circuit is a special case of what will be explained in Sec.XX, but here is the recipe:

    (1) visit the top level of the circuit and select the Define command.
    (2) Enter some text in the big blank field of the dialog window that appears (the text does not matter, as long as it does not contain special control directives. Write, e.g., the question "Remake?").
    (3) accept with OK.

    You have now defined a command offering you to remake the entire circuit from scatch whenever you press the right mouse button when at the top level of your circuit (try it out). This will help whenever a circuit has become spoiled by NaN values during some divergent computation.

    How to separate training and testing

    The previous example was meant as a simple illustration of some basic issues, but it is by no means a good example of how a learning (or any other more complicated) task should be structured in Neo/NST. Usually, you will wish to separate training and testing cleanly, using a separate data set for each. In addition, you may wish to make the network unit easily exchangeable (none of these demands in met in example#3).

    The way to achieve this is to provide training and test data in two separate named "data units" and the network in a third. The training itself is implemented in a separate "training container", which accesses the training data unit and the network unit by means of use_method units. In a similar fashion, testing is implemented in a separate "testing container". Further containers may implement additional methods, such as visualizing the current state of the network.

    Example#5 illustrates this scheme. This time, the mlp_net shall learn a binary classification task. Class 1 (yellow) consists of the 3 gaussian point clouds centered at every second corner of a hexagon in the 2D plane. Class 2 (blue) consists of 3 gaussian point clouds centered at the remaining three hexagon corners (this is a slight generalization of the noisy XOR task, which would result when we replaced the hexagon by a square).

    To train the mpl_net, execute the unit do_train. To display the classification rate, execute do_test. Finally, executing show will visualize the data and the decision boundary of the net.

    Units train_data and test_data implement (in the form of two prog_units) two data generators that produce samples of the described class distribution (output 0 provides a (x,y) point from the class distribution, output 1 provides (1, 0) if (x,y) is from class 1, and (0, 1) otherwise).

    Note that each data generator is implemented as a class container with its main method, the generation of a data sample, carried out by executing the class container itself. Usually, if the data set is finite, e.g., from a file, the requested data sample will be identified by an index, expected to be given at the input of the class container (here, each data vector is computed with a random generator, and the index value is ignored).

    An additional method subunit (named "num") with a single output pin provides the number of data samples in the data set. This allows any requesting circuit (such as the train and test circuits) to inquire about the index range of the data samples that can be requested.

    The contents of the do_train container largely follows the already familiar scheme for training a mlp_net, using a for_op in exec+adapt mode. The only new points are the use of a first use_method unit to get the index range of available data samples. This is used by a for_op to do a full pass through the data (one "epoch"), fetching each sample with a second use_method unit (this is done in a simple index-sequential fashion here; when the data are not already produced in random order, a further "random index shuffling" unit might be necessary here!).

    The contents of the do_test container is analogous, but the for_op is now only in "exec" mode (i.e., no adaptation occurs), and the index range and the accessed data is now from the test data set. The prog_unit evaluates the error statistics by comparing the mlp_net output with the target values in the test data set. An output_window unit displays the result.

    The show container visualizes the test data distribution and the class decision boundary learnt by the mlp_net. The first part of the circuit traverses the data set, augmenting with the prog_unit each data point into a 5-tuple of (x,y,color,symboltype,size) (the last two elements are constant) and feeding them to a plot_xy unit for display.

    The subunit iso contains a circuit to display the class boundary. To this end, the circuit evaluates the mlp_net on a grid of data points covering the data distribution. At each grid point, the sign of the difference z between the two mlp output pins indicates whether the mlp_net favors class 1 (z>0) or class 2 (z<0). A switch_output unit collects all the difference values in a single array, and the iso_contour unit computes from this array the isoline z=0, which represents the class boundary.

    About online and epoch learning

    The previous two examples adapted the mlp_net on each example individually. This is called online learning.

    In some case it might be desireable first to make a pass through all examples (summing up for each example its contribution to the gradient of the mlp error function) and then to do a learning step for the entire "epoch". This procedure is called epoch learning.

    The mlp_net can follow both schemes. When its control connector (the bottommost input connector) is set to 1, the mlp_net will perform online learning. When its control connector is instead set to a value N>1, the mlp_net will memorize N consecutive training samples as an epoch and, when the epoch is complete, do a single epoch adaptation step. Note that also during epoch learning, each sample must be provided by an exec call, followed by an adapt call (the only difference is that each N-th adapt call does some extra work to finish the current epoch).

    How to adjust the learning rate automatically

    The previous two examples used the mlp_net with a "hand-adjusted" learning rate. Usually, it is desireable to automate learning rate control, which may include to admit for each neuron in the mlp_net an individually optimized learning rate.

    One of the best algorithms to achieve this is the RPROP learning rate control algorithm of Riedmiller and Braun. For the mlp_net, this algorithm will become enabled by specifying the option %R in its creation dialog (the learning rates given in the parameter window will then be ignored).

    Using the RPROP learning rate control requires to use epoch learning. I.e., during adaptation you must feed the epoch size to the control input of the mlp_net (the bp_layer works in the analogous way).

    Example#6 illustrates the necessary changes (it is almost identical to example#5, only the mlp_net unit has RPROP enabled, and the do_train circuit contains the additional wire to feed the epoch size to the mlp_net call).

    How to use the llm_layer unit

    As a final example#7 of an adaptive unit, we illustrate the use of the llm_layer unit. It implements a LLM-network (a Local Linear Map network). Like a multilayer perceptron, it must also be trained in a supervised fashion. However, instead of using "sigmoid neurons", a LLM network represents its input-output transformation by tesselating its input space into as many domains ("Voronoi cells") as it has nodes. In each domain, the input-output mapping is then approximated by a linear mapping. Optionally, the linear mappings constructed for all the cells may be "blended" to obtain an input-output transformation that is smooth also across the tesselation cell borders.

    The entire learning process is, therefore, governed by three learning rates:

    1. Eps1 for creating the input tesselation (obtained by approximating the input data distribution with a set of prototype vectors which are adjusted in a SOM-like manner, but without neighborhood interaction ("learning vector quantization"); each prototype then defines a tesselation cell by the set of input points that are closer to it than to any other prototype vector).

    2. Eps2 for adjusting in each tesselation cell the constant term of its linear mapping. This adjustment is made towards the average value of all data points falling into the considered tesselation cell.

    3. Eps3 for adjusting the linear part (matrix) of each cell's local mapping. This uses a standard linear perceptron learning rule with Eps3 as its learning parameter.

    A fourth learning parameter, eps_r, allows to adjust the local length scale for each tesselation cell (cf. manual page for details).

    The formation of the tesselation, the adjusting of the constant terms in the linear mappings, and the adjustment of the linear parts can be done simultaneously or in successive phases of the training (with suitably chosen learning rates, e.g., with eps1 dominating in phase 1, eps2 in phase 2 and eps3 in phase 3.

    In the present example#7, these phases have been collapsed for simplicity into a single training phase, carried out by the do_train unit. Note also that the llm_layer unit -- unlike the mlp_net unit -- expects for its adaptation step the target value (not the error differences, as for the mlp_net). Therefore, the dif_vec unit from the mlp_net example is now absent. When you invoke the show unit afterwards, you will see that the resulting decision boundaries are composed of linear pieces, as would be expected from the way the LLM net approximates a mapping. Try out the effect of choosing different numbers of nodes in the llm_layer unit!

    With blending, the classification boundaries can be smoothed. However, blending requires to re-adapt the output mappings to take account of their mixing. However, it is important to know that this re-adaptation must be done with eps1=0, since otherwise blending may totally derange any previously optimized input space tesselation. The strength of blending is controlled by a further parameter beta>0. For beta=0, blending is absent. For small values, 0<beta<<1, blending is very strong, i.e., close to simply averaging over all local mappings. With larger values of beta, each local mapping increasingly dominates in the vicinity of its tesselation cell. In the limit of very large values of beta, the original, unblended local linear maps are regained. For mathematical details, see the manual page of the llm_layer unit.

    How to initialize layer type units

    The som_layer and llm_layer (as well as the vq, rbf and delta layer described below) allow an initialization of their weight prototype vectors to a given set of values. To this end, they must be invoked with a NST_INIT control call. This will put them in a special initialization mode in which the next nodes exec calls are special and will (instead of the usual execution) store the current input vector (input and target vector in case of the llm_layer) as the prototype vector (input and output prototype vector in case of the llm_layer) for the next node, until each node has been initialized in this way. Note also that this may lead to surprises if one of these units happens to get into the scope of a NST_INIT call inadvertently. If instead NST_I_RND is used, initialization will be to random weight values (this will also work for the multilayer type units). For more details, consult the manual pages.

    About further adaptive units

    Besides the above illustrated units we mention a few further units which you may explore via their 'e'-examples:

    The bp_layer and bp_unit: they provide a single layer of backpropagation units, or a even a single unit. They are useful to build networks with special topologies, such as layer-skipping connections. They also allow hand-specified learning rates given individually for each layer or unit (the mlp_net enforces a global learning rate or you must use RPROP instead). If you, e.g., concatenate several bp_layers, you must always connect the error difference output (out_1) of the successor unit to the error difference input (inp_1) of its predecessor to ensure proper error backpropagation (in fact, the mlp_net unit is internally composed of a number of bp_layer instances hooked up in this way).

    The som_op (in a separate folder): it allows self-organizing maps of various topologies. In addition, the SOM step is decomposed into the calls of several operand units, so that very special distance and adaptation rules can be implemented (in the som_layer, there are fixed, standard choices for these).

    The llm_net: a more recent re-implementation of the LLM network, using the class container concept. This allows to train the LLM with a batch of data rather than incrementally and facilitates the choice of values for some LLM learning parameters.

    The vq_layer: this offers learning vector quantization in the cartesian product of input and output space. Convenient, e.g., to store vector data with access by a nearest neighbor search.

    The rbf_layer: the input layer of a radial basis function network. Must be concatenated with a delta_layer to obtain a fully functional radial basis function network.

    The delta_layer: a layer of linear perceptron units, trained with the linear error correction rule.

    The simpler adaptive units (to which also the hebb_unit and the delta_unit belong) are probably obsolete with the availability of the prog_unit, which allows to implement them with a few lines of code that is more explicit and easier modified than the units.


    [PREVIOUS] [NEXT] [UP]