<B>qlearn.so</B> -- Q-learning algorithm

USAGE:


   #import qlearn

SYNOPSIS:

Implements Q-learning algorithm on a rectangular state grid or a continuous state space. States may be specified either by a integer index, or through a continuous state vector. In the latter case, the methods will automatically quantize a rectangular state space region into rectangular cells whose number and extent along each axis can be specified with the methods set_range().

VARIABLES:

float eps:: learning step size parameter (cf. above)
float gamma:: discount factor (should be close below 1)
float Q[]:: iStates x iActions dimensional vector holding elements of Q matrix. For each action, there is a block of iStates consecutive Q-values, one for each state. For continuous Q-learning, matrix Q[] will be 0-dimensional, and the Q-values instead be represented by an externally supplied function float Q(float action,float *state) (currently, the action space is limited to one-dimensional or discrete).

METHODS:

qlearn(int *states, int numactions): constructor for state grid of states[0]*states[1]..*states[dim-1] points. State grid points can be referenced either with a linear index, or with a float vector.

qlearn(int n0, n1, ... nk, int numactions): constructor for state grid of n0*n1..*nk points and numactions actions. State grid points can be referenced either with a linear index, or with a float vector (cf. below). Initially, the state grid will cover the rectangular parallelepiped [0..n0[x[0..n1[..[0..nk[.

qlearn(int dim, float (*Q)(float,float*), int numactions): constructor for continuous Q-learning with continuous input space of dimension dim and discrete output space of numactions. Q(a,s) is the Q-function to be used.

qlearn(int dim, float (*Q)(float,float*), float a1=-1,a2=1,tol=0.01): constructor for continuous Q-learning with continuous input space of dimension dim and one-dimensional continuous output space represented by the closed interval [a1,a2]. tol specifies the accurracy with which output actions need to be computed.

set_range(int a, float limit1, float limit2): define range for the na state grid cells along axis a. The state grid cells will be chosen equidistant such that the cover the halfopen interval [limit1,limit2[ (if limit2<limit1, the two will be swapped).

set_range(int a, float *values): similar, but allows to specify non-uniform state cells: vector values should provide the na+1 border values (sorted in ascending order) that delimit the na state cells along axis a.

int exit: true, if current state is outside chosen range. The value is reset to false by calling start_trial().

void start_trial(void): initialize everything for a new trial.

set_state(int state): set new state cell index to value state. For cell (i0,i1..ik), the linear index state is given by i0+...

set_state(float *state): as before, but now the state cell is defined to be the cell that contains the state vector state (the cell boundaries have been defined with set_range() before).

When set_state() is called with a state parameter (either a linear index or a vector) that is not within the allowed range, exit will be set true, otherwise false.

int action(float fBeta): propose a new action (from the set (0,1,...numactions-1)), based on the current Q matrix learnt so far. fBeta is a parameter that governs the degree of exploration:

If fBeta >= 0: fBeta = 0 ignores the Q matrix, fBeta=Infty choses the best action, all other values interpolate in between. If 0 <= -fBeta <= 1: choose best action with probability -fBeta, choose random action with probability epsilon=1+fBeta (epsilon-greedy strategy with epsilon=1-abs(fBeta)).

void adapt(float reward, int action): do learning step. Requires that previous state was legal, however, the current state need not be legal (i.e., exit may be true). In this case, there will be no expected future reward contribution to reward. If action is omitted, the action that was proposed by the most recent action() call will be assumed.

int depth(): return reward depth.
int depth(int k): set reward depth and return previous value. k>0 is for seeking a positively rewarded goal, k<0 for avoiding a negatively rewarded obstacle.

int depth(int k, int mode): set reward depth and reward mode explicitly. Reward mode=1 is for seeking a goal, mode=-1 for avoiding an obstacle. Mode=0 is for both cases.

Let the discounted reward estimate q(t+1) after step t be


            q(t+1) = r(t,t+1) + gamma E_s(t+1)

and let p_i=(s(t-i),a(t-i)),i=0..k-1 be the state-action pairs taken at the most recent k steps for the current trajektory. Then we will not only update Q_sa, but also the k-1 previous Q_p_i, i=1..k-1. The update of Q_p_i will use the discounted reward estimate w_i:=q(t+1)*gamma^i (i=1..k-1) and it will occur only, if w_i > Q_p_i, i.e., if the current reward estimate is too pessimistic. We might envisage such restriction also for i=0 (i.e., the current estimate). The problem is that there might be noise, making our reward estimates more and more optimistic. State transitions have to be provided by the user, based on a world model f() that is not part of the class. Likewise, reward has to be specified by the user.

ALGORITHM:

update: Q_sa(t+1) = Q_sa(t) + eps*(r(t,t+1) + gamma*E_s(t+1) - Q_sa(t))


             a(t) = arg max_a Q_sa(t)

             E_s = max_a Q_sa

   s(t+1) = f(s(t),a(t)) world model
   r(t,t+1) = g(s(t+1)) reward model

CODE SKELETON:


   class qlearn Q;

   if (!Q) { .. create .. }
   
   for (i=0; i<iNumTrials; i++) {
      ... prepare trial
      Q.start_trial();
      do {
         a = action(fBeta);
         .. apply action a and compute:
            new state vector .state
            reward .fReward
         Q.set_state(state);
         Q.adapt(fReward);
      } while (!Q.exit);
   }

AUXILIARY CLASS:

statemap - map feature vector to discrete state

state_map(int d): create statemap for d-dimensional space.

METHODS:

range(int i, float min, max, int n=0): define limits of range along i-th dimension. For n>0 optionally redefine nr of state space cells along direction.

range(int i, float *vec): define (possibly non-uniform) cell limits along i-th direction. Dimensionality of vec will determine nr of cells plus one.

int map(float *vec): find linear index of state space cell containing continuous state vector vec.

float *map(int i): return center of state space cell with linear index i.

REGRESSION Q-LEARNING:

Calling the constructor with a single integer argument


    Q = qlearn(dim)

indicates that Q-learning with a continuous state space and a continuous 1d action space is desired. In this case, there will be no Q-matrix (the Q element will have dimension 0). Instead, any method that needs a Q-value must be passed a function pointer to a Q-function of signature float Q(float a,float *s). The method will then only use the Q function, its adaptation is left to a different class. The above code skeleton becomes:


    float Q(float a, float *s); // Q function to use

    for (i=0; i<iNumTrials; i++) {
      ... prepare trial
      Q.start_trial();
      do {
         a = action(fEps,Q); // use Q function!
         .. apply action a and compute:
            new state vector .state
            reward .fReward
         Q.set_state(state); // state may be continuous vector!
         // Q.adapt(fReward); this now replaced:
         Q.record(fReward,Q);
      } while (!Q.exit);
   }

Actions must be scalar continuous to allow fast maximum search. The action() method contains a 1d line search to locate the best action value (perhaps this might be extended to 2 or 3 dimensions). The set_state method just sets the state vector. The Q.record method uses fReward to add a tuple


    (s, a, s', reward, predicted_reward) to an internal state memory.

MORE CONVENIENT: (s, a, predicted_reward, reward, s') to an internal state memory. Before the tuple is added, the method uses the Q function to evaluate the delta d:


    d = (reward + gamma*max_a' Q(s',a')) - Q(s,a)

which is the difference between a-priori and a-posteriori predicted discounted reward. When the internal state memory is not yet full, the new tuple will be inserted unconditionally. Otherwise, it will only be inserted when d>dmin, where dmin is the min delta of the tuples stored so far. If this is the case, the new tuple will replace the so-far best tuple (which records an experience for which the prediction is already better). In this way, the state memory will preferably collect those experiences for which adaptation is most badly needed. The state memory can be accessed any time by other methods in order to retrain the current Q-function. To this end, the state memory is organized as a linear array of concatenated rows, each row consisting of a tuple (s,s',a,reward,predicted_reward) of dimensionality 2d+3 (still assuming the action values as scalars). A training algorithm can use these tuples for retraining a Q function, using the first dim+1 rows as input and the next row as target values. For best results, the training algorithm might recompute the target values while Q changes during training. The state memory is kept in a separate array memory[]. The currently best predicted event will always be held as the first row of this array (to facilitate comparisons). The current number of rows in the state memory is held in int memrows.

float action_range(float amin, amax):: specify search range for action optimization routine(s).

float action(float fEps, float Q(float,float*), float tol=0.01): propose action, using Q-function Q. The arguments amin, amax limit the range in which maximization of the Q-function is attempted. This function will use the epsilon-greedy strategy: with probability fEps, a random action from the uniform distribution on the interval amin, amax will be used. Otherwise, the action leading to the max value of Q will be used. tol is an optional parameter to limit the accurracy of the action determination (relative to the length of the interval amax-amin).

float expected_reward(float r, float Q(float,float*)): find expected reward from current action. Search range for Q function will be the same as specified with the most recent action call.

float record(float r, float Q(float,float*))
float record(float r, float a, float Q(float,float*)): compute and return prediction difference d of most recent action and add tuple (s,rpred,r,s') to the state memory if criterion is met.

void memlimit(int num): re-allocate maximal size of memory to num entries. Entries stored will be kept to the extent possible.

CHANGE: store instead of d values proper target values, and keep d for internal use in aux array (facilitates training of Q).

FILE

qlearn.c