USAGE:
#import qlearn
SYNOPSIS:
Implements Q-learning algorithm on a rectangular state grid or
a continuous state space.
States may be specified either by a integer index, or through
a continuous state vector. In the latter case, the methods
will automatically quantize a rectangular state space region
into rectangular cells whose number and extent along each axis
can be specified with the methods set_range().
VARIABLES:
- float eps:
- learning step size parameter (cf. above)
- float gamma:
- discount factor (should be close below 1)
- float Q[]:
- iStates x iActions dimensional vector holding
elements of Q matrix. For each action, there is a block of
iStates consecutive Q-values, one for each state.
For continuous Q-learning, matrix Q[] will be 0-dimensional,
and the Q-values instead be represented by an externally
supplied function float Q(float action,float *state)
(currently, the action space is limited to one-dimensional
or discrete).
METHODS:
- qlearn(int *states, int numactions)
- constructor
for state grid of states[0]*states[1]..*states[dim-1] points.
State grid points can be referenced either with a linear index,
or with a float vector.
- qlearn(int n0, n1, ... nk, int numactions)
- constructor
for state grid of n0*n1..*nk points and numactions actions.
State grid points can be referenced either with a linear index,
or with a float vector (cf. below). Initially, the state grid
will cover the rectangular parallelepiped [0..n0[x[0..n1[..[0..nk[.
- qlearn(int dim, float (*Q)(float,float*), int numactions)
- constructor
for continuous Q-learning with continuous input space of dimension
dim and discrete output space of numactions. Q(a,s) is the
Q-function to be used.
- qlearn(int dim, float (*Q)(float,float*), float a1=-1,a2=1,tol=0.01)
- constructor
for continuous Q-learning with continuous input space of dimension
dim and one-dimensional continuous output space represented by the closed
interval [a1,a2]. tol specifies the accurracy with which output
actions need to be computed.
- set_range(int a, float limit1, float limit2)
- define range
for the na state grid cells along axis a. The state grid
cells will be chosen equidistant such that the cover the
halfopen interval [limit1,limit2[ (if limit2<limit1, the two
will be swapped).
- set_range(int a, float *values)
- similar, but allows to specify
non-uniform state cells: vector values should provide the na+1
border values (sorted in ascending order) that delimit the na
state cells along axis a.
- int exit
- true, if current state is outside chosen range.
The value is reset to false by calling start_trial().
- void start_trial(void)
- initialize everything for a new trial.
- set_state(int state)
- set new state cell index to value state.
For cell (i0,i1..ik), the linear index state is given by
i0+...
- set_state(float *state)
- as before, but now the state cell is
defined to be the cell that contains the state vector state
(the cell boundaries have been defined with set_range() before).
When set_state() is called with a state parameter (either a linear
index or a vector) that is not within the allowed range, exit will
be set true, otherwise false.
- int action(float fBeta)
- propose a new action (from the
set (0,1,...numactions-1)), based on
the current Q matrix learnt so far. fBeta is a
parameter that governs the degree of exploration:
If fBeta >= 0:
fBeta = 0 ignores the Q matrix, fBeta=Infty
choses the best action, all other values interpolate
in between.
If 0 <= -fBeta <= 1: choose best action with probability -fBeta,
choose random action with probability epsilon=1+fBeta (epsilon-greedy
strategy with epsilon=1-abs(fBeta)).
- void adapt(float reward, int action)
- do learning step.
Requires that previous state was legal, however, the current state
need not be legal (i.e., exit may be true). In this case, there
will be no expected future reward contribution to reward.
If action is omitted, the action that was proposed by the
most recent action() call will be assumed.
- int depth()
- return reward depth.
- int depth(int k)
- set reward depth and return previous value.
k>0 is for seeking a positively rewarded goal, k<0 for
avoiding a negatively rewarded obstacle.
- int depth(int k, int mode)
- set reward depth and reward mode
explicitly. Reward mode=1 is for seeking a goal, mode=-1 for
avoiding an obstacle. Mode=0 is for both cases.
Let the discounted reward estimate q(t+1) after step t be
q(t+1) = r(t,t+1) + gamma E_s(t+1)
and let p_i=(s(t-i),a(t-i)),i=0..k-1 be the state-action pairs taken
at the most recent k steps for the current trajektory. Then we will
not only update Q_sa, but also the k-1 previous Q_p_i, i=1..k-1.
The update of Q_p_i will use the discounted reward estimate
w_i:=q(t+1)*gamma^i (i=1..k-1) and it will occur only, if
w_i > Q_p_i, i.e., if the current reward estimate is too pessimistic.
We might envisage such restriction also for i=0 (i.e., the current
estimate). The problem is that there might be noise, making our
reward estimates more and more optimistic.
State transitions have to be provided by the user, based on
a world model f() that is not part of the class.
Likewise, reward has to be specified by the user.
ALGORITHM:
update: Q_sa(t+1) = Q_sa(t) + eps*(r(t,t+1) + gamma*E_s(t+1) - Q_sa(t))
a(t) = arg max_a Q_sa(t)
E_s = max_a Q_sa
s(t+1) = f(s(t),a(t)) world model
r(t,t+1) = g(s(t+1)) reward model
CODE SKELETON:
class qlearn Q;
if (!Q) { .. create .. }
for (i=0; i<iNumTrials; i++) {
... prepare trial
Q.start_trial();
do {
a = action(fBeta);
.. apply action a and compute:
new state vector .state
reward .fReward
Q.set_state(state);
Q.adapt(fReward);
} while (!Q.exit);
}
AUXILIARY CLASS:
statemap - map feature vector to discrete state
- state_map(int d)
- create statemap for d-dimensional
space.
METHODS:
- range(int i, float min, max, int n=0)
- define limits of range along
i-th dimension. For n>0 optionally redefine nr of state space cells
along direction.
- range(int i, float *vec)
- define (possibly non-uniform) cell limits
along i-th direction. Dimensionality of vec will determine nr of cells
plus one.
- int map(float *vec)
- find linear index of state space cell
containing continuous state vector vec.
- float *map(int i)
- return center of state space cell with
linear index i.
REGRESSION Q-LEARNING:
Calling the constructor with a single integer argument
Q = qlearn(dim)
indicates that Q-learning with a continuous state space and a
continuous 1d action space is desired. In this case, there will be
no Q-matrix (the Q element will have dimension 0). Instead, any
method that needs a Q-value must be passed a function pointer
to a Q-function of signature float Q(float a,float *s).
The method will then only use the Q function, its adaptation
is left to a different class. The above code skeleton becomes:
float Q(float a, float *s); // Q function to use
for (i=0; i<iNumTrials; i++) {
... prepare trial
Q.start_trial();
do {
a = action(fEps,Q); // use Q function!
.. apply action a and compute:
new state vector .state
reward .fReward
Q.set_state(state); // state may be continuous vector!
// Q.adapt(fReward); this now replaced:
Q.record(fReward,Q);
} while (!Q.exit);
}
Actions must be scalar continuous to allow fast maximum search.
The action() method contains a 1d line search to locate the best
action value (perhaps this might be extended to 2 or 3 dimensions).
The set_state method just sets the state vector.
The Q.record method uses fReward to add a tuple
(s, a, s', reward, predicted_reward) to an internal state memory.
MORE CONVENIENT: (s, a, predicted_reward, reward, s') to an internal state memory.
Before the tuple is added, the method uses the Q function to
evaluate the delta d:
d = (reward + gamma*max_a' Q(s',a')) - Q(s,a)
which is the difference between a-priori and a-posteriori
predicted discounted reward. When the internal state memory
is not yet full, the new tuple will be inserted unconditionally.
Otherwise, it will only be inserted when d>dmin, where dmin
is the min delta of the tuples stored so far. If this is the
case, the new tuple will replace the so-far best tuple
(which records an experience for which the prediction is
already better). In this way, the state memory will preferably
collect those experiences for which adaptation is most
badly needed.
The state memory can be accessed any time by other methods in
order to retrain the current Q-function. To this end, the state
memory is organized as a linear array of concatenated rows,
each row consisting of a tuple (s,s',a,reward,predicted_reward) of
dimensionality 2d+3 (still assuming the action values as scalars).
A training algorithm can use these tuples for retraining
a Q function, using the first dim+1 rows as input and the
next row as target values. For best results, the training
algorithm might recompute the target values while Q changes
during training.
The state memory is kept in a separate array memory[].
The currently best predicted event will always be held
as the first row of this array (to facilitate comparisons).
The current number of rows in the state memory is held in
int memrows.
- float action_range(float amin, amax):
- specify search range
for action optimization routine(s).
- float action(float fEps, float Q(float,float*), float tol=0.01)
-
propose action, using Q-function Q. The arguments amin,
amax limit the range in which maximization of the Q-function is
attempted. This function will use the epsilon-greedy strategy:
with probability fEps, a random action from the uniform distribution
on the interval amin, amax will be used. Otherwise, the action
leading to the max value of Q will be used. tol is an optional
parameter to limit the accurracy of the action determination
(relative to the length of the interval amax-amin).
- float expected_reward(float r, float Q(float,float*))
-
find expected reward from current action. Search range for Q function
will be the same as specified with the most recent action call.
- float record(float r, float Q(float,float*))
-
- float record(float r, float a, float Q(float,float*))
- compute and
return prediction difference d of most recent action and
add tuple (s,rpred,r,s') to the state memory if criterion is met.
- void memlimit(int num)
- re-allocate maximal size of memory to num
entries. Entries stored will be kept to the extent possible.
CHANGE: store instead of d values proper target values,
and keep d for internal use in aux array (facilitates training
of Q).
FILE
qlearn.c