Read anywhere, anytime
Description
Subjects
Informations
Published by | Shyan |
Reads | 57 |
Language | English |
Exrait
A Reinforcement Learning Tutorial for
Game Strategy Acquisition
Thomas Philip Runarsson
March 17, 2005
T.P. Runarsson (tpr@hi.is)
version 0.0.3 March 17, 2005
Introduction
This tutorial is based on:
• the book by Richard S. Sutton and Andrew G. Barto,
Reinforcement Learning An Introduction. MIT Press, 1998.
Also available on-line here:
http://www.cs.ualberta.ca/~sutton/book/the-book.html
• the computational intelligence course taught at the University
of Iceland.
1version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
The n-armed Bandit Problem 1.5
ε=0.1
ε=0.01
1
ε=0You must choose between n diﬀerent options or actions. Having
performed that action a you receive a reward r. The reward is 0.5
sampled from a stationary probability distribution that depend on
the action made. 0
0 200 400 600 800 1000
plays∗Let Q (a) denote the mean (expected) reward received when
100
action a is selected, this is also known as the value of action a. ε=0.1
Q (a) is the estimated value at the tth play, that ist
ε=0.0150r +r +...+r1 2 kaQ (a) = =E[r]t ε=0ka
0where tth play action a has been chosen k times prior to t,a 0 200 400 600 800 1000
playsyielding rewards r ,r , ..., r .1 2 ka
With = 0.1 the optimal action is never selected more than
91% of the time, but will ﬁnd the optimal action earlier than
= 0.01. However, in the end = 0.01 will perform better,
it may in this case be a good idea to start with a large and
decrease it over time. The greedy method performs signiﬁcantly
worse and often gets stuck performing suboptimal actions. On
the other hand if the reward variance was zero the greedy method
would know the true value of each action after trying once and
would work better. The noisier the reward the more exploration
is needed.
2 4
version 0.0.3 March 17, 2005
Exploration versus exploitation
• greedy action: is the action whose estimated value Q(a) at
any time is the greatest. In this case you are exploiting your
current knowledge.
• non-greedy action: when choosing a non-greedy action you
improve your estimate of the value of non-greedy actions and
we say you are exploring.
Considerthefollowingexperiment,repeated2000times,for1000
plays of the “n-armed bandit”
n = 10; % number of arms
epsilon = 0.01; % probability of selecting a purely random action
for i = 1:2000, % number of experiments
Qstar = randn(1,n); % mean reward
[dummy,astar] = max(Qstar); % optimal action
Q = zeros(1,n); % initial estimated return
ka = zeros(1,n); % number of times action taken
for j = 1:1000, % number of games played
if (rand(1) < epsilon), % epsilon greedy policy
a = ceil(rand(1)*n); % purely random move
else,
Qm = Q./max(ka,ones(1,n)); % mean return (avoid divide by zero)
K = find(Qm == max(Qm)); % find actions that give maximum return
k = ceil(rand(1)*length(K)); % break ties randomly
a = K(k); % greedy move
end
r = Qstar(a) + randn(1); % reward for this action
Q(a) = Q(a) + r; % add to rewards obtained for this action
ka(a) = ka(a) + 1; % increment counter for this action
R(i,j) = r; % keep track of reward obtained for this game
O(i,j) = (astar == a); % was this an opimal move?
end
end
3
% optimal action
mean rewardversion 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
Incremental Implementation Experiment using a ﬁxed-step size
The average of k+1 rewards can be computed by The previous experiment is repeated using the incremental update
rule with α = 0.1 and = 0.1.
k+1X1
Q = rk+1 i
alpha = 0.1; % fixed step sizek+1
i=1 n = 10; % number of arms
epsilon = 0.1; % probability of selecting a purely random action„ «kX1 for i = 1:2000, % number of experiments
= r + rk+1 i Qstar = randn(1,n); % mean reward
k+1
[dummy,astar] = max(Qstar); % optimal actioni=1
Q = zeros(1,n); % initial estimated return
1 ` ´ ka = zeros(1,n); % number of times action taken= r +kQ +Q −Qk+1 k k k for j = 1:1000, % number of games playedk+1
if (rand(1) < epsilon), % epsilon greedy policy` ´ a = ceil(rand(1)*n); % purely random move1
= r +(k+1)Q −Q else,k+1 k k
k+1 K = find(Q == max(Q)); % find actions that give maximum return
k = ceil(rand(1)*length(K)); % break ties randomly1 ` ´
a = K(k); % greedy move= Q + r −Qk k+1 k
endk+1
ka(a) = ka(a) + 1; % increment counter for this action
r = Qstar(a) + randn(1); % reward for this action
% alpha = 1/ka(a); % uncomment this line for sample average
This update rule is commonly used in reinforcement learning. The Q(a) = Q(a) + alpha*(r - Q(a)); % incremental implementation
general form is: R(i,j) = r; % keep track of reward obtained for this game
O(i,j) = (astar == a); % was this an opimal move?
endNewEstimate← OldEstimate+StepSize[Target−OldEstimate]
end
5 7
version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
The red lines represent the results using the incremental updateTracking a Non-stationary Problem
with ﬁxed a step-size α.
1.5When the incremental update rule is modiﬁed to be ε=0.1
ε=0.01` ´
1Q = Q +α r −Qk+1 k k+1 k ε=0
where the step-size α, 0 < α≤ 1, is constant, this results in 0.5
a weighted average of past rewards and the initial estimate Q ,0
0that is:
0 200 400 600 800 1000
plays` ´
Q = Q +α r −Qk k−1 k k−1 100
ε=0.1
= αr +(1−α)Qk k−1
2= αr +(1−α)αr +(1−α) Q ε=0.01k k−1 k−2 50
2= αr +(1−α)αr +(1−α) αr +...k k−1 k−2 ε=0
k−1 k+(1−α) αr +(1−α) Q1 0 0
0 200 400 600 800 1000
k playsX
k k−i= (1−α) Q + α(1−α) r0 i
i=1
kThis is an exponential recency-weightedaverage since(1−α) +P
k k−iα(1−α) = 1. Sometimes it may be convenient toi=1
1vary α (a) after thekth selection of actiona. The special casek
is α (a) = 1/k which results in the sample average method.k P1 ∞
Conditions: overcome initial conditions or random ﬂuctuations α (a) =∞ andi=1 kP∞ 2
assure convergence α (a) <∞.i=1 k
6 8
% optimal action
mean rewardversion 0.0.3 March 17, 2005
Agent - Environment Interface
Reinforcement learning (RL) is learning what to do – mapping
situations to actions – so as to maximize a numerical reward
signal. The learner is not instructed which actions to take, but
must discover which actions yield the most reward by trying
them. Actions may not only aﬀect the immediate rewards but
also the next situation and consequently all subsequent rewards.
The elements of RL are:
- Agent – an agent is the learner and decision maker.
- Environment–everythingoutsidetheagentisitsenvironment
(the thing it interacts with).
- Rewards – special numerical values the environment gives
rise to and the agent tries to maximize over time, r ∈<.t
- State – a description of the situation, a representation of the
environment’s state s .t
- Action – the feasible set of actions a ∈A(s ), available int t
state s .t
- Policy – an agent maps states to probabilities of selecting
possibleactions,thatisπ (s,a)istheprobabilitythata = at t
if s = s. An alternative notation used is a = π (s ) whent t t t
action a is selected given state s and policy π .t t t
9
version 0.0.3 March 17, 2005
Agent - Environment Interaction
AGENT
STATE REWARD ACTION
a = π(s)s r t t tt t
rt+1
st+1 ENVIRONMENT
10version 0.0.3 March 17, 2005
Rewards deﬁne the goal of learning
In reinforcement learning, the purpose or goal of the agent is
formalized in terms of a special reward signal passing from the
environment to the agent. The agent’s goal is to maximize
the cumulative reward in the long run, not just the immediate
rewards.
Examples of how reward signals are used to formalize the idea of
a goal:
• Robot learning to walk: a reward is given proportional to the
forward motion of the robot.
• Robot escapes a maze: give no reward until it escapes. To
ﬁnd the shortest path out of the maze give a negative reward
at each discrete time step.
• Robot cleaning up a room: reward given for each piece of
rubbish picked up.
• Playing chess or other game, rive a reward +1 for winning,
−1 for loosing and 0 for a draw and all nonterminal position.
One could give rewards for achieving subgoals, but then there
is the danger of the agent ﬁnding ways of achieving these
subgoals without achieving the real goal – winning.
11
version 0.0.3 March 17, 2005
Returns
The agent’s goal is to maximize the reward it receives in the long
run, that is the return
r +r +r +...t+1 t+2 t+3
More precisely we aim to maximize the mean or expected return
E[R ], where R is some function of the reward sequence above.t t
In the simplest case the return is
R = r +r +r +...+rt t+1 t+2 t+3 T
where T is the ﬁnal time step. This makes sense for tasks
like games that have terminal states. A sequence of agent-
environment interactions of this type are known as an episode
and the tasks known as episodic tasks.
Continuous tasks go on continually without limit. In this case
the return may also have no limit, for example if T =∞ and
the reward is +1 in each time step. To overcome this problem
we introduce discounting, that is a discounted return:
∞X
2 kR = r +γr +γ r +... = γ rt t+1 t+2 t+3 t+k+1
k=0
where the parameter 0≤ γ≤ 1 is called the discount rate.
12version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
Markov Decision Processes Action-Value Function for Policy π
2 πA reinforcement learning task that satisﬁes the Markov property Deciding which action to take using the value function V
is called a Markov decision process (MDP). If the state and requires a one-step-ahead search to yield the long-term optimal
action spaces are ﬁnite, its called a ﬁnite MDP and are all you action. This is not alway possible (for example a robot cannot
need to understand 90% of modern reinforcement learning. take the action to jump oﬀ a cliﬀ and see what the value of the
resulting state is) and so action-values must be used.
A ﬁnite MDP is deﬁned by its state and action sets and by the
one-step dynamics of the environment. Given any state s and It is also possible to deﬁne the value of taking actiona in states
0action a, the probability of each possible next state, s, is under a policy π, as the expected return starting from s, taking
action a, and thereafter following policy π, formally˛ˆ ˜a 0˛P 0 = Pr s = s s = s,a = at+1 t tss ˛ˆ ˜π ˛Q (s,a) = E R s = s,a = aπ t t t
also known as transition probabilities. Similarly, the expected » ˛ –∞X ˛
kvalue of the next reward is, ˛= E γ r s = s,a = aπ t+k+1 t t˛
k=0ˆ ˛ ˜a 0˛R = E r s = s,a = a,s = s0 t+1 t t t+1ss
a aThe quantities P and R completely specify the most0 0ss ss
important aspects of the dynamics of a ﬁnite MDP.
2The Markov property is only that the present state gives any information of the future
behaviour of the process.
13 15
version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
State-Value Function for Policy π Afterstate-Value Function for Policy π
Recall that a policy, π, is a mapping from each state, s∈ S, Afterstates are states immediately after an agent takes some
and action a∈ A(s), to the probability of π(s,a) of taking action. Values for these afterstates are called afterstate values.
action a when in state s. The value of a state s under a policy Afterstates are useful when we have knowledge of an initial part
π is the expected return when starting in states and followingπ of the environment’s dynamics but not the full dynamics. For
thereafter, or formally for MDPs as, example, in chess we know what the resulting position will be
for each possible move, but not how the opponent will reply.» ˛ –∞Xˆ ˛ ˜ ˛ Learning afterstate values for games should reduce learning timeπ k˛ ˛V (s) =E R s = s =E γ r s = sπ t t π t+k+1 t˛ because there may be many state-actions (position-move) pairs
k=0
that produce the same resulting state (position). An Afterstate-
whereE denotes the expected value given that the agent follows value function is therefore neither a state-value function nor anπ
action-value function.policy π. Note that the value of a terminal state is zero, since
we set in this case 0 = r = r = ... and in this case weT+1 T+2
can also set γ = 1
14 16version 0.0.3 March 17, 2005
Optimal Policy and Optimal Value Function
0A policy π is deﬁned to be better than or equal to a policy π if
0its expected return is greater than or equal to that of π for all
0π πstates, in other words V (s)≥ V (s),∀s∈S.
There is also at least one policy that is better than or equal
∗to all other policies. This is an optimal policy π . Optimal
policies share the same state-value function, called the optimal
∗
state-value function, denotedV and also share the same optimal
∗
action-value function, denoted Q .
Any policy that is greedy with respect to the optimal value
∗ ∗function V or Q is an optimal policy. Note that
∗ ∗V (s) = max Q (s,a)
a∈A(s)
17
version 0.0.3 March 17, 2005
Iterative Policy Evaluation
a aIfR andP are known then the policy and value function0 0ss ss
can be found using policy evaluation as described in the dynamic
programming literature. That is» –˛
π ˛V (s) = E R s = sπ t t» –˛
2 ˛= E r +γr +γ r +... s = sπ t+1 t+2 t+3 t
» –˛
π ˛= E r +γV (s ) s = sπ t+1 t+1 t
“ ”X X
π a a π 0V (s) = π(s,a) P R +γV (s)0 0ss ss
a 0s
By initializing all values V(s),∀s∈S to some value, say zero,
and then looping repeatedly (until convergence)through all states
and updating V(s) using the equation above, it is possible to
πestimate V for a given policy π.
However, we are not interested in just some policy, we want our
policy to improve with time.
18version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
Policy Improvement Temporal Diﬀerence Learning
Recall the iterative policy evaluation on page 18, that is
» –0The policy improvement theorem, states: let π and π be any ˛
π π ˛V (s) = E r +γV (s ) s = sπ t+1 t+1 tpair of deterministic policies such that, for all s∈S, “ ”X X
π a a π 0π 0 π V (s) = π(s,a) P R +γV (s)0 0Q (s,π(s))≥ V (s). ss ss
a 0s
0Then the policy π must be as good as, or better than, π. That
In dynamic programming a complete model of the environment isis, it must obtain greater or equal expected return from all states,
agiven, that isP 0 andR are known. In temporal diﬀerence0ss sss∈S:
0 (TD) learning this is unknown. For this reason it is only possibleπ πV (s)≥ V (s).
to use the ﬁrst equation. Using the incremental implementation
πand the current estimate of V (instead of the true V , in theThe process of making a new policy that improves on an original
sense that the environment is known), the update becomespolicy, by making it greedy or nearly greedy with respect to the » –value function of the original policy, is called policy improvement.
V(s ) ← E r +γV(s )This is a direct result of the Bellman optimality criteria, that is t t+1 t+1“ ” “ ”X X ˆ ˜π a a π 0
V (s) = π(s,a) P 0 R 0 +γV (s) ← V(s )+α r +γV(s ) −V(s )t t+1 t+1 tss ss
0a s“ ”X
∗ a a ∗ 0V = max P R +γV (s)0 0ss ss
a using the incremental update formula. Ifα is constant this is the0s
recency weighted average, which is beneﬁcial for a non-stationary
problem.
19 21
version 0.0.3 March 17, 2005
Generalized Policy Iteration
The term generalized policy iteration (GPI) us used to refer
to the general idea of interacting policy evaluation and policy
improvement processes. Almost all reinforcement learning
methods can be described as GPI – all have identiﬁable policies
and value functions, with the policy being improved with respect
to the value function and the value function being driven toward
the value function for the policy.
πV = V
∗V
initial
∗π
V π
π = greedy(V )
20version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
• Exploration and learning – if the value function is updated
only when a greedy move is performed, what is learned is the
expected return from each position when playing optimally
Tic-Tac-Toe Example from then on. But this is actually not the case since
8 1 6 exploratory moves are made occasionally. When continuously
updatingtheplayerlearnstotaketheseoccasionalexplorationsAssign the following numbers to a tic-tac-toe board 3 5 7
into account – it learns the expected return given that an4 9 2
This is a magic square, where the numbers add to 15 along any exploratorymovesaretaken. Thebestmovesgivenexploration
row, column, or diagonal. The ﬁrst player begins the game by are diﬀerent from the best moves without exploration. If we
putting an X in one of the squares on the board. This is the continue to explore we expect the player that continuously
same as choosing a number from 1 to 9. The second player now updates its values to win more games.
puts an O in one of the remaining squares. This is the same as
choosing one of the remaining numbers. Play proceeds like this • Opponent – could be human, a program found on the
until one player has marked all the squares in a row, column, or Internet, player that selects only random legal moves, or some
diagonal. This is the same as one player having three numbers combination of these. Alternatively, we could use self-play,
which add to 15. where the TD learning algorithm plays against itself or other
TD learners.
Although a simple problem, a classical “minimax” solution from
game theory is not correct here because it assumes a particular
way of playing by the opponent (for example will never reach
a game state from which it can loose, even though it might
always win from that state because of imperfect play from the
opponent).
Dynamic programming can compute an optimal solution for any
opponent, but requires a complete speciﬁcation of that opponent
a(P 0 andR ).0ss ss
22 24
version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
Temporal Diﬀerence Learning and Tic-Tac-Toe Self Play
• Self-play will result in play optimally against itself. Random
exploratory moves will cause all positions to be encountered
(at least occasionally so).
When applying TD learning to this task, there are a number
issues that must ﬁrst be considered:
• There is a notion that a self-play player will play in a universal
‘optimal’ way.
• Playing “optimally” has really only meaning with respect to a
particular opponent.
• State representation – clearly tic-tac-toehas many symmetries
that could be considered. up-down, right-left, and along
• A self-play player may learn to play better against many
the diagonal. If all positions are reﬂected to standard opponents.
form then the values to be learned can be reduced by a
factor of eight. This reduces the memory requirements
• A self-play player will play a position which force a win (unless
and the number of games needed to play. However, this
a exploratory move).
may be counterproductive when playing against an imperfect
opponent (one that does not play symmetric).
• A self-play player will prefer moves from which it is unlikely
to loose even when it occasionally makes random exploratory
moves.
• Value function – it is possible to learn using two tables one
that uses the standard form and anotherthat learns all distinct
• A reinforcement player said to play against itself is not strictly
states. This would require more memory and time per move,
doing so. Rather it plays against a copy of itself adjusted to
but it would speed up learning in terms of the number of
learn after ﬁrst-player moves learning separately about each
games played.
individual position.
23 25version 0.0.3 March 17, 2005
MATLAB/OCTAVE code for TD tic-tac-toe
We illustrate how afterstate values can be learned using self-play.
The ﬁgure below shows the states experienced by players using
crosses (s ,s , ands ) and naughts (s ,s , ands ). The value1 3 T 2 4 T
of state V(s ) in the code below is updated by both players0
but its value has no meaning since its not an afterstate. When
a terminal afterstate is reached its value is known and set to its
correct value. If its a draw then both players ﬁnal afterstate is
known to be 0 (since the last move is forced). Example:
s s s s s s0 1 2 3 4 T
swin = [1 1 1 2 2 2 2 2 2; % these are all possible winning states
2 2 2 1 1 1 2 2 2; % for tic-tac-toe (where the ones are)
2 2 2 2 2 2 1 1 1;
1 2 2 2 1 2 2 2 1;
2 2 1 2 1 2 1 2 2;
1 2 2 1 2 2 1 2 2;
2 1 2 2 1 2 2 1 2;
2 2 1 2 2 1 2 2 1]’;
epsilon = 0.1; % used by our self-improving espilon-greedy policy
alpha = 0.1; % step size used by incremental update
triadic = 3.^(0:8); % use this to find quickly an index to table
V = sparse(triadic*ones(9,1)+1,1); % initialize the value table to zero
26
version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
for episode = 1:10000 % play the game for a number of epsiodes
TD: α fixed at 0.1 (ε=0.1)
s = zeros(9,1); % initial state is an empty board
1index = [1 1]; % both players use this state as their initial state
winner = 0; p = 1; % player 1 uses crosses, default outcome is a draw
for plays = 1:9 % the game terminates in 9 plays or sooner
0.8A = find(0 == s); % find all legal moves
if (rand(1) < epsilon) % use epsilon greedy policy
a = A(ceil(rand(1)*length(A))); % purely random move selected
0.6else
v = zeros(length(A),1); % clear temporary value variable v
for k = 1:length(A) % do a greedy afterstate search
0.4
sd = s; % copy current state
sd(A(k)) = p; % do move and observed state sd
ind = abs(triadic*sd)+1 ; % find index to table
0.2
v(k) = p*V(ind); % store this value, note
end % that this is a minimax search
K = find(v == max(v)); % find moves that give maximum return
0
k = K(ceil(rand(1)*length(K))); % break tries randomly 0 0.5 1 1.5 2
a = A(k); % the greedy move breaking ties 5self−plays x 10end
s(a) = p; % perform actual move and observe new state
ind = abs(triadic*s)+1; % current state index to table TD: α=0.1 reduced over time (ε=0.1)
pind = index((p+1)/2+1); % previous state index to table 0.8
if any(sum((s*ones(1,8) == p*swin),1) == 3) % is this a winning state?
V(ind) = p; % the known afterstate value
V(index) = V(index)+alpha*(p-V(index)); % incremental update
winner = p; % record the winner (see check for draw below) 0.6
break; % state ’ind’ is terminal - break!
else
V(pind) = V(pind)+alpha*(V(ind)-V(pind)); % incremental TD update
0.4end
index((p+1)/2+1) = ind; % remember this new state index
p = -p; % swap players
end
0.2
if (0 == winner) % its a draw!
V(index) = 0; % known to be zero, because the final move is forced
end
end 0
0 0.5 1 1.5 2
5self−plays
x 10
27 28
afterstate value for initial moves afterstate value for initial moves
Access to the YouScribe library is required to read this work in full.
Discover the services we offer to suit all your requirements!