### reinforcement learning

#####################################
####   reinforcement learning    ####  CS7642 notes
#####################################

(review lectures on MDP, value/policy iteration from ML course)  http://kenics.net/gatech/ml/

###
###  Bellman Equation
###

note: in ML lecture, we wrote Bellman equation as the following

U(s) = R(s) + γ max Σ T(s,a,s') * U(s')       # U(s) = V(s)   utility = value      # just a diff notation for the same thing
a  s'                        # R(s) = reward as a function of state s
# R(s,a) = reward of taking action a in state s
# R(s,a,s') = reward of taking a in s, and ending up in s'
# mathematically they are the same for our purpose, but diff representation

but in this lecture, we will use the following notation & representation

V(s) = max( R(s,a) + γΣ[T(s,a,s') * V(s')] )    # notice the recursive nature.
a             s'

===> just to re-write this, let's denote the inside of max() as Q(s,a) function
a
V(s) = max Q(s,a)    # then it's just a recursively defined function
a            # so we can re-write the whole equation in terms of Q(s,a)

Q(s,a) = R(s,a) + γΣ[T(s,a,s') max Q(s',a')]
s'           a'

note: the reason we do this is because in RL context, Q value (its expectation) is easy to compute from experience <s,a,s',r> while computing V() requires knowing R() and T() specifically.

note: similarly, we can rewrite V(s,a)  as below

V(s,a) = max(R(s,a) + C(s,a))       # C as in continuation
a

C(s,a) = γΣ[T(s,a,s') max ( R(s',a') + C(s',a') ) ]
s'           a'

====> so we can express each of V(), Q(), C() in terms of the others         #  (quiz) https://www.youtube.com/watch?v=c3LrCrBw0hY

e.g.

Q(s,a) = R(s,a) + γΣ[T(s,a,s') V(s')]      # writing Q() in terms of V()
s'

Q(s,a) = R(s,a) + C(s,a)             # now appreciate the idea that if you know C() & R(), then you can express Q()

C(s,a) = γΣ[T(s,a,s') V(s')]           # writing C() in terms of V()
s'

C(s,a) = γΣ[T(s,a,s') max Q(s',a')]   # if you know Q() and T(), then you can get C()
s'           a'

===> so it all comes down to what we want to know, and what we can easily know.
in RL context, we want to know V(), and Q() happens to be easy to know (easy to improve Q value algorithmically).

############################
###   (3)   RL basics    ###
############################

##
##  RL versus other ML methods
##

(1) sequential decision making
- balancing short term VS long term reward maximization
-- the two may compete. it's not always easy.

(2) evaluating feedback
- you get a sequence of experience <s,a,s',r>
- but so what? is there a bigger reward you haven't gotten to? or not?
- evaluating feedback is hard.

(3) sampled feedback for generalizing π(s) = a for unseen s
- aka Deep RL
- much like supervised learning
- we need Deep RL when state,action space is too large to explore.

###
###   Behavior Structures
###

in RL, an agent interacts with an environment, and tries to learn behavioral structure that maximizes reward.

- a few well known forms/notions to represent "behavior"

(1) plan  : fixed sequence of actions   #  two cons (a) cannot have a complete plan during training
#           (b) cannot handle stochasticity well.
(2) conditional plan : plan that includes "if-then-else" statement
(3) stationary policy (aka universal plan) : mapping from state to action      # MDP paradigm!
- handles stochasticity well
- "if" conditional plan for every state, which is a very large exploration space..
- optimal stationary policy "always" exists for any MDPs. (not to say we can always find them, but representation wise, stationary policy is all we need)

###
###  evaluating a policy
###

(1) state transitions to immediate rewards    # basically follow policy (from an initial state) to subsequent state to collect rewards
(2) truncate according to horizon T           # e.g. T = 5
(3) summarize sequence :
T
return = Σ [γ^i + r_i]     where  0 < γ ≤ 1    #  γ = discount factor
i=1                                    #  r_i = reward at step i

(4) summarize over sequences
- average, i.e. expectation

note: following a stationary_policy(s) = a, subsequent state s' you end up in may not be deterministic.
so you may have different sequences of states/rewards even following one same policy.
e.g.
following some policy(),
sequence_A (probability 60%):  0, +6, -0.2, +0.3, 0, 0,,,      # from step (1)(2)(3), you know how to calc the return on this sequence.
sequence_B (probability 25%): -5, +0.1, 0, 0, +0.7,,,          # so on and so forth.
...                                                            # then you can weight-average the return over sequences, to compute the overall value of a policy.

###
###  evaluating a learner (who outputs a policy)
###
- as good as its policy:  value of a learner == value of returns of the policy
- computational time complexity    # how much time it takes for a learner to learn (in terms of input data dimension size, etc)
- sample complexity                # how much experience a learner requires to converge/learn optimal policy
===> certain obvious trade-off exists where with less time,sample, your learner output policy is not as optimal.

so if two learners produce the same policy, if one is more efficient in terms of time/sample complexity, then we prefer the efficient learner.

note: while we can talk abotu space complexity, usually it's not important in RL context, as other complexity often becomes the bottleneck.

################################################
####   (4)  Temporal Difference Learning    ####
################################################

introduced by Sutton (a 1988 paper, then a book)

##
##  RL context
##

experience                         policy
<s,a,r,s'>* ---> [RL learner] --->  π(s)        #  star * denotes there could be many

===> specifically 3 families.

(1)  model-based RL

<s,a,r,s'>*  ---> [model_learner] ---> T(),R() ---> [MDP_solver] ---> Q* ---> [argmax] ---> π(s)
^                 |
\----------------/

(2)  value-function-based   # aka "model free"

<s,a,r,s'>*  --->  [value_update] ---> Q ---> [argmax] ---> π(s)
^             |
\------------/

(3)  policy search

<s,a,r,s'>  --->  [policy_update] --->  π(s)
^               |
\--------------/

(1) (2) (3)
more supervised <-----------> more direct learning

note: which one is better?  it really depends on the problem.
but it's often hard to model T() & R() from experience when the world is complex.
similarly it's often hard to learn policy directly from experience.
and the major focus of research has been on (2), so we will focus on (2)

NOTE: model of environment i.e. T(), R()  ==  MDP

- planning: agent is given the model of the world, i.e. complete knowledge of T(), R(), then plans a policy(s)=a
- model-based RL: agent learns the model AND the policy by interacting with (i.e. experiencing) the world/env.
- model-free RL: agent doesn't bother learn the model, but learns the policy based on experience <s,a,s',r> of the world/env, by learning some state/action value functions.

###
###   TD Lambda : example of value-based RL
###

learning to predict stuff (like expected sum of discounted rewards) over time.

e.g.

learn V(s) = 0                if s = s_end       #  V(s) = a value function, value of s
= E[ r + γV(s')]   otherwise          #  γ = discount rate, r = immediate reward (could be pos/neg/zero)
#  s' = next state
e.g.

here is an example MDP graph representation.
(notice s3 stochastically transitions into s4 or s5, let's say 30% & 70% probability, respectively)

r3_1      r5
r1      /----> s5 ----\
s1 ----> s3                ------> s_end
^  \----> s4 ----/
s2 -----/    r3_2      r4
r2

what is V(s3) ?   # well just plug in to the recursive equation as defined.

V(s3) = (r3_1 * 0.7 + r3_2 * 0.3) + γV(s4)    # recall the above 30%, 70% setup
V(s4) = r4 + γV(s_end)

note: it's nice to be able to compute V(s) like above. but sometimes all we have is a set of experience.

e.g.
+1      +0      +3
exp_1: s1 ---> s3 ---> s5 ---> s_end     thus, V(s1) = 4   based on this experience alone

+1      +7      +1
exp_2: s1 ---> s3 ---> s4 ---> s_end     thus, V(s1) = 9   based on this experience alone

exp_i: .....

===> as you get more & more experience, you can simply take the mean of V(s1) to (hopefully) approximate accurate V(s1)
in principle, with an infinite amount of examples, we get true V(s1)

###
###   computing estimates incrementally
###

let's formalize the exercise of computing V(s) from each experience/example.

we denote each experience 1,2,3,,,,t

given Vt-1(s) which is the expected sum of all discounted rewards based on t-1 examples, the next t'th example will give Rt(s)   # this is just a setup
so you can compute Vt(s) as follows:

(t-1)Vt-1(s1) + Rt(s1)
Vt(s1) =  ------------------------
t
(t-1)               1
=  ----- Vt-1(s1)  +  --- Rt(s1)
t                 t

=  Vt-1(s1) +  αt(Rt(s1) - Vt-1(s1))    where αt = 1/t
_________________
\
this part here can be seen as error

note: an interesting question to ponder on is "if we change αt definition, like αt = sqrt(t), does it change the algo behavior or still converge to the same answer eventually?"

###
###  Properties of Learning Rates: αt
###

lim    Vt(s) = V(s)       # V(s) is the true value of s, average of all Rt(s)
T -> ∞

While we don't go into proof, here are two conditions for a valid learning rate.

(1)  Σ αt = ∞
t

(2)  Σ αt^2 < ∞
t

so what bounds can we state about αt for it to be valid ?

∞             1     1     1     1
(recall) Harmonic Series:  Σ 1/n =  1 + --- + --- + --- + --- + ...  =  ∞      # called a divergent infinite series
n=1            2     3     4     5

(recall) hyper harmonic series  # aka p-series

∞
Σ  1/(n^p)   -->  converges  if p > 1     # then it is called over-harmonic series
n=1                diverges   if p <= 1
π^2
note: the sum of hyper hamonic series = -----  if p=2    # known as Basel Problem
6

(ref) https://en.wikipedia.org/wiki/Harmonic_series_(mathematics)
https://en.wikipedia.org/wiki/Harmonic_series_(mathematics)#p-series

###
###   TD(1) rule
###

- it's just a rule defined as below.
+r2   +r3   +r4
Episode T    # note: an episode = an experience, like s1--->s2--->s3--->sf
(1) ∀S,  e(S) = 0, VT(S) = VT-1(S)  at start of an episode
(2) After St-1  ---(rt)--->  St  (i.e. t'th step),  e(St-1) = e(St-1) + 1   # you got reward rt by transitioning from St-1 to St
(3) ∀S,  VT(S) = VT(S) + αT(rt + γ*VT-1(St) - VT-1(St-1)) * e(S)            # this step is just a temporal diff
e(S) = γ*e(S)                                                     # this step is just decalying eligibility

===> step (1) is only once, then you do (2),(3) at every step of the episode.

note:  e(S) denotes eligibility of state S
e(S) = 0  means S is ineligible
note:  an extremely confusing part of the above TD(1) definition is the distinction between big T and small t.
a big T denotes an episode (an experience), and a small t refers to a step within an episode.
so you update VT(S) while you deal with St,rt at each step.  St ∈ S
note:  reward rt can be negative.

==> CLAIM:  TD(1) is the same as outcome-based updates (IF no repeated states)

==> problem: when TD(1) updates VT(St-1), it doesn't account for all the info observed on potential VT(St).
for example, suppose you get lots of examples like

episode_1: s1,s3,s5,s_end
episode_2: s1,s3,s4,s_end
episode_3: s1,s3,s5,s_end
episode_4: s2,s3,s4,s_end   # TD(1) updates VT(s2) without using knowledge of s3,s4,s_end gained from episode 1,2,3
# MLE method, on the other hand, takes such knowledge into account.
r3_1      r5
r1      /----> s5 ----\
s1 ----> s3                ------> s_end
^  \----> s4 ----/
s2 -----/    r3_2      r4
r2

###
###   TD(0) rule
###

- finds MLE (if finite data repeated infinitely often)

Episode T
(1) ∀S,  VT(S) = VT-1(S)  at start of an episode
(2) ∀St-1, VT(St-1) = VT(St-1) + αt(rt + γ*VT(St) - VT(St-1))     # St-1 = state we just left,  St = next state (probabilistically there can be more than one)
= E[r + γ*VT(St)]                             # the idea is, with after enough episodes, we see St that happens more often more often.
St                                          # in fact, they become the mean. (again, if finite data repeated infinitely often)
# note, in TD(1), repeating finite data infinitely doesn't help, recall VT(s2) example
# if infinite data, then both TD(1) and TD(0) find the true V(S)

note: you repeat step (2) at each step within an episode.
note: as you can see, when we update VT(St-1), we take into account our knowledge of VT(St) from prev episodes.
note: step (2) can be re-written as below.

(2) ∀ S=St-1,  VT(S) = VT(S) + αt(rt + γ*VT-1(St) - VT-1(St-1))

###
###   TD(λ) rule
###

both TD(0) and TD(1) have updates based on differences between temporally successive predictions.
TD(λ) covers both, defined as below.

Episode T
(1) ∀S,  e(S) = 0, VT(S) = VT-1(S)  at start of an episode
(2) After St-1  ---(rt)--->  St  (i.e. t'th step),  e(St-1) = e(St-1) + 1
(3) ∀S,  VT(S) = VT(S) + αt(rt + γ*VT-1(St) - VT-1(St-1)) * e(S)
e(S) = λ*γ*e(S)                                           # the only diff from TD(1) is you have λ here

====> if λ=1, then you get exactly TD(1) rule, and if λ=0, then you get TD(0) rule effectively.
====> the idea is as you set λ = [0,1], you get properties of both.
optimal λ value depends on the problem, but empirically often optimal λ = [0.3, 0.7] range

###
###  K-step Estimators
###

E1:  V(St) = V(St) + αt(rt+1 + γ*V(St+1) - V(St))                             # this is TD(0), aka one-step estimator
E2:  V(St) = V(St) + αt(rt+1 + γrt+2 + γ^2 * V(St+2) - V(St))
E3:  V(St) = V(St) + αt(rt+1 + γrt+2 + γ^2 * rt+3 + γ^3 * V(St+3) - V(St))
EK:  V(St) = V(St) + αt(rt+1 + ... + γ^(k-1) * rt+k + γ^k * V(St+k) - V(St))
E∞:  V(St) = V(St) + αt(rt+1 + ... + γ^(k-1) * rt+k + ...  - V(St))           # this is TD(1), aka outcome-based estimator

each of E[1,∞] gives an estimate of V(S), and we want to weight each one in a particular way.
for this "weighting", we use lambda λ

weight
E1: 1-λ
E2: λ(1-λ)
E3: λ^2 * (1-λ)
..
EK: λ^(k-1) * (1-λ)
E∞: λ^∞

∞
Σ λ^(k-1) * (1-λ) = 1     # summing all the weights should amount to 1
k=1

so, putting this all together.

∞
TD(λ) =  Σ λ^(k-1) * (1-λ) * Ek     # called TD(λ) estimator of an MDP
k=1                         # it's a weighted combination of k-step estimators Ek for k >= 1

note: if we use k-step estimator for the below MDP, what happens after k=3 ?
well, think of this MDP as continuous in that s_end state has self transition probability 1 with reward 0
as such we expect V(s_end) = 0

episode_1: s1,s3,s5,s_end
episode_2: s1,s3,s4,s_end
episode_3: s1,s3,s5,s_end
episode_4: s2,s3,s4,s_end

r3_1      r5
r1      /----> s5 ----\
s1 ----> s3                ------> s_end
^  \----> s4 ----/
s2 -----/    r3_2      r4
r2

##
##  optimal λ value,  empirically
##

|
error    |              ..
in V()   |            ..
after    | ..       ..
finite   |    ......
data     |__________________
0   0.3  0.7    1

λ value

###
###  off policy  VS  on policy
###

RL algo can be categorized as either "on" or "off" policy learning algo.

what is a policy ?
- deterministic policy : π(s) = a
- stochastic policy : Pr(a|s) = π(s,a)
Σ π(s,a) = 1
a ∈ As

read the below refs. but intuitively, on policy means your algo keeps updating the policy by using the policy itself (aka target/estimation policy). (updating/learning on itself)
whereas off policy means your algo may learn some other policy/value function of a sort and use that some other (such as greedy + exploratory) policy/value func (aka behavioral policy) to derive the policy. (updating/learning the target/estimation policy via behavioral policy)

e.g.
on-policy: TD(λ), SARSA,
off-policy: Q-Learning, R-Learning

SARSA:  Q(s,a) = Q(s,a) + α(r + γ*Q(s',a')] - Q(s,a))         # see how it is using itself to decide future discounted reward

Q-Learning:  Q(s,a) = Q(s,a) + α(r + γ*max[Q(s',a')] - Q(s,a))     # see how it is using "greedy" policy (as behavioral policy) to decide future discounted reward
a'                         # while estimation policy may not be the same greedy policy

what is the estimation policy ?
- well it is just whatever current policy you got, assuming it is not an optimal policy you want to improve it.
- likely it is not a simplistic greedy policy because that would mean it never explores.
- note if it is a simplistic greedy policy then q-learning of it becomes on-policy learning.

##
##  comparison of properties
##

on-policy
- it may not explore as widely as off policy. but because it explores based on the existing policy of itself, so if the exploration cost/penality can be huge, then on policy provides a sane learning. e.g. you don't want your robot to explore and jump off a cliff.
- can get stuck on local minima

off-policy
- more expoloration, but can get messy if you use fancy behavioral policy
- may be slower

(ref) https://courses.engr.illinois.edu/cs440/fa2009/lectures/Lect24.pdf
(ref) http://www.inf.ed.ac.uk/teaching/courses/rl/slides15/rl05.pdf
(ref) https://stats.stackexchange.com/questions/184657/what-is-the-difference-between-off-policy-and-on-policy-learning
(ref) http://artint.info/html/ArtInt_268.html

###############################
####   (5)  Convergence    ####
###############################

TD with control of action. (to learn the value of actions, to figure out the best way to behave)

###
###  Bellman Equations with NO action
###

V(s) = R(s) + γ*Σ T(s,s')V(s')     # value of a state = immediate reward + discounted expected sum of reward of subsequent sequence of states
s'                 # notice we deliberately removed action out of the usual equation

given an experience tuple <St-1,Rt,St>

Vt(St-1) = Vt-1(St-1) + αt(Rt + γ*Vt-1(St) - Vt-1(St-1))     #  notice this is TD(0)
Vt(S) = Vt-1(S) otherwise

###
###  Bellman Equations with action
###

Q(s,a) = R(s,a) + γ*Σ T(s,a,s') max Q(s',a')       # Q-form of bellman equation
s'           a'

given <St-1, At-1, Rt, St>,  we can write TD(0)-like Q-learning update rule. (nothing fancy, just TD(0) again)

Qt(St-1,At-1) = Qt-1(St-1,At-1) + αt(Rt + γ*max Qt-1(St,A') - Qt-1(St-1,At-1))    # i did s/a/A/g, to distinguish from α
A'
Qt(s,a) = Qt-1(s,a) otherwise    # a==A, s==S

note: the above update rule is taking care of two kinds of approximation.

(1) if we knew the model (transition and reward), then we can synchronously update as follows:
Qt(s,a) = R(s,a) + γ*Σ T(s,a,s') max Qt-1 (s',a')           # TD(0)
s'           a'
(2) if we knew Q*, sampling asynchronously update
Qt(St-1,At-1) = Qt-1(St-1,At-1) + αt(Rt + γ*max Q*(St,A') - Qt-1(St-1,At-1))       # Q* is true value, which we only use to one-step look-ahead
A'                                    # to update Q value estimate of the state we just left

note: recall we need two properties to converge:

∞
  Σαt = ∞
t=0

∞
  Σ(αt)^2 < ∞
t=0

note: the notation is getting messy, so we will simplify by introducing bellman operator.

###
###  Bellman Operator
###

- let B be an operator (or mapping) from Q/value functions to Q/value functions.

[BQ](s,a) = R(s,a) + γ*Σ T(s,a,s') * max Q(s',a')        # BQ denotes B applied to Q
s'             a'

e.g.

Q* = BQ*      #  Bellman Equation (it's an equation that defines value function that gives optimal policy)
Qt = BQt-1    #  Value Iteration   # to be precise, you need to initialize Qt = some_init_value for t=0

###
###  Contraction Mappings
###

B is an operator.

if, for all F,G and some 0 ≤ γ < 1,      # assume F,G are value functions
||BF - BG||_∞   ≤   γ||F - G||_∞
then B is a contraction mapping.         # contraction = closer together, tighter together

e.g.

|| Q||_∞  =  max| Q(s,a)|      # infinity norm of Q value function of s,a
s,a               # i.e. the max value of current Q

so, || F-G||_∞  just denotes infinity norm, the max distance (i.e. the biggest diff) btwn (the Q values of) F and G

to make it simple, assume an MDP problem with a single state & a single action.
so your Q value function only gives a single number.
which of the following B operators are contraction mappings?

  Bx = x/2
  Bx = x + 1
  Bx = x - 1
  Bx = (x + 100) * 0.9

==>  and  because adding a constant never changes the distance. so , are eliminated.
then, think about  as below.

|Bx-By| = |x/2 - y/2|
= 1/2 * |x-y|   so for 1/2 ≤ γ < 1, B is a contraction mapping.

then  follows the same argument.

##
##  Contraction Mapping - properties
##
if B is a contraction mapping,
  F* = BF* has a unique solution   # aka "fixed point" of B.
  Ft = BFt-1  then  Ft -> F*       # i.e. Value Iteration converges

note: here is the reasoning. if B is a contraction mapping, then the following is true by definition.

|BFt-1 - BF*||_∞   ≤   γ|| Ft-1 - F*||_∞       #  eqt.1

then, based on  & , we can rewrite || BFt-1 - BF*||_∞  =  || Ft - F*||_∞

so we can rewrite eqt.1 as
|| Ft - F*||_∞   ≤   γ|| Ft-1 - F*||_∞          # i.e. at each t'th step, we get closer to F*

note: for a given contraction mapping B, "unique" F* means, there can be only one F*
proof by contradiction: given a contraction mapping B, suppose there are F* and G* s.t.
F* = BF*
G* = BG*  where F* = G*
now, because B is a contraction mapping, we have
|| BF* - BG*||_∞   ≤   γ||F* - G*||_∞
but this means we also must have
|| BF* - BG*||_∞  = || F* - G*||_∞  ≤  γ|| F* - G*||_∞  which cannot hold true for 0 ≤ γ < 1
thus, we prove there must be a unique F*

###
###  Bellman Operator Contractions
###

[BQ](s,a) = R(s,a) + γΣ T(s,a,s') max Q(s',a')      # the usual bellman operator notation
s'           a'

given Q1, Q2
|| BQ1 - BQ2||_∞   ≤   γ|| Q1 - Q2||_∞           # let's prove this.

proof:

|| BQ1 - BQ2||_∞   =  max| [BQ1](s,a) - [BQ2](s,a)|        # now let's do a tedious substitution
a,s

=  max| { R(s,a) + γΣ T(s,a,s') max Q1(s',a') } - { R(s,a) + γΣ T(s,a,s') max Q1(s',a') }|
a,s              s'           a'                           s'           a'

=  max| γΣ T(s,a,s'){max Q1(s',a') - max Q2(s',a')}|     # let's keep simplifying
a,s   s'           a'              a'

≤  γ max| max Q1(s',a') - max Q2(s',a')|    # notice the two a' are independent
s'   a'              a'

≤  γ max| Q1(s',a') - Q2(s',a')|            # now two a' are the same
s',a'                                  # see the proof of max()'s non-expansive property

=  γ|| Q1 - Q2||_∞

so yes bellman value iteration converges!

###
###  proof of the non-expansive property of max() operator
###

for all f,g  | max f(a) - max g(a)|  ≤  max| f(a) - g(a)|
a          a             a

proof: WLOG, assume  max f(a) ≥ max g(a)        # WLOG = without the loss of generality
a          a              # here each a is independent
# WLOG because it's trivial to flip

|| max f(a) - max g(a)||  =  max f(a) - max g(a)  # again, here each a is independent
a          a              a          a

=  f(a1) - g(a2)       where  a1 = argmax f(a),  a2 = argmax g(a)
a                  a
≤  f(a1) - g(a1)    # recall f(a1) ≥ g(a2) ≥ g(a1)
= | f(a1) - g(a1)|
≤  max| f(a) - g(a)|   # here "a" is one same a

note: non-expansive ≈ contraction in this context. here we only looked at max(), but other operators also have non-expansive property, which can be useful to prove stuff.
note: we can now move onto the proof of Q learning convergence, as it finds a solution to bellman equation.

###
###  Convergence theorem    (which basically says Q-learning converges)
###

define αt(s,a) = 0 if st != s, at != a     # t = current step, thus st = current state, at = current action
# basically, we set the learning rate 0, for state-action pairs that are not current s,a
# i.e. only keep the learning rate for the current s,a
theorem:
let B be a contraction mappping, and Q* = BQ* be its fixed point.    #  recall Q* being the unique solution to bellman equation
let Q0 be a Q function and define Qt+1 = [Bt Qt]Qt                   #  [Bt Qt] denotes an operator produced by applying Bt to Qt
then Qt -> Q*  if the following properties hold.                     #  Qt -> Q* means Q-learning converges. (that's all we are saying with this theorem really)

1) for all U1,U2,s,q
| ([Bt U1]Q*)(s,a) - ([Bt U2]Q*)(s,a) |   ≤   (1 - αt(s,a))|U1(s,a) - U2(s,a)|

2) for all Q,U,s,a
| ([Bt U]Q*)(s,a) - ([Bt U]Q)(s,a) |  ≤  γαt(s,a)|Q*(s,a) - Q(s,a)|

3) the usual learning rate properties for convergence
∞
3.1)   Σαt = ∞
t=0

∞
3.2)   Σ(αt)^2 < ∞
t=0

condition 1) takes care of averaging out the stochasticity
condition 2) works like value iteration, the property of contraction

###
###  Convergence theorem in Q-learning context
###

let's express Q-learning as an update rule, in the form of convergence theorem.

for s=st-1, a=at-1
([Bt Q]W)(s,a)  =  Q(s,a) + αt(s,a)(rt + γ max W(st,a') - Q(s,a))      # Bt is just an operator we use to update Q function at t'th step
a'
<st-1,at-1,st,rt>   # experience is you took action at-1 in state st-1 and ended up in state st with reward rt

we will argue operator [Bt Q] is in fact a Q-learning operator.

notice we have two Q value functions.
Q() estimates the value of st-1,at-1   # state we just left
W() estimates the value of st,a'       # state we just arrived at

now, this can fit the below two conditions. think W() = Q*() as if you somehow know true Q function as one-step look ahead.
think Q() = U()
1) for all U1,U2,s,q
| ([Bt U1]Q*)(s,a) - ([Bt U2]Q*)(s,a) |   ≤   (1 - αt(s,a))|U1(s,a) - U2(s,a)|

2) for all Q,U,s,a
| ([Bt U]Q*)(s,a) - ([Bt U]Q)(s,a) |  ≤  γαt(s,a)|Q*(s,a) - Q(s,a)|

#############################################
####    Advanced Algorithmic Analysis    ####
#############################################

(1) VI       # value iteration
(2) LP       # linear programming
(3) PI       # policy iteration

###
###  VI
###

recall we talked about how contraction mapping allows VI to converge, in the limit (in inifinite horizon), solving bellman equation.
"in the limit" is nice to know theoretically, but not realistically useful.
here are some more useful properties.
1
(1) for some t*<∞ polynomial in |S|, |A|, Rmax = max|R(s,a)|, -----, bits of precision for transition probability func T()
a,s           1-γ
π(s) = argmax Qt*(s,a) is optimal.
a

note: in this case, t* refers to some (less than ∞) step
so this claim is even stronger than the VI convergence (in infinite horizon) we previously talked about.
this claims even in less than infinity step, in polynomial time to those relevant variables, VI converges to optimal policy. note, V() itself doesn't converge, but its outout π() will converge.
note: strictly speaking, if the value of γ gets small, then 1/(1-γ) gets infinitely big, and it's no longer polynomial, and becomes exponential. so that's the weakness, which we will overcome by encoding MDP in LP framework.

ref: this property was originally introduced in cramer's rule.

(2) if |Vt(s) - Vt+1(s)|  <  ε  ∀s           # if your V() at step t is no different (at least not more than ε) from V() at previous step t-1, for all states,

2εγ
then  max |VπVt(s) - V*(s)|   <   -------     # max diff btwn your VπVt ()and optimal V*() is bounded by a small number 2εγ/(1-γ) which is good
s                            1-γ

note: VπVt(s) is a messy notation but it says
first get your greedy policy π() with respoect to Vt(),
i.e.    π(s) = argmax Vt*(s,a)
then define your value function V() based on that policy π()
i.e.   Vπ(s) = R(s,π(s)) + γ∑ T(s,π(s),s') * Vπ(s')
s'

note: this property is nice to know, and takes (1) one step further, because if your latest Vt() is good enough approximation (in terms of ε), then you can put a bound on how far your resulting V() is from optimal V*() in terms of ε & γ.

note: overall, based on (1),(2) should we always pick smaller γ ?   well, no, because smaller γ means stronger-discount future rewards, and emphasize immediate reward (short-sighted, myopic), which is not always wise.

(3)  given Bk = bellman operator of VI applied k times(steps)
it's contraction mapping.
i.e.   ||BkQ1 - BkQ2||_∞   ≤  γ^k ||Q1 - Q2||_∞

this is nothing fancy, it's saying applying VI k steps contracts the difference. gets you closer to optimal, as bounded by γ.

###
###  LP
###

recall, in VI, convergence of Q() for getting optimal policy π() was almost polynomial except for a smaller value of γ, it could get exponential. we will solve this issue by encoding MDP into LP framework.

LP is an optimization framework where models (variables,constrants) are represented in linear relationships.
e.g. maximize c^T * x  subject to Ax ≤ b and 0 ≤ x

(ref) https://en.wikipedia.org/wiki/Linear_programming

how do we encode our MDP in LP way ?

∀s,  V(s) = max( R(s,a) + γ∑ T(s,a,s')V(s') )     # here is our usual bellman equation
a             s'

===> we can turn it into LP format as below. (aka "primal" LP representation of MDP)

min ∑ V(s)
s

subject to   ∀ s,a    V(s) ≥ R(s,a) + γ∑ T(s,a,s')V(s')

note: an intuitive simpler example of what we did above is.
e.g.
given a problem  max(-3,4,5,7) = x       # obviously the answer is x=7
you can turn it into an LP problem as below
min x                # objective function
subject to  x ≥ -3   # linear constraints
x ≥ 4
x ≥ 5
x ≥ 7

===> as with any LP, given the above "primal", let's look at its "dual"

max  ∑ ∑ q(s,a)R(s,a)      # q(s,a) = policy flow
q(s,a) s a

subject to   1 + γ ∑ ∑ q(s,a) T(s,a,s') = ∑ q(s',a)     ∀ s'
s a                    a

∀ s,a   q(s,a) ≥ 0

###
###  PI
###

1.         initialize:  ∀s, Q0(s) = 0

2. policy improvement:  ∀s, πt(s) = argmax Qt(s,a)   for t ≥ 0     # outer loop, iterates all actions
a
3. policy evaluation:   Qt+1 = Qπt                                 # inner loop, iterates all states

4. repeat 2,3 until convergence

note: πt() denotes the policy (updated) based on Qt()
note: Qπt denotes Q value function (updated) based on πt()
other way to denote this is Qt|π()
note: Qt+1 = Rt + γ Qt(s,a)
= Rt + γ Σ T(s,π(s),s') * Qt(s')
s'

note: PI has the following properties.

1)  Qt -> Q*
2)  convergence is exact & complete in finite time    # true in VI too
3)  converges at least as fast as VI

note: we know PI takes less iteration (if we count the outer loop) than VI because each iteration in PI is longer.
but is there a bound in convergence time?
we know:
|S|  ≤  convergence time  ≤  |A|^|S|
(linear)                     (exponential)

(very good paper) https://editorialexpress.com/jrust/research/siam_dp_paper.pdf
(very good paper) https://arxiv.org/pdf/1301.6718.pdf

###
###  "Domination" concept
###

[notation]
π1 ≥ π2  denotes  π1 dominates π2
π1 > π2  denotes  π1 strictly dominates π2

recall Vπ() is simply a V() based on π()
i.e.
Vπ(s) = R(s,π(s)) + γ∑ T(s,π(s),s') * Vπ(s')
s'

[domination]

π1 ≥ π2   iff  ∀s∈S  Vπ1(s) ≥ Vπ2(s)     # i.e. π1 does no worse than π2

[strict domination]

π1 > π2  iff  ∀s∈S  Vπ1(s) ≥ Vπ2(s)  AND  ∃ s∈S  Vπ1(s) > Vπ2(s)

[ε-optimal policy]

π is ε-optimal  iff |Vπ(s) - Vπ*(s)| ≤ ε   ∀s∈S      # aka "bounded loss/regret"

###
###   why does PI work ?
###

a tale of two operators for two policies: π1, π2

[B1 V](s) = R(s,π1(s)) + γΣ T(s,π1(s),s')V(s')    # we just define the usual B1,B2 bellman operators for π1,π2
[B2 V](s) = R(s,π2(s)) + γΣ T(s,π2(s),s')V(s')

theorem:  if V1 ≥ V2,  then  B2V1 ≥ B2V2      # V1 ≥ V2 denotes V1 dominates V2
# here we say B2 is monotonic

proof: [B2V1 - B2V2](s)  =  γ∑ T(s,π2(s),s'){V1(s) - V2(s)}     #   V1(s) - V2(s) ≥ 0  is given
s'

≥ 0

note: this property is called "monotonicity"

##
##  another important property (called "value improvement") in PI
##

π1  -->  B1  -->  Q1 = B1Q1      #  Q1 is the fixed point of B1 operator
# recall we just defined B1 from π1 as  [B1 V](s) = R(s,π1(s)) + γΣ T(s,π1(s),s')V(s')
π2(s) = argmax Q1(s,a)  ∀s       #  π2 aka greedy policy wrt Q1
a

π2  -->  B2

Q1(s,a)  =  R(s,a) + γ∑ T(s,π1(s),s')Q1(s',π1(s'))   # think of this as t'th step of PI
s'

≤  R(s,a) + γ∑ T(s,π2(s),s')Q1(s',π2(s'))   # this is obviously equal or greater because this is t+1 step of PI
s'
=  B2Q1

thus we have proved the following property.

Q1 ≤ B2Q1                 # B2Q1 dominates Q1
# aka "value improvement" i.e. not gonna get stuck in local optima

###
###  PI proof
###

(video)

given a step of PI,
π1 : current policy
B1 : bellman operator associated with π1
Q1 : fixed point of B1
π2 : greedy policy eval of Q1
B2 : bellman operator associated with π2
Q2 : fixed point of B1

B2Q1 ≥ Q1      # B2Q1 dominates Q1,  aka "value improvement"

for k ≥ 0,  B2k+1Q1 ≥ B2kQ1   # monotonicity of B2 operator

lim   B2kQ1 ≥ Q1      # transitivity
k -> ∞
Q2 ≥ Q1      # fixed point

###############################################
####    reward function transformation     ####
###############################################

read the paper by andrew ng "Policy invariance under reward transformations: theory and application to reward shaping"

(Q) why might we want to change the reward function for an MDP ?
(A) motivation is to change the reward function to make an MDP easier to solve. (without chaging what an RL learner tries to optimize)
- "easier" can mean
-- semantics : an RL learner may get a better grasp of the MDP. solvability.
-- efficienty: computation time, sample, space. solvability.

###
###  how to change the MDP reward function without changing the optimal policy ?
###

<S,A,R,T,γ>      #  only change R

ideas
(1) multiply by positive constant (scalar)
(2) shift by constant (adding a constant)
(3) non-linear potential-based rewards

###
### (1) multiply by a (positive) scalar
###

Q(s,a) = R(s,a) + γ∑ T(s,a,s') max Q(s',a')    # the usual bellman equation
s'           a'

R'(s,a) = cR(s,a)   where  c > 0

then

Q'(s,a) = cQ(s,a)    # this is neat. see derivation below

cQ(s,a) = cR(s,a) + γ∑ T(s,a,s') max cQ(s',a')
s'           a'

= cR(s,a) + cγ∑ T(s,a,s') max Q(s',a')
s'           a'

###
### (2) adding a scalar
###

Q(s,a) = R(s,a) + γ∑ T(s,a,s') max Q(s',a')    # the usual bellman equation
s'           a'

R'(s,a) = R(s,a) + c

then
c
Q'(s,a) = Q(s,a) + -----    # see derivation below
1-γ
c
Q'(s,a) = R(s,a) + c + γ∑ T(s,a,s') max( Q(s',a') + ----- )    # the reason we have geometric series c/(1-γ) is because we have +c in every subsequent iteration, and multilied by gamma γ^n
s'           a'              1-γ
c
= R(s,a) + c + γ∑ T(s,a,s'){max(Q(s',a')) + ----- }    # because c/(1-γ) is a constant that doesn't depend on a', we can take it out of the max()
s'           a'              1-γ
γc
= Q(s,a) + c + -----
1-γ
c
= Q(s,a) + -----
1-γ

###
###  shaping rewards
###

suppose you are RL-training a robot soccer agent.
scoring a goal should be a big reward, but to program its movement, you must implement some rewards for each step, like approaching the ball, shooting the ball, etc.
but if you don't do it smart, the agent might do something stupid like step near the ball step away then repeat infinitely, to try collect points infinitely. (suboptimal positive loop)

so shaping rewards right is important. - we have "potential functions" for it.

###
### (3) potential-based reward shaping in RL
###

first, think change-in-state-based reward: agent gets reward for achieving/approaching certain states. (and lose reward for leaving those states)

e.g.
you get reward as you approach the ball:  1/distance
you lose reward as you move away from the ball: -1/distance

(so it doesnt do stupid loop)

Q(s,a) = ∑ T(s,a,s'){R(s,a,s') + γ*max Q(s',a')}    # the usual bellman equation
s'                         a'

R'(s,a,s') = R(s,a) - Ψ(s) + γΨ(s')     # -Ψ(s) = potential of leaving state s
# γΨ(s') = discounted potential of arriving at state s'
# simply you get Ψ(s) when you reach it, then lose Ψ(s) when you leave
then                                    # so the recursive definition is simple as below

Q'(s,a) = Q(s,a) - Ψ(s)      # derivation below

Q'(s,a) = ∑ T(s,a,s'){R(s,a,s') - Ψ(s) + γΨ(s') + γ*max(Q(s',a') - Ψ(s'))    # notice γΨ(s') gets cancelled
s'                                         a'

= Q(s,a) - Ψ(s)

###
###  Q-learning with potentials
###

let's define our usual Q() value function using potential Ψ()    # basically using R'()

given <s,a,r,s'>    # i.e. we just left s, after taking a, then arrived at s', got r

Q(s,a) := Q(s,a) + αt(r - Ψ(s) + γΨ(s') + γ*max(Q(s',a') - Q(s,a))      #  A := B denotes A defined as B
a'
Ψ(s) = max Q*(s,a)
a

Q(s,a) = Q*(s,a) - Ψ(s)
= Q*(s,a) - max Q*(s,a)
a

= 0 if a is optimal
< 0 otherwise

note: it's known that Q-learning with potentials can speed things up (because you are encoding additional info/knowledge for each state)
note: recall other more critical benefit of potential function is to avoid suboptimal loop

# note: it's been proven that
Q-learning with potential == Q-learning without potential but with Q-function initialized at the same potential values
(it's not surprising, because you are asserting certain info in the way you initialize your Q values)

##########################
####   Exploration    ####
##########################

exploration : a very topic which separates RL from the rest of ML.
recall "exploration-exploitation dilemma"

- bandits   # state-less stochastic model.
- deterministic MDPs
- stochastic MDPs

###
###  k-armed bandists
###

bandit : it means thief, but also the name of slot machine.
in this context, bandit == slot machine

suppose you have k bandits, each with an arm (lever to pull).
you pull the arm, then you get binary output: hit or miss
suppose, for all bandits, the payout is
1 for hit
0 for miss
BUT, each bandit has different probability for hit & miss. (aka payoff rate)
and you don't know the probability. you only observe instances of hit & miss.
(our end goal is to find the bandit that has the highest payoff rate)

suppose k=7

bandit:

a : 1/10    # i.e. for bandit "a", we pulled the arm ten times, and got hit once
b : 2/20    # i.e. for bandit "b", we got 2 hits out of 20 trials
c : 4/40
d : 1/5
e : 2/10
f : 4/20
g : 8/40

(Q) which one has the highest expected (maximum likelihood) payoff ?
(A) bandits d,e,f,g all has 20% probability

(Q) which arm gives you the most confident about its expected payoff ?
(A) confidence increases monotonically as a function of samples. thus bandits c & g
(we'll look at the math later)

###
###  confidence-based exploration
###

suppose we have two bandits A & B
A : 0/1
B : 1/2

what's the best strategy ?
- always A            # this
- always B            # and this are not good ideas, as we don't know which one has higher probability of hit.
- maximum likelihood  # also bad idea, because this is ideantical to saying always B in this case (never explores, but at least it's doing a sensible exploitation)
- maximum confidence  # also bad idea, because this is ideantical to saying always B in this case (never explores, and this is a stupid exploitation)
- minimum confidence  # this alternates between A & B. not intelligent. (never exploits, but this is good exploration)

==> so we want to strike a good balance between maximum likelihood (exploitation) AND minimum confidence (exploration)
==> we'll look at the metrics for deciding the balance.
(the end goal of course is to find the optimal, i.e. highest hit probability, bandit)

###
###  metrics
###

1. identify optimal bandit in the limit.   # computationally intractable. because you have to limit all bandits
2. maximize (discounted) expected reward.  # this actually translates to achieving 1.
--> a popluar method is "Gittins Index" which maps the number of hits over the number of trials (i.e. observed payoff rate) into an index value in such a way that it considers both expected reward + confidence so that you can always simply pick the highest index value bandit and you will end up picking the optimal bandit.
note: turns out Gittins Index only works for Bandit problem. (and doesn't generalize for other RL problems)

3. maximize expected reward over finite horizon
4. identify near-optimal bandit (at most ε error away from true-optimal bandit) with high probability (1-δ) in time τ(k,ε,δ) polynomial in terms of k,ε,δ     # PAC concept, τ() = time
5. nearly maximize reward (at most ε error away from true-optimal reward) with high probability (1-δ) in time τ(k,ε,δ) polynomial in terms of k,ε,δ   # related to 4.
6. pull a non-near optimal bandit (again with some notion of ε) no more than τ(k,ε,δ) times with high probability (1-δ)    # mistake bound, related to 4,5.
..   (note: mistake here is defined as the number of times you pull non-ε-optimal bandit)
..
(list goes on)

note: turns out 4,5,6 are all related (an algo for one can be translated to solve the others)

###
###  Find Best  implies  Few Mistakes
###

[Find Best algo]
algorithm that finds optimal bandit (within ε) with probability 1-δ in τ(k,ε,δ) pulls.
(i.e. the number of times you pull non-ε-otpimal bandits are bounded. aka mistake bound)

###
###  Few Mistakes  implies  Do Well
###

[Few Mistakes algo]
algotirhm that pulls ε-subotimal bandits at most τ(k,ε,δ) times with probability 1-δ
(i.e. it pulls only ε-otimal bandits after at most τ(k,ε,δ) times)

note: ε-subotimal == non-ε-otimal

===> this implies Do Well algo (see below) because after at most τ(k,ε,δ) pulls, Few Mistakes algo only pulls ε-optimal bandits, so you can keep pulling ε-optimal bandits until you reach your desired per-step average ε-optimal reward.

###
###  Do Well  implies  Find Best
###

[Do Well algo]
algorithm that, with probability 1-δ, has a per-step (average) reward ε-close-to-optimal after τ(k,ε,δ) steps

==> intuitively, to keep per-step average reward ε-optimal, means finding ε-optimal bandit

notice we can translate between FindBest-FewMistakes-DoWell.

###
###  Hoeffding bound  (aka tailbound)
###

let X1,X2,,,,Xn be iid with mean μ        # iid = independently identically distributed
_
let μ be our estimate of μ : Σ Xi / n
i

the following is a 100(1-δ)% confidence interval for μ
_                _
[ μ - Zδ/sqrt(n),  μ + Zδ/sqrt(n) ]     where Zδ = sqrt(1/2 * ln(2/δ))

===> we don't go into math proof, but intuitively,
- if you want to increase confidence (i.e. smaller δ, i.e. bigger Zδ), then you have to widen your bound.
- bigger n (i.e. more samples) gives you more confidence. in this case, bigger n makes the bound tighter, which is good. because if the bound is very wide, then it's not useful information.

so the info from Hoeffding bound gives you an idea of sample complexity for learning the value of bandits.

###
###  union bound
###

we take C samples of each bandit, then choose the bandit with best estimate.
want to be 1-δ sure it's within ε.

learn each bandit to within ε/2 with probability 1-δ/k

ε/2 + ε/2
arm 1   [_______]            # this bandit is correct with probability 1-δ/k
<-- ε -->            # i.e. wrong with probability δ/k
arm 2          [_______]     #
<-- ε -->     # given k bandits, at least one wrong with probability at most δ  (union bound)
arm 3     [_______]          #                  all k bandits correct with probability 1-δ
<-- ε -->          #
..                           # union bound:  P(A or B) = P(A) + P(B) - P(A && B)
..                           #                         < P(A) + P(B)
arm k       [_______]
<-- ε -->
---------------------------- value

note: arm == bandit

==> which arm gives the best value ?
in the above example, arm 2 looks the best. but its actual value may be at the lowest of its band, and arm 3's value may be at its upper bound, then in that case arm 3 is better.
what's the worst by which you are off from the true value of the optimal arm? well you cannot be off by more than ε if you pick the arm whose upper bound is highest than others'.

###
###  how many samples ?
###

- want to be ε/2 accurate with probability 1-δ/k
- let's rearrange Hoeffding bound.

sqrt(1/2 * ln(2k/δ))
--------------------  ≤  ε/2
sqrt(c)

2
c  ≥  ---ln(2k/δ)
ε^2

===> so we know the size of sample c, to choose ε-close optimal arm with probability 1-δ

###
###  MDP optimization criteria
###

(1) explore randomly    # some rare states (that might give good reward) cannot be explored if you do pure random exploration
(2) trap states         # need some way to avoid trap states
(3) mistake bound
- the number of ε-suboptimal action choices is bounded in poly(1/ε,1/δ,n,k) with probability 1-δ

###
###  exploring deterministic MDPs
###

Rmax algo: (for deterministic MDPs)
1. keep track of MDP experience
2. any unknown state-action pair, assume Rmax reward. (so we explore that s,a pair. aka optimism in the face of uncertainty)
3. solve the MDP
4. take action from π*

Rmax analysis
1. once all edges known, no more mistakes  (mistake = ε-suboptimal action)
2. stop visiting unknown states ?
- because of discount factor γ, some path/state we may know we don't have to explore, even if there is a loop with Rmax there.
3. we execute a path to an unknown state-action pair:
- given n states, k actions
- we may travel across n states to get to a previously unknown state, and that state has k actions: n*k
- we do it for every state: n
- so the mistake upper bound is  n*n*k = O(kn^2)
- lower bound is O(n^2) assuming you take the optimal action, but not realistic, so it's really O(kn^2)

###
###  general stochastic MDPs
###

- stochastic : Hoeffding bound until estimates are accurate
- sequential : unknown state-action pairs assumed optimistic (like Rmax)

Rmax algo: (for stochastic MDPs)
1. keep track of MDP experience: so you have estimates of T() and R()
2. any unknown state-action pair, assume Rmax reward:
- we introduce a parameter "c" : if we tried unknown s,a pair fewer than c times, then call it still unknown. otherwise we move to MLE
3. solve the MDP
4. take action from π*

###
###  simulation lemma
###

- if we have a good estimate of an MDP, then our optimized reward (value func) will be close to optimal.
- transitions and rewards off by α or less.

given a fixed π,

V1|π(s)   V2|π(s)     # 1 = MDP1,  2 = MDP2
T1,R1     T2,R2       # so our lemma is if |T1 - T2| ≤ α, and |R1 - R2| ≤ α, then we claim |V1 - V2| is also small (see below math on ε)

max | T1(s,a,s') - T2(s,a,s') |  ≤ α
s,a,s'

max | R1(s,a) - R2(s,a) |  ≤ α     # note it is fair to argue, α for T() and R() should be diff. like α_t and α_r
s,a

Rmax
define G = α + γnα ------     #  n = number of states, γ = usual discount factor
1-γ
then
G
ε  =  abs[ V1|π(s) - V2|π(s) ]  ≤  -----      # so we bound ε by G,γ
1-γ
ε(1-γ)
α = ------------------     # then we solve for α,
1 + γn Rmax         # so we know how closely we need to estimate T() & R() to get ε-close V()
----
1-γ

===> we want to know "c" that ensures α-close estimates of T() & R() for every s,a pair.
(we use Hoeffding bound, union bound. the step is so similar, we skip algebra here.)

###
###  Expolore-or-Exploit lemma
###

lemma:
if all transitions are either accurately estimated or unknown, optimal policy is either near-optimal or an unknown state is reached quickly (polynomial number of steps) with some probability.

#############################
####   Generalization    ####
#############################

[motivation]
given an MDP, there are sometimes zillion states, and some states are similar, thus can be generalized.
e.g.
if agent is 1 steps away from state X, and there are N states that can transition to state X, then we may consider the agent in any of those N states as similar.
a kind of generalization, which may help reduce complexity.

###
###  what to generalize
###

recall 3 approaches of RL

(1) π(s) = a
(2) value function V(s) = r
(3) T(), R()

===> as always, we focus on (2), as it gives a good intermediate property between (1) and (3)

###
###  basic update rule
###

general form:   Q(s,a) = F(w(a),f(s))       # F() = some general function
# f(s) = gives features about s
# w(a) = gives weight for a

note: the idea is to express our Q(s,a) value function as features and weights.

given <s,a,r,s'>

∂ Q(s,a)
Δwi(a) = α { r + γ max Q(s',a') - Q(s,a) } --------        # gradient step
__________________             ∂ wi(a)
Bellman equation

wi  = wi + Δwi

- diff btwn current Q(s,a) and discounted one step look up max Q val is TDerror. we wanna move to the correct direction.
- but how much do we move ?  --> gradient is the partial derivative of Q(s,a) wrt to wi(a). i.e. how much the weight influences the Q(s,a) value

α = learning rate

###
###  Linear Value Function Approximation
###

let's represent the general form as specifically the linear combination of features and weights (for a given action)

n
Q(s,a) = Σ fi(s) * wi(a)
i

- each action gets a set of n weights to represent the Q values for that action.
- simply taking the dot product of feature vector and weight vector.
- weights give "importance" of all the features in contributing to an action's value. (e.g. if weight_i = 0, then its feature is zero importance.)
- action can be considered part of feature, but in this case we are treating it as hyper parameter for generalization.

now, putting this together with the basic update rule, we get the gradient i.e. how does Q(s,a) value change as a function of particular weight wi(a)

∂ Q(s,a)
-------- = fi(s)
∂ wi(a)

###
###  does this generalization work ?
###

- only on some problems.
e.g.
3-layer NN backprop: by Tesauro (applied on TDgammon)
CMAC: by Sutton       # CMAC is like layered NN but the 1st layer is decided by human
linear neural networks (linear combination of feature-weight): by Singh, Parr
deep neural networks: by google deep mind, Minh et al

* read 3 papers
(1) TDgammon by Tesauro
(2) follow up by Bayan/Moore
(3) follow up by Sutton

note: Leemon Baird showed that Linear Value Function Approximation doesn't always converge. (sad)

###
###  Averagers: a particluar class of function approximator (proposed by Gordon)
###             that is known to converge !

let's say we have these three data points.
(data points == anchor points == basis points. we define them as a set "B")

|          .

|   .
|                      .

----------------------------
x1     x2          x3

===>  suppose you want to approximate the value for a point (let's say x2.2) between x2 & x3.
an intuitive way is to plot a line between x2 & x3, and pick where x2.2 crosses.
let's say x2.2 is geometrically 20% of the way from x2 to x3.
then this is just a weighted average:  V(x2.2) = 0.8 * V(x2) + 0.2 * V(x3)

===> then this is just appximation based on convex combination of anchor points     # very interpolative.
(for new approx points beyond x1,x3, we can have some simple rules, like it just takes x1 or x3 values, etc)
(the point here is convex combination gives you a bound.)

V(S)  =     Σ w(S,Sb) V(Sb)    s.t.   w(S,Sb) ≥ 0  ∀S,Sb
Sb ∈ B                     and
   Σ w(S,Sb) = 1  ∀S
Sb ∈ B
where
B = a set of anchor(basis) points
V(Sb) is a parameter

(Quiz)   which supervized learner can be expressed an an averager ?

###
###  connecting averager to MDPs
###

let's plug averager into the usual Bellman equation

V(s) = max( R(s,a) + γΣ[T(s,a,s') * V(s')] )     # the usual Bellman equation
a             s'

now,   V(s') =   Σ w(S',Sb) * V(Sb)     # treating s' as an unseen next state whose value we approximate
Sb ∈ B

so, by substitution, the bellman equation becomes the following

V(s) = max( R(s,a) + γΣ[T(s,a,s') * Σ w(s',Sb) * V(Sb)] )
a             s'          Sb∈B

= max( R(s,a) + γ Σ [ Σ {T(s,a,s') * w(s',Sb)} * V(Sb)] )     # just re-arranging things
a             Sb∈B s'

now, notice T(s,a,s') is just a convex function that takes value 0 ~ 1, and sums up to 1
also w(s',Sb) is a convex function that takes value 0 ~ 1, and sums up to 1
so their combination is also a convex function.
let's simplify the notation as follows.

Σ[ T(s,a,s') * w(s',Sb) ]  = T'(s,a,Sb)
s'

so we can re-write our bellman equation as below.

V(s) = max( R(s,a) + γ Σ [T'(s,a,Sb) * V(Sb)] )      # thus, indeed, we proved averager connects to MDPs
a             Sb∈B                           # we can solve MDPs using standard methods, like VI, Q-learning

note: there is a paper that talks about how the approximation error decreases as the number of anchor points increase. (intuitive phenomenon)

note: the lecture didn't cover, but read the paper on LSPI - least square policy iteration

########################################
####   Partially Observable MDPs    ####
########################################

you don't get to observe the states for certain.

A = set of actions
S = set of states
Z = set of observables (many instances of z. note: z is a partial info about state)
T = transition function
R = reward function
O = observation function : O(s,z) = probability of seeing z in state s

###
###  POMDP generalizes MDP  (yes!)
###

MDP: <S,A,T,R>
POMDP: <S,A,Z,T,R,O>

you can represent any MDP in POMDP.
a MDP is a POMDP with  Z=S, and
 O(s,z) = 1 iff s=z

###
###  state estimation
###

because you only partially observe state, you are not sure which state you are in, so you estimate Pr(state=1), Pr(state=2), and so on.
we look at formal systematic approach to this.

here is an example quiz.

b : belief state - it's a vector of probability distrib about states
e.g.  b(s) = probability we are in state s
e.g.  b,a,z -> b'     # we are in belief state b, take action a, then get observation z, and then form b'
conceptually, we turn POMDP into "belief MDP" where states of MDP are probability of states in the underlying POMDP.
here, reward is considered either not observed or observed and included in z.
some argue, r = f(z), so reward is part of observation.

b'(s')  =  Pr(s' | b,a,z)     # probability of being in state s' after b,a,z

Pr(z|b,a,s') * Pr(s'|b,a)       #  just using Bayes rule, to turn b'(s') into quantities we know of.
=  -----------------------------
Pr(z|b,a)                #  Pr(z|b,a) = normalization factor

Pr(z|b,a,s')                                    #  Pr(z|b,a,s') = Pr(z|s') = O(s',z)   because probability of observing z in s' is independent of a,s
= -------------- Σ [ Pr(s'|s,b,a) * Pr(s|b,a) ]    #  Pr(s'|b,a) = Σ Pr(s'|s,b,a) * Pr(s|b,a) = Σ Pr(s'|a,s) * b(s)
Pr(z|b,a)    s                                 #               s                            s

O(s',z)
= ----------- Σ T(s,a,s') b(s)       #  Pr(s'|a,s) = T(s,a,s')
Pr(z|b,a)  s

###
###  Value Iteration (VI) in POMDP
###

we need to carefully represent VI in POMDP in a finite way so it can still compute value even given an infinite state base from belief POMDP.

for t = 0, Vt(b) = 0
for t > 0, Vt(b) = max(R(b,a) + γΣ Pr(z|b,a) * Vt-1(b'))    where b' = SE(b,a,z)     # SE = state estimation
a            z

note:   R(b,a) = Σ Pr(s) * R(s,a)    # it's average reward over the current belief
s

= Σ b(s) * R(s,a)     # Pr(s) = b(s)
s

CLAIM:  Vt(b) =  max[ α*b ]  =  max [ Σ b(s) * α(s) ]      # known as "piecewise linear & convex"
α ∈ Γt         α ∈ Γt s

note:   Γt is a set of vectors
note:  what we are claiming is any value function V() at any step of VI, can be represented finitely as max of a bunch of linear functions. each linear function can handle infinite number of input. e.g. f(x) = 2x  can handle infinite number of input.

###
###   piecewise linear and convex : proof
###

we will prove by induction.

base case:  there exists a set Γ0 such that  V0(b) =  max α*b  = 0
α ∈ Γ0
what's the size of the vector b ?
- it's |s|
- recall Σ b(s) = 1
s

Γ0 = {[0,0,0,,,,,0]}                      # where length of this vector is |s| = N
= {[R(s0,a),R(s1,a),,,R(sN,a)] | a∈A}

general case:

assume we have Γt-1  s.t.  Vt-1(b) =  max α*b
α ∈ Γt-1
build Γt  s.t.  Vt(b) =  max α*b
α ∈ Γt

Vt(b) = max Vt|a(b)     # note this can be represented as  Γt = Ua Γt|a   where you take union of vectors for all actions
a∈A             # |b| = ∞, so we want to avoid exploring it. the above bag of vector and taking union lets you.

Vt|a(b) = Σ Vt|a,z(b)   # similarly, this can be represented as  Γt|a = (+)z Γt|a,z    where (+) denotes "cross sum"
z             # i.e. you take a set of vectors for each a,z pair, then sum up vectors for all z
# Vt|a() = Q function

Vt|a,z(b) = Σ [R(s,a) * b(s)] + γPr(z|b,a) * Vt(SE(b,a,z))      # see more math in the ref URL
s

note: size of b(s) is |z|

###
###   algorithmic approach
###

suppose you have 3 vectors

(1)       (2)
|    \       /
|     \     /
|      \   /
|       \ /
|_______/_\________(3)
|      /   \
----------------------------
s       e                 # s = start, e = end. let's say that's the belief state range.

a vector is
"dominated" : if another vector has bigger values for any range.
"purgable"  : if another vector has bigger values for the belief state range. (you just need to check two corners)

note: "strictly dominated" means bigger, and "dominated" means equal or bigger.

computational complexity wise - it's linear programming (i.e. polynomial)

###
###  RL for POMDPs
###

planning: you know everything (model of the world)  # e.g. PI, VI  which assume you have T(), R()

RL: you don't know models, so you need to interact, experience, explore/exploit.
- model-based RL: learning models  # learn POMDP
- model-free RL: don't bother learning models   # map observations to actions

| uncontrolled | controlled    # controlled = agent can choose actions
-----------------------------------------------
observed | Markov-Chain |     MDP
partially observed |     HMM      |    POMDP      # HMM = hidden markov model

###
###  learning memoryless policies (for POMDP)
###

memoryless: here it simply means the agent cannot remember its prev actions. so we cannot give a policy that directs sequential actions, like left-left-right.
we can only direct a single action with probabilistic split. e.g.  33% left, 67% right.

###
###   Bayesian RL  (for POMDP)
###

based on observables, you probabilistically try determine the current belief state.
(normalization done with Bayes rule, hence Bayesian RL for POMDP)

what is "optimal policy" in POMDP ?
- two aspects
-- (1) figure out MDP                                # explore
-- (2) maximize reward in the current belief state   # exploit

==> solving POMDP gives you the optimal policy.

Bayesian RL is actually planning in a kind of continuous POMDP.
- value function - not linear, but piecewise polynomical and convex
- BEETLE (an example bayesian RL algo)
- too expensive (compared to Q-learning)

###
###  predictive state representation (PSR)
###

POMDP: belief state is probability distribution over states.
(states are never fully observed. do they even exist? we don't really have a way to know. PSR is one approach to think about this.)

PSR: predictive state is probabilities of future predictions. (it's just another way to represent the RL problem/env/world. in terms of probability distribution of the outcome of test/action)

(very good quiz) https://www.youtube.com/watch?v=i2e30sQbp7M   (so you can convert between belief state <-> predictive state representations, given enough (right) tests)

[PSR theorem] any n-state POMDP can be represented by a PSR with no more than n tests, each of which is no longer than n steps long.

note: PSR lets you represent the world in terms of what you can (continuously) observe. and for some problems, it's easier to learn PSR than POMDP.
(we don't go into details, nor look at example PSR learning algo in this class)

#########################################################################
####     Options (for generalization, abstraction, scale, search)    ####
#########################################################################

###
###  what makes RL hard ?
###

- delayed reward
-- reward is weak feedback (you only get how you were doing, unlike supervised learning which tells you what you should do)
- bootstrapping - need exploration
- complexity in terms of #states, #actions, for function approximation of V(s), Q(s,a)
-- how to scale ?

###
###  temporal abstraction (of actions)
###

think about a huge x-by-y grid world, with obstacles, destination, agent current coordinate, etc.
your action space is Up, Down, Left, Right.
instead of million step instruction of U->L->D->R->D so on, you might wanna give an abstract instruction like "go to coordinate x,y and then go to this door, then search near foobar.."
===> i.e. you give new (more abstract in this case) actions.
generally, increasing action space (or state space) is frowned upon, but in this case, instead of potentially million step instruction, you may have reduced it to a 5-step instruction.
the benefit is enormous. (saves space, simplified instruction)
imagine the coordinate x,y is your state, then constructing policy π(s) = a for every state s in a large grid world is infinitely time consuming.
but if you can simply instruct (if x,y within this neighborhood, then take this particular action a), then it's computationally (both space and speed) simplified.
we call this "temporal abstraction"
if you simplified state space. action sequence, policy like so, then the complexity of learning process is also reduced.
think in terms of q learning, the agent can quickly back-propagate the values from the destination state, to further states, to update true Q(s,a)

here, loosely,  options ≅ temporally abstracted actions(choices)      # aka "temporally extended actions"

an option o = <I,π,β>       # generalization (aka temporal abstraction) of actions (i.e. an option = a set of actions with some policy and initial/terminal set of states with probability)

I = initiation set of states    # where the option may (legally) start (or is reasonable to execute) in one of those states
π = policy    (s,a) -> [0,1]    # gives the probability of being in state s, and taking particular action a. policy itself maps states to actions (either deterministic or stochastic).
β = termination set of states  s -> [0,1]  # a set of states, with each state given some probability of ending up in that state with the particular option

===> now think about how we can set up a Bellman equation accounting for this temporally abstracted action concept 'option'.
first, recall Bellman equation was for MDP. but here 'option' gives us SMDP.

##
## SMDP (semi MDP)    #  aka "multiple variable time MDP"
##

a,r   a,r   a,r   a,r   a,r
MDP: s --> s --> s --> s --> s --> ...    # here, transition/action is atomic. (you take a in s, get r, and end up in s'. that's always the same length 'atomic' time step. there is no room for other things to come in the middle)

o,R(s,o)       o,R(s,o)       o,R(s,o)
SMDP: s -------> s ---------------> s ----->  ...   # here, it's variable time length transition/action, because we take option o in state s, and the number of steps is variable. (because the number of actions in an option is variable)
# SMDP violates MDP's atomic transition/action property. so we have to be smart about getting our SMDP Bellman equation.

##
##  temporal abstraction (SMDP) Bellman equation
##

Vt+1(s) = max( R(s,o) + Σ[ F(s,o,s')* Vt(s')] )       # just our bellman equation, adapted to SMDP
o            s'                            # recall instead of an atomic action, we have temporally abstracted set of action (called option)

R(s,o) = E[r1 + γ*r2 + γ^2*r3 ... +  γ^(k-1)*rk | s, k steps]     # the idea is you execute o in s, which causes you k steps of transition, with some reward at each step.
# so R is the expected value over a set of rewards, discounted at each step, starting from state s, taking k steps as a result of executing option o (o ends after k steps)
F(s,o,s') = E[γ^k Δ(sk,s') | s, k steps]   # expectation of ending up in s' after k steps as a result of taking option o, having started in s

Δ(sk,s') = 1  if k'th state sk is equal to s'
= 0  otherwise

note: discount factor γ is hidden inside R() & F(), which is the key aspect of this effort of adapting Bellman equation to SMDP. because option entails multiple actions, we have to sort of embed discount factor within R(), F()
note: SMDP is a big topic, here we only covered briefly. the point is we developed SMDP Bellman equation which allows us to treat SMDP as MDP, as far as iterative value function learning goes.

(ref) https://en.wikipedia.org/wiki/Dirac_delta_function

###
###  example problems of options (= temporal abstraction of actions)
###

[Pac-Man game]

- action: L,R,U,D
-- what is its temporal abstraction (i.e. option)?
--- eat dot
--- avoid ghost
--- eat big dot (gains a temporary power)
--- eat ghost

- states
-- what are the features that comprise states?
--- dots : location, reserve
--- big dots : location, reserve
--- pacman : location, status (powered up?)
--- ghost : location, status (scared?)

[quiz]

to execute each option, what feature does pac-man need to know ?

(option)\(feature) |  dots  |  big dots  |  ghosts |  scared ghost |    # pac-man location is assumed always known anyway, so not included here
--------------------------------------------------------------------
eat dot      |  yes   |            |         |               |    # you can argue you need to know ghost location/status
eat big dot  |        |   yes      |         |               |
eat ghost    |        |            |   yes   |   yes         |
avoid ghost  |        |            |   yes   |   yes         |    # you can argue you don't need to know ghost status if you wanna avoid it anyway

===> as you can see, there is no one absolute answer, you can argue in multiple ways.
but note that if you can reduce required features for each option, that's good "state abstraction" which makes it computationally efficient.
(because if you require all features, then there is no abstraction, it's like saying know all states and solve the whole game, which is not useful)

NOTE: in general, this type of game is called "predator prey" where you try to eat (to not startve) while avoiding getting eaten.

###
###  options (temporal abstration of action) properties
###

- SMDP inherits MDP properties:
-- optimality of policy: "hierarcical optimality" i.e. policy gets optimal w.r.t the options you define
-- convergence proof
-- stability proof

- avoid boring parts of MDP  (e.g. less exploration required as options typically have some policy builtin, like "go to this door that connects to a room where the goal state is")

- allows state abstraction  (if we set up options in such a way that the agent doesn't require the knowledge of all state features, like we saw with Pac-Man quiz)

###
###  goal abstraction
###

another angle to think about this abstraction methodology is in terms of goals.
options themselves can be construed as "goals" e.g. "eat dots", "avoid ghost"
and you wanna achieve them all, but at any given time, you probably prioritize one over the others. (arbitration)
so they are "parallel goals"
i.e.
thru abstraction of options, we went from "sequencial actions" to "paralell goals"

##
##  arbitrating between multiple goals       # aka "modular RL"
##

there are a few known techniques.

(1) Greatest mass Q-learning

each option(goal) trains its own Q(s,a) function.
e.g.
"eat dots"      Q1(s,a)
"avoid ghost"   Q2(s,a)
"eat big dots"  Q3(s,a)
"eat ghost"     Q4(s,a)

===> then they just vote (sum), and pick the action with most values. i.e. Q(s,a) = Σ Qi(s,a)

note: this may not the best thing for everyone, because sometimes a mediocre (like 5th ranked action in each Q learner) action may get elected as a whole if everybody has disagrees on top actions.
(that can be good or bad, there is no free lunch after all.)

(2) Top Q-learnings

similar to (1), but just pick action of whichever Qi(s,a) that has the highest value. i.e. Q(s,a) = max Qi(s,a)

(3) Negotiated Q-learnings

this is actually complicated, but essentially the agent with the most to lose gets to pick the action.
(basically you don't want any agent to lose big, rather than picking an agent to win)

===> all these techniques suffer from a fundamental issue "Arrow's impossibility theorem" which basically says "fair voting" is impossible.
because for example, each Q-learner agent may have different metric system, so one's value "73" may be different from other's value "73"
for another example, are all modules/options/goals trying to achieve the same thing or different things?
maybe there can be an arbitrator that can take care of incompatible value metric sysems or arbitrate based on a larger shared goal among options?
(it is one reearch areain RL)

(ref) https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem

NOTE: notice some of so called "goals" are really "constraints"

e.g. "each XYZ" goal terminates when you achieve "success" so this is truly a "goal"
e.g. "avoid ghost" never succeeds, neer terminates until you "fail" so this is really a "constraint"

###
###   "scalability" (in distinction from abstraction)
###

"Monte Carlo Tree Search (MCTS)" algorithm     (a policy search algo)

o         each node is a state
/ | \       each branch is an action/option
o  o  o
/ \   /|\     assume for the current tree, you have some policy π(s)=a
o   o o o o

while computationally affordable:
1. select       # select action,  π(s)=a   # maybe it's essentially argmax_a Q(s,a)
2. expand       # once you hit the leaf where you don't know how to select action, because you don't know enough about next states, reward, values, then you expand (explore)
3. simulate     # assuming you can't expand to explore every single state,action pair (as there can be too many), then you sample/simulate some and the further steps (action, discounted reward) and estimate true Q(s,a)
4. back up      # back propagete the info up, in a Bellman equation way. so the more you expand/simulate, your upstream node/edge value estimate improves.

note: it's similar to "game tree" algo, but here we deal with a single agent MDP, not multi agent game tree.

note: when expanding, you sample (assuming there are too many action,states to explore) and simulate next s,a pairs, enough of them (maybe 5, maybe 100 of them, depending on your action,state space)
then for each of these subsequent s,a pairs, you use "roll out" policy πr(s)=a to explrore further downstream steps (finding out reward info, etc), in order to estimate Q(s,a)
note: "roll out" policy can be simple, like just take random actions in a given state

###
###  Properties of Monte Carlo Tree Search algo
###

- useful for large state space MDPs
- need lots of samples to get a good estimate     # hopefully you pick samples that are representative of the whole
- planning time is independent of the size of state space |S|
- running time is exponential in the horizon H (depth of tree, how far you wanna look into the future)
-- complexity is  O( (|A|*x)^H )    where x = how many iteration of the loop you run

note: this whole action/state/goal abstraction is an active research area, both in theory and application.

###############################
####    Game Theory - 1    ####  # see ML notes   http://kenics.net/gatech/ml/
###############################

###############################
####    Game Theory - 2    ####  # see ML notes   http://kenics.net/gatech/ml/
###############################

###############################
####    Game Theory - 3    ####
###############################

###
###  Solution Concepts
###

- Nash Equilibrium (aka NE where no player has interest to switch strategy even if given a choice)
- Correlated Equilibrium (CE)
- Trembling Hand
- Stackelberg
- Subgame Perfect
- coco (cooperative-competitive)
- bayes Nash
- more

====> we will focus on CE

###
###  general tso chicken (GTC) game
###

general tso was a chinese general from rangoon qing dynasty who had a chicken dish named after him.
here we mean general sum game of chicken.

- where two drivers drive toward each other, and whoever swervs first is chicken
(or two drivers both drive toward the cliff, and whoever stops first is chicken)

| dare | chicken
-------------------------    # general sum game
dare |  0,0 | 7,2         # reward is symmetric
checken |  2,7 | 6,6

what are the possible combinations of pure strategies?      # recall pure == deterministic
and which ones are NE ?

1.  D,D      # not NE
2.  D,C      # NE
3.  C,D      # NE
4.  C,C      # not NE

===>  but those two NEs suck, because depending on which side you are on, you don't want it.
you wanna be "D" (reward 7), rather than "C" (reward 2)
(multiple equilibiria and which one to select, is a big fundamental problem with NE, which is the motivation for CE)

What about NE with mixed strategies ?      # recall mixed == stochastic

===> turns out NE mixed strategy is
- dare 33.3%
- chiecken 66.7 %

strat | payout | probability
C,C  |   6    |   4/9
C,D  |   2    |   2/9
D,C  |   7    |   2/9
D,D  |   0    |   1/9

===> sum up the probability weighted payout, then you get 14/3 (approx 4.66 reward) as mixed strat NE

###
###   Correlated Equilibrium (CE)
###

as we discussed above, in general sum markov games, there can be multiple NEs with multiple payoff values.
- how can multiple agents select among multiple equilibria with multi equilibrium values? it's a challenge.
CE overcomes this problem, lets you pick some objective function to compute equilibrium value, thus resolving this equilibrium selection problem.

suppose there is a third party who picks a pure strategy for two palyers.
let's say a pure strategy is uniformly randomly chosen out of

1.  C,C
2.  D,C
3.  C,D

and the third party (aka "shared source of randomization") recommends each player what his action should be.
(both players are aware of the setup so far)

e.g.  D,C is picked.
then the third party tells player 1 "we recommend you action D"
similarly tells player 2 "we recommend you action C"
note the third party doesn't tell each player what the other player is being told.
so each player only knows what his recommended action is.

suppose, as player 2, you are being told the following, then what action should you take ?

message      | what you should do (to maximize your reward)
-----------------------------------------------------------------------------------------------------------------------
take action D | D, because we know the other player is being told C, so by taking D, we secure reward 7.0
take action C | C, because if you take D, then 50% chance it's C,D (reward 7.0), 50% D,D (reward 0), thus avg reward 3.5
| whereas if you take C, then 50% chance it's C,C (reward 6.0), 50% D,C (reward 2.0), thus avg reward 4.0

what's the expected value of CE ?

strat | reward | probability
C,C  |   4    |  1/3
D,C  |   4    |  1/3
C,D  |   7    |  1/3

===> 5 on avg   (see you surpassed reward 4.5 from mixed strategy NE)

so by probabilistically correlating your bahavior with the other player's, you get better expected equlibrium reward/value than NE.

###
###  CE properties
###

1. CE can be found in polynomial time   (advantage over NE)
2. all mixed NE are correlated, so CE exists
3. all convex combinations of mixed NE are correlated

###
###  coco (cooperative-competitive) values         # coco aka friend-or-foe
###

(two must read papers)  "Friend-or-Foe Q-learning in general-sum games" by Littman
 "Markov games as a framework for multi-agent reinforcement learning" by Littman

suppose two guys (tall & short), and a tree with 4 bananas.
- short guy cannot reach any bananas.
- tall guy can reach 2 bananas.
- if (1) tall guy climbs on short guy's shoulder, and (2) short guy allows it, then they get 4 bananas together. (but let's say for now, tall guy gets all 4 bananas)

so we have this matrix
tall_guy
| reach | climb
--------------------------
short   dont help |  0,2  |  0,0
guy         help |  0,2  |  0,4

===> obviously, tall guy ideally wants 4, but short guy is not getting any incentive to help.

###
###  coco definition
###

coco value of a game is defined as follows:
_
a general sum game of U,U

U = payoffs to you

U = payoffs to other
_                _
_            U+U              U-U
coco(U,U) = maxmax[-----] + minimax[-----]
2                2
_____________   ______________
cooperative     competitive(zerosum)
(just take max)   (just solve NE w/ linear programming which is in poly time)

(ref) "linear programming and zero sum games" by Ron Parr. https://www2.cs.duke.edu/courses/fall12/cps270/lpandgames.pdf

###
###  coco example
###

U = [0 0]     # short guy
[0 0]

U = [2 0]     # tall guy
[2 4]
_
U+U
--- = [1 0]     # maxmax
2    [1 2]      (you get 2 by help/climb strategy)
_
U-U
--- = [-1  0]   # minimax
2    [-1 -2]     (you get -1 by not_help/reach strategy)
_                             _                   _
coco(U,U) = 2 - 1    # if you swap U,U as in U=tall_guy, U=short_guy        _
= 1        # and recompute the above numbers, then you get coco(U,U) = 3

so, coco value is 1  (from short_guy's perspective)
3  (from tall_guy's perspective)
_         _
side payments  p = coco(U,U) - U(a*,a*) = 1 - 0 = 1
_        _      _    _                                _
p = coco(U,U) - U(a*,a*) = 3 - 4 = -1    # obviously, p is the inverse of p
_
where U(a*,a*) = max utility value of joint action (which is 0,4 from help,climb strategy in this case)
(here both players cooperate and take utility maximizing join actions)

===> so tall_guy should pay 1 banana to short_guy to incentivize help

###
###  coco properties
###
- known to be applicable to two player games (applicability to three or more playe games is still open problem)
- efficiently computable (linear programming. there are many solvers available)
- utility maximizing (joint action/behavior, then they split up to decide side payments)
- decompose game into sum of two games (cooperative & competitive, shifting around utilities)
- unique coco value (there can be multiple equilibrium policies but the equilibirum value of the game is unique, this is an advantage of CE over NE)
- can be extended to stochastic games (coco-Q, not nonexpansion, i.e. converges)
- not necessarily equilibrium (best response) i.e. some player may not like coco values and may prefer NE values
(side payment needs to be binding agreement)

###
###  Mechanism Design
###
- games produce behaviors
- based on behaviors you observe, you can make a game that produces particular behaviors you wanna see
- it's a cycle. let's see two examples "peer teaching" and "king solomon"

###
###  Peer Teaching
###

there are two players : student & teacher
questions in the order of difficulty (or you may call it in the spectrum of student's understanding)
e.g.
(individual questions)
o o o o o o o o o o o o o o o o o o o o
easy <------------------------------------------> hard
difficulty

teacher picks and asks student questions.
goal: maximize student's learning

what reward system should we have to incentivize both teacher and student learning ?

proposal (1)
- student gets +1 when answering correctly
- teacher gets +1 when answering correctly

===> incentivizes teacher to pick easy questions. not maximizing the learning.

proposal (2)
- student gets +1 when answering correctly
- teacher decides where student's understanding is in the spectrum, and asks questions around that area (maybe in a Gaussian distrib)
then gets +1 if student answers correct on the questions beyond their understanding line. (i.e. student answers hard questions correct)
gets +1 if student answers incorrect on the question they should get right. (i.e. student cannot correctly answer easy questions)
===> for teacher to get points from both sides of the bar, teacher is incentivized to set the bar where student's level of understanding really is.
from student's perspective, it's always better to learn more (then more points)

###
###  King Solomon
###

King has two wives (let's call them A & B). And there is a baby. Both wives claim to be the mother.
King wants to know which one is the real mother.
How can he get them to tell the truth ?  --> this is a Mechanism Design problem.

one assumption is the baby must be more valuable to the real mother than to a fake mother.
let's say Vreal > Vfake

King will
(0) choose a small fine F > 0
(1) ask wife A "is this your baby?"
if A says no, then B gets the baby
if A says yes, then goto (2)
(2) ask B "is this your baby?"
if B says no, then A gets the baby
if B says yes, then goto (3)
(3) A pays F, and B must announce V > F      # V = value
(4) ask A "is this your baby?"
if A says no, then B gets the baby, and B pays V
if A says yes, then A gets to keep the baby, but A pays V, and B pays F

===> the idea is fake mother will only bid as high as Vfake, and real mother will willingly pay Vfake (or even as much as Vreal), so real mom will win.
and since fake mom will have to pay F by saying yes and announcing V=Vfake > F, while true mom pays Vfake, so fake mom is incentivized to bail, and pay no fine.
===> this mechanism design will incentivize two wives to tell the truth, and King may learn the intrinsic value of baby to fake mom.

########################################################
####    Coordinating Communicating Coaching (CCC)   ####
########################################################

MDP framework we studied so far is "solipsistic" (i.e. only considers him alone VS the rest of thw world)

we wanna redefine our MDP/Markov planning/learning framework in such a way that considers coordination/communication/coaching between multi agents.

###
###  DEC POMDP    # DEC = decentralized
###

recall POMDP where agent takes action to (1) gain reward, (2) gain information (about the world that helps agent gain more reward)

DEC-POMDP is defined by

I  : finite set of agents
S  : states
Ai : agent i's actions
T  : joint transition function. i.e.  T(s,a,s')  where a = vector with one action per |I|
R  : shared reward function (Ri if POSG)  POSG = partially observable stochastic game
Zi : agent i's observations
O  : observation function

DEC-POMDP is where POMDP meets Game Theory. If reward function is seperate and distinct for each agent, it's really game theory, but as bove, we share reward function among agents, i.e. shared interest game.

###
###  DEC-POMDP properties
###

- elements of game theory and POMDPs
- NEXP-complete for finite horizon
- agents can benefit from communication
- optimal solution balance cost of communicating with cost of not communicating
- there are known algorithms, heuristics, applications.

###
###  DEC-POMDP example
###

suppose two players in a grid world, with the usual action set (e.g. up, down, left, right, stay)
each player only knows who is in current state (including himself), not knowing where the other person is (unless he comes into the same state).
goal/win is for two players to come to the same state. thus coordinating is in both parties' best interest.

what strategies are available ?
- one stays in the same state, and the other explores.
- both try reach one same state.

but being POMDP makes it hard. (current state not always deterministically known)

###
###  communicating & coaching
###

what is agent A wants agent B to learn to get an apple ?
agent A needs to communicate to / coach agent B to learn the task (i.e. build its own MDP, goal / reward function, etc)

===> inverse RL is an answer

###
###  inverse RL
###

[regular RL]

rewards  -->  RL  --> behavior
environment  -->

[inverse RL]

behavior -->  RL  --> rewards
environment -->

e.g. you can observe an agent's behavior in an environment, then learn what reward function the agent is operating under, also a lot of info about the environment.

there can be more than one algo for inverse RL.
one algo for inverse RL is MLIRL: maximum likelihood inverse RL
1. guess R
2. compute π
3. measure Pr(D|π)    # D = data
4. gradient on R, then goto 1.

==> read Littman paper, it turns out MLIRL finds reward func very well, but can be slow and can get stuck in local minima. (and getting the terminal states right is hard)

###
###  policy shaping
###

recall "reward shaping" where you messed with reward, transforming reward func, without affecting optimal policy.
suppose in the game of Pac-man, if you can inform Pac-man (our agent) that a certain action he took in a certain state was good or bad.
then this is additional reward you are shaping, but also it can be construed as "policy shaping" because you are directing what the agent should (shouldnt) do in a certain state.

##
##  policy shaping -  quiz 1
##
in a given state, let's say human gave pos or neg feedback.

human feedback\action |  x  y  z
------------------------------------
pos |  1  0  0
neg |  0  0  0

probability of human being correct = 1

Pr(action) = probability of action being optimal
------------------------------------------------
Pr(x) = 1
Pr(y) = 0 or not known.  (you might say 0.5 if uniformly random and no other knowledge)
Pr(z) = 0 or not known

===> because all we know is human said x is correct action to take, so we are sure about x.
assuming there is only one optimal action, then the rest is 0.
but if there can be more than one optimal action, then we have no information about y,z

note: ΣPr(a) = 1  only if we assume there is only one optimal action per state
a

note: if we assume there is only one optimal action per state, then the above table means the below table.

human feedback\action |  x  y  z
------------------------------------
pos |  1  0  0
neg |  0  1  1

##
## policy shaping -  quiz 2
##
in a given state, let's say human gave pos or neg feedback.

human feedback\action |  x  y  z
------------------------------------      # the same setup as quiz 1
pos |  1  0  0
neg |  0  0  0

probability of human being correct = 0.8     # notice this changed from quiz 1

assume only one optimal action per state. i.e. ΣPr(a) = 1
a
i.e.
human feedback\action |  x  y  z
------------------------------------      # the same setup as quiz 1
pos |  1  0  0
neg |  0  1  1

Pr(action) = probability of action being optimal
------------------------------------------------
Pr(x) = 4/6
Pr(y) = 1/6
Pr(z) = 1/6

===> the way to think about this is, without any knowledge, then we go with uniformly random,
Pr(x) = 1/3
Pr(y) = 1/3
Pr(z) = 1/3

then we know x is given positive signal 100%, with 80% likelihood of correctness.
and y,z are given negative signal 100%, again with 80% chance of correctness, i.e. positive signal 100%, with 20% chance of correctness.
(rigorous reasoning involves bayes rule and generative model)
Pr(x) = 1/3 * 0.8
Pr(y) = 1/3 * 0.2
Pr(z) = 1/3 * 0.2

==> normalizing the output to enforce ΣPr(a) = 1, you get 4/6, 1/6, 1/6 answer.
a
###
###  policy shaping
###

in a given state s,

# (1) general case:  there can be multiple optimal actions in a given state

C^Δa
P(a|Da) = -----------------
C^Δa + (1-C)^Δa

Da = data suggesting a is optimal action
Δa = #yes - #no
C = probability of feedback (yes or no) being correct

# (2) only one optimal action case: (i.e. only one optimal action in a given state)

P(a|Da) ∝ C^Δa * (1-C)^Σ[Δj]
j≠i

note "∝" denotes "proportional to" because you have to normalize like we did in the above example.
or you can replace it with an equal sign "=" if you put some normalization factor in the equation.

###
###  mulitple sources of info
###

you may train a policy π_H based on human feedback, but also the agent gets lots of info from the world, so it can train a policy π_A based on its real world experience.

policy\action   |  x  |  y  |  z  |
------------------------------------
π_H(s) | 2/3 | 1/6 | 1/6 |    # probability of each action being optimal in state s, according to policy π_H
π_A(s) | 1/3 | 1/3 | 1/3 |

==> there are algos that consider both to decide action.
if C = 1, then agent can trust human absolutely. otherwise, you wanna take π_A into account.
you can assign heavier weight/trust on π_A as C decreases and as experience increases.

if you equally trust them, then

optimal_action =  argmax P(a|π_H) * P(a|π_A)     # probability both policies agree on action a being optimal in a given state s
a

###
###  Drama Management
###

so far we've been looking at the topic of "coaching" & "communication" from the perspective of human influencing the agent via
- demonstration   # direct
- reward shaping  # indirect
- policy shaping  # indirect

but here we look at MDP from "drama management" roles,
(suppose in the game of Pacman)
i.e.
- author   # who designs the game
- agent    # Pacman himself
- player   # who controls Pacman

what is a story ?
- trajectory of observations as plot points
e.g.
if it's a mystery story, then among many observations (story plot points), you connect relevant ones in a relevant sequence to form a story of your choice.

* --------> * ---------> * -------> so on
murder     accusation   weapon is found

How can we influence a player ?

#  Trajectories as MDPs

states : partial sequences of plots   # hyper exponential possibilities. i.e. O(2^2^n) assuming n plots. because you can choose the sequence length, and choose whether a particular plot is included in the sequence
actions : story actions
model : player model
rewards : author evaluation

#  TTD-MDPs     # TTD = targeted trajectory distribution

trajectories (i.e. states) : plot sequence so far      # plot sequence == story
actions : actions
model : P(t' | a,t)         # probability of ending up in trajectory t'  after taking action a in trajectory t
target probability distribution : P(T)                # T = as in your target trajectory, so if you have one T as desired trajectory, then you set P(T)=1
----------------------------------------------------- # or if you have a few particular trajectories as desired target, then maybe P(T1) = 0.4, P(T2) = 0.3, P(T3) = 0.3
policy : probability distribution over action

===> turns out we can solve TTD-MDP in linear time (wrt the length of a story. i.e. # of plot points) - read the paper for formal proof.
but the gist is suppose you build a tree of possible trajectories.

(1)        # plot point (1)
/ | \       # then you branch into hypoer exponential possibilities of trajectories
(2)(3)(4)     # but once you reach (3) for example, then you can forget about (2) & (4) and their downstream branches as you can never get there from (3)
//|\/|\ |\\    # thus search time is not as bad as you might fear.

##########################
###    outroduction    ###
##########################

application of RL to real world problems.

- TD gammon
- elevator control
- helicopter control
- stock trading strategy
- DRL by deepmind to solve Atari games
- various other games
- robotics
- behavioral/neural science
- as long as you can model the problem as MDP for your RL algo to solve, where you can simulate the env/world which makes it easy for agent to interact, etc.

1. 2016-10-02 16:11:21 |
2. Category : gatech
3. Page View: