## Policy Optimization methods in Reinforcement Learning

### Contents

**Markov Decision Process****Bellman Equation**- Value Functions
**Policy Optimization**- 1. Policy Iteration
- 2. Value Iteration

19 October 2018 | Aditya Jain
## Policy Optimization methods in Reinforcement Learning

### Contents

**Markov Decision Process****Bellman Equation**- Value Functions
**Policy Optimization**- 1. Policy Iteration
- 2. Value Iteration

Before we go ahead and start discussing about policy optimization methods in Reinforcement Learning. Let me first clear few terms such as markov decision process, bellman equation, states, actions, rewards, policy, value functions etc. So lets understand them one by one.

In Reinforcement Learning, An AI agent learn how to optimally interact in a Real Time environment using Time-Delayed Labels called as Rewards as a signal.

**Markov Decision Process** is a mathematical framework for defining the reinforcement learning problem using STATES, ACTIONS, REWARDS.

Through interacting with an environment, an AI will learn a policy which will return an action for a given STATE with the highest reward.

MDP is an approach in achieving reinforcement learing to take decisions in a matrix. A grid would consist of states in form of grid. MDP tries to capture world in form of grid by dividing it into states, actions, transition matrix and rewards. The solution of MDP is policy and objective is to find optimal policy for task that MDP is imposed. Thus any reinforcement learning task is composed of set of states, actions and rewards that follow markov property is considered as MDP.

` pi ^ \ast ` is called optimal policy, which maximizes expected reward. Among all policy taken, optimal policy one that maximize the amount of reward received or expected to receive over a lifetime. For an MDP there's no end of lifetime and you have to decide end time.

Policy is nothing but a guide telling which action to take for a given state. It is not a plan but uncovers the underlying plan of the environment by returning the actions to take for each state.

Markov Decision Process (MDP) is a tuple (S,A,T,r,?).

` S ` : Set of Observations

` A ` : Set of Actions

` T ` : Transition Model

` r ` : Reward Model

` ? ` : Discount factor (between 0 & 1) represent relative importance between immediate and future reward.

` A ` : Set of Actions

` T ` : Transition Model

` r ` : Reward Model

` ? ` : Discount factor (between 0 & 1) represent relative importance between immediate and future reward.

Question that Bellman equation answers:

Given a state I'm in, assuming I take the best possible action now and at each subsequent step, what long term reward can I expect.

What is the value of the STATE?

Bellman equations are Dynamic Programming Equations. If you dont know about dynamic programming, it is better to clear your concepts about dynamic programming.

` V ` : Value of given state s

` max_a ` : Maximum for action a

` R ` : reward for action a in state s

` γ ` : discount factor (Gamma)

` s `: Next state by choosing action a

` max_a ` : Maximum for action a

` R ` : reward for action a in state s

` γ ` : discount factor (Gamma)

` s `: Next state by choosing action a

` V ` : Value of given state s

`max_a`:Maximum for action a

` R ` : reward for action a in state s

` \gamma ` : discount factor (Gamma)

` s' ` : Next state by choosing action a

`max_a`:Maximum for action a

` R ` : reward for action a in state s

` \gamma ` : discount factor (Gamma)

` s' ` : Next state by choosing action a

**State-Value Functions**: Expected/Discounted reward when starting in state 's' and successfully following policy '` pi `' for an action. Denoted as ` V(s) or V_pi(s) `. How good is a state.**Action-Value Function**: Action 'a' to state 's' and return a real value. Referred as Q-function. Denoted as ` Q(s,a)`. How good is a state action pair for agent in environment.

Until now we have studied about Reinforcement Learning environment, and we have also learned what our goal is in that enviroment i.e. to find Optimal Policy or say to find optimal value function because one will lead to another. In this blog we will discuss policy optimization using planning by dynamic programming. Dynamic Programming assumes full knowledge of MDP. It is used for planning in a MDP.

**Here we will discuss, 2 methods:**

**Policy Iteration**: Policy Evaluation `+` Policy Improvement and the two are expected iteratively until policy converges.**Value Iteration**: Finding optimal value function `+` one policy extraction. There is not repeat of the two because once the value function is optimal, then policy out of it should also be optimal.

- Evaluate the policy `pi`. $$ V_{\pi}(s) = E\Big[ R_{t+1} + \gamma R_{t+2} + .... | s_t = s \Big] $$
- Improve the policy by acting greedily with respect to `V_pi`. $$ \pi^\ast = greedy\Big( V_{\pi} \Big) $$

**Policy evaluation**- Estimate `V_{pi}`, Iterative policy evaluation**Policy improvement**- Generate ` pi^' >= pi `, Greedy policy improvement

- Iterative application of Bellman expectation backup.
- ` V_1 -> V_2 -> V_3 -> .... -> V_{pi} `.

Start with random value function and iteratively figure out new value function. - Using Synchronous backups:

- at each iteration k+1
- For all state `s\inS`
- Update `V_{k+1}(s)` from `V_k(s')` using bellman expectation equation. $$ V_{k+1}(s) = \sum_{a \in A} \pi(a|s)\Big( R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a V_k(s') \Big) $$ $$ V^{k+1} = R^{\pi} + \gamma P^{\pi}V^{k} $$
- Where `s'` is successor of state `s`.

- Find the best policy from the value function obtained from policy evaluation using greedy method.
- This process of policy iteration always converges to `pi^ast`.

- ` V_1 -> V_2 -> V_3 -> .... -> V_ast `. Start with random value function and update values using bellman optimality equation.
- Using Synchronous backups:

- at each iteration k+1
- For all state `s\inS`
- Update `V_{k+1}(s)` from `V_k(s')` using bellman optimality equation. $$ V_{k+1}(s) = \max_{a \in A} \Big( R^a_s + \gamma \sum_{s' \in S} P^a_{ss'} V_k(s') \Big) $$ $$ V_{k+1} = \max_{a \in A} \Big( R^a + \gamma P^aV_k \Big) $$
- Where `s'` is successor of state `s`.

- Unlike policy iteration, there is no explicit policy.
- Intermediate value functio may not correspond to any policy.
- After we finally get optimal value function, we will extract policy using greedy method from that value function. We have to do this only once because optimal value function always gives optimal policy.

Problem | Bellman Equation | Algorithm |
---|---|---|

Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |

Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |

Control | Bellman Optimality Equation | Value Iteration |

- Algorithms are based on state-value function `V_pi(s) or V_ast(s)`
- Complexity `O(mn^2 )` per iteration, for m actions and n states
- Could also apply to action-value function `q_pi(s, a) or q^ast(s, a)`
- Complexity `O(m^2n^2 )` per iteration