Introduction to MDP
The Reinforcement Learning Loop
RL is a back and forth between an agent and an environment. The agent sends an action to the environment, and the environment replies with a new state and a reward.
Markov Decision Process (MDP)
The interaction is just a sequence of states, actions and rewards repeating over time:
It is a sequence of state, action, reward played out step after step.
The Markov Property
- ▹Each state depends only on its immediate previous state, not the whole history.
- ▹It does not need to know all of the earlier states.
- ▹The reward here is just a single number, which is not enough once the learning gets complex.
- ▹A single number reward may work for a robotic finger, but for an autonomous driving vehicle it often will not.
- ▹All of reinforcement learning is built on the foundation of the MDP.
The Goal of an MDP
Policies
A policy is how the agent decides which action to take in a given state. There are two kinds.
Deterministic policy
Takes a state as input and returns one action. The agent has a fixed path. A robot crossing a grid hits an obstacle and turns a specific way, left or right.
Stochastic policy
A function with randomness. It gives a probability over actions (say 70% left, 30% right), so the agent explores more paths and learns more.
Return and the Discount Factor
The return is the sum of rewards from time t onward, written Gt:
γ (gamma) is the discount factor, from 0 to 1. It sets how much the agent cares about future rewards versus the reward right now.
γ near 0
Myopic, short sighted
Future rewards are treated as less important than the present reward, so the agent barely looks ahead.
γ near 1
Far sighted
Future rewards are weighted highly, so the agent plans for the long run and acts with those rewards in mind.