Milan Ghimire

Notes

Reinforcement Learning

My revision notes on reinforcement learning, organised by chapter so I can come back and refresh quickly.

Introduction to MDP

The Reinforcement Learning Loop

RL is a back and forth between an agent and an environment. The agent sends an action to the environment, and the environment replies with a new state and a reward.

Agent
action →
Environment
← state, reward

Markov Decision Process (MDP)

The interaction is just a sequence of states, actions and rewards repeating over time:

S0, a0, r0, S1, a1, r1, ...

It is a sequence of state, action, reward played out step after step.

The Markov Property

  • Each state depends only on its immediate previous state, not the whole history.
  • It does not need to know all of the earlier states.
  • The reward here is just a single number, which is not enough once the learning gets complex.
  • A single number reward may work for a robotic finger, but for an autonomous driving vehicle it often will not.
  • All of reinforcement learning is built on the foundation of the MDP.

The Goal of an MDP

Find a state or policy that gives the highest accumulation of reward, so we can expect to collect the largest reward over time.

Policies

A policy is how the agent decides which action to take in a given state. There are two kinds.

Deterministic policy

π : S → a

Takes a state as input and returns one action. The agent has a fixed path. A robot crossing a grid hits an obstacle and turns a specific way, left or right.

Stochastic policy

π(a | S) = probability

A function with randomness. It gives a probability over actions (say 70% left, 30% right), so the agent explores more paths and learns more.

Return and the Discount Factor

The return is the sum of rewards from time t onward, written Gt:

Gt = rt + γ·rt+1 + γ²·rt+2 + ...

γ (gamma) is the discount factor, from 0 to 1. It sets how much the agent cares about future rewards versus the reward right now.

γ near 0

Myopic, short sighted

Future rewards are treated as less important than the present reward, so the agent barely looks ahead.

γ near 1

Far sighted

Future rewards are weighted highly, so the agent plans for the long run and acts with those rewards in mind.

The Objective

Find the policy that maximises the return.