Skip to main content

Overview

Book

  • An introduction to Reinforcement Learning, Sutton and Barto, Second Edition, 400 Pages

Introduction to Reinforcement Learning

  • RL sits at an intersection of several fields. It is the science of decision-making, an optimal way.

    • Machine Learning - CS
    • Optimal Control - Engineering
    • Reward System - Neuroscience
    • Classical/Operant Conditioning - Psychology
    • Operations Research - Mathematics
    • Bounded Rationality - Economics
  • Three branches of Machine Learning

    • Supervised Learning
    • Unsupervised Learning
    • Reinforcement Learning
  • Characteristics of RL that differ from supervised and unsupervised

    • A trial and error paradigm without any supervisor, it has only a reward signal
    • Feedback on the decision unfolds over time, so not instantaneous
    • Time really matters because of sequential decision-making processes. The systems are in general dynamic e.g., a robot moving in an environment
    • Agent's actions affect the subsequent data it receives
  • Examples of RL

    • Fly stunt maneuvers in a helicopter
    • Defeat the world champion at Backgammon
    • Manage an investment portfolio
    • Control at a power station, sequence of controls to minimize some cost
    • Make a humanoid robot walk, rewards at every step, open stuff
    • Play many different Atari games better than humans

The Reinforcement Learning Problem

  • A reward RtR_t is a scalar feedback signal
  • Indicates how well the agent is doing at time step tt
  • The agent's job is to maximize the cumulative reward
  • RL is based on the reward hypothesis

Reward Hypothesis: All goals can be described by the maximization of the expected cumulative reward

Goal: AS this is a sequential decision-making paradigm the goal is to select actions to maximize total future reward.

  • Actions may have long-term consequences
  • Rewards may be delayed
  • It may be better to sacrifice immediate rewards to gain long-term rewards. So, one cannot be greedy.

Agent and Environment Formalism

  • Observation, OtO_t
  • Reward, RtR_t
  • Action, AtA_t

The goal is to have an RL algorithm for the brain, that sees an observation from the world and gets a reward signal from the world and takes an action.

  • At each step t the agent
    • Executes Action, AtA_t
    • Receives Observation, OtO_t
    • Receives Scalar Reward, RtR_t
  • The environment
    • Receives Action, AtA_t
    • Emits Observation, OtO_t
    • Emits Scalar Reward, RtR_t

The experience of the agent is defined by the time series of observations, rewards and actions. This experience is used for building our reinforcement learning algorithm. The experience is called history

Ht=A1,O1,R1,.....At,Ot,Rt\begin{gather*} H_t = A_1,O_1,R_1,.....A_t,O_t,R_t \end{gather*}
  • i.e. all observable variables up to time t

  • i.e. example sensorimotor steam of a robot or embodied agent

  • What happens next depends on the history:

    • The agent selects actions
    • The environment selects observations/rewards
  • State is the summary of information used to determine what happens next

  • Formally, state is a function of the history St=f(Ht)S_t = f(H_t)

  • Three different definitions of state

    • The environment state, SteS_t^e is the environment's private representation. This contains the data the environment uses to pick the next observation/reward. This state is not usually visible to the agent, because if it is known the agent can predict future observations/rewards. RL algorithms cannot depend on state. We can use this to build our RL algorithm
    • The agent state, StaS_t^a is the agent's internal representation. This captures what happens to the agent so far and uses it to pick our next action. This is the information used to build RL algorithms. In can be any function of history Sta=f(Ht)S_t^a = f(H_t)

State Formal/Mathematical Definition

An information state a.k.a Markov state contains all useful information from the history

  • Markov Property: A state StS_t is said Markov if and only if
P[St+1St]=P[St+1S1,S2....St]\begin{gather*} P[S_{t+1}|S_t] = P[S_{t+1}|S_1,S_2....S_t] \end{gather*}
  • If you have the Markov property, the future is independent of the past given the present
H1:tStHt+1:inf\begin{gather*} H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\inf} \end{gather*}
  • Once the current state is known, the history may be thrown away
  • i.e. The current state is a sufficient statistic for the future
  • The environment state SteS_t^e is Markov by definition
  • The history HtH_t is Markov by definition albeit not so useful one
Full Observability
  • An agent directly observes environment state Ot=Sta=SteO_t = S_t^a = S_t^e
  • Agent state = environment state = information state
  • This kinda fully observable state representation is formally known as Markov Decision Process (MDP)
Partial Observability
  • An agent indirectly observes the environment
    • A robot with camera vision isn't told its absolute location
    • A trading agent only observes current prices
    • A poker-playing agent only observes public cards
  • Now we have an agent state that is not equal to the environment state
  • Formally this is a Partially Observable Markov Decision Process (POMDP)
  • An agent must construct its own state representation StaS_t^a
    • We can remember the complete history: Sta=HtS_t^a = H_t
    • A probabilistic/bayesian approach is to build beliefs every step of what you think, So I am gonna keep a probability distribution over where I think I am in the environment. This vector of probabilities contains the state representation of the agent Sta=(P[Ste=s1]....P[Sta=sn])S_t^a = (P[S_t^e = s^1]....P[S_t^a = s^n])
    • Without using probabilities, we can use any numbers. This approach follows recurrent neural network Sta=σ(St1aWs+OtWo)S_t^a = \sigma(S_{t-1}^aW_s + O_tW_o). Take your agent state in the previous time step and take a linear combination with the latest observation

Major Components of an RL Agent

Policy

  • Agent's behavior function is the way it picks the actions
  • It is a map from state to action. We learn it from experience and the best policy maximizes the reward
  • Deterministic Policy: a=π(s)a = \pi(s)
  • Stochastic Policy: π(a/s)=P[A=aS=s]\pi(a/s) = P[A = a| S = s]. This helps for random exploratory decisions of the state space, probability of taking some action conditioned on some state

Value Function

  • How good is each state and/or action? How much reward do we expect to get, estimating how well we do in a particular state
  • Prediction of future reward
  • Used to evaluate the goodness/badness of the states
  • Value function depends on the policy π\pi an agent takes
  • And therefore to select between actions, e.g.
  • Vπ(S)=Eπ[Rt+γRt+1+γ2Rt+2+...St=s]V_\pi(S) = E_\pi[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2}+...|S_t = s ]
  • γ\gamma is a discounting factor over future rewards. This also takes care of the time horizon of the future rewards

Model

  • Agent's view/representation of the environment. This is to learn the model of the environment
  • The way to learn this is
    • Transition Model: P\mathrm{P} predicts the next state i.e. dynamics of the environment
    • Reward Model: R\mathrm{R} predicts the next (immediate) reward e.g.
      • Pssa=P[S=sS=s,A=a]\mathrm{P}_{ss'}^a = P[S' = s' | S = s, A = a]
      • Rsa=E[RS=s,A=a]\mathrm{R}_s^a = E[R|S = s, A = a]
  • Model of the environment is often optional in solving RL problems

Categorizing RL Agents

Based on the three components above

Value-Based

  • The value function is all it needs
  • The policy is implicit

Policy Based

  • The policy is explicitly represented and it will be used to solve a problem without storing the value function explicitly
  • Hence no value function

Actor-Critic

  • This stores both policy and value function

Model Free

  • We do not explicitly understand the environment
  • Use policy and/or value function

Model-Based

  • We first build the model of the environment
  • Use policy and/or value function

Problems within RL

  • Learning and Planning: There are two different problems in sequential decision-making. So, going back to the science of what it means to take the optimal decision, there are two very different problem setups we care about

Reinforcement Learning

  • The environment is initially unknown
  • The agent interacts with the environment and figures out the best way to behave to maximize the goal
  • Thus improving its policy

Planning

  • A model of the environment is fully known
  • The agent performs computations with its model (without any external interaction)
  • The agent improves its policy
  • a.k.a deliberation, reasoning, introspection, pondering, thought, search
  • Plan ahead to find an optimal policy based on planning methods like search trees

Exploration and Exploitation

  • Reinforcement learning is like trial-and-error learning
  • The agent should discover a good policy
  • From its experience of the environment without losing too much reward along the way
  • Exploration finds more information about the environment
  • Exploitation exploits known information to maximize reward
  • Exploration-Exploitation trade-off is unique to reinforcement learning and it doesn't come elsewhere in machine learning

Prediction and Control

  • Prediction: Evaluate the future - given a policy
  • Control: Optimize the future - find the best policy
  • Prediction has to be solved first to solve the control problem