Model-free (reinforcement learning)

Source: Wikipedia, the free encyclopedia.

In

Sarsa, and Q-learning
.

In model-free reinforcement learning, Monte Carlo (MC) estimation is a central component of a large class of model-free algorithms. The MC learning algorithm is essentially an important branch of generalized policy iteration, which has two periodically alternating steps, i.e., policy evaluation (PEV) and policy improvement (PIM). In this framework, each policy is first evaluated by its corresponding value function. Then, based on the evaluation result, greedy search is completed to output a better policy. The MC estimation is mainly applied to the first step, i.e., policy evaluation. The simplest idea, i.e., averaging the returns of all collected samples, is used to judge the effectiveness of the current policy. As more experience is accumulated, the estimate will converge to the true value by the law of large numbers. Hence, MC policy evaluation does not require any prior knowledge of the environment dynamics. Instead, all it needs is experience, i.e., samples of state, action, and reward, which are generated from interacting with a real environment.[2]

The estimation of value function is critical for model-free RL algorithms. Unlike Monte Carlo (MC) methods, temporal difference (TD) methods learn the value function by reusing existing value estimates. If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal difference. TD has the ability to learn from an incomplete sequence of events without waiting for the final outcome. TD has the ability to approximate the future return as a function of the current state. Similar to MC, TD only uses experience to estimate the value function without knowing any prior knowledge of the environment dynamics. The advantage of TD lies in the fact that it can update the value function based on its current estimate. Therefore, TD learning algorithms can learn from incomplete episodes or continuing tasks in a step-by-step manner, while MC must be implemented in an episode-by-episode fashion.[2]

Model-Free reinforcement learning algorithms

Model-free reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Deep neural networks are responsible for recent artificial intelligence breakthroughs, and they can be combined with reinforcement learning to create something astounding, such as DeepMind’s AlphaGo. Mainstream model-free RL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional Soft Actor-Critic (DSAC), etc.[2] Some model-free algorithms are listed as follows, especially those with deep learning.

Algorithm Description Model Policy Action Space State Space Operator
DQN Deep Q Network Model-Free Off-policy Discrete Typically Discrete or Continuous Q-value
DDPG Deep Deterministic Policy Gradient Model-Free Off-policy Continuous Discrete or Continuous Q-value
A3C Asynchronous Advantage Actor-Critic Algorithm Model-Free On-policy Continuous Discrete or Continuous Advantage
TRPO Trust Region Policy Optimization Model-Free On-policy Continuous or Discrete Discrete or Continuous Advantage
PPO
Proximal Policy Optimization Model-Free On-policy Continuous or Discrete Discrete or Continuous Advantage
TD3 Twin Delayed Deep Deterministic Policy Gradient Model-Free Off-policy Continuous Continuous Q-value
SAC Soft Actor-Critic Model-Free Off-policy Continuous Discrete or Continuous Advantage
DSAC[3] Distributional Soft Actor-Critic Model-free Off-policy Continuous Continuous Value distribution

References