Models and Planning
Source:: 2018 Reinforcement Learning An Introduction
- Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning.
# Models and Planning
- By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions.
- Given a state and an action, a model produces a prediction of the resultant next state and next reward.
- If the model is stochastic, then there are several possible next states and next rewards, each with some probability of occurring.
- Some models produce a description of all possibilities and their probabilities; these we call distribution models.
- Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models.
- Distribution models are stronger than sample models in that they can always be used to produce samples. However, in many applications it is much easier to obtain sample models than distribution models.
- Models can be used to mimic or simulate experience.
- We use the term planning to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment.
- In artificial intelligence, there are two distinct approaches to planning according to our definition.
- State-space planning is viewed primarily as a search through the state space for an optimal policy or an optimal path to a goal.
- A unified view for this is that all state-space planning methods share a common structure:
- All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy.
- They compute value functions by updates or backup operation applied to simulated experience.
- A unified view for this is that all state-space planning methods share a common structure:
- Plan-space planning is instead a search through the space of plans.
- Operators transform one plan into another, and value functions are defined over the space of plans.
- It includes evolutionary methods and “partial-order planning”.
- These methods are difficult to apply efficiently to the stochastic sequential decision problems that are the focus in reinforcement learning.
- State-space planning is viewed primarily as a search through the state space for an optimal policy or an optimal path to a goal.
- The heart of both learning and planning methods is the estimation of value functions by backing-up update operations. The difference is that whereas planning uses simulated experience generated by a model, learning methods use real experience generated by the environment.
# Dyna: Integrated Planning, Acting, and Learning
- Within a planning agent, there are at least two roles for real experience:
- it can be used to improve the model, called “model learning”
- it can be used to directly improve the value function and policy, called “direct reinforcement learning”
- Indirect methods often make fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions. On the other hand, direct methods are much simpler and are not affected by biases in the design of the model.