In On-policy method, we use the same policy for value estimation as well as control.
In Off-policy method, we use a one policy for value estimation and another for control, i.e. one policy is used to generate behavior, called the behavior policy while another is used for policy evaluation and improvement, called the target policy.
One benefit of this separation is that the target policy may be deterministic (e.g., greedy), while the behavior policy can continue to sample all possible actions (exploration).
Off-policy methods follow the behavior policy while learning about and improving the target policy.
To explore all possibilities, we require that the behavior policy be soft (i.e., it selects all actions in all states with nonzero probability).
The Monte Carlo methods learn value functions and optimal policies from experience in the form of sample episodes.
[u] Advantages of Monte Carlo methods over Dynamic Programming methods:
They can be used to learn optimal behavior directly from interaction with the environment, with no model of the environment’s dynamics.
They can be used with simulation or sample models.
It is easy and efficient to focus Monte Carlo methods on a small subset of the states.
They may be less harmed by the violations of the Markov Property.
Monte Carlo methods intermix policy evaluation and policy improvement steps on an episode-by-episode basis, and can be incrementally implemented on an episode-by-episode basis.
Maintaining sufficient exploration is an issue in Monte Carlo control methods.
One approach is to ignore this problem by assuming that episodes begin with state-action pairs randomly. This is known as Exploring starts.
The problem with this is that even though this approach can sometimes be used in applications with simulated episodes, it is unlikely to be used in a learning from real experience.
In on-policy methods, the agent commits to always exploring and tries to find the best policy that still explores.
In off-policy methods, the agent also explores, but learns a deterministic optimal policy that may be unrelated to the policy followed.
Off-policy prediction refers to learning the value function of a target policy from data generated by a different behavior policy. Such learning methods are based on some form of importance sampling.
Ordinary importance sampling uses a simple average of the weighted returns, whereas weighted importance sampling uses a weighted average.
Ordinary importance sampling produces unbiased estimates, but has larger, possibly infinite, variance, whereas weighted importance sampling always has finite variance and is preferred in practice.
[?] How Monte Carlo methods differ from Dynamic Programming methods?
They operate on sample experience.
Thus they can be used for direct learning without a model.
They do not bootstrap.
They do not update their value estimates on the basis of other value estimates.