TD Prediction
Source:: 2018 Reinforcement Learning An Introduction
- Temporal-Difference Learning is a combination of Monte Carlo ideas and Dynamic Programming ideas.
- Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics.
- Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
- [?] What is the prediction problem?
- It is the problem of estimating the value function $v_\pi$ for a given policy $\pi$.
- Also known as policy evaluation.
- [?] What is the control problem?
- It is the problem of finding an optimal policy.
# TD Prediction
- Both TD and Monte Carlo methods use experience to solve the prediction problem.
- Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to $V(S_t)$, TD methods need to wait only until the next time step.
- At time $t + 1$, they immediately form a target and make a useful update using the observed reward $R_{t+1}$ and the estimate $V(S_{t+1})$.
- The simplest TD method makes the update $$V(S_t) \longleftarrow V(S_t) + \alpha\left[R_{t + 1} + \gamma V(S_{t+1}) - V(S_t) \right].$$ immediately on transition to $S_{t+1}$.
- This TD method is TD(0), or one-step TD, because it is a special case of the TD$(\lambda)$ and n-step TD methods.
- Sample Updates involve looking ahead to a sample successor state (or state-action pair), using the value of the successor and the reward along the way to compute a backed-up value, and then updating the value of the original state (or state-action pair) accordingly.
- Sample updates differ from the expected updates of DP methods
- They are based on a single sample successor rather than on a complete distribution of all possible successors.
- Sample updates differ from the expected updates of DP methods
- TD error is the difference between the estimated value of $S_t$ and the better estimate $R_{t+1} + \gamma V(S_{t + 1})$: $$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t).$$
- TD error at each time is the error in the estimate made at that time.
# Advantages of TD Prediction Methods
- TD methods bootstrap, i.e., they learn a guess from a guess.
- TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.
- The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an online, fully incremental fashion.
- Some Monte Carlo methods must ignore or discount episodes on which experimental actions are taken, which can greatly slow learning. TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.
- In practice, TD methods have usually been found to converge faster than constant-$\alpha$ MC methods on stochastic tasks.
# Optimality of TD(0)
# Sarsa: On-policy TD Control
- Unlike in TD method, we learn an action-value function rather than a state-value function.
- We consider transitions from state-action pair to state-action pair, and learn the values of state-action pairs: $$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)].$$
- If $S_{t+1}$ is terminal, then $Q(S_{t+1}, A{t+1})$ is defined as zero.
- This rule uses every element of the quintuple of events, $(S_t,A_t,R_{t+1},S_{t+1},A_{t+1})$, that make up a transition from one state–action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm.
# Q-learning: Off-policy TD Control
- This was one of the early breakthroughs in reinforcement learning.
- In this case, the learned action-value function, $Q$, directly approximates $q_*$, the optimal action-value function, independent of the policy being followed: $$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \underset{a}{\operatorname{max}}Q(S_{t+1}, a) - Q(S_t, A_t)].$$
- This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs.
- All that is required for correct convergence is that all pairs continue to be updated.
# Expected Sarsa
- It is just like Q-Learning, but instead of the maximum over next state-action pairs, it uses the expected value, taking into account how likely each action is under the current policy: $$\begin{aligned}
Q(S_t, A_t) & \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma \mathbb{E}\pi[Q(S{t+1}, A_{t+1} | S_{t+1})] - Q(S_t, A_t)\right] \newline
& \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma \sum\limits_{a} \pi(a | S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t)\right].
\end{aligned}$$ - Given the next state, $S_{t+1}$,this algorithm moves deterministically in the same direction as Sarsa moves in expectation.
- Expected Sarsa is more complex computationally than Sarsa but, in return, it eliminates the variance due to the random selection of $A_{t+1}$.