Convergence of Value-based RL
We often hear that reinforcement learning always converges to a unique fixed point. Why is that?
We are going to prove this using the Banach fixed point theorem by showing that the Bellman optimality operator is a contraction over a complete metric space of real numbers with metric L-infinity norm. For this, we will first discuss the fixed point problem and complete metric spaces with respect to the Cauchy sequence.
Given a metric space $(\mathcal{X}, d)$, the infinite sequence $x_1,x_2,x_3โฆ,$ is a Cauchy sequence if for every $\varepsilon\in\mathbb{R}^+$, there exists an $N\in\mathbb{Z}^+$ such that $d(x_a,x_b)<\varepsilon,a,b>N$. This definition formalises the notion of convergence to a fixed point. A metric space is complete if every possible Cauchy sequence among the elements of $\mathcal{X}$ converges to some fixed point that itself lies in $X$. Given a complete metric space $(\mathcal{X},d)$, a function $f:\mathcal{X}\rightarrow \mathcal{X}$ is a contraction if applying it to two elements $x,xโ\in\mathcal{X}$ moves them closer together by a factor of at least $\gamma\in[0,1)$, i.e.
\[d(f(x),f(x'))\leq\gamma d(x,x')\]The Banach fixed point theorem gives a proof of what is intuitively obvious: repeated application of a contraction induces a Cauchy sequence, and thus converges to a unique fixed point $x^\ast\in\mathcal{X}$.
Bellman optimality operator $\mathbb{B}:\mathbb{R}\rightarrow\mathbb{R}$.
\[\mathbb{B}[V(s)]=\underset{a}{\max}r(s,a)+\gamma\sum_{s'}P(s'\vert s,a)V(s')\]$\mathbb{B}$ recursive can be used to generate a sequence of value functions.
Now let us use the infinity norm to define a metric space $(\mathbb{R},\lVert\cdot\rVert\infty)$. If we can show that $\mathbb{B}$ is a contractor in $(\mathbb{R},\lVert\cdot\rVert\infty)$, then we can conclude that it will eventually give a unique optimal value function $V^\ast$ and in turn, a unique optimal policy $\pi^\ast$. First, letโs write out the equation for the infinity norm after applying $\mathbb{B}$ to two value functions $V_1$, $V_2$:
\[\lVert\mathbb{B}[V_1(s)]-\mathbb{B}[V_2(s)]\rVert_\infty=\lVert\underset{a_1}{\max}(r(s,a_1)+\gamma\sum_{s'}P(s'\vert s,a_1)V_1(s'))-\underset{a_2}{\max}(r(s,a_2)+\gamma\sum_{s'}P(s'\vert s,a_2)V_2(s'))\rVert_\infty\]The key trick here is to recognise that ==XXXXXXXXXXXXXXXXXXXX==
\[\leq\lVert\underset{a_1}{\max}(\cancel{r(s,a_1)}+\gamma\sum_{s'}P(s'\vert s,a_1)V_1(s'))-\cancel{r(s,a_1)}-\gamma\sum_{s'}P(s'\vert s,a_1)V_2(s'))\rVert_\infty=\gamma\lVert \underset{a_1}{\max}\sum_{s'}P(s'\vert s,a_1)(V_1(s')-V_2(s'))\rVert_\infty\]Since the infinity norm returns the maximum value over the entire state space, this is equivalent to
\[=\gamma\underset{s',a_1}{\max}\sum_{s'}P(s'\vert s,a_1)(V_1(s')-V_2(s'))=\gamma\underset{s',\cancel{a_1}}{\max}(V_1(s')-V_2(s'))\cancelto{1}{\sum_{s'}P(s'\vert s,a_1)}=\gamma\lVert V_1(s')-V_2(s')\rVert_\infty\]So therefore
\[\lVert\mathbb{B}[V_1(s)]-\mathbb{B}[V_2(s)]\rVert_\infty\leq\gamma\lVert V_1(s')-V_2(s')\rVert_\infty\]which means that $\mathbb{B}$ is a contractor in $(\mathbb{R},\lVert\cdot\rVert_\infty)$.
FQI gives no guarantees of convergence to the optimal solution.
- Proof: Mix of contraction under $\infty$-norm and $\ell 2$-norm.
- $Q$ learning is not true gradient descent (semi-gradient) because we donโt take the gradient through the target value. Residual algorithms slow and not numerically stable (learning rates).
- Also means that actor-critic wonโt converge (policy gradient converges to local maximum, just like supervised learning).