# Weekly Readings #16

** Published:**

Imitation by coaching; GAIL; human-centric vs robot-centric; DeepMimic.

## 📝 Papers

### He, He, Jason Eisner, and Hal Daume. “Imitation Learning by Coaching.” In *Advances in Neural Information Processing Systems 25*, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 3149–3157. Curran Associates, Inc., 2012.

This extension to the *DAgger* method for online imitation learning is designed to help in situations where a ground-truth environmental loss function is available, but the optimal policy is *far from what is achievable in the student’s policy space*. The idea is to let the learner target at easier goals first, and gradually move towards optimality. An intermediate *coach* policy is introduced, which at training iteration $i$ selects actions as follows

where the score term is a measure of the likelihood of the student policy $\pi_i$ taking $a$ given $s$, and $L(s,a)$ is the environmental loss value. As in DAgger, training is done iteratively on a dataset generated by the student’s policy itself, but now it is labelled by the coach $\tilde{\pi}_{i}$. It is shown both theoretically and empirically that coaching has a lower regret bound than DAgger.

### Ho, Jonathan, and Stefano Ermon. “Generative Adversarial Imitation Learning.” In *Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4565–4573. Curran Associates, Inc., 2016.

The two main approaches to imitation learning both have their disadvantages. Behavioural cloning (BC) requires a large amount of data to succeed due to compounding error caused by covariate shift. Inverse reinforcement learning (IRL) avoids this issue but requires reinforcement learning be run as an inner loop, making it extremely expensive. IRL is concerned with learning a cost function that we never use directly, and it feels like we are wasting effort; is there a more efficient formulation of the problem?

In maximum causal entropy IRL, we seek to find a cost function $c:\mathcal{S}\times{A}\rightarrow\mathbb{R}$ such *that the expert policy $\pi_E$ performs better than all other other policies*, with the cost regularised by a convex function $\psi:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\rightarrow\bar{\mathbb{R}}$ to prevent overfitting ($\bar{\mathbb{R}}=\mathbb{R}\cup\infty$):

where $H(\pi)$ is the causal entropy of the policy $\pi$. Let $\bar{c}$ be the outcome of this optimisation and $\text{RL}(\bar{c})=\text{RL}\circ \text{IRL}*{\psi}\left(\pi*{E}\right)$ be the policy learned with that cost function via reinforcement learning. In Appendix A.1, the authors prove that this policy can be equivalently written as

where $\psi^\ast$ is called the *convex conjugate* of $\psi$ and $\rho_{\pi}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is the *occupancy measure* of a policy $\pi$, defined as $\rho_{\pi}(s, a)=\pi(a \vert s) \sum_{t=0}^{\infty} \gamma^{t} P\left(s_{t}=s \vert \pi\right)$. The occupancy measure can be interpreted as the unnormalised distribution of state-action pairs that an agent encounters in the environment while executing its policy.

It can be shown that there is a unique policy $\pi_\rho$ for each $\rho$. Hence what we are doing in the rewritten formulation is searching for the policy whose occupancy measure is closest to the expert’s as measured by the convex conjugate of the regulariser $\psi^*$. Depending on the choice of $\psi$, we get different variants of the imitation learning problem.

Here a new regulariser, denoted $\psi_{\text{GA}}$, is proposed. This aims to combine the best of several previous approaches: close matching of $\rho_E$ while retaining tractability in large environments.

Intuitively, $\psi_{\text{GA}}$ places a low penalty on cost functions that assign a large negative cost to state-action pairs, but a high penalty on costs close to zero (which is the upper bound value).

With this new regulariser, and introducing a parameter $\lambda\geq0$ to modulate the causal entropy term, we obtain a new imitation learning algorithm that happens to have another nice equivalence:

After all this, our aim is to **find a policy whose occupancy measure minimises Jensen-Shannon divergence $D_{\text{JS}}$ to the expert’s**. The authors propose to do this using **generative adversarial networks**: a discriminator network parameterised by $w$ to estimate $D_{\text{JS}}$, and a policy network parameterised by $\theta$ to implement $\pi$. A standard adversarial training procedure, in which $w$ and $\theta$ are iterative updated by Adam and policy gradient updates respectively, is all that we need.

On a selection of baseline RL tasks of varying complexity, this *generative adversarial imitation learning* (GAIL) approach demonstrates better performance than BC and a couple of other baselines given a limited expert dataset. However, while GAIL is efficient in terms of expert interaction, it does require a lot of environmental interaction during training. Ultimately, the authors expect that a method combining well-chosen environmental models with expert interaction will present the best solution to the imitation learning problem.

### Laskey, Michael, Caleb Chuck, Jonathan Lee, Jeffrey Mahler, Sanjay Krishnan, Kevin Jamieson, Anca Dragan, and Ken Goldberg. “Comparing Human-Centric and Robot-Centric Sampling for Robot Deep Learning from Demonstrations.” *ArXiv:1610.00850 [Cs]*, March 28, 2017.

Several past works (most notably by Ross and Bagnell) have called for student-led imitation learning rather than teacher-led, because it produces better statistical guarantees with respect to the underlying environmental loss function. Here a counterargument is made, with the following main points:

- Humans find it difficult to accurately relabel poor trajectories as it requires high-cognitive-load counterfactual reasoning. [For imitation of other autonomous agents, this isn’t an issue.]
- The advantage of student-led sampling empirically seems to disappear if the student model is sufficiently expressive (e.g. a depth-100 decision tree) and gets enough training data, because it reduces the error enough for the quadratic explosion not to be a problem. [If we care about interpretability and data efficiency, this isn’t good enough!]
- Student-led sampling can fail to converge in cases where teacher-led does. [However, the example given shows teacher-led converging to a policy that only works in part of the state space, while student-led tries to learn a more complex whole-space policy for which it is insufficiently expressive. I’m not actually sure it’s clear-cut which is more desirable.]
- For environments where a critical bifurcation appears early and the student initially learns to take the ‘wrong’ path, even infinite on-policy training to minimise loss vs the teacher may fail to flip it back out onto the better path. [In my view this is probably the strongest point.]

### Peng, Xue Bin, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. “DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills.” *ACM Transactions on Graphics* 37, no. 4 (July 30, 2018): 1–14.

Given enough data, reinforcement learning (RL) is very effective at learning behaviour to satisfy its reward function, but in contexts such as character motion, where the ‘style’ of behaviour is important, results often look awkward and even humorous. *DeepMimic* is a kind of blend of RL and imitation learning, which uses reference clips to provide an additional reward term to combine with the task-specific one. This serves to incentivise more natural-looking policies.

In the character motion context explored, the state $s_t$ describes the configuration of the character’s body (positions and joint angles), and the action $a_t$ consists of a multidimensional Gaussian (fixed, spherical $\Sigma$), whose dimensions are used to specify targets orientations for PD controllers at each joint. Both are very high-dimensional: in the hundreds for $s_t$ and dozens for $a_t$, depending on the precise character. The reward function used is

where $r_t^G$ is the task-specific reward and $r_t^I$ measures the similarity to the corresponding state in a reference trajectory $\hat{q}_t$. $r_t^I$ has a domain-specific composition, combining terms that quantify the similarity of joint angles, rotational velocities, end effector positions, and the character’s centre-of-mass.

The task is solved using policy gradient RL, specifically proximal policy optimisation, a variant of trust-region policy optimisation which uses the KL-divergence between successive policies to enforce conservative updates and improve stability. A couple of training tricks prove important:

- Randomly sampling the initial state to be somewhere along the reference trajectory $\hat{q}$, so that the agent gets the chance to learn all parts equally well (otherwise failure early on would mean it never gets to see high-reward states later in the trajectory).
- Terminating a training episode whenever certain body elements touch the ground, to prevent training on a large amount of futile post-failure states.

In addition to successfully training individual motion skills (walking, running, backflips, rolls…), a number of interesting extensions are demonstrated:

**Multi-clip reward**: To utilise multiple reference clips during training, simply take $r_t^I$ for each timestep as the*maximum*value from the set. This is shown to work where the set of skills are similar, and is demonstrated on walking with various turn angles. Concatenating the target heading to the agent’s input vector allows its walking direction to be controlled.**Skill selection**: Another way of combining skills is to periodically switch between skill clips during training, and notify the agent by concatenating a one-hot vector encoding the current skill to its input. During deployment this again allows the character to be controlled. It is interesting to observe how the agent learns transitions between skills that are never explicitly demonstrated.**Composite policies**: A third skill combination method is to train separate policies for each skill, and use a composite that weights each skill policy based on its predicted value at each timestep. This is shown to enable transitions between the most diverse set of skills (flip, roll, stand up after fall).**Environment retargeting**: Adding a height map of the surrounding ground (pre-processed by convolutional layers) to the agent’s input allows skills such as running to be adapted to work on uneven terrain. The walking skill is adapted to climb stairs, though the motion looks a little awkward, indicating that each skill clip has a relatively narrow range over which it can be naturally extrapolated.

Robustness experiments show that the learned policies can survive equal-or-larger force perturbations than in past work. Disadvantages of the approach include its data hungriness (training takes about two days *per skill* on an 8-CPU AWS instance) and the fact that the agent learns skill timings that are inextricably tied to the reference clips. There is no mechanism for moving into and out of phase.

## 🗝️ Key Insights

*Everything’s been focused on various approaches to imitation learning this week.*

- The use of an intermediate ‘coach’ layer between the teacher and student policies is an interesting idea and one approach to delivering tailored, active learning. It operates very simply in the paper here, but there are plenty of refinements that could be made (e.g. being adaptive to the student’s uncertainty).
The GAIL paper provides a compelling argument for reducing inverse reinforcement learning to an generative adversarial learning problem. InfoGAIL (read previously) attempts to enforce more interpretable representations during this process, but is there a way to make it

*far*more interpretable (e.g. by using decision trees?)- Most previous investigations of the topic suggest student-led demonstrations provide more sample-efficient learning than teacher-led ones (avoids covariate shift). While the “Comparing…” paper provides some arguments in the opposite direction, I don’t believe most of these are relevant to the context of imitation learning for black box interpretability.
- DeepMimic illustrates the potential of a halfway house between imitation learning and reinforcement learning, which balances fidelity with respect to the teacher against expected reward.