# Weekly Readings #26

** Published:**

## đ Papers

### Atkeson, Christopher G., and Stefan Schaal. *Robot Learning From Demonstration*, 1997.

This paper describes implementation of a model-based kind of imitation learning to perform the pendulum swing-up task by acceleration-based control of the horizontal position $x$ of a physical robot hand. The pendulum starts at angle $\theta=\pi$ and a successful swing-up moves it to $\theta=0$.

The approach makes use of a **task model** and a **imitation-based cost function** to be used by an **optimal control planner**. For the task model, the authors use a discrete-time parametric model of the angular velocity of an idealised pendulum attached to a horizontally-moving hand:

The two parameters $\alpha_1$ and $\alpha_2$, corresponding to the viscous damping and $\Delta g/l$ respectively ($\Delta$ is the timestep), are learned from data by linear regression. An alternative nonparametric *locally-weighted learning* model, to predict $\dot{\theta}*{k+1}$ as a function of $(\theta*{k}, \dot{\theta}*{k}, x*{k}, \dot{x}*{k}, \ddot{x}*{k})$, is also learned.

These models are used by an optimal control planner to minimise a cost function that penalises deviations from a demonstration trajectory and squared acceleration:

\[r\left(\mathbf{x}_{k}, \mathbf{u}_{k}, k\right)=\left(\mathbf{x}_{k}-\mathbf{x}_{k}^{\mathrm{d}}\right)^{\mathrm{T}}\left(\mathbf{x}_{k}-\mathbf{x}_{k}^{\mathrm{d}}\right)+\mathbf{u}_{k}^{\mathrm{T}} \mathbf{u}_{k}\]where the state is $\mathbf{x}=(\theta, \dot{\theta}, x, \dot{x})$, $\mathbf{x}^{\mathrm{d}}$ is the demonstrated motion, $k$ is the sample index, and the control is $\mathbf{u}=(\ddot{x})$.

With both parametric and non-parametric models, this approach successfully imitates human demonstrations with one back-and-forth motion of the pendulum, but it fails at the more difficult task of âpumpingâ the pendulum twice before swinging it up.

It is suggested that this is likely due to a mismatch between the model structure and the true system. To combat this, an additional free parameter is introduced: allowing the target angle used by the planner to vary away from $\theta=0$. Searching across a range of values identifies a value which allows the more complex task to be learned.

### Bhattacharyya, Raunak P., Derek J. Phillips, Changliu Liu, Jayesh K. Gupta, Katherine Driggs-Campbell, and Mykel J. Kochenderfer. âSimulating Emergent Properties of Human Driving Behavior Using Multi-Agent Reward Augmented Imitation Learningâ. In *ArXiv:1903.05766 [Cs]*, 2019.

Both imitation learning and reinforcement learning are challenging and unstable in multi-agent domains where undesirable emergent properties may develop. The authors propose to tackle this problem by augmenting the generative adversarial imitation learning (`GAIL`

) approach with prior knowledge in the form of hand-crafted rewards to disincentivise visiting a set of undesirable state-action pairs $U$.

`GAIL`

uses an adversarial two-model setup, whereby a discriminator network $D_\psi$ is trained to output a high score given state-action pairs from a demonstration policy $\pi^\ast$, and a low score given pairs from a learner policy $\pi_\theta$.

Rather than applying the reward augmentation as a hard constraint (which makes optimisation much more difficult), it is added as a regularisation term to the `GAIL`

objective:

This objective is solved in two iterative steps:

- Optimise the discriminator parameters $\psi$ using data from new rollouts.
- Use
*Trust Region Policy Optimisation*to update the policy parameters $\theta$. Part of the loss comes from the discriminator, and part comes from the reward augmentation term.

In the context of a traffic simulator environment with real-world training data, one reward augmentation scheme (âbinaryâ) consists of a large penalty $R$ if a vehicle crashes or drives off the road, and a smaller one $R/2$ if it performs a hard brake. An alternative (âsmoothâ) uses a linearly-increasing penalty as a function of distance to the road edge or braking magnitude.

`RAIL`

is compared with `GAIL`

on various metrics: positional divergence of vehicles from the training data, error rate (collision, off-road, hard-brake), and frequency of emergent phenomena (lane changes, time gaps between vehicles) compared with the training data. During these evaluations, a certain proportion of the vehicles use the learned policy, and the rest follow the training data. `RAIL`

is clearly seen to do better.

### Brennen, Andrea. âWhat Do People Really Want When They Say They Want âExplainable AIâ? We Asked 60 Stakeholders.â, 2020, 7.

Here the author summarises two key findings from stakeholder interviews.

**Lack of consistent terminology:**

- Identified synonyms for âExplainableâ included âAccountableâ, âFairâ, Reliableâ and âVerifiableâ. Some interviewees used terms interchangeably that others insisted were very different.
- Similarly, the two terms âInterpretabilityâ and âExplainabilityâ were used by some to mean the same thing, and others to describe two alternative approaches. What those alternatives actually were also varied!
- Finally, even âAIâ meant different things to different people. Some were interested only in deep learning models, while others extended their definition to incorporate nearly any automated manipulation of data.
- Miscommunication is exacerbated when words have both technical and nontechnical definitions, e.g. âFairnessâ.

**Multiple motivations for explainability:**

- Most people started with the notion of AI as a
*black box*, and that the existence of a*mismatch*between what they wanted to understand and what they currently understood. - But there are
*at least*three very distinct variants:**Debugging models**: explainability as a tool to be used by engineers for improving performance.**Identifying bias**: explainability to identify problematic data or inferences, important from business and legal perspectives.**Building trust**: explainability to help end-users understand unfamiliar technologies and assuage their anxiety.

- With different explainees and different content requirements, it seems reasonable to expect weâll need different solutions.

Identifying these issues, which hinder discussion and mutual understanding, is the first important step toward mitigating them.

### Elton, Daniel C. âSelf-Explaining AI as an Alternative to Interpretable AIâ. *ArXiv:2002.05149 [Cs, Stat]*, 24 April 2020.

The *double descent phenomenon* indicates that deep neural networks typically operate by interpolating between data points rather than by extracting a few high level rules as we might have hoped. This is a âdumbâ process: any apparent regularities that appear to be captured internally are properties of the data rather than a self-directed distillation process.

In addition to making failure likely outside the training distribution, this behaviour makes interpretation (which makes the implicit assumption that such latent rules exist) inherently hard. With this in mind, how can we ever hope to trust AI systems? Instead of post-hoc methods, Elton advocates for **self-explaining** AI models. In addition to the prediction itself, a self-explaining model would output:

- A human-understandable explanation;
- Confidence levels both both prediction and explanation.

Prediction and explanation branches from a shared set of primary layers. How can we trust that the explanation is actually relevant? One possible option is to measure the *mutual information* between the prediction $Y$ and the activation vector of the final shared latent layer $L$, and also between the explanation $E$ and $L$:

A disadvantage of this is that mutual information is extremely expensive to compute!

A couple more points made during the discussion:

- The term âblack boxâ should not be used as often as it is, because given enough time and effort we
*can*know exactly whatâs inside a neural network, and how it got there. The issue is that we donât have a good way of relating this knowledge to questions we actually care about. - Given that deep neural networks seem to operate using dumb interpolation, there seems to be a direct conflict between the accuracy of an interpretation, and its relevance to the domain of application.

## đ Theses

### Ross, Stephane. âInteractive Learning for Sequential Decisions and Predictions.â (2013).

Learning actively through interaction is necessary to obtain good and robust predictors for sequential prediction tasks. No-regret online learning methods provide a useful class of algorithms to learn efficiently from these interactions, and provide good performance both theoretically and empirically.

This thesis introduces interactive learning methods that address the major limitation of behavioural cloning: the sequential nature of control problems breaks the i.i.d. assumption. Any open-loop error causes the test distribution to diverge from the training one in the closed-loop setting, and poor performance results.

The ideal solution is to train on the distribution that the learned predictor will see in closed-loop, but this is a *chicken-and-egg* problem! A common strategy to deal with such problems is to adopt an iterative training approach, and this is what is done here. Overall, the effect is to **reduce imitation learning (and related problems) to online learning**.

#### Learning Behaviour from Demonstrations

Here we start to tackle the problem of learning good control policies from demonstration. All analysis assumes an episodic MDP of fixed length $T$, but does generalise to discounted infinite-horizon tasks. Performance is quantified in terms of a non-negative cost to be minimised $C(s,a)\in[0,C_{\max}]$, rather than reward $R(s,a)$ which may be positive or negative. This simplifies much of the analysis.

Our ultimate goal is to minimise $J(\pi)$, the expected sum of costs accumulated by $\pi$ over $T$-timestep episodes under the induced state distribution $d_\pi$. Since this is challenging to do directly, we introduce a surrogate loss $\ell(\pi,\pi^\ast)$ that quantifies some measure of deviation from a high-performing *expert* policy $\pi^\ast$. In the special case when we are interested in the learnerâs ability to predict the expertâs actions, $C=\ell$.

Behavioural cloning is the most naive solution to this problem, which simply performs supervised learning on a training distribution generated by $\pi^\ast$. $\epsilon$ the expected loss under this distribution. It can be proven that if $\ell$ is the 0-1 loss (or an upper bound such as the hinge or logistic loss), then $J(\pi) \leq J\left(\pi^{*}\right)+C_{\max } T^{2} \epsilon$, meaning the closed-loop cost of $\pi$ compared with $\pi^\ast$ **grows quadratically** with $T$. This is much worse than the i.i.d. setting, where the cost of classifying $T$ samples grows linearly with $T$.

What follows is a selection of approaches for closing this gap.

##### Forward Training

As a simple starting point, we can imagine training a *sequence* of policies $\pi_1,âŚ,\pi_T$ in order. For each $\pi_t$, we sample multiple $t$-step trajectories, starting in the initial state distribution and continued by executing $\pi_1,âŚ,\pi_{t-1}$. This produces a distribution of states for time $t$. We query the expert for these states and use the results to train $\pi_t$.

By training sequentially in this way, each $\pi_t$ is trained under the distribution of states itâs going to encounter during closed-loop execution. If previous policies produce poor predictions, then $\pi^\ast$ will demonstrate the necessary recovery behaviours at future steps.

Iterative forward training attains closed-loop loss that grows linearly with $T$ (the best possible case), but it is impractical when $T$ is large (or the MDP is continuing), and inefficient in that the same or similar behaviours may have to be learned many times over for different timesteps.

##### Stochastic Mixing Training

The two approaches reviewed here achieve similar guarantees by adopting the same general philosophy of *learning gradually*.

`SEARN`

works by starting from a policy $\pi_0=\pi^\ast$, i.e. one that just queries the expert. A new policy $\hat{\pi}_1$ is trained to minimise the surrogate loss $\ell$ on data from the execution of $\pi_0$. This is then **stochastically mixed** with $\pi_0$ to obtain the next policy

which at every timestep $t$, follows $\hat{\pi}_1$ with probability $\alpha$, and follows $\pi_0$ otherwise. It keeps iterating in a similar fashion, so that

\[\pi_{n}=(1-\alpha) \pi_{n-1}+\alpha \hat{\pi}_{n}=(1-\alpha)^{n} \pi_{0}+\alpha \sum_{i=1}^{n}(1-\alpha)^{n-i} \hat{\pi}_{i}\]After some large number of iterations $N$, `SEARN`

terminates and returns a final policy $\pi_N$ that never has to query $\pi^\ast$ by renormalising:

By default, `SEARN`

involves simulating an entire trajectory to collect each training datapoint, which is often impractical. A variant, called `SMILE`

, addresses this issue. It keeps the same update relationship between policies, mediated by $\alpha$, but at each step $n$ the training is simply to imitate the expert $\pi^\ast$ on the latest dataset. Hence each $T$-step trajectories yields $T$ datapoints for training instead of one. However, `SMILE`

has less strong performance guarantees than `SEARN`

or forward training.

Both `SEARN`

and `SMILE`

also have the disadvantage of requiring a stochastic policy which might be undesirable in practical (safety critical) applications.

##### Dataset Aggregation

This approach, the flagship of the thesis, is shortened to `DAgger`

. It can learn deterministic policies, and is even simpler than the preceding methods.

`DAgger`

starts by gathering a dataset $\mathcal{D}$ of expert-generated trajectories, trains a policy $\pi_2$ to imitate the expert on those data. Then at iteration $n$, it proceeds by collecting a dataset at each iteration under $\pi_n$, appends this to $\mathcal{D}$, and trains $\pi_{n+1}$ on the concatenation of all collected datasets. It is a kind of *follow the leader* algorithm in that at each iteration we find the policy that would give the best imitation in hindsight on all previous data.

This basic algorithm can be modified to use a mixture policy at each iteration $n$, that samples from either $\pi_n$ or $\pi^\ast$ according to a parameter $\beta_n$. In practice however, it is found that itâs often best to just set $\beta_1=1$ and $\beta_n=0$ for all $n\neq1$.

`DAgger`

works with any âno-regretâ online learning algorithm.

A couple of questions:

- Why not just collect data everywhere, e.g. via a random or noisy policy? Because (1) itâs very data-inefficient, and (2) when the policy class $\Pi$ is not expressive enough to fully
*realise*the target policy, we need to trade-off accuracy in different regions correctly fitting irrelevant data may actually harm performance. - Why not just use the last collected dataset as in policy iteration? This tends to be very unstable and leads to oscillation between multiple mediocre policies. When we constantly append to $\mathcal{D}$, the fraction of new data decreases over time, so the policy will tend to change more and more slowly, and stabilise around some behaviour that produces good predictions under inputs that were collected by similar predictors. We get increasingly close to the âidealâ of matching training and test distributions.

#### Learning Behaviour using Cost Information

Now consider the case where an expert is present, *and* additional cost information is available. Intuitively, this should enable better performance!

##### Forward Training with Cost-to-Go

Instead of only observing the expertâs actions in states reached at time $t$ by execution of $\pi_1,âŚ,\pi_{t-1}$, we instead *explore* actions to perform at $t$, follow $\pi^\ast$ until the end of the episode, and observe the total cost $Q$ of this sequence. Under the assumption that $\pi^\ast$ is a decent policy and that $\Pi$ contains similar good policies, then these $Q$ values give us a rough estimate of what good policies in $\Pi$ will be able to achieve at future steps.

$\pi_t$ is trained to minimise cost on the collected dataset of $(s_t,a_t,Q)$ triples. If we are lucky enough to be able to sample *all* actions from each $s_t$, we have all the information we need for cost-sensitive classification learning. Otherwise, we can reduce the problem to a regression one (predicting cost for a state-action pair) or use importance weighting techniques.

To select exploring actions from an $s_t$, we can simply use a random strategy. Better results might be attained by casting this as a contextual bandit problem. Developing efficient algorithms for this problem is still an open problem.

In cases where the expert is much better than any policy in $\Pi$, then the cost estimates may be very optimistic, and not reflective of the true prospects of the learner. This will not lead to good performance.

`DAgger`

with Cost-to-Go

A similar idea is used here. At iteration $n$, we collect each of $m$ sample as follows:

Follow current policy $\pi_n$ from the start of the episode to a randomly-sampled time $t\in{1..T}$.

Execute some exploration action $a_t$ in $s_t$.

Follow $\pi^\ast$ from $t+1$ to $T$, observing the total cost $Q$ starting at $t$.

After adding the $m$ new samples to $\mathcal{D}$, we train $\pi_{n+1}$ as a cost-sensitive classifier.

By minimising cost, this algorithm is effectively a regret reduction of imitation learning, rather than an error reduction as obtained when minimising classification loss in conventional `DAgger`

.

Utilising cost-to-go can often be expensive and impractical compared with simple imitation loss minimisation. A potential combination of the two approaches could be very effective.

## đď¸ Key Insights

- Atkeson and Schaalâs classic paper, oft-cited in the imitation learning literature, takes quite a heavily model-based approach compared with more recent works.
- That said, Bhattacharyya et alâs much newer
`RAIL`

model makes use of domain knowledge, in the form of hand-crafted penalties for known failure modes, added as an extra term to the`GAIL`

objective. - Brennenâs review lays bare the concerning lack of consistency or rigour in uses of words such as âinterpretabilityâ and âexplainabilityâ. With so many potential applications in mind, it might be that both terms are just too broad to be meaningful.
Eltonâs proposal for self-explaining AI is to me less interesting than some of the high-level observations he makes along the way. The term âblack boxâ is really inappropriate when what is missing is not

*knowledge*of a neural networkâs internal representations, but a way of*relating*them to issues we care about. And this seems doubly challenging if we believe the evidence that such representations encode mere interpolation rules rather than high-level regularities.- Rossâ thesis gives compelling arguments for the critical importance of interaction when doing imitation learning in dynamic contexts. His
`DAgger`

algorithm is an easy-to-implement scheme for implementing this interaction.