# Weekly Readings #17

** Published:**

Using $Q$ for imitation; differentiable decision trees and their application to RL; interactive explanations with Glass-Box.

## 📝 Papers

### Bastani, Osbert, Yewen Pu, and Armando Solar-Lezama. “Verifiable Reinforcement Learning via Policy Extraction.” In *Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 2494–2504. Curran Associates, Inc., 2018.

The motivation of this paper is to learn agent policies whose safety, stability and robustness can be formally verified. To achieve this, a modified DAgger process is completed to distill a pretrained DNN policy $\pi^\ast$ into a decision tree. The process, which the authors call *VIPER* (Verifiability via Iterative Policy ExtRaction), makes additional use of the targets policy’s $Q^{(\pi^\ast)}$ function, which yields far more compact trees than conventional imitation learning or DAgger.

It is first suggested that we modify DAgger to use the following convex loss

Rather than directly modifying the loss optimised by the decision tree inducer (here, CART), the same effect can produced in expectation by modifying how instances in the aggregated DAgger dataset are sampled for training. Specifically, we just need to sample each $(s,a)$ pair with a probability weighted by $\tilde{\ell}(s)$. A new tree is trained from scratch on each iteration.

Trees resulting from the VIPER process are demonstrated to be smaller and higher-performing than ones from DAgger (CartPole and Pong environments), and are then subjected to a variety of verification analyses which prove stability and performance guarantees. Such analyses could not be applied directly to a DNN policy.

### Löckel, Stefan, Jan Peters, and Peter van Vliet. “A Probabilistic Framework for Imitating Human Race Driver Behavior.” *ArXiv:2001.08255 [Cs, Stat]*, January 22, 2020.

Here the goal is to build realistic human driver models for use in simulations for vehicle optimisation. The problem setup is a standard behavioural cloning one, with a dataset $\mathcal{D}$ of state-action pairs from a stochastic driver-specific policy $\pi_E$, which we wish to imitate with a learned policy $\pi_M$ that *explicitly represents the variability* in $\pi_E$. The problem is solved by modularising it using domain-specific intuition.

The first stage is to learn a driver-specific *global target trajectory* model from demonstration data, which is represented as a Gaussian mixture model [I think] in an abstract weight space. Samples from this space can be used in turn to construct sample trajectories. The second stage performs *local path planning*, which involves fitting a kind of spline called a clothoid which starts at the vehicle’s current centre of mass and ends at some point on the target trajectory. The third and final stage is *action selection*. Here, the clothoid parameters, alongside a features representing the current position and orientation of the vehicle, are fed into a recurrent neural network which outputs a steering angle $\delta$, gas pedal actuation $g\in[0,1]$ and brake pedal actuation $b\in[0,1]$. The global target trajectory model has to be trained for each track topology, while the action selection model is track-independent, and the local path planning algorithm is fixed and requires no training.

Evaluation is done using a selection of domain-specific similarity metrics (e.g. lap time, steering angle distribution) alongside the number of laps completed before a crash, and the subjective *Imitation Game*-like human judgements as to whether trajectories are generated by $\pi_E$ or $\pi_M$. On each of these, the proposed modular approach does better than conventional imitation learning, as well as DAgger [which interestingly seems to confer no real benefit], although the number of test runs completed is remarkably small. The moral of the story: modularisation and domain knowledge helps us solve problems, which is not very surprising.

## Rodriguez, Ivan Dario Jimenez, Andrew Silva, Taylor Killian, Sung-Hyun Son, and Matthew Gombolay. “Towards Interpretable, Online Reinforcement Learning with Differentiable Decision Trees.” *ArXiv:1903.09338 [Cs, Stat]*, July 22, 2019.

Decision trees are not typically used in RL because they cannot be learned via gradient-based updates; prior attempts to bypass this have been hacky and haven’t yielded particularly impressive results.

Suárez and Lutsko [see paper below] developed the *differentiable decision tree* (DDT) model, in which each decision node $n$ computes a linear combination of the features $x$ (with weights $\textbf{w}_n$ and bias $t_n$) and applies a logistic activation function (with steepness $b_n$) whose output value is used as the probability of taking the rightmost child node.

This paper marks the first exploration of using a DDT for end-to-end RL. Theoretical analysis with an extremely simple 4-state MDP, and a DDT with just one decision node, provides evidence that the policy gradient paradigm will be more stable than Q-learning.

After this, a couple of simple modifications are made to Suárez and Lutsko’s model, to service the goal of interpretability. Firstly, a regularisation term is added to the policy gradient loss which encourages sparsity in the weights $\textbf{w}_n\forall n$ and thus for a single feature to be used at each node:

To ensure that the ‘chosen’ weight $\max (\textbf{w}_n)$ remains nicely bounded, a softmax operator is then applied to $\textbf{w}_n$. Another modification is to uniformly initialise the weights, because random initialisation is found to introduce bias. Finally, a method for extracting a conventional, discretised decision tree is introduced: simply finding the index of the ‘chosen’ feature $\text{argmax} _{j}(\textbf{w}_n^j)$ and taking the rightmost child node when

Evaluations in OpenAI Gym environments demonstrate that DDTs learned by RL exhibit comparable performance to neural network function approximators with the same or more parameters, and that the extracted discretised trees only suffer a marginal performance hit.

### Sokol, Kacper, and Peter Flach. “One Explanation Does Not Fit All: The Promise of Interactive Explanations for Machine Learning Transparency.” *ArXiv:2001.09734 [Cs, Stat]*, January 27, 2020.

Miller’s (2019) “seminal” call-to-action for the XAI to incorporate more ideas from the social sciences has been listened to in many ways, but a remaining neglected area is that of interactivity and personalisation of explanations. One of the reasons for this is the lack of a well-defined protocol for evaluating interactive explanations.

Here the authors present *Glass-Box*, their implementation of an interaction explanation framework. Glass-Box uses a decision tree as the underlying prediction model, which currently provides the largest range of options for global explanation (visualisation, feature importance) and local explanation (decision rule, counterfactual, exemplar).

Glass-Box has been implemented to work with the UCI German credit dataset, for which the output is a binary credit score. It is able to answer three kinds of counterfactual explanation query:

*Why…?*: returns the shortest (according to a distance metric in feature space) class-contrastive counterfactual.*Why given…?*: returns the shortest counterfactual that*only*uses a specific subset of features.*Why despite…?*: returns the shortest counterfactual that does*not*modify a specific subset of features.

The system is paired with an off-the-shelf NLP framework that allows the explainee to ask a sequence of valid queries of the above types. However, feedback from experimental subjects (conference attendees) suggests that the NLP module is slow and clunky, so future studies with different datasets will use the *Wizard of Oz* approach (have subjects believe they are using an NLP system but actually enter queries manually). This reduces the engineering overhead and enables more focus on the technical core of the work.

Other issues that need more focus include explainees’ tendencies to over-generalise from local explanations. This demands very careful interaction design. Future work will also consider how to construct a mental model of the explainee based on argumentation theory. The mental model will be utilised to adjust the granularity and complexity of explanations.

### Suarez, A., and J.F. Lutsko. “Globally Optimal Fuzzy Decision Trees for Classification and Regression.” *IEEE Transactions on Pattern Analysis and Machine Intelligence* 21, no. 12 (December 1999): 1297–1311.

This paper introduces a kind of fuzzy decision tree which has since come to be known as a *differentiable* decision tree. It augments the general CART framework with the core idea of fuzzy set theory: that of partial membership of a set.

The conventional crisp test used at a decision node $n$ is

with $\mu^\text{left}_n(\textbf{x}_i)=1-\mu^\text{right}_n(\textbf{x}_i)$. Here, $j_n$ is the index of the selected feature, $t_n$ is the threshold and $\mu^\text{child}_n(\textbf{x})$ indicates the membership of the input vector in each child node (currently $\in{0,1}$). This can be thought of as a special case of a test concerning the weighted sum of all features

which, for regression problems, can be fuzzified using a logistic activation function and continuous membership values $\in[0,1]$:

where $b_n$ is a width parameter. For regression, the defuzzified prediction $\bar{y}(\textbf{x})$ for a given input vector $\textbf{x}$ is computed as a membership-weighted sum of predicted values across all leaf nodes. Note that the predicted value of a leaf node $l$, denoted $d_l$, is itself a parameter to be learned.

In addition to being more expressive than a crisp, axis-aligned decision tree (can represent continuous functions), a fuzzy regression tree has a structure of continuous parameters. This means it is amenable to global optimisation, specifically by backpropagation of loss gradients. For a training dataset $\mathcal{D}$, the gradient of the mean squared loss $\mathcal{L}_\mathcal{D}$ with respect to each $d_l$ is

and the gradient with respect to the parameter vector of a decision node $\xi_n=[-b_nt_n,b_n\textbf{w}_n]$ is

where $\bar{y}_n^\text{child}(\textbf{x})$ is the partial estimate of $\bar{y}(\textbf{x})$ that can be computed from the subtree rooted at the child node, and

where $\tilde{\textbf{x}}=[1,\textbf{x}^{(1)},\textbf{x}^{(2)},…]$.

It is important to note that this optimisation scheme *assumes a fixed tree topology*. In experiments, the authors first train a crisp regression tree using CART, then use it to initialise a fuzzy tree which is subjected to gradient-based optimisation. The proposed initialisation method for each decision node is to copy the threshold $t_n$ exactly, set the weights $\textbf{w}_n$ to a one-hot vector encoding the selected feature $j_n$ [I think], and choose the width parameter $b_n$ via a rather complicated function of the range of $j_n$ across the training dataset, and impurity reduction effect of the node [see paper for details].

So far we have assumed that the fuzzy tree is being used for a regression problem. For classification, the scheme is quite a bit hackier. First, we need to introduce normalisation as follows [why?]:

where $\tilde{t}_n=t_n/\vert\textbf{w}_n\vert$. In addition, class labels are assigned to leaf nodes by the majority rule (so unlike in regression, they are not parameters), and the defuzzified prediction $\bar{y}(\textbf{x})$ is the class with the greatest sum of membership across all leaf nodes.

The global Gini impurity $\mathcal{G}_\mathcal{D}$ is chosen as the loss metric to optimise for classification. The loss gradient with respect to a decision node parameter $\alpha_n$ is

where for leaf nodes

(with $k=1..K$ being the classes), and for decision nodes

which is computed by backward recursion from the leaves. $\alpha_n$ can stand for either $\textbf{w}_n$ or $\tilde{t}_n$, with

and

The width parameters $b_n$ are not optimised directly. Instead, a range of widths are tried and the best-performing one is chosen.

The accuracy gain attained by using a fuzzy classification tree is less clear-cut than with regression, but we get an additional advantage, namely a measure of the proximity of a given example $\textbf{x}_i$ to a decision boundary (which is a good indicator of its probability of misclassification). This measure is the *fuzzy entropy*, calculated as follows:

where

This measure can easily be calculated concurrently with the classification of the example.

## 🗝️ Key Insights

A couple of pieces of evidence of how additional information to enrich the imitation learning process can improve its performance. The two examples were the use of the target policy’s state-action value function $Q$, and the division of the imitation problem into stages using domain konwledge.

Differentiable / fuzzy decision trees appear to be an immensely flexible model architecture, especially for regression, that have been criminally under-explored in the RL context where their online learning potential is extra valuable.

Sokol and Flach’s work on Glass-Box demonstrates the dual nature of the XAI problem: it needs both theoretical and algorithmic developments, and careful design of interaction protocols. We must be aware of which is our focus and strength, otherwise we risk wasting a lot of time creating suboptimal results!