# Weekly Readings #9

** Published:**

Model extraction; world models and representations; a MAS taxonomy.

## 📝 Papers

### Bastani, Osbert, Carolyn Kim, and Hamsa Bastani. “Interpretability via Model Extraction.” *ArXiv:1706.09773 [Cs, Stat]*, March 12, 2018.

This is a take on the cloning / model distillation / inverse learning challenge, in which the aim is to approximate a black box classifier $f\ :\ \mathcal{X}\rightarrow\mathcal{Y}$ using an interpretable model (decision tree) $T$. The imagined use of $T$ is to enable intuition-building and debugging prior to deploying $f$.

Since decision trees are prone to overfitting to small datasets, data augmentation is used. A Gaussian mixture model is fit to the training inputs $X_{train}$, yielding a distribution $\mathcal{P}$ over the input space. $\mathcal{P}$, rather than the dataset itself, is then used during purity calculations. For example, if the Gini impurity is used, the purity of a node $N$ is computed as

\[1-\sum_{y\in\mathcal{Y}}\text{Pr}_{x\sim\mathcal{P}}[\ f(x)=y\ \vert\ C_N\ ]\]where $C_N$ is the set of axis-aligned constraints at $N$. In practice, the probability is estimated using i.i.d. samples from $\mathcal{P}\ \vert\ C_N$. In the paper, some equations are provided showing how to sample from a Gaussian mixture.

The algorithm is used to interpret several supervised learning models and an OpenAI Gym control policy. A simple CART algorithm is also used for comparison; in both cases the trees are restricted to have $31$ nodes. It does quite a bit better in terms of *relative performance* on the underlying tasks: what is its error/reward compared with the error/reward of the black box? [I’m not sure this is a very good metric.]

The extracted trees are used to infer inclusion of undesirable features, and to debug biases in the RL control policy. The development of new ways to gain insight from extracted trees is identified as an important future work direction.

### Ha, David, and Jürgen Schmidhuber. “World Models.” *ArXiv:1803.10122 [Cs, Stat]*, March 28, 2018.

The goal here is to equip an agent with a mental model of its environment - learned by a generative network - that can be used to construct a compact state representation and can even provide ‘dream’ simulations for policy training. It generally is difficult to train large policy networks in RL due to the credit assignment problem. It is hoped that a large and detailed world model will allow the policy network to remain compact even for complex tasks.

The proposed system has three components:

A

**vision model**$V$ which encodes a high-dimensional (visual) observation $o_t$ into a low-dimensional state representation $z_t$. It is implemented as a convolutional variational autoencoder (ConvVAE).A

**memory RNN**$M$ which integrates historical states $z_1,…,z_t$ and actions $a_1,…,a_t$ to produce a probability distribution of the next state $z_{t+1}$. It is implemented as a*mixed density network*combined with an LSTM (MDN-RNN), which outputs a Gaussian mixture as the density function.A very small

\[a_t=W_c[z_t\ h_t]+b_c\]**controller**$C$ which uses representations from both $V$ and $M$ to select actions. It is simply implemented as a linear model that maps the concatenated vector $[z_t\ h_t]$ (where $h_t$ is the RNN hidden state) to an action $a_t$:where the weight matrix $W_c$ and bias vector $b_c$ are the parameters to be learned. The chosen optimisation algorithm is the

*covariance matrix adaptation evolution strategy*(CMA-ES).

During experiments on a track driving task, it is found that training with $z_t$ alone as the state representation does enable some learning, the dynamics encoded in $h_t$ allow much better learning. On a different task (moving a character to avoid fireballs), training in a ‘dream’ environment generated by $M$ does transfer to the actual environment, although adversarial cases are found where the agent learns to exploit imperfections in the dream representation in order to gain reward. These, of course, do not yield good performance in the actual environment.

The tasks used in these experiments are relatively simple. For more sophisticated tasks, it is suggested that an iterative training procedure for $M$ and $C$ would be useful. On each iteration, the agent is deployed in the actual environment, trajectories are collected, $M$ is retrained on these trajectories, then $C$ is retrained to maximise rewards inside of $M$.

### Merckling, Astrid, Alexandre Coninx, Loic Cressot, Stephane Doncieux, and Nicolas Perrin. “State Representation Learning from Demonstration.” *ArXiv:1910.01738 [Cs, Stat]*, September 15, 2019.

In this work, the aim is to learn a low-dimensional state representation from high-dimensional observations of an existing policy with access to the ground-truth state. By providing demonstrations for a range of policies and task setups, the hope is that the inferred state representation is generalisable and complete. This approach differs from most work in SRL in that it is entirely off-policy: the representation is learned from demonstration alone and not from direct interaction with the environment.

For a given task $T^i$ and target policy $\pi^i_{target}$, we can observe trajectories and store them in a dataset $\mathcal{D}^i$ where each element of $\mathcal{D}^i$ is a triple $(I_{t-1},I_{t},a_{t}^{i})$. $I_{t-1}$ and $I_t$ are consecutive high-dimensional observations and $a_{t}^{i}$ is a vector corresponding to the action taken at time $t$.

Given a *set* of $K$ tasks and associated policies, the aim is to learn a *single* mapping $\varphi(I_t^i)=\varphi_t^i$, implemented as a neural network. The final state representation used is $(\varphi^i_t,\Delta\varphi^i_t)$ where $\Delta\varphi^i_t=\varphi^i_t-\varphi^i_{t-1}$, since this is deemed likely to be useful for a dynamical system which is described by differential equations. This common representation is then used by $K$ independent *head* networks $\psi^{1:K}$ which aim to mimic the $K$ policies. We end up with $K$ optimisation problems:

for $i\in{1,…,K}$, where $\theta_i$ are the parameters in *both* $\varphi$ and $\psi^i$ that contribute to the output of $\psi^i$. Of course, the optimisation problems are not independent because parameters are shared.

The $K$ problems are solved semi-concurrently by iteratively sampling one of them and performing gradient descent on its loss function using a batch of samples from the associated dataset.

The experimental task is that of moving a 2D robotic arm to a goal location. Observations consist of images with added noise and distractors. Key findings:

- Asymptotic performance pretty much matches using the ground-truth state if no noise or distractors are used, but falls below when they are added. The method far outperforms using an autoencoder-based representation (which does not work at all in the noisy case), and offers a smaller advantage over using PCA.
- Performance increases incrementally as the allowed size of the representation is increased between $16$ and $48$, but begins to decrease again at $128$, and reaches zero at $4096$ (equal to image size).

### Muharram, M., and G.D. Smith. “Evolutionary Constructive Induction.” *IEEE Transactions on Knowledge and Data Engineering* 17, no. 11 (November 2005): 1518–28.

A *genetic programming* (GP) toolkit called *GPSys* is used to synthesise nonlinear attributes using the arithmetic operators ($+,-,\div,\times$) and a set of original features for a decision tree learner, deployed on five classification problems. Various fitness functions are examined, mostly based on Gini coefficient, information gain or a combination of the two. The evolved attributes enable better performance, and significantly more compact trees, compared with using the original features. A couple of examples are given of intuitively-useful attributes being discovered automatically.

On one of the datasets, this hybrid approach of GP and decision tree is shown to be more effective than simply using a GP alone to evolve a rule set from the original features. It is suggested that this is because rule evolution requires useful features to be synthesised separately for each use, whereas the hybrid allows them to be easily reused in multiple parts of the decision logic.

### Stone, Peter, and Manuela Veloso. “Multiagent Systems: A Survey from a Machine Learning Perspective:” Fort Belvoir, VA: Defense Technical Information Center, December 4, 1997.

Multiagent systems (MAS) have several advantages including speed (from parallelism), robustness and scalability, and potential ease-of-programming compared with global control.

This article taxonomises MAS according to two axes: degree of *heterogeneity* and degree of *communication*. Hence there are four broad classes of MAS:

**Homogenous non-communicating**. Goals, domain knowledge and capabilities are common. Mutual modeling is still often required (e.g. to prevent policy convergence), but this is relatively straightforward since each agent can assume each other one reasons identically given the same internal state and input. However, fully-recursive belief modelling is only practical up to a certain depth. Indirect communication may be possible through*stigmergy*.**Heterogenous non-communicating**. Goals, domain knowledge and capabilities differ between agents, hence we may have a mix of benevolent and competitive dynamics. This class includes MAS where learning is possible on a single-agent level, even if agents begin as homogenous (credit assignment is often challenging here). The mutual modeling problem is significantly more complex here: in the absence of communication, models must be constructed from observation alone. Roles, social conventions and teaming may all emerge.**Homogenous communicating**. The right implementation of communication can greatly improve efficiency and performance. There tends to be a tradeoff between degree of communication, and individual decision-making freedom. In robotic implementations, bandwidth limitations and latency make for tough technical challenges, but communication can be harnessed to augment sensing and localisation.**Heterogenous communicating**. This is the most general class of MAS. The design of commonly-usable communication protocols is far more difficult here: there are a few standard ones out there, and some work has been done on letting agents learn a protocol. In competitive systems, communication may be used to negotiate, mislead or confuse, and to make or break commitments. Mutual BDI modeling is common in this class of system.

These classes are explored within the complex and interesting case study of robotic football. At the extreme end of simplicity (homogenous non-communicating), players may have no idea they are even part of a team and have a policy as simple as running toward the ball and shooting where possible. An obvious form of heterogeneity is that of prescribed positions, and another is variation in skills which can be implemented success probabilities and noise levels for various action types. Communication is also clearly beneficial in implementations where each player has incomplete information (e.g. a limited view of the field).

In such a complex scenario, behavioural layering is important, with low-level kicking dynamics at one end, and tactical playing styles at the other.

## 🗝️ Key Insights

- A possible way to prevent overfitting in decision trees is to fit a generative model to the input features and train on samples from the model, rather than from the original dataset itself.
- The three-component
*world model*architecture, consisting of $V$, $M$ and (simple) $C$, seems like a very general and powerful one that can be used in a variety of ways. - Representations can be learned from pure observation alone, though it is important that multiple task variants or policies are observed, in order to prevent trivial overfitting.
- Genetic programming is an effective way of doing representation learning when the set of original features and operators can be specified
*a priori*. If the constructed features are to be used in a decision tree, it makes sense to evaluate their quality using the same purity metric as in the tree generator. - A general theme from the representation learning literature: modularising the representation and policy learning stages seems to improve efficiency and generalisation compared with end-to-end methods.
- Two of the most useful axes for taxonomising multi-agent systems are homogeneity and communication. Also, robotic football is a fantastically rich testbed for MAS research.