# Weekly Readings #13

** Published:**

Theory-of-mind as a general solution; factual and counterfactual explanation; semantic development in neural networks; cloning without action knowledge; intuition pumps.

## 📝 Papers

### Çelikok, Mustafa Mert, Tomi Peltola, Pedram Daee, and Samuel Kaski. “Interactive AI with a Theory of Mind.” *ArXiv:1912.05284 [Cs]*, December 1, 2019.

Currently we face two major problems: AI does not understand humans, and humans do not understand AI. Here it is argued that these two problems have the *same solution*: each agent needs to forms a *theory of mind* (ToM) of the other.

To solve this problem we can look to prior work on human-human ToM (psychology) and agent-agent interaction (multi-agent systems, e.g. opponent modelling, I-POMDPs). It would also be beneficial to develop a taxonomy of approaches. Here an initial one is proposed that categorises ToM complexity, on a scale from using a fixed prescriptive model of the other agent as part of the environment, to treating it as an active, learning planner, which in turn has its own theory of mind.

### Guidotti, Riccardo, Anna Monreale, Fosca Giannotti, Dino Pedreschi, Salvatore Ruggieri, and Franco Turini. “Factual and Counterfactual Explanations for Black Box Decision Making.” *IEEE Intelligent Systems*, 2019.

Here factual *and* local counterfactual explanations of a binary classifier are produced by fitting a decision tree to datapoints nearby to a target instance.

Nearby points are generated by an evolutionary process. Each member of the population $P$ is initiated with the target instance $x^\ast$. On each iteration, each instance $x$ is submitted to the black box $b$ which yields a prediction $b(x)$. Fitness is computed as

\[\text{fitness}(x)=I_{b(x)=c}+(1-d(x,x^\ast))-I_{x=x^\ast}\]where $c\in{0,1}$ is one of the two classes, $d(x,x^\ast)\in[0,1]$ is a distance function between instances, and the final term ensures that $x^\ast$ itself is not retained. Both mutation and crossover are performed to maximise the fitness function. Finally, a selection of the best individuals is returned. The evolution process is completed for both classes, producing a balanced dataset.

A *factual* explanation consists of the decision path to the target instance (pruning any rules that do not yield a different prediction). A *counterfactual* explanation consists of the minimal *number* of conditions that must be changed to modify the prediction (if multiple solutions exist, include all). Both kinds of explanation are produced, yielding two perspectives on the explanation question.

In experiments with several open-source datasets, show that the explanatory model has higher local fidelity than one from LIME. It also produces more readable explanations with less manual specification of hyperparameters.

### Li, Yunzhu, Jiaming Song, and Stefano Ermon. “Infogail: Interpretable Imitation Learning from Visual Demonstrations.” In *Advances in Neural Information Processing Systems*, 3812–3822, 2017.

In *Generative Adversarial Imitation Learning* (GAIL), a learning signal is provided by a discriminative classifier $D$ which tries to distinguish state-action pairs from the expert policy $\pi_E$ and the student $\pi$. Here the idea is extended by incorporating the idea from InfoGAN of trying to learn a disentangled, semi-interpretable latent representation.

In InfoGAIL we handle variations in expert strategy between episodes by modeling $\pi$ as a *mixture* of policies, where one is selected for each episode conditioned on a latent code $c$. As in InfoGAN, disentanglement of $c$ is encouraged by enforcing high mutual information between $c$ and the state-action pairs in generated student trajectories. The mutual information is hard to maximise directly so is approximated by a variational lower bound, using a posterior estimate $Q(c\vert\tau)\sim P(c\vert\tau)$.

With $D$, $\pi$ and $Q$ each represented as neural networks, the InfoGAIL algorithm proceeds as follows:

- Loop:
- Sample a batch of latent codes $c_i\sim p(c)$.
- Sample a batch of state-action pairs from $\pi$, with the latent code fixed during each episode.
- Sample a batch from $\pi_E$ of the same size.
- Run the discriminator $D$ on the batches and update its parameters using the discrimination error gradient.
- Update the parameters of $Q$ using the mutual information gradient.
- Update the parameters of $\pi$ using the
*Trust Region Policy Optimisation*(TRPO) update rule.- TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be.

An additional feature included in InfoGAIL is *reward augmentation* which incentivises policy updates towards regions of desired behaviour by providing a surrogate state-based reward. This yields a hybrid between imitation learning and reinforcement learning.

InfoGAIL is tested in a 2D race track simulator, with the task of imitating a black box teacher with two distinct driving modes (e.g. overtaking vehicles to the left or right). Observations take the form of top-down images (preprocessed by a CNN) concatenated with domain-specific auxiliary information such as recent speeds and actions. Reward augmentation is provided via a constant reward at every timestep to encourage vehicles to ‘stay alive’.

In addition to learn good driving performance, the aim is to separate the modes using $c$, which is encoded as a one-hot vector with $2$ elements, and this seems to work pretty well (around $80\%$ accuracy). This is a good result, since learning the two modes individually would require us to know how to distinguish them *a priori*.

### Saxe, Andrew M., James L. McClelland, and Surya Ganguli. “A Mathematical Theory of Semantic Development in Deep Neural Networks.” *ArXiv:1810.10531 [Cs, q-Bio, Stat]*, October 23, 2018.

Here a mathematical theory is developed to explain how deep neural architectures can abstract semantic knowledge from data. Specifically, we consider a *linear* network (i.e. no activation functions) with one hidden layer, for which the input vector $\textbf{x}$ identifies an item, and the output vector $\textbf{y}$ is a set of features to be associated with that item.

- Note that this isn’t a conventional perceptual setup. We can assume both $\textbf{x}$ and $\textbf{y}$ have been generated by other networks trained to identify items and features based on raw percepts.

It is well known that linear networks of any depth can only learn linear mappings, but their weight change dynamics (by backpropagation) are described by complex nonlinear differential equations, with up to cubic interactions between weights. In this respect a linear network provides useful and analytically tractable model of its nonlinear counterpart.

The analysis concerns the learning dynamics of a deep linear network, starting from *small*, *random initial weights* (an important assumption), with a small learning rate which ensures that learning is driven by the statistical structure of the domain rather than any individual item. These dynamics can be understood in terms of the input-output correlation matrix of the data itself, $\Sigma^{yx}$.

$\Sigma^{yx}$ can be decomposed into the product of three matrices via singular value decomposition (SVD)

\[\Sigma^{yx}=\textbf{USV}^T=\sum_{\alpha=1}^{N_h}s_\alpha\textbf{u}^\alpha(\textbf{v}^\alpha)^T\]where $N_h$ is the number of hidden neuron. Each matrix has a semantic interpretation:

- $\textbf{v}^\alpha$ is an
*object analyser*vector, with one element per item $i$, which represents the position of $i$ along an important axis of semantic distinction $\alpha$ (e.g. animal vs plant, fast vs slow, light vs dark). This axis is encoded in the weights to/from the corresponding neuron.- $\textbf{v}^\alpha$ can be used as a measure of
*typicality*of $i$ with respect to $\alpha$. A more typical item will have higher $\vert\textbf{v}_i^\alpha\vert$.

- $\textbf{v}^\alpha$ can be used as a measure of
- $\textbf{u}^\alpha$ is a
*feature synthesiser*vector for $\alpha$, with one element per feature $m$, which represents the positive/negative contribution of $m$ to $\alpha$ (e.g. “has roots” has a negative contribution on the animal vs plant dimension).- With the right normalisation, $\textbf{u}_\alpha$ can be used as a
*category prototype*for the positive direction of $\alpha$.

- With the right normalisation, $\textbf{u}_\alpha$ can be used as a
- $s_\alpha$ is an element of the diagonal matrix $\textbf{S}$, whose value captures the overall explanatory power or
*coherence*of $\alpha$ in the data. The SVD method produces an ordered $\textbf{S}$ matrix, i.e. $s_1\geq s_2\geq…$

Now comes for the really interesting bit. At time $t$ during learning, the deep network’s input-output mapping is a time-dependent version of this SVD result, in which $\textbf{U}$ and $\textbf{V}$ are shared but each $s_\alpha$ is replaced by a dynamic value $a_\alpha(t)$. During learning this value follows a *sigmoidal* trajectory between some initial value and $s_\alpha$ (as $t\rightarrow\infty$).

The fascinating aspect of this, and the heart of the results in this paper, is that the **rise time is shorter for larger $s_\alpha$**, hence more coherent concepts are learned more quickly by the network. In contrast, in a shallow network with no hidden layer, $a_\alpha(t)$ follows an exponential trajectory with much weaker rate dependency on $s_\alpha$.

Both theoretical analysis and experiments are conducted to show that this coherence-dependent learning rate means that **broad categorical concepts are learned before fine-grained distinctions**, which mirrors observations of human learning.

This analysis reveals how a pre-trained network can be used to model basic induction and generalisation by similarity-based reasoning:

- If a novel feature $m$ is newly ascribed to a familiar item $i$ (by adjusting
*only*the weights from the hidden layer to the novel feature so as to activate it appropriately), the effect is to also ascribed to another item $j$ by an amount proportional to similarity of the two hidden states $\textbf{h}_i$ and $\textbf{h}_j$ (computed by scalar projection). - Similarly, if a novel item $i$ is introduced with a familiar feature $m$ (by adjusting
*only*the item-to-hidden weights), it will also be assigned another feature $n$ by an amount proportional to the similarity of the features’ hidden representations.- These cannot be obtained directly, but at time $t$ the $\alpha$th component of $\mathbf{h}
*n$ can be computed as $\mathbf{h}^\alpha_n=\mathbf{u}^\alpha_n\sqrt{a*\alpha(t)}$.

- These cannot be obtained directly, but at time $t$ the $\alpha$th component of $\mathbf{h}

Thus the hidden layer of a deep network acts as a common representational space in which both features and items can be placed.

The authors finish by reminding us that this analysis has been done on an exceedingly simple network structure. We should now proceed to analyse more complex systems.

### Torabi, Faraz, Garrett Warnell, and Peter Stone. “Behavioral Cloning from Observation.” *ArXiv:1805.01954 [Cs]*, May 4, 2018.

Propose a framework for behavioural cloning in an MDP environment *without direct access to the actions* taken by the target agent. The approach consists of three steps:

- Learning an
*inverse dynamics*model of the environment $P_\theta(a_t\vert s_t,s_{t+1})$ by following a (random) exploration policy. More specifically, the aim is to find a task-independent model, such that the distribution is only dependent on those features of the state that are*internal*to the agent $i$, denoted $s_t^i$ and $s_t^{i+1}$. Any sufficiently expressive function approximator can be used and the parameters $\theta$ are optimised by supervised learning. In experiments with continuous actions, the distribution is assumed to be Gaussian over each internal state dimension and a neural network is trained to output the means and standard deviations. For discrete actions, the network outputs action probabilities via a softmax function. - The inverse dynamics model is deployed to label a dataset of trajectories produced by the target policy $\pi^\ast$ (the maximum likelihood action is chosen for each transition). This newly-labelled dataset is then used to learn a policy approximation $\pi_\phi(a^i_t\vert s_t^i)$. For continuous and discrete actions respectively, the same structural assumptions are made as above (i.e. Gaussian or softmax).
- Both learned models can be optionally improved with post-demonstration environmental interaction: following $\pi_\phi$ for a number of timesteps [and getting the actions relabeled by $\pi^\ast$?] produces datasets for refining first $\theta$, then $\phi$.

The approach is deployed on a handful of standard OpenAI Gym tasks. It performs favourably compared with alternative approaches that do have access to the actions taken by the target policy, and requires fewer post-demonstration environmental interactions.

## 📚 Books

### Dennett, D. & Hofstadter, D. (2001). *The Minds’ I: Fantasies and Reflections on Self and Soul*. Basic Books.

#### From Hardware to Software

Purpose, intelligence and identity can reside at multiple levels of a hierarchical system, and it is possible for these emergent properties to result from the interactions of many individually lacking components (ants → signals →colony). **When we seek a compact explanation for behaviour, we must choose the right level of abstraction. Anything else is liable to yield empty, nonsense or overly-complex results.**

A central thesis: *mind is a pattern perceived by a mind.*

#### Created Selves and Free Will

There’s is no hard distinction between a simulation / model of a phenomenon and the phenomenon itself. There is a spectrum, dependent on the degree of fidelity.

The concepts of free will and the self stand and fall together. Without a hard distinction between “me” and “the universe”, how can one possibly say that “I” (and not the universe) determine my actions? Furthermore, anecdotal evidence is sufficient to show that we rarely have an integrated, unconflicted will in the first place. We have a complex of wills arising from our many subcomponents.

In his argument against the presence of understanding in *The Chinese Room*, Searle’s rebuttal of all comers seems plausible because of a clever use of framing. Change a few parameters - the size of the system, its physical implementation, and its execution speed - and we can pump our intuitions towards a different conclusion about where intentionality and understanding lies. He is also very keen to emphasise a kind of “special causal sauce” present in the human brain, without questioning what this might actually consist of.

#### The Inner Eye

The state of a brain at time t (along with the input it receives from the outside world) influences it’s state at time t+1. Whether or not “I” am in control of my thoughts and actions depends whether I am willing to identify with previous iterations of my brain, or whether I claim to exist only in the present.

## 🗝️ Key Insights

*Theory of mind*as a unifying factor of the two major problems in AI (AI understanding humans; humans understanding AI) is a simple and compelling one.- A clear example of the utility of combining factual and counterfactual explanations, and how decision trees can be used to deliver both.
As in InfoGAN, InfoGAIL is another bit of evidence that information theory is one route towards semantics, and on a similar note, the stunningly elegant analysis in “A Mathematical Theory…” shows how meaningful compositionality can emerge from the dynamics of neural learning alone.

Lack of direct access to the actions taken by the target policy in behavioural cloning needn’t be a dealbreaker; an additional step spent exploring the environmental dynamics to learn an inverse model appears to be effective.

- The notion of an intuition pump - a thought experiment with the parameters tunes so as to lead the reader towards a desired conclusion - is an helpful one to be wary of when learning about thorny philosophical issues.