Posts by Collection

2021 #Content/Paper by Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. There’s also an accompanying YouTube playlist here and walkthrough by Neel Nanda here

Attention

3 minute read

Note: traditional attention was first used as a mechanism to improve the performance of RNNs. The Attention is All You Need paper proposed self-attention as a complete replacement for the RNN itself, which has subsequently proven to be more performant as well as faster due to the lack of reliance on sequential processing.

Codebook Features - Sparse and Discrete Interpretability for Neural Networks

1 minute read

2024 ICML #Content/Paper by Alex Tamkin, Mohammad Taufeeque, and Noah D. Goodman.

Dictionary Learning

3 minute read

One of three approaches to alleviating superposition suggested in Anthropic’s toy models blog post, which has a long history in the sparse coding literature and may even be a strategy employed by the brain.

Interpreting Neural Networks through the Polytope Lens

less than 1 minute read

2022 #Content/Paper by Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, and Connor Leahy.

Linear Representation Hypothesis

less than 1 minute read

That monosemantic concepts are represented in neural networks as directions (vectors) in activation space, that conjunctions of concepts are represented as linear combinations of these vectors, and that we can understand the computations done by models between layers as nonlinear operations on them.

Mechanistic Interpretability

less than 1 minute read

Embodies a fundamental belief that neural networks are only black boxes by default, and that we can understand their function in terms of interpretable concepts and algorithms if we try hard enough.

Not All Language Model Features Are Linear

less than 1 minute read

2024 #Content/Paper by Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, and Max Tegmark.

Scaling Monosemanticity - Extracting Interpretable Features from Claude 3 Sonnet

1 minute read

2024 #Content/Blog by Anthropic.

Toy Models of Superposition

3 minute read

2022 #Content/Blog by Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Chris Olah.

Transformer

1 minute read

The term “transformer” doesn’t have a fully precise definition, but in general is used to refer to any neural sequence-to-sequence model where the only interaction between positions is through a sequence of multi-head attention layers that iteratively update a central embedding called a residual stream by addition.

portfolio

Assorted University Projects

Assessed work from BEng and MSc degrees

MSc Project: On Tour

Harnessing Social Data for Travel Recommendation

Minimal Hanabi Emulator

A minimal Hanabi emulator, written in pure Python.

publications

Use of Machine Vision and Intelligent Data Processing Algorithms to Monitor and Predict Crop Growth in Vertical Farms

Published in , 2018

Exerpt.

Recommended citation: Bewley, Tom. "Use of Machine Vision and Intelligent Data Processing Algorithms to Monitor and Predict Crop Growth in Vertical Farms." BEng Thesis, University of Bristol, 2018. https://bit.ly/2MHrhWb

On the Combination of Gamiﬁcation and Crowd Computation in Industrial Automation and Robotics Applications

Published in 2019 International Conference on Robotics and Automation (ICRA), 2019

A theoretical framework for embedding robotics problems into video games, enabling distributed control by a crowd of players.

Recommended citation: Bewley, Tom, and Liarokapis, Minas. "On the Combination of Gamification and Crowd Computation in Industrial Automation and Robotics Applications." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019. /files/gamification.pdf

On Tour: Harnessing Social Tourism Data for City and Point-of-Interest Recommendation

Published in , 2019

Exerpt.

Recommended citation: Bewley, Tom. "On Tour: Harnessing Social Tourism Data for City and Point-of-Interest Recommendation." MSc Thesis, University of Bristol, 2019. /files/msc-thesis.pdf

On Tour: Harnessing Social Tourism Data for City and Point of Interest Recommendation

Published in DSRS-Turing 2019: 1st International ‘Alan Turing’ Conference on Decision Support and Recommender Systems, 2019

A data-driven recommender system for tourism.

Recommended citation: Bewley, Tom, and Carrascosa, Ivan Palomares. "On Tour: Harnessing Social Tourism Data for City and Point of Interest Recommendation." 1st International ‘Alan Turing’ Conference on Decision Support and Recommender Systems (DSRS-Turing 2019). 2019. https://www.researchgate.net/profile/Tom_Bewley2/publication/338581229_On_Tour_Harnessing_Social_Tourism_Data_for_City_and_Point_of_Interest_Recommendation/links/5e1dd8e6458515d2b46d3eb6/On-Tour-Harnessing-Social-Tourism-Data-for-City-and-Point-of-Interest-Recommendation.pdf

Am I Building a White Box Agent or Interpreting a Black Box Agent?

Published in arXiv, 2020

Describing the various ways of training and evaluating white box approximations of black box policies, and how these can be in conflict.

Recommended citation: Bewley, Tom. "Am I Building a White Box Agent or Interpreting a Black Box Agent?" arXiv preprint 2007.01187. 2020. https://arxiv.org/abs/2007.01187

Modelling Agent Policies with Interpretable Imitation Learning

Published in Trustworthy AI - Integrating Learning, Optimization and Reasoning (also 1st TAILOR Workshop at ECAI 2020), 2020

Introducing I2L and preliminary results.

Recommended citation: Bewley T., Lawry J., Richards A. (2021) Modelling Agent Policies with Interpretable Imitation Learning. In: Heintz F., Milano M., O'Sullivan B. (eds) Trustworthy AI - Integrating Learning, Optimization and Reasoning. TAILOR 2020. Lecture Notes in Computer Science, vol 12641. Springer, Cham. https://link.springer.com/chapter/10.1007%2F978-3-030-73959-1_16

TripleTree: A Versatile Interpretable Representation of Black Box Agents and their Environments

Published in 35th AAAI Conference on Artificial Intelligence (AAAI 2021), 2020

Introducing a new decision tree model of black box agent behaviour, which jointly captures the policy, value function and temporal dynamics.

Recommended citation: Bewley, Tom and Lawry, Jonathan. "TripleTree: A Versatile Interpretable Representation of Black Box Agents and their Environments" 35th AAAI Conference on Artificial Intelligence (AAAI 2021). 2021. https://arxiv.org/abs/2009.04743

Interpretable Preference-based Reinforcement Learning with Tree-Structured Reward Functions

Published in 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2022), 2021

An online, active learning algorithm that uses human preferences to construct reward functions with intrinsically interpretable, compositional tree structures.

Recommended citation: Bewley, Tom and Lecue, Freddy. "Interpretable Preference-based Reinforcement Learning with Tree-Structured Reward Functions" Proc. of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2022). 2022. http://arxiv.org/abs/2112.11230

Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning

Published in Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), 2022

We generalise reward modelling for reinforcement learning to handle non-Markovian rewards, and propose new interpretable multiple instance learning models for this problem.

Recommended citation: Early, Joseph, Tom Bewley, Christine Evers, and Sarvapali Ramchurn. "Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning" Proc. of the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022). 2022. https://arxiv.org/abs/2205.15367

Summarising and Comparing Agent Dynamics with Contrastive Spatiotemporal Abstraction

Published in XAI-IJCAI22 Workshop, 2022

A data-driven, model-agnostic technique for generating a human-interpretable summary of the salient points of contrast within an evolving dynamical system.

Recommended citation: Bewley, Tom and Lawry, Jonathan and Richards, Arthur. "Summarising and Comparing Agent Dynamics with Contrastive Spatiotemporal Abstraction" XAI-IJCAI22 Workshop. 2022. https://arxiv.org/abs/2201.07749

Reward Learning with Trees: Methods and Evaluation

Published in arXiv, 2022

We show that reward learning with tree models can be competitive with neural networks, and demonstrate some of its interpretability benefits.

Recommended citation: Bewley, Tom, Jonathan Lawry, Arthur Richards, Rachel Craddock, and Ian Henderson. "Reward Learning with Trees: Methods and Evaluation" arXiv preprint 2210.01007. 2022. https://arxiv.org/abs/2210.01007

Learning Interpretable Models of Aircraft Handling Behaviour by Reinforcement Learning from Human Feedback

Published in AIAA SciTech Forum, 2023

We show that reward learning with tree models can be competitive with neural networks in an aircraft handling domain, and demonstrate some of its interpretability benefits.

Recommended citation: Bewley, Tom, Jonathan Lawry, and Arthur Richards. "Learning Interpretable Models of Aircraft Handling Behaviour by Reinforcement Learning from Human Feedback" AIAA SciTech Forum. 2023. https://arxiv.org/abs/2305.16924

Conversative World Models

Published in Seventh Workshop on Generalization in Planning, 2023

We propose methods for improving the performance of zero-shot RL methods when trained on sub-optimal offline datasets.

Recommended citation: Jeen, Scott and Tom Bewley and Jonathan M. Cullen. "Conversative World Models" Seventh Workshop on Generalization in Planning. 2023. https://openreview.net/pdf?id=6oPZcdzFjW

Tree Models for Interpretable Agents

Published in , 2024

Exerpt.

Recommended citation: Bewley, Tom. "Tree Models for Interpretable Agents." PhD Thesis, University of Bristol, 2024. https://research-information.bris.ac.uk/en/studentTheses/tree-models-for-interpretable-agents

Counterfactual Metarules for Local and Global Recourse

Published in International Conference on Machine Learning (ICML), 2024

A model-agnostic method for learning local and global counterfactual explanations in the form of generalised rules.

Recommended citation: Bewley, Tom, Salim I. Amoukou, Saumitra Mishra, Daniele Magazzeni, and Manuela Veloso. "Counterfactual Metarules for Local and Global Recourse" International Conference on Machine Learning (ICML). 2024. https://arxiv.org/abs/2405.18875

readings

Weekly Readings #1

19 minute read

Published: October 13, 2019

As it stands I’m precisely 13 days into my PhD, which means a lot of reading, and I thought I’d kick this blog off with a weekly rolling ‘diary’ of things I read, watch and otherwise consume which may have some influence on my PhD topic. Most of the papers have words pertaining to explanation in there, and that’s because I did a massive scrape of papers with that keyword. I figured that would be a reasonable start.

Weekly Readings #2

23 minute read

Published: October 20, 2019

Approximately three weeks in, I’m starting to work on a case study project that will allow me to explore some of the key ideas around multi-agent explainability – collision avoidance within a population of autonomous vehicles on road / track networks. As a result, more of my reading this week has focused specifically on the multi-agent context.

Weekly Readings #3

8 minute read

Published: October 27, 2019

This week didn’t involve very much reading since I focused instead on my practical investigation of the traffic coordination problem. Nonetheless, I encountered a variety of fascinating ideas.

Weekly Readings #4

11 minute read

Published: November 03, 2019

Decision trees for state space segmentation; lightweight manual labelling as a ‘seed’ for interpretability; the dangerous of homogenous distributed control; AI and the climate crisis.

Weekly Readings #5

18 minute read

Published: November 10, 2019

The theory of why-questions; fidelity versus accuracy; trees and programs as RL policies; partially-interpretable hybrids.

Weekly Readings #6

9 minute read

Published: November 17, 2019

Modelling other agents; DAGGER; evaluating feature importance visualisations; self, soul and circular ethics.

Weekly Readings #7

8 minute read

Published: November 24, 2019

Meta learning causal relations; decomposing explanation questions; misleading explanations; the critical influence of metrics.

Weekly Readings #8

17 minute read

Published: December 01, 2019

State representation learning; emotions and qualitative regions for heuristic explanation; causal reasoning as a middle ground between statistics and mechanics; deep learning and neuroscientific discovery.

Weekly Readings #9

11 minute read

Published: December 08, 2019

Model extraction; world models and representations; a MAS taxonomy.

Weekly Readings #10

7 minute read

Published: December 14, 2019

State representation learning in Atari; AI shortcuts and ethical debt; cloning swarms.

Weekly Readings #11

11 minute read

Published: December 22, 2019

Distillation and cloning; onboard swarm evolution; The Mind’s I chapters.

Weekly Readings #12

14 minute read

Published: December 29, 2019

Goal hierarchies as rule sets; mutual information and auxiliary tasks for representation learning; model-based understanding.

Weekly Readings #13

14 minute read

Published: January 12, 2020

Theory-of-mind as a general solution; factual and counterfactual explanation; semantic development in neural networks; cloning without action knowledge; intuition pumps.

Weekly Readings #14

15 minute read

Published: January 19, 2020

Integrating knowledge and machine learning; folk psychology and intentionality; soft decision trees; conceptual spaces.

Weekly Readings #15

12 minute read

Published: January 26, 2020

Confident execution framework; explananda as differences; online decision tree induction; hybrid AI design patterns.

Weekly Readings #16

11 minute read

Published: February 02, 2020

Imitation by coaching; GAIL; human-centric vs robot-centric; DeepMimic.

Weekly Readings #17

12 minute read

Published: February 09, 2020

Using $Q$ for imitation; differentiable decision trees and their application to RL; interactive explanations with Glass-Box.

Weekly Readings #18

22 minute read

Published: February 16, 2020

Formalising interpretation and explanation; operationally-meaningful representations; Conceptual Spaces book.

Weekly Readings #19

10 minute read

Published: February 23, 2020

Symbols and cognition; robust AI through hybridisation; causal modelling via RL interventions; environment as an engineered system.

Weekly Readings #20

7 minute read

Published: March 01, 2020

Rule-based regularisation; DeepSHAP for augmenting GAN training; image schemas as conceptual primitives; imitating DDPG with a fuzzy rule-based system.

Weekly Readings #21

12 minute read

Published: March 15, 2020

Constraining embeddings with side information; latent actions; RL with abstract representations and models; fuzzy state prototypes; index-free imitation.

Weekly Readings #22

7 minute read

Published: March 22, 2020

Explanation-based tuning; saliency maps for vision-based policies; RL with differentiable decision trees.

Weekly Readings #23

10 minute read

Published: March 29, 2020

Explanatory debugging; latent canonicalisations; the perceptual user interface; automatic curriculum learning.

Weekly Readings #24

11 minute read

Published: April 19, 2020

Trustworthy AI; unifying imitation and policy gradient; soft decision trees; SRL with dimension specialisation.

Weekly Readings #25

13 minute read

Published: April 26, 2020

Imitation learning using value or reward.

Weekly Readings #26

17 minute read

Published: May 03, 2020

Terminological quagmires; (mis)interpreting interpolation; interaction is paramount.

talks

Scary Black Boxes: Why Explanation Lies at the Heart of Socially-responsible AI

Published: October 03, 2019

PechaKucha talk introducing the motivation behind my PhD topic, as part of a one-day conference on socially-responsible AI.

On Tour: Harnessing Social Tourism Data for City and Point of Interest Recommendation

Published: November 21, 2019

Talk presenting an overview of my MSc thesis project, as part of the 1st International ‘Alan Turing’ conference on Decision Support and Recommender Systems.

Explainable AI for Black Box Autonomous Agents

Published: August 27, 2020

Talk presenting an overview of my PhD research so far, as part of a one-day conference showcasing student work within the School of Computer Science, Electrical and Electronic Engineering, and Engineering Maths (SCEEM) at the University of Bristol.

Modelling Agent Policies with Interpretable Imitation Learning

Published: September 05, 2020

Presentation of my paper of the same name, which can be found here.

Towards Explanatory Interactive Reinforcement Learning for Aligned and Trustworthy Agents

Published: January 18, 2021

Recording of a talk delivered to fellow Bristol researchers on 18/01/21, as part of a one-day symposium on Combining Knowledge and Data.

teaching

Engineering Design: Design Project 2

Undergraduate course, Faculty of Engineering, University of Bristol, 2019

Technical assistance for second year concept generation and MATLAB modelling project.

Engineering Design: Design Project 4

Undergraduate course, Faculty of Engineering, University of Bristol, 2019

Technical assistance for fourth year research and modelling activities, which form the first half of the centrepiece two-year group design project in the latter half of the Engineering Design programme.