attended a talk on AI Safety

March 23, 2025

I received an invitation, as below.

Hi everyone! This Thursday 4-5pm, we have Elliott Thornley, a visiting Research Fellow from Oxford University, speaking on the topic of “Can incomplete preferences keep artificial agents shutdownable?”

In this event, Elliot will explain the shutdown problem: the problem of ensuring that advanced artificial agents never resist shutdown. Elliot will then propose a solution: we train agents to have incomplete preferences. He will suggest a method for training such agents using reinforcement learning, and present experimental evidence in favour of the method. He will explain how work on the shutdown problem fits into a larger project called ‘constructive decision theory’: using ideas from decision theory to design and train artificial agents.

Event will be held at Singapore AI Safety Hub, 22 Cross Street.

I attended it, and wanted to share what I have learnt.

Elliott Thornley started out in the study of philosophy but has been working with computer scientists on interdisciplinary publications. He is worried about the emergence of what he calls “advanced agents”: AI agents that take the initiative to act on complex goals without any human prompting. This has not yet happened – as current AI software still relies on human input to function – but Thornley raises the scenario where AI agents pursue an unintended project and resist human commands to shut down the unintended project.

In his talk from a few days ago, he applied decision theory to AI Agents. (he calls them agents instead of software). he uses ideas from decision theory to design and train AI agents. this includes attributing human behaviour to them, such as rational thinking processes that choose the more rewarding of two choices, when both impose an equal cost.

In layman's language, Thornley suggests that we can develop certain preferences in AI agents while they are still “impressionable infants”, so that they will grow up into obedient, submissive “adults” that will never say No to a human's command to shut themselves down.

The talk was very brief in the allocated time, and some audience members visibly struggled to digest his ideas and theorems. However he has provided his email address and some of his publications for those so inclined on further study.

elliott.thornley@philosophy.ox.ac.uk

https://arxiv.org/pdf/2403.04471v1

https://www.philosophy.ox.ac.uk/people/elliott-thornley#tab-4483331