Challenges in achieving AI safety

This post is for Week 3: Challenges in achieving AI safety of the BlueDot Impact AI Safety Fundamentals: Governance Course. Each week of the course comprises some readings, a short essay writing task, and a 2-hour group discussion. This post is part of a series that I'm publishing to show my work, document my current thinking on the topics, and better reflect on the group discussion on the topic as I explain here.

Essay task

Note: I only had time to outline my essay for this week.

Introduction to a policy recommendation on deceptive behaviour post-deployment

Initial thoughts/notes

Technical source of risk

emergent capabilities – as the scales of models increase so too do the chances of new qualitatively different capabilities
There is a risk that an AI system could develop the capability to be deliberately deceptive during training and then behave differently once deployed
This risk stems from training AI systems using human feedback. The general framework for training an AI is to have it produce an output which is assessed and scored. The AI then alliteratively adjusts it’s output to try achieve the highest score correlating and generate the best output.
Typically, this training starts with using data then some systems are further refined through human supervision which entails humans giving feedback on the output so the system can refine itself.
This creates an incentive for the AI to develop any capabilities that would allow it to achieve better feedback from the human.
It could learn to manipulate or deceive the human to get an equal or higher reward through feedback.

Motivate the need; outcomes of the risk

There are a range of outcomes to be concerned about
The AI might start behaving dramatically different once we hand over power to it (king Lear problem)
It will be hard to tell the difference between ‘behaving well’ and ‘appearing to behave well’
If AI is given power over more systems there is a greater risk of deceptive behaviour going undetected.
while it may seem prudent to not give power to AI systems there will likely be competitive incentives for companies, people, and countries to automate their systems with AI to keep up with their competitors

Essay task disclaimer

Group discussion

Recommended resources

Future of Life Institute 'Artifical Escalation' video
SERI ML Alignment Theory Scholars Program
- This is a potentially good way to enter the technical alignment field. Recommended in a discussion about my interest in mechanistic interpretability.

Role-playing exercise

I played the role of an AI CEO whose company was 3-6 months behind a leading AI company that was proposing to train a cutting-edge model with wide-reaching societal implications. The exercise was interesting as it forced me to assume the role of the CEO and think through what my objectives would be when it comes to the regulation of the AI industry.

Some of my considerations were:

I want to keep shareholders happy by trying to keep my company profitable
Given my company is behind, I want to try to keep the AI industry competitive so I don't get left behind
- The regulation barrier should be high enough to slow down the leading AI company but not so stringent that my company couldn't catch up

My key objectives were:

Prevent the main company from being able to train and deploy its AI first
Not stop the development/training of AI so my company can remain profitable
Keep the AI industry profitable

The way I tried to achieve these objectives was to persuade the other participants that:

consumers would be hurt by one company monopolising the industry and competition would lead to consumers having access to better products
the economy would be stronger from continued innovation enabled by competition
companies need to be able to bring products to market under regulation

Reflection

This was the first time I've done a role-playing exercise like this and I really enjoyed it. It forced me to try to get inside someone else's head and see the world from their perspective and I was surprised at the lines of thought that surfaced by doing this. The framing of a role-playing exercise helped me understand what a competing CEO could be thinking about better than a task to 'write 3 objectives a competing AI CEO might have in this scenario'.

I think I got a better understanding because I knew I would actually have to make a case to persuade others in an attempt to achieve my objective. It forced me to think in terms of 'I want to say this, because of this, to achieve this' instead of just 'a competing CEO would want this'. This engaged my imagination to create a more fully-formed world view of the 'competing CEO' and that was a more interesting and rewarding experience for me.

I'm keen for more role-playing exercises in later weeks. I'm also interested in using a role-playing framing for thinking through other aspects of my work because I think that could yield some interesting results.

You can view the rest of the series here or view any other posts related to the course here.

PREVIOUSThe likelihood of risks arrising from misuse, accident, or rogue, agentic AI systems causing harm

NEXTAI standards and regulations

Essay task

Group discussion

Recommended resources

Role-playing exercise

Read more