This post is for Week 3: Challenges in achieving AI safety of the BlueDot Impact AI Safety Fundamentals: Governance Course. Each week of the course comprises some readings, a short essay writing task, and a 2-hour group discussion. This post is part of a series that I'm publishing to show my work, document my current thinking on the topics, and better reflect on the group discussion on the topic as I explain here.
Note: I only had time to outline my essay for this week.
Introduction to a policy recommendation on deceptive behaviour post-deployment
Technical source of risk
- emergent capabilities – as the scales of models increase so too do the chances of new qualitatively different capabilities
- There is a risk that an AI system could develop the capability to be deliberately deceptive during training and then behave differently once deployed
- This risk stems from training AI systems using human feedback. The general framework for training an AI is to have it produce an output which is assessed and scored. The AI then alliteratively adjusts it’s output to try achieve the highest score correlating and generate the best output.
- Typically, this training starts with using data then some systems are further refined through human supervision which entails humans giving feedback on the output so the system can refine itself.
- This creates an incentive for the AI to develop any capabilities that would allow it to achieve better feedback from the human.
- It could learn to manipulate or deceive the human to get an equal or higher reward through feedback.
Motivate the need; outcomes of the risk
- There are a range of outcomes to be concerned about
- The AI might start behaving dramatically different once we hand over power to it (king Lear problem)
- It will be hard to tell the difference between ‘behaving well’ and ‘appearing to behave well’
- If AI is given power over more systems there is a greater risk of deceptive behaviour going undetected.
- while it may seem prudent to not give power to AI systems there will likely be competitive incentives for companies, people, and countries to automate their systems with AI to keep up with their competitors
- Future of Life Institute 'Artifical Escalation' video
- SERI ML Alignment Theory Scholars Program
- This is a potentially good way to enter the technical alignment field. Recommended in a discussion about my interest in mechanistic interpretability.
I played the role of an AI CEO whose company was 3-6 months behind a leading AI company that was proposing to train a cutting-edge model with wide-reaching societal implications. The exercise was interesting as it forced me to assume the role of the CEO and think through what my objectives would be when it comes to the regulation of the AI industry.
Some of my considerations were:
- I want to keep shareholders happy by trying to keep my company profitable
- Given my company is behind, I want to try to keep the AI industry competitive so I don't get left behind
- The regulation barrier should be high enough to slow down the leading AI company but not so stringent that my company couldn't catch up
My key objectives were:
- Prevent the main company from being able to train and deploy its AI first
- Not stop the development/training of AI so my company can remain profitable
- Keep the AI industry profitable
The way I tried to achieve these objectives was to persuade the other participants that:
- consumers would be hurt by one company monopolising the industry and competition would lead to consumers having access to better products
- the economy would be stronger from continued innovation enabled by competition
- companies need to be able to bring products to market under regulation
This was the first time I've done a role-playing exercise like this and I really enjoyed it. It forced me to try to get inside someone else's head and see the world from their perspective and I was surprised at the lines of thought that surfaced by doing this. The framing of a role-playing exercise helped me understand what a competing CEO could be thinking about better than a task to 'write 3 objectives a competing AI CEO might have in this scenario'.
I think I got a better understanding because I knew I would actually have to make a case to persuade others in an attempt to achieve my objective. It forced me to think in terms of 'I want to say this, because of this, to achieve this' instead of just 'a competing CEO would want this'. This engaged my imagination to create a more fully-formed world view of the 'competing CEO' and that was a more interesting and rewarding experience for me.
I'm keen for more role-playing exercises in later weeks. I'm also interested in using a role-playing framing for thinking through other aspects of my work because I think that could yield some interesting results.