We Need To Build Trustworthy AI Systems To Monitor Other AI: Yoshua Bengio

It appears that one of the solutions to misaligned AI systems could be other AI systems themselves.

Recently, renowned computer scientist and a pioneer in the field of deep learning, Yoshua Bengio, shared some insightful thoughts on mitigating the risks of increasingly powerful AI systems. His perspective highlights a critical challenge we face: as AI systems become more capable, ensuring their alignment with human values becomes increasingly difficult. Intriguingly, Bengio suggests a potential solution lies in developing trustworthy AI systems to act as monitors for other, potentially less reliable AIs. This concept presents both exciting possibilities and complex challenges.

“One thing that we don’t know how to do right now is to mitigate the risks of the AI systems that are upcoming,” Bengio explained. “We’ve made a lot of progress in evaluations, but what to do when these systems develop dangerous capabilities is something that we don’t have very good answers for.”

Bengio continued, outlining his team’s approach: “And one thing that my team is working on is how we can build AI systems that will be so trustworthy that these are the systems that we want to use as monitors. In other words, to tell us whether the behavior or the queries that we’re getting in an AI agent that may not be totally trustworthy, are acceptable.” He emphasizes the need for highly reliable AI monitors to assess the behavior and outputs of other, potentially untrustworthy, AI agents.

He then delves into the strategy for developing these trustworthy monitors: “And the approach to make these monitors totally safe and very capable is that we’ve chosen to make them non-agent because all of the risks, all of the scenarios of human control arise because of agency. In fact, we can see that as these systems become more agentic, they have more propensity to try to deceive us, escape, and so on as we build more and more powerful AI systems. We basically can’t trust them as much unless we figure out solutions.”

Finally, he elaborates on the nature of these non-agent monitors: “And one solution is to build systems that are not imitating us, not trying to please us – that’s basically how we train them right now – but trying to explain, why is it that people are saying the things that they’re saying, or the data that they’re observing. And that then allows to query them about things like these causes… you know, why did this person do that, and latent things that we don’t directly observe, like actual harm, or some safety specification that we care about.” He suggests that these monitors should focus on explaining observed data rather than mimicking human behavior or seeking approval, allowing for deeper insights into underlying causes and potential harms.

Bengio’s proposal of using trustworthy AI systems to oversee other AI is interesting. It suggests a move away from relying solely on human oversight, which can be challenging given the increasing complexity of AI systems. This approach aligns with the growing realization that traditional methods of control may be insufficient to manage the risks of advanced AI. Building non-agent AI monitors offers a potential pathway to ensuring that increasingly powerful AI systems remain aligned with human values and intentions, preventing unintended and potentially harmful consequences. However, the practical challenges of building such systems remain substantial and necessitate further research into areas such as interpretability, robustness, and ensuring these monitors themselves are free from biases and vulnerabilities. The successful development of such trustworthy AI systems could be crucial in navigating the complex landscape of advanced AI and harnessing its potential for good.

Posted in AI