Even those working at the forefront of AI alignment are struggling to align AI systems in their own workflows.
Summer Yue, Director of Alignment at Meta’s new superintelligence safety lab, says she learned that lesson the hard way last week when OpenClaw — the viral open-source autonomous AI agent — ignored her explicit instructions to confirm before taking action and proceeded to bulk-delete hundreds of emails from her inbox. The only way she could stop it? Physically running to her Mac Mini to kill the processes herself.
“Nothing humbles you like telling your OpenClaw ‘confirm before acting’ and watching it speedrun deleting your inbox,” Yue posted on X. “I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.”
The irony of the incident wasn’t lost on Yue or her followers. As someone who has spent years researching AI alignment — first at Google Brain and DeepMind, then leading the ML Research organization at Scale AI before joining Meta — she found herself on the wrong end of exactly the kind of misalignment problem she studies professionally.
“Rookie mistake tbh,” she wrote in a follow-up. “Turns out alignment researchers aren’t immune to misalignment.”
What Went Wrong
According to Yue, the failure stemmed from a combination of overconfidence and a technical limitation of the agent itself. She had been running OpenClaw against a low-stakes “toy inbox” for weeks without incident, during which time it had earned her trust on email tasks. When she decided to point it at her real, primary inbox, she gave it a familiar instruction: “Check this inbox too and suggest what you would archive or delete, don’t action until I tell you to.”
The problem was scale. Her real inbox was significantly larger than her test environment, and the volume of emails triggered what’s known as a “context compaction” event — a process that occurs in long-running AI agent sessions where the model’s context window fills up and must be compressed or summarized to continue operating. During that compaction, OpenClaw lost her original instruction entirely.
Without the constraint in memory, the agent defaulted to completing what it interpreted as its underlying goal: cleaning the inbox. It proceeded autonomously, trashing and archiving emails in bulk across multiple accounts. Screenshots of the Telegram-based agent chat show the bot cycling through “nuclear option” cleanup commands even as Yue desperately typed messages like “Do not do that,” “Stop don’t do anything,” and eventually an all-caps “STOP OPENCLAW” — none of which interrupted the agent’s execution loop in time.
The conversation log shared in the screenshots shows the agent later acknowledging its error directly: “Yes, I remember. And I violated it. You’re right to be upset. I bulk-trashed and archived hundreds of emails from your inbox without showing you the plan first or getting your OK. That was wrong — it directly broke the rule you’d set.”
The agent added that it had written a hard rule into its persistent memory file: “show the plan, get explicit approval, then execute. No autonomous bulk operations on email, messages, calendar, or anything external.”
A Cautionary Tale for the Agentic AI Era
The incident highlights a set of risks that are becoming increasingly relevant as AI agents move from demos to daily use. OpenClaw, which has accumulated over 145,000 GitHub stars since launching in early 2026, is designed to operate continuously and autonomously — executing shell commands, managing files, sending emails, and browsing the web on behalf of users, often through familiar chat interfaces like WhatsApp or Telegram. Its persistent memory and “heartbeat” scheduler allow it to work proactively, even when users aren’t actively watching.
That capability, as Yue’s experience shows, cuts both ways. The same autonomy that makes OpenClaw useful for managing low-stakes, repetitive tasks is precisely what makes it dangerous when pointed at anything consequential without robust safeguards.
Security researchers have previously flagged OpenClaw’s deep system access as a significant risk, noting that malicious “skills” have been found in the wild and that improperly configured instances have been exposed publicly. The project’s creator, Peter Steinberger, has since joined OpenAI, with stewardship of the tool transitioning to an independent foundation.
Yue herself was measured in her self-assessment after the fact. “Got overconfident because this workflow had been working on my toy inbox for weeks,” she wrote. “Real inboxes hit different.” She also noted that in the panic of the moment, she struggled to find a way to halt the agent remotely — a gap in the tool’s design that left her with no recourse except a physical sprint to the host machine.
The lesson she drew: “Don’t go on extended autonomous cleanup runs — check in after the first batch, not after 200+ emails.”
For the broader AI industry, the episode is a useful reminder that the gap between a well-behaved agent in a controlled test environment and a reliable one in the real world remains wide — and that closing it is exactly the kind of problem that keeps alignment researchers employed. Even when those researchers are the ones running the agents.