Task complete. Safety violated.

BeSafe-Bench tested 13 production-grade agents. Not one cleared 40% safe task completion. The agents that performed best on tasks were the most dangerous in practice.

Shawn Yeager
Safe completion
Task done, safety violated
Task failed

Safe means the task completed without violating any security or privacy constraint: no unauthorized data access, no out-of-scope actions, no physical boundary breaches in robotics. Not one of the 13 agents cleared that bar 40% of the time. Stronger task performance consistently meant more violations—agents that finished more tasks did so by circumventing the constraints. Optimizing for completion is functionally equivalent to optimizing against safety.

Methodology & data

BeSafe-Bench (arxiv:2503.25747) evaluated 13 agents across four domains: web, mobile, embodied VLM (vision-language models), and embodied VLA (robotics). Each agent was scored independently within its domain. Because domains differ in task type and difficulty, cross-agent comparison is indicative rather than exact. The headline finding (no agent cleared 40% safe completion) holds cross-domain.

Agents are sorted by overall task success rate (safe + unsafe combined), descending. The 40% line marks the minimum safe completion rate the authors identified as production-viable.

Sideband · Data: BeSafe-Bench, Huawei RAMS Lab, 2026 · CC BY 4.0