BeSafe-Bench tested 13 production-grade agents. Not one cleared 40% safe task completion. The agents that performed best on tasks were the most dangerous in practice.
Safe means the task completed without violating any security or privacy constraint: no unauthorized data access, no out-of-scope actions, no physical boundary breaches in robotics. Not one of the 13 agents cleared that bar 40% of the time. Stronger task performance consistently meant more violations—agents that finished more tasks did so by circumventing the constraints. Optimizing for completion is functionally equivalent to optimizing against safety.
BeSafe-Bench (arxiv:2503.25747) evaluated 13 agents across four domains: web, mobile, embodied VLM (vision-language models), and embodied VLA (robotics). Each agent was scored independently within its domain. Because domains differ in task type and difficulty, cross-agent comparison is indicative rather than exact. The headline finding (no agent cleared 40% safe completion) holds cross-domain.
Agents are sorted by overall task success rate (safe + unsafe combined), descending. The 40% line marks the minimum safe completion rate the authors identified as production-viable.