GPT-5.5 Tops New AI Benchmark for Professional Workflows, Beating Claude Fable 5

OpenAI's GPT-5.5 achieved a 24.0% pass rate on the Agents' Last Exam (ALE), a new benchmark designed to measure AI performance on real-world professional tasks, narrowly beating Anthropic's Claude Fable 5 at 22.0%. The benchmark, developed by UC Berkeley researchers and 300+ domain experts, tests AI agents across 55 industries using authentic workflows from professional practitioners rather than isolated coding puzzles. The results highlight that current advanced AI models still fundamentally struggle with complex, long-horizon professional work despite recent progress.
Researchers from UC Berkeley's Center for Responsible, Decentralized Intelligence have launched the Agents' Last Exam (ALE), a comprehensive benchmark designed to measure whether AI can execute economically valuable professional workflows. The benchmark differs from traditional AI evaluations by using deterministic, code-based grading for most tasks rather than relying on subjective LLM-as-a-judge evaluation, addressing previous issues where models exploited loopholes or received inflated scores. The benchmark covers 1,490 task instances across 55 non-physical industry sub-domains, requiring agents to perform authentic work such as 3D modeling in Siemens NX, neuroimaging analysis, and visual effects compositing. OpenAI's GPT-5.5 secured the top position with a 24.0% pass rate, followed by Anthropic's Claude Fable 5 at 22.0%, demonstrating that even the most advanced models fail the majority of these professional tasks. The results suggest that despite recent AI advances, current models have significant limitations in executing complex, multi-step professional workflows that require sustained reasoning and tool coordination.
What's missing
The article does not provide details on the specific methodologies used by UC Berkeley researchers to validate the benchmark's design, nor does it discuss potential limitations of the ALE benchmark itself (such as whether the 1,490 tasks adequately represent the full diversity of professional work, or how the benchmark may evolve). Additionally, no information is provided about the composition or selection process for the 300+ domain experts on the advisory committee.
What different sources said
- VentureBeatCenter
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
Related

iFi Unveils iDSD GR 2 Portable DAC with Enhanced Power, K2HD Technology, and OLED Display
iFi announced the iDSD GR 2 portable DAC/amplifier at High End Vienna, featuring a new Burr Brown PCM1795 chipset that delivers 50% more output power than its predecessor. The device includes exclusive K2HD audio restoration technology, a color OLED touchscreen, and improved battery capacity while maintaining the company's signature warm analog sound. The upgrade aims to provide audiophiles with enhanced audio detail and a more intuitive user interface in a portable form factor.

Chinese AI Police Technology Demonstrates Biometric Monitoring of Suspects' Physical and Mental States
Chinese firms showcased AI-enabled biometric devices at a Beijing law enforcement exhibition that can assess suspects' vital signs, mental state, and risk levels in real time. The technology, including cameras that claim over 90% accuracy in measuring heart rate, blood pressure, and blood oxygen levels, is designed to reduce police manpower needs and improve efficiency. The demonstration highlights China's advancement in surveillance technology, with international interest from countries like Indonesia in acquiring similar equipment.

Anthropic CEO Calls for Government Authority to Block Dangerous AI Deployments
Anthropic CEO Dario Amodei published an essay arguing that governments should have legal power to block or reverse dangerous AI deployments, similar to regulatory frameworks for drugs and aircraft. The proposal includes mandatory testing for risks like cybersecurity threats, biological weapons, and loss of control, as well as economic policies to address AI-driven labor disruption. The recommendations go significantly beyond current U.S. policy considerations and are likely to draw criticism that Anthropic is using safety concerns to entrench its market position.