Technology
17 min read
Gemini 3 Flash Aces Apex-Agents Benchmark for Complex Tasks
Quantum Zeitgeist
January 22, 2026•3 hours ago

AI-Generated SummaryAuto-generated
Researchers introduced the APEX-Agents benchmark to assess AI's ability to perform complex, professional tasks. The benchmark simulates realistic work environments with files and tools. Gemini 3 Flash achieved the highest score of 24.0% on this new evaluation. Both the benchmark and its evaluation infrastructure are now open-source.
Researchers have unveiled a new benchmark, the Productivity Index for APEX-Agents, designed to rigorously evaluate the capacity of large language models to perform complex, multi-step tasks mirroring those undertaken by professionals in demanding fields. Led by Bertie Vidgen, Austin Mann, and Abby Fennelly, all from Mercor, alongside et al, this work introduces a challenging testbed requiring models to interact with realistic digital work environments, utilising files and tools to complete long-horizon assignments. The team tested eight leading models, finding Gemini 3 Flash (Thinking=High) achieved the highest score of 24.0%, demonstrating a significant step forward in AI’s ability to handle practical, professional workloads. Crucially, the APEX- benchmark , comprising 480 tasks with accompanying prompts, rubrics, and gold outputs , and the Archipelago evaluation infrastructure are both being released as open-source resources, promising to accelerate progress in this vital area of artificial intelligence.
APEX Agents benchmark professional AI task performance
Scientists have unveiled a new benchmark, APEX, Agents, designed to rigorously assess the capabilities of AI agents in executing complex, long-horizon tasks mirroring the work of investment banking analysts, management consultants, and corporate lawyers. The research introduces the AI Productivity Index for Agents, a novel method for evaluating whether these agents can navigate realistic work environments and effectively utilise files and tools to complete professional-level assignments. This breakthrough addresses a significant gap in current AI agent evaluations, which often suffer from a “sim-to-real” disconnect and fail to accurately reflect the nuances of professional workflows. The team meticulously constructed 480 tasks, split across 33 distinct simulated work environments, each built with input from industry professionals assuming realistic roles and responsibilities.
To facilitate further research and development, the researchers have open-sourced both the APEX, Agents benchmark, including all prompts, rubrics, gold outputs, files, and metadata, and Archipelago, their infrastructure for agent execution and evaluation. This commitment to open science allows the broader AI community to replicate, extend, and build upon their findings, accelerating progress in the field of AI-powered professional services. The creation of APEX, Agents involved a three-step process: first, building data-rich worlds based on real-world project scenarios; second, having industry professionals create challenging tasks within those worlds; and third, providing agents access to these environments with all necessary data and software. This approach, informed by the APEX Survey of 227 experts, ensures the benchmark accurately reflects the complexities of professional work and provides a meaningful measure of agentic AI performance.
The APEX Survey revealed that core activities comprise 47% of professionals’ time, with tasks segmented into 18 inductively-identified categories, further informing the design of realistic and challenging tasks within the benchmark. This detailed understanding of professional workflows, combined with the comprehensive dataset and evaluation infrastructure, positions APEX, Agents as a crucial tool for driving innovation in AI agents and unlocking the potential for a future where skilled, on-demand expertise is available to all. The open-sourcing of both the benchmark and infrastructure promises to catalyse further advancements, ultimately reshaping knowledge work and dramatically increasing productivity across various industries.
APEX-Agents Benchmark for Professional Services AI Evaluation
Scientists pioneered the APEX-Agents benchmark to rigorously evaluate the capacity of artificial intelligence agents to perform complex, long-horizon tasks mirroring professional services work. The research team engineered a novel evaluation framework, constructing 33 distinct ‘worlds’, data-rich environments simulating realistic project scenarios in investment banking, management consulting, and law. Industry professionals, assuming roles like partner and associate, collaboratively planned and executed projects over 5-10 days, generating high-quality deliverables such as spreadsheets, reports, and presentations, thereby establishing a gold standard for task completion. The study meticulously created 480 tasks within these worlds, averaging 1, 2 hours in estimated completion time for experienced professionals, and ensuring each world contained, on average, 166 files.
Experts then devised challenging tasks utilising these files, granting agents access to the same data and software a human professional would employ, including applications like Calendar, Chat, Documents, and Spreadsheets, a total of 63 tools across all worlds, with two investment banking worlds expanding to 81 tools. Web search functionality was deliberately disabled to guarantee reproducibility of evaluations and maintain the integrity of the benchmark. Researchers harnessed data from an APEX Survey of 227 professionals, 58 financial analysts, 77 management consultants, and 92 lawyers with an average of 10.8 years’ experience, to inform the creation of these worlds and tasks. Qualitative analysis of participant descriptions of their work activities, categorised into 18 inductively-identified areas, directly shaped the task design, ensuring relevance and authenticity.
Rate this article
Login to rate this article
Comments
Please login to comment
No comments yet. Be the first to comment!
