Researchers have unveiled what they call “Humanity’s Last Exam” (HLE) – a new benchmark designed to rigorously assess the capabilities of today’s leading artificial intelligence (AI) models. The exam, developed by the Center for AI Safety and Scale AI, isn’t just another AI test; it aims to determine how close machines are to matching or exceeding human expertise across a vast range of subjects.
The Exam: Built to Break AI
Launched in January 2025, the HLE contains 2,500 questions spanning over 100 disciplines, vetted by over 1,000 subject-matter experts from 50 countries. Unlike typical AI benchmarks that rely on easily searchable information, the HLE questions are designed to be unambiguous but not readily available online. The goal: to measure genuine understanding, not just the ability to regurgitate data.
Early tests in early 2025 showed poor results, with OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro failing to exceed 8.3% accuracy. However, researchers predict that AI could surpass 50% accuracy by the end of 2025 given the rapid pace of development. As of February 2026, Google’s Gemini 3 Deep Think achieved the highest score at 48.4%, still significantly lower than the 90% typical of human experts in their fields.
Why This Matters: Beyond Just Scores
The HLE isn’t just about numbers. It highlights a critical gap between current AI and true general intelligence (AGI). While AI excels at specific tasks, it still struggles with the broad, adaptable knowledge that humans possess. This test is a way to measure how close we are to machines that can genuinely think, reason, and learn like people.
The design of this exam is unique in that it requires questions to be precise, unambiguous, solvable and non-searchable. The team automatically rejected any questions the models could answer correctly, ensuring that only truly challenging material was included.
Examples of the Exam’s Rigor
The questions themselves cover a wide spectrum, from obscure trivia like “In Greek mythology, who was Jason’s maternal great-grandfather?” to complex physics problems requiring advanced understanding. This breadth of subjects distinguishes the HLE from other benchmarks, like the Massive Multitask Language Understanding (MMLU) dataset, which focuses heavily on coding and mathematics. Gemini 3 Deep Think, for example, achieved 84.6% on the ARC-AGI-2 benchmark but failed to reach 50% on the HLE.
The Verdict: Not AGI Yet
The study’s authors emphasize that a high score on the HLE does not automatically signify AGI. Achieving expert-level performance on verifiable questions is only one piece of the puzzle. Real intelligence requires autonomous research capabilities, creative problem-solving, and the ability to apply knowledge in novel situations. As Manuel Schottdorf, a neuroscientist involved in the project, states: “Doing well on HLE is a necessary, but not a sufficient criterion to say that machines have reached true intelligence.”
The HLE represents a significant step in measuring AI’s progress, but it also underscores that we are still far from machines that can truly think like humans. The pursuit of AGI is ongoing, and benchmarks like this will be crucial in tracking how close we get.
