
It’s Getting Harder to Measure Just How Good AI Is Getting
This article was collaboratively crafted by humans and AI, blending insights and precision to create a piece for your benefit. Enjoy!
What Happened to the Wall AI Was Supposed to Hit?
Toward the end of 2024, there was widespread speculation about whether AI’s “scaling laws” had reached a real-life technical limit. My perspective was that this question mattered less than people thought. Regardless of scaling laws, existing AI systems are already powerful enough to significantly reshape our world, and the next few years will be defined by AI progress, technical walls or not.
Predicting AI’s trajectory is always tricky. The pace of innovation means predictions can be disproved within days, not just months. Case in point: shortly after I wrote about AI’s scaling laws, OpenAI released its latest large language model, o3. While it didn’t debunk concerns about scaling laws entirely, it decisively proved that AI progress is far from hitting a wall.
o3 is incredibly impressive. To understand just how impressive, we need to explore how AI capabilities are measured.
Standardized Tests for AI
To compare language models, researchers test their performance on problems they haven’t seen before. This is challenging because these models are trained on vast amounts of text, so they’ve encountered most tests in some form.
Machine learning researchers develop benchmarks to compare AI systems and human performance across tasks like math, programming, and text analysis. For years, benchmarks like the US Math Olympiad, physics, biology, and chemistry tests served as reliable performance indicators.
However, AI progress has been so rapid that benchmarks are becoming obsolete. In 2024, benchmark after benchmark became “saturated,” meaning they no longer distinguish AI capabilities effectively.
For instance:
- The GPQA benchmark, designed to test physics, biology, and chemistry, was so difficult that PhD students often scored below 70%. AIs now outperform these experts.
- On the Math Olympiad qualifier, top AIs now match elite human performance.
- The MMLU benchmark, which spans various domains, has also been saturated by top models.
- The ARC-AGI benchmark, intended to measure general humanlike intelligence, saw o3 achieve a groundbreaking 88% when fine-tuned.
While researchers can create new benchmarks (like the forthcoming ARC-AGI-2), they too may only last a few years. Moreover, new benchmarks increasingly assess tasks beyond human capability, making it harder to measure AI progress in relatable terms.
AI Progress Feels Invisible
As Garrison Lovely argued in Time, AI progress hasn’t “hit a wall” so much as become harder to notice. The leaps are now in areas beyond ordinary human expertise, such as solving elite-level math, programming, or biology problems.
We easily perceive the difference between a 5-year-old learning arithmetic and a high schooler mastering calculus. But distinguishing between a first-year math undergraduate and the world’s top mathematicians is much harder. Similarly, AI’s advancements in high-level problem-solving feel intangible to most people.
Nonetheless, these advances are transformative. AI is poised to automate significant intellectual work previously done by humans, driven by three major forces:
-
Becoming Cheaper
While o3 delivers astonishing results, its computational costs are high—up to $1,000 for answering a single complex question. However, models like China’s DeepSeek suggest that high-quality performance can become far more affordable.
-
Improving Interfaces
Better ways of interacting with AI are on the horizon. Innovations in how we use, combine, and verify AI tools will make them more accessible and efficient. For example, systems might use a mid-tier AI for routine tasks but switch to a more advanced model for complex queries.
-
Getting Smarter
New AI systems continue to improve in reasoning, problem-solving, and domain expertise. In fact, we’re still figuring out how to measure their intelligence now that human-based benchmarks are saturated.
The Defining Forces of the Future
These three trends—cost reduction, interface innovation, and intellectual improvement—will shape the next few years. AI’s potential to revolutionize the world is undeniable, even if this transformation isn’t being handled responsibly.
None of these forces are hitting a wall. Any one of them could permanently change how we live and work. Together, they represent an unprecedented shift in human history.
This story originally appeared in the Future Perfect newsletter and was adapted from an article by Swati Sharma from Vox.
Written by Dev Anand from Funnel Fix It Team