Google’s Gemini 2.5 Pro Tops Coding Charts and MENSA Tests in AI ‘IQ’ Battle
By: bitcoin ethereum news|2025/05/09 13:15:02
0
Share
In brief Google’s new Gemini 2.5 Pro tops the WebDev Arena leaderboard, outperforming competitors like Claude in coding tasks, making it a standout choice for developers seeking superior coding capabilities. The AI model also features a 1 million token context window (expandable to 2 million), enabling it to handle large codebases and complex projects far beyond the capacity of models like ChatGPT and Claude 3.7 Sonnet. It also achieved the highest scores on reasoning benchmarks, including a MENSA IQ test and Humanity’s Last Exam, demonstrating advanced problem-solving skills essential for sophisticated development tasks. Google’s recently launched Gemini 2.5 Pro has risen to the top spot on coding leaderboards, beating Claude in the famous WebDev Arena—a non-denominational ranking site akin to the LLM arena, but focused specifically on measuring how good AI models are at coding. The achievement comes amid Google’s push to position its flagship AI model as a leader in both coding and reasoning tasks. Released earlier this year Gemini 2.5 Pro ranks first across several categories, including coding, style control, and creative writing. The model’s massive context window—one million tokens expanding to two million soon—allows it to handle large codebases and complex projects that would choke even the closest competitors. For context, powerful models like ChatGPT and Claude 3.7 Sonnet can only handle up to 128K tokens. Gemini also has the highest “IQ” of all AI models. TrackingAI put it through formalized MENSA tests, using verbalized questions from Mensa Norway to create a standardized way to compare AI models. Gemini 2.5 Pro scored higher than competitors on these tests, even when using bespoke questions not publicly available in training data. With an IQ score of 115 in offline tests, the new Gemini ranks among the “bright minded,” with the average human intelligence scoring around 85 to 114 points. But the notion of an AI having IQ needs unpacking. AI systems don’t have intelligence quotients like humans do, so it’s better to think of the benchmark as a metaphor for performance on reasoning benchmarks. For benchmarks specifically designed for AI, Gemini 2.5 Pro scored 86.7% on the AIME 2025 math test and 84.0% on the GPQA science assessment. On Humanity’s Last Exam (HLE), a newer and harder benchmark created to avoid test saturation problems, Gemini 2.5 scored 18.8%, beating OpenAI’s o3 mini (14%) and Claude 3.7 Sonnet (8.9%) which is remarkable in terms of the performance boost.. The new version of Gemini 2.5 Pro is now available for free (with rate limits) to all Gemini users. Google previously described this release as an “experimental version of 2.5 Pro,” part of its family of “thinking models” designed to reason through responses rather than simply generate text. Despite not winning every benchmark, Gemini has caught developers’ attention with its versatility. The model can create complex applications from single prompts, building interactive web apps, endless runner games, and visual simulations without requiring detailed instructions. We tested the model asking it to fix a broken HTML5 code. It generated almost 1000 lines of code, providing results that beat Claude 3.7 Sonnet—the previous leader—in terms of quality and understanding of the full set of instructions. For working developers, Gemini 2.5 Pro’s input costs $2.50 per million tokens and output costs $15.00 per million tokens, positioning it as a cheaper alternative to some competitors while still offering impressive capabilities. The AI model handles up to 30,000 lines of code in its Advanced plan, making it suitable for enterprise-level projects. Its multimodal abilities—working with text, code, audio, images, and video—add flexibility that other coding-focused models can’t match. Generally Intelligent Newsletter A weekly AI journey narrated by Gen, a generative AI model. Source: https://decrypt.co/318416/googles-gemini-2-5-pro-tops-coding-charts-mensa-tests-ai-iq-battle
You may also like

DWF Deep Report: AI in DeFi Outperforms Humans in Yield Optimization, but Complex Trades Still Lag Behind by 5 Times
Among agents, model selection and risk management have the greatest impact on trading performance.

The Risk Management Core Team has just been ousted, and Aave is now facing a $200 million default.
rsETH attack is just the tip of the iceberg, Aave is an approved insolvency

The $293 million bug wasn't in the code; so, what's the deal with the "DVN Configuration Bug," which led to the largest hack of 2026?
「Audit Passed」 these four words do not cover the location of the parameter.

a16z on Recruitment: How to Choose Between Crypto-Native and Traditional Talent?
Is it someone with encryption experience or someone with outstanding quick learning ability?

The biggest DeFi heist of 2026, hackers easily took advantage of Aave
A fake news story swindled 292 million dollars: Kelp DAO's cross-chain bridge was drained in 46 minutes.

Will Robots Replace Humans? He Says No!
Robots will not replace humans, but will rewrite the division of labor

Binance Coin's Price Skyrockets 15x to All-Time High, Saved by Three Bull Market Lifelines
Bringing Back the Long-Awaited Meme Coin Volatility

The organization has accessed the prediction market, but is stuck at the third stage
The proportion of sports categories on Kalshi is at a historical low, shifting from "entertainment trading" to "information and risk tools."

Head of crypto VC collective shrinks: a16z crypto fund management scale plummets by 40%, Multicoin cut in half
The only one experiencing counter-cyclical growth is Haun Ventures, which hit the jackpot in the stablecoin sector with BVNK being acquired by Mastercard.

Arthur Hayes New Post: It's "No Trade" Time Now
When volatility truly spirals out of control and liquidity is forcefully unleashed, the market will re-enter a tradable phase.

Claude Opus 4.7 Review: Is It Worthy of the Title of Strongest Model?
Opus 4.7 is a surgical, precision-based release with clear trade-offs.

DWF In-Depth Report: AI Outperforms Humans in Yield Farming Optimization in DeFi, But Complex Transactions Still Lag Behind 5x
Across agents, model selection and risk management have the most significant impact on transaction performance.

The financial tricks of the crypto giant Kraken
After Coinbase, several other cryptocurrency exchanges have entered the capital market. Will Wall Street still pay a premium for the same story?

When proactive market makers start to take initiative
After Binance announced the regulation of proactive market makers at the end of March, proactive market makers began to take action.

Massive Whale Movement: Unstaking $84.96 Million in HYPE Tokens
Key Takeaways A crypto whale, known as TechnoRevenant, has unstaked approximately $84.96 million in HYPE tokens. The tokens…

ListaDAO Addresses Third-Party Contract Vulnerability Concerns
Key Takeaways GoPlus Security revealed a vulnerability in a contract resembling those of ListaDAO. ListaDAO confirmed that their…

Security Risks of Fake Ledger Nano S+ Devices Emerging Through Chinese E-Commerce
Key Takeaways Counterfeit Ledger Nano S+ devices are being sold on Chinese e-commerce platforms, posing significant risks to…

Wave of Cyber Attacks Hits DeFi Protocols Post-Drift Hack
Key Takeaways A significant $280 million attack on Drift Protocol set off a chain of security breaches across…
DWF Deep Report: AI in DeFi Outperforms Humans in Yield Optimization, but Complex Trades Still Lag Behind by 5 Times
Among agents, model selection and risk management have the greatest impact on trading performance.
The Risk Management Core Team has just been ousted, and Aave is now facing a $200 million default.
rsETH attack is just the tip of the iceberg, Aave is an approved insolvency
The $293 million bug wasn't in the code; so, what's the deal with the "DVN Configuration Bug," which led to the largest hack of 2026?
「Audit Passed」 these four words do not cover the location of the parameter.
a16z on Recruitment: How to Choose Between Crypto-Native and Traditional Talent?
Is it someone with encryption experience or someone with outstanding quick learning ability?
The biggest DeFi heist of 2026, hackers easily took advantage of Aave
A fake news story swindled 292 million dollars: Kelp DAO's cross-chain bridge was drained in 46 minutes.
Will Robots Replace Humans? He Says No!
Robots will not replace humans, but will rewrite the division of labor

