ARC-AGI-3 Slams AGI Claims: Top Models Score <1% — What Crypto Investors Should Know

Nvidia’s Jensen Huang stirred the pot last week on Lex Fridman’s podcast when he declared, “I think we’ve achieved AGI.” Two days later, a much harsher reality check arrived: ARC-AGI-3, the newest and most rigorous test for artificial general intelligence from the ARC Prize Foundation, landed—and frontier models scored well below 1%. Key results - ARC-AGI-3 top scores: Google’s Gemini 3.1 Pro 0.37%, OpenAI’s GPT-5.4 0.26%, Anthropic’s Claude Opus 4.6 0.25%, xAI’s Grok-4.20 0.00%. - Humans: 100% of environments solved with no prior training or instructions. - Best agent in a month-long developer preview: 12.58% (on the full benchmark under contest rules). What makes ARC-AGI-3 different - Not a trivia or coding exam: ARC-AGI-3 is a set of 135 original, interactive game-like environments built from scratch by François Chollet and Mike Knoop’s foundation. - Zero prior clues: agents receive no instructions, no goal statements, and no rule descriptions. The task is to explore, infer the game’s dynamics, form a plan, and execute it—something a typical five-year-old can often do. - Designed to resist overfitting: 110 of the 135 environments are private (55 semi-private for API testing, 55 fully locked for competition), so teams can’t simply train on the dataset or memorize solutions. Scoring that punishes brute force - ARC-AGI-3 uses RHAE (Relative Human Action Efficiency). The baseline is the second-best, first-run human performance. - Inefficiency is penalized heavily: an agent taking ten times more actions than a human scores ~1% for that level; the metric squares penalties for wasted actions like random wandering or repeated backtracking. That structure rewards focused, generalizable problem solving over noisy trial-and-error. A methodological flashpoint - The foundation’s official benchmark feeds agents JSON (structured data), not raw visuals. A Duke-built custom harness reportedly pushed Claude Opus 4.6 from 0.25% to 97.1% on one environment variant (TR87), though Claude’s overall ARC-AGI-3 score remained 0.25%. - Debate: is the JSON/API format a flaw that helps or hinders models, or does it reveal that models are already better at processing human-friendly descriptions than discovering rules from raw perceptual inputs? The foundation’s position: “Frame content perception and API format are not limiting factors… the real gap lies in reasoning and generalization.” Why this matters, especially now - The release arrives amid a week of AGI claims and branding: Jensen Huang’s statement, Arm calling a new chip the “AGI CPU,” Sam Altman saying OpenAI has “basically built AGI,” and Microsoft marketing research toward ASI. The term AGI is getting stretched to suit marketing and product narratives. - Chollet’s simple litmus test: if a normal human with no instructions can solve a task and your system can’t, you don’t have AGI—you have a sophisticated but brittle autocomplete that needs heavy supervision. What’s next - ARC Prize 2026 is offering $2 million across three competition tracks, all hosted on Kaggle; every winning solution must be open-sourced. The contest aims to drive transparent progress on true generalization and reasoning, not just benchmark-saturating tweaks. Bottom line for the crypto and tech crowd - Hype cycles can move markets and mindshare fast, but benchmarks like ARC-AGI-3 show where real capabilities stand. Current frontier LLMs, when tested without bespoke tooling, struggle with open-ended, unfamiliar tasks that require genuine exploration and reasoning. For investors and builders in crypto and Web3—who increasingly lean on AI for tooling, trading, and on-chain analysis—this is a reminder: the “G” in AGI (generalization) remains the hardest part, and today’s models are still far from human-level general problem solving. Read more AI-generated news on: undefined/news

ARC-AGI-3 Slams AGI Claims: Top Models Score <1% — What Crypto Investors Should Know

Share This Article

Related News

ETH, BTC, SOL Spark Weekend Crypto Buzz: Quantum Fears, Solana Hack &...

Elon Musk’s X auto-locks accounts on first crypto post to thwart hijac...

Tillis–Alsobrooks Compromise Bars Passive Stablecoin Yield, Sparks Cry...

Trump's Record $1.5T Pentagon Request Is a Macro Shock — Crypto Volati...

24/7 Crypto-Style Trading Could Kill After‑Hours 'House Advantage' — T...

Analyst: ALT/BTC Bullish Crossover Signals Altcoin Season Could Top 20...

Most Read News

From Mini‑Budget to Bitcoin: Ex‑Chancellor Kwasi K...

From Hoarding to Harvesting: Corporate Crypto Trea...

Yuan-Paid Oil & CBDC Rails Are Rerouting the Petro...

Analyst: PEPE Could Be This Cycle's Shiba Inu — Bu...

Glassnode: Large Bitcoin Holders Suffer $200M+ Dai...

More News

ETH, BTC, SOL Spark Weekend Crypto Buzz: Quantum Fears, Sola...

Elon Musk’s X auto-locks accounts on first crypto post to th...

Tillis–Alsobrooks Compromise Bars Passive Stablecoin Yield,...

Trump's Record $1.5T Pentagon Request Is a Macro Shock — Cry...

24/7 Crypto-Style Trading Could Kill After‑Hours 'House Adva...

Analyst: ALT/BTC Bullish Crossover Signals Altcoin Season Co...

From Mini‑Budget to Bitcoin: Ex‑Chancellor Kwasi Kwarteng Jo...

From Hoarding to Harvesting: Corporate Crypto Treasuries Piv...

Yuan-Paid Oil & CBDC Rails Are Rerouting the Petrodollar — W...

Menu