March 27, 2026 ChainGPT

ARC-AGI-3 Slams AGI Claims: Top Models Score <1% — What Crypto Investors Should Know

ARC-AGI-3 Slams AGI Claims: Top Models Score <1% — What Crypto Investors Should Know
Nvidia’s Jensen Huang stirred the pot last week on Lex Fridman’s podcast when he declared, “I think we’ve achieved AGI.” Two days later, a much harsher reality check arrived: ARC-AGI-3, the newest and most rigorous test for artificial general intelligence from the ARC Prize Foundation, landed—and frontier models scored well below 1%. Key results - ARC-AGI-3 top scores: Google’s Gemini 3.1 Pro 0.37%, OpenAI’s GPT-5.4 0.26%, Anthropic’s Claude Opus 4.6 0.25%, xAI’s Grok-4.20 0.00%. - Humans: 100% of environments solved with no prior training or instructions. - Best agent in a month-long developer preview: 12.58% (on the full benchmark under contest rules). What makes ARC-AGI-3 different - Not a trivia or coding exam: ARC-AGI-3 is a set of 135 original, interactive game-like environments built from scratch by François Chollet and Mike Knoop’s foundation. - Zero prior clues: agents receive no instructions, no goal statements, and no rule descriptions. The task is to explore, infer the game’s dynamics, form a plan, and execute it—something a typical five-year-old can often do. - Designed to resist overfitting: 110 of the 135 environments are private (55 semi-private for API testing, 55 fully locked for competition), so teams can’t simply train on the dataset or memorize solutions. Scoring that punishes brute force - ARC-AGI-3 uses RHAE (Relative Human Action Efficiency). The baseline is the second-best, first-run human performance. - Inefficiency is penalized heavily: an agent taking ten times more actions than a human scores ~1% for that level; the metric squares penalties for wasted actions like random wandering or repeated backtracking. That structure rewards focused, generalizable problem solving over noisy trial-and-error. A methodological flashpoint - The foundation’s official benchmark feeds agents JSON (structured data), not raw visuals. A Duke-built custom harness reportedly pushed Claude Opus 4.6 from 0.25% to 97.1% on one environment variant (TR87), though Claude’s overall ARC-AGI-3 score remained 0.25%. - Debate: is the JSON/API format a flaw that helps or hinders models, or does it reveal that models are already better at processing human-friendly descriptions than discovering rules from raw perceptual inputs? The foundation’s position: “Frame content perception and API format are not limiting factors… the real gap lies in reasoning and generalization.” Why this matters, especially now - The release arrives amid a week of AGI claims and branding: Jensen Huang’s statement, Arm calling a new chip the “AGI CPU,” Sam Altman saying OpenAI has “basically built AGI,” and Microsoft marketing research toward ASI. The term AGI is getting stretched to suit marketing and product narratives. - Chollet’s simple litmus test: if a normal human with no instructions can solve a task and your system can’t, you don’t have AGI—you have a sophisticated but brittle autocomplete that needs heavy supervision. What’s next - ARC Prize 2026 is offering $2 million across three competition tracks, all hosted on Kaggle; every winning solution must be open-sourced. The contest aims to drive transparent progress on true generalization and reasoning, not just benchmark-saturating tweaks. Bottom line for the crypto and tech crowd - Hype cycles can move markets and mindshare fast, but benchmarks like ARC-AGI-3 show where real capabilities stand. Current frontier LLMs, when tested without bespoke tooling, struggle with open-ended, unfamiliar tasks that require genuine exploration and reasoning. For investors and builders in crypto and Web3—who increasingly lean on AI for tooling, trading, and on-chain analysis—this is a reminder: the “G” in AGI (generalization) remains the hardest part, and today’s models are still far from human-level general problem solving. Read more AI-generated news on: undefined/news