潜龙 QianLong · 中文 AI 内容与工具平台

Imagine trying to measure the intelligence of the world's most advanced supercomputers using a simple, absurd request: draw a pelican riding a bicycle.

This is exactly the benchmark renowned developer Simon Willison uses to evaluate Large Language Models (LLMs). In a thought-provoking presentation—playfully framed as a retrospective from the speculative year of 2026—Willison mapped out a hypothetical, yet highly plausible, six-month explosion in artificial intelligence capabilities.

Why a pelican on a bike? The genius of the test lies in its absurdity. Pelicans are anatomically tricky to draw, bicycles are notoriously difficult for AI to render correctly, and the combination is so nonsensical that no AI laboratory would ever waste resources specifically training their models to generate it. Yet, in Willison's speculative timeline, the title of the industry's "best model" changed hands five times in a single month. Iterative versions of major models constantly leapfrogged one another, all judged by their ability to render this avian cyclist.

But the real inflection point wasn't about generating quirky images. It was about the maturation of coding agents. Through advanced reinforcement learning techniques, AI programming assistants crossed a critical threshold. They transitioned from being frustrating novelties that required constant human supervision to reliable "daily drivers." Developers could finally trust these systems to write functional, complex software without spending hours debugging trivial mistakes.

This newfound reliability gave birth to an entirely new category of personal AI assistants, colloquially dubbed "Claws" (stemming from an early open-source project). As these autonomous agents became wildly popular, consumers began purchasing dedicated hardware—like Mac Minis—just to run them locally. Rather than relying on cloud servers, people started turning everyday computers into dedicated digital habitats for their personal AI proxies. These agents act as highly capable digital butlers, immensely powerful but requiring strict safety guardrails to ensure they operate within user intentions.

Simultaneously, the open-source ecosystem experienced a massive surge. From new iterations of models by US tech giants to colossal 1.5-terabyte models from Chinese laboratories, open-weight AI began matching the performance of proprietary leaders. Eventually, even a compressed 20-gigabyte model running locally on a standard laptop could flawlessly animate a pelican cruising on a bicycle.

When an off-the-shelf laptop can effortlessly generate such complex outputs, it signals more than just graphical prowess. It means the benchmark itself is dead. As AI capabilities compound at breakneck speed, our biggest challenge might not be building better models, but figuring out how to measure the intelligence of the ones we already have.

Key Points

Absurd benchmarks like a 'pelican on a bicycle' are effective because AI labs don't explicitly train for them in their datasets.
AI coding agents have crossed a critical threshold, becoming reliable 'daily drivers' for real-world software development.
The rise of autonomous personal AI assistants has turned consumer hardware into dedicated digital habitats for AI.
Open-weight models running on local laptops are now capable of beating proprietary giants at complex generative tasks.

Why It Matters

As AI models rapidly outpace the benchmarks designed to test them, the shift toward highly capable, localized AI agents will fundamentally change how we interact with software.

Sources:

The last six months in LLMs in five minutes — Simon Willison's Weblog

潛

本文完

潜龙编辑部 · 2026/7/14

The Pelican on a Bicycle: Measuring the Unmeasurable in AI

Key Points

Why It Matters

更多专栏

The Rise of the ChatGPT Flyer: AI's Awkward Physical Era

The Accidental Legacy of Apple's Cancelled Car

Inside the AI Mind: Unlocking the Secrets of 'J-Space'