The AGI Paradox: If AI Is So Smart, Why Can’t It Design a Good Website?
In January 2025, Sam Altman wrote: "We are now confident we know how to build AGI."
Jensen Huang told Lex Fridman in March 2026: "I think we've achieved AGI."
Dario Amodei predicted AI at "Nobel Laureate" level by 2027. Mustafa Suleyman forecast "human-level performance on most professional tasks" within 18 months.
Meanwhile, the same AI systems supposedly approaching general intelligence cannot look at a website layout and tell you whether it feels off.
They cannot decide if a heading should be 56 pixels or 64 pixels based on what a brand needs to communicate. They cannot watch a user struggle with a form and feel the friction that reshapes their next design. They cannot look at a competitor’s site and choose dark mode — not because dark mode is trendy, but because everyone else in the market uses white, and being different is the strategy.
This is the AGI paradox. It tells us something important about what intelligence actually is.
The Paradox, Stated Plainly
If AGI is here, every cognitive task should be within reach.
Web design is a cognitive task. It requires reasoning, judgment, cultural awareness, emotional understanding, and creative problem-solving. These are the capabilities AGI is supposed to possess.
But AI-generated web design is, at best, competent. It follows conventions. It looks professional. And it is recognizable as AI-generated — not because of any single flaw, but because of a pervasive averageness. Everything is reasonable. Nothing is surprising. The result is a site that checks every box and moves no one.
Either AGI is here and design should be easy, or design is genuinely hard and AGI is not here.
Both cannot be true.
The resolution is not that one side is lying. It is that they are measuring different things. The thing they are not measuring is precisely what design reveals.
What AGI Claims Are Actually Standing On
Every AGI claim is built on benchmark performance.
AI scores 94% on a test covering 57 academic subjects. It passes the bar exam. It solves math olympiad problems. It matches doctors on medical licensing exams.
These are real achievements.
But look at the type of task. Every benchmark tests problems with three specific properties:
There is a correct answer. Math competitions, bar exams, licensing tests — all have definitive right answers.
The evaluation is objective. A proof is correct or it is not. Code passes the tests or it does not.
The problem resembles training data. The model has seen problems structured like these before.
These three properties are what make a task solvable through statistical pattern completion. AI is extraordinarily good at tasks where the answer is knowable, verifiable, and resembles something it has seen.
The question is whether that describes intelligence — or a very sophisticated answer-checker.
Seven Things Benchmarks Do Not Test (But Design Does)
Design is a useful lens because it is a complete cognitive task. It does not test one capability in isolation. It tests the integration of capabilities that general intelligence would require.
1. Judgment without a right answer
When a designer decides between 12 pixels and 16 pixels of padding, there is no correct answer. It depends on the brand personality, surrounding elements, page density, and target audience. Twenty expert designers might give twenty different answers — and several could be excellent.
A research study called DesignPref found that 20 professional designers evaluating the same work had an agreement rate (Krippendorff’s alpha) of 0.25. Substantial disagreement. If experts cannot agree on what “good” means, evaluating design quality is not pattern matching. It is judgment — the ability to weigh factors that cannot be directly compared and arrive at a defensible position.
Every benchmark that supports AGI claims has a right answer. Design does not. Most decisions that matter in business — strategy, hiring, product direction — do not either.
2. Intent
An expert designer is arguing for something. The headline is 64 pixels because the value proposition must dominate the viewport. The blue is this specific shade because it signals trust while standing apart from competitors who also use blue. The testimonial section has more whitespace because the designer wants readers to slow down and absorb social proof.
Every design decision serves a communicative goal. The designer has intent — a specific thing they are trying to make someone feel, think, or do.
AI has no intent. It generates output that looks like it has intent because it mimics designs made by people who had intent. The AI itself is not trying to accomplish anything. It is completing a pattern.
Think of the Pacific islanders who built wooden airplanes after World War II, hoping to attract more supply drops. They replicated the form of airport operations without understanding the function. The planes did not fly. AI design replicates the form of intentional design without the function. The designs do not argue for anything. They arrange.
3. Cultural fluency
When a designer puts a serif font on a fintech site, they are not making a typographic decision. They are making a cultural decision. Serifs signal “established” and “authoritative” — not because of any visual property of the letter shapes, but because serifs have appeared for centuries in courtrooms, banks, and broadsheet newspapers.
AI has processed text about typography and images of serifs. It knows serifs correlate with “traditional” in its training data. But it does not wield the cultural resonance — the felt connection between this typeface and the institutions people have encountered throughout their lives. It matches patterns. It does not use meaning as a tool.
4. Knowing what to remove
Expert design is heavily subtractive. Generate ideas, pick the best direction, develop it, then remove everything that does not earn its place. Repeat until what remains is essential.
AI is fundamentally generative. It produces output by adding. It has no mechanism for asking “does this element earn its place?” because it placed the element there based on statistical frequency, not strategic judgment.
The result: AI designs are consistently over-stuffed. Too many colors, too many font sizes, too many sections. They have the problem of “more” because the system has no capacity for disciplined “less.”
5. Empathy
Expert designers have watched hundreds of real people use websites. They have seen the confused squint when a navigation label is ambiguous. The frustrated sigh when a form asks for information already provided. The impatient scroll when a loading animation runs four seconds.
These are not abstract observations. They are emotional memories. When an expert designer creates a form with too many required fields, they feel a pang of discomfort — because they remember watching real people encounter real friction. That emotional calibration shapes design decisions below the level of conscious reasoning.
AI has never watched a user. It can follow accessibility guidelines — minimum tap targets, contrast ratios — but guidelines produce compliant design, not comfortable design. There is a gap between “passes the audit” and “feels good to use when you are one-handed on the subway, squinting in sunlight, with a toddler pulling at your arm.”
6. Strategic differentiation
Your website exists in a browser tab next to two or three competitors. When your potential customer is comparing options, your design’s job is to win that comparison — not by being generically good, but by being strategically different.
An expert designer studies your competitors before touching a single pixel. If every competitor uses blue-and-white, maybe your brand should own green. If every competitor packs their homepage with features, maybe yours should be minimal — signaling confidence. If every competitor uses stock photography, maybe yours should use illustrations.
AI has zero competitive awareness. Ask it to design a fintech landing page and it produces the average of all fintech landing pages in its training data — which is exactly what your competitors’ sites look like. You have surrendered your most powerful differentiation tool before the visitor reads a word.
7. Learning from consequences
When an expert designer ships a website, they watch what happens next. They check analytics, review heatmaps, watch session recordings. Sometimes they see users do something completely unexpected — and that observation reshapes how they design every future project.
This feedback loop — design, ship, observe, learn, design better — runs across hundreds of projects over a 15 to 20 year career. It produces wisdom: accumulated judgment about what works, what does not, and why.
AI has no feedback loop. It generates a design and moves on. It never sees whether users converted or bounced. It never learns from the consequences of its choices. Every project starts from the same baseline. There is no accumulated wisdom.
METR researchers found that frontier AI models “can only reliably complete tasks of up to a few minutes long.” Not just because of technical limits — but because they do not operate in the consequence loop that characterizes real intelligence. They generate. They do not live with the results.
The Three Illusions That Make AGI Claims Sound Reasonable
Illusion 1: The benchmark illusion
“AI scores 94% on a test covering 57 subjects — that is general intelligence.”
This confuses generality of subject matter with generality of cognitive capability. A system can know facts about 57 subjects while being unable to exercise judgment in any of them. Breadth of knowledge is not depth of intelligence.
The proof: ARC-AGI. It tests the ability to solve novel visual puzzles that any untrained human can solve. No domain knowledge required. Just perception and genuine reasoning about something new.
OpenAI’s o3 model scored 88% on ARC-AGI-1. Celebrations followed. Then ARC-AGI-2 was released — different puzzles, same format. The same model scored 3%. Average untrained humans scored 60%.
When ARC-AGI-3 launched in 2026, every frontier model scored 0%. Zero. On puzzles ordinary people solve without training.
The model had not learned to reason about novel patterns. It had learned to match the specific style of patterns in the first test. When the style changed, it collapsed.
Illusion 2: The economic illusion
OpenAI defines AGI as “systems that outperform humans at most economically valuable work.”
This replaces intelligence (a cognitive concept) with economic output (a market metric).
A spreadsheet outperforms a human at calculation. A database outperforms a human at recall. Neither is intelligent. They are tools that perform specific valuable functions. AI currently performs economically valuable work across many domains. This makes it a powerful tool. Tools are not intelligent just because they are useful.
Illusion 3: The mimicry illusion
AI outputs look like the outputs of intelligence. A generated website looks like a website a designer made.
But the Pacific islanders’ wooden airplanes looked like real airplanes too. The form was right. The function — flying — was absent.
AI design replicates the form of expert design: hero sections, clean color palettes, standard component patterns, responsive layouts. It does not replicate the function: communicating a specific brand identity to a specific audience, creating emotional resonance that drives behavior, standing out through intentional visual positioning, building trust through accumulated signals of craft and care.
The form is learnable from data. The function requires the designer. Mistaking form for function is how you end up with a website that looks great in a screenshot and underperforms in every metric that matters.
What We Actually Have
This is not an anti-AI argument. The progress is real. AI is the most powerful tool to emerge in a generation.
What it is not — yet — is general intelligence.
What we have is something more precise: narrow cognitive automation at scale. AI automates the well-defined subset of cognitive tasks across many domains simultaneously. It looks general because it spans many subjects. It is not general because within each domain, it only handles tasks with clear, verifiable answers.
The capabilities it is missing are not exotic. They are the ordinary capabilities of human intelligence:
- Judgment when there is no right answer
- Intent, acting in pursuit of a specific goal
- Cultural reasoning, understanding what things mean rather than what they resemble
- Empathy, feeling what someone else would feel
- Strategic thinking, making choices against a competitive context
- Learning from consequences, updating judgment based on real-world results
- Knowing when to break rules, and when to follow them
Design requires all of them at once. That is why it is such a clear diagnostic.
The Smart Play
Use AI for speed and scale. Use human designers for strategy and taste.
- A human designer defines the brand, the strategy, the visual system.
- AI generates variations, handles production, implements across breakpoints.
- The designer evaluates, selects, refines, and makes sure every pixel serves the strategy.
- AI handles scale — applying the system consistently across dozens or hundreds of pages.
- The designer reviews what matters most: the pages that drive revenue, build trust, and define the brand.
This is not a compromise. It is a multiplication. The designer’s judgment is amplified by AI’s speed. AI’s output is elevated by the designer’s taste.
The AGI Paradox, Resolved
The people claiming AGI and the people observing that AI cannot design a great website are both correct — about different things.
AGI optimists are correct that AI now performs many cognitive tasks at or above human level. The capability gains are real.
Design reveals that performing tasks with clear answers is not the same as general intelligence. The capabilities design requires — judgment, intent, empathy, cultural reasoning, strategic differentiation, learning from consequences — are the capabilities that benchmarks do not test and current AI does not possess.
What we have is an extraordinarily powerful narrow automation technology that spans many domains — and so looks general — while remaining unable to exercise judgment, intent, or wisdom in any of them.
Design is the canary in the coal mine.
The question every builder should sit with: if your product’s competitive advantage depends on one of the seven capabilities above, how much of that advantage survives a world where everyone has access to the same AI tools?