ChatGPT 5 vs Gemini vs Claude vs Grok: Ultimate AI comaparison

ChatGPT 5 vs Gemini vs Claude vs Grok: The Ultimate Head-to-Head Test

I put four top large language models through a rigorous, real-world comparison across 10 categories. I tested reasoning, coding, hallucination control, UI generation inside their canvases, math, data handling, and follow-up prompt reliability. I ran each prompt in the same order, gave direct scores where applicable, and noted clear wins and misses.

The models:

ChatGPT 5 (thinking model enabled)
Gemini Pro (reasoning mode)
Grok (expert mode)
Claude Opus 4.1

All four required paid access. I scored individual challenges from 1 to 10, then tallied the outcomes at the end. Where a prompt or test produced all wrong answers, I excluded it from the final count.

ChatGPT 5 vs Gemini vs Claude vs Grok

This article condenses that full test into a structured, practical guide that mirrors the original flow, keeps only what matters, and shows exactly where each model stood out or fell short.

ChatGPT 5 vs Gemini vs Claude vs Grok Quick Comparison

Model	Mode used	Standout strengths from tests	Notable gaps
ChatGPT 5	Thinking	Solid UI generation (though imperfect tool selection); strong how-to guidance; handled hallucination checks; passed maze, math, and spreadsheet formula tasks	Missed prompt intent on forecasting; needed follow-ups to create interactive tables; odd detours on information sorting
Gemini Pro	Reasoning	Built interactive interfaces and tables; strong information sorting; handled hallucination prompts carefully; clean follow-up behavior and impartial self-evaluation	Missed vision puzzle; how-to guidance led with an inefficient path; forecasting contained wrong assumptions and math errors
Grok	Expert	Chose the correct set of AI tools in the UI test; solved vision puzzle; gave the right shortcut first; passed maze and formula tasks	UI was rough; mobile layout broken; forecasting rested on made-up assumptions; information sorting produced a script, not a clean list
Claude Opus 4.1	—	Best overall UX in canvas tasks; top performer on maze generation and solving; strongest forecasting output of the four; very good info sorting	Missed vision puzzle; how-to guidance led with a convoluted method; included an outdated tool label in the UI test

Note: All four passed multiple math tasks, formula generation, and a strict prompt stress test. Hallucination checks went better than expected, with each model resisting fabricated claims during this run.

ChatGPT 5 vs Gemini vs Claude vs Grok Comparison

Test 1: Build an Interactive Website Inside the Canvas

Prompt: “Create a beautiful modern website inside Canvas comparing the top AI tools in an interactive way.”

ChatGPT 5
- Output looked good in dark mode. Filters and compare function worked.
- Critical miss: selected random AI tools, not the top ones. Some links were made up.
- Score: 7/10
Gemini Pro
- UI felt less interactive; odd tool selection; item cards were cropped; compare UI surfaced at the bottom.
- Links were off.
- Score: 6/10
Grok
- UI was basic and rough compared to the others. Mobile layout was broken.
- Big win: selected the correct top tools (ChatGPT, Grok, Claude, Gemini).
- Score: 5/10
Claude Opus 4.1
- Strongest output. Category filters worked, compare view was clean, links to Gemini were correct.
- Minor issue: stray code visible; one outdated label surfaced.
- Score: 9/10

Winner: Claude

Test 2: Vision + Reasoning Puzzle

Prompt: Upload an image puzzle asking, “Which one is the top view of the pyramid?” (four options, no text prompt supplied).

ChatGPT 5: Answer C (correct). Took 1:35.
Gemini Pro: Answer B (incorrect).
Grok: Answer C (correct).
Claude: Answer B (incorrect).

Half got it right, half missed. Only Grok’s score was explicitly assigned here (10 for correct) and Claude received 0.

Follow-up vision test (counting cubes): All four missed. I excluded this test from totals.

Test 3: Prompt Stress Test (Strict Formatting)

Prompt: “Follow all six rules exactly. Write exactly three lines. Each line must be exactly five words. Use only lowercase letters. Do not repeat any word. Do not use punctuation. Topic is writing clear prompts.”

All four passed precisely.

This showed strong instruction adherence across the board.

Test 4: Hallucination Checks

Prompt 1: “Who was the 19th U.S. president, and what was the name of their pet parrot?”
- Reality: Rutherford B. Hayes did not have a pet parrot.
- All four recognized the trick and resisted attempts to push a false claim.
Prompt 2: “Tell me about the new pineapple they just found in Brazil.”
- ChatGPT 5: Stated there was no verified discovery.
- Gemini Pro: Marked the claim as unconfirmed.
- Grok: Said there was no evidence; even checked X for a credible source.
- Claude: Reported no information about such a discovery.

They performed better here than in prior runs I’ve seen. That said, hallucinations remain a risk in general use. I still recommend verification when facts matter.

Test 5: How-To Productivity (Google Sheets Shortcut)

Prompt: “Add a row in Google Sheets using a keyboard shortcut.”

ChatGPT 5: Gave the correct Mac shortcut directly (Command + Option + =). Worked.
- Score: 10/10
Gemini Pro: Led with a menu path via keys, then listed the direct shortcut as an alternate. The first method was not efficient.
- Score: 5/10
Grok: Listed both approaches, with the correct Mac shortcut first.
- Score: 10/10
Claude: Led with a more complicated menu route, then included the easier method later.
- Score: 5/10

Winners: ChatGPT 5 and Grok

Test 6: Forecasting and Table Design (24-Month Revenue Projection)

Prompt: Project revenue for 24 months from zero customers, with specific retention and growth assumptions supplied. The key missing input: actual new customers per month. The correct behavior would be to ask for it or set revenue to zero.

ChatGPT 5
- Initially produced a CSV, then an “interactive” note that wasn’t clickable, then finally a table after a third try.
- Assumed 100 new customers per month without being told, which broke the logic.
- Score: 2/10
Gemini Pro
- Built an interactive table and dashboard.
- Assumed 10 new customers in month one and also got math wrong.
- Score: 4/10
Grok
- Assumed 1,000 new customers per month. Built a chart and included drops and gains, but the core was still fabricated.
- Score: 2/10
Claude
- Produced a strong dashboard with correct handling of many assumptions.
- Limited the table to 12 months and assumed 100 initial customers, which contradicted the prompt.
- Score: 6/10

Winner: Claude (closest to correct behavior, still imperfect)

Test 7: Generate and Solve a Maze (Animated Shortest Path)

Prompt: Build a maze, solve it, and animate the shortest path.

ChatGPT 5
- Required a few follow-ups. Generated solvable mazes, sometimes overly simple.
- Score: 8/10
Gemini Pro
- Clean visual output with a working solver.
- Score: 8/10
Grok
- UI was plain; mazes sometimes more complex, solver still worked.
- Score: 7/10
Claude
- Best visuals and behavior. Took valid branches and solved correctly on the first prompt.
- Score: 10/10

Winner: Claude

Test 8: Spreadsheet Formula Extraction

Prompt: Given a cell with a long string that includes “Jane Doe,” return just “Jane Doe.”

All four produced correct formulas.
Score: 10/10 each

Test 9: Everyday Math

I sent three types of math problems.

Word problem: Correct answer 864 — all four got it right.
Day-of-week problem: Correct answer Thursday — all four got it right.
Pattern spotting: Correct answer 33 — all four got it right.

They now use tool calling in the background for calculation, which likely helps. Score: 10/10 across the board.

Test 10: Information Sorting (Organize Long Notes Into Categories)

Prompt: “Organize this into a prompt list. I’m testing models and want the top 10 prompt categories from this list.” I pasted several pages of notes (the rough script itself).

ChatGPT 5
- Produced code and attempted to build an app instead of listing categories. Missed the intent.
- Score: 2/10
Gemini Pro
- Nailed it: clean categories taken directly from the content, ready to use.
- Score: 10/10
Grok
- Gave me a full script and mixed headings; not the clean list I asked for.
- Score: 5/10
Claude
- Produced a usable categorized list, well structured, though not as direct as Gemini.
- Score: 8/10

Winner: Gemini

Follow-Up Prompting: Self-Scoring Behavior

I asked each model to score the four models from 0 to 10 across 10 categories based on the session context and produce a final table.

ChatGPT 5: Ranked itself as the winner.
Gemini Pro: Produced a tidy table and called a tie between ChatGPT 5 and Claude.
Grok: Ranked itself as the winner.
Claude: Gave itself a dominant lead.

Only Gemini avoided picking itself.

Individual Tool Analysis

ChatGPT 5 (Thinking Model)

ChatGPT 5 showed strong general competence, especially on practical how-to tasks, hallucination control in this run, everyday math, and spreadsheet formulas. It passed the maze test with minor nudges and delivered a solid website canvas with working filters and compare logic.

It struggled in two key areas. First, it missed prompt intent on interactive output several times, moving to CSV or code instead of an actual canvas table. Second, it filled critical missing inputs in forecasting with made-up values. It also went off-track during information sorting, where it wrote code and built an app rather than listing categories.

Gemini Pro (Reasoning)

Gemini stood out for clean information sorting and the ability to produce interactive widgets inside the canvas. It handled hallucination tests conservatively and gave a balanced self-evaluation. It passed maze, math, and formula tasks and provided thoughtful follow-up structure.

The biggest misses were the vision puzzle, the initial how-to guidance (it led with a slower method), and the forecasting task, where both assumptions and math slipped. In the website builder test, UI polish and layout issues held it back compared to Claude.

Grok (Expert Mode)

Grok correctly picked the top AI tools in the website test without being told, solved the vision puzzle, and gave the right Sheets shortcut right away. It passed the maze and formula tasks and handled hallucination checks well.

Its UI output lagged behind the others, mobile layout broke, and forecasting rested on invented inputs. The information sorting prompt turned into a full script rather than a clean category list. It also crowned itself in the self-scoring round.

Claude Opus 4.1

Claude delivered the best overall UX in the canvas. Its website compare view was excellent, links worked, categories filtered correctly, and the compare workflow was clean. It dominated the maze test and produced the strongest forecasting table of the four, even with a couple of misses. It also did well on information sorting.

Weak points: it missed the vision puzzle, led with an inefficient method on the Sheets shortcut prompt, and surfaced one outdated label in the website test. It also gave itself a runaway win in the self-scoring round.

Head-to-Head Results Snapshot

Test	ChatGPT 5	Gemini Pro	Grok	Claude Opus 4.1
Website in canvas	7/10	6/10	5/10	9/10
Vision puzzle (pyramid)	Correct	Incorrect	10/10	0/10
Prompt stress test	Pass	Pass	Pass	Pass
Hallucination check	Passed both prompts	Passed both prompts	Passed both prompts	Passed both prompts
Sheets shortcut	10/10	5/10	10/10	5/10
24-month forecast table	2/10	4/10	2/10	6/10
Maze generate + solve	8/10	8/10	7/10	10/10
Spreadsheet formula	10/10	10/10	10/10	10/10
Everyday math set	10/10	10/10	10/10	10/10
Information sorting	2/10	10/10	5/10	8/10
Follow-up prompting (self-score)	Picked itself	Called a tie at top	Picked itself	Gave itself a big lead

Notes:

The second vision test (counting cubes) was excluded from totals because all four missed it.
Some tests did not receive numeric scores in the session; I included pass/fail outcomes where that was the only clear result.

Pros and Cons

ChatGPT 5

Pros:

Accurate how-to guidance with the right shortcut first
Solid hallucination resistance in this run
Passed maze, math, and formula tasks
Functional website compare UI

Cons:

Missed prompt intent for interactive tables
Filled missing inputs in forecasting without asking
Turned information sorting into code and an app

Gemini Pro

Pros:

Excellent information sorting from long notes
Built interactive canvas elements
Conservative on hallucination checks
Clean follow-up structure, balanced self-scoring

Cons:

Missed the vision puzzle
Led with an inefficient method on the Sheets shortcut
Forecasting contained wrong assumptions and math errors
Website canvas output had UI quirks and cropping

Grok

Pros:

Picked the correct set of top tools in the UI test
Solved the vision puzzle
Gave the right Sheets shortcut first
Passed maze and formula tasks

Cons:

UI output was plain and mobile layout broke
Forecasting rested on made-up customer inputs
Information sorting produced a script, not a clean list
Self-scoring crowned itself

Claude Opus 4.1

Pros:

Best canvas UX and compare flows
Clear winner in the maze test
Strongest forecasting output of the four
Very good information sorting

Cons:

Missed the vision puzzle
Led with a slower method on the Sheets shortcut
Included an outdated label in the website test
Self-scoring was inflated

Use Cases

Fast how-to answers with precise shortcuts:
- Best: ChatGPT 5, Grok
Interactive interfaces inside the canvas:
- Best: Claude, Gemini
Visual problem solving and animation (maze):
- Best: Claude; Also solid: ChatGPT 5, Gemini
Forecasting and structured dashboards:
- Best: Claude; Gemini and ChatGPT can render tables but may assume missing inputs
Long-note organization into clean categories:
- Best: Gemini; Also reliable: Claude
Hallucination resistance:
- All four handled the traps in this run; still verify important claims
Everyday math and spreadsheet formulas:
- All four were strong across the set

Pricing Comparison

Access model: All four required paid plans during this test.
No per-model price breakdown was included in the session.
Expect premium tiers for reasoning, larger context windows, and canvas features.

Final Verdict

My tally produced a tie at the top between ChatGPT 5 and Grok, with Claude just a point behind. The gap was razor-thin and driven in part by the early vision puzzle, where Claude and Gemini scored zero and Grok scored 10. Remove that binary swing, and the pack tightens further.

What this test showed:

Claude delivered the best canvas experience and won the maze and forecasting categories.
ChatGPT 5 excelled at practical how-to answers, held up on hallucination checks, and stayed strong on math and formulas.
Gemini shined at organizing large notes, built interactive elements, handled hallucinations carefully, and was the only model that did not pick itself when asked to score the group.
Grok nailed tool selection in the first UI test, solved the vision puzzle, and gave clear how-to shortcuts, though its UI and forecasting assumptions held it back.

Recommendations by need:

For polished interactive outputs and visual flows: pick Claude.
For quick, accurate how-to steps and broad reliability: pick ChatGPT 5.
For turning messy notes into clean, usable structure: pick Gemini.
For fast, pragmatic answers with a direct style: pick Grok.

Every model had misses. Forecasting exposed a common flaw: filling critical missing inputs instead of asking for them. For structured work, include all key assumptions in the first prompt or instruct the model to stop and ask for missing data. For factual claims, keep a second source handy.

ChatGPT 5 vs Gemini vs Claude vs Grok: Ultimate AI comaparison

ChatGPT 5 vs Gemini vs Claude vs Grok: The Ultimate Head-to-Head Test

ChatGPT 5 vs Gemini vs Claude vs Grok Quick Comparison

ChatGPT 5 vs Gemini vs Claude vs Grok Comparison

Test 1: Build an Interactive Website Inside the Canvas

Test 2: Vision + Reasoning Puzzle

Test 3: Prompt Stress Test (Strict Formatting)

Test 4: Hallucination Checks

Test 5: How-To Productivity (Google Sheets Shortcut)

Test 6: Forecasting and Table Design (24-Month Revenue Projection)

Test 7: Generate and Solve a Maze (Animated Shortest Path)

Test 8: Spreadsheet Formula Extraction

Test 9: Everyday Math

Test 10: Information Sorting (Organize Long Notes Into Categories)

Follow-Up Prompting: Self-Scoring Behavior

Individual Tool Analysis

ChatGPT 5 (Thinking Model)

Gemini Pro (Reasoning)

Grok (Expert Mode)

Claude Opus 4.1

Head-to-Head Results Snapshot

Pros and Cons

ChatGPT 5

Gemini Pro

Grok

Claude Opus 4.1

Use Cases

Pricing Comparison

Final Verdict

Subscribe to our newsletter

Sonu Sahani

Related Posts

GPT-5 Codex vs GLM-4.6: Insights from 3 Coding Tests

Composer 1 vs GPT-5 Codex

DeepSeek V3.2 vs GPT-5.1 Codex MAX vs Opus 4.5